Monitoramento estatístico aplicado à ciência dos dados: uma abordagem para validação contínua de modelos preditivos classificatórios
Ano de defesa: | 2022 |
---|---|
Autor(a) principal: | |
Orientador(a): | |
Banca de defesa: | |
Tipo de documento: | Tese |
Tipo de acesso: | Acesso aberto |
Idioma: | por |
Instituição de defesa: |
Universidade Federal de São Carlos
Câmpus São Carlos |
Programa de Pós-Graduação: |
Programa de Pós-Graduação em Engenharia de Produção - PPGEP
|
Departamento: |
Não Informado pela instituição
|
País: |
Não Informado pela instituição
|
Palavras-chave em Português: | |
Palavras-chave em Inglês: | |
Área do conhecimento CNPq: | |
Link de acesso: | https://repositorio.ufscar.br/handle/20.500.14289/16304 |
Resumo: | Predictive models are those that apply to data of observable variables, called independent, inferring on the behavior of another variable, observable or not, called dependent. A particular case, and widely used, are the binary classificatory models, in which the dependent variable can receive two values: Yes or no (positive/negative, success/failure). The present thesis demonstrates that operational environments increasingly digitized allow more complex applications of these classification models. Add to this the need to increase business competitiveness, through the search for information that reduces costs or increases the profitability of companies: it is the “Perfect Storm”, which increases the importance, scope, financial impacts, and horizon time of use of these models. This phenomenon occurs both within the industry with Big Data Analytics (BDA), and in other sectors, with the development of Data Science (DS). However, the boundary conditions, or existing operational conditions, when creating the model, can undergo significant variations, due to technical problems in the generation, capture, flow of information, or even in the relationships between the variables studied, which can reduce the quality of the forecast. of the created model. The literature review showed that several researchers claim that it is important to periodically check the performance of hits and misses of these models, however, there is a lack of more specific criteria and methods that define the checking frequency and sample sizes suitable for this monitoring. To fill this gap, the concepts of Project Science Research were used to integrate the concepts of Statistical Process Monitoring (SPM) with the methods of elaboration of models applied in the field of DS. In the construction of this integration, Phases I and II of the SPM were related to a structured process of data analysis and model generation, creating an approach for its continuous validation. This was validated using analytical and simulation techniques applied to the Cohen's Kappa index, resulting in prescriptive criteria for its use, supported by comparisons based on the Matthews correlation coefficient (MCC) and the index from Youden. Control charts, based on Kappa, were found to perform well for sample amounts of m=5 and sample sizes of n=500, provided the Pe value is less than 0.8. The simulations also showed that for monitoring through Kappa fewer samples are needed than for the other studied indices. |