Model risk in credit scoring models with big data applications

Detalhes bibliográficos
Ano de defesa: 2023
Autor(a) principal: Yoshida Junior, Valter Takuo
Orientador(a): Schiozer, Rafael Felipe
Banca de defesa: Não Informado pela instituição
Tipo de documento: Tese
Tipo de acesso: Acesso aberto
Idioma: eng
Instituição de defesa: Não Informado pela instituição
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Palavras-chave em Inglês:
Link de acesso: https://hdl.handle.net/10438/33946
Resumo: Large databases and Machine Learning have increased our ability to produce credit scoring models with a different number of observations and explanatory variables. Although managers and regulators have concerns about the potential risks associated with algorithms’ discretion for variable selection, model building and the lack of causality, insufficient attention has been given to the inappropriate utilization of highhit rate credit scoring models, or to credit scoring model risk. This study fills this gap by proposing a novel model risk measure, , Credit Scoring Model Risk, based on the correlation between the dependent variable and the generated predictions. This work empirically tests the in plugin LASSO credit scoring models and finds that adding loans from different banks to increase the number of observations is not optimal in in-sample basis, challenging the generally accepted assumption that more data leads to better predictions. However, the evaluation of model performance using in-sample data may exhibit instability across out-of-time estimations. Therefore, the decision-making (choosing a model among a variety of possibilities) based exclusively on in-sample’s measures may be problematic, because banks’ loan portfolios change over time, models can be born uncalibrated (or not well-fitted to the current portfolio) and can behave differently under new macroeconomic conditions, or along exogenous and stochastic events. This work also proposes a procedure to forecast the best-performing model in out-ottime datasets. Three (complementary) approaches help the model user to choose between the segmented or full data models, for out-of-time applications, by predicting which model tends to have higher correlation (or lower model risk). The first approach is based on the concept of “shrinkage”; the second uses a Monte Carlo simulation; and the third is a Bayesian estimation of covariances.