Predição de default de empresas: técnicas de machine learning em dados desbalanceados

Bibliographic Details
Main Author: Cordeiro, Tiago Vilas Boas
Publication Date: 2020
Format: Master thesis
Language: por
Source: Repositório Institucional do FGV (FGV Repositório Digital)
Download full: https://hdl.handle.net/10438/29873
Summary: Given the importance of credit risk management for the banking sector, probability of default models have become fundamental. In this context, with the advances in the volume of information from customers and the computational capacity, several techniques have been studied and applied. In this study, we used two traditional linear techniques, the Linear Discriminant Analysis and Logistic Regression, and four non-linear ensemble techniques, Bagging, Random Forest, Adaboost and Stacking, applied to a problem of probability of default on brazilian companies, using information from their financial statements. The results indicate that the transformations in the data and treatment of class imbalanced have a strong impact on the predictive power of Logistic Regression. Yet, Random Forest was the technique with the best performance regardless of the scenario and the metric used.
id FGV_fa8c9a23bdc357a3d5152eba05bfd1ef
oai_identifier_str oai:repositorio.fgv.br:10438/29873
network_acronym_str FGV
network_name_str Repositório Institucional do FGV (FGV Repositório Digital)
repository_id_str 3974
spelling Cordeiro, Tiago Vilas BoasEscolas::EESPCosta, Oswaldo Luiz do ValleMatsumoto, Élia YathieChela, João Luiz2020-12-01T13:25:25Z2020-12-01T13:25:25Z2020-11-11https://hdl.handle.net/10438/29873Given the importance of credit risk management for the banking sector, probability of default models have become fundamental. In this context, with the advances in the volume of information from customers and the computational capacity, several techniques have been studied and applied. In this study, we used two traditional linear techniques, the Linear Discriminant Analysis and Logistic Regression, and four non-linear ensemble techniques, Bagging, Random Forest, Adaboost and Stacking, applied to a problem of probability of default on brazilian companies, using information from their financial statements. The results indicate that the transformations in the data and treatment of class imbalanced have a strong impact on the predictive power of Logistic Regression. Yet, Random Forest was the technique with the best performance regardless of the scenario and the metric used.Dada a importância do gerenciamento do risco de crédito para o setor bancário, modelos de probabilidade de default tornaram-se fundamentais. Neste contexto, com o avanço do volume de informações dos clientes e a capacidade computacional, diversas técnicas têm sido estudadas e aplicadas. Neste estudo, utilizamos duas técnicas lineares tradicionais, a Análise Discriminante Linear e a Regressão Logística, e quatro técnicas não-lineares ensembles, Bagging, Random Forest, Adaboost e Stacking, aplicadas em um problema de predição de default de empresas brasileiras utilizando informações de seus demonstrativos financeiros. Os resultados indicam que as transformações nos dados e tratamento de desbalanceamento de classes tem forte impacto no poder preditivo da Regressão Logística. Ainda, o Random Forest foi a técnica com melhor desempenho, independente do cenário e da métrica utilizada.porMachine learningLogistic regressionRandom forestProbability of defaultRisk rating modelsRegressão logísticaProbabilidade de defaultModelos de ratingEconomiaAprendizado do computadorAnálise de regressão logísticaAvaliação de riscosPredição de default de empresas: técnicas de machine learning em dados desbalanceadosinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisinfo:eu-repo/semantics/openAccessreponame:Repositório Institucional do FGV (FGV Repositório Digital)instname:Fundação Getulio Vargas (FGV)instacron:FGVTEXTDisserta__o_Tiago - 2020-11-27T201019.927.pdf.txtDisserta__o_Tiago - 2020-11-27T201019.927.pdf.txtExtracted texttext/plain103037https://repositorio.fgv.br/bitstreams/4bef6e65-0934-46fa-86de-0e65709f5c15/download9ce3ccfba0c78168fc069564f442ee13MD55THUMBNAILDisserta__o_Tiago - 2020-11-27T201019.927.pdf.jpgDisserta__o_Tiago - 2020-11-27T201019.927.pdf.jpgGenerated Thumbnailimage/jpeg3062https://repositorio.fgv.br/bitstreams/97977563-2413-4189-bf04-0f9849cbb751/download61fadda3692b7ca097e5137cbd912d65MD56ORIGINALDisserta__o_Tiago - 2020-11-27T201019.927.pdfDisserta__o_Tiago - 2020-11-27T201019.927.pdfPDFapplication/pdf1356798https://repositorio.fgv.br/bitstreams/2b4515b8-f04f-4e5c-8e21-debdd7c18e99/downloadbc3cde20c21f901a035f7db71816e650MD51LICENSElicense.txtlicense.txttext/plain; charset=utf-84707https://repositorio.fgv.br/bitstreams/98d92f9c-0cc7-47d5-b46e-356b01e312a8/downloaddfb340242cced38a6cca06c627998fa1MD5210438/298732023-11-25 13:42:40.057open.accessoai:repositorio.fgv.br:10438/29873https://repositorio.fgv.brRepositório InstitucionalPRIhttp://bibliotecadigital.fgv.br/dspace-oai/requestopendoar:39742023-11-25T13:42:40Repositório Institucional do FGV (FGV Repositório Digital) - Fundação Getulio Vargas (FGV)falseVEVSTU9TIExJQ0VOQ0lBTUVOVE8gUEFSQSBBUlFVSVZBTUVOVE8sIFJFUFJPRFXDh8ODTyBFIERJVlVMR0HDh8ODTwpQw5pCTElDQSBERSBDT05URcOaRE8gw4AgQklCTElPVEVDQSBWSVJUVUFMIEZHViAodmVyc8OjbyAxLjIpCgoxLiBWb2PDqiwgdXN1w6FyaW8tZGVwb3NpdGFudGUgZGEgQmlibGlvdGVjYSBWaXJ0dWFsIEZHViwgYXNzZWd1cmEsIG5vCnByZXNlbnRlIGF0bywgcXVlIMOpIHRpdHVsYXIgZG9zIGRpcmVpdG9zIGF1dG9yYWlzIHBhdHJpbW9uaWFpcyBlL291CmRpcmVpdG9zIGNvbmV4b3MgcmVmZXJlbnRlcyDDoCB0b3RhbGlkYWRlIGRhIE9icmEgb3JhIGRlcG9zaXRhZGEgZW0KZm9ybWF0byBkaWdpdGFsLCBiZW0gY29tbyBkZSBzZXVzIGNvbXBvbmVudGVzIG1lbm9yZXMsIGVtIHNlIHRyYXRhbmRvCmRlIG9icmEgY29sZXRpdmEsIGNvbmZvcm1lIG8gcHJlY2VpdHVhZG8gcGVsYSBMZWkgOS42MTAvOTggZS9vdSBMZWkKOS42MDkvOTguIE7Do28gc2VuZG8gZXN0ZSBvIGNhc28sIHZvY8OqIGFzc2VndXJhIHRlciBvYnRpZG8sIGRpcmV0YW1lbnRlCmRvcyBkZXZpZG9zIHRpdHVsYXJlcywgYXV0b3JpemHDp8OjbyBwcsOpdmlhIGUgZXhwcmVzc2EgcGFyYSBvIGRlcMOzc2l0byBlCmRpdnVsZ2HDp8OjbyBkYSBPYnJhLCBhYnJhbmdlbmRvIHRvZG9zIG9zIGRpcmVpdG9zIGF1dG9yYWlzIGUgY29uZXhvcwphZmV0YWRvcyBwZWxhIGFzc2luYXR1cmEgZG9zIHByZXNlbnRlcyB0ZXJtb3MgZGUgbGljZW5jaWFtZW50bywgZGUKbW9kbyBhIGVmZXRpdmFtZW50ZSBpc2VudGFyIGEgRnVuZGHDp8OjbyBHZXR1bGlvIFZhcmdhcyBlIHNldXMKZnVuY2lvbsOhcmlvcyBkZSBxdWFscXVlciByZXNwb25zYWJpbGlkYWRlIHBlbG8gdXNvIG7Do28tYXV0b3JpemFkbyBkbwptYXRlcmlhbCBkZXBvc2l0YWRvLCBzZWphIGVtIHZpbmN1bGHDp8OjbyDDoCBCaWJsaW90ZWNhIFZpcnR1YWwgRkdWLCBzZWphCmVtIHZpbmN1bGHDp8OjbyBhIHF1YWlzcXVlciBzZXJ2acOnb3MgZGUgYnVzY2EgZSBkaXN0cmlidWnDp8OjbyBkZSBjb250ZcO6ZG8KcXVlIGZhw6dhbSB1c28gZGFzIGludGVyZmFjZXMgZSBlc3Bhw6dvIGRlIGFybWF6ZW5hbWVudG8gcHJvdmlkZW5jaWFkb3MKcGVsYSBGdW5kYcOnw6NvIEdldHVsaW8gVmFyZ2FzIHBvciBtZWlvIGRlIHNldXMgc2lzdGVtYXMgaW5mb3JtYXRpemFkb3MuCgoyLiBBIGFzc2luYXR1cmEgZGVzdGEgbGljZW7Dp2EgdGVtIGNvbW8gY29uc2Vxw7zDqm5jaWEgYSB0cmFuc2ZlcsOqbmNpYSwgYQp0w610dWxvIG7Do28tZXhjbHVzaXZvIGUgbsOjby1vbmVyb3NvLCBpc2VudGEgZG8gcGFnYW1lbnRvIGRlIHJveWFsdGllcwpvdSBxdWFscXVlciBvdXRyYSBjb250cmFwcmVzdGHDp8OjbywgcGVjdW5pw6FyaWEgb3UgbsOjbywgw6AgRnVuZGHDp8OjbwpHZXR1bGlvIFZhcmdhcywgZG9zIGRpcmVpdG9zIGRlIGFybWF6ZW5hciBkaWdpdGFsbWVudGUsIHJlcHJvZHV6aXIgZQpkaXN0cmlidWlyIG5hY2lvbmFsIGUgaW50ZXJuYWNpb25hbG1lbnRlIGEgT2JyYSwgaW5jbHVpbmRvLXNlIG8gc2V1CnJlc3Vtby9hYnN0cmFjdCwgcG9yIG1laW9zIGVsZXRyw7RuaWNvcywgbm8gc2l0ZSBkYSBCaWJsaW90ZWNhIFZpcnR1YWwKRkdWLCBhbyBww7pibGljbyBlbSBnZXJhbCwgZW0gcmVnaW1lIGRlIGFjZXNzbyBhYmVydG8uCgozLiBBIHByZXNlbnRlIGxpY2Vuw6dhIHRhbWLDqW0gYWJyYW5nZSwgbm9zIG1lc21vcyB0ZXJtb3MgZXN0YWJlbGVjaWRvcwpubyBpdGVtIDIsIHN1cHJhLCBxdWFscXVlciBkaXJlaXRvIGRlIGNvbXVuaWNhw6fDo28gYW8gcMO6YmxpY28gY2Fiw612ZWwKZW0gcmVsYcOnw6NvIMOgIE9icmEgb3JhIGRlcG9zaXRhZGEsIGluY2x1aW5kby1zZSBvcyB1c29zIHJlZmVyZW50ZXMgw6AKcmVwcmVzZW50YcOnw6NvIHDDumJsaWNhIGUvb3UgZXhlY3XDp8OjbyBww7pibGljYSwgYmVtIGNvbW8gcXVhbHF1ZXIgb3V0cmEKbW9kYWxpZGFkZSBkZSBjb211bmljYcOnw6NvIGFvIHDDumJsaWNvIHF1ZSBleGlzdGEgb3UgdmVuaGEgYSBleGlzdGlyLApub3MgdGVybW9zIGRvIGFydGlnbyA2OCBlIHNlZ3VpbnRlcyBkYSBMZWkgOS42MTAvOTgsIG5hIGV4dGVuc8OjbyBxdWUKZm9yIGFwbGljw6F2ZWwgYW9zIHNlcnZpw6dvcyBwcmVzdGFkb3MgYW8gcMO6YmxpY28gcGVsYSBCaWJsaW90ZWNhClZpcnR1YWwgRkdWLgoKNC4gRXN0YSBsaWNlbsOnYSBhYnJhbmdlLCBhaW5kYSwgbm9zIG1lc21vcyB0ZXJtb3MgZXN0YWJlbGVjaWRvcyBubwppdGVtIDIsIHN1cHJhLCB0b2RvcyBvcyBkaXJlaXRvcyBjb25leG9zIGRlIGFydGlzdGFzIGludMOpcnByZXRlcyBvdQpleGVjdXRhbnRlcywgcHJvZHV0b3JlcyBmb25vZ3LDoWZpY29zIG91IGVtcHJlc2FzIGRlIHJhZGlvZGlmdXPDo28gcXVlCmV2ZW50dWFsbWVudGUgc2VqYW0gYXBsaWPDoXZlaXMgZW0gcmVsYcOnw6NvIMOgIG9icmEgZGVwb3NpdGFkYSwgZW0KY29uZm9ybWlkYWRlIGNvbSBvIHJlZ2ltZSBmaXhhZG8gbm8gVMOtdHVsbyBWIGRhIExlaSA5LjYxMC85OC4KCjUuIFNlIGEgT2JyYSBkZXBvc2l0YWRhIGZvaSBvdSDDqSBvYmpldG8gZGUgZmluYW5jaWFtZW50byBwb3IKaW5zdGl0dWnDp8O1ZXMgZGUgZm9tZW50byDDoCBwZXNxdWlzYSBvdSBxdWFscXVlciBvdXRyYSBzZW1lbGhhbnRlLCB2b2PDqgpvdSBvIHRpdHVsYXIgYXNzZWd1cmEgcXVlIGN1bXByaXUgdG9kYXMgYXMgb2JyaWdhw6fDtWVzIHF1ZSBsaGUgZm9yYW0KaW1wb3N0YXMgcGVsYSBpbnN0aXR1acOnw6NvIGZpbmFuY2lhZG9yYSBlbSByYXrDo28gZG8gZmluYW5jaWFtZW50bywgZQpxdWUgbsOjbyBlc3TDoSBjb250cmFyaWFuZG8gcXVhbHF1ZXIgZGlzcG9zacOnw6NvIGNvbnRyYXR1YWwgcmVmZXJlbnRlIMOgCnB1YmxpY2HDp8OjbyBkbyBjb250ZcO6ZG8gb3JhIHN1Ym1ldGlkbyDDoCBCaWJsaW90ZWNhIFZpcnR1YWwgRkdWLgoKNi4gQ2FzbyBhIE9icmEgb3JhIGRlcG9zaXRhZGEgZW5jb250cmUtc2UgbGljZW5jaWFkYSBzb2IgdW1hIGxpY2Vuw6dhCkNyZWF0aXZlIENvbW1vbnMgKHF1YWxxdWVyIHZlcnPDo28pLCBzb2IgYSBsaWNlbsOnYSBHTlUgRnJlZQpEb2N1bWVudGF0aW9uIExpY2Vuc2UgKHF1YWxxdWVyIHZlcnPDo28pLCBvdSBvdXRyYSBsaWNlbsOnYSBxdWFsaWZpY2FkYQpjb21vIGxpdnJlIHNlZ3VuZG8gb3MgY3JpdMOpcmlvcyBkYSBEZWZpbml0aW9uIG9mIEZyZWUgQ3VsdHVyYWwgV29ya3MKKGRpc3BvbsOtdmVsIGVtOiBodHRwOi8vZnJlZWRvbWRlZmluZWQub3JnL0RlZmluaXRpb24pIG91IEZyZWUgU29mdHdhcmUKRGVmaW5pdGlvbiAoZGlzcG9uw612ZWwgZW06IGh0dHA6Ly93d3cuZ251Lm9yZy9waGlsb3NvcGh5L2ZyZWUtc3cuaHRtbCksIApvIGFycXVpdm8gcmVmZXJlbnRlIMOgIE9icmEgZGV2ZSBpbmRpY2FyIGEgbGljZW7Dp2EgYXBsaWPDoXZlbCBlbQpjb250ZcO6ZG8gbGVnw612ZWwgcG9yIHNlcmVzIGh1bWFub3MgZSwgc2UgcG9zc8OtdmVsLCB0YW1iw6ltIGVtIG1ldGFkYWRvcwpsZWfDrXZlaXMgcG9yIG3DoXF1aW5hLiBBIGluZGljYcOnw6NvIGRhIGxpY2Vuw6dhIGFwbGljw6F2ZWwgZGV2ZSBzZXIKYWNvbXBhbmhhZGEgZGUgdW0gbGluayBwYXJhIG9zIHRlcm1vcyBkZSBsaWNlbmNpYW1lbnRvIG91IHN1YSBjw7NwaWEKaW50ZWdyYWwuCgoKQW8gY29uY2x1aXIgYSBwcmVzZW50ZSBldGFwYSBlIGFzIGV0YXBhcyBzdWJzZXHDvGVudGVzIGRvIHByb2Nlc3NvIGRlCnN1Ym1pc3PDo28gZGUgYXJxdWl2b3Mgw6AgQmlibGlvdGVjYSBWaXJ0dWFsIEZHViwgdm9jw6ogYXRlc3RhIHF1ZSBsZXUgZQpjb25jb3JkYSBpbnRlZ3JhbG1lbnRlIGNvbSBvcyB0ZXJtb3MgYWNpbWEgZGVsaW1pdGFkb3MsIGFzc2luYW5kby1vcwpzZW0gZmF6ZXIgcXVhbHF1ZXIgcmVzZXJ2YSBlIG5vdmFtZW50ZSBjb25maXJtYW5kbyBxdWUgY3VtcHJlIG9zCnJlcXVpc2l0b3MgaW5kaWNhZG9zIG5vIGl0ZW0gMSwgc3VwcmEuCgpIYXZlbmRvIHF1YWxxdWVyIGRpc2NvcmTDom5jaWEgZW0gcmVsYcOnw6NvIGFvcyBwcmVzZW50ZXMgdGVybW9zIG91IG7Do28Kc2UgdmVyaWZpY2FuZG8gbyBleGlnaWRvIG5vIGl0ZW0gMSwgc3VwcmEsIHZvY8OqIGRldmUgaW50ZXJyb21wZXIKaW1lZGlhdGFtZW50ZSBvIHByb2Nlc3NvIGRlIHN1Ym1pc3PDo28uIEEgY29udGludWlkYWRlIGRvIHByb2Nlc3NvCmVxdWl2YWxlIMOgIGFzc2luYXR1cmEgZGVzdGUgZG9jdW1lbnRvLCBjb20gdG9kYXMgYXMgY29uc2Vxw7zDqm5jaWFzIG5lbGUKcHJldmlzdGFzLCBzdWplaXRhbmRvLXNlIG8gc2lnbmF0w6FyaW8gYSBzYW7Dp8O1ZXMgY2l2aXMgZSBjcmltaW5haXMgY2Fzbwpuw6NvIHNlamEgdGl0dWxhciBkb3MgZGlyZWl0b3MgYXV0b3JhaXMgcGF0cmltb25pYWlzIGUvb3UgY29uZXhvcwphcGxpY8OhdmVpcyDDoCBPYnJhIGRlcG9zaXRhZGEgZHVyYW50ZSBlc3RlIHByb2Nlc3NvLCBvdSBjYXNvIG7Do28gdGVuaGEKb2J0aWRvIHByw6l2aWEgZSBleHByZXNzYSBhdXRvcml6YcOnw6NvIGRvIHRpdHVsYXIgcGFyYSBvIGRlcMOzc2l0byBlCnRvZG9zIG9zIHVzb3MgZGEgT2JyYSBlbnZvbHZpZG9zLgoKClBhcmEgYSBzb2x1w6fDo28gZGUgcXVhbHF1ZXIgZMO6dmlkYSBxdWFudG8gYW9zIHRlcm1vcyBkZSBsaWNlbmNpYW1lbnRvIGUKbyBwcm9jZXNzbyBkZSBzdWJtaXNzw6NvLCBjbGlxdWUgbm8gbGluayAiRmFsZSBjb25vc2NvIi4K
dc.title.por.fl_str_mv Predição de default de empresas: técnicas de machine learning em dados desbalanceados
title Predição de default de empresas: técnicas de machine learning em dados desbalanceados
spellingShingle Predição de default de empresas: técnicas de machine learning em dados desbalanceados
Cordeiro, Tiago Vilas Boas
Machine learning
Logistic regression
Random forest
Probability of default
Risk rating models
Regressão logística
Probabilidade de default
Modelos de rating
Economia
Aprendizado do computador
Análise de regressão logística
Avaliação de riscos
title_short Predição de default de empresas: técnicas de machine learning em dados desbalanceados
title_full Predição de default de empresas: técnicas de machine learning em dados desbalanceados
title_fullStr Predição de default de empresas: técnicas de machine learning em dados desbalanceados
title_full_unstemmed Predição de default de empresas: técnicas de machine learning em dados desbalanceados
title_sort Predição de default de empresas: técnicas de machine learning em dados desbalanceados
author Cordeiro, Tiago Vilas Boas
author_facet Cordeiro, Tiago Vilas Boas
author_role author
dc.contributor.unidadefgv.por.fl_str_mv Escolas::EESP
dc.contributor.member.none.fl_str_mv Costa, Oswaldo Luiz do Valle
Matsumoto, Élia Yathie
dc.contributor.author.fl_str_mv Cordeiro, Tiago Vilas Boas
dc.contributor.advisor1.fl_str_mv Chela, João Luiz
contributor_str_mv Chela, João Luiz
dc.subject.eng.fl_str_mv Machine learning
Logistic regression
Random forest
Probability of default
Risk rating models
topic Machine learning
Logistic regression
Random forest
Probability of default
Risk rating models
Regressão logística
Probabilidade de default
Modelos de rating
Economia
Aprendizado do computador
Análise de regressão logística
Avaliação de riscos
dc.subject.por.fl_str_mv Regressão logística
Probabilidade de default
Modelos de rating
dc.subject.area.por.fl_str_mv Economia
dc.subject.bibliodata.por.fl_str_mv Aprendizado do computador
Análise de regressão logística
Avaliação de riscos
description Given the importance of credit risk management for the banking sector, probability of default models have become fundamental. In this context, with the advances in the volume of information from customers and the computational capacity, several techniques have been studied and applied. In this study, we used two traditional linear techniques, the Linear Discriminant Analysis and Logistic Regression, and four non-linear ensemble techniques, Bagging, Random Forest, Adaboost and Stacking, applied to a problem of probability of default on brazilian companies, using information from their financial statements. The results indicate that the transformations in the data and treatment of class imbalanced have a strong impact on the predictive power of Logistic Regression. Yet, Random Forest was the technique with the best performance regardless of the scenario and the metric used.
publishDate 2020
dc.date.accessioned.fl_str_mv 2020-12-01T13:25:25Z
dc.date.available.fl_str_mv 2020-12-01T13:25:25Z
dc.date.issued.fl_str_mv 2020-11-11
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://hdl.handle.net/10438/29873
url https://hdl.handle.net/10438/29873
dc.language.iso.fl_str_mv por
language por
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.source.none.fl_str_mv reponame:Repositório Institucional do FGV (FGV Repositório Digital)
instname:Fundação Getulio Vargas (FGV)
instacron:FGV
instname_str Fundação Getulio Vargas (FGV)
instacron_str FGV
institution FGV
reponame_str Repositório Institucional do FGV (FGV Repositório Digital)
collection Repositório Institucional do FGV (FGV Repositório Digital)
bitstream.url.fl_str_mv https://repositorio.fgv.br/bitstreams/4bef6e65-0934-46fa-86de-0e65709f5c15/download
https://repositorio.fgv.br/bitstreams/97977563-2413-4189-bf04-0f9849cbb751/download
https://repositorio.fgv.br/bitstreams/2b4515b8-f04f-4e5c-8e21-debdd7c18e99/download
https://repositorio.fgv.br/bitstreams/98d92f9c-0cc7-47d5-b46e-356b01e312a8/download
bitstream.checksum.fl_str_mv 9ce3ccfba0c78168fc069564f442ee13
61fadda3692b7ca097e5137cbd912d65
bc3cde20c21f901a035f7db71816e650
dfb340242cced38a6cca06c627998fa1
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
MD5
MD5
repository.name.fl_str_mv Repositório Institucional do FGV (FGV Repositório Digital) - Fundação Getulio Vargas (FGV)
repository.mail.fl_str_mv
_version_ 1827846549396258816