Análise de texto não supervisionada. Aplicações: setores químico e elétrico

Detalhes bibliográficos
Ano de defesa: 2020
Autor(a) principal: Lucas Augusto Ferreira de Oliveira
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Dissertação
Tipo de acesso: Acesso aberto
Idioma: por
Instituição de defesa: Universidade Federal de Minas Gerais
Brasil
ENG - DEPARTAMENTO DE ENGENHARIA QUÍMICA
Programa de Pós-Graduação em Engenharia Química
UFMG
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: http://hdl.handle.net/1843/35653
Resumo: Text analysis is an area that has been around for a few years; however, it has advanced considerably due to the development of the capacity to collect and store information in text format. Text analysis can be divided into database analysis, text mining, and information extraction. All these points are explored in this work. It proposes a methodology for the discovery and naming of clusters. This methodology uses natural language processing (Natural Language Processing; NLP) through an unsupervised machine learning approach. Two real case studies are used. The first concerns CEMIG, one of the main concessionaires in the electricity sector in Brazil, with the objective of grouping the text messages of its customers, or, in other words, of discovering intents of its users. The second refers to a company that sells machinery for civil construction, also in Brazil, with the objective of gathering technical opinions, issued in text format, of laboratory analysis of fluids used in the machines. These analyzes are written by different analysts; therefore, the need for a standardization of this information. Satisfactory results were obtained in both cases. The combination, using PCA as a method of dimensionality reduction and k-means as a clustering algorithm, proved to be, in general, the one with the best performance, according to the usual evaluation metric called silhouette coefficient, generally higher than 0,95; also having as metrics the size of the grouping of data called “random”, which brings together little expressive phrases, around 6%; and significantly low computational processing time. The methodology proved to be quite efficient for these cases and can be used in other contexts.