Modelos para análise de textos: um comparativo do número de tópicos
Ano de defesa: | 2024 |
---|---|
Autor(a) principal: | |
Orientador(a): | |
Banca de defesa: | |
Tipo de documento: | Dissertação |
Tipo de acesso: | Acesso aberto |
Idioma: | por |
Instituição de defesa: |
Universidade Federal de São Carlos
Câmpus São Carlos |
Programa de Pós-Graduação: |
Programa Interinstitucional de Pós-Graduação em Estatística - PIPGEs
|
Departamento: |
Não Informado pela instituição
|
País: |
Não Informado pela instituição
|
Palavras-chave em Português: | |
Palavras-chave em Inglês: | |
Área do conhecimento CNPq: | |
Link de acesso: | https://repositorio.ufscar.br/handle/20.500.14289/20846 |
Resumo: | Text modeling has gained significant visibility and popularity in recent years due to the large and ever-increasing amount of information present in daily life, consumed in various ways. For the efficiency and applicability of these models, the prior step of data preprocessing is of utmost importance, as it helps in the organization and treatment of texts. One branch within text analysis is topic modeling, whose methodologies aim to understand the topic structure that forms a document, segmenting multiple documents by their dominant topics (subjects) and thus simplifying the exploration of large volumes of textual data with the resulting dimensionality reduction. One of the pioneering methods in this context is the Mixture Model (MM), which assumes that each document will be composed of words from a single topic. Given this limitation, the technique of Latent Dirichlet Allocation (LDA) has gained considerable visibility due to its greater flexibility, as it allows each document to exhibit multiple topics. In both methodologies, model inference is generally given via a Bayesian approach. However, one of the characteristics of MM and LDA is the requirement that the user define the number of topics in the model from the outset. Therefore, the use of performance metrics becomes necessary after the application of the method, aiming to help in the definition and estimation of the best number of topics to be chosen. In this work, therefore, in addition to contrasting text analysis methodologies, we compare the metrics that measure the quality of the models and are used for choosing the number of topics. To do this, we apply the models and selection metrics to two sets of real data. |