Classificação multi-rótulo hierárquica de documentos textuais

Detalhes bibliográficos
Ano de defesa: 2009
Autor(a) principal: Gustavo Henrique Orair
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Dissertação
Tipo de acesso: Acesso aberto
Idioma: por
Instituição de defesa: Universidade Federal de Minas Gerais
UFMG
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: http://hdl.handle.net/1843/SLSS-7WMHNG
Resumo: The amount of information stored in text databases is steadily increasing. As such, demand for automated techniques to organize this data also continues to grow. In this context, academic and industry research has been focused on the study of automatic text classification. Most work on text classification studies the development of techniques in which there are a limited number of classes and dependencies between them is not significant. There are several relevant application scenarios in which these assumptions are not valid. To solve these problems, a new research topic, the Multi-label Hierarchical Classification (HMC) has received more attention but still represents a major challenge for the area. In HMC problems, the set of classes is likely to be much greater and, as such, they are hierarchically structured. Classic methods, in addition to ignore the existing structure knowledge, have their performance degradated if the number of classes is too large or interdependence between the classes exists. In this work we perform an extensive literature study, present a framework targeting development and analysis of HMC algorithms, the MASSIFICA, and propose a lazy classification rule-based algorithm suitable for HMC problems. MASSIFICA was used as benchmark to evaluate performance of a proposed algorithm against well known base classifers based on both fat architecture and structured database (topdown) architectures. We also present results applied to a real application scenario: classification of companies' economic activities. Finally, we discuss challenges and how diferent solutions react to these challenges. We conclude that the new algorithm, despite having a lower performance in the first hierarchical levels, can perform competitively, particularly in the deeper levels of the hierarchy, which in general classes are uncommon and less information is provided.