Classificação multi-rótulo hierárquica de documentos textuais
Ano de defesa: | 2009 |
---|---|
Autor(a) principal: | |
Orientador(a): | |
Banca de defesa: | |
Tipo de documento: | Dissertação |
Tipo de acesso: | Acesso aberto |
Idioma: | por |
Instituição de defesa: |
Universidade Federal de Minas Gerais
UFMG |
Programa de Pós-Graduação: |
Não Informado pela instituição
|
Departamento: |
Não Informado pela instituição
|
País: |
Não Informado pela instituição
|
Palavras-chave em Português: | |
Link de acesso: | http://hdl.handle.net/1843/SLSS-7WMHNG |
Resumo: | The amount of information stored in text databases is steadily increasing. As such, demand for automated techniques to organize this data also continues to grow. In this context, academic and industry research has been focused on the study of automatic text classification. Most work on text classification studies the development of techniques in which there are a limited number of classes and dependencies between them is not significant. There are several relevant application scenarios in which these assumptions are not valid. To solve these problems, a new research topic, the Multi-label Hierarchical Classification (HMC) has received more attention but still represents a major challenge for the area. In HMC problems, the set of classes is likely to be much greater and, as such, they are hierarchically structured. Classic methods, in addition to ignore the existing structure knowledge, have their performance degradated if the number of classes is too large or interdependence between the classes exists. In this work we perform an extensive literature study, present a framework targeting development and analysis of HMC algorithms, the MASSIFICA, and propose a lazy classification rule-based algorithm suitable for HMC problems. MASSIFICA was used as benchmark to evaluate performance of a proposed algorithm against well known base classifers based on both fat architecture and structured database (topdown) architectures. We also present results applied to a real application scenario: classification of companies' economic activities. Finally, we discuss challenges and how diferent solutions react to these challenges. We conclude that the new algorithm, despite having a lower performance in the first hierarchical levels, can perform competitively, particularly in the deeper levels of the hierarchy, which in general classes are uncommon and less information is provided. |