Construção de evidências para classificação automática de textos

Detalhes bibliográficos
Ano de defesa: 2008
Autor(a) principal: Fabio Soares Figueiredo
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Dissertação
Tipo de acesso: Acesso aberto
Idioma: por
Instituição de defesa: Universidade Federal de Minas Gerais
UFMG
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: http://hdl.handle.net/1843/RVMR-7L3NSY
Resumo: Since the popularization of digital documents, automatic text classification is considered an important research topic. Despite the research efforts, there is still a demand for improving the performance of classifiers. Most of the research in automatic text classification focus on the algorithmic side, but there are few efforts focused on enhancing the datasets used for training the automatic text classifiers, which is the focus of this paper. We propose a data treatment strategy, based on feature extraction, that precedes the classification task, in order to enhance documents with discriminative features of each class capable of increasing the classification effectiveness.Our strategy is based on term co-occurrences to generate new discriminative features, called compound-features (or c-features), that can be incorporated to documents to help the classification task. The idea is that, when used in conjunction with single-features, the ambiguity and noise inherent to c-features components are reduced, therefore making them more helpful to separate classes into more homogeneous partitions. However, the computational cost of feature extaction may make the method unfeasible. In this paper, we devise a set of mechanisms that make the strategy computationally feasible while improving the classifier effectiveness. We test this approach with several classification algorithms and standard text collections. Experimental results demonstrated gains in almost all evaluated scenarios, from the simplest algorithms such as k-Nearest Neighbors (kNN) (46% gain in micro average F1 in the 20 Newsgroups 18828 collection) to the most complex one, the state of the art Support Vector Machine (SVM) (10,7% gain in macro average F1 in the collection OHSUMED).