Medida de certeza na categorização multi-rótulo de texto e sua utilização como estratégia de poda do ranking de categorias

Detalhes bibliográficos
Ano de defesa: 2010
Autor(a) principal: Souza, Caribe Zampirolli de
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Dissertação
Tipo de acesso: Acesso aberto
Idioma: por
Instituição de defesa: Universidade Federal do Espírito Santo
BR
Mestrado em Informática
Centro Tecnológico
UFES
Programa de Pós-Graduação em Informática
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
004
Link de acesso: http://repositorio.ufes.br/handle/10/6393
Resumo: A multi-label text categorization system typically computes degrees of belief when it comes to the categories of a pre-defined set, orders the categories by degree of belief, and attributes to the document categories with a higher degree of belief to determined threshold cut. It would be ideal if the degree of belief could inform the probability of the document be part of this category. Unfortunately, there isn t a categorization system that computes such probabilities and to map degrees of belief in probabilities is still a problem that isn`t well explored in IR. In this paper we propose a method based on Bayes rules to map degrees of belief in terms of multi-label text measures of categorization. There are other contributions in this work such as an strategy to determine the limits of threshold cut based on bayesian cut (BCut) and a variant for PBCut (position based bayesian CUT ). As an experience, we evaluated the impact of the proposed methods when performing the two techniques of the multi-label text categorization. The first technique is called knearest neighbor multi-label (ML-KNN) and the second technique is called VG-RAM weightless Neural Networks. Theses evaluations were made in the context of the categorization of economic activities description of Brazilian enterprises, according to the Economic Activities Classification in Brazil (CNAE). In this work we also investigated the impact in the performance of multi-label text categorization of the three cut methods commonly used in the IR literature: RCut, PCut, SCut and RTCut. Moreover, we propose a new variant for the so called PCut* and a new variant for SCut*. Finally, this work shows that the cut approach proposed, BCut and PBCut, produces a categorization performance superior to the other strategies presented in the literature of IR.