Medida de certeza na categorização multi-rótulo de texto e sua utilização como estratégia de poda do ranking de categorias
Ano de defesa: | 2010 |
---|---|
Autor(a) principal: | |
Orientador(a): | |
Banca de defesa: | |
Tipo de documento: | Dissertação |
Tipo de acesso: | Acesso aberto |
Idioma: | por |
Instituição de defesa: |
Universidade Federal do Espírito Santo
BR Mestrado em Informática Centro Tecnológico UFES Programa de Pós-Graduação em Informática |
Programa de Pós-Graduação: |
Não Informado pela instituição
|
Departamento: |
Não Informado pela instituição
|
País: |
Não Informado pela instituição
|
Palavras-chave em Português: | |
Link de acesso: | http://repositorio.ufes.br/handle/10/6393 |
Resumo: | A multi-label text categorization system typically computes degrees of belief when it comes to the categories of a pre-defined set, orders the categories by degree of belief, and attributes to the document categories with a higher degree of belief to determined threshold cut. It would be ideal if the degree of belief could inform the probability of the document be part of this category. Unfortunately, there isn t a categorization system that computes such probabilities and to map degrees of belief in probabilities is still a problem that isn`t well explored in IR. In this paper we propose a method based on Bayes rules to map degrees of belief in terms of multi-label text measures of categorization. There are other contributions in this work such as an strategy to determine the limits of threshold cut based on bayesian cut (BCut) and a variant for PBCut (position based bayesian CUT ). As an experience, we evaluated the impact of the proposed methods when performing the two techniques of the multi-label text categorization. The first technique is called knearest neighbor multi-label (ML-KNN) and the second technique is called VG-RAM weightless Neural Networks. Theses evaluations were made in the context of the categorization of economic activities description of Brazilian enterprises, according to the Economic Activities Classification in Brazil (CNAE). In this work we also investigated the impact in the performance of multi-label text categorization of the three cut methods commonly used in the IR literature: RCut, PCut, SCut and RTCut. Moreover, we propose a new variant for the so called PCut* and a new variant for SCut*. Finally, this work shows that the cut approach proposed, BCut and PBCut, produces a categorization performance superior to the other strategies presented in the literature of IR. |