Uma investigação de aspectos da classificação de tópicos para textos curtos

Detalhes bibliográficos
Ano de defesa: 2015
Autor(a) principal: Oliveira, Ewerton Lopes Silva de
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Dissertação
Tipo de acesso: Acesso aberto
Idioma: por
Instituição de defesa: Universidade Federal da Paraíba
Brasil
Informática
Programa de Pós-Graduação em Informática
UFPB
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: https://repositorio.ufpb.br/jspui/handle/tede/7842
Resumo: In recent years a large number of scientific research has stimulated the use of web data as inputs for the epidemiological surveillance and knowledge discovery/mining related to public health in general. In order to make use of social media content, especially tweets, some approaches proposed before transform a content identification problem to a text classification problem, following the supervised learning scenario. However, during this process, some limitations attributed to the representation of messages as well as the extraction of attributes arise. From this, the present research is aimed to investigate the performance impact in the short social messages classification task using a continuous expansion of the training set approach with support of a measure of confidence in the predictions made. At the same time, the survey also aimed to evaluate alternatives for consideration and extraction of terms used for the classification in order to reduce dependencies on term-frequency based metrics. Restricted to the binary classification of tweets related to health events and written in English, the results showed a 9% improvement in F1, compared to the baseline used, showing that the action of expanding the classifier increases the performance, even in the case of short message classification task for health concerns. For the term weighting objective, the main contribution obtained is the ability to automatically indentify high discriminative terms in the dataset, without suffering limitations regarding term-frequency. This may, for example, be able to help build more robust and dynamic classification processes which make use of lists of specific terms for indexing contents on external database ( textit background knowledge). Overall, the results can benefit, by the improvement of the discussed hypotheses, the emergence of more robust applications in the field of surveillance, control and decision making to real health events (epidemiology, health campaigns, etc.), through the task of classifying short social messages.