Uma investigação de aspectos da classificação de tópicos para textos curtos
Ano de defesa: | 2015 |
---|---|
Autor(a) principal: | |
Orientador(a): | |
Banca de defesa: | |
Tipo de documento: | Dissertação |
Tipo de acesso: | Acesso aberto |
Idioma: | por |
Instituição de defesa: |
Universidade Federal da Paraíba
Brasil Informática Programa de Pós-Graduação em Informática UFPB |
Programa de Pós-Graduação: |
Não Informado pela instituição
|
Departamento: |
Não Informado pela instituição
|
País: |
Não Informado pela instituição
|
Palavras-chave em Português: | |
Link de acesso: | https://repositorio.ufpb.br/jspui/handle/tede/7842 |
Resumo: | In recent years a large number of scientific research has stimulated the use of web data as inputs for the epidemiological surveillance and knowledge discovery/mining related to public health in general. In order to make use of social media content, especially tweets, some approaches proposed before transform a content identification problem to a text classification problem, following the supervised learning scenario. However, during this process, some limitations attributed to the representation of messages as well as the extraction of attributes arise. From this, the present research is aimed to investigate the performance impact in the short social messages classification task using a continuous expansion of the training set approach with support of a measure of confidence in the predictions made. At the same time, the survey also aimed to evaluate alternatives for consideration and extraction of terms used for the classification in order to reduce dependencies on term-frequency based metrics. Restricted to the binary classification of tweets related to health events and written in English, the results showed a 9% improvement in F1, compared to the baseline used, showing that the action of expanding the classifier increases the performance, even in the case of short message classification task for health concerns. For the term weighting objective, the main contribution obtained is the ability to automatically indentify high discriminative terms in the dataset, without suffering limitations regarding term-frequency. This may, for example, be able to help build more robust and dynamic classification processes which make use of lists of specific terms for indexing contents on external database ( textit background knowledge). Overall, the results can benefit, by the improvement of the discussed hypotheses, the emergence of more robust applications in the field of surveillance, control and decision making to real health events (epidemiology, health campaigns, etc.), through the task of classifying short social messages. |