Subject classification through context-enriched language models

Detalhes bibliográficos
Ano de defesa: 2015
Autor(a) principal: Alexandre Guelman Davis
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Dissertação
Tipo de acesso: Acesso aberto
Idioma: por
Instituição de defesa: Universidade Federal de Minas Gerais
UFMG
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: http://hdl.handle.net/1843/ESBF-9VKK2Q
Resumo: Throughout the years, humans have developed a complex and intricate system of communication with several means of conveying information that range from books, newspapers and television to, more recently, social media. However, efficiently retrieving and understanding messages from social media for extracting useful information is challenging, especially considering that shorter messages are strongly dependent on context. Users often assume that their social media audience is aware of the associated background and the underlying real world events. This allows them to shorten their messages without compromising the effectiveness of communication. Traditional data mining algorithms do not account for contextual information. We argue that exploiting context could lead to more complete and accurate analyses of social media messages. For this work, therefore, we demonstrate how relevant is contextual information in the successful filtering of messages that are related to a selected subject. We also show that recall rate increases if context is taken into account. Furthermore, we propose methods for filtering relevant messages without resorting only to keywords if the context is known and can be detected. In this dissertation, we propose a novel approach for subject classification of social media messages that considers both textual and extra-textual (or contextual) information. This approach uses a proposed context-enriched language model. Techniques based on concepts of computational linguistics, more specifically in the field of Pragmatics, are employed. For experimentally analyzing the impact of the proposed approach, datasets containing messages about three major American sports (football, baseball and basketball) were used. Results indicate up to 50% improvement in retrieval over text-based approaches due to the use of contextual information.