Efeitos temporais em classificação de textos: caracterização e engenharia de dados
Ano de defesa: | 2008 |
---|---|
Autor(a) principal: | |
Orientador(a): | |
Banca de defesa: | |
Tipo de documento: | Dissertação |
Tipo de acesso: | Acesso aberto |
Idioma: | por |
Instituição de defesa: |
Universidade Federal de Minas Gerais
UFMG |
Programa de Pós-Graduação: |
Não Informado pela instituição
|
Departamento: |
Não Informado pela instituição
|
País: |
Não Informado pela instituição
|
Palavras-chave em Português: | |
Link de acesso: | http://hdl.handle.net/1843/SLSS-8CEG6C |
Resumo: | The Automatic Document Classification (ADC) has become an important research topicdue to the increasing amount of information available on the Internet. ADC usuallyfollows a standard supervised learning strategy, in which we first build a classificationmodel using pre-classified documents and then this model is used to classify new documents.One major challenge for ADC in many scenarios is that the characteristics ofthe documents and the classes to which they belong may change over time. However,most of the current techniques for ADC are applied without taking into account thetemporal evolution of the collection of documents.In this work, we characterize the temporal evolution in ADC in details, based onan analysis methodology for the temporal effects, and we propose data engineeringstrategies to deal with these effects. In the analysis methodology, we show that thetemporal evolution may be explained by three factors: class distribution, term distributionand class similarity. We employ experimental methodologies and metrics capableof isolating each of these factors in order to analyze them separately. Moreover, wepresent some data engineering strategies that incorporate the temporal aspects in thedatabases, through processes of data filtering and transformation. While data filteringconsists of selecting the documents that will be part of the training set, data transformationis a process in which the terms of the documents in the database are changed,assigning them a new label that will somehow incorporate the temporal aspects.Using an exhaustive filtering strategy, we showed that, with only 69% of the ACMdatabase, we are able to have an accuracy of 89.76%, and with only 25% of the MedLine,an accuracy of 87.57%, which means gains of up to 20% in the accuracy with muchsmaller training sets than the entire database. However, we know that this strategyis not feasible in real scenarios. On the other hand, with our data transformationstrategies, we obtained a gain of up to 6.5% in the accuracy, and these strategies mayme applied in real scenarios and even extended to the use of other algorithms. |