Aprendizagem eficiente de classificadores sequenciais em padrões longos
Ano de defesa: | 2013 |
---|---|
Autor(a) principal: | |
Orientador(a): | |
Banca de defesa: | |
Tipo de documento: | Dissertação |
Tipo de acesso: | Acesso aberto |
Idioma: | por |
Instituição de defesa: |
Universidade Federal de Minas Gerais
UFMG |
Programa de Pós-Graduação: |
Não Informado pela instituição
|
Departamento: |
Não Informado pela instituição
|
País: |
Não Informado pela instituição
|
Palavras-chave em Português: | |
Link de acesso: | http://hdl.handle.net/1843/ESBF-97CJGD |
Resumo: | Many applications, such as information extraction, intrusion detection and protein fold recognition, can be expressed as sequences of events or elements (rather than unordered sets of features), that is, there is an order dependence among the elements composing each data instance. These applications may be modeled as classification problems, and in this case the classifier must be built using sequential interactions among the elements, so that the ordering relationship among them is properly captured. Dominant approaches to this problem include: (i) learning Hidden Markov Models, (ii) exploiting frequent sequences extracted from the data and (iii) computing string kernels for Support Vector Machines. Such approaches, however, are computationally hard, and the typically high-dimensional nature of sequential data poses serious challenges to their feasibility, especially if the data shows long range dependencies (i.e., long patterns are necessary in order to model the data). In this paper we introduce algorithms that build highly effective sequential classifiers by exploiting adjacency or proximity information, either to improve classification accuracy or to ensure O(nn) learning cost, where n is the dimension (number of features) comprising a given test instance. Our algorithms are based on enumerating (approximately) contiguous sequences from the training data on a demand-driven basis, exploiting a lightweight and flexible sequence matching function and an innovative sequence enumeration strategy called pattern silhouettes, which make our classifiers fast but also robust even in noisy data. Our empirical results on actual datasets show that, in most of the cases, our classifiers are faster than existing solutions (sometimes orders of magnitude faster), also providing significant accuracy improvements in most of the evaluated cases. |