Redução de dimensionalidade em bases de dados de classificação hierárquica multirrótulo usando autoencoders

Detalhes bibliográficos
Ano de defesa: 2019
Autor(a) principal: Siqueira, Rafael Fernandes
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Dissertação
Tipo de acesso: Acesso aberto
Idioma: por
Instituição de defesa: Universidade Tecnológica Federal do Paraná
Ponta Grossa
Brasil
Programa de Pós-Graduação em Ciência da Computação
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: http://repositorio.utfpr.edu.br/jspui/handle/1/4472
Resumo: Protein prediction in bioinformatics data is an example of a Hierarchical Multilabel Classification problem in which each instance can be associated with multiple classes, which in turn are organized in a hierarchy. The high dimensionality of attributes and classes influences the performance of the classifiers, both in computational cost and in predictive capacity, as it impairs the search for patterns and the discovery of useful knowledge. Feature Extraction is one of the techniques used to achieve dimensionality reduction in databases, and thus eliminate irrelevant and/or redundant attributes that tend to confuse a learning algorithm. In this technique, by means of combinations and/or transformations of the original attributes, new attributes, which are more significant and represent the database, are generated in a smaller space. Thus, this work proposes a new method of feature extraction, FEAE-HMC, for the hierarchical multi-label classification, based on concepts and techniques of Deep Learning, through adaptations in a classic Autoencoder network. The FEAE-HMC method is divided into two main steps: the feature extraction and the evaluation of the reduced data set using a hierarchical multi-label classifier (Clus-HMC and MHC-CNN) and its performance measure (AUPRC). To perform the experiments, biological data from 10 Genetic Ontology databases are used, and their classes are structured in a hierarchy in the form of a Directed Acyclic Graph (DAG). According to the experimental results, the FEAE-HMC method was able to extract representations of smaller dimension that can add correlations between the attributes and labels. These representations, when submitted to a Hierarchical Multi-label Classifier, generate models with predictive performance equivalent or even superior to the performance of the original base. The difference between the full-base AUPRC measurement and a reduced base with a reduction of up to 90% of the original dimensionality is less than 0.047 in both classifiers. Statistical tests show that the reduced bases extracted by the FEAE-HMC are at least statistically equivalent to the original bases.