Aprendizagem de máquina para classificação de estructuras Exon e Intron em dados de genoma humano

Detalhes bibliográficos
Ano de defesa: 2019
Autor(a) principal: SANZ, Albaro Ramon Paiva lattes
Orientador(a): FERREIRA, Tiago Alessandro Espínola
Banca de defesa: CUNHA FILHO, Moacyr, BALBINO, Valdir Queiroz, SANTOS, Antônio de Pádua, MIRANDA, Péricles Barbosa Cunha de
Tipo de documento: Tese
Tipo de acesso: Acesso aberto
Idioma: por
Instituição de defesa: Universidade Federal Rural de Pernambuco
Programa de Pós-Graduação: Programa de Pós-Graduação em Biometria e Estatística Aplicada
Departamento: Departamento de Estatística e Informática
País: Brasil
Palavras-chave em Português:
Área do conhecimento CNPq:
Link de acesso: http://www.tede2.ufrpe.br:8080/tede2/handle/tede2/8150
Resumo: Classification techniques are often used to solve different bioinformatics problems. Most genes in the DNA sequence are transcribed by messenger RNA and translated into protein. The DNA contains regions that encode proteins (exons) and regions that do not encode proteins (introns), the boundaries between exons and introns are called the splice site. During the transcription process, the introns are "cut", this is known as splicing that puts the exons of a gene consecutively, ready to be translated into the amino acid sequence that make up the protein. In splice sites, the transition from the coding region exon to the non-coding region intron (EI) and distinguished with the nucleotides GT, and transition from the non-coding region (intron) to the coding region exon (IE) and distinguished with the nucleotides AG. A small percentage of these combinations are actual splice sites. In this study, a methodology for the classification problem EI and IE is presented, which consists in obtaining probability distributions using machine learning technique and starting from them to obtain different measures of performance. A number of algorithms (Support Vector Machine (SVM), Artificial Neural Network (RNA), Random Forest (RF), Naive Bayes (NB)) are tested and compared to find the best classifier. To make the selection of the best classifier the most known measures are applied based on the confusion matrix: Accuracy, Specificity, Sensitivity, among others, as well as the Kolgomorov distance (KS) as measured performance of the classification models. More precisely, the KS is a measure of the degree of sep aration between the distributions of probability class, which is an indication of greater accuracy. The results presented in this study are equal or superior in accuracy when compared with the papers presented in the literature Classification.