Desenvolvimento de um serviço de análise de sequências utilizando um modelo baseado em atributos de resultados de PSI-BLAST

Detalhes bibliográficos
Ano de defesa: 2013
Autor(a) principal: Henrique de Assis Lopes Ribeiro
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Tese
Tipo de acesso: Acesso aberto
Idioma: por
Instituição de defesa: Universidade Federal de Minas Gerais
UFMG
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: http://hdl.handle.net/1843/BUOS-9K6JMK
Resumo: PSI-BLAST is one of the main tools for remote homology search. This kind of task is essential for molecular modeling, secondary structure prediction and hypothetical proteins functional annotation. Nevertheless literature reports high rates of false positives in PSI-BLAST search. That is mostly due to the unsupervised way PSI-BLAST calculates the PSSM weights. In this work we combine PSI-BLAST with supervised machine learn techniques that were able to predict probability of a result being correct. In order to do that 1200 PANTHERs queries were selected and split in two groups: one with 800 were used as training and another of 400 were tests. These queries were submitted to PSI-BLAST against a PANTHER-UniProt multi-fasta database. Then each subject found was evaluated as being from the same cluster as the query, from a di_erent cluster, or as not having a cluster, in which case the subject were discarded. Also 17 features were created based on the subject scores found in each iteration and query size. With these features an ensemble of neural networks and random forest were trained and achieved 0.94 AUC in test. The 1200 queries were also submitted to BLASTp and a neuronal network model was trained and achieved 0.78AUCin test. This model only takes 3 features and was proposed as a heuristic for the main model based on PSI-BLAST. These ML-BLAST (Machine Learn- BLAST) models were applied to 900 recent annotated proteins and subject and querys annotations similarity were compared. These tests happened to generate a model of weighted annotation relevance. And annotation suggestion based on annotation consensus. ML-BLAST models were also applied to four microorganisms hypothetical proteins and ware able to suggest annotation for about half of them. These models jointly with a set of other metrics were integrated in a new tool called Annothetic (Annotate Hypothetical). Despite of the name, this tool can be applied not only for proteins annotation but also for any task that require remote similarity search.