Desenvolvimento de um serviço de análise de sequências utilizando um modelo baseado em atributos de resultados de PSI-BLAST
Ano de defesa: | 2013 |
---|---|
Autor(a) principal: | |
Orientador(a): | |
Banca de defesa: | |
Tipo de documento: | Tese |
Tipo de acesso: | Acesso aberto |
Idioma: | por |
Instituição de defesa: |
Universidade Federal de Minas Gerais
UFMG |
Programa de Pós-Graduação: |
Não Informado pela instituição
|
Departamento: |
Não Informado pela instituição
|
País: |
Não Informado pela instituição
|
Palavras-chave em Português: | |
Link de acesso: | http://hdl.handle.net/1843/BUOS-9K6JMK |
Resumo: | PSI-BLAST is one of the main tools for remote homology search. This kind of task is essential for molecular modeling, secondary structure prediction and hypothetical proteins functional annotation. Nevertheless literature reports high rates of false positives in PSI-BLAST search. That is mostly due to the unsupervised way PSI-BLAST calculates the PSSM weights. In this work we combine PSI-BLAST with supervised machine learn techniques that were able to predict probability of a result being correct. In order to do that 1200 PANTHERs queries were selected and split in two groups: one with 800 were used as training and another of 400 were tests. These queries were submitted to PSI-BLAST against a PANTHER-UniProt multi-fasta database. Then each subject found was evaluated as being from the same cluster as the query, from a di_erent cluster, or as not having a cluster, in which case the subject were discarded. Also 17 features were created based on the subject scores found in each iteration and query size. With these features an ensemble of neural networks and random forest were trained and achieved 0.94 AUC in test. The 1200 queries were also submitted to BLASTp and a neuronal network model was trained and achieved 0.78AUCin test. This model only takes 3 features and was proposed as a heuristic for the main model based on PSI-BLAST. These ML-BLAST (Machine Learn- BLAST) models were applied to 900 recent annotated proteins and subject and querys annotations similarity were compared. These tests happened to generate a model of weighted annotation relevance. And annotation suggestion based on annotation consensus. ML-BLAST models were also applied to four microorganisms hypothetical proteins and ware able to suggest annotation for about half of them. These models jointly with a set of other metrics were integrated in a new tool called Annothetic (Annotate Hypothetical). Despite of the name, this tool can be applied not only for proteins annotation but also for any task that require remote similarity search. |