Detecção de subfamílias proteicas isofuncionais utilizando integração de dados e agrupamento espectral

Detalhes bibliográficos
Ano de defesa: 2015
Autor(a) principal: Elisa Boari de Lima
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Tese
Tipo de acesso: Acesso aberto
Idioma: por
Instituição de defesa: Universidade Federal de Minas Gerais
UFMG
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: http://hdl.handle.net/1843/BUOS-APTNCE
Resumo: Despite the best research efforts, a substantial and ever-increasing amount of predicted proteins still lack functional annotation. As increasingly more genomes are sequenced, the vast majority of proteinsmay only be annotated computationally, given experimental investigation is difficult, expensive, and time-consuming. This highlights the need for computational methods to determine protein functions quickly and reliably. However, no large-scale approaches currently exist capable of revealing the functions of all hypothetical genes in the already sequenced genomes. This goal can only be reached through numerous research efforts, and the work presented herein is a computational effort aiming totake a step toward that goal. We believe dividing a protein family into same-specificity subtypes, which share specific functions uncommon to the family as a whole, is a first step toward reducing the function annotation problems complexity. Hence, this works purpose is to detect isofunctional subfamilies inside a family of unknown function, as well as to identify residues responsible for subfamily differentiation. For this purpose, the similarity between protein pairs according to various data types is studied and interpreted as functional similarity evidence. Data are integrated using genetic programming and, then, provided to a spectral clustering algorithm, which creates clusters of similar proteins.Four case studies were performed, applying the proposed framework to well-known protein families and to a family of unknown function, and comparing its results to those obtained by ASMC, a similar method found in the literature. Results showed our fully automated technique obtained better clusters than ASMC for the nucleotidyl cyclases and protein kinases families, besides equivalent results for serine proteases and the DUF849 family, for which clusters were defined with manual intervention. Clusters produced by our framework showed great correspondence with the known subfamilies, besides being more contrasting than those produced by ASMC. Additionally, for the families whose specificity determining positions are known, such residues were among those our technique considered most important to differentiate a given group. Best results consistently involved multiple data types, thus confirming our initial hypothesis that similarities according to different knowledgedomains may be used as functional similarity evidence. Our main contributions are the proposed strategy for selecting and integrating data types, along with the ability to work with noisy and incomplete data; the use of domain knowledge for detecting isofunctional subfamilies in a protein family with different specificities, thus reducing the complexity of the experimental function characterization problem; and the identification of residues responsible for specificity.