Enzymap: explorando metadados protéicos para modelagem e previsão de mudanças de anotação no Uniprot/Swiss-Prot

Detalhes bibliográficos
Ano de defesa: 2013
Autor(a) principal: Sabrina de Azevedo Silveira
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Tese
Tipo de acesso: Acesso aberto
Idioma: por
Instituição de defesa: Universidade Federal de Minas Gerais
UFMG
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: http://hdl.handle.net/1843/BUBD-9DKH48
Resumo: In recent decades there has been a surge in the amount of available biological data. New DNA sequencing technologies have made economically possible an increasing number of large data projects, which led to an exponential increase in DNA sequence data. Also, vast amounts of data such as protein sequences and structures, gene-expression measurements, protein and genetic interactions and phenotype studies have been produced. Much of these data are organized and publicly available to the scientic community in biological repositories via the Internet. These repositories store not only biological raw data but also relevant information such as protein function, literature information and the relationship between a protein and its encoding gene, among other metadata, also called annotation. In this work we propose a supervised learning approach to characterize and predict annotation changes in temporal data, which we term ENZYmatic Metadata Annotation Predictor (ENZYMAP). More precisely, we are interested in predict enzyme function annotation based on UniProt/Swiss-Prot entry metadata. This proposal allows us to suggest possible corrections to annotations from biological repositories and can be used in a complementary manner to other annotation methods improving the quality and realiability of these data. Our approach uses data already available to enhance the repository, which does not demand new expensive bench experiments. Furthermore, there is a huge volume of data that can not be analyzed manually, hence the importance of reliable automatic annotation methods. We performed an initial exploration of the data in which changes in enzyme annotation were modeled considering the numeric and hierarchical nature of the enzyme classication system called Enzyme Commission (EC) number. This step led to the creation of an interactive visualization tool called ADVISe and also to the publication of an article in IEEE Symposium on Biological Data Visualization (BioVis), 2012. Then some metadata from Swiss-Prot were selected to discriminate entries that experienced a specic EC change type from those which annotation remained constant. Ocurrence matrices were proposed to model EC number changes in terms of Swiss-Prot metadata and such matrices served as input for the supervised learning approach. We performed three experiments to characterize and predict EC number changes: Descriptive Multiclass, in which we concluded that selected metadata were able to discriminate entries that undergone a specic EC number change from those which annotation remained constant; Predictive Multiclass indicated that predicting the last ocurrence of an EC change type using a single multiclass classier with a scarce number of examples was not possible; Predictive Common Source, in which we concluded that predicting an EC change type using more specialized classiers is possible even with a scarce number of examples. We compared predictions made by ENZYMAP to predictions made by DETECT, a technique able to associate an EC number to the residues' sequence of a protein, and both were checked against Swiss-Prot annotations. The percentage of predictions made by our approach that is in accordance with Swiss-Prot is greater than the same percentage for DETECT for all four levels of EC annotation.