Uso da álgebra linear para análise de similaridades e extração de padrões em sequências protéicas

Detalhes bibliográficos
Ano de defesa: 2010
Autor(a) principal: Braulio Roberto Goncalves Marinho Couto
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Tese
Tipo de acesso: Acesso aberto
Idioma: por
Instituição de defesa: Universidade Federal de Minas Gerais
UFMG
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: http://hdl.handle.net/1843/BUOS-8L4RSA
Resumo: Extracting patterns from protein sequence data is one of the challenges of Computational Biology. Here we use linear algebra methods and logistic regression models to analyze sequences without the requirement of multiples alignments. Firstly, we consider a biomolecular sequence as a complex written language that is recoded as p-peptide frequency vector using all possible overlapping p-peptides window. With 20 amino acids is generated a 20p high-dimensional vector, where p is the word-size. After that, singular value decomposition (SVD) and/or logistic regression models are applied on data to extract patterns or to allow visualizing of high dimensional data. Spearman correlation (r) was used to evaluate the association between statistics used by BLAST and similarity metrics used by SVD. Euclidean distance was negatively correlated with bit score (r>-0.6) and positively correlated with E value (r>+0.7). Cosine had negative correlation with E value (r>-0.7) and positive correlation with bit score (r>+0.8). In addition, we compared edit distance between each pair of sequences with respective cosines and Euclidean distances from SVD. Correlation between cosine and edit distance was -0.32 (P < 0.01) and between Euclidean distance and edit distance was +0.70 (P < 0.01). Besides, the ability of SVD in classifying sequences according to their categories was evaluated. With a 3-peptide frequency matrix, all queries were correctly classified (accuracy = 100%). We proposed a biological significance of the SVD: the singular value spectrum visualized as scree plots unreveals the main components, the process that exists hidden in the protein database. A feature selection for protein sequence classification was made by using logistic regression models and SVD. In addition to the feature selection, combining logistic regression models with SVD allowed better classification of unknown sequences than using SVD alone. We also presented a method that utilizes information from known protein databases to build logistic regression models that allow prediction of a new amino acids sequence. We successfully tested the method in ten instances, which generated models for predicting insulin, globin, keratin, cytochrome, albumin, collagen, fibrinogen and proteins related with cystic fibrosis, Alzheimer disease and schizophrenia. SVD, followed by optimization allows visualization of high dimensional genomes by mapping multivariate data from their high dimensional representation into 2D or 3D space. All results found in this work and the characteristics described are important because SVD can be a solution for the potential problems with alignment algorithms and can be a substitute for those methods, for example, in whole genome analysis.