Racionalizando a utilização do algoritmo PHRED para a análise de seqüências de DNA

Detalhes bibliográficos
Ano de defesa: 2006
Autor(a) principal: Francisco Prosdocimi de Castro Santos
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Tese
Tipo de acesso: Acesso aberto
Idioma: por
Instituição de defesa: Universidade Federal de Minas Gerais
UFMG
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: http://hdl.handle.net/1843/GRFO-7ZGK4Q
Resumo: Science is sometimes dogmatic. Even the very thinker scientists are sometimes forcedto accept as true something believed by the community in order to advance theirresearch. In the genomic research field, some dogmas are still attached to scientificculture and the main goal of this thesis is the tentative to question some of thesedogmas and bring to the light of reason a consistent knowledge about some restrictaspects related to the base-calling process. Therefore, in order to evaluate theexecution of PHRED, the main base-caller used in genome projects, we first develop aconsistent methodology of analysis. Using this methodology we have tried to reducethe number of variables to be analyzed in sequencing reads, making our analysis freeof particularities happening in some specific sequencing reaction. With this in mind, wehave performed the sequencing of a well-known cloning vector (pUC18) in a singlepool,homogenizing the samples before and after the sequencing reaction. So, 846sequences from the pUC18 cloning vector were produced by single-pool and compared,through local alignments, with a positive control: the sequence published for thismolecule. This comparison allowed us both to identify precisely the errors happening inthe sequencing and/or base-calling and to evaluate different parameters used forPHRED running. We have verified (1) an error pattern very similar to the expected one,(2) the impossibility to predict errors evaluating the base quality values surroundingthe neighborhood of miscalled bases, (3) the high presence of mismatches in lowquality values and (4) the presence of some indels in high quality regions. We haverealized also an application of these base-calling data to the process of designingprimers for sequencing and one study was published on this subject. Trying tosoftmask low quality bases, we have made another study to find the best PHREDquality value to be used to mask most of the errors without masking correct bases.Moreover, we have studied and adjusted PHRED trimming parameters in order toretrieve from the sequence just the biologically relevant information. At last, we haveanalyzed the consensus production through different number of sequencing reads inorder to find the appropriate number of sample re-sequencing to generate a highfidelitymolecule.