Sistema para análise de sequências nucleotídicas do HIV disponíveis no GenBank

Detalhes bibliográficos
Ano de defesa: 2015
Autor(a) principal: Gonçalves, José Irahe Kasprzykowski lattes
Orientador(a): Queiroz, Artur Trancoso Lopo de lattes
Banca de defesa: Não Informado pela instituição
Tipo de documento: Dissertação
Tipo de acesso: Acesso aberto
Idioma: por
Instituição de defesa: Universidade Estadual de Feira de Santana
Programa de Pós-Graduação: Mestrado em Computação Aplicada
Departamento: DEPARTAMENTO DE CIÊNCIAS EXATAS
País: Brasil
Palavras-chave em Português:
HIV
Palavras-chave em Inglês:
HIV
Área do conhecimento CNPq:
Link de acesso: http://localhost:8080/tede/handle/tede/327
Resumo: HIV infects over 40 million people worldwide and is considered by the World Health Organization a large scale pandemic. Which the associated disease has no cure. New data and analysis can help new treatment and vaccine development. However, the dataset is vast, with over 500,000 sequences available on GenBank. This data still lacks essential information such as subtyping and genome location. To help minimize these problems we developed a system for automated analysis from GenBank data. The tool performs sequence map according to HXB2 and subtyping by comparison with subtype reference sequences. This process uses Needleman-Wusch and Smith-Waterman respectively. All 582,678 sequences were mapped in 5 days and 14 hours and subtyped in 1 day and 7 hours with our algorithm, while the original approach was estimated to finish in 36 and 97 years respectively. Our tool was able to analyse the massive data in a reliable time. No current subtyping tool can analyse this high-throughput data. Our results showed that pol and gag genes were the most prevalent genes on the dataset, and could be explained because treatment and subtyping are based on these genes. Moreover, the structural genes were most prevalent, with 66.41%. This highlighted the low representation of regulatory genes on available data. The subtyping results showed that the subtype B was most frequent, with 45.96%. The recombinants together represent 43.37%. Furthermore, subtype C presented only 4.12% and the other pure subtypes less than 4%. Also, the geographical data was recovered from database and USA presented higher frequency, with 24.50%, showing a significant country bias. Our results present a new HIV subtype distribution with the most complete and recent dataset.Herein, we presented a new user friendly software for massive data analysis of viruses. This software is able to analyse highly mutational virus data, such as HCV and HIV in reliable time. Further, severe country bias raises questions regarding world subtype distribution. The analysis of all sequences from HIV provides new epidemy insights about subtypes and country distribution.