Proposal of a data mining pipeline to improve ab initio prediction of bacterial small rna

Detalhes bibliográficos
Ano de defesa: 2018
Autor(a) principal: Reinoso Vilca, Fabio Ivan
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Dissertação
Tipo de acesso: Acesso aberto
Idioma: eng
Instituição de defesa: Universidade Federal de Viçosa
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
RNA
Link de acesso: http://www.locus.ufv.br/handle/123456789/23926
Resumo: Bacterial small RNAs (sRNAs) are usually non-coding RNAs (ncRNAs) with a size of 50–500 nucleotides, and act mainly as post-transcriptional regulators. Prediction of sRNAs is a challenging issue in bioinformatics. The current computational tools deliver a high number of false positives. Hence, the development of more precise predictive methods is of fundamental importance to narrow the number of costly and time-consuming sequence validations on the laboratory workbench. In this work, we collected a series of features from the existent computational tools for ncRNA prediction in order to select the best ones for classifying putative bacterial sRNA sequences. Out of the 264 initially-chosen features, 22 relevant and non-redundant features could be selected by using feature-selection algorithms. To validate this proposal we used a dataset built with only experimentally-validated sRNAs from different bacteria sub-strains, considered as model organisms in genetics, as well as non-sRNA sequences. Finally, a Random Forest algorithm was applied for the classification task. Our first validation experiment of this proposal covered the single sequence prediction task, using 6 testing sets. Our pipeline presented better results than the only ab initio method we could find in literature. The differentiating characteristics of our method are the lower computational cost, the dimensionality reduction and the analytic power analysis due to the single 22 features selected. Our approach could reach an average of 80% of Accuracy, 71.28% of Precision, 82.11% of Specificity and an area under the ROC curve of 0.879. Furthermore, we presented a Genome-wide framework to sRNA prediction, obtaining a 39% lower False Positive Ratio and the double of Specificity than the above-mentioned ab initio method.