Analysis of microRNA precursors in multiple species by data mining techniques

Lopes, Ivani de Oliveira Negrão

Analysis of microRNA precursors in multiple species by data mining techniques

Detalhes bibliográficos
Ano de defesa:	2014
Autor(a) principal:	Lopes, Ivani de Oliveira Negrão
Orientador(a):	Não Informado pela instituição
Banca de defesa:	Não Informado pela instituição
Tipo de documento:	Tese
Tipo de acesso:	Acesso aberto
Idioma:	eng
Instituição de defesa:	Biblioteca Digitais de Teses e Dissertações da USP
Programa de Pós-Graduação:	Não Informado pela instituição
Departamento:	Não Informado pela instituição
País:	Não Informado pela instituição
Palavras-chave em Português:	Data mining Ensembles Mineração de dados Pre-miRNA prediction Predição de pre-microRNA
Link de acesso:	http://www.teses.usp.br/teses/disponiveis/55/55134/tde-19092014-155038/
Resumo:	RNA Sequencing has recently emerged as a breakthrough technology for microRNA (miRNA) discovery. This technology has allowed the discovery of thousands of miRNAs in a large number of species. However, despite the benefits of this technology, it also carries its own limitations, including the need for sequencing read libraries and of the genome. Differently, ab initio computational methods need only the genome as input to search for genonic locus likely to give rise to novel miRNAs. In the core of most of these methods, there are predictive models induced by using data mining techniques able to distinguish between real (positive) and pseudo (negative) miRNA precursors (pre-miRNA). Nevertheless, the applicability of current literature ab initio methods have been compromised by high false detection rates and/or by other computational difficulties. In this work, we investigated how the main aspects involved in the induction of predictive models for pre-miRNA affect the predictive performance. Particularly, we evaluate the discriminant power of feature sets proposed in the literature, whose computational costs and composition vary widely. The computational experiments were carried out using sequence data from 45 species, which covered species from eight phyla. The predictive performance of the classification models induced using large training set sizes (≥ 1; 608) composed of instances extracted from real and pseudo human pre-miRNA sequences did not differ significantly among the feature sets that lead to the maximal accuracies. Moreover, the differences in the predictive performances obtained by these models, due to the learning algorithms, were neglectable. Inspired by these results, we obtained a feature set which can be computed 34 times faster than the less costly among those feature sets, producing the maximal accuracies, albeit the proposed feature set has achieved accuracy within 0.1% of the maximal accuracies. When classification models using the elements previously discussed were induced using small training sets (120) from 45 species, we showed that the feature sets that produced the highest accuracies in the classification of human sequences were also more likely to produce higher accuracies for other species. Nevertheless, we showed that the learning complexity of pre-miRNAs vary strongly among species, even among those from the same phylum. These results showed that the existence of specie specific features indicated in previous studies may be correlated with the learning complexity. As a consequence, the predictive accuracies of models induced with different species and same features and instances spaces vary largely. In our results, we show that the use of training examples from species phylogenetically more complex may increase the predictive performances for less complex species. Finally, by using ensembles of computationally less costly feature sets, we showed alternative ways to increase the predictive performance for many species while keeping the computational costs of the analysis lower than those using the feature sets from the literature. Since in miRNA discovery the number of putative miRNA loci is in the order of millions, the analysis of putative miRNAs using a computationally expensive feature set and or inaccurate models would be wasteful or even unfeasible for large genomes. In this work, we explore most of the learning aspects implemented in current ab initio pre-miRNA prediction tools, which may lead to the development of new efficient ab initio pre-miRNA discovery tools

Analysis of microRNA precursors in multiple species by data mining techniques

Registros relacionados