Predição e validação de genes essenciais em procariotos e eucariotos utilizando aprendizado de máquina e atributos intrínsecos à sequência

Detalhes bibliográficos
Ano de defesa: 2021
Autor(a) principal: Giovanni Marques de Castro
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Tese
Tipo de acesso: Acesso aberto
Idioma: por
Instituição de defesa: Universidade Federal de Minas Gerais
Brasil
ICB - INSTITUTO DE CIÊNCIAS BIOLOGICAS
Programa de Pós-Graduação em Bioinformatica
UFMG
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: http://hdl.handle.net/1843/74615
Resumo: Essential genes are defined as those whose absence of the functional product is incompatible with the organism's viability. Non-essential genes, in contrast, are those where this absence still generates phenotypically viable individuals. The large-scale characterization of these genes provides the description of minimal genomes compatible with cellular life, as well as suggests potential molecular targets for the development of specific biopesticides with a smaller ecological footprint. However, since the experimental characterization of these genes is a costly and time-consuming process, several computational strategies have been used for the prediction of essential genes. A common approach in this direction is the usage of machine learning algorithms, since these programs are expected to learn from experience and the use of data. Machine learning algorithms developed to predict essential genes can use two types of gene-centric attributes: 1) extrinsic, defined as those that use information not contained in the gene sequence itself (e.g. gene expression profile, annotation); 2) intrinsic, defined as those computed from the gene sequence only (e.g. frequency of k-mers, entropy). Even though essential gene predictors that use extrinsic attributes have superior performance compared to those that use only intrinsic attributes, they former lack generalization, since they cannot be used in non-model organisms that do not have extrinsic information. In this work, we developed and validated a complete computational routine for the prediction of essential genes in prokaryotes and eukaryotes. Specifically, we developed an R package that integrates and calculates 5093 nucleotide and 9815 protein attributes, totaling 14908 intrinsic attributes. These attributes, together with the labels (essential genes versus non-essential genes), are then used to train random forest models and gradient boosting models while taking into account the state-of-the-art for model performance evaluation. We validated our methodology by gathering high-quality sequence information data of essential and non-essential genes for two phylogenetically distant prokaryote species (Acinetobacter baylyi, Proteobacteria; Staphylococcus aureus, Firmicutes) and for two insect species (Drosophila melanogaster, Diptera; Tribolium castaneum, Coleoptera). We used these data to train individual classifiers for each species. As validation, we demonstrate that classifiers trained with data from one species of prokaryote/insect are able to predict essential genes in another species of prokaryote/insect, which emulates the daily use of the tool in new organisms. The source code for the calculation of attributes and models training, as well as the databases of essential and non-essential genes used in this study are available at https://github.com/g1o/GeneEssentiality.