Reconhecimento de padrões utilizando métricas de redes complexas para a extração de características, representação e classificação de sequências de RNAs

Detalhes bibliográficos
Ano de defesa: 2018
Autor(a) principal: Katahira, Isaque
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Dissertação
Tipo de acesso: Acesso aberto
Idioma: por
Instituição de defesa: Universidade Tecnológica Federal do Paraná
Cornelio Procopio
Brasil
Programa de Pós-Graduação em Bioinformática
UTFPR
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: http://repositorio.utfpr.edu.br/jspui/handle/1/3368
Resumo: Due to the emergence of Next Generation Sequencers (NGS), a large volume of DNAs and RNAs has been sequenced quickly at relatively lower costs. NGS has a output capacity of several thousands of sequences simultaneously, producing a massive volume of data to be analyzed. In this sense, computational tools become essential not only for an extraction, but also for the data selection and analysis. This research presents a model capable of extracting features for classification of coding and non-coding RNAs. The BiologicAl Sequences NETwork (BASiNET) is available at url https : //cran.r – project.org/package = BASiNET, implements the developed method, which convert RNAs sequences through complex networks, since these are e_cient to represent real systems, as is the case with biological systems. In order to represent the selected sequences, the configuration of the complex network is from the step size parameter, that represents the connections between the nucleotides, and also the word size parameter, that represents the quantity of nucleotides by vertex; afterwards the least dense edges are removed for subnetwork generation resulting from the increasing elimination of 1 to n edges from the network. Subsequently, each subnetwork is submitted to the measures of: proximity, degree, maximum degree, minimum degree, intermediation, clustering coefficient, mean minimum path, standard deviation and motifs. The extraction of measures from each of these subnetworks makes up the feature vector, the vector values are inserted in the supervised classification algorithm that, through the detection of patterns, performs the distinction of sequences with 10-fold cross validation. The BASiNET tool is applied to two data sets. The obtained results were compared with other tools: Predictor of long non-coding RNAs and messenger RNAs based on an improved k-mer scheme (PLEK), Coding-Non-Coding Index (CNCI) and Coding Potential Calculator (CPC2). The comparison of the BASiNET performance indicates, since it higher average accuracy results in the identification of coding RNAs and non-coding RNAs in the two experimental data sets. The average indices obtained from the two experiments were higher in the identification of coding RNAs by 8,6 % with respect to the CNCI; 11,4 % with respect to PLEK and 4,4 % with respect to CPC2. Regarding the identification of the non-coding RNAs, the overall average obtained was 2,2 %, 2,6 %, 1,5 % higher with respect to CNCI, PLEK and CPC2, respectively. The improvement of the accuracy indices reinforces the stability and the homogeneity of the method. Finally, it should be noted that the method implemented by BASiNET uses open source tools and can be executed on a computer with basic configurations, being extended to the classification of other sequences such as DNAs and amino acids.