Predicting non-coding RNA function using Artificial Intelligence

Bibliographic Details
Main Author: Correia, David Alexandre da Costa
Publication Date: 2024
Format: Master thesis
Language: eng
Source: Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
Download full: http://hdl.handle.net/10400.5/96901
Summary: Tese de mestrado, Bioinformática e Biologia Computacional, 2024, Universidade de Lisboa, Faculdade de Ciências
id RCAP_7f895bfebe1fa39f58e214476489ccef
oai_identifier_str oai:repositorio.ulisboa.pt:10400.5/96901
network_acronym_str RCAP
network_name_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository_id_str https://opendoar.ac.uk/repository/7160
spelling Predicting non-coding RNA function using Artificial IntelligenceRNAs não codificantesExtração de RelaçõesProspeção de TextoSupervisão à DistânciaGrandes Modelos de LinguagemTeses de mestrado - 2024Departamento de InformáticaTese de mestrado, Bioinformática e Biologia Computacional, 2024, Universidade de Lisboa, Faculdade de CiênciasNon-coding RNAs (ncRNAs) represent the majority of human gene products and are involved in various important biological processes, being considered relevant disease biomarkers and therapeutic agents. However, there are few functional annotation databases dedicated to ncRNAs and information about these biomolecules remains sparsely distributed, mostly in the form of scientific research articles. It is then of pivotal importance to aggregate and summarize the existing information. Natural Language Processing methods applied to text mining enable automatic information extraction and summarization from textual data. These techniques can be used to generate collections of annotated sentences expressing relations between entities, called relational corpora. In this work, a text mining pipeline was implemented to generate a ncRNA-phenotype relational corpus (ncoRP) using Distant Supervision Relation Extraction (DSRE), consisting of 21,608 annotated articles, 2,835 unique ncRNAs, 1,118 unique phenotypes and 35,295 unique relations, with a precision of 0.761 and F1-score of 0.593, calculated through human validation. DSRE methods require a set of predocumented relations to function, as such, a high-fidelity ncRNA-phenotype relation dataset, consisting of 214,300 unique relations, was created by the aggregation of five ncRNA-disease functional annotation databases. Then, both ncoRP and the relation dataset represent important contributions towards solving the problem with the sparseness of information about ncRNAs. Large Language Models (LLMs) are an emerging type of language model, showing great capabilities in general task-solving through text generation, without the requirement of fine-tuning with large datasets. This benefit shows promise for applications in Relation Extraction (RE), when compared to data-intensive state-of-the-art deep learning methods. In this work, a LLM RE methodology is proposed and evaluated, achieving an F1-score of 0.978 by combining the RE task with a preceding sentence filtering task and applying prompting principles such as in-context learning and Chain-of-Thought self-explanation.Martiniano, Hugo Filipe de Mesquita Costa, 1978-Couto, Francisco José MoreiraRepositório da Universidade de LisboaCorreia, David Alexandre da Costa2025-01-07T10:51:19Z202420242024-01-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10400.5/96901enginfo:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2025-03-17T16:31:17Zoai:repositorio.ulisboa.pt:10400.5/96901Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-29T04:18:01.434059Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse
dc.title.none.fl_str_mv Predicting non-coding RNA function using Artificial Intelligence
title Predicting non-coding RNA function using Artificial Intelligence
spellingShingle Predicting non-coding RNA function using Artificial Intelligence
Correia, David Alexandre da Costa
RNAs não codificantes
Extração de Relações
Prospeção de Texto
Supervisão à Distância
Grandes Modelos de Linguagem
Teses de mestrado - 2024
Departamento de Informática
title_short Predicting non-coding RNA function using Artificial Intelligence
title_full Predicting non-coding RNA function using Artificial Intelligence
title_fullStr Predicting non-coding RNA function using Artificial Intelligence
title_full_unstemmed Predicting non-coding RNA function using Artificial Intelligence
title_sort Predicting non-coding RNA function using Artificial Intelligence
author Correia, David Alexandre da Costa
author_facet Correia, David Alexandre da Costa
author_role author
dc.contributor.none.fl_str_mv Martiniano, Hugo Filipe de Mesquita Costa, 1978-
Couto, Francisco José Moreira
Repositório da Universidade de Lisboa
dc.contributor.author.fl_str_mv Correia, David Alexandre da Costa
dc.subject.por.fl_str_mv RNAs não codificantes
Extração de Relações
Prospeção de Texto
Supervisão à Distância
Grandes Modelos de Linguagem
Teses de mestrado - 2024
Departamento de Informática
topic RNAs não codificantes
Extração de Relações
Prospeção de Texto
Supervisão à Distância
Grandes Modelos de Linguagem
Teses de mestrado - 2024
Departamento de Informática
description Tese de mestrado, Bioinformática e Biologia Computacional, 2024, Universidade de Lisboa, Faculdade de Ciências
publishDate 2024
dc.date.none.fl_str_mv 2024
2024
2024-01-01T00:00:00Z
2025-01-07T10:51:19Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10400.5/96901
url http://hdl.handle.net/10400.5/96901
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron:RCAAP
instname_str FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron_str RCAAP
institution RCAAP
reponame_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
collection Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository.name.fl_str_mv Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
repository.mail.fl_str_mv info@rcaap.pt
_version_ 1833602010061996032