BIOMEDICAL DOCUMENT RETRIEVAL FOR DATABASE CURATION

Detalhes bibliográficos
Autor(a) principal: Ramos, Diogo Luís Embaixador
Data de Publicação: 2024
Tipo de documento: Dissertação
Idioma: eng
Título da fonte: Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
Texto Completo: http://hdl.handle.net/10362/182368
Resumo: This dissertation explores state-of-the-art deep learning models for document retrieval in biomedical research, using the Exposome-Explorer database as a case study, which contains manually curated entries on biomarkers of exposure to environmental risk factors for various diseases. Previous works have employed simple machine learning algorithms to reduce expert workload by enhancing the accuracy and efficiency of document retrieval. In this dissertation traditional document retrieval methods, such as BM25, are evaluated alongside transformer models like MonoBERT, DistilBERT, and PubMedBERT, to assess their suitability for the task. Results demonstrate that PubMedBERT, pre-trained on biomedical text, offers the best performance in retrieving relevant documents, with BM25 contributing significantly to initial dataset refinement. However, challenges such as curated data variability and variability in precision and recall persist, particularly with smaller datasets for which fewer training examples are available like pollutant biomarkers. This research represents a step forward in automating and refining the curation of biomedical databases, ensuring faster and more reliable results. Future work will involve applying the trained models to the latest version of the Exposome-Explorer database and enhancing BM25 with RM3 query expansion for improved document ranking. Additional optimization of the models will be explored to address performance variability and improve overall retrieval accuracy across different biomarker datasets.
id RCAP_aafc46c5d2f23c4da4414bfc2b1da8d0
oai_identifier_str oai:run.unl.pt:10362/182368
network_acronym_str RCAP
network_name_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository_id_str https://opendoar.ac.uk/repository/7160
spelling BIOMEDICAL DOCUMENT RETRIEVAL FOR DATABASE CURATIONDEEP LEARNINGDOCUMENT RETRIEVALDATABASE CURATIONBIOMEDICAL LITERATUREINFORMATION RETRIEVALDomínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e InformáticaThis dissertation explores state-of-the-art deep learning models for document retrieval in biomedical research, using the Exposome-Explorer database as a case study, which contains manually curated entries on biomarkers of exposure to environmental risk factors for various diseases. Previous works have employed simple machine learning algorithms to reduce expert workload by enhancing the accuracy and efficiency of document retrieval. In this dissertation traditional document retrieval methods, such as BM25, are evaluated alongside transformer models like MonoBERT, DistilBERT, and PubMedBERT, to assess their suitability for the task. Results demonstrate that PubMedBERT, pre-trained on biomedical text, offers the best performance in retrieving relevant documents, with BM25 contributing significantly to initial dataset refinement. However, challenges such as curated data variability and variability in precision and recall persist, particularly with smaller datasets for which fewer training examples are available like pollutant biomarkers. This research represents a step forward in automating and refining the curation of biomedical databases, ensuring faster and more reliable results. Future work will involve applying the trained models to the latest version of the Exposome-Explorer database and enhancing BM25 with RM3 query expansion for improved document ranking. Additional optimization of the models will be explored to address performance variability and improve overall retrieval accuracy across different biomarker datasets.Esta dissertação explora modelos de deep learning de última geração para a recuperação de documentos em investigação biomédica, utilizando a base de dados Exposome-Explorer como caso de estudo, a qual contém entradas manualmente curadas sobre biomarcadores de exposição a fatores de risco ambientais para várias doenças. Trabalhos anteriores utilizaram algoritmos simples de machine learning para reduzir a carga de trabalho dos especialistas, melhorando a precisão e eficiência da obtenção de documentos. Nesta dissertação, são avaliados métodos tradicionais de obtenção de documentos, como o BM25, juntamente com modelos de transformadores como MonoBERT, DistilBERT e PubMedBERT, para avaliar a sua adequação para a tarefa. Os resultados demonstram que o PubMedBERT, pré-treinado em texto biomédico, oferece o melhor desempenho na obtenção de documentos relevantes, com o BM25 a contribuir significativamente para o refinamento inicial do conjunto de dados. No entanto, persistem desafios como a variabilidade dos dados e a variabilidade na precisão e recall, particularmente em conjuntos de dados menores, para os quais estão disponíveis menos exemplos de treino, como os biomarcadores de poluentes. Esta investigação representa um avanço na automatização e aperfeiçoamento da cu- radoria de bases de dados biomédicas, garantindo resultados mais rápidos e fiáveis. Trabalhos futuros irão envolver a aplicação dos modelos treinados na versão mais recente da base de dados Exposome-Explorer e a melhoria do BM25 com expansão de consultas RM3 para um melhor ranking de documentos. Serão exploradas otimizações adicionais dos modelos para enfrentar a variabilidade de desempenho e melhorar a precisão geral da recuperação em diferentes conjuntos de dados de biomarcadores.Lamúrias, AndréRUNRamos, Diogo Luís Embaixador2025-04-16T15:06:41Z2024-122024-12-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10362/182368enginfo:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2025-04-21T01:33:26Zoai:run.unl.pt:10362/182368Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-29T06:29:54.488152Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse
dc.title.none.fl_str_mv BIOMEDICAL DOCUMENT RETRIEVAL FOR DATABASE CURATION
title BIOMEDICAL DOCUMENT RETRIEVAL FOR DATABASE CURATION
spellingShingle BIOMEDICAL DOCUMENT RETRIEVAL FOR DATABASE CURATION
Ramos, Diogo Luís Embaixador
DEEP LEARNING
DOCUMENT RETRIEVAL
DATABASE CURATION
BIOMEDICAL LITERATURE
INFORMATION RETRIEVAL
Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática
title_short BIOMEDICAL DOCUMENT RETRIEVAL FOR DATABASE CURATION
title_full BIOMEDICAL DOCUMENT RETRIEVAL FOR DATABASE CURATION
title_fullStr BIOMEDICAL DOCUMENT RETRIEVAL FOR DATABASE CURATION
title_full_unstemmed BIOMEDICAL DOCUMENT RETRIEVAL FOR DATABASE CURATION
title_sort BIOMEDICAL DOCUMENT RETRIEVAL FOR DATABASE CURATION
author Ramos, Diogo Luís Embaixador
author_facet Ramos, Diogo Luís Embaixador
author_role author
dc.contributor.none.fl_str_mv Lamúrias, André
RUN
dc.contributor.author.fl_str_mv Ramos, Diogo Luís Embaixador
dc.subject.por.fl_str_mv DEEP LEARNING
DOCUMENT RETRIEVAL
DATABASE CURATION
BIOMEDICAL LITERATURE
INFORMATION RETRIEVAL
Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática
topic DEEP LEARNING
DOCUMENT RETRIEVAL
DATABASE CURATION
BIOMEDICAL LITERATURE
INFORMATION RETRIEVAL
Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática
description This dissertation explores state-of-the-art deep learning models for document retrieval in biomedical research, using the Exposome-Explorer database as a case study, which contains manually curated entries on biomarkers of exposure to environmental risk factors for various diseases. Previous works have employed simple machine learning algorithms to reduce expert workload by enhancing the accuracy and efficiency of document retrieval. In this dissertation traditional document retrieval methods, such as BM25, are evaluated alongside transformer models like MonoBERT, DistilBERT, and PubMedBERT, to assess their suitability for the task. Results demonstrate that PubMedBERT, pre-trained on biomedical text, offers the best performance in retrieving relevant documents, with BM25 contributing significantly to initial dataset refinement. However, challenges such as curated data variability and variability in precision and recall persist, particularly with smaller datasets for which fewer training examples are available like pollutant biomarkers. This research represents a step forward in automating and refining the curation of biomedical databases, ensuring faster and more reliable results. Future work will involve applying the trained models to the latest version of the Exposome-Explorer database and enhancing BM25 with RM3 query expansion for improved document ranking. Additional optimization of the models will be explored to address performance variability and improve overall retrieval accuracy across different biomarker datasets.
publishDate 2024
dc.date.none.fl_str_mv 2024-12
2024-12-01T00:00:00Z
2025-04-16T15:06:41Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10362/182368
url http://hdl.handle.net/10362/182368
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron:RCAAP
instname_str FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron_str RCAAP
institution RCAAP
reponame_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
collection Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository.name.fl_str_mv Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
repository.mail.fl_str_mv info@rcaap.pt
_version_ 1833602701567459328