BIOMEDICAL DOCUMENT RETRIEVAL FOR DATABASE CURATION
Autor(a) principal: | |
---|---|
Data de Publicação: | 2024 |
Tipo de documento: | Dissertação |
Idioma: | eng |
Título da fonte: | Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
Texto Completo: | http://hdl.handle.net/10362/182368 |
Resumo: | This dissertation explores state-of-the-art deep learning models for document retrieval in biomedical research, using the Exposome-Explorer database as a case study, which contains manually curated entries on biomarkers of exposure to environmental risk factors for various diseases. Previous works have employed simple machine learning algorithms to reduce expert workload by enhancing the accuracy and efficiency of document retrieval. In this dissertation traditional document retrieval methods, such as BM25, are evaluated alongside transformer models like MonoBERT, DistilBERT, and PubMedBERT, to assess their suitability for the task. Results demonstrate that PubMedBERT, pre-trained on biomedical text, offers the best performance in retrieving relevant documents, with BM25 contributing significantly to initial dataset refinement. However, challenges such as curated data variability and variability in precision and recall persist, particularly with smaller datasets for which fewer training examples are available like pollutant biomarkers. This research represents a step forward in automating and refining the curation of biomedical databases, ensuring faster and more reliable results. Future work will involve applying the trained models to the latest version of the Exposome-Explorer database and enhancing BM25 with RM3 query expansion for improved document ranking. Additional optimization of the models will be explored to address performance variability and improve overall retrieval accuracy across different biomarker datasets. |
id |
RCAP_aafc46c5d2f23c4da4414bfc2b1da8d0 |
---|---|
oai_identifier_str |
oai:run.unl.pt:10362/182368 |
network_acronym_str |
RCAP |
network_name_str |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
repository_id_str |
https://opendoar.ac.uk/repository/7160 |
spelling |
BIOMEDICAL DOCUMENT RETRIEVAL FOR DATABASE CURATIONDEEP LEARNINGDOCUMENT RETRIEVALDATABASE CURATIONBIOMEDICAL LITERATUREINFORMATION RETRIEVALDomínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e InformáticaThis dissertation explores state-of-the-art deep learning models for document retrieval in biomedical research, using the Exposome-Explorer database as a case study, which contains manually curated entries on biomarkers of exposure to environmental risk factors for various diseases. Previous works have employed simple machine learning algorithms to reduce expert workload by enhancing the accuracy and efficiency of document retrieval. In this dissertation traditional document retrieval methods, such as BM25, are evaluated alongside transformer models like MonoBERT, DistilBERT, and PubMedBERT, to assess their suitability for the task. Results demonstrate that PubMedBERT, pre-trained on biomedical text, offers the best performance in retrieving relevant documents, with BM25 contributing significantly to initial dataset refinement. However, challenges such as curated data variability and variability in precision and recall persist, particularly with smaller datasets for which fewer training examples are available like pollutant biomarkers. This research represents a step forward in automating and refining the curation of biomedical databases, ensuring faster and more reliable results. Future work will involve applying the trained models to the latest version of the Exposome-Explorer database and enhancing BM25 with RM3 query expansion for improved document ranking. Additional optimization of the models will be explored to address performance variability and improve overall retrieval accuracy across different biomarker datasets.Esta dissertação explora modelos de deep learning de última geração para a recuperação de documentos em investigação biomédica, utilizando a base de dados Exposome-Explorer como caso de estudo, a qual contém entradas manualmente curadas sobre biomarcadores de exposição a fatores de risco ambientais para várias doenças. Trabalhos anteriores utilizaram algoritmos simples de machine learning para reduzir a carga de trabalho dos especialistas, melhorando a precisão e eficiência da obtenção de documentos. Nesta dissertação, são avaliados métodos tradicionais de obtenção de documentos, como o BM25, juntamente com modelos de transformadores como MonoBERT, DistilBERT e PubMedBERT, para avaliar a sua adequação para a tarefa. Os resultados demonstram que o PubMedBERT, pré-treinado em texto biomédico, oferece o melhor desempenho na obtenção de documentos relevantes, com o BM25 a contribuir significativamente para o refinamento inicial do conjunto de dados. No entanto, persistem desafios como a variabilidade dos dados e a variabilidade na precisão e recall, particularmente em conjuntos de dados menores, para os quais estão disponíveis menos exemplos de treino, como os biomarcadores de poluentes. Esta investigação representa um avanço na automatização e aperfeiçoamento da cu- radoria de bases de dados biomédicas, garantindo resultados mais rápidos e fiáveis. Trabalhos futuros irão envolver a aplicação dos modelos treinados na versão mais recente da base de dados Exposome-Explorer e a melhoria do BM25 com expansão de consultas RM3 para um melhor ranking de documentos. Serão exploradas otimizações adicionais dos modelos para enfrentar a variabilidade de desempenho e melhorar a precisão geral da recuperação em diferentes conjuntos de dados de biomarcadores.Lamúrias, AndréRUNRamos, Diogo Luís Embaixador2025-04-16T15:06:41Z2024-122024-12-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10362/182368enginfo:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2025-04-21T01:33:26Zoai:run.unl.pt:10362/182368Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-29T06:29:54.488152Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse |
dc.title.none.fl_str_mv |
BIOMEDICAL DOCUMENT RETRIEVAL FOR DATABASE CURATION |
title |
BIOMEDICAL DOCUMENT RETRIEVAL FOR DATABASE CURATION |
spellingShingle |
BIOMEDICAL DOCUMENT RETRIEVAL FOR DATABASE CURATION Ramos, Diogo Luís Embaixador DEEP LEARNING DOCUMENT RETRIEVAL DATABASE CURATION BIOMEDICAL LITERATURE INFORMATION RETRIEVAL Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática |
title_short |
BIOMEDICAL DOCUMENT RETRIEVAL FOR DATABASE CURATION |
title_full |
BIOMEDICAL DOCUMENT RETRIEVAL FOR DATABASE CURATION |
title_fullStr |
BIOMEDICAL DOCUMENT RETRIEVAL FOR DATABASE CURATION |
title_full_unstemmed |
BIOMEDICAL DOCUMENT RETRIEVAL FOR DATABASE CURATION |
title_sort |
BIOMEDICAL DOCUMENT RETRIEVAL FOR DATABASE CURATION |
author |
Ramos, Diogo Luís Embaixador |
author_facet |
Ramos, Diogo Luís Embaixador |
author_role |
author |
dc.contributor.none.fl_str_mv |
Lamúrias, André RUN |
dc.contributor.author.fl_str_mv |
Ramos, Diogo Luís Embaixador |
dc.subject.por.fl_str_mv |
DEEP LEARNING DOCUMENT RETRIEVAL DATABASE CURATION BIOMEDICAL LITERATURE INFORMATION RETRIEVAL Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática |
topic |
DEEP LEARNING DOCUMENT RETRIEVAL DATABASE CURATION BIOMEDICAL LITERATURE INFORMATION RETRIEVAL Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática |
description |
This dissertation explores state-of-the-art deep learning models for document retrieval in biomedical research, using the Exposome-Explorer database as a case study, which contains manually curated entries on biomarkers of exposure to environmental risk factors for various diseases. Previous works have employed simple machine learning algorithms to reduce expert workload by enhancing the accuracy and efficiency of document retrieval. In this dissertation traditional document retrieval methods, such as BM25, are evaluated alongside transformer models like MonoBERT, DistilBERT, and PubMedBERT, to assess their suitability for the task. Results demonstrate that PubMedBERT, pre-trained on biomedical text, offers the best performance in retrieving relevant documents, with BM25 contributing significantly to initial dataset refinement. However, challenges such as curated data variability and variability in precision and recall persist, particularly with smaller datasets for which fewer training examples are available like pollutant biomarkers. This research represents a step forward in automating and refining the curation of biomedical databases, ensuring faster and more reliable results. Future work will involve applying the trained models to the latest version of the Exposome-Explorer database and enhancing BM25 with RM3 query expansion for improved document ranking. Additional optimization of the models will be explored to address performance variability and improve overall retrieval accuracy across different biomarker datasets. |
publishDate |
2024 |
dc.date.none.fl_str_mv |
2024-12 2024-12-01T00:00:00Z 2025-04-16T15:06:41Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/masterThesis |
format |
masterThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/10362/182368 |
url |
http://hdl.handle.net/10362/182368 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.source.none.fl_str_mv |
reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia instacron:RCAAP |
instname_str |
FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
collection |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
repository.name.fl_str_mv |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia |
repository.mail.fl_str_mv |
info@rcaap.pt |
_version_ |
1833602701567459328 |