Detalhes bibliográficos
Ano de defesa: |
2022 |
Autor(a) principal: |
Carmo, Vinícius Cleves de Oliveira |
Orientador(a): |
Não Informado pela instituição |
Banca de defesa: |
Não Informado pela instituição |
Tipo de documento: |
Dissertação
|
Tipo de acesso: |
Acesso aberto |
Idioma: |
eng |
Instituição de defesa: |
Biblioteca Digitais de Teses e Dissertações da USP
|
Programa de Pós-Graduação: |
Não Informado pela instituição
|
Departamento: |
Não Informado pela instituição
|
País: |
Não Informado pela instituição
|
Palavras-chave em Português: |
|
Link de acesso: |
https://www.teses.usp.br/teses/disponiveis/3/3141/tde-24052023-152815/
|
Resumo: |
Question Answering (QA) systems that operate over textual databases aim at improving traditional information retrieval systems; while the latter recover a number of relevant documents from a document pool, the former can also find and present direct answers to users. Recent improvements on QA have been based on deep neural networks; such networks require large volumes of labeled data for training. Most existing datasets target general knowledge and, even though there are a few datasets for specific domains (such as biomedicine), for most domains there is no labeled, or easy to label, dataset available. This creates an obstacle for the development of domain-specific QA systems. We propose a framework for developing domain-specific QA systems by leveraging unsupervised learning so as to avoid the costs related to large scale dataset labeling. The contributions of this work are twofold. First, we apply domain-adaptive pretraining to improve out-of-domain performance of reading comprehension and question answering systems. This technique achieves state-of-the-art results on two Reading Comprehension datasets, and it exceeds the performance of state-of-the-art domain adaptation techniques in the literature by a significant margin: 2.3 exact match and 5.2 F1-score on BioASQ. Then, we propose a framework for domain-specific question answering in the low data regime. For document retrieval, we apply a combination of BM25 along with a custom text processing pipeline. We find that, in a low data setting, statistical document retrieval models outperform neural models as the data on the desired domain differ from the data used for training. For answer selection, we apply a neural reader trained with domain-adaptive pretraining to improve generalization on the desired domain. We also perform a case study by applying the proposed framework to the offshore engineering domain. |