Detalhes bibliográficos
Ano de defesa: |
2024 |
Autor(a) principal: |
Cordeiro, Fábio Corrêa |
Orientador(a): |
Coelho, Flávio Codeço |
Banca de defesa: |
Não Informado pela instituição |
Tipo de documento: |
Tese
|
Tipo de acesso: |
Acesso aberto |
Idioma: |
eng |
Instituição de defesa: |
Não Informado pela instituição
|
Programa de Pós-Graduação: |
Não Informado pela instituição
|
Departamento: |
Não Informado pela instituição
|
País: |
Não Informado pela instituição
|
Palavras-chave em Inglês: |
|
Link de acesso: |
https://hdl.handle.net/10438/35868
|
Resumo: |
Numerous companies are interested in gathering strategic information from their document repositories. It is especially relevant for the oil and gas industry, which has large repositories of geoscientific reports from several decades of production. Making this information available in a structured format can unlock valuable information among these mountains of data, which is crucial to support a wide range of industrial and academic applications. However, most natural language processing resources were built with general domain texts extracted from the Internet and written primarily in English. This thesis presents a methodology for extracting geoscientific entities and relations from technical documents and populating a knowledge graph - the Petro KGraph. We also developed a comprehensive set of natural language processing and information extraction resources for the oil and gas industry in Portuguese. Along the text, we describe the natural language processing and information extraction resources, the process used to train machine learning models, and review relevant literature. Finally, we evaluate each model and the overall methodology. We developed an innovative Entity Linking approach that allows finding new entities beyond those already known. Another crucial contribution is that the new resources and evaluation procedures constitute a new benchmark for the Portuguese language and the geoscience domain. We evaluated an information retrieval system using the Petro KGraph to expand its queries, which presented a slightly better result than the system without a query expansion. Plans for future work include building an improved information retrieval test set, comparing the results using different graph embedding algorithms, and testing language models launched after BERT models. |