Petro KGraph: a methodology for extracting knowledge graph from technical documents - an application in the oil and gas industry.

Detalhes bibliográficos
Ano de defesa: 2024
Autor(a) principal: Cordeiro, Fábio Corrêa
Orientador(a): Coelho, Flávio Codeço
Banca de defesa: Não Informado pela instituição
Tipo de documento: Tese
Tipo de acesso: Acesso aberto
Idioma: eng
Instituição de defesa: Não Informado pela instituição
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Inglês:
Link de acesso: https://hdl.handle.net/10438/35868
Resumo: Numerous companies are interested in gathering strategic information from their document repositories. It is especially relevant for the oil and gas industry, which has large repositories of geoscientific reports from several decades of production. Making this information available in a structured format can unlock valuable information among these mountains of data, which is crucial to support a wide range of industrial and academic applications. However, most natural language processing resources were built with general domain texts extracted from the Internet and written primarily in English. This thesis presents a methodology for extracting geoscientific entities and relations from technical documents and populating a knowledge graph - the Petro KGraph. We also developed a comprehensive set of natural language processing and information extraction resources for the oil and gas industry in Portuguese. Along the text, we describe the natural language processing and information extraction resources, the process used to train machine learning models, and review relevant literature. Finally, we evaluate each model and the overall methodology. We developed an innovative Entity Linking approach that allows finding new entities beyond those already known. Another crucial contribution is that the new resources and evaluation procedures constitute a new benchmark for the Portuguese language and the geoscience domain. We evaluated an information retrieval system using the Petro KGraph to expand its queries, which presented a slightly better result than the system without a query expansion. Plans for future work include building an improved information retrieval test set, comparing the results using different graph embedding algorithms, and testing language models launched after BERT models.