Heterogeneous Graphs for Text Representation: An Integrated Approach with Language Models

Detalhes bibliográficos
Ano de defesa: 2023
Autor(a) principal: Santos, Brucce Neves dos
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Tese
Tipo de acesso: Acesso aberto
Idioma: eng
Instituição de defesa: Biblioteca Digitais de Teses e Dissertações da USP
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: https://www.teses.usp.br/teses/disponiveis/55/55134/tde-25032024-112903/
Resumo: Data representation through graphs is essential for analyzing complex relationships in fields like computer science and biology. In real-world scenarios, relationships between vertices do not always follow a uniform pattern, creating the need for heterogeneous graphs representing different types of vertices and various relationships in complex systems. However, heterogeneous graphs come with challenges. Due to the diversity of vertices and types of relationships, the inherent complexity of these structures makes understanding and analyzing them more complex than homogeneous graphs. To address this challenge, several machine learning models specific to heterogeneous graphs have been developed to comprehend the semantics of relationships between entities. Text representation in heterogeneous graphs is also challenging due to the lack of structure in textual data, which can lead to information loss. Additionally, heterogeneous graphs struggle to capture detailed semantic information in texts as they are primarily designed to represent formal structures and structural relationships. Resolving textual ambiguities is also complex for heterogeneous graphs, requiring a deep understanding of textual context. While language models excel at text comprehension, they may not be suitable for representing complex entities and relationships in real-world systems. Accurately identifying entities mentioned in texts and their relationships with real-world entities can be challenging. The integration of heterogeneous graphs and language models offers a promising solution. It combines the structural knowledge of heterogeneous graphs with the textual understanding of language models, resulting in embeddings that incorporate both the structural complexity of graphs and natural language text understanding. This approach can enhance performance in natural language processing, recommendation, and information retrieval tasks. This doctoral thesis focuses on overcoming the limitations of heterogeneous graphs in representing semantic information in texts. The proposal is to combine heterogeneous graphs with language models, leveraging the advantages of both approaches. While graphs represent structures and relationships, language models specialize in efficiently understanding and generating text. The underlying hypothesis is that this combination will result in richer data representations, improving performance in complex data analyses. This thesis introduces a two-stage approach that combines label propagation techniques and language model embeddings to generate vector representations of vertices in heterogeneous graphs. In this approach, the EPHG-CR (Embedding Propagation for Heterogeneous Graphs with Class Refinement) method is proposed, which differentiates itself by considering not only edge weights but also vertex relevance to task classes, bringing vertices with the same class closer together, taking into account the graphs topology. This approach was compared with a language model in the aspect-based sentiment analysis task, showing competitive results and, in some cases, slight superiority. Furthermore, the article explores applications of auxiliary vertex embeddings in other tasks, demonstrating another advantage of the approach.