Word co-occurrence network analysis using Word Embedding

Detalhes bibliográficos
Ano de defesa: 2024
Autor(a) principal: Quispe, Laura Vanessa Cruz
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Tese
Tipo de acesso: Acesso aberto
Idioma: eng
Instituição de defesa: Biblioteca Digitais de Teses e Dissertações da USP
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: https://www.teses.usp.br/teses/disponiveis/55/55134/tde-16012025-155108/
Resumo: Recent studies in the literature demonstrate that human language can be effectively modeled as a complex network, commonly referred to as a word co-occurrence network. These networks exhibit characteristics typical of scale-free and small-world networks, aligning them with fundamental concepts in network theory. The use of word co-occurrence networks in text classification has shown notable success, primarily due to their ability to capture the structural and syntactic properties of a text, without relying on parsers that require a deeper linguistic knowledge of the language. However, the increasing use of word embeddings across various applications highlights the importance of integrating contextual and semantic information, which co-occurrence networks, in their traditional form, may lack. In this research, we propose to extend the modeling of word co-occurrence networks by incorporating word embedding data to generate virtual edges, thereby unifying syntactic, semantic, and contextual elements within the same network. This approach aims to improve several aspects of text classification, particularly in terms of quality, robustness, and adaptability to short texts, which often present unique challenges. Due to the generalizability of the proposed model and the flexible nature of embeddings, we believe that these networks can further our understanding of how word embeddings operate within complex network structures. The results of our experiments reveal that the use of virtual edges generated from embeddings such as GloVe, Word2Vec, and FastText enhances the discriminative power of the network, significantly improving text classification performance. Additionally, we discovered that the most optimized results are achieved when stop-words are retained and a simple global thresholding strategy is applied to establish virtual edges. Moreover, incorporating word embeddings in these networks not only enhances but also maintains a high level of informativeness, allowing the network to better distinguish between human texts and nonsensical texts over short and long texts. Finally, the combination of word embeddings with stop-word filtering, provides the network with semantic richness, giving the ability to capture semantical and context information of texts. However, maintaining word embedding without stop-word filtering, retains its ability to capture the underlying syntactic structure, making it possible to identify the linguistic properties of different languages. This approach adds robustness to word co-occurrence networks, preserving their initial syntactic capabilities without being compromised by the addition of virtual edges.