Word co-occurrence network analysis using Word Embedding

Bibliographic Details
Main Author: Quispe, Laura Vanessa Cruz
Publication Date: 2024
Format: Doctoral thesis
Language: eng
Source: Biblioteca Digital de Teses e Dissertações da USP
Download full: https://www.teses.usp.br/teses/disponiveis/55/55134/tde-16012025-155108/
Summary: Recent studies in the literature demonstrate that human language can be effectively modeled as a complex network, commonly referred to as a word co-occurrence network. These networks exhibit characteristics typical of scale-free and small-world networks, aligning them with fundamental concepts in network theory. The use of word co-occurrence networks in text classification has shown notable success, primarily due to their ability to capture the structural and syntactic properties of a text, without relying on parsers that require a deeper linguistic knowledge of the language. However, the increasing use of word embeddings across various applications highlights the importance of integrating contextual and semantic information, which co-occurrence networks, in their traditional form, may lack. In this research, we propose to extend the modeling of word co-occurrence networks by incorporating word embedding data to generate virtual edges, thereby unifying syntactic, semantic, and contextual elements within the same network. This approach aims to improve several aspects of text classification, particularly in terms of quality, robustness, and adaptability to short texts, which often present unique challenges. Due to the generalizability of the proposed model and the flexible nature of embeddings, we believe that these networks can further our understanding of how word embeddings operate within complex network structures. The results of our experiments reveal that the use of virtual edges generated from embeddings such as GloVe, Word2Vec, and FastText enhances the discriminative power of the network, significantly improving text classification performance. Additionally, we discovered that the most optimized results are achieved when stop-words are retained and a simple global thresholding strategy is applied to establish virtual edges. Moreover, incorporating word embeddings in these networks not only enhances but also maintains a high level of informativeness, allowing the network to better distinguish between human texts and nonsensical texts over short and long texts. Finally, the combination of word embeddings with stop-word filtering, provides the network with semantic richness, giving the ability to capture semantical and context information of texts. However, maintaining word embedding without stop-word filtering, retains its ability to capture the underlying syntactic structure, making it possible to identify the linguistic properties of different languages. This approach adds robustness to word co-occurrence networks, preserving their initial syntactic capabilities without being compromised by the addition of virtual edges.
id USP_cd7c8f70a2e994f1447eaccce2efc1e2
oai_identifier_str oai:teses.usp.br:tde-16012025-155108
network_acronym_str USP
network_name_str Biblioteca Digital de Teses e Dissertações da USP
repository_id_str 2721
spelling Word co-occurrence network analysis using Word EmbeddingAnálise de redes de coocorrência de palavras usando Word EmbeddingsAnálise de redesClassificação de textoComplex networksNetwork analysisRedes complexasRedes de co-occorrencia de palavrasText classificationWord co-occurrence networksWord embeddingsWord embeddingsRecent studies in the literature demonstrate that human language can be effectively modeled as a complex network, commonly referred to as a word co-occurrence network. These networks exhibit characteristics typical of scale-free and small-world networks, aligning them with fundamental concepts in network theory. The use of word co-occurrence networks in text classification has shown notable success, primarily due to their ability to capture the structural and syntactic properties of a text, without relying on parsers that require a deeper linguistic knowledge of the language. However, the increasing use of word embeddings across various applications highlights the importance of integrating contextual and semantic information, which co-occurrence networks, in their traditional form, may lack. In this research, we propose to extend the modeling of word co-occurrence networks by incorporating word embedding data to generate virtual edges, thereby unifying syntactic, semantic, and contextual elements within the same network. This approach aims to improve several aspects of text classification, particularly in terms of quality, robustness, and adaptability to short texts, which often present unique challenges. Due to the generalizability of the proposed model and the flexible nature of embeddings, we believe that these networks can further our understanding of how word embeddings operate within complex network structures. The results of our experiments reveal that the use of virtual edges generated from embeddings such as GloVe, Word2Vec, and FastText enhances the discriminative power of the network, significantly improving text classification performance. Additionally, we discovered that the most optimized results are achieved when stop-words are retained and a simple global thresholding strategy is applied to establish virtual edges. Moreover, incorporating word embeddings in these networks not only enhances but also maintains a high level of informativeness, allowing the network to better distinguish between human texts and nonsensical texts over short and long texts. Finally, the combination of word embeddings with stop-word filtering, provides the network with semantic richness, giving the ability to capture semantical and context information of texts. However, maintaining word embedding without stop-word filtering, retains its ability to capture the underlying syntactic structure, making it possible to identify the linguistic properties of different languages. This approach adds robustness to word co-occurrence networks, preserving their initial syntactic capabilities without being compromised by the addition of virtual edges.Estudos recentes na literatura demonstram que a linguagem humana pode ser modelada de maneira eficaz como uma rede complexa, comumente referida como uma rede de coocorrência de palavras. Essas redes exibem características típicas de redes livres de escala e de mundos pequenos, alinhando-se aos conceitos fundamentais da teoria das redes. O uso de redes de coocorrência de palavras na classificação de textos tem mostrado notável sucesso, principalmente devido à sua capacidade de capturar as propriedades estruturais e sintáticas de um texto, sem a necessidade de parsers que requerem um conhecimento mais profundo da língua. No entanto, o uso crescente de word embeddings em várias aplicações ressalta a importância de integrar informações contextuais e semânticas, que as redes de coocorrência de palavras, em sua forma tradicional, podem não conter. Nesta pesquisa, propomos estender a modelagem das redes de coocorrência de palavras, incorporando dados de embeddings para gerar arestas virtuais, unificando assim elementos sintáticos, semânticos e contextuais dentro da mesma rede. Essa abordagem visa melhorar vários aspectos da classificação de textos, particularmente em termos de qualidade, robustez e adaptabilidade a textos curtos, que muitas vezes apresentam desafios únicos. Devido à generalidade do modelo proposto e à natureza flexível dos embeddings, acreditamos que essas redes podem avançar nossa compreensão sobre como os word embeddings operam dentro das estruturas de redes complexas. Os resultados de nossos experimentos revelam que o uso de arestas virtuais geradas a partir de embeddings como GloVe, Word2Vec e FastText aumenta o poder discriminativo da rede, melhorando significativamente o desempenho na classificação de textos. Além disso, descobrimos que os resultados mais otimizados são alcançados quando as stop-words são mantidas e uma simples estratégia de limiarização global é aplicada para estabelecer as arestas virtuais. Ademais, incorporar word embeddings nessas redes não apenas as melhora, mas também mantém um alto nível de informatividade, permitindo que a rede diferencie melhor entre textos humanos e textos sem sentido, tanto em textos curtos quanto longos. Finalmente, a combinação de word embeddings com a filtragem de stop-words proporciona à rede uma riqueza semântica, conferindo a capacidade de capturar informações semânticas e de contexto dos textos. No entanto, a manutenção do embedding sem a filtragem de stop-words preserva a capacidade de capturar a estrutura sintática subjacente, possibilitando a identificação das propriedades linguísticas de diferentes línguas. Essa abordagem acrescenta robustez às redes de coocorrência de palavras, preservando suas capacidades sintáticas iniciais sem ser comprometida pela adição de arestas virtuais.Biblioteca Digitais de Teses e Dissertações da USPAmancio, Diego RaphaelQuispe, Laura Vanessa Cruz2024-11-13info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisapplication/pdfhttps://www.teses.usp.br/teses/disponiveis/55/55134/tde-16012025-155108/reponame:Biblioteca Digital de Teses e Dissertações da USPinstname:Universidade de São Paulo (USP)instacron:USPLiberar o conteúdo para acesso público.info:eu-repo/semantics/openAccesseng2025-01-16T17:58:02Zoai:teses.usp.br:tde-16012025-155108Biblioteca Digital de Teses e Dissertaçõeshttp://www.teses.usp.br/PUBhttp://www.teses.usp.br/cgi-bin/mtd2br.plvirginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.bropendoar:27212025-01-16T17:58:02Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)false
dc.title.none.fl_str_mv Word co-occurrence network analysis using Word Embedding
Análise de redes de coocorrência de palavras usando Word Embeddings
title Word co-occurrence network analysis using Word Embedding
spellingShingle Word co-occurrence network analysis using Word Embedding
Quispe, Laura Vanessa Cruz
Análise de redes
Classificação de texto
Complex networks
Network analysis
Redes complexas
Redes de co-occorrencia de palavras
Text classification
Word co-occurrence networks
Word embeddings
Word embeddings
title_short Word co-occurrence network analysis using Word Embedding
title_full Word co-occurrence network analysis using Word Embedding
title_fullStr Word co-occurrence network analysis using Word Embedding
title_full_unstemmed Word co-occurrence network analysis using Word Embedding
title_sort Word co-occurrence network analysis using Word Embedding
author Quispe, Laura Vanessa Cruz
author_facet Quispe, Laura Vanessa Cruz
author_role author
dc.contributor.none.fl_str_mv Amancio, Diego Raphael
dc.contributor.author.fl_str_mv Quispe, Laura Vanessa Cruz
dc.subject.por.fl_str_mv Análise de redes
Classificação de texto
Complex networks
Network analysis
Redes complexas
Redes de co-occorrencia de palavras
Text classification
Word co-occurrence networks
Word embeddings
Word embeddings
topic Análise de redes
Classificação de texto
Complex networks
Network analysis
Redes complexas
Redes de co-occorrencia de palavras
Text classification
Word co-occurrence networks
Word embeddings
Word embeddings
description Recent studies in the literature demonstrate that human language can be effectively modeled as a complex network, commonly referred to as a word co-occurrence network. These networks exhibit characteristics typical of scale-free and small-world networks, aligning them with fundamental concepts in network theory. The use of word co-occurrence networks in text classification has shown notable success, primarily due to their ability to capture the structural and syntactic properties of a text, without relying on parsers that require a deeper linguistic knowledge of the language. However, the increasing use of word embeddings across various applications highlights the importance of integrating contextual and semantic information, which co-occurrence networks, in their traditional form, may lack. In this research, we propose to extend the modeling of word co-occurrence networks by incorporating word embedding data to generate virtual edges, thereby unifying syntactic, semantic, and contextual elements within the same network. This approach aims to improve several aspects of text classification, particularly in terms of quality, robustness, and adaptability to short texts, which often present unique challenges. Due to the generalizability of the proposed model and the flexible nature of embeddings, we believe that these networks can further our understanding of how word embeddings operate within complex network structures. The results of our experiments reveal that the use of virtual edges generated from embeddings such as GloVe, Word2Vec, and FastText enhances the discriminative power of the network, significantly improving text classification performance. Additionally, we discovered that the most optimized results are achieved when stop-words are retained and a simple global thresholding strategy is applied to establish virtual edges. Moreover, incorporating word embeddings in these networks not only enhances but also maintains a high level of informativeness, allowing the network to better distinguish between human texts and nonsensical texts over short and long texts. Finally, the combination of word embeddings with stop-word filtering, provides the network with semantic richness, giving the ability to capture semantical and context information of texts. However, maintaining word embedding without stop-word filtering, retains its ability to capture the underlying syntactic structure, making it possible to identify the linguistic properties of different languages. This approach adds robustness to word co-occurrence networks, preserving their initial syntactic capabilities without being compromised by the addition of virtual edges.
publishDate 2024
dc.date.none.fl_str_mv 2024-11-13
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/doctoralThesis
format doctoralThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://www.teses.usp.br/teses/disponiveis/55/55134/tde-16012025-155108/
url https://www.teses.usp.br/teses/disponiveis/55/55134/tde-16012025-155108/
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv
dc.rights.driver.fl_str_mv Liberar o conteúdo para acesso público.
info:eu-repo/semantics/openAccess
rights_invalid_str_mv Liberar o conteúdo para acesso público.
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.coverage.none.fl_str_mv
dc.publisher.none.fl_str_mv Biblioteca Digitais de Teses e Dissertações da USP
publisher.none.fl_str_mv Biblioteca Digitais de Teses e Dissertações da USP
dc.source.none.fl_str_mv
reponame:Biblioteca Digital de Teses e Dissertações da USP
instname:Universidade de São Paulo (USP)
instacron:USP
instname_str Universidade de São Paulo (USP)
instacron_str USP
institution USP
reponame_str Biblioteca Digital de Teses e Dissertações da USP
collection Biblioteca Digital de Teses e Dissertações da USP
repository.name.fl_str_mv Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)
repository.mail.fl_str_mv virginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.br
_version_ 1831147751824424960