Word co-occurrence network analysis using Word Embedding

Quispe, Laura Vanessa Cruz

Word co-occurrence network analysis using Word Embedding

Bibliographic Details
Main Author:	Quispe, Laura Vanessa Cruz
Publication Date:	2024
Format:	Doctoral thesis
Language:	eng
Source:	Biblioteca Digital de Teses e Dissertações da USP
Download full:	https://www.teses.usp.br/teses/disponiveis/55/55134/tde-16012025-155108/
Summary:	Recent studies in the literature demonstrate that human language can be effectively modeled as a complex network, commonly referred to as a word co-occurrence network. These networks exhibit characteristics typical of scale-free and small-world networks, aligning them with fundamental concepts in network theory. The use of word co-occurrence networks in text classification has shown notable success, primarily due to their ability to capture the structural and syntactic properties of a text, without relying on parsers that require a deeper linguistic knowledge of the language. However, the increasing use of word embeddings across various applications highlights the importance of integrating contextual and semantic information, which co-occurrence networks, in their traditional form, may lack. In this research, we propose to extend the modeling of word co-occurrence networks by incorporating word embedding data to generate virtual edges, thereby unifying syntactic, semantic, and contextual elements within the same network. This approach aims to improve several aspects of text classification, particularly in terms of quality, robustness, and adaptability to short texts, which often present unique challenges. Due to the generalizability of the proposed model and the flexible nature of embeddings, we believe that these networks can further our understanding of how word embeddings operate within complex network structures. The results of our experiments reveal that the use of virtual edges generated from embeddings such as GloVe, Word2Vec, and FastText enhances the discriminative power of the network, significantly improving text classification performance. Additionally, we discovered that the most optimized results are achieved when stop-words are retained and a simple global thresholding strategy is applied to establish virtual edges. Moreover, incorporating word embeddings in these networks not only enhances but also maintains a high level of informativeness, allowing the network to better distinguish between human texts and nonsensical texts over short and long texts. Finally, the combination of word embeddings with stop-word filtering, provides the network with semantic richness, giving the ability to capture semantical and context information of texts. However, maintaining word embedding without stop-word filtering, retains its ability to capture the underlying syntactic structure, making it possible to identify the linguistic properties of different languages. This approach adds robustness to word co-occurrence networks, preserving their initial syntactic capabilities without being compromised by the addition of virtual edges.

Item metadata

id	USP_cd7c8f70a2e994f1447eaccce2efc1e2
oai_identifier_str	oai:teses.usp.br:tde-16012025-155108
network_acronym_str	USP
network_name_str	Biblioteca Digital de Teses e Dissertações da USP
repository_id_str	2721
spelling	Word co-occurrence network analysis using Word EmbeddingAnálise de redes de coocorrência de palavras usando Word EmbeddingsAnálise de redesClassificação de textoComplex networksNetwork analysisRedes complexasRedes de co-occorrencia de palavrasText classificationWord co-occurrence networksWord embeddingsWord embeddingsRecent studies in the literature demonstrate that human language can be effectively modeled as a complex network, commonly referred to as a word co-occurrence network. These networks exhibit characteristics typical of scale-free and small-world networks, aligning them with fundamental concepts in network theory. The use of word co-occurrence networks in text classification has shown notable success, primarily due to their ability to capture the structural and syntactic properties of a text, without relying on parsers that require a deeper linguistic knowledge of the language. However, the increasing use of word embeddings across various applications highlights the importance of integrating contextual and semantic information, which co-occurrence networks, in their traditional form, may lack. In this research, we propose to extend the modeling of word co-occurrence networks by incorporating word embedding data to generate virtual edges, thereby unifying syntactic, semantic, and contextual elements within the same network. This approach aims to improve several aspects of text classification, particularly in terms of quality, robustness, and adaptability to short texts, which often present unique challenges. Due to the generalizability of the proposed model and the flexible nature of embeddings, we believe that these networks can further our understanding of how word embeddings operate within complex network structures. The results of our experiments reveal that the use of virtual edges generated from embeddings such as GloVe, Word2Vec, and FastText enhances the discriminative power of the network, significantly improving text classification performance. Additionally, we discovered that the most optimized results are achieved when stop-words are retained and a simple global thresholding strategy is applied to establish virtual edges. Moreover, incorporating word embeddings in these networks not only enhances but also maintains a high level of informativeness, allowing the network to better distinguish between human texts and nonsensical texts over short and long texts. Finally, the combination of word embeddings with stop-word filtering, provides the network with semantic richness, giving the ability to capture semantical and context information of texts. However, maintaining word embedding without stop-word filtering, retains its ability to capture the underlying syntactic structure, making it possible to identify the linguistic properties of different languages. This approach adds robustness to word co-occurrence networks, preserving their initial syntactic capabilities without being compromised by the addition of virtual edges.Estudos recentes na literatura demonstram que a linguagem humana pode ser modelada de maneira eficaz como uma rede complexa, comumente referida como uma rede de coocorrência de palavras. Essas redes exibem características típicas de redes livres de escala e de mundos pequenos, alinhando-se aos conceitos fundamentais da teoria das redes. O uso de redes de coocorrência de palavras na classificação de textos tem mostrado notável sucesso, principalmente devido à sua capacidade de capturar as propriedades estruturais e sintáticas de um texto, sem a necessidade de parsers que requerem um conhecimento mais profundo da língua. No entanto, o uso crescente de word embeddings em várias aplicações ressalta a importância de integrar informações contextuais e semânticas, que as redes de coocorrência de palavras, em sua forma tradicional, podem não conter. Nesta pesquisa, propomos estender a modelagem das redes de coocorrência de palavras, incorporando dados de embeddings para gerar arestas virtuais, unificando assim elementos sintáticos, semânticos e contextuais dentro da mesma rede. Essa abordagem visa melhorar vários aspectos da classificação de textos, particularmente em termos de qualidade, robustez e adaptabilidade a textos curtos, que muitas vezes apresentam desafios únicos. Devido à generalidade do modelo proposto e à natureza flexível dos embeddings, acreditamos que essas redes podem avançar nossa compreensão sobre como os word embeddings operam dentro das estruturas de redes complexas. Os resultados de nossos experimentos revelam que o uso de arestas virtuais geradas a partir de embeddings como GloVe, Word2Vec e FastText aumenta o poder discriminativo da rede, melhorando significativamente o desempenho na classificação de textos. Além disso, descobrimos que os resultados mais otimizados são alcançados quando as stop-words são mantidas e uma simples estratégia de limiarização global é aplicada para estabelecer as arestas virtuais. Ademais, incorporar word embeddings nessas redes não apenas as melhora, mas também mantém um alto nível de informatividade, permitindo que a rede diferencie melhor entre textos humanos e textos sem sentido, tanto em textos curtos quanto longos. Finalmente, a combinação de word embeddings com a filtragem de stop-words proporciona à rede uma riqueza semântica, conferindo a capacidade de capturar informações semânticas e de contexto dos textos. No entanto, a manutenção do embedding sem a filtragem de stop-words preserva a capacidade de capturar a estrutura sintática subjacente, possibilitando a identificação das propriedades linguísticas de diferentes línguas. Essa abordagem acrescenta robustez às redes de coocorrência de palavras, preservando suas capacidades sintáticas iniciais sem ser comprometida pela adição de arestas virtuais.Biblioteca Digitais de Teses e Dissertações da USPAmancio, Diego RaphaelQuispe, Laura Vanessa Cruz2024-11-13info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisapplication/pdfhttps://www.teses.usp.br/teses/disponiveis/55/55134/tde-16012025-155108/reponame:Biblioteca Digital de Teses e Dissertações da USPinstname:Universidade de São Paulo (USP)instacron:USPLiberar o conteúdo para acesso público.info:eu-repo/semantics/openAccesseng2025-01-16T17:58:02Zoai:teses.usp.br:tde-16012025-155108Biblioteca Digital de Teses e Dissertaçõeshttp://www.teses.usp.br/PUBhttp://www.teses.usp.br/cgi-bin/mtd2br.plvirginia@if.usp.br\|\| atendimento@aguia.usp.br\|\|virginia@if.usp.bropendoar:27212025-01-16T17:58:02Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)false
dc.title.none.fl_str_mv	Word co-occurrence network analysis using Word Embedding Análise de redes de coocorrência de palavras usando Word Embeddings
title	Word co-occurrence network analysis using Word Embedding
spellingShingle	Word co-occurrence network analysis using Word Embedding Quispe, Laura Vanessa Cruz Análise de redes Classificação de texto Complex networks Network analysis Redes complexas Redes de co-occorrencia de palavras Text classification Word co-occurrence networks Word embeddings Word embeddings
title_short	Word co-occurrence network analysis using Word Embedding
title_full	Word co-occurrence network analysis using Word Embedding
title_fullStr	Word co-occurrence network analysis using Word Embedding
title_full_unstemmed	Word co-occurrence network analysis using Word Embedding
title_sort	Word co-occurrence network analysis using Word Embedding
author	Quispe, Laura Vanessa Cruz
author_facet	Quispe, Laura Vanessa Cruz
author_role	author
dc.contributor.none.fl_str_mv	Amancio, Diego Raphael
dc.contributor.author.fl_str_mv	Quispe, Laura Vanessa Cruz
dc.subject.por.fl_str_mv	Análise de redes Classificação de texto Complex networks Network analysis Redes complexas Redes de co-occorrencia de palavras Text classification Word co-occurrence networks Word embeddings Word embeddings
topic	Análise de redes Classificação de texto Complex networks Network analysis Redes complexas Redes de co-occorrencia de palavras Text classification Word co-occurrence networks Word embeddings Word embeddings
description	Recent studies in the literature demonstrate that human language can be effectively modeled as a complex network, commonly referred to as a word co-occurrence network. These networks exhibit characteristics typical of scale-free and small-world networks, aligning them with fundamental concepts in network theory. The use of word co-occurrence networks in text classification has shown notable success, primarily due to their ability to capture the structural and syntactic properties of a text, without relying on parsers that require a deeper linguistic knowledge of the language. However, the increasing use of word embeddings across various applications highlights the importance of integrating contextual and semantic information, which co-occurrence networks, in their traditional form, may lack. In this research, we propose to extend the modeling of word co-occurrence networks by incorporating word embedding data to generate virtual edges, thereby unifying syntactic, semantic, and contextual elements within the same network. This approach aims to improve several aspects of text classification, particularly in terms of quality, robustness, and adaptability to short texts, which often present unique challenges. Due to the generalizability of the proposed model and the flexible nature of embeddings, we believe that these networks can further our understanding of how word embeddings operate within complex network structures. The results of our experiments reveal that the use of virtual edges generated from embeddings such as GloVe, Word2Vec, and FastText enhances the discriminative power of the network, significantly improving text classification performance. Additionally, we discovered that the most optimized results are achieved when stop-words are retained and a simple global thresholding strategy is applied to establish virtual edges. Moreover, incorporating word embeddings in these networks not only enhances but also maintains a high level of informativeness, allowing the network to better distinguish between human texts and nonsensical texts over short and long texts. Finally, the combination of word embeddings with stop-word filtering, provides the network with semantic richness, giving the ability to capture semantical and context information of texts. However, maintaining word embedding without stop-word filtering, retains its ability to capture the underlying syntactic structure, making it possible to identify the linguistic properties of different languages. This approach adds robustness to word co-occurrence networks, preserving their initial syntactic capabilities without being compromised by the addition of virtual edges.
publishDate	2024
dc.date.none.fl_str_mv	2024-11-13
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/doctoralThesis
format	doctoralThesis
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	https://www.teses.usp.br/teses/disponiveis/55/55134/tde-16012025-155108/
url	https://www.teses.usp.br/teses/disponiveis/55/55134/tde-16012025-155108/
dc.language.iso.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv
dc.rights.driver.fl_str_mv	Liberar o conteúdo para acesso público. info:eu-repo/semantics/openAccess
rights_invalid_str_mv	Liberar o conteúdo para acesso público.
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.coverage.none.fl_str_mv
dc.publisher.none.fl_str_mv	Biblioteca Digitais de Teses e Dissertações da USP
publisher.none.fl_str_mv	Biblioteca Digitais de Teses e Dissertações da USP
dc.source.none.fl_str_mv	reponame:Biblioteca Digital de Teses e Dissertações da USP instname:Universidade de São Paulo (USP) instacron:USP
instname_str	Universidade de São Paulo (USP)
instacron_str	USP
institution	USP
reponame_str	Biblioteca Digital de Teses e Dissertações da USP
collection	Biblioteca Digital de Teses e Dissertações da USP
repository.name.fl_str_mv	Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)
repository.mail.fl_str_mv	virginia@if.usp.br\|\| atendimento@aguia.usp.br\|\|virginia@if.usp.br
_version_	1831147751824424960

Word co-occurrence network analysis using Word Embedding

Similar Items