Word co-occurrence network analysis using Word Embedding
Main Author: | |
---|---|
Publication Date: | 2024 |
Format: | Doctoral thesis |
Language: | eng |
Source: | Biblioteca Digital de Teses e Dissertações da USP |
Download full: | https://www.teses.usp.br/teses/disponiveis/55/55134/tde-16012025-155108/ |
Summary: | Recent studies in the literature demonstrate that human language can be effectively modeled as a complex network, commonly referred to as a word co-occurrence network. These networks exhibit characteristics typical of scale-free and small-world networks, aligning them with fundamental concepts in network theory. The use of word co-occurrence networks in text classification has shown notable success, primarily due to their ability to capture the structural and syntactic properties of a text, without relying on parsers that require a deeper linguistic knowledge of the language. However, the increasing use of word embeddings across various applications highlights the importance of integrating contextual and semantic information, which co-occurrence networks, in their traditional form, may lack. In this research, we propose to extend the modeling of word co-occurrence networks by incorporating word embedding data to generate virtual edges, thereby unifying syntactic, semantic, and contextual elements within the same network. This approach aims to improve several aspects of text classification, particularly in terms of quality, robustness, and adaptability to short texts, which often present unique challenges. Due to the generalizability of the proposed model and the flexible nature of embeddings, we believe that these networks can further our understanding of how word embeddings operate within complex network structures. The results of our experiments reveal that the use of virtual edges generated from embeddings such as GloVe, Word2Vec, and FastText enhances the discriminative power of the network, significantly improving text classification performance. Additionally, we discovered that the most optimized results are achieved when stop-words are retained and a simple global thresholding strategy is applied to establish virtual edges. Moreover, incorporating word embeddings in these networks not only enhances but also maintains a high level of informativeness, allowing the network to better distinguish between human texts and nonsensical texts over short and long texts. Finally, the combination of word embeddings with stop-word filtering, provides the network with semantic richness, giving the ability to capture semantical and context information of texts. However, maintaining word embedding without stop-word filtering, retains its ability to capture the underlying syntactic structure, making it possible to identify the linguistic properties of different languages. This approach adds robustness to word co-occurrence networks, preserving their initial syntactic capabilities without being compromised by the addition of virtual edges. |
id |
USP_cd7c8f70a2e994f1447eaccce2efc1e2 |
---|---|
oai_identifier_str |
oai:teses.usp.br:tde-16012025-155108 |
network_acronym_str |
USP |
network_name_str |
Biblioteca Digital de Teses e Dissertações da USP |
repository_id_str |
2721 |
spelling |
Word co-occurrence network analysis using Word EmbeddingAnálise de redes de coocorrência de palavras usando Word EmbeddingsAnálise de redesClassificação de textoComplex networksNetwork analysisRedes complexasRedes de co-occorrencia de palavrasText classificationWord co-occurrence networksWord embeddingsWord embeddingsRecent studies in the literature demonstrate that human language can be effectively modeled as a complex network, commonly referred to as a word co-occurrence network. These networks exhibit characteristics typical of scale-free and small-world networks, aligning them with fundamental concepts in network theory. The use of word co-occurrence networks in text classification has shown notable success, primarily due to their ability to capture the structural and syntactic properties of a text, without relying on parsers that require a deeper linguistic knowledge of the language. However, the increasing use of word embeddings across various applications highlights the importance of integrating contextual and semantic information, which co-occurrence networks, in their traditional form, may lack. In this research, we propose to extend the modeling of word co-occurrence networks by incorporating word embedding data to generate virtual edges, thereby unifying syntactic, semantic, and contextual elements within the same network. This approach aims to improve several aspects of text classification, particularly in terms of quality, robustness, and adaptability to short texts, which often present unique challenges. Due to the generalizability of the proposed model and the flexible nature of embeddings, we believe that these networks can further our understanding of how word embeddings operate within complex network structures. The results of our experiments reveal that the use of virtual edges generated from embeddings such as GloVe, Word2Vec, and FastText enhances the discriminative power of the network, significantly improving text classification performance. Additionally, we discovered that the most optimized results are achieved when stop-words are retained and a simple global thresholding strategy is applied to establish virtual edges. Moreover, incorporating word embeddings in these networks not only enhances but also maintains a high level of informativeness, allowing the network to better distinguish between human texts and nonsensical texts over short and long texts. Finally, the combination of word embeddings with stop-word filtering, provides the network with semantic richness, giving the ability to capture semantical and context information of texts. However, maintaining word embedding without stop-word filtering, retains its ability to capture the underlying syntactic structure, making it possible to identify the linguistic properties of different languages. This approach adds robustness to word co-occurrence networks, preserving their initial syntactic capabilities without being compromised by the addition of virtual edges.Estudos recentes na literatura demonstram que a linguagem humana pode ser modelada de maneira eficaz como uma rede complexa, comumente referida como uma rede de coocorrência de palavras. Essas redes exibem características típicas de redes livres de escala e de mundos pequenos, alinhando-se aos conceitos fundamentais da teoria das redes. O uso de redes de coocorrência de palavras na classificação de textos tem mostrado notável sucesso, principalmente devido à sua capacidade de capturar as propriedades estruturais e sintáticas de um texto, sem a necessidade de parsers que requerem um conhecimento mais profundo da língua. No entanto, o uso crescente de word embeddings em várias aplicações ressalta a importância de integrar informações contextuais e semânticas, que as redes de coocorrência de palavras, em sua forma tradicional, podem não conter. Nesta pesquisa, propomos estender a modelagem das redes de coocorrência de palavras, incorporando dados de embeddings para gerar arestas virtuais, unificando assim elementos sintáticos, semânticos e contextuais dentro da mesma rede. Essa abordagem visa melhorar vários aspectos da classificação de textos, particularmente em termos de qualidade, robustez e adaptabilidade a textos curtos, que muitas vezes apresentam desafios únicos. Devido à generalidade do modelo proposto e à natureza flexível dos embeddings, acreditamos que essas redes podem avançar nossa compreensão sobre como os word embeddings operam dentro das estruturas de redes complexas. Os resultados de nossos experimentos revelam que o uso de arestas virtuais geradas a partir de embeddings como GloVe, Word2Vec e FastText aumenta o poder discriminativo da rede, melhorando significativamente o desempenho na classificação de textos. Além disso, descobrimos que os resultados mais otimizados são alcançados quando as stop-words são mantidas e uma simples estratégia de limiarização global é aplicada para estabelecer as arestas virtuais. Ademais, incorporar word embeddings nessas redes não apenas as melhora, mas também mantém um alto nível de informatividade, permitindo que a rede diferencie melhor entre textos humanos e textos sem sentido, tanto em textos curtos quanto longos. Finalmente, a combinação de word embeddings com a filtragem de stop-words proporciona à rede uma riqueza semântica, conferindo a capacidade de capturar informações semânticas e de contexto dos textos. No entanto, a manutenção do embedding sem a filtragem de stop-words preserva a capacidade de capturar a estrutura sintática subjacente, possibilitando a identificação das propriedades linguísticas de diferentes línguas. Essa abordagem acrescenta robustez às redes de coocorrência de palavras, preservando suas capacidades sintáticas iniciais sem ser comprometida pela adição de arestas virtuais.Biblioteca Digitais de Teses e Dissertações da USPAmancio, Diego RaphaelQuispe, Laura Vanessa Cruz2024-11-13info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisapplication/pdfhttps://www.teses.usp.br/teses/disponiveis/55/55134/tde-16012025-155108/reponame:Biblioteca Digital de Teses e Dissertações da USPinstname:Universidade de São Paulo (USP)instacron:USPLiberar o conteúdo para acesso público.info:eu-repo/semantics/openAccesseng2025-01-16T17:58:02Zoai:teses.usp.br:tde-16012025-155108Biblioteca Digital de Teses e Dissertaçõeshttp://www.teses.usp.br/PUBhttp://www.teses.usp.br/cgi-bin/mtd2br.plvirginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.bropendoar:27212025-01-16T17:58:02Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)false |
dc.title.none.fl_str_mv |
Word co-occurrence network analysis using Word Embedding Análise de redes de coocorrência de palavras usando Word Embeddings |
title |
Word co-occurrence network analysis using Word Embedding |
spellingShingle |
Word co-occurrence network analysis using Word Embedding Quispe, Laura Vanessa Cruz Análise de redes Classificação de texto Complex networks Network analysis Redes complexas Redes de co-occorrencia de palavras Text classification Word co-occurrence networks Word embeddings Word embeddings |
title_short |
Word co-occurrence network analysis using Word Embedding |
title_full |
Word co-occurrence network analysis using Word Embedding |
title_fullStr |
Word co-occurrence network analysis using Word Embedding |
title_full_unstemmed |
Word co-occurrence network analysis using Word Embedding |
title_sort |
Word co-occurrence network analysis using Word Embedding |
author |
Quispe, Laura Vanessa Cruz |
author_facet |
Quispe, Laura Vanessa Cruz |
author_role |
author |
dc.contributor.none.fl_str_mv |
Amancio, Diego Raphael |
dc.contributor.author.fl_str_mv |
Quispe, Laura Vanessa Cruz |
dc.subject.por.fl_str_mv |
Análise de redes Classificação de texto Complex networks Network analysis Redes complexas Redes de co-occorrencia de palavras Text classification Word co-occurrence networks Word embeddings Word embeddings |
topic |
Análise de redes Classificação de texto Complex networks Network analysis Redes complexas Redes de co-occorrencia de palavras Text classification Word co-occurrence networks Word embeddings Word embeddings |
description |
Recent studies in the literature demonstrate that human language can be effectively modeled as a complex network, commonly referred to as a word co-occurrence network. These networks exhibit characteristics typical of scale-free and small-world networks, aligning them with fundamental concepts in network theory. The use of word co-occurrence networks in text classification has shown notable success, primarily due to their ability to capture the structural and syntactic properties of a text, without relying on parsers that require a deeper linguistic knowledge of the language. However, the increasing use of word embeddings across various applications highlights the importance of integrating contextual and semantic information, which co-occurrence networks, in their traditional form, may lack. In this research, we propose to extend the modeling of word co-occurrence networks by incorporating word embedding data to generate virtual edges, thereby unifying syntactic, semantic, and contextual elements within the same network. This approach aims to improve several aspects of text classification, particularly in terms of quality, robustness, and adaptability to short texts, which often present unique challenges. Due to the generalizability of the proposed model and the flexible nature of embeddings, we believe that these networks can further our understanding of how word embeddings operate within complex network structures. The results of our experiments reveal that the use of virtual edges generated from embeddings such as GloVe, Word2Vec, and FastText enhances the discriminative power of the network, significantly improving text classification performance. Additionally, we discovered that the most optimized results are achieved when stop-words are retained and a simple global thresholding strategy is applied to establish virtual edges. Moreover, incorporating word embeddings in these networks not only enhances but also maintains a high level of informativeness, allowing the network to better distinguish between human texts and nonsensical texts over short and long texts. Finally, the combination of word embeddings with stop-word filtering, provides the network with semantic richness, giving the ability to capture semantical and context information of texts. However, maintaining word embedding without stop-word filtering, retains its ability to capture the underlying syntactic structure, making it possible to identify the linguistic properties of different languages. This approach adds robustness to word co-occurrence networks, preserving their initial syntactic capabilities without being compromised by the addition of virtual edges. |
publishDate |
2024 |
dc.date.none.fl_str_mv |
2024-11-13 |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/doctoralThesis |
format |
doctoralThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
https://www.teses.usp.br/teses/disponiveis/55/55134/tde-16012025-155108/ |
url |
https://www.teses.usp.br/teses/disponiveis/55/55134/tde-16012025-155108/ |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
|
dc.rights.driver.fl_str_mv |
Liberar o conteúdo para acesso público. info:eu-repo/semantics/openAccess |
rights_invalid_str_mv |
Liberar o conteúdo para acesso público. |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.coverage.none.fl_str_mv |
|
dc.publisher.none.fl_str_mv |
Biblioteca Digitais de Teses e Dissertações da USP |
publisher.none.fl_str_mv |
Biblioteca Digitais de Teses e Dissertações da USP |
dc.source.none.fl_str_mv |
reponame:Biblioteca Digital de Teses e Dissertações da USP instname:Universidade de São Paulo (USP) instacron:USP |
instname_str |
Universidade de São Paulo (USP) |
instacron_str |
USP |
institution |
USP |
reponame_str |
Biblioteca Digital de Teses e Dissertações da USP |
collection |
Biblioteca Digital de Teses e Dissertações da USP |
repository.name.fl_str_mv |
Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP) |
repository.mail.fl_str_mv |
virginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.br |
_version_ |
1831147751824424960 |