Employing syntactical dependency and a mesoscopic scale to model books\' narratives through recurrence networks

Detalhes bibliográficos
Ano de defesa: 2023
Autor(a) principal: Souza, Bárbara Cortes e
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Dissertação
Tipo de acesso: Acesso aberto
Idioma: eng
Instituição de defesa: Biblioteca Digitais de Teses e Dissertações da USP
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
NLP
PLN
Link de acesso: https://www.teses.usp.br/teses/disponiveis/55/55134/tde-02042024-162451/
Resumo: In recent years, science has been deeply impacted by the growing amount of data available for research. Specifically, the continuous increase of textual data availability has been essential for the development and proposal of new methodologies to tackle text processing problems. There are several new approaches that focus on different components of linguistics, such as lexicon, syntax and semantics. Natural Language Processing, for example, is a multidisciplinary field that concerns the interaction between natural languages and computers. Some examples of problems in this field are: topic detection, text classification, stylometry, automatic summarization, and others. Since natural languages are actually complex systems, it is also appropriate to represent them as complex networks, to help address these various challenges. One well known example of text modelling method is the word adjacency network, that maps each of the words in a text into nodes, and create an edge between any pair of terms that occur adjacent to each other in the text. In this Masters work, however, we focus on a larger, mesoscopic scale with the intent of capturing the overall context of a narrative. In this methodology, a single node refers to a sequence of paragraphs in the text, and the edges are created between the most similar ones. Additionally, we apply syntactical dependency knowledge to increase informativeness and, therefore, obtain a better performance on grasping the contextual semantics of the text. Finally, one can extract significant network measures in order to characterize it, including accessibility, symmetry and the new proposed recurrence signature, as a manner of capturing topological properties that reflect the narratives context. Several method validations have been performed, including a comparison with other trivial measures, two experiments to discriminate real from meaningless texts and between literary genres and, finally, a comparison of the current method to other orthodox approaches, namely co-occurrence networks and doc2vec.