Interactive keyterm-based document clustering and visualization via neural language models

Cabral, Eric Macedo

Interactive keyterm-based document clustering and visualization via neural language models

Detalhes bibliográficos
Ano de defesa:	2020
Autor(a) principal:	Cabral, Eric Macedo
Orientador(a):	Não Informado pela instituição
Banca de defesa:	Não Informado pela instituição
Tipo de documento:	Dissertação
Tipo de acesso:	Acesso aberto
Idioma:	eng
Instituição de defesa:	Biblioteca Digitais de Teses e Dissertações da USP
Programa de Pós-Graduação:	Não Informado pela instituição
Departamento:	Não Informado pela instituição
País:	Não Informado pela instituição
Palavras-chave em Português:	Agrupamento interativo de documentos Interactive document clustering Modelos neurais de linguagem Neural language models Visual analytics Visualização analítica
Link de acesso:	https://www.teses.usp.br/teses/disponiveis/55/55134/tde-20082020-093906/
Resumo:	Interactive data clustering techniques put the user in the clustering algorithm loop, allowing not only better clustering quality, but also supporting the knowledge discovery task in large textual corpora. The keyterm guided approach is arguably intuitive, allowing the user to interact with representative words instead of interacting with a large volume of full-length documents or complex topic models. More than making the clustering algorithm adjustable with little user-effort, the visual interactive clustering approach allows the user to focus on exploring the corpus as an incremental task. After each interaction, the user can obtain new information about the corpus, and expresses it as feedback to the clustering algorithm. The visual analytics system Vis-Kt presents itself as an interactive keyterm-based document clustering system, embedded with techniques that overcome the state-of-the-art ones, such as Latent Dirichlet Allocation and the Non-negative Matrix Factorization. With a user-guided approach, Vis-Kt allows the user to draw her insights into the corpus by describing document clusters with a small set of significative terms. However, Vis-Kt and its underlying clustering algorithms depend on the Bag-of-Words model, which has several limitations concerning the information extractions scalability, the process incrementality, and the datas semantic representation. In order to overcome the limitations inherent to the Bag-of-Words model, we propose an update for the keyterm-based representation model to a machine learning approach based on neural language models. Such a model can extract semantic information and relationships from the words that are included in the corpus. This projects main contribution is a novel interactive document clustering algorithm guided by keyterms and based on neural language models. This approach shows a significant improvement compared to the baseline algorithms, considered state-of-the-art. The proposed clustering algorithm allows Vis-Kt to work incrementally, without the need to repeat the entire learning and clustering processes from the beginning. This makes the system suitable for analyzing text streams. In order to contribute to the task of knowledge discovery and to support its incremental aspect, a visual component based on the Sankey diagram was developed to depict the clustering membership changes throughout the clustering loop after each interaction with the corpus. A set of quantitative experiments on publicly available text datasets was performed to evaluate the obtained clustering results. The results reported in this work show that, in most of the experimented cases, the proposed algorithm presents a significant improvement in clustering quality measures in comparison with the baseline algorithms. In all cases, the proposed algorithm showed a gain in processing time, especially in the largest datasets. We also report two usage scenarios to qualitatively evaluate the proposed visual component.

Interactive keyterm-based document clustering and visualization via neural language models

Registros relacionados