Detalhes bibliográficos
Ano de defesa: |
2020 |
Autor(a) principal: |
Cabral, Eric Macedo |
Orientador(a): |
Não Informado pela instituição |
Banca de defesa: |
Não Informado pela instituição |
Tipo de documento: |
Dissertação
|
Tipo de acesso: |
Acesso aberto |
Idioma: |
eng |
Instituição de defesa: |
Biblioteca Digitais de Teses e Dissertações da USP
|
Programa de Pós-Graduação: |
Não Informado pela instituição
|
Departamento: |
Não Informado pela instituição
|
País: |
Não Informado pela instituição
|
Palavras-chave em Português: |
|
Link de acesso: |
https://www.teses.usp.br/teses/disponiveis/55/55134/tde-20082020-093906/
|
Resumo: |
Interactive data clustering techniques put the user in the clustering algorithm loop, allowing not only better clustering quality, but also supporting the knowledge discovery task in large textual corpora. The keyterm guided approach is arguably intuitive, allowing the user to interact with representative words instead of interacting with a large volume of full-length documents or complex topic models. More than making the clustering algorithm adjustable with little user-effort, the visual interactive clustering approach allows the user to focus on exploring the corpus as an incremental task. After each interaction, the user can obtain new information about the corpus, and expresses it as feedback to the clustering algorithm. The visual analytics system Vis-Kt presents itself as an interactive keyterm-based document clustering system, embedded with techniques that overcome the state-of-the-art ones, such as Latent Dirichlet Allocation and the Non-negative Matrix Factorization. With a user-guided approach, Vis-Kt allows the user to draw her insights into the corpus by describing document clusters with a small set of significative terms. However, Vis-Kt and its underlying clustering algorithms depend on the Bag-of-Words model, which has several limitations concerning the information extractions scalability, the process incrementality, and the datas semantic representation. In order to overcome the limitations inherent to the Bag-of-Words model, we propose an update for the keyterm-based representation model to a machine learning approach based on neural language models. Such a model can extract semantic information and relationships from the words that are included in the corpus. This projects main contribution is a novel interactive document clustering algorithm guided by keyterms and based on neural language models. This approach shows a significant improvement compared to the baseline algorithms, considered state-of-the-art. The proposed clustering algorithm allows Vis-Kt to work incrementally, without the need to repeat the entire learning and clustering processes from the beginning. This makes the system suitable for analyzing text streams. In order to contribute to the task of knowledge discovery and to support its incremental aspect, a visual component based on the Sankey diagram was developed to depict the clustering membership changes throughout the clustering loop after each interaction with the corpus. A set of quantitative experiments on publicly available text datasets was performed to evaluate the obtained clustering results. The results reported in this work show that, in most of the experimented cases, the proposed algorithm presents a significant improvement in clustering quality measures in comparison with the baseline algorithms. In all cases, the proposed algorithm showed a gain in processing time, especially in the largest datasets. We also report two usage scenarios to qualitatively evaluate the proposed visual component. |