Estruturas de indexação métricas em operações distribuídas de agrupamento por similaridade em dados de alta dimensionalidade

Detalhes bibliográficos
Ano de defesa: 2024
Autor(a) principal: Silva, Ana Paula Cassiano Alves da
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Dissertação
Tipo de acesso: Acesso aberto
Idioma: por
Instituição de defesa: Universidade Federal de Uberlândia
Brasil
Programa de Pós-graduação em Ciência da Computação
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: https://repositorio.ufu.br/handle/123456789/44498
http://doi.org/10.14393/ufu.di.2024.619
Resumo: The prevalence of Big Data presents significant challenges for extracting knowledge from large volumes of complex data. Cluster analysis, crucial for identifying patterns and similarities, uses techniques such as similarity search and join, essential for queries based on intrinsic data relationships. However, the high dimensionality and massive volume of data make these operations computationally expensive. Distributed systems, such as Apache Hadoop and Spark, have been implemented to improve the performance of these analyses. Partitioning and pruning techniques, along with dissimilarity-based methods such as distance functions, are key to optimizing data manipulation. Recently, the SGB operator and its evolution, the DSG, have shown significant advances, allowing clusters to be computed by similarity on distributed platforms. However, the growing demand for faster and more accurate analysis requires continuous improvements. In this context, we propose DSG-VPTree operator, an innovation that integrates the VP-Tree data structure with the DSG operator, aiming for a more efficient and balanced partitioning. This work details the implementation of the DSG-VPTree in the Spark environment, evaluating its performance in terms of execution time compared to the DSG operator, demonstrating its efficiency in overcoming the scalability limitations of previous solutions. The proposal offers an efficient solution for similarity search operations on large volumes of data, contributing to the evolution of analysis techniques in Big Data. The experiments show that the DSG-VPTree outperforms the DSG by 40%, with shorter execution times and better scalability on high-dimensional data.