Estruturas de indexação métricas em operações distribuídas de agrupamento por similaridade em dados de alta dimensionalidade
Ano de defesa: | 2024 |
---|---|
Autor(a) principal: | |
Orientador(a): | |
Banca de defesa: | |
Tipo de documento: | Dissertação |
Tipo de acesso: | Acesso aberto |
Idioma: | por |
Instituição de defesa: |
Universidade Federal de Uberlândia
Brasil Programa de Pós-graduação em Ciência da Computação |
Programa de Pós-Graduação: |
Não Informado pela instituição
|
Departamento: |
Não Informado pela instituição
|
País: |
Não Informado pela instituição
|
Palavras-chave em Português: | |
Link de acesso: | https://repositorio.ufu.br/handle/123456789/44498 http://doi.org/10.14393/ufu.di.2024.619 |
Resumo: | The prevalence of Big Data presents significant challenges for extracting knowledge from large volumes of complex data. Cluster analysis, crucial for identifying patterns and similarities, uses techniques such as similarity search and join, essential for queries based on intrinsic data relationships. However, the high dimensionality and massive volume of data make these operations computationally expensive. Distributed systems, such as Apache Hadoop and Spark, have been implemented to improve the performance of these analyses. Partitioning and pruning techniques, along with dissimilarity-based methods such as distance functions, are key to optimizing data manipulation. Recently, the SGB operator and its evolution, the DSG, have shown significant advances, allowing clusters to be computed by similarity on distributed platforms. However, the growing demand for faster and more accurate analysis requires continuous improvements. In this context, we propose DSG-VPTree operator, an innovation that integrates the VP-Tree data structure with the DSG operator, aiming for a more efficient and balanced partitioning. This work details the implementation of the DSG-VPTree in the Spark environment, evaluating its performance in terms of execution time compared to the DSG operator, demonstrating its efficiency in overcoming the scalability limitations of previous solutions. The proposal offers an efficient solution for similarity search operations on large volumes of data, contributing to the evolution of analysis techniques in Big Data. The experiments show that the DSG-VPTree outperforms the DSG by 40%, with shorter execution times and better scalability on high-dimensional data. |