GPU-Accelerated Approximate K- Nearest Neighbors over Unbounded Datasets

Detalhes bibliográficos
Autor(a) principal: Lopes, Gonçalo Pedro Santos
Data de Publicação: 2021
Tipo de documento: Dissertação
Idioma: eng
Título da fonte: Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
Texto Completo: http://hdl.handle.net/10362/166583
Resumo: With the growing demand for computational power in this era of information, algo- rithms like the K-Nearest Neighbours (KNN) have been developed for applications like similarity search, clustering and classification which allow to make assumptions over big sets of data, making it a lazy learning algorithm. Since most of the KNN algorithms have their temporal complexity lower bound by the size of the input data, when we are dealing with unbounded datasets like the ones that can be expected from an input stream of data, the processing becomes too slow for almost all applications. In the current state of the art, we have access to many Approximate Nearest Neigh- bours (ANN) algorithms which, instead of calculating the exact result like the normal KNN implementations, approximate it. This effectively allows systems to trade off some output accuracy for major performance gains and, when coupled with techniques like Locality Sensitive Hashing (LSH) (Locality Sensitive Hashing), they also try to optimize memory usage in Graphics Processing Unit (GPU)-based implementations. However, when dealing with stream-based inputs and growing datasets, it is easy to run into mem- ory heap problems and inefficiencies while using the currently available algorithms. In this work we propose SPLASH, a system that implements an ANN algorithm that supports, not only stream-based inputs, but also the usage of datasets that may not fit in the GPU’s memory. With the added dataset building fuctionality, we developed an algorithm capable of scaling by growing its sample data, while maintaining a constant query processing time during that time. By meeting the stated requirements in our work, we developed an implementation that advances the state of the art. The test results against a base ANN implementation proved that SPLASH is more efficient in long-term executions, with good accuracy and precision in the outputs. It was also proven that our system is stable when stress-tested in a stream-based context, maintaining a constant query processing time while effectively growing the dataset.
id RCAP_b9fe1fbb15009ce39684b15d7dfb1a1e
oai_identifier_str oai:run.unl.pt:10362/166583
network_acronym_str RCAP
network_name_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository_id_str https://opendoar.ac.uk/repository/7160
spelling GPU-Accelerated Approximate K- Nearest Neighbors over Unbounded DatasetsKNNBig DatasetsData StreamsLSHANNGPU Memory ManagementDomínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e InformáticaWith the growing demand for computational power in this era of information, algo- rithms like the K-Nearest Neighbours (KNN) have been developed for applications like similarity search, clustering and classification which allow to make assumptions over big sets of data, making it a lazy learning algorithm. Since most of the KNN algorithms have their temporal complexity lower bound by the size of the input data, when we are dealing with unbounded datasets like the ones that can be expected from an input stream of data, the processing becomes too slow for almost all applications. In the current state of the art, we have access to many Approximate Nearest Neigh- bours (ANN) algorithms which, instead of calculating the exact result like the normal KNN implementations, approximate it. This effectively allows systems to trade off some output accuracy for major performance gains and, when coupled with techniques like Locality Sensitive Hashing (LSH) (Locality Sensitive Hashing), they also try to optimize memory usage in Graphics Processing Unit (GPU)-based implementations. However, when dealing with stream-based inputs and growing datasets, it is easy to run into mem- ory heap problems and inefficiencies while using the currently available algorithms. In this work we propose SPLASH, a system that implements an ANN algorithm that supports, not only stream-based inputs, but also the usage of datasets that may not fit in the GPU’s memory. With the added dataset building fuctionality, we developed an algorithm capable of scaling by growing its sample data, while maintaining a constant query processing time during that time. By meeting the stated requirements in our work, we developed an implementation that advances the state of the art. The test results against a base ANN implementation proved that SPLASH is more efficient in long-term executions, with good accuracy and precision in the outputs. It was also proven that our system is stable when stress-tested in a stream-based context, maintaining a constant query processing time while effectively growing the dataset.Com a crescente necessidade de poder computacional na era de Informação em que nos encontramos, algoritmos de lazy learning como o K-Nearest Neighbours (KNN) foram desenvolvidos para aplicações nas áreas de procura de semelhanças, clustering ou classi- ficação através de inferências sobre grande conjuntos de dados. Dado que a maioria das implementações de KNN têm o limite inferior da sua complexidade temporal proporcio- nal ao tamanho do conjunto de dados usado como input, quando é necessário processar quantidades ilimitadas de dados, recebidas por um canal contínuo de transmissão, a exe- cução torna-se demasiado longa para muitas das finalidades do algoritmo. No atual estado da arte, temos acesso a inúmeros algoritmos de Approximate Nearest Neighbours (ANN) os quais aproximam os resultados da pesquisa por semelhança. A aproximação sacrifica parte da precisão dos resultados, mas reduz o tempo de execução. Juntando ao ANN técnicas como o Locality Sensitive Hashing (LSH), é possível otimizar a utilização de memória do programa. No entanto, com a utilização de grandes conjuntos de dados, inevitavelmente, surgem erros sobre os espaços de memória e ineficiências . Nesta tese propomos SPLASH, um sistema que implementa um algoritmo de ANN em Graphics Processing Unit (GPU) que suporta não só inputs contínuos de dados, como também a utilização de conjuntos de dados de tamanho superior aos recursos disponíveis no dispositivo. Com a adição da construção dinâmica dos conjuntos de dados, desenvol- vemos um algoritmo capaz de escalar devido ao crescimento indefinido da amostra de dados usados, mantendo constante o tempo de execução de consultas. Ao cumprir os requisitos propostos, podemos concluir que o nosso trabalho representa um avanço no estado da arte. Os resultados obtidos na comparação entre o SPLASH e a uma implementação base de um algoritmo de ANN mostram que o nosso sistema é igualmente preciso e mais eficiente em múltiplas execuções. O SPLASH revelou-se também estável quando executado por longos periodos de tempo, mantendo o tempo das consultas constante apesar do crescimento contínuo do conjunto de dados.Paulino, HervéRUNLopes, Gonçalo Pedro Santos2024-04-24T13:21:02Z2021-022021-02-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10362/166583enginfo:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2024-05-22T18:20:43Zoai:run.unl.pt:10362/166583Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-28T17:51:19.107224Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse
dc.title.none.fl_str_mv GPU-Accelerated Approximate K- Nearest Neighbors over Unbounded Datasets
title GPU-Accelerated Approximate K- Nearest Neighbors over Unbounded Datasets
spellingShingle GPU-Accelerated Approximate K- Nearest Neighbors over Unbounded Datasets
Lopes, Gonçalo Pedro Santos
KNN
Big Datasets
Data Streams
LSH
ANN
GPU Memory Management
Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática
title_short GPU-Accelerated Approximate K- Nearest Neighbors over Unbounded Datasets
title_full GPU-Accelerated Approximate K- Nearest Neighbors over Unbounded Datasets
title_fullStr GPU-Accelerated Approximate K- Nearest Neighbors over Unbounded Datasets
title_full_unstemmed GPU-Accelerated Approximate K- Nearest Neighbors over Unbounded Datasets
title_sort GPU-Accelerated Approximate K- Nearest Neighbors over Unbounded Datasets
author Lopes, Gonçalo Pedro Santos
author_facet Lopes, Gonçalo Pedro Santos
author_role author
dc.contributor.none.fl_str_mv Paulino, Hervé
RUN
dc.contributor.author.fl_str_mv Lopes, Gonçalo Pedro Santos
dc.subject.por.fl_str_mv KNN
Big Datasets
Data Streams
LSH
ANN
GPU Memory Management
Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática
topic KNN
Big Datasets
Data Streams
LSH
ANN
GPU Memory Management
Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática
description With the growing demand for computational power in this era of information, algo- rithms like the K-Nearest Neighbours (KNN) have been developed for applications like similarity search, clustering and classification which allow to make assumptions over big sets of data, making it a lazy learning algorithm. Since most of the KNN algorithms have their temporal complexity lower bound by the size of the input data, when we are dealing with unbounded datasets like the ones that can be expected from an input stream of data, the processing becomes too slow for almost all applications. In the current state of the art, we have access to many Approximate Nearest Neigh- bours (ANN) algorithms which, instead of calculating the exact result like the normal KNN implementations, approximate it. This effectively allows systems to trade off some output accuracy for major performance gains and, when coupled with techniques like Locality Sensitive Hashing (LSH) (Locality Sensitive Hashing), they also try to optimize memory usage in Graphics Processing Unit (GPU)-based implementations. However, when dealing with stream-based inputs and growing datasets, it is easy to run into mem- ory heap problems and inefficiencies while using the currently available algorithms. In this work we propose SPLASH, a system that implements an ANN algorithm that supports, not only stream-based inputs, but also the usage of datasets that may not fit in the GPU’s memory. With the added dataset building fuctionality, we developed an algorithm capable of scaling by growing its sample data, while maintaining a constant query processing time during that time. By meeting the stated requirements in our work, we developed an implementation that advances the state of the art. The test results against a base ANN implementation proved that SPLASH is more efficient in long-term executions, with good accuracy and precision in the outputs. It was also proven that our system is stable when stress-tested in a stream-based context, maintaining a constant query processing time while effectively growing the dataset.
publishDate 2021
dc.date.none.fl_str_mv 2021-02
2021-02-01T00:00:00Z
2024-04-24T13:21:02Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10362/166583
url http://hdl.handle.net/10362/166583
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron:RCAAP
instname_str FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron_str RCAAP
institution RCAAP
reponame_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
collection Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository.name.fl_str_mv Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
repository.mail.fl_str_mv info@rcaap.pt
_version_ 1833597015290806272