GPU-Accelerated Approximate K- Nearest Neighbors over Unbounded Datasets

Lopes, Gonçalo Pedro Santos

GPU-Accelerated Approximate K- Nearest Neighbors over Unbounded Datasets

Detalhes bibliográficos
Autor(a) principal:	Lopes, Gonçalo Pedro Santos
Data de Publicação:	2021
Tipo de documento:	Dissertação
Idioma:	eng
Título da fonte:	Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
Texto Completo:	http://hdl.handle.net/10362/166583
Resumo:	With the growing demand for computational power in this era of information, algo- rithms like the K-Nearest Neighbours (KNN) have been developed for applications like similarity search, clustering and classification which allow to make assumptions over big sets of data, making it a lazy learning algorithm. Since most of the KNN algorithms have their temporal complexity lower bound by the size of the input data, when we are dealing with unbounded datasets like the ones that can be expected from an input stream of data, the processing becomes too slow for almost all applications. In the current state of the art, we have access to many Approximate Nearest Neigh- bours (ANN) algorithms which, instead of calculating the exact result like the normal KNN implementations, approximate it. This effectively allows systems to trade off some output accuracy for major performance gains and, when coupled with techniques like Locality Sensitive Hashing (LSH) (Locality Sensitive Hashing), they also try to optimize memory usage in Graphics Processing Unit (GPU)-based implementations. However, when dealing with stream-based inputs and growing datasets, it is easy to run into mem- ory heap problems and inefficiencies while using the currently available algorithms. In this work we propose SPLASH, a system that implements an ANN algorithm that supports, not only stream-based inputs, but also the usage of datasets that may not fit in the GPU’s memory. With the added dataset building fuctionality, we developed an algorithm capable of scaling by growing its sample data, while maintaining a constant query processing time during that time. By meeting the stated requirements in our work, we developed an implementation that advances the state of the art. The test results against a base ANN implementation proved that SPLASH is more efficient in long-term executions, with good accuracy and precision in the outputs. It was also proven that our system is stable when stress-tested in a stream-based context, maintaining a constant query processing time while effectively growing the dataset.

Metadados do item

id	RCAP_b9fe1fbb15009ce39684b15d7dfb1a1e
oai_identifier_str	oai:run.unl.pt:10362/166583
network_acronym_str	RCAP
network_name_str	Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository_id_str	https://opendoar.ac.uk/repository/7160
spelling	GPU-Accelerated Approximate K- Nearest Neighbors over Unbounded DatasetsKNNBig DatasetsData StreamsLSHANNGPU Memory ManagementDomínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e InformáticaWith the growing demand for computational power in this era of information, algo- rithms like the K-Nearest Neighbours (KNN) have been developed for applications like similarity search, clustering and classification which allow to make assumptions over big sets of data, making it a lazy learning algorithm. Since most of the KNN algorithms have their temporal complexity lower bound by the size of the input data, when we are dealing with unbounded datasets like the ones that can be expected from an input stream of data, the processing becomes too slow for almost all applications. In the current state of the art, we have access to many Approximate Nearest Neigh- bours (ANN) algorithms which, instead of calculating the exact result like the normal KNN implementations, approximate it. This effectively allows systems to trade off some output accuracy for major performance gains and, when coupled with techniques like Locality Sensitive Hashing (LSH) (Locality Sensitive Hashing), they also try to optimize memory usage in Graphics Processing Unit (GPU)-based implementations. However, when dealing with stream-based inputs and growing datasets, it is easy to run into mem- ory heap problems and inefficiencies while using the currently available algorithms. In this work we propose SPLASH, a system that implements an ANN algorithm that supports, not only stream-based inputs, but also the usage of datasets that may not fit in the GPU’s memory. With the added dataset building fuctionality, we developed an algorithm capable of scaling by growing its sample data, while maintaining a constant query processing time during that time. By meeting the stated requirements in our work, we developed an implementation that advances the state of the art. The test results against a base ANN implementation proved that SPLASH is more efficient in long-term executions, with good accuracy and precision in the outputs. It was also proven that our system is stable when stress-tested in a stream-based context, maintaining a constant query processing time while effectively growing the dataset.Com a crescente necessidade de poder computacional na era de Informação em que nos encontramos, algoritmos de lazy learning como o K-Nearest Neighbours (KNN) foram desenvolvidos para aplicações nas áreas de procura de semelhanças, clustering ou classi- ficação através de inferências sobre grande conjuntos de dados. Dado que a maioria das implementações de KNN têm o limite inferior da sua complexidade temporal proporcio- nal ao tamanho do conjunto de dados usado como input, quando é necessário processar quantidades ilimitadas de dados, recebidas por um canal contínuo de transmissão, a exe- cução torna-se demasiado longa para muitas das finalidades do algoritmo. No atual estado da arte, temos acesso a inúmeros algoritmos de Approximate Nearest Neighbours (ANN) os quais aproximam os resultados da pesquisa por semelhança. A aproximação sacrifica parte da precisão dos resultados, mas reduz o tempo de execução. Juntando ao ANN técnicas como o Locality Sensitive Hashing (LSH), é possível otimizar a utilização de memória do programa. No entanto, com a utilização de grandes conjuntos de dados, inevitavelmente, surgem erros sobre os espaços de memória e ineficiências . Nesta tese propomos SPLASH, um sistema que implementa um algoritmo de ANN em Graphics Processing Unit (GPU) que suporta não só inputs contínuos de dados, como também a utilização de conjuntos de dados de tamanho superior aos recursos disponíveis no dispositivo. Com a adição da construção dinâmica dos conjuntos de dados, desenvol- vemos um algoritmo capaz de escalar devido ao crescimento indefinido da amostra de dados usados, mantendo constante o tempo de execução de consultas. Ao cumprir os requisitos propostos, podemos concluir que o nosso trabalho representa um avanço no estado da arte. Os resultados obtidos na comparação entre o SPLASH e a uma implementação base de um algoritmo de ANN mostram que o nosso sistema é igualmente preciso e mais eficiente em múltiplas execuções. O SPLASH revelou-se também estável quando executado por longos periodos de tempo, mantendo o tempo das consultas constante apesar do crescimento contínuo do conjunto de dados.Paulino, HervéRUNLopes, Gonçalo Pedro Santos2024-04-24T13:21:02Z2021-022021-02-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10362/166583enginfo:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2024-05-22T18:20:43Zoai:run.unl.pt:10362/166583Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-28T17:51:19.107224Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse
dc.title.none.fl_str_mv	GPU-Accelerated Approximate K- Nearest Neighbors over Unbounded Datasets
title	GPU-Accelerated Approximate K- Nearest Neighbors over Unbounded Datasets
spellingShingle	GPU-Accelerated Approximate K- Nearest Neighbors over Unbounded Datasets Lopes, Gonçalo Pedro Santos KNN Big Datasets Data Streams LSH ANN GPU Memory Management Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática
title_short	GPU-Accelerated Approximate K- Nearest Neighbors over Unbounded Datasets
title_full	GPU-Accelerated Approximate K- Nearest Neighbors over Unbounded Datasets
title_fullStr	GPU-Accelerated Approximate K- Nearest Neighbors over Unbounded Datasets
title_full_unstemmed	GPU-Accelerated Approximate K- Nearest Neighbors over Unbounded Datasets
title_sort	GPU-Accelerated Approximate K- Nearest Neighbors over Unbounded Datasets
author	Lopes, Gonçalo Pedro Santos
author_facet	Lopes, Gonçalo Pedro Santos
author_role	author
dc.contributor.none.fl_str_mv	Paulino, Hervé RUN
dc.contributor.author.fl_str_mv	Lopes, Gonçalo Pedro Santos
dc.subject.por.fl_str_mv	KNN Big Datasets Data Streams LSH ANN GPU Memory Management Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática
topic	KNN Big Datasets Data Streams LSH ANN GPU Memory Management Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática
description	With the growing demand for computational power in this era of information, algo- rithms like the K-Nearest Neighbours (KNN) have been developed for applications like similarity search, clustering and classification which allow to make assumptions over big sets of data, making it a lazy learning algorithm. Since most of the KNN algorithms have their temporal complexity lower bound by the size of the input data, when we are dealing with unbounded datasets like the ones that can be expected from an input stream of data, the processing becomes too slow for almost all applications. In the current state of the art, we have access to many Approximate Nearest Neigh- bours (ANN) algorithms which, instead of calculating the exact result like the normal KNN implementations, approximate it. This effectively allows systems to trade off some output accuracy for major performance gains and, when coupled with techniques like Locality Sensitive Hashing (LSH) (Locality Sensitive Hashing), they also try to optimize memory usage in Graphics Processing Unit (GPU)-based implementations. However, when dealing with stream-based inputs and growing datasets, it is easy to run into mem- ory heap problems and inefficiencies while using the currently available algorithms. In this work we propose SPLASH, a system that implements an ANN algorithm that supports, not only stream-based inputs, but also the usage of datasets that may not fit in the GPU’s memory. With the added dataset building fuctionality, we developed an algorithm capable of scaling by growing its sample data, while maintaining a constant query processing time during that time. By meeting the stated requirements in our work, we developed an implementation that advances the state of the art. The test results against a base ANN implementation proved that SPLASH is more efficient in long-term executions, with good accuracy and precision in the outputs. It was also proven that our system is stable when stress-tested in a stream-based context, maintaining a constant query processing time while effectively growing the dataset.
publishDate	2021
dc.date.none.fl_str_mv	2021-02 2021-02-01T00:00:00Z 2024-04-24T13:21:02Z
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	http://hdl.handle.net/10362/166583
url	http://hdl.handle.net/10362/166583
dc.language.iso.fl_str_mv	eng
language	eng
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.source.none.fl_str_mv	reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia instacron:RCAAP
instname_str	FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron_str	RCAAP
institution	RCAAP
reponame_str	Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
collection	Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository.name.fl_str_mv	Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
repository.mail.fl_str_mv	info@rcaap.pt
_version_	1833597015290806272

GPU-Accelerated Approximate K- Nearest Neighbors over Unbounded Datasets

Registros relacionados