GPU-Accelerated Approximate K- Nearest Neighbors over Unbounded Datasets
| Autor(a) principal: | |
|---|---|
| Data de Publicação: | 2021 |
| Tipo de documento: | Dissertação |
| Idioma: | eng |
| Título da fonte: | Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
| Texto Completo: | http://hdl.handle.net/10362/166583 |
Resumo: | With the growing demand for computational power in this era of information, algo- rithms like the K-Nearest Neighbours (KNN) have been developed for applications like similarity search, clustering and classification which allow to make assumptions over big sets of data, making it a lazy learning algorithm. Since most of the KNN algorithms have their temporal complexity lower bound by the size of the input data, when we are dealing with unbounded datasets like the ones that can be expected from an input stream of data, the processing becomes too slow for almost all applications. In the current state of the art, we have access to many Approximate Nearest Neigh- bours (ANN) algorithms which, instead of calculating the exact result like the normal KNN implementations, approximate it. This effectively allows systems to trade off some output accuracy for major performance gains and, when coupled with techniques like Locality Sensitive Hashing (LSH) (Locality Sensitive Hashing), they also try to optimize memory usage in Graphics Processing Unit (GPU)-based implementations. However, when dealing with stream-based inputs and growing datasets, it is easy to run into mem- ory heap problems and inefficiencies while using the currently available algorithms. In this work we propose SPLASH, a system that implements an ANN algorithm that supports, not only stream-based inputs, but also the usage of datasets that may not fit in the GPU’s memory. With the added dataset building fuctionality, we developed an algorithm capable of scaling by growing its sample data, while maintaining a constant query processing time during that time. By meeting the stated requirements in our work, we developed an implementation that advances the state of the art. The test results against a base ANN implementation proved that SPLASH is more efficient in long-term executions, with good accuracy and precision in the outputs. It was also proven that our system is stable when stress-tested in a stream-based context, maintaining a constant query processing time while effectively growing the dataset. |
| id |
RCAP_b9fe1fbb15009ce39684b15d7dfb1a1e |
|---|---|
| oai_identifier_str |
oai:run.unl.pt:10362/166583 |
| network_acronym_str |
RCAP |
| network_name_str |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
| repository_id_str |
https://opendoar.ac.uk/repository/7160 |
| spelling |
GPU-Accelerated Approximate K- Nearest Neighbors over Unbounded DatasetsKNNBig DatasetsData StreamsLSHANNGPU Memory ManagementDomínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e InformáticaWith the growing demand for computational power in this era of information, algo- rithms like the K-Nearest Neighbours (KNN) have been developed for applications like similarity search, clustering and classification which allow to make assumptions over big sets of data, making it a lazy learning algorithm. Since most of the KNN algorithms have their temporal complexity lower bound by the size of the input data, when we are dealing with unbounded datasets like the ones that can be expected from an input stream of data, the processing becomes too slow for almost all applications. In the current state of the art, we have access to many Approximate Nearest Neigh- bours (ANN) algorithms which, instead of calculating the exact result like the normal KNN implementations, approximate it. This effectively allows systems to trade off some output accuracy for major performance gains and, when coupled with techniques like Locality Sensitive Hashing (LSH) (Locality Sensitive Hashing), they also try to optimize memory usage in Graphics Processing Unit (GPU)-based implementations. However, when dealing with stream-based inputs and growing datasets, it is easy to run into mem- ory heap problems and inefficiencies while using the currently available algorithms. In this work we propose SPLASH, a system that implements an ANN algorithm that supports, not only stream-based inputs, but also the usage of datasets that may not fit in the GPU’s memory. With the added dataset building fuctionality, we developed an algorithm capable of scaling by growing its sample data, while maintaining a constant query processing time during that time. By meeting the stated requirements in our work, we developed an implementation that advances the state of the art. The test results against a base ANN implementation proved that SPLASH is more efficient in long-term executions, with good accuracy and precision in the outputs. It was also proven that our system is stable when stress-tested in a stream-based context, maintaining a constant query processing time while effectively growing the dataset.Com a crescente necessidade de poder computacional na era de Informação em que nos encontramos, algoritmos de lazy learning como o K-Nearest Neighbours (KNN) foram desenvolvidos para aplicações nas áreas de procura de semelhanças, clustering ou classi- ficação através de inferências sobre grande conjuntos de dados. Dado que a maioria das implementações de KNN têm o limite inferior da sua complexidade temporal proporcio- nal ao tamanho do conjunto de dados usado como input, quando é necessário processar quantidades ilimitadas de dados, recebidas por um canal contínuo de transmissão, a exe- cução torna-se demasiado longa para muitas das finalidades do algoritmo. No atual estado da arte, temos acesso a inúmeros algoritmos de Approximate Nearest Neighbours (ANN) os quais aproximam os resultados da pesquisa por semelhança. A aproximação sacrifica parte da precisão dos resultados, mas reduz o tempo de execução. Juntando ao ANN técnicas como o Locality Sensitive Hashing (LSH), é possível otimizar a utilização de memória do programa. No entanto, com a utilização de grandes conjuntos de dados, inevitavelmente, surgem erros sobre os espaços de memória e ineficiências . Nesta tese propomos SPLASH, um sistema que implementa um algoritmo de ANN em Graphics Processing Unit (GPU) que suporta não só inputs contínuos de dados, como também a utilização de conjuntos de dados de tamanho superior aos recursos disponíveis no dispositivo. Com a adição da construção dinâmica dos conjuntos de dados, desenvol- vemos um algoritmo capaz de escalar devido ao crescimento indefinido da amostra de dados usados, mantendo constante o tempo de execução de consultas. Ao cumprir os requisitos propostos, podemos concluir que o nosso trabalho representa um avanço no estado da arte. Os resultados obtidos na comparação entre o SPLASH e a uma implementação base de um algoritmo de ANN mostram que o nosso sistema é igualmente preciso e mais eficiente em múltiplas execuções. O SPLASH revelou-se também estável quando executado por longos periodos de tempo, mantendo o tempo das consultas constante apesar do crescimento contínuo do conjunto de dados.Paulino, HervéRUNLopes, Gonçalo Pedro Santos2024-04-24T13:21:02Z2021-022021-02-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10362/166583enginfo:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2024-05-22T18:20:43Zoai:run.unl.pt:10362/166583Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-28T17:51:19.107224Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse |
| dc.title.none.fl_str_mv |
GPU-Accelerated Approximate K- Nearest Neighbors over Unbounded Datasets |
| title |
GPU-Accelerated Approximate K- Nearest Neighbors over Unbounded Datasets |
| spellingShingle |
GPU-Accelerated Approximate K- Nearest Neighbors over Unbounded Datasets Lopes, Gonçalo Pedro Santos KNN Big Datasets Data Streams LSH ANN GPU Memory Management Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática |
| title_short |
GPU-Accelerated Approximate K- Nearest Neighbors over Unbounded Datasets |
| title_full |
GPU-Accelerated Approximate K- Nearest Neighbors over Unbounded Datasets |
| title_fullStr |
GPU-Accelerated Approximate K- Nearest Neighbors over Unbounded Datasets |
| title_full_unstemmed |
GPU-Accelerated Approximate K- Nearest Neighbors over Unbounded Datasets |
| title_sort |
GPU-Accelerated Approximate K- Nearest Neighbors over Unbounded Datasets |
| author |
Lopes, Gonçalo Pedro Santos |
| author_facet |
Lopes, Gonçalo Pedro Santos |
| author_role |
author |
| dc.contributor.none.fl_str_mv |
Paulino, Hervé RUN |
| dc.contributor.author.fl_str_mv |
Lopes, Gonçalo Pedro Santos |
| dc.subject.por.fl_str_mv |
KNN Big Datasets Data Streams LSH ANN GPU Memory Management Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática |
| topic |
KNN Big Datasets Data Streams LSH ANN GPU Memory Management Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática |
| description |
With the growing demand for computational power in this era of information, algo- rithms like the K-Nearest Neighbours (KNN) have been developed for applications like similarity search, clustering and classification which allow to make assumptions over big sets of data, making it a lazy learning algorithm. Since most of the KNN algorithms have their temporal complexity lower bound by the size of the input data, when we are dealing with unbounded datasets like the ones that can be expected from an input stream of data, the processing becomes too slow for almost all applications. In the current state of the art, we have access to many Approximate Nearest Neigh- bours (ANN) algorithms which, instead of calculating the exact result like the normal KNN implementations, approximate it. This effectively allows systems to trade off some output accuracy for major performance gains and, when coupled with techniques like Locality Sensitive Hashing (LSH) (Locality Sensitive Hashing), they also try to optimize memory usage in Graphics Processing Unit (GPU)-based implementations. However, when dealing with stream-based inputs and growing datasets, it is easy to run into mem- ory heap problems and inefficiencies while using the currently available algorithms. In this work we propose SPLASH, a system that implements an ANN algorithm that supports, not only stream-based inputs, but also the usage of datasets that may not fit in the GPU’s memory. With the added dataset building fuctionality, we developed an algorithm capable of scaling by growing its sample data, while maintaining a constant query processing time during that time. By meeting the stated requirements in our work, we developed an implementation that advances the state of the art. The test results against a base ANN implementation proved that SPLASH is more efficient in long-term executions, with good accuracy and precision in the outputs. It was also proven that our system is stable when stress-tested in a stream-based context, maintaining a constant query processing time while effectively growing the dataset. |
| publishDate |
2021 |
| dc.date.none.fl_str_mv |
2021-02 2021-02-01T00:00:00Z 2024-04-24T13:21:02Z |
| dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
| dc.type.driver.fl_str_mv |
info:eu-repo/semantics/masterThesis |
| format |
masterThesis |
| status_str |
publishedVersion |
| dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/10362/166583 |
| url |
http://hdl.handle.net/10362/166583 |
| dc.language.iso.fl_str_mv |
eng |
| language |
eng |
| dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
| eu_rights_str_mv |
openAccess |
| dc.format.none.fl_str_mv |
application/pdf |
| dc.source.none.fl_str_mv |
reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia instacron:RCAAP |
| instname_str |
FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia |
| instacron_str |
RCAAP |
| institution |
RCAAP |
| reponame_str |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
| collection |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
| repository.name.fl_str_mv |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia |
| repository.mail.fl_str_mv |
info@rcaap.pt |
| _version_ |
1833597015290806272 |