Towards terabyte-scale outlier detection using GPUs
Ano de defesa: | 2017 |
---|---|
Autor(a) principal: | |
Orientador(a): | |
Banca de defesa: | |
Tipo de documento: | Dissertação |
Tipo de acesso: | Acesso aberto |
Idioma: | eng |
Instituição de defesa: |
Universidade Federal de Minas Gerais
Brasil ICX - DEPARTAMENTO DE CIÊNCIA DA COMPUTAÇÃO Programa de Pós-Graduação em Ciência da Computação UFMG |
Programa de Pós-Graduação: |
Não Informado pela instituição
|
Departamento: |
Não Informado pela instituição
|
País: |
Não Informado pela instituição
|
Palavras-chave em Português: | |
Link de acesso: | http://hdl.handle.net/1843/32269 |
Resumo: | Outlier detection is an important data mining task for nding unusual data records in datasets. These anomalies often carry useful information that can be employed in a wide range of practical applications, such as network intrusion detection, fraud discovery in credit card or insurance databases, among several others. There are several challenges associated with the outlier detection problem and its computational cost is a major one. Signi cant research has been done to improve these methods' runtime complexity through the use of data partitioning, ordering and pruning rules. Though these advancements allow the outlier detection to be performed in near-linear time, they are not enough to enable processing large-scale datasets in a reasonable time. Even state-of-the-art methods are limited to processing small scale datasets and/or limited to nd just a tiny fraction of the true top-n outliers. Recently, GPU-based implementations have emerged as an alternative to address the computational bottleneck. They have shown promising results but, to the best of our knowledge, all distance-based GPU algorithms currently available are designed for in-memory detection: they require the dataset to t and be loaded into the GPU's memory. Consequently, their applicability is limited because they can not be used in scenarios where the GPU's computational power would the most useful: to process large scale datasets. The goal of this work is to use GPUs to accelerate the outlier detection process in terabyte-scale, disk-resident datasets. To achieve it, we have to develop algorithms and strategies to overcome the massive reductions in the GPU's computation throughput caused by disk accesses. We made two main contributions in this work. First, we developed set of tools and abstractions for out-of-core distance-based outlier detection in GPUs, such as an e ective parallelization strategy; algorithms and high-performance GPU kernels of essential operations for distance-based outlier detection; and an I/O subsystem that reduces data transfer overhead while allowing I/O and computation overlapping. The second main contributions is the development of a novel distancebased outlier detection algorithm for GPUs, DROIDg, capable of processing large scale and disk-resident datasets in reasonable time. It leverages a new ranking heuristic, proposed by ourselves, to improve the e ciency of its pruning rule, thereby massively reducing the amount of computation required by the detection. Our experimental analysis focused on assessing the performance bene ts of using GPUs for outlier detection in large-scale datasets. Thus, we compared DROIDg against some of the best out-of-core outlier detection algorithms available for CPUs: Orca, Diskaware and Dolphin. DROIDg achieved speedups between 10X and 137X over the best sequential algorithm. Moreover, it displayed far superior scalability with regards to the dataset size and number of outliers being detected. These results showed that GPUs enable the outlier detection to be performed at scales far beyond what even state-of-the-art CPU algorithms are capable of. |