ALGORITMO K-MEANS PARALELO BASEADO EM HADOOP-MAPREDUCE PARA MINERAÇÃO DE DADOS AGRÍCOLAS

Veloso, Lays Helena Lopes

ALGORITMO K-MEANS PARALELO BASEADO EM HADOOP-MAPREDUCE PARA MINERAÇÃO DE DADOS AGRÍCOLAS

Detalhes bibliográficos
Ano de defesa:	2015
Autor(a) principal:	Veloso, Lays Helena Lopes
Orientador(a):	Senger, Luciano José
Banca de defesa:	Vaz, Maria Salete Marcon Gomes , Góis, Lourival Aparecido de
Tipo de documento:	Dissertação
Tipo de acesso:	Acesso aberto
Idioma:	por
Instituição de defesa:	UNIVERSIDADE ESTADUAL DE PONTA GROSSA
Programa de Pós-Graduação:	Programa de Pós Graduação Computação Aplicada
Departamento:	Computação para Tecnologias em Agricultura
País:	BR
Palavras-chave em Português:	K-Means Paralelo MapReduce Hadoop dados de fluxo mineração de dados
Palavras-chave em Inglês:	Parallel K-Means MapReduce Hadoop flux data data mining
Área do conhecimento CNPq:	CNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
Link de acesso:	http://tede2.uepg.br/jspui/handle/prefix/127
Resumo:	This study aimed to investigate the use of a parallel K-means clustering algorithm,based on parallel MapReduce model, to improve the response time of the data mining. The parallel K-Means was implemented in three phases, performed in each iteration: assignment of samples to groups with nearest centroid by Mappers, in parallel; local grouping of samples assigned to the same group from Mappers using a Combiner and update of the centroids by the Reducer. The performance of the algorithm was evaluated in respect to SpeedUp and ScaleUp. To achieve this, experiments were run in single-node mode and on a Hadoop cluster consisting of six of-the-shelf computers. The data were clustered comprise flux towers measurements from agricultural regions and belong to Ameriflux. The results showed performance gains with increasing number of machines and the best time was obtained using six machines reaching the speedup of 3,25. To support our results, ANOVA analysis was applied from repetitions using 3, 4 and 6 machines in the cluster, respectively. The ANOVA show low variance between the execution times obtained for the same number of machines and a significant difference between means of each number of machines. The ScaleUp analysis show that the application scale well with an equivalent increase in data size and the number of machines, achieving similar performance. With the results as expected, this paper presents a parallel and scalable implementation of the K-Means to run on a Hadoop cluster and improve the response time of clustering to large databases.

ALGORITMO K-MEANS PARALELO BASEADO EM HADOOP-MAPREDUCE PARA MINERAÇÃO DE DADOS AGRÍCOLAS

Registros relacionados