Escalonamento adaptativo para o Apache Hadoop

Detalhes bibliográficos
Ano de defesa: 2016
Autor(a) principal: Cassales, Guilherme Weigert
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Dissertação
Tipo de acesso: Acesso aberto
Idioma: por
Instituição de defesa: Universidade Federal de Santa Maria
Brasil
Ciência da Computação
UFSM
Programa de Pós-Graduação em Informática
Centro de Tecnologia
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: http://repositorio.ufsm.br/handle/1/12025
Resumo: Many alternatives have been employed in order to process all the data generated by current applications in a timely manner. One of these alternatives, the Apache Hadoop, combines parallel and distributed processing with the MapReduce paradigm in order to provide an environment that is able to process a huge data volume using a simple programming model. However, Apache Hadoop has been designed for dedicated and homogeneous clusters, a limitation that creates challenges for those who wish to use the framework in other circumstances. Often, acquiring a dedicated cluster can be impracticable due to the cost, and the acquisition of reposition parts can be a threat to the homogeneity of a cluster. In these cases, an option commonly used by the companies is the usage of idle computing resources in their network, however the original distribution of Hadoop would show serious performance issues in these conditions. Thus, this study was aimed to improve Hadoop’s capacity of adapting to pervasive and shared environments, where the availability of resources will undergo variations during the execution. Therefore, context-awareness techniques were used in order to collect information about the available capacity in each worker node and distributed communication techniques were used to update this information on scheduler. The joint usage of both techniques aimed at minimizing and/or eliminating the overload that would happen on shared nodes, resulting in an improvement of up to 50% on performance in a shared cluster, when compared to the original distribution, and indicated that a simple solution can positively impact the scheduling, increasing the variety of environments where the use of Hadoop is possible.