Escalonamento baseado em localidade no ambiente Watershed

Detalhes bibliográficos
Ano de defesa: 2016
Autor(a) principal: Bruno Cerqueira Hott
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Dissertação
Tipo de acesso: Acesso aberto
Idioma: por
Instituição de defesa: Universidade Federal de Minas Gerais
UFMG
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: http://hdl.handle.net/1843/ESBF-AEDNUF
Resumo: Increased in connectivity and bandwidth on the Internet, combined with the reduced cost of electronic equipment in general have caused an explosion in the volume of data traveling over the network. At the same time, resources to store these data have been growing, which led to the appearance of specially developed systems to process them, and as an early example the MapReduce model of Google, which was followed by several open source implementations such as Hadoop, and new models such as Spark. In addition, it was necessary a solution to the storage of this huge data set and distributed file systems like HDFS and Tachyon, were emerging. Because the data are now a very large volume and are distributed over multiple machines in a cluster, the problem arises of getting applications close to the databases in a effectively way.If this is not done, the price of moving the data through the system can be very high and impair the final performance of the application. Depending on location, the data access application may be performed directly on the disk of the local machine, the local memory via caching of memory or from another cluster machine via network. The various commitments in terms of storage capacity, access time and computational cost involved make nontrivial a positioning decision.This work implements the scheduling based on data locality in the Watershed processing environment. For this analysis was made an integration of Watershed Hadoop ecosystem, creating channels of communication with the HDFS distributed file systems and Tachyon. Based on the location information provided by these systems, we have implemented a process scheduler based on locality for Watershed applications on those file systems.Finally, experiments were conducted in order to compare the various means of manipulating files, either by the local file system, distributed or in memory. The results show the advantages of taking into account the placement of data in scheduling such applications.