Escalonador distribuído de tarefas para o Apache Spark
Ano de defesa: | 2018 |
---|---|
Autor(a) principal: | |
Orientador(a): | |
Banca de defesa: | |
Tipo de documento: | Dissertação |
Tipo de acesso: | Acesso aberto |
Idioma: | por |
Instituição de defesa: |
Universidade Federal do Rio de Janeiro
Brasil Instituto Alberto Luiz Coimbra de Pós-Graduação e Pesquisa de Engenharia Programa de Pós-Graduação em Engenharia de Sistemas e Computação UFRJ |
Programa de Pós-Graduação: |
Não Informado pela instituição
|
Departamento: |
Não Informado pela instituição
|
País: |
Não Informado pela instituição
|
Palavras-chave em Português: | |
Link de acesso: | http://hdl.handle.net/11422/13042 |
Resumo: | Data intensive frameworks have a featured role in industry and academia. They make computation distribution and parallelization transparent to data scientists providing them greater productivity in data analytics application development like machine learning and deep learning algorithms implementations. However, in order to provide this abstraction layer several issues emerge like obtaining good performance when using many processing units, handling communication and coping with storage devices. One of the most relevant challenges is related with task scheduling. Perform this action in an efficient and scalable manner is very important, since it impacts significantly computing resources performance and utilization. This work proposes an hierarchical manner of distributing task scheduling in data intensive frameworks, introducing the scheduler assistant whose role is alleviate central scheduler job as it takes responsibility of a share of the scheduling load. We use Apache Spark for implementing a version of the hierarchical distributed task scheduler and to perform comparative experiments for testing the proposal scalability. Using 32 computational nodes, results show our proof of concept maintains execution times values similar to those found with the original version of Apache Spark. Moreover we show that deploying scheduler assistants the system can better utilize computational nodes processors during experiments executions. Finally, we expose a bottleneck due the centralization of scheduling decisions at Apache Spark execution model. |