Arquitetura de configuração dinâmica para a técnica de checkpoint em frameworks de processamento distribuído

Detalhes bibliográficos
Ano de defesa: 2019
Autor(a) principal: Cardoso, Paulo Vinicius Mendonça
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Dissertação
Tipo de acesso: Acesso aberto
Idioma: por
Instituição de defesa: Universidade Federal de Santa Maria
Brasil
Ciência da Computação
UFSM
Programa de Pós-Graduação em Ciência da Computação
Centro de Tecnologia
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: http://repositorio.ufsm.br/handle/1/19346
Resumo: Processing data on High-Performance Computing (HPC) systems is a communal assignment due to large amounts of information being generated. However, reliability and performance problems are created as these systems complexity is increasing. Thus, the search for fault tolerance techniques is important in this context. The Checkpoint and Recovery (CR) fault tolerance technique is widely used for failure recovery based on system stable states that were previously saved. On Apache Hadoop and Apache Spark – distributing and high-performance frameworks –, checkpoint helps on recovery steps after failure events. But checkpoint attribute configuration on both frameworks is static because it depends on the system developer’s choices. Also, changes in real-time are not allowed. In this way, inappropriate choices may harm the system’s reliability and/or performance. Therefore, this work presents a solution for dynamic configurations for the checkpoint technique on Hadoop and Spark. The purpose is described by the Dynamic Configuration Architecture (DCA) that works with monitoring metrics definitions. The main goal of DCA is to provide real-time adaptations of checkpoint attributes according to the necessity of the framework. Besides the architecture definition, validations were performed on controlled failure scenarios to measure DCA efficiency. Obtained results show that dynamically configured checkpoint techniques reached a balance between performance and reliability (based on recovery time) in most of the tested scenarios. With no failures, executions with DCA did not experience high intrusiveness, as failure scenarios were controlled with fast recovery. Besides, DCA shows a great advantage with the possibility of Spark checkpoint savings even source code parts that are inaccessible from developers. In future works, DCA optimizations will be developed and validated. Monitoring metrics will be improved, as well as the DCA elements. With these optimizations, more accurate validations with several failures and workload scenarios will be performed so the DCA performance can be completely measured.