Desempenho e disponibilidade em sistemas distribuídos em larga escala

Bruno Rocha Coutinho

Desempenho e disponibilidade em sistemas distribuídos em larga escala

Detalhes bibliográficos
Ano de defesa:	2005
Autor(a) principal:	Bruno Rocha Coutinho
Orientador(a):	Não Informado pela instituição
Banca de defesa:	Não Informado pela instituição
Tipo de documento:	Dissertação
Tipo de acesso:	Acesso aberto
Idioma:	por
Instituição de defesa:	Universidade Federal de Minas Gerais UFMG
Programa de Pós-Graduação:	Não Informado pela instituição
Departamento:	Não Informado pela instituição
País:	Não Informado pela instituição
Palavras-chave em Português:	sistemas distribuidos Software Confiabilidade Computação Tolerancia a falha (Computação)
Link de acesso:	http://hdl.handle.net/1843/RVMR-6HKGXG
Resumo:	Scientific workflow systems provide scientists with a suite of tools and infrastructure to build data analysis applications from reusable components and execute them. The challenges to implementing workflow middleware support for scientific applications are many. Analysis oftentimes requires processing of large volumes of data through a series of simple and complex operations.To support data processing efficiently, a workflow middleware system should leverage distributed computing power and storage space (both disk and memory space) and implement optimizations for large data retrieval and scheduling of I/O and computation components. Another challenging issue is to enable fault tolerance in the middleware fabric. An analysis workflow with complex operations on large data can take long time to execute. The probability of a failure during execution should be considered. Efficient mechanisms are needed to support recovery from a failure without having to redo much of the computation already done.In this thesis, we propose and evaluate a fault tolerance framework for applications that process data using a pipelined network of user-defined operations in a distributed environment. We provide functionality and protocols to efficiently manage input, intermediate, and output data and associated metadata and to recover from certain types of faults that may occur in the system. In our approach, intermediate results and messages exchanged among application components are maintained in a distributed data management infrastructure along with additional metadata. The infrastructure consists of a persistent storage manager that stores check-pointed information in a distributed database and a distributed cache that reduces the overhead of check-pointing. We have developed a protocol among the various components of the system to manage check-points and related information efficiently. The experimental results show that our approach provides an asynchronous data storage mechanism that minimizes overhead to the execution of the workflow.

Desempenho e disponibilidade em sistemas distribuídos em larga escala

Registros relacionados