Desempenho e disponibilidade em sistemas distribuídos em larga escala
Ano de defesa: | 2005 |
---|---|
Autor(a) principal: | |
Orientador(a): | |
Banca de defesa: | |
Tipo de documento: | Dissertação |
Tipo de acesso: | Acesso aberto |
Idioma: | por |
Instituição de defesa: |
Universidade Federal de Minas Gerais
UFMG |
Programa de Pós-Graduação: |
Não Informado pela instituição
|
Departamento: |
Não Informado pela instituição
|
País: |
Não Informado pela instituição
|
Palavras-chave em Português: | |
Link de acesso: | http://hdl.handle.net/1843/RVMR-6HKGXG |
Resumo: | Scientific workflow systems provide scientists with a suite of tools and infrastructure to build data analysis applications from reusable components and execute them. The challenges to implementing workflow middleware support for scientific applications are many. Analysis oftentimes requires processing of large volumes of data through a series of simple and complex operations.To support data processing efficiently, a workflow middleware system should leverage distributed computing power and storage space (both disk and memory space) and implement optimizations for large data retrieval and scheduling of I/O and computation components. Another challenging issue is to enable fault tolerance in the middleware fabric. An analysis workflow with complex operations on large data can take long time to execute. The probability of a failure during execution should be considered. Efficient mechanisms are needed to support recovery from a failure without having to redo much of the computation already done.In this thesis, we propose and evaluate a fault tolerance framework for applications that process data using a pipelined network of user-defined operations in a distributed environment. We provide functionality and protocols to efficiently manage input, intermediate, and output data and associated metadata and to recover from certain types of faults that may occur in the system. In our approach, intermediate results and messages exchanged among application components are maintained in a distributed data management infrastructure along with additional metadata. The infrastructure consists of a persistent storage manager that stores check-pointed information in a distributed database and a distributed cache that reduces the overhead of check-pointing. We have developed a protocol among the various components of the system to manage check-points and related information efficiently. The experimental results show that our approach provides an asynchronous data storage mechanism that minimizes overhead to the execution of the workflow. |