Tolerância a falhas em elementos de processamento de MPSoCs

Barreto, Francisco Favorino da Silva

Tolerância a falhas em elementos de processamento de MPSoCs

Detalhes bibliográficos
Ano de defesa:	2015
Autor(a) principal:	Barreto, Francisco Favorino da Silva
Orientador(a):	Amory, Alexandre de Morais
Banca de defesa:	Não Informado pela instituição
Tipo de documento:	Dissertação
Tipo de acesso:	Acesso aberto
Idioma:	por
Instituição de defesa:	Pontifícia Universidade Católica do Rio Grande do Sul
Programa de Pós-Graduação:	Programa de Pós-Graduação em Ciência da Computação
Departamento:	Faculdade de Informática
País:	Brasil
Palavras-chave em Português:	INFORMÁTICA MULTIPROCESSADORES TOLERÂNCIA A FALHAS (INFORMÁTICA)
Área do conhecimento CNPq:	CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
Link de acesso:	http://tede2.pucrs.br/tede2/handle/tede/6435
Resumo:	The need of more processing capacity for embedded systems nowadays is pushing the research of MPSoCs with tens or hundreds of processors. These characteristics bring design challenges such as scalability and dependability. Such complex systems must have fault tolerant methods to ensure acceptable reliability and availability. This way, the user is not exposed to significant data losses, malfunctioning and even the total system failure. Considering this technology trend, the present work proposes a fault tolerance method with focus in fault recovery. The method uses concepts largely explored in distributed systems to solve the problem of permanent failures in the processing elements of MPSoCs. The implementation is exclusively in software, and recovers the system exposed to a permanent failure on processing elements, reallocating all tasks that were executing in the faulty element to a healthy processing element. The failed application tasks restart their executions since there is no context saving, enabling a lightweight method. The experiments are performed in the HeMPS platform, evaluating the most relevant parameters as recovery time, communication bandwidth impact, scalability and others. In the absence of faults, the proposed protocol has 21 Kbytes of memory area (20% more compared to the original kernel) and no overhead in terms of execution time. In the presence of faults, the results demonstrate total recovery times from 0.2ms to 1ms, depending on the number of reallocated tasks (1 to 7). The biggest impact in the protocol time is related with the reallocation task phase.

Tolerância a falhas em elementos de processamento de MPSoCs

Registros relacionados