Layered approach for runtime fault recovery in NOC-Based MPSOCS

Wächter, Eduardo Weber

Layered approach for runtime fault recovery in NOC-Based MPSOCS

Detalhes bibliográficos
Ano de defesa:	2015
Autor(a) principal:	Wächter, Eduardo Weber
Orientador(a):	Moraes, Fernando Gehm
Banca de defesa:	Não Informado pela instituição
Tipo de documento:	Tese
Tipo de acesso:	Acesso aberto
Idioma:	eng
Instituição de defesa:	Pontifícia Universidade Católica do Rio Grande do Sul
Programa de Pós-Graduação:	Programa de Pós-Graduação em Ciência da Computação
Departamento:	Faculdade de Informática
País:	Brasil
Palavras-chave em Português:	INFORMÁTICA ARQUITETURA DE COMPUTADOR MICROPROCESSADORES
Área do conhecimento CNPq:	CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
Link de acesso:	http://tede2.pucrs.br/tede2/handle/tede/6279
Resumo:	Mechanisms for fault-tolerance in MPSoCs are mandatory to cope with defects during fabrication or faults during product lifetime. For instance, permanent faults on the interconnect network can stall or crash applications, even though the MPSoCs’ network has alternative faultfree paths to a given destination. Runtime Fault Tolerance provide self-organization mechanisms to continue delivering their processing services despite defective cores due to the presence of permanent and/or transient faults throughout their lifetime. This Thesis presents a runtime layered approach to a fault-tolerant MPSoC, where each layer is responsible for solving one part of the problem. The approach is built on top of a novel small specialized network used to search fault-free paths. The first layer, named physical layer, is responsible for the fault detection and fault isolation of defective routers. The second layer, named the network layer, is responsible for replacing the original faulty path by an alternative fault-free path. A fault-tolerant routing method executes a path search mechanism and reconfigures the network to use the faulty-free path. The third layer, named transport layer, implements a fault-tolerant communication protocol that triggers the path search in the network layer when a packet does not reach its destination. The last layer, application layer, is responsible for moving tasks from the defective processing element (PE) to a healthy PE, saving the task’s internal state, and restoring it in case of fault while executing a task. Results at the network layer, show a fast path finding method. The entire process of finding alternative paths takes typically less than 2000 clock cycles or 20 microseconds. In the transport layer, different approaches were evaluated being capable of detecting a lost message and start the retransmission. The results show that the overhead to retransmit the message is 2.46X compared to the time to transmit a message without fault, being all other messages transmitted with no overhead. For the DTW, MPEG, and synthetic applications the average-case application execution overhead was 0.17%, 0.09%, and 0.42%, respectively. This represents less than 5% of the application execution overhead worst case. At the application layer, the entire fault recovery protocol executes fast, with a low execution time overhead with no faults (5.67%) and with faults (17.33% - 28.34%).

Layered approach for runtime fault recovery in NOC-Based MPSOCS

Registros relacionados