Adding native support for task scheduling to a Linux-capable RISC-V multicore system

Bibliographic Details
Main Author: Morais, Lucas Henrique
Publication Date: 2019
Format: Master thesis
Language: eng
Source: Biblioteca Digital de Teses e Dissertações da USP
Download full: http://www.teses.usp.br/teses/disponiveis/45/45134/tde-28092019-060958/
Summary: The Task Scheduling Paradigm is a general technique for leveraging fine and coarse grain parallelism from applications of several domains with minimum impact on code readability, relying on the automatic inference of data dependencies among tasks. The performance of Task Parallel applications is correlated with the speed at which the underlying Task Scheduling System is able to detect such dependencies, something that is critical for fine-granularity workloads, which cannot amortize scheduling overheads with long periods of useful computation. That being the case, several groups have recently been developing FPGA-accelerated Task Scheduling Systems architectures where a software Task Scheduling Runtime is able to offload its bookkeeping computations to an FPGA-based accelerator with the goal of efficiently scheduling fine-grained tasks to CPU cores. Even though these FPGA-accelerated systems offer substantial gains over the software-only baseline, it is also true that FPGA-CPU communication bottlenecks prevent such designs from handling scenarios with either large number of cores or very fine-grained tasks. With that in mind, we proposed the implementation of a Native Task Scheduling System that is, a processor with native support for task scheduling embedded into its architecture with the goal of substantially reducing these overheads. More specifically, this project aimed at embedding the HW logic of Picos, a mature Task Scheduling Accelerator developed by the Barcelona Supercomputing Center (BSC), into Rocket Chip, an open-source, silicon-proven, multi-core implementation of RISC-V. The ISA of the resulting system provides special instructions for Task Applications to interact with this Task Scheduling Logic, ruling out all FPGA-CPU communication latencies. To evaluate the prototype performance, we both (1) adapted Nanos, a mature Task Scheduling runtime, to benefit from the new task-scheduling-accelerating instructions; and (2) developed Phentos, a new HW-accelerated light weight Task Scheduling runtime. Our experiments show that task parallel programs using Nanos-RV the Nanos version ported to our system are on average 2.13 times faster than those being serviced by baseline Nanos, while programs running on Phentos are 13.19 times faster, considering geometric means. Using eight cores, Nanos-RV is able to deliver speedups with respect to serial execution of up to 5.62 times, while Phentos produces speedups of up to 5.72 times.
id USP_c05190fdd8dc7e57a72d4a1c6b80fdbc
oai_identifier_str oai:teses.usp.br:tde-28092019-060958
network_acronym_str USP
network_name_str Biblioteca Digital de Teses e Dissertações da USP
repository_id_str 2721
spelling Adding native support for task scheduling to a Linux-capable RISC-V multicore systemAdicionando suporte nativo a paralelismo de tarefas a um sistema RISC-V multicore com suporte a LinuxChiselChiselParalelismo por tarefasParallel programmingProgramação paralelaRISC-VRISC-VRocket chipRocket chipTask schedulingThe Task Scheduling Paradigm is a general technique for leveraging fine and coarse grain parallelism from applications of several domains with minimum impact on code readability, relying on the automatic inference of data dependencies among tasks. The performance of Task Parallel applications is correlated with the speed at which the underlying Task Scheduling System is able to detect such dependencies, something that is critical for fine-granularity workloads, which cannot amortize scheduling overheads with long periods of useful computation. That being the case, several groups have recently been developing FPGA-accelerated Task Scheduling Systems architectures where a software Task Scheduling Runtime is able to offload its bookkeeping computations to an FPGA-based accelerator with the goal of efficiently scheduling fine-grained tasks to CPU cores. Even though these FPGA-accelerated systems offer substantial gains over the software-only baseline, it is also true that FPGA-CPU communication bottlenecks prevent such designs from handling scenarios with either large number of cores or very fine-grained tasks. With that in mind, we proposed the implementation of a Native Task Scheduling System that is, a processor with native support for task scheduling embedded into its architecture with the goal of substantially reducing these overheads. More specifically, this project aimed at embedding the HW logic of Picos, a mature Task Scheduling Accelerator developed by the Barcelona Supercomputing Center (BSC), into Rocket Chip, an open-source, silicon-proven, multi-core implementation of RISC-V. The ISA of the resulting system provides special instructions for Task Applications to interact with this Task Scheduling Logic, ruling out all FPGA-CPU communication latencies. To evaluate the prototype performance, we both (1) adapted Nanos, a mature Task Scheduling runtime, to benefit from the new task-scheduling-accelerating instructions; and (2) developed Phentos, a new HW-accelerated light weight Task Scheduling runtime. Our experiments show that task parallel programs using Nanos-RV the Nanos version ported to our system are on average 2.13 times faster than those being serviced by baseline Nanos, while programs running on Phentos are 13.19 times faster, considering geometric means. Using eight cores, Nanos-RV is able to deliver speedups with respect to serial execution of up to 5.62 times, while Phentos produces speedups of up to 5.72 times.Paralelismo por Tarefas é uma técnica genérica de extração de paralelismo de granularidade arbitrária aplicável a programas de vários domínios, com mínimo impacto sobre legibilidade de código, baseada na inferência automática de dependências de dados entre tarefas. O desempenho de aplicações paralelas baseadas nesse paradigma depende da velocidade com a qual o runtime de Paralelismo por Tarefas que lhe dá suporte é capaz de detectar tais dependências, fato que é ainda mais crítico para aplicações envolvendo tarefas de granularidade fina, já que nesse cenário o overhead de escalonamento não é amortizado por períodos significativamente maiores de computação útil. Recentemente, diversos grupos têm desenvolvido Sistemas de Suporte a Paralelismo por Tarefas acelerados por FPGAs, os quais são capazes de fazer offload das operações de inferência de dependências para um acelerador em FPGA de modo a melhorar o seu desempenho ao lidar com tarefas de granularidade fina. Por outro lado, ainda que esses sistemas acelerados por FPGA apresentem ganhos substanciais com relação às alternativas baseadas puramente em software, o desempenho dessas soluções é prejudicado por gargalos de comunicação entre a CPU e a FPGA, os quais limitam a capacidade desses sistemas de lidar com cenários envolvendo grande número de núcleos ou tarefas muito finas. Motivados por isso, implementamos um Sistema de Suporte Nativo a Paralelismo por Tarefas isto é, um processador com suporte arquitetural nativo a Paralelismo por Tarefas com o objetivo de reduzir consideravelmente tais overheads de comunicação. Mais especificamente, integramos a lógica em hardware do Picos, um acelerador de Paralelismo por Tarefas desenvolvido pelo Barcelona Supercomputing Center (BSC), ao Rocket Chip, uma implementação multi-core de código livre do RISC-V desenvolvida pela Universidade da Califórnia, Berkeley. O sistema resultante contém em sua ISA (Instruction Set Architecture) as instruções necessárias para que aplicações baseadas em tarefas possam interagir diretamente com essa lógica de escalonamento, minimizando os overheads associados ao uso de runtimes intermediários e eliminando toda a latência de comunicação FPGA-CPU. Para avaliar a performance do protótipo que então se construiu, nós tanto (1) adaptamos o runtime de escalonamento de tarefas Nanos para que ele pudesse ser acelerado pelas novas instruções de escalonamento de tarefas, quanto (2) criamos um novo runtime leve de escalonamento de tarefas a que demos o nome de Phentos. Nossos experimentos mostram que programas baseados em paralelismo por tarefas usando o runtime Nanos-RV a versão do runtime Nanos com suporte ao sistema que produzimos são executados em média 2,13 vezes mais rapidamente do que versões dos mesmos programas utilizando a versão básica do Nanos, enquanto programas executados com o Phentos são em média 13,19 vezes mais rápidos do que suas versões correspondentes baseadas na mesma versão básica do Nanos. Tais valores médios correspondem à média geométrica dos conjuntos de dados pertinentes. Usando oito núcleos, Nanos-RV entrega ganhos de desempenho com relação a execuções seriais de até 5,62 vezes, enquanto Phentos entrega ganhos de até 5,72 vezes.Biblioteca Digitais de Teses e Dissertações da USPAraújo, Guido Costa Souza deLejbman, Alfredo Goldman VelMorais, Lucas Henrique2019-08-22info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://www.teses.usp.br/teses/disponiveis/45/45134/tde-28092019-060958/reponame:Biblioteca Digital de Teses e Dissertações da USPinstname:Universidade de São Paulo (USP)instacron:USPLiberar o conteúdo para acesso público.info:eu-repo/semantics/openAccesseng2019-11-08T23:43:27Zoai:teses.usp.br:tde-28092019-060958Biblioteca Digital de Teses e Dissertaçõeshttp://www.teses.usp.br/PUBhttp://www.teses.usp.br/cgi-bin/mtd2br.plvirginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.bropendoar:27212019-11-08T23:43:27Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)false
dc.title.none.fl_str_mv Adding native support for task scheduling to a Linux-capable RISC-V multicore system
Adicionando suporte nativo a paralelismo de tarefas a um sistema RISC-V multicore com suporte a Linux
title Adding native support for task scheduling to a Linux-capable RISC-V multicore system
spellingShingle Adding native support for task scheduling to a Linux-capable RISC-V multicore system
Morais, Lucas Henrique
Chisel
Chisel
Paralelismo por tarefas
Parallel programming
Programação paralela
RISC-V
RISC-V
Rocket chip
Rocket chip
Task scheduling
title_short Adding native support for task scheduling to a Linux-capable RISC-V multicore system
title_full Adding native support for task scheduling to a Linux-capable RISC-V multicore system
title_fullStr Adding native support for task scheduling to a Linux-capable RISC-V multicore system
title_full_unstemmed Adding native support for task scheduling to a Linux-capable RISC-V multicore system
title_sort Adding native support for task scheduling to a Linux-capable RISC-V multicore system
author Morais, Lucas Henrique
author_facet Morais, Lucas Henrique
author_role author
dc.contributor.none.fl_str_mv Araújo, Guido Costa Souza de
Lejbman, Alfredo Goldman Vel
dc.contributor.author.fl_str_mv Morais, Lucas Henrique
dc.subject.por.fl_str_mv Chisel
Chisel
Paralelismo por tarefas
Parallel programming
Programação paralela
RISC-V
RISC-V
Rocket chip
Rocket chip
Task scheduling
topic Chisel
Chisel
Paralelismo por tarefas
Parallel programming
Programação paralela
RISC-V
RISC-V
Rocket chip
Rocket chip
Task scheduling
description The Task Scheduling Paradigm is a general technique for leveraging fine and coarse grain parallelism from applications of several domains with minimum impact on code readability, relying on the automatic inference of data dependencies among tasks. The performance of Task Parallel applications is correlated with the speed at which the underlying Task Scheduling System is able to detect such dependencies, something that is critical for fine-granularity workloads, which cannot amortize scheduling overheads with long periods of useful computation. That being the case, several groups have recently been developing FPGA-accelerated Task Scheduling Systems architectures where a software Task Scheduling Runtime is able to offload its bookkeeping computations to an FPGA-based accelerator with the goal of efficiently scheduling fine-grained tasks to CPU cores. Even though these FPGA-accelerated systems offer substantial gains over the software-only baseline, it is also true that FPGA-CPU communication bottlenecks prevent such designs from handling scenarios with either large number of cores or very fine-grained tasks. With that in mind, we proposed the implementation of a Native Task Scheduling System that is, a processor with native support for task scheduling embedded into its architecture with the goal of substantially reducing these overheads. More specifically, this project aimed at embedding the HW logic of Picos, a mature Task Scheduling Accelerator developed by the Barcelona Supercomputing Center (BSC), into Rocket Chip, an open-source, silicon-proven, multi-core implementation of RISC-V. The ISA of the resulting system provides special instructions for Task Applications to interact with this Task Scheduling Logic, ruling out all FPGA-CPU communication latencies. To evaluate the prototype performance, we both (1) adapted Nanos, a mature Task Scheduling runtime, to benefit from the new task-scheduling-accelerating instructions; and (2) developed Phentos, a new HW-accelerated light weight Task Scheduling runtime. Our experiments show that task parallel programs using Nanos-RV the Nanos version ported to our system are on average 2.13 times faster than those being serviced by baseline Nanos, while programs running on Phentos are 13.19 times faster, considering geometric means. Using eight cores, Nanos-RV is able to deliver speedups with respect to serial execution of up to 5.62 times, while Phentos produces speedups of up to 5.72 times.
publishDate 2019
dc.date.none.fl_str_mv 2019-08-22
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://www.teses.usp.br/teses/disponiveis/45/45134/tde-28092019-060958/
url http://www.teses.usp.br/teses/disponiveis/45/45134/tde-28092019-060958/
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv
dc.rights.driver.fl_str_mv Liberar o conteúdo para acesso público.
info:eu-repo/semantics/openAccess
rights_invalid_str_mv Liberar o conteúdo para acesso público.
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.coverage.none.fl_str_mv
dc.publisher.none.fl_str_mv Biblioteca Digitais de Teses e Dissertações da USP
publisher.none.fl_str_mv Biblioteca Digitais de Teses e Dissertações da USP
dc.source.none.fl_str_mv
reponame:Biblioteca Digital de Teses e Dissertações da USP
instname:Universidade de São Paulo (USP)
instacron:USP
instname_str Universidade de São Paulo (USP)
instacron_str USP
institution USP
reponame_str Biblioteca Digital de Teses e Dissertações da USP
collection Biblioteca Digital de Teses e Dissertações da USP
repository.name.fl_str_mv Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)
repository.mail.fl_str_mv virginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.br
_version_ 1826319126693412864