Scheduling computations

Bibliographic Details
Main Author: Rito, Guilherme Miguel Teixeira
Publication Date: 2016
Format: Master thesis
Language: eng
Source: Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
Download full: http://hdl.handle.net/10362/166873
Summary: For quite some time, the Work Stealing algorithm has been the de facto standard for scheduling multithreaded computations. To ensure scalability and achieve high perfor- mance, work is scattered through processors. In turn, each processor owns a concurrent work queue that uses to keep track of its assigned tasks. When a processor’s work queue becomes empty, it becomes a thief and starts targeting victims uniformly at random, from which it attempts stealing tasks. This strategy was proved to be efficient in both theory and practice, and is currently used in state-of-the-art Work Stealing algorithms. Nevertheless, purely receiver initiated load balancing schemes, such as Work Steal- ing’s, are known not to be suitable for scheduling computations with few or unbalanced parallelism. More, due to the concurrent nature of work queues, even local operations require memory fences that are extremely expensive on modern computer architectures. Consequently, even when a processor is busy, it may incur in costly overheads caused by local accesses to its work queue. Finally, as the scheduler’s load balancer relies on ran- dom steals, its performance when executing memory bound computations is very limited. Despite all efforts, no silver-bullet has been found, and, even worse, all these limitations still exist in state-of-the-art Work Stealing algorithms. In this thesis we make three major theoretical contributions, addressing each of the aforementioned limitations. First, we prove that Work Stealing can easily be extended to make use of custom load balancers, that, for various classes of workloads (e.g. memory bound computations), can greatly boost the scheduler’s performance, while, at the same time, maintaining Work Stealing’s high performance for the general setting. Then, we present a provably efficient scheduler that mixes both receiver and sender-initiated poli- cies, and theoretically show that it successfully overcomes Work Stealing’s limitations for the execution of computations with few or irregular parallelism. Finally, we present a novel scheduling algorithm, whose expected runtime bounds are optimal within a con- stant factor, and that avoids most of the costs associated with memory fences, bounding the total expected overheads incurred by memory fences to O (P T∞), where T∞ is the critical-path length of a computation, and P is the number of processors. This contrasts with state-of-the-art Work Stealing algorithms where the total overheads incurred by these synchronization mechanisms can grow proportionally with the total amount of work. From this perspective, our proposal greatly improves the state-of-the-art Work Stealing algorithm. In fact, as we will prove, for several classes of computations, the over- heads incurred by our algorithm are exponentially smaller than the overheads incurred by state-of-the-art Work Stealing algorithms.
id RCAP_f1dcb95a8d50a929a274d3e29bd19f71
oai_identifier_str oai:run.unl.pt:10362/166873
network_acronym_str RCAP
network_name_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository_id_str https://opendoar.ac.uk/repository/7160
spelling Scheduling computationsScheduling algorithmsRandomized algorithmsParallel computingDistributed computingDynamic load balancingDomínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e InformáticaFor quite some time, the Work Stealing algorithm has been the de facto standard for scheduling multithreaded computations. To ensure scalability and achieve high perfor- mance, work is scattered through processors. In turn, each processor owns a concurrent work queue that uses to keep track of its assigned tasks. When a processor’s work queue becomes empty, it becomes a thief and starts targeting victims uniformly at random, from which it attempts stealing tasks. This strategy was proved to be efficient in both theory and practice, and is currently used in state-of-the-art Work Stealing algorithms. Nevertheless, purely receiver initiated load balancing schemes, such as Work Steal- ing’s, are known not to be suitable for scheduling computations with few or unbalanced parallelism. More, due to the concurrent nature of work queues, even local operations require memory fences that are extremely expensive on modern computer architectures. Consequently, even when a processor is busy, it may incur in costly overheads caused by local accesses to its work queue. Finally, as the scheduler’s load balancer relies on ran- dom steals, its performance when executing memory bound computations is very limited. Despite all efforts, no silver-bullet has been found, and, even worse, all these limitations still exist in state-of-the-art Work Stealing algorithms. In this thesis we make three major theoretical contributions, addressing each of the aforementioned limitations. First, we prove that Work Stealing can easily be extended to make use of custom load balancers, that, for various classes of workloads (e.g. memory bound computations), can greatly boost the scheduler’s performance, while, at the same time, maintaining Work Stealing’s high performance for the general setting. Then, we present a provably efficient scheduler that mixes both receiver and sender-initiated poli- cies, and theoretically show that it successfully overcomes Work Stealing’s limitations for the execution of computations with few or irregular parallelism. Finally, we present a novel scheduling algorithm, whose expected runtime bounds are optimal within a con- stant factor, and that avoids most of the costs associated with memory fences, bounding the total expected overheads incurred by memory fences to O (P T∞), where T∞ is the critical-path length of a computation, and P is the number of processors. This contrasts with state-of-the-art Work Stealing algorithms where the total overheads incurred by these synchronization mechanisms can grow proportionally with the total amount of work. From this perspective, our proposal greatly improves the state-of-the-art Work Stealing algorithm. In fact, as we will prove, for several classes of computations, the over- heads incurred by our algorithm are exponentially smaller than the overheads incurred by state-of-the-art Work Stealing algorithms.O algoritmo Work Stealing é considerado, já há vários anos, o standard no que toca à execução de computações paralelas. Para garantir escalabilidade e alcançar altos desempe- nhos, o trabalho é distribuido por processadores. Por sua vez, cada processador tem uma fila de trabalho concorrente, que utiliza para guardar as tarefas que lhe foram atribuídas. Quando a fila de trabalho de um processador fica vazia, este torna-se ladrão e começa a escolher vítimas, de forma uniformemente aleatória, das quais tenta roubar tarefas. Esta estratégia foi provada ser eficiente, tanto em teoria como na prática, e é actualmente utilizada nos algoritmos de Work Stealing do estado da arte. Contudo, estratégias de balanceamento de carga como a do Work Stealing em que apenas os recipientes tomam a iniciativa de balancear a carga são conhecidas por não serem adequadas para o escalonamento de computações cujo paralelismo é reduzido, ou mesmo desíquilibrado. Além disso, devido à natureza concorrente das filas de trabalho, até operações locais requerem o uso de barreiras de memória, cujos custos são extrema- mente elevados em arquiteturas de computadores modernas. Por conseguinte, mesmo quando um processador está ocupado, este pode, frequentemente, incorrer em overheads bastante significativos causados por simples acessos locais à sua própria fila de trabalho. Finalmente, como o balanceamento de carga do escalonador se baseia apenas em roubos aleatórios, o seu desempenho enquanto executa computações memory bound é bastante limitado. Apesar de todos os esforços, não foi ainda descoberta nenhuma solução que con- siga resolver estes problemas e, ainda pior, todas estas limitações existem nos algoritmos de Work Stealing do estado da arte. Nesta tese, fazemos três grandes contribuições teóricas, cada uma endereçando uma das limitações acima referidas. Primeiro, provamos que o Work Stealing pode ser facil- mente estendido para usar mecanismos personalizados de balanceamento de carga que, para inúmeras classes de computações, conseguem melhorar significativamente o desem- penho do escalonador e, ao mesmo tempo, continuar a garantir altos desempenhos para o caso geral. De seguida apresentamos um novo algoritmo de escalonamento que prova- mos ser eficiente e que utiliza, não só estratégias de roubo, mas também de distribuição de trabalho. Mostramos também, teoricamente, que esta estratégia de balanceamento de carga consegue ultrapassar com grande sucesso as limitações do Work Stealing, para computações cujo paralelismo é reduzido ou desíquilibrado. Por último, apresentamos um novo algoritmo de escalonamento para o qual o tempo esperado de execução de com- putações é óptimo segundo um factor constante, e que consegue ainda evitar a grande maioria dos overheads associados às barreiras de memória causadas por acessos locais dos processadores às suas próprias filas de trabalho. Provamos ainda que os overheads totais es- perados causados por estas barreiras são O (P T∞), onde T∞ corresponde ao comprimento do caminho-crítico de uma computação, e em que P denota o número de processadores. Estes resultados contrastam com o estado da arte de algoritmos de Work Stealing, em que os overheads causados por estas mesmas barreiras podem crescer proporcionalmente com a quantidade total de trabalho. Nesta perspectiva, a nossa proposta melhora substancial- mente o atual algoritmo de Work Stealing do estado da arte. Tal como vamos mostrar, para inúmeras classes de computações, os overheads incorridos pelo nosso algoritmo são expo- nencialmente menores, quando comparados com os overheads incorridos pelos algoritmos do estado da arte de Work Stealing.Paulino, HervéRUNRito, Guilherme Miguel Teixeira2024-05-02T15:22:58Z2016-112016-11-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10362/166873enginfo:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2024-05-22T18:20:59Zoai:run.unl.pt:10362/166873Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-28T17:51:46.614747Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse
dc.title.none.fl_str_mv Scheduling computations
title Scheduling computations
spellingShingle Scheduling computations
Rito, Guilherme Miguel Teixeira
Scheduling algorithms
Randomized algorithms
Parallel computing
Distributed computing
Dynamic load balancing
Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática
title_short Scheduling computations
title_full Scheduling computations
title_fullStr Scheduling computations
title_full_unstemmed Scheduling computations
title_sort Scheduling computations
author Rito, Guilherme Miguel Teixeira
author_facet Rito, Guilherme Miguel Teixeira
author_role author
dc.contributor.none.fl_str_mv Paulino, Hervé
RUN
dc.contributor.author.fl_str_mv Rito, Guilherme Miguel Teixeira
dc.subject.por.fl_str_mv Scheduling algorithms
Randomized algorithms
Parallel computing
Distributed computing
Dynamic load balancing
Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática
topic Scheduling algorithms
Randomized algorithms
Parallel computing
Distributed computing
Dynamic load balancing
Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática
description For quite some time, the Work Stealing algorithm has been the de facto standard for scheduling multithreaded computations. To ensure scalability and achieve high perfor- mance, work is scattered through processors. In turn, each processor owns a concurrent work queue that uses to keep track of its assigned tasks. When a processor’s work queue becomes empty, it becomes a thief and starts targeting victims uniformly at random, from which it attempts stealing tasks. This strategy was proved to be efficient in both theory and practice, and is currently used in state-of-the-art Work Stealing algorithms. Nevertheless, purely receiver initiated load balancing schemes, such as Work Steal- ing’s, are known not to be suitable for scheduling computations with few or unbalanced parallelism. More, due to the concurrent nature of work queues, even local operations require memory fences that are extremely expensive on modern computer architectures. Consequently, even when a processor is busy, it may incur in costly overheads caused by local accesses to its work queue. Finally, as the scheduler’s load balancer relies on ran- dom steals, its performance when executing memory bound computations is very limited. Despite all efforts, no silver-bullet has been found, and, even worse, all these limitations still exist in state-of-the-art Work Stealing algorithms. In this thesis we make three major theoretical contributions, addressing each of the aforementioned limitations. First, we prove that Work Stealing can easily be extended to make use of custom load balancers, that, for various classes of workloads (e.g. memory bound computations), can greatly boost the scheduler’s performance, while, at the same time, maintaining Work Stealing’s high performance for the general setting. Then, we present a provably efficient scheduler that mixes both receiver and sender-initiated poli- cies, and theoretically show that it successfully overcomes Work Stealing’s limitations for the execution of computations with few or irregular parallelism. Finally, we present a novel scheduling algorithm, whose expected runtime bounds are optimal within a con- stant factor, and that avoids most of the costs associated with memory fences, bounding the total expected overheads incurred by memory fences to O (P T∞), where T∞ is the critical-path length of a computation, and P is the number of processors. This contrasts with state-of-the-art Work Stealing algorithms where the total overheads incurred by these synchronization mechanisms can grow proportionally with the total amount of work. From this perspective, our proposal greatly improves the state-of-the-art Work Stealing algorithm. In fact, as we will prove, for several classes of computations, the over- heads incurred by our algorithm are exponentially smaller than the overheads incurred by state-of-the-art Work Stealing algorithms.
publishDate 2016
dc.date.none.fl_str_mv 2016-11
2016-11-01T00:00:00Z
2024-05-02T15:22:58Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10362/166873
url http://hdl.handle.net/10362/166873
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron:RCAAP
instname_str FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron_str RCAAP
institution RCAAP
reponame_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
collection Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository.name.fl_str_mv Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
repository.mail.fl_str_mv info@rcaap.pt
_version_ 1833597016847941632