ASLI schenes as a kernel convolved way to optimize stencil computation.

Detalhes bibliográficos
Ano de defesa: 2021
Autor(a) principal: Januário, Guilherme Carvalho
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Tese
Tipo de acesso: Acesso aberto
Idioma: eng
Instituição de defesa: Biblioteca Digitais de Teses e Dissertações da USP
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: https://www.teses.usp.br/teses/disponiveis/3/3141/tde-18052021-154710/
Resumo: Stencil computation is notorious for having the performance limited by the main memory access. In current computers it implies underutilization of the central processing units. To cope with this limitation, multiple approaches relying on reordering the computation have been proposed, most notably variations of space-blocking and timeblocking. This work introduces a technique to speed up stencil computation, which is not based on space-blocking or time-blocking. Stencil computation implies multiple iterations of traversals through every domain point, with each iteration updating every point based on the previous values of the neighboring points. The technique introduced, named Aggregate Stencil-Loop Iteration (ASLI), works by updating the value of each domain point using the original stencil operator convolved with itself one or more times. The approach implies traversing the data domain fewer times than a straightforward iterative stencil implementation would, with each traversal performing more computation per data item fetched into registers. This more complex operator creates new opportunities for in-register data reuse and increases the FLOPs-to-load ratio. Computation and data reuse schemes are developed for its application to 1, 2, and 3- dimensional stencils. The Influence Table is presented to assist in the calculation of convolved coefficients. An integer sequence is derived. For 2D and 3D star-shaped stencils, the total number of FLOPs increases, but better interaction with the memory makes it beneficial even when compared with optimized non-ASLI implementations. ASLI is relatively easy to implement, allowing more scientists to productively extract better performance from supercomputing clusters. Performance results are shown for a variety of platforms, proving the soundness of the approach and exemplifying how it can be straightforwardly applied with existing techniques and solutions, helping to increase the performance of existing optimization methods. In order to better express ASLI and to enable comparison with other approaches, a methodology is outlined and new metrics are set forth for evaluating stencil implementations, and perhaps the scalability of memory access in a machine. ASLI can be regarded as the application of a broader principle, namely, Kernel Convolution, to the particular case of stencil computation. From this perspective, the Influence Table could promote the use of Kernel Convolution in other applications.