research-article

Public Access

Automatic Hierarchical Parallelization of Linear Recurrences

Authors:

Sepideh Maleki,

Martin BurtscherAuthors Info & Claims

ACM SIGPLAN Notices, Volume 53, Issue 2

Pages 128 - 138

https://doi.org/10.1145/3296957.3173168

Published: 19 March 2018 Publication History

Abstract

Linear recurrences encompass many fundamental computations including prefix sums and digital filters. Later result values depend on earlier result values in recurrences, making it a challenge to compute them in parallel. We present a new work- and space-efficient algorithm to compute linear recurrences that is amenable to automatic parallelization and suitable for hierarchical massively-parallel architectures such as GPUs. We implemented our approach in a domain-specific code generator that emits optimized CUDA code. Our evaluation shows that, for standard prefix sums and single-stage IIR filters, the generated code reaches the throughput of memory copy for large inputs, which cannot be surpassed. On higher-order prefix sums, it performs nearly as well as the fastest handwritten code from the literature. On tuple-based prefix sums and digital filters, our automatically parallelized code outperforms the fastest prior implementations.

References

[1]

Alg3: https://github.com/andmax/gpufilter/, accessed 8/8/2017.

[2]

G.E. Blelloch. "Scans as Primitive Parallel Operations." IEEE Transactions on Computers, 38(11):1526--1538. 1989.

Digital Library

[3]

G.E. Blelloch. "Prefix Sums and Their Applications." In John H. Reif (Ed.), Synthesis of Parallel Algorithms, Morgan Kaufmann, 1990.

[4]

G. Chaurasia, J. Ragan-Kelley, S. Paris, G. Drettakis, and F. Durand. "Compiling High Performance Recursive Filters." In Proceedings of the 7th Conference on High-Performance Graphics, pp. 85--94. 2015.

Digital Library

[5]

CUB: https://nvlabs.github.io/cub/, accessed 8/8/2017.

[6]

Y. Dotsenko, N.K. Govindaraju, P.P. Sloan, C. Boyd, and J. Manferdelli. "Fast Scan Algorithms on Graphics Processors." In Proceedings of the 22nd Annual International Conference on Supercomputing, pp. 205--213. 2008.

Digital Library

[7]

J. Hensley, T. Scheuermann, G. Coombe, M. Singh, and A. Lastra. "Fast Summed-Area Table Generation and its Applications." Computer Graphics Forum, 24(3):547--555. 2005.

[8]

W.D. Hillis and G.L. Steele. "Data Parallel Algorithms." Communications of the ACM, 29(12): 1170--1183. 1986.

Digital Library

[9]

R.M. Karp, R.E. Miller, and S. Winograd. "The Organization of Computations for Uniform Recurrence equations." Journal of the ACM, 14:3, pp. 563--590. 1967.

Digital Library

[10]

P.M. Kogge and H.S. Stone. "A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations." IEEE Transactions on Computers, 22(8):786--793. 1973.

Digital Library

[11]

S. Maleki, A. Yang, and M. Burtscher. "Higher-Order and Tuple-Based Massively-Parallel Prefix Sums." In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 539--552. 2016.

Digital Library

[12]

B. Merry. "A Performance Comparison of Sort and Scan Libraries for GPUs." World Scientific Publishing Company. 2014.

[13]

D. Merrill and M. Garland. "Single-Pass Parallel Prefix Scan with Decoupled Look-back." NVIDIA Technical Report NVR-2016-002. 2016.

[14]

n-nacci numbers: https://en.wikipedia.org/wiki/Generalizations_of_Fibonacci_numbers, accessed 8/8/2017.

[15]

D. Nehab, A. Maximo, R.S. Lima, and H. Hoppe. "GPU-Efficient Recursive Filtering and Summed-Area Tables." In Proceedings of the SIGGRAPH Asia Conference, pp. 176:1--176:12. 2011.

Digital Library

[16]

A.V. Oppenheim and R.W. Schafer. "Discrete-Time Signal Processing." 3rd Edition. Prentice Hall. 2009.

Digital Library

[17]

Rec: https://github.com/mit-gfx/recfilter, accessed 8/8/2017.

[18]

SAM: http://cs.txstate.edu/~burtscher/research/SAM/, accessed 8/8/2017.

[19]

S. Sengupta, A.E. Lefohn, and J.D. Owens. "A Work-Efficient Step-Efficient Prefix Sum Algorithm." In Proceedings of the Workshop on Edge Computing Using New Commodity Architectures, pp. 26--27. 2006.

[20]

S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens. "Scan Primitives for GPU Computing." In Proceedings of Graphics Hardware, pp. 97--106. 2007.

Digital Library

[21]

S. Sengupta, M. Harris, and M. Garland. "Efficient Parallel Scan Algorithms for GPUs." NVIDIA. 2008 - gpucomputing.net.

[22]

S.W. Smith. "Digital Signal Processing: A Practical Guide for Engineers and Scientists." Newnes, 2002. ISBN 0--7506--7444-X.

[23]

H.S. Stone. "An Efficient Parallel Algorithm for the Solution of a Tridiagonal Linear System of Equations." Journal of the ACM, 20(1):27--38. 1973.

Digital Library

[24]

W. Sung and S. Mitra. "Efficient Multi-Processor Implementation of Recursive Digital Filters." In Proceedings of the IEEE Conference on Acoustics, Speech and Signal Processing, 11:257--260. 1986.

[25]

W. Thies, M. Karczmarek, and S.P. Amarasinghe. "StreamIt: A Language for Streaming Applications." In Proceedings of the 11th International Conference on Compiler Construction, pp. 179--196. 2002.

Digital Library

Cited By

Bielecki WBłaszyński P(2021)Parallel Tiled Code for Computing General Linear Recurrence EquationsElectronics10.3390/electronics1017205010:17(2050)Online publication date: 25-Aug-2021
https://doi.org/10.3390/electronics10172050
Maleki SAgarwal UBurtscher MPingali KLee JPetrank E(2021)BiPartProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3437801.3441611(161-174)Online publication date: 17-Feb-2021
https://dl.acm.org/doi/10.1145/3437801.3441611
Huang YDi SYu XLi GCappello FMohror KArnold DBadia R(2023)cuSZp: An Ultra-fast GPU Error-bounded Lossy Compression Framework with Optimized End-to-End PerformanceProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607048(1-13)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607048
Show More Cited By

Index Terms

Automatic Hierarchical Parallelization of Linear Recurrences
1. Computing methodologies
  1. Concurrent computing methodologies
    1. Concurrent algorithms
  2. Parallel computing methodologies
    1. Parallel algorithms
      1. Massively parallel algorithms

Recommendations

Automatic Hierarchical Parallelization of Linear Recurrences
ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Linear recurrences encompass many fundamental computations including prefix sums and digital filters. Later result values depend on earlier result values in recurrences, making it a challenge to compute them in parallel. We present a new work- and space-...
Low complexity algorithms for linear recurrences
ISSAC '06: Proceedings of the 2006 international symposium on Symbolic and algebraic computation

We consider two kinds of problems: the computation of polynomial and rational solutions of linear recurrences with coefficients that are polynomials with integer coefficients; indefinite and definite summation of sequences that are hypergeometric over ...
A comparison of automatic parallelization tools/compilers on the SGI origin 2000
SC '98: Proceedings of the 1998 ACM/IEEE conference on Supercomputing

Porting applications to new high performance parallel and distributed computing platforms is a challenging task. Since writing parallel code by hand is time consuming and costly, porting codes would ideally be automated by using some parallelization ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices

ACM SIGPLAN Notices Volume 53, Issue 2

ASPLOS '18

February 2018

809 pages

ISSN:0362-1340

EISSN:1558-1160

DOI:10.1145/3296957

Editor:
Matthew Fluet
Rodchester Institude of Technology

Issue’s Table of Contents

ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems
March 2018
827 pages
ISBN:9781450349116
DOI:10.1145/3173162
General Chairs:
Xipeng Shen
North Carolina State University, USA
,
James Tuck
North Carolina State University, USA
,
Program Chairs:
Ricardo Bianchini
Microsoft Research, USA
,
Vivek Sarkar
Georgia Institute of Technology, USA

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 March 2018

Published in SIGPLAN Volume 53, Issue 2

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
792
Total Downloads

Downloads (Last 12 months)116
Downloads (Last 6 weeks)17

Reflects downloads up to 29 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Bielecki WBłaszyński P(2021)Parallel Tiled Code for Computing General Linear Recurrence EquationsElectronics10.3390/electronics1017205010:17(2050)Online publication date: 25-Aug-2021
https://doi.org/10.3390/electronics10172050
Maleki SAgarwal UBurtscher MPingali KLee JPetrank E(2021)BiPartProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3437801.3441611(161-174)Online publication date: 17-Feb-2021
https://dl.acm.org/doi/10.1145/3437801.3441611
Huang YDi SYu XLi GCappello FMohror KArnold DBadia R(2023)cuSZp: An Ultra-fast GPU Error-bounded Lossy Compression Framework with Optimized End-to-End PerformanceProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607048(1-13)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607048
Maximo A(2021)GPU efficient 1D and 3D recursive filteringDigital Signal Processing10.1016/j.dsp.2021.103076114(103076)Online publication date: Jul-2021
https://doi.org/10.1016/j.dsp.2021.103076
Xia YJiang PAgrawal GGupta RShen X(2020)Scaling out speculative execution of finite-state machines with parallel mergeProceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3332466.3374524(160-172)Online publication date: 19-Feb-2020
https://dl.acm.org/doi/10.1145/3332466.3374524
Bahig HFathy K(2020)An efficient parallel strategy for high-cost prefix operationThe Journal of Supercomputing10.1007/s11227-020-03473-xOnline publication date: 5-Nov-2020
https://doi.org/10.1007/s11227-020-03473-x
Xia YJiang PAgrawal GAmaral JKulkarni M(2019)Enabling prefix sum parallelism pattern for recurrences with principled function reconstructionProceedings of the 28th International Conference on Compiler Construction10.1145/3302516.3307354(17-28)Online publication date: 16-Feb-2019
https://dl.acm.org/doi/10.1145/3302516.3307354

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents