tutorial

Locality-Aware Mapping of Nested Parallel Patterns on GPUs

Authors:

HyoukJoong Lee,

Kevin J. Brown,

Arvind K. Sujeeth,

Kunle OlukotunAuthors Info & Claims

MICRO-47: Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture

Pages 63 - 74

https://doi.org/10.1109/MICRO.2014.23

Published: 13 December 2014 Publication History

Abstract

Recent work has explored using higher level languages to improve programmer productivity on GPUs. These languages often utilize high level computation patterns (e.g., Map and Reduce) that encode parallel semantics to enable automatic compilation to GPU kernels. However, the problem of efficiently mapping patterns to GPU hardware becomes significantly more difficult when the patterns are nested, which is common in non-trivial applications.

To address this issue, we present a general analysis framework for automatically and efficiently mapping nested patterns onto GPUs. The analysis maps nested patterns onto a logical multidimensional domain and parameterizes the block size and degree of parallelism in each dimension. We then add GPU-specific hard and soft constraints to prune the space of possible mappings and select the best mapping. We also perform multiple compiler optimizations that are guided by the mapping to avoid dynamic memory allocations and automatically utilize shared memory within GPU kernels. We compare the performance of our automatically selected mappings to hand-optimized implementations on multiple benchmarks and show that the average performance gap on 7 out of 8 benchmarks is 24%. Furthermore, our mapping strategy outperforms simple 1D mappings and existing 2D mappings by up to 28.6x and 9.6x respectively.

References

[1]

B. Catanzaro, M. Garland, and K. Keutzer, "Copperhead: compiling an embedded data parallel language," in Proceedings of the 16th ACM symposium on Principles and practice of parallel programming, ser. PPoPP. New York, NY, USA: ACM, 2011, pp. 47--56.

Digital Library

[2]

G. Mainland and G. Morrisett, "Nikola: embedding compiled GPU functions in Haskell," in Proceedings of the third ACM Haskell symposium on Haskell, ser. Haskell '10. New York, NY, USA: ACM, 2010, pp. 67--78.

Digital Library

[3]

T. L. McDonell, M. M. Chakravarty, G. Keller, and B. Lippmeier, "Optimising purely functional GPU programs," in Proceedings of the 18th ACM SIGPLAN International Conference on Functional Programming, ser. ICFP '13. New York, NY, USA: ACM, 2013, pp. 49--60. {Online}. Available: http://doi.acm.org/10.1145/2500365.2500595

Digital Library

[4]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron, "Rodinia: A benchmark suite for heterogeneous computing," in Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, Oct 2009, pp. 44--54.

Digital Library

[5]

J. Hoberock and N. Bell, "Thrust: C++ template library for CUDA," 2009.

[6]

N. Nystrom, D. White, and K. Das, "Firepile: run-time compilation for GPUs in Scala," in Proceedings of the 10th ACM international conference on Generative programming and component engineering, ser. GPCE. New York, NY, USA: ACM, 2011, pp. 107--116.

Digital Library

[7]

S. Hong, S. K. Kim, T. Oguntebi, and K. Olukotun, "Accelerating CUDA graph algorithms at maximum warp," in Proceedings of the 16th ACM symposium on Principles and practice of parallel programming, ser. PPoPP, 2011, pp. 267--276.

Digital Library

[8]

K. J. Brown, A. K. Sujeeth, H. Lee, T. Rompf, H. Chafi, M. Odersky, and K. Olukotun, "A heterogeneous parallel framework for domain-specific languages," ser. PACT, 2011.

Digital Library

[9]

A. Prokopec, P. Bagwell, and T. R. abd Martin Odersky, "A generic parallel collection framework," ser. Euro-Par, 2010.

Digital Library

[10]

S. L. P. Jones, R. Leshchinskiy, G. Keller, and M. M. T. Chakravarty, "Harnessing the multicores: Nested data parallelism in Haskell," in FSTTCS, 2008, pp. 383--414.

[11]

S. Verdoolaege, J. Carlos Juega, A. Cohen, J. Ignacio Gómez, C. Tenllado, and F. Catthoor, "Polyhedral parallel code generation for CUDA," ACM Trans. Archit. Code Optim., vol. 9, no. 4, pp. 54:1--54:23, Jan. 2013. {Online}. Available: http://doi.acm.org/10.1145/2400682.2400713

Digital Library

[12]

M. Amini, O. Goubier, S. Guelton, J. O. Mcmahon, F.-x. Pasquier, G. PÃl'an, and P. Villalon, "Par4all: From convex array regions to heterogeneous computing," in Second International Workshop on Polyhedral Compilation Techniques, ser. IMPACT 2012, 2012.

[13]

L. Page, S. Brin, R. Motwani, and T. Winograd, "The pagerank citation ranking: Bringing order to the web." Stanford InfoLab, Technical Report 1999-66, November 1999, previous number = SIDL-WP-1999-0120. {Online}. Available: http://ilpubs.stanford.edu:8090/422/

[14]

F. Niu, B. Recht, C. Ré, and S. J. Wright, "Hogwild!: A lock-free approach to parallelizing stochastic gradient descent," Advances in Neural Information Processing Systems, vol. 24, pp. 693--701, 2011.

Digital Library

[15]

G. R. Bowman, X. Huang, and V. S. Pande, "Using generalized ensemble simulations and Markov state models to identify conformational states," Methods, vol. 49, no. 2, pp. 197--201, 2009. {Online}. Available: http://www.sciencedirect.com/science/article/pii/S1046202309000978

[16]

S. Hong and H. Kim, "An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness," in Proceedings of the 36th Annual International Symposium on Computer Architecture, ser. ISCA '09. New York, NY, USA: ACM, 2009, pp. 152--163. {Online}. Available: http://doi.acm.org/10.1145/1555754.1555775

Digital Library

[17]

S. S. Baghsorkhi, M. Delahaye, S. J. Patel, W. D. Gropp, and W.-m. W. Hwu, "An adaptive performance modeling tool for GPU architectures," in Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ser. PPoPP '10. New York, NY, USA: ACM, 2010, pp. 105--114. {Online}. Available: http://doi.acm.org/10.1145/1693453.1693470

Digital Library

[18]

T. D. Han and T. S. Abdelrahman, "hiCUDA: a high-level directive-based language for GPU programming," in Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units. ACM, 2009, pp. 52--61.

Digital Library

[19]

S.-Z. Ueng, M. Lathara, S. S. Baghsorkhi, and W. H. Wen-mei, "CUDA-lite: Reducing GPU programming complexity," in Languages and Compilers for Parallel Computing. Springer, 2008, pp. 1--15.

Digital Library

[20]

Y. Yang, P. Xiang, J. Kong, and H. Zhou, "A GPGPU compiler for memory optimization and parallelism management," in Proceedings of the 2010 ACM SIGPLAN Conference on Programming Language Design and Implementation, ser. PLDI '10. New York, NY, USA: ACM, 2010, pp. 86--97. {Online}. Available: http://doi.acm.org/10.1145/1806596.1806606

Digital Library

[21]

A. H. Hormati, M. Samadi, M. Woh, T. Mudge, and S. Mahlke, "Sponge: portable stream programming on graphics engines," in ACM SIGPLAN Notices, vol. 46, no. 3. ACM, 2011, pp. 381--392.

Digital Library

[22]

W. Thies, M. Karczmarek, and S. Amarasinghe, "Streamit: A language for streaming applications," in Compiler Construction. Springer, 2002, pp. 179--196.

Digital Library

[23]

A. Udupa, R. Govindarajan, and M. J. Thazhuthaveetil, "Software pipelined execution of stream programs on GPUs," in Code Generation and Optimization, 2009. CGO 2009. International Symposium on. IEEE, 2009, pp. 200--209.

Digital Library

[24]

T. B. Jablin, P. Prabhu, J. A. Jablin, N. P. Johnson, S. R. Beard, and D. I. August, "Automatic CPU-GPU communication management and optimization," ACM SIGPLAN Notices, vol. 46, no. 6, pp. 142--151, 2011.

Digital Library

[25]

J. Auerbach, D. F. Bacon, P. Cheng, and R. Rabbah, "Lime: a Java-compatible and synthesizable language for heterogeneous architectures," in Proceedings of the ACM international conference on Object oriented programming systems languages and applications, ser. OOPSLA. New York, NY, USA: ACM, 2010, pp. 89--108.

Digital Library

[26]

C. Dubach, P. Cheng, R. Rabbah, D. F. Bacon, and S. J. Fink, "Compiling a high-level language for GPUs: (via language support for architectures and compilers)," in Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation, ser. PLDI '12, 2012, pp. 1--12.

Digital Library

[27]

C. J. Rossbach, Y. Yu, J. Currey, J.-P. Martin, and D. Fetterly, "Dandelion: a compiler and runtime for heterogeneous systems," in Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. ACM, 2013, pp. 49--68.

Digital Library

[28]

L. Bergstrom and J. Reppy, "Nested data-parallelism on the GPU," in Proceedings of the 17th ACM SIGPLAN International Conference on Functional Programming, ser. ICFP '12. New York, NY, USA: ACM, 2012, pp. 247--258. {Online}. Available: http://doi.acm.org/10.1145/2364527.2364563

Digital Library

[29]

Y. Yang and H. Zhou, "CUDA-NP: Realizing nested thread-level parallelism in GPGPU applications," in Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ser. PPoPP '14. New York, NY, USA: ACM, 2014, pp. 93--106. {Online}. Available: http://doi.acm.org/10.1145/2555243.2555254

Digital Library

Cited By

Nodehi Sabet AQiu JZhao Z(2018)TigrACM SIGPLAN Notices10.1145/3296957.317318053:2(622-636)Online publication date: 19-Mar-2018
https://dl.acm.org/doi/10.1145/3296957.3173180
Nodehi Sabet AQiu JZhao ZShen XTuck JBianchini RSarkar V(2018)TigrProceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3173162.3173180(622-636)Online publication date: 19-Mar-2018
https://dl.acm.org/doi/10.1145/3173162.3173180
Steuwer MRemmelg TDubach CReddi VSmith ATang L(2017)Lift: a functional data-parallel IR for high-performance GPU code generationProceedings of the 2017 International Symposium on Code Generation and Optimization10.5555/3049832.3049841(74-85)Online publication date: 4-Feb-2017
https://dl.acm.org/doi/10.5555/3049832.3049841
Show More Cited By

Index Terms

Locality-Aware Mapping of Nested Parallel Patterns on GPUs
1. Computing methodologies
  1. Computer graphics
    1. Graphics systems and interfaces
      1. Graphics processors
2. Hardware
  1. Hardware validation
  2. Integrated circuits

Recommendations

Roofline-aware DVFS for GPUs
ADAPT '14: Proceedings of International Workshop on Adaptive Self-tuning Computing Systems

Graphics processing units (GPUs) are becoming increasingly popular for compute workloads, mainly because of their large number of processing elements and high-bandwidth to off-chip memory. The roofline model captures the ratio between the two (the ...
Architecture-Aware Mapping and Optimization on a 1600-Core GPU
ICPADS '11: Proceedings of the 2011 IEEE 17th International Conference on Parallel and Distributed Systems

The graphics processing unit (GPU) continues to make in-roads as a computational accelerator for high-performance computing (HPC). However, despite its increasing popularity, mapping and optimizing GPU code remains a difficult task, it is a multi-...
Locality-Aware CTA Clustering for Modern GPUs
Asplos'17

Cache is designed to exploit locality; however, the role of on-chip L1 data caches on modern GPUs is often awkward. The locality among global memory requests from different SMs (Streaming Multiprocessors) is predominantly harvested by the commonly-...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MICRO-47: Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture

December 2014

697 pages

ISBN:9781479969982

General Chair:
Krisztian Flautner
ARM
,
Program Chairs:
Thomas F. Wenisch
University of Michigan
,
Emre Ozer
ARM
,
Publications Chair:
Michael Ferdman
Stony Brook University

Sponsors

Publisher

IEEE Computer Society

United States

Publication History

Published: 13 December 2014

Check for updates

Qualifiers

Tutorial
Research
Refereed limited

Conference

MICRO-47

Sponsor:

SIGMICRO

MICRO-47: The 47th Annual IEEE/ACM International Symposium of Microarchitecture

December 13 - 17, 2014

Cambridge, United Kingdom

Acceptance Rates

MICRO-47 Paper Acceptance Rate 53 of 279 submissions, 19%;

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Upcoming Conference

MICRO '24

Sponsor:
sigmicro

57th Annual IEEE/ACM International Symposium on Microarchitecture

November 2 - 6, 2024

Austin , TX , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

27
Total Citations
View Citations
252
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 21 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Nodehi Sabet AQiu JZhao Z(2018)TigrACM SIGPLAN Notices10.1145/3296957.317318053:2(622-636)Online publication date: 19-Mar-2018
https://dl.acm.org/doi/10.1145/3296957.3173180
Nodehi Sabet AQiu JZhao ZShen XTuck JBianchini RSarkar V(2018)TigrProceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3173162.3173180(622-636)Online publication date: 19-Mar-2018
https://dl.acm.org/doi/10.1145/3173162.3173180
Steuwer MRemmelg TDubach CReddi VSmith ATang L(2017)Lift: a functional data-parallel IR for high-performance GPU code generationProceedings of the 2017 International Symposium on Code Generation and Optimization10.5555/3049832.3049841(74-85)Online publication date: 4-Feb-2017
https://dl.acm.org/doi/10.5555/3049832.3049841
Prabhakar RZhang YKoeplinger DFeldman MZhao THadjis SPedram AKozyrakis COlukotun K(2017)PlasticineACM SIGARCH Computer Architecture News10.1145/3140659.308025645:2(389-402)Online publication date: 24-Jun-2017
https://dl.acm.org/doi/10.1145/3140659.3080256
Henriksen TSerup NElsman MHenglein FOancea C(2017)Futhark: purely functional GPU-programming with nested parallelism and in-place array updatesACM SIGPLAN Notices10.1145/3140587.306235452:6(556-571)Online publication date: 14-Jun-2017
https://dl.acm.org/doi/10.1145/3140587.3062354
Larsen RHenriksen TTrinder POancea C(2017)Strategies for regular segmented reductions on GPUProceedings of the 6th ACM SIGPLAN International Workshop on Functional High-Performance Computing10.1145/3122948.3122952(42-52)Online publication date: 7-Sep-2017
https://dl.acm.org/doi/10.1145/3122948.3122952
Prabhakar RZhang YKoeplinger DFeldman MZhao THadjis SPedram AKozyrakis COlukotun K(2017)PlasticineProceedings of the 44th Annual International Symposium on Computer Architecture10.1145/3079856.3080256(389-402)Online publication date: 24-Jun-2017
https://dl.acm.org/doi/10.1145/3079856.3080256
Henriksen TSerup NElsman MHenglein FOancea CCohen AVechev M(2017)Futhark: purely functional GPU-programming with nested parallelism and in-place array updatesProceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3062341.3062354(556-571)Online publication date: 14-Jun-2017
https://dl.acm.org/doi/10.1145/3062341.3062354
Yu LPei YChen TLou XWu MZhang T(2017)Enable back memory and global synchronization on LLC bufferThe Journal of Supercomputing10.1007/s11227-017-2093-873:12(5414-5439)Online publication date: 1-Dec-2017
https://dl.acm.org/doi/10.1007/s11227-017-2093-8
Chang LHajj IRodrigues CGómez-Luna JHwu WHsu WYang CLipasti MLee H(2016)Efficient kernel synthesis for performance portable programmingThe 49th Annual IEEE/ACM International Symposium on Microarchitecture10.5555/3195638.3195653(1-13)Online publication date: 15-Oct-2016
https://dl.acm.org/doi/10.5555/3195638.3195653
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents