Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1109/MICRO.2014.23acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
tutorial

Locality-Aware Mapping of Nested Parallel Patterns on GPUs

Published: 13 December 2014 Publication History

Abstract

Recent work has explored using higher level languages to improve programmer productivity on GPUs. These languages often utilize high level computation patterns (e.g., Map and Reduce) that encode parallel semantics to enable automatic compilation to GPU kernels. However, the problem of efficiently mapping patterns to GPU hardware becomes significantly more difficult when the patterns are nested, which is common in non-trivial applications.
To address this issue, we present a general analysis framework for automatically and efficiently mapping nested patterns onto GPUs. The analysis maps nested patterns onto a logical multidimensional domain and parameterizes the block size and degree of parallelism in each dimension. We then add GPU-specific hard and soft constraints to prune the space of possible mappings and select the best mapping. We also perform multiple compiler optimizations that are guided by the mapping to avoid dynamic memory allocations and automatically utilize shared memory within GPU kernels. We compare the performance of our automatically selected mappings to hand-optimized implementations on multiple benchmarks and show that the average performance gap on 7 out of 8 benchmarks is 24%. Furthermore, our mapping strategy outperforms simple 1D mappings and existing 2D mappings by up to 28.6x and 9.6x respectively.

References

[1]
B. Catanzaro, M. Garland, and K. Keutzer, "Copperhead: compiling an embedded data parallel language," in Proceedings of the 16th ACM symposium on Principles and practice of parallel programming, ser. PPoPP. New York, NY, USA: ACM, 2011, pp. 47--56.
[2]
G. Mainland and G. Morrisett, "Nikola: embedding compiled GPU functions in Haskell," in Proceedings of the third ACM Haskell symposium on Haskell, ser. Haskell '10. New York, NY, USA: ACM, 2010, pp. 67--78.
[3]
T. L. McDonell, M. M. Chakravarty, G. Keller, and B. Lippmeier, "Optimising purely functional GPU programs," in Proceedings of the 18th ACM SIGPLAN International Conference on Functional Programming, ser. ICFP '13. New York, NY, USA: ACM, 2013, pp. 49--60. {Online}. Available: http://doi.acm.org/10.1145/2500365.2500595
[4]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron, "Rodinia: A benchmark suite for heterogeneous computing," in Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, Oct 2009, pp. 44--54.
[5]
J. Hoberock and N. Bell, "Thrust: C++ template library for CUDA," 2009.
[6]
N. Nystrom, D. White, and K. Das, "Firepile: run-time compilation for GPUs in Scala," in Proceedings of the 10th ACM international conference on Generative programming and component engineering, ser. GPCE. New York, NY, USA: ACM, 2011, pp. 107--116.
[7]
S. Hong, S. K. Kim, T. Oguntebi, and K. Olukotun, "Accelerating CUDA graph algorithms at maximum warp," in Proceedings of the 16th ACM symposium on Principles and practice of parallel programming, ser. PPoPP, 2011, pp. 267--276.
[8]
K. J. Brown, A. K. Sujeeth, H. Lee, T. Rompf, H. Chafi, M. Odersky, and K. Olukotun, "A heterogeneous parallel framework for domain-specific languages," ser. PACT, 2011.
[9]
A. Prokopec, P. Bagwell, and T. R. abd Martin Odersky, "A generic parallel collection framework," ser. Euro-Par, 2010.
[10]
S. L. P. Jones, R. Leshchinskiy, G. Keller, and M. M. T. Chakravarty, "Harnessing the multicores: Nested data parallelism in Haskell," in FSTTCS, 2008, pp. 383--414.
[11]
S. Verdoolaege, J. Carlos Juega, A. Cohen, J. Ignacio Gómez, C. Tenllado, and F. Catthoor, "Polyhedral parallel code generation for CUDA," ACM Trans. Archit. Code Optim., vol. 9, no. 4, pp. 54:1--54:23, Jan. 2013. {Online}. Available: http://doi.acm.org/10.1145/2400682.2400713
[12]
M. Amini, O. Goubier, S. Guelton, J. O. Mcmahon, F.-x. Pasquier, G. PÃl'an, and P. Villalon, "Par4all: From convex array regions to heterogeneous computing," in Second International Workshop on Polyhedral Compilation Techniques, ser. IMPACT 2012, 2012.
[13]
L. Page, S. Brin, R. Motwani, and T. Winograd, "The pagerank citation ranking: Bringing order to the web." Stanford InfoLab, Technical Report 1999-66, November 1999, previous number = SIDL-WP-1999-0120. {Online}. Available: http://ilpubs.stanford.edu:8090/422/
[14]
F. Niu, B. Recht, C. Ré, and S. J. Wright, "Hogwild!: A lock-free approach to parallelizing stochastic gradient descent," Advances in Neural Information Processing Systems, vol. 24, pp. 693--701, 2011.
[15]
G. R. Bowman, X. Huang, and V. S. Pande, "Using generalized ensemble simulations and Markov state models to identify conformational states," Methods, vol. 49, no. 2, pp. 197--201, 2009. {Online}. Available: http://www.sciencedirect.com/science/article/pii/S1046202309000978
[16]
S. Hong and H. Kim, "An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness," in Proceedings of the 36th Annual International Symposium on Computer Architecture, ser. ISCA '09. New York, NY, USA: ACM, 2009, pp. 152--163. {Online}. Available: http://doi.acm.org/10.1145/1555754.1555775
[17]
S. S. Baghsorkhi, M. Delahaye, S. J. Patel, W. D. Gropp, and W.-m. W. Hwu, "An adaptive performance modeling tool for GPU architectures," in Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ser. PPoPP '10. New York, NY, USA: ACM, 2010, pp. 105--114. {Online}. Available: http://doi.acm.org/10.1145/1693453.1693470
[18]
T. D. Han and T. S. Abdelrahman, "hiCUDA: a high-level directive-based language for GPU programming," in Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units. ACM, 2009, pp. 52--61.
[19]
S.-Z. Ueng, M. Lathara, S. S. Baghsorkhi, and W. H. Wen-mei, "CUDA-lite: Reducing GPU programming complexity," in Languages and Compilers for Parallel Computing. Springer, 2008, pp. 1--15.
[20]
Y. Yang, P. Xiang, J. Kong, and H. Zhou, "A GPGPU compiler for memory optimization and parallelism management," in Proceedings of the 2010 ACM SIGPLAN Conference on Programming Language Design and Implementation, ser. PLDI '10. New York, NY, USA: ACM, 2010, pp. 86--97. {Online}. Available: http://doi.acm.org/10.1145/1806596.1806606
[21]
A. H. Hormati, M. Samadi, M. Woh, T. Mudge, and S. Mahlke, "Sponge: portable stream programming on graphics engines," in ACM SIGPLAN Notices, vol. 46, no. 3. ACM, 2011, pp. 381--392.
[22]
W. Thies, M. Karczmarek, and S. Amarasinghe, "Streamit: A language for streaming applications," in Compiler Construction. Springer, 2002, pp. 179--196.
[23]
A. Udupa, R. Govindarajan, and M. J. Thazhuthaveetil, "Software pipelined execution of stream programs on GPUs," in Code Generation and Optimization, 2009. CGO 2009. International Symposium on. IEEE, 2009, pp. 200--209.
[24]
T. B. Jablin, P. Prabhu, J. A. Jablin, N. P. Johnson, S. R. Beard, and D. I. August, "Automatic CPU-GPU communication management and optimization," ACM SIGPLAN Notices, vol. 46, no. 6, pp. 142--151, 2011.
[25]
J. Auerbach, D. F. Bacon, P. Cheng, and R. Rabbah, "Lime: a Java-compatible and synthesizable language for heterogeneous architectures," in Proceedings of the ACM international conference on Object oriented programming systems languages and applications, ser. OOPSLA. New York, NY, USA: ACM, 2010, pp. 89--108.
[26]
C. Dubach, P. Cheng, R. Rabbah, D. F. Bacon, and S. J. Fink, "Compiling a high-level language for GPUs: (via language support for architectures and compilers)," in Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation, ser. PLDI '12, 2012, pp. 1--12.
[27]
C. J. Rossbach, Y. Yu, J. Currey, J.-P. Martin, and D. Fetterly, "Dandelion: a compiler and runtime for heterogeneous systems," in Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. ACM, 2013, pp. 49--68.
[28]
L. Bergstrom and J. Reppy, "Nested data-parallelism on the GPU," in Proceedings of the 17th ACM SIGPLAN International Conference on Functional Programming, ser. ICFP '12. New York, NY, USA: ACM, 2012, pp. 247--258. {Online}. Available: http://doi.acm.org/10.1145/2364527.2364563
[29]
Y. Yang and H. Zhou, "CUDA-NP: Realizing nested thread-level parallelism in GPGPU applications," in Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ser. PPoPP '14. New York, NY, USA: ACM, 2014, pp. 93--106. {Online}. Available: http://doi.acm.org/10.1145/2555243.2555254

Cited By

View all
  • (2018)TigrACM SIGPLAN Notices10.1145/3296957.317318053:2(622-636)Online publication date: 19-Mar-2018
  • (2018)TigrProceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3173162.3173180(622-636)Online publication date: 19-Mar-2018
  • (2017)Lift: a functional data-parallel IR for high-performance GPU code generationProceedings of the 2017 International Symposium on Code Generation and Optimization10.5555/3049832.3049841(74-85)Online publication date: 4-Feb-2017
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MICRO-47: Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture
December 2014
697 pages
ISBN:9781479969982

Sponsors

Publisher

IEEE Computer Society

United States

Publication History

Published: 13 December 2014

Check for updates

Qualifiers

  • Tutorial
  • Research
  • Refereed limited

Conference

MICRO-47
Sponsor:

Acceptance Rates

MICRO-47 Paper Acceptance Rate 53 of 279 submissions, 19%;
Overall Acceptance Rate 484 of 2,242 submissions, 22%

Upcoming Conference

MICRO '24

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 21 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2018)TigrACM SIGPLAN Notices10.1145/3296957.317318053:2(622-636)Online publication date: 19-Mar-2018
  • (2018)TigrProceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3173162.3173180(622-636)Online publication date: 19-Mar-2018
  • (2017)Lift: a functional data-parallel IR for high-performance GPU code generationProceedings of the 2017 International Symposium on Code Generation and Optimization10.5555/3049832.3049841(74-85)Online publication date: 4-Feb-2017
  • (2017)PlasticineACM SIGARCH Computer Architecture News10.1145/3140659.308025645:2(389-402)Online publication date: 24-Jun-2017
  • (2017)Futhark: purely functional GPU-programming with nested parallelism and in-place array updatesACM SIGPLAN Notices10.1145/3140587.306235452:6(556-571)Online publication date: 14-Jun-2017
  • (2017)Strategies for regular segmented reductions on GPUProceedings of the 6th ACM SIGPLAN International Workshop on Functional High-Performance Computing10.1145/3122948.3122952(42-52)Online publication date: 7-Sep-2017
  • (2017)PlasticineProceedings of the 44th Annual International Symposium on Computer Architecture10.1145/3079856.3080256(389-402)Online publication date: 24-Jun-2017
  • (2017)Futhark: purely functional GPU-programming with nested parallelism and in-place array updatesProceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3062341.3062354(556-571)Online publication date: 14-Jun-2017
  • (2017)Enable back memory and global synchronization on LLC bufferThe Journal of Supercomputing10.1007/s11227-017-2093-873:12(5414-5439)Online publication date: 1-Dec-2017
  • (2016)Efficient kernel synthesis for performance portable programmingThe 49th Annual IEEE/ACM International Symposium on Microarchitecture10.5555/3195638.3195653(1-13)Online publication date: 15-Oct-2016
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media