Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3033019.3033022acmotherconferencesArticle/Chapter ViewAbstractPublication PagesccConference Proceedingsconference-collections
research-article

Optimized two-level parallelization for GPU accelerators using the polyhedral model

Published: 05 February 2017 Publication History

Abstract

While GPUs play an increasingly important role in today's high-performance computers, optimizing GPU performance continues to impose large burdens upon programmers. A major challenge in optimizing codes for GPUs stems from the two levels of hardware parallelism, blocks and threads; each of these levels has significantly different characteristics, requiring different optimization strategies.
In this paper, we propose a novel compiler optimization algorithm for GPU parallelism. Our approach is based on the polyhedral model, which has enabled significant advances in program analysis and transformation compared to traditional AST-based frameworks. We extend polyhedral schedules to enable two-level parallelization through the idea of superposition, which integrates separate schedules for block-level and thread-level parallelism. Our experimental results demonstrate that our proposed compiler optimization framework can deliver 1.8x and 2.1x geometric mean improvements on NVIDIA Tesla M2050 and K80 GPUs, compared to a state-of-the-art polyhedral parallel code generator (PPCG) for GPGPUs.

References

[1]
The Polyhedral Compiler Collection. http://www.cs.ucla.edu/ ˜pouchet/software/pocc/.
[2]
PolyBench/GPU Implementation of PolyBench codes for GPU processing. http://web.cse.ohio-state.edu/˜pouchet/ software/polybench/GPU/.
[3]
R. Baghdadi, U. Beaugnon, A. Cohen, T. Grosser, M. Kruse, C. Reddy, S. Verdoolaege, A. Betts, A. F. Donaldson, J. Ketema, J. Absar, S. v. Haastregt, A. Kravets, A. Lokhmotov, R. David, and E. Hajiyev. Pencil: A platform-neutral compute intermediate language for accelerator programming. In 2015 International Conference on Parallel Architecture and Compilation (PACT), pages 138–149, Oct 2015.
[4]
V. Bandishti, I. Pananilath, and U. Bondhugula. Tiling stencil computations to maximize parallelism. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’12, pages 40:1–40:11, Los Alamitos, CA, USA, 2012. IEEE Computer Society Press. ISBN 978-1-4673-0804-5.
[5]
M. M. Baskaran, J. Ramanujam, and P. Sadayappan. Automatic C-to-CUDA code generation for affine programs. In Proceedings of the 19th Joint European Conference on Theory and Practice of Software, International Conference on Compiler Construction, CC’10/ETAPS’10, pages 244–263, Berlin, Heidelberg, 2010. Springer-Verlag. ISBN 3- 642-11969-7, 978-3-642-11969-9.
[6]
C. Bastoul. Code generation in the polyhedral model is easier than you think. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, PACT ’04, pages 7–16, Washington, DC, USA, 2004. IEEE Computer Society. ISBN 0-7695-2229-7.
[7]
U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A Practical Automatic Polyhedral Parallelizer and Locality Optimizer. In Proc. of PLDI ’08, New York, NY, USA, 2008. ACM.
[8]
ISBN 978-1-59593-860-2.
[9]
CANDL. CANDL: Data dependence analysis tool in the polyhedral model. http://icps.u-strasbg.fr/˜bastoul/development/ candl.
[10]
P. Chatarasi, J. Shirako, and V. Sarkar. Polyhedral optimizations of explicitly parallel programs. In 2015 International Conference on Parallel Architecture and Compilation (PACT), pages 213–226, 2015.
[11]
C. Dubach, P. Cheng, R. Rabbah, D. F. Bacon, and S. J. Fink. Compiling a high-level language for GPUs: (via language support for architectures and compilers). In Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation, PLDI ’12, pages 1–12, New York, NY, USA, 2012. ACM. ISBN 978-1-4503-1205-9.
[12]
N. Fauzia, L.-N. Pouchet, and P. Sadayappan. Characterizing and enhancing global memory data coalescing on GPUs. In Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’15, pages 12–22, Washington, DC, USA, 2015. IEEE Computer Society. ISBN 978-1-4799-8161-8.
[13]
J. Ferrante, V. Sarkar, and W. Thrash. On Estimating and Enhancing Cache Effectiveness. Proc. LCPC 91, 589:328–343, 1991.
[14]
R. Garg and L. Hendren. Velociraptor: An embedded compiler toolkit for numerical programs targeting cpus and gpus. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, PACT ’14, pages 317–330, New York, NY, USA, 2014.
[15]
ACM. ISBN 978-1-4503-2809-8.
[16]
S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos. Auto-tuning a High-Level Language Targeted to GPU Codes. In Proceedings of Innovative Parallel Computing (InPar ’12), 2012.
[17]
T. Grosser, A. Größlinger, and C. Lengauer. Polly - Performing Polyhedral Optimizations on a Low-Level Intermediate Representation. Parallel Processing Letters, 22(4), 2012.
[18]
T. Grosser, A. Cohen, P. H. J. Kelly, J. Ramanujam, P. Sadayappan, and S. Verdoolaege. Split tiling for GPUs: Automatic parallelization using trapezoidal tiles. In Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units, GPGPU-6, pages 24–31, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-2017- 7.
[19]
T. Grosser, A. Cohen, J. Holewinski, P. Sadayappan, and S. Verdoolaege. Hybrid hexagonal/classical tiling for GPUs. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’14, pages 66:66–66:75, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-2670-4.
[20]
T. Grosser, S. Verdoolaege, and A. Cohen. Polyhedral ast generation is more than scanning polyhedra. ACM Trans. Program. Lang. Syst., 37 (4):12:1–12:50, July 2015. ISSN 0164-0925.
[21]
A. Hayashi, M. Grossman, J. Zhao, J. Shirako, and V. Sarkar. Accelerating Habanero-Java Programs with OpenCL Generation. In Proceedings of the 2013 International Conference on Principles and Practices of Programming on the Java Platform: Virtual Machines, Languages, and Tools, PPPJ ’13, pages 124–134, New York, NY, USA, 2013.
[22]
ACM. ISBN 978-1-4503-2111-2.
[23]
A. Hayashi, K. Ishizaki, G. Koblents, and V. Sarkar. Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection. In Proceedings of the Principles and Practices of Programming on The Java Platform, PPPJ ’15, pages 27–36, New York, NY, USA, 2015. ACM. ISBN 978-1-4503-3712-0.
[24]
J. Holewinski, L.-N. Pouchet, and P. Sadayappan. High-performance code generation for stencil computations on gpu architectures. In Proceedings of the 26th ACM International Conference on Supercomputing, ICS ’12, pages 311–320, New York, NY, USA, 2012. ACM. ISBN 978-1-4503-1316-2.
[25]
K. Ishizaki, A. Hayashi, G. Koblents, and V. Sarkar. Compiling and optimizing java 8 programs for gpu execution. In 2015 International Conference on Parallel Architecture and Compilation (PACT), pages 419–431, 2015.
[26]
ISL. Integer set library. http://isl.gforge.inria.fr.
[27]
K. Kennedy and J. R. Allen. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2002. ISBN 1-55860-286-0.
[28]
KHRONOS GROUP. OpenCL 2.1 Provisional API Specification, Version 2.1, 2013. https://www.khronos.org/registry/cl/.
[29]
M. Kong, R. Veras, K. Stock, F. Franchetti, L.-N. Pouchet, and P. Sadayappan. When Polyhedral Transformations Meet SIMD Code Generation. volume 48, pages 127–138, New York, NY, USA, June 2013.
[31]
S. Lee and R. Eigenmann. OpenMPC: Extended OpenMP Programming and Tuning for GPUs. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’10, pages 1–11, Washington, DC, USA, 2010. IEEE Computer Society. ISBN 978-1-4244-7559-9.
[32]
A. Leung, O. Lhoták, and G. Lashari. Automatic parallelization for graphics processing units. In Proceedings of the 7th International Conference on Principles and Practice of Programming in Java, PPPJ ’09, pages 91–100, New York, NY, USA, 2009. ACM. ISBN 978- 1-60558-598-7.
[33]
A. Leung, N. Vasilache, B. Meister, M. Baskaran, D. Wohlford, C. Bastoul, and R. Lethin. A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU-3, pages 51–61, New York, NY, USA, 2010. ACM. ISBN 978-1-60558-935-0.
[34]
A. W. Lim, G. I. Cheong, and M. S. Lam. An affine partitioning algorithm to maximize parallelism and minimize communication. In Proceedings of the 13th International Conference on Supercomputing, ICS ’99, pages 228–237, New York, NY, USA, 1999. ACM. ISBN 1- 58113-164-X.
[35]
Mark Harris. Optimizing parallel reduction in CUDA, 2007. http: //developer.download.nvidia.com/compute/cuda/1.1-Beta/ x86_website/projects/reduction/doc/reduction.pdf.
[36]
V. K. Nandivada, J. Shirako, J. Zhao, and V. Sarkar. A Transformation Framework for Optimizing Task-Parallel Programs. ACM Trans. Program. Lang. Syst., 35(1):3:1–3:48, Apr. 2013. ISSN 0164-0925.
[37]
NVIDIA Corporation. CUDA C PROGRAMMING GUIDE 7.0, 2014. http://docs.nvidia.com/cuda/pdf/CUDA_C_ Programming_Guide.pdf.
[38]
NVIDIA Corporation. PROFILER USER’S GUIDE 7.5, 2015. http://docs.nvidia.com/cuda/pdf/CUDA_Profiler_Users_ Guide.pdf.
[39]
OpenACC forum. The OpenACC Application Programming Interface, Version 2.0, 2013. http://www.openacc.org/sites/default/ files/OpenACC.2.0a_1.pdf.
[40]
OpenScop. Openscop specification and library. http://icps.ustrasbg.fr/ bastoul/development/openscop/.
[41]
PolyBench. The polyhedral benchmark suite. http://www.cse. ohio-state.edu/˜pouchet/software/polybench/.
[42]
L.-N. Pouchet, U. Bondhugula, C. Bastoul, A. Cohen, J. Ramanujam, and P. Sadayappan. Combined iterative and model-driven optimization in an automatic parallelization framework. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’10, pages 1–11, Washington, DC, USA, 2010. IEEE Computer Society. ISBN 978-1- 4244-7559-9.
[43]
B. Pradelle, B. Meister, and M. Baskaran. Scalable hierarchical polyhedral compilation. In The 45th International Conference on Parallel Processing, Paris, France, August 2016.
[44]
V. Sarkar. Automatic Selection of High Order Transformations in the IBM XL Fortran Compilers. IBM J. Res. & Dev., 41(3), May 1997.
[45]
J. Shirako, J. M. Zhao, V. K. Nandivada, and V. N. Sarkar. Chunking parallel loops in the presence of synchronization. In Proceedings of the 23rd International Conference on Supercomputing, ICS ’09, pages 181–192, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-498- 0.
[46]
J. Shirako, L.-N. Pouchet, and V. Sarkar. Oil and water can mix: An integration of polyhedral and ast-based transformations. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’14, pages 287–298, Piscataway, NJ, USA, 2014. IEEE Press. ISBN 978-1-4799-5500-8.
[47]
SPEC. Spec accel benchmark. https://www.spec.org/accel/.
[48]
S. S. Stone, J. P. Haldar, S. C. Tsao, W. m. W. Hwu, B. P. Sutton, and Z. P. Liang. Accelerating advanced mri reconstructions on gpus. J. Parallel Distrib. Comput., 68(10):1307–1318, Oct. 2008. ISSN 0743- 7315.
[49]
J. A. Stratton, C. Rodrigrues, I.-J. Sung, N. Obeid, L. Chang, G. Liu, and W.-M. W. Hwu. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Technical Report IMPACT- 12-01, University of Illinois at Urbana-Champaign, Urbana, Mar. 2012.
[50]
G. J. van den Braak, B. Mesman, and H. Corporaal. Compile-time GPU memory access optimizations. In International Conference on Embedded Computer Systems, 2010.
[51]
N. Vasilache, B. Meister, M. Baskaran, and R. Lethin. Joint scheduling and layout optimization to enable multilevel vectorization. In IMPACT-2: 2nd International Workshop on Polyhedral Compilation Techniques, Paris, France, January, Paris, France, Jan 2012.
[52]
S. Verdoolaege. isl: An Integer Set Library for the Polyhedral Model. In K. Fukuda, J. Hoeven, M. Joswig, and N. Takayama, editors, Mathematical Software ICMS 2010, volume 6327 of Lecture Notes in Computer Science, pages 299–302. Springer Berlin Heidelberg, 2010. ISBN 978-3-642-15581-9.
[53]
S. Verdoolaege, J. Carlos Juega, A. Cohen, J. Ignacio G´omez, C. Tenllado, and F. Catthoor. Polyhedral parallel code generation for CUDA. ACM Trans. Archit. Code Optim., 9(4):54:1–54:23, Jan. 2013. ISSN 1544-3566.
[54]
S. Verdoolaege, S. Guelton, T. Grosser, and A. Cohen. Schedule Trees. In IMPACT - 4th Workshop on Polyhedral Compilation Techniques, associated with HiPEAC, Vienna, Austria, Jan. 2014. ACM.
[55]
V. Volkov and J. W. Demmel. Benchmarking gpus to tune dense linear algebra. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC ’08, pages 31:1–31:11, Piscataway, NJ, USA, 2008. IEEE Press. ISBN 978-1-4244-2835-9.
[56]
D. G. Wonnacott. Constraint-based Array Dependence Analysis. PhD thesis, College Park, MD, USA, 1995. UMI Order No. GAX96-22167.
[57]
Y. Yang, P. Xiang, J. Kong, and H. Zhou. A GPGPU compiler for memory optimization and parallelism management. SIGPLAN Not., 45(6):86–97, June 2010. ISSN 0362-1340.

Cited By

View all
  • (2022)Optimizing GPU deep learning operators with polyhedral scheduling constraint injectionProceedings of the 20th IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO53902.2022.9741260(313-324)Online publication date: 2-Apr-2022
  • (2022)Automatic Parallelization of Python Programs for Distributed Heterogeneous ComputingEuro-Par 2022: Parallel Processing10.1007/978-3-031-12597-3_22(350-366)Online publication date: 22-Aug-2022
  • (2021)LocalityGuru: A PTX Analyzer for Extracting Thread Block-level Locality in GPGPUs2021 IEEE International Conference on Networking, Architecture and Storage (NAS)10.1109/NAS51552.2021.9605411(1-8)Online publication date: Oct-2021
  • Show More Cited By

Index Terms

  1. Optimized two-level parallelization for GPU accelerators using the polyhedral model

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    CC 2017: Proceedings of the 26th International Conference on Compiler Construction
    February 2017
    141 pages
    ISBN:9781450352338
    DOI:10.1145/3033019
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 05 February 2017

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. CUDA code generation
    2. GPUs
    3. Program transformations
    4. data locality optimizations
    5. memory coalescing
    6. parallelization
    7. polyhedral model

    Qualifiers

    • Research-article

    Conference

    CC '17
    CC '17: Compiler Construction
    February 5 - 6, 2017
    TX, Austin, USA

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)41
    • Downloads (Last 6 weeks)5
    Reflects downloads up to 27 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Optimizing GPU deep learning operators with polyhedral scheduling constraint injectionProceedings of the 20th IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO53902.2022.9741260(313-324)Online publication date: 2-Apr-2022
    • (2022)Automatic Parallelization of Python Programs for Distributed Heterogeneous ComputingEuro-Par 2022: Parallel Processing10.1007/978-3-031-12597-3_22(350-366)Online publication date: 22-Aug-2022
    • (2021)LocalityGuru: A PTX Analyzer for Extracting Thread Block-level Locality in GPGPUs2021 IEEE International Conference on Networking, Architecture and Storage (NAS)10.1109/NAS51552.2021.9605411(1-8)Online publication date: Oct-2021
    • (2020)Locality-Centric Data and Threadblock Management for Massive GPUs2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO50266.2020.00086(1022-1036)Online publication date: Oct-2020
    • (2019)Performance evaluation of OpenMP's target construct on GPUs-exploring compiler optimisationsInternational Journal of High Performance Computing and Networking10.5555/3302714.330271813:1(54-69)Online publication date: 1-Jan-2019
    • (2018)Compiler assisted coalescingProceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques10.1145/3243176.3243203(1-11)Online publication date: 1-Nov-2018
    • (2018)Warp-ConsolidationProceedings of the 2018 International Conference on Supercomputing10.1145/3205289.3205294(53-64)Online publication date: 12-Jun-2018
    • (2018)Polyhedral expression propagationProceedings of the 27th International Conference on Compiler Construction10.1145/3178372.3179529(25-36)Online publication date: 24-Feb-2018

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media