research-article

Optimized two-level parallelization for GPU accelerators using the polyhedral model

Authors:

Akihiro Hayashi,

Vivek SarkarAuthors Info & Claims

CC 2017: Proceedings of the 26th International Conference on Compiler Construction

Pages 22 - 33

https://doi.org/10.1145/3033019.3033022

Published: 05 February 2017 Publication History

Abstract

While GPUs play an increasingly important role in today's high-performance computers, optimizing GPU performance continues to impose large burdens upon programmers. A major challenge in optimizing codes for GPUs stems from the two levels of hardware parallelism, blocks and threads; each of these levels has significantly different characteristics, requiring different optimization strategies.

In this paper, we propose a novel compiler optimization algorithm for GPU parallelism. Our approach is based on the polyhedral model, which has enabled significant advances in program analysis and transformation compared to traditional AST-based frameworks. We extend polyhedral schedules to enable two-level parallelization through the idea of superposition, which integrates separate schedules for block-level and thread-level parallelism. Our experimental results demonstrate that our proposed compiler optimization framework can deliver 1.8x and 2.1x geometric mean improvements on NVIDIA Tesla M2050 and K80 GPUs, compared to a state-of-the-art polyhedral parallel code generator (PPCG) for GPGPUs.

References

[1]

The Polyhedral Compiler Collection. http://www.cs.ucla.edu/ ˜pouchet/software/pocc/.

[2]

PolyBench/GPU Implementation of PolyBench codes for GPU processing. http://web.cse.ohio-state.edu/˜pouchet/ software/polybench/GPU/.

[3]

R. Baghdadi, U. Beaugnon, A. Cohen, T. Grosser, M. Kruse, C. Reddy, S. Verdoolaege, A. Betts, A. F. Donaldson, J. Ketema, J. Absar, S. v. Haastregt, A. Kravets, A. Lokhmotov, R. David, and E. Hajiyev. Pencil: A platform-neutral compute intermediate language for accelerator programming. In 2015 International Conference on Parallel Architecture and Compilation (PACT), pages 138–149, Oct 2015.

Digital Library

[4]

V. Bandishti, I. Pananilath, and U. Bondhugula. Tiling stencil computations to maximize parallelism. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’12, pages 40:1–40:11, Los Alamitos, CA, USA, 2012. IEEE Computer Society Press. ISBN 978-1-4673-0804-5.

Digital Library

[5]

M. M. Baskaran, J. Ramanujam, and P. Sadayappan. Automatic C-to-CUDA code generation for affine programs. In Proceedings of the 19th Joint European Conference on Theory and Practice of Software, International Conference on Compiler Construction, CC’10/ETAPS’10, pages 244–263, Berlin, Heidelberg, 2010. Springer-Verlag. ISBN 3- 642-11969-7, 978-3-642-11969-9.

Digital Library

[6]

C. Bastoul. Code generation in the polyhedral model is easier than you think. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, PACT ’04, pages 7–16, Washington, DC, USA, 2004. IEEE Computer Society. ISBN 0-7695-2229-7.

Digital Library

[7]

U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A Practical Automatic Polyhedral Parallelizer and Locality Optimizer. In Proc. of PLDI ’08, New York, NY, USA, 2008. ACM.

Digital Library

[8]

ISBN 978-1-59593-860-2.

[9]

CANDL. CANDL: Data dependence analysis tool in the polyhedral model. http://icps.u-strasbg.fr/˜bastoul/development/ candl.

[10]

P. Chatarasi, J. Shirako, and V. Sarkar. Polyhedral optimizations of explicitly parallel programs. In 2015 International Conference on Parallel Architecture and Compilation (PACT), pages 213–226, 2015.

Digital Library

[11]

C. Dubach, P. Cheng, R. Rabbah, D. F. Bacon, and S. J. Fink. Compiling a high-level language for GPUs: (via language support for architectures and compilers). In Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation, PLDI ’12, pages 1–12, New York, NY, USA, 2012. ACM. ISBN 978-1-4503-1205-9.

Digital Library

[12]

N. Fauzia, L.-N. Pouchet, and P. Sadayappan. Characterizing and enhancing global memory data coalescing on GPUs. In Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’15, pages 12–22, Washington, DC, USA, 2015. IEEE Computer Society. ISBN 978-1-4799-8161-8.

Digital Library

[13]

J. Ferrante, V. Sarkar, and W. Thrash. On Estimating and Enhancing Cache Effectiveness. Proc. LCPC 91, 589:328–343, 1991.

Digital Library

[14]

R. Garg and L. Hendren. Velociraptor: An embedded compiler toolkit for numerical programs targeting cpus and gpus. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, PACT ’14, pages 317–330, New York, NY, USA, 2014.

Digital Library

[15]

ACM. ISBN 978-1-4503-2809-8.

[16]

S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos. Auto-tuning a High-Level Language Targeted to GPU Codes. In Proceedings of Innovative Parallel Computing (InPar ’12), 2012.

[17]

T. Grosser, A. Größlinger, and C. Lengauer. Polly - Performing Polyhedral Optimizations on a Low-Level Intermediate Representation. Parallel Processing Letters, 22(4), 2012.

[18]

T. Grosser, A. Cohen, P. H. J. Kelly, J. Ramanujam, P. Sadayappan, and S. Verdoolaege. Split tiling for GPUs: Automatic parallelization using trapezoidal tiles. In Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units, GPGPU-6, pages 24–31, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-2017- 7.

Digital Library

[19]

T. Grosser, A. Cohen, J. Holewinski, P. Sadayappan, and S. Verdoolaege. Hybrid hexagonal/classical tiling for GPUs. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’14, pages 66:66–66:75, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-2670-4.

Digital Library

[20]

T. Grosser, S. Verdoolaege, and A. Cohen. Polyhedral ast generation is more than scanning polyhedra. ACM Trans. Program. Lang. Syst., 37 (4):12:1–12:50, July 2015. ISSN 0164-0925.

Digital Library

[21]

A. Hayashi, M. Grossman, J. Zhao, J. Shirako, and V. Sarkar. Accelerating Habanero-Java Programs with OpenCL Generation. In Proceedings of the 2013 International Conference on Principles and Practices of Programming on the Java Platform: Virtual Machines, Languages, and Tools, PPPJ ’13, pages 124–134, New York, NY, USA, 2013.

Digital Library

[22]

ACM. ISBN 978-1-4503-2111-2.

[23]

A. Hayashi, K. Ishizaki, G. Koblents, and V. Sarkar. Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection. In Proceedings of the Principles and Practices of Programming on The Java Platform, PPPJ ’15, pages 27–36, New York, NY, USA, 2015. ACM. ISBN 978-1-4503-3712-0.

Digital Library

[24]

J. Holewinski, L.-N. Pouchet, and P. Sadayappan. High-performance code generation for stencil computations on gpu architectures. In Proceedings of the 26th ACM International Conference on Supercomputing, ICS ’12, pages 311–320, New York, NY, USA, 2012. ACM. ISBN 978-1-4503-1316-2.

Digital Library

[25]

K. Ishizaki, A. Hayashi, G. Koblents, and V. Sarkar. Compiling and optimizing java 8 programs for gpu execution. In 2015 International Conference on Parallel Architecture and Compilation (PACT), pages 419–431, 2015.

Digital Library

[26]

ISL. Integer set library. http://isl.gforge.inria.fr.

[27]

K. Kennedy and J. R. Allen. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2002. ISBN 1-55860-286-0.

Digital Library

[28]

KHRONOS GROUP. OpenCL 2.1 Provisional API Specification, Version 2.1, 2013. https://www.khronos.org/registry/cl/.

[29]

M. Kong, R. Veras, K. Stock, F. Franchetti, L.-N. Pouchet, and P. Sadayappan. When Polyhedral Transformations Meet SIMD Code Generation. volume 48, pages 127–138, New York, NY, USA, June 2013.

Digital Library

[30]

ACM.

[31]

S. Lee and R. Eigenmann. OpenMPC: Extended OpenMP Programming and Tuning for GPUs. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’10, pages 1–11, Washington, DC, USA, 2010. IEEE Computer Society. ISBN 978-1-4244-7559-9.

Digital Library

[32]

A. Leung, O. Lhoták, and G. Lashari. Automatic parallelization for graphics processing units. In Proceedings of the 7th International Conference on Principles and Practice of Programming in Java, PPPJ ’09, pages 91–100, New York, NY, USA, 2009. ACM. ISBN 978- 1-60558-598-7.

Digital Library

[33]

A. Leung, N. Vasilache, B. Meister, M. Baskaran, D. Wohlford, C. Bastoul, and R. Lethin. A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU-3, pages 51–61, New York, NY, USA, 2010. ACM. ISBN 978-1-60558-935-0.

Digital Library

[34]

A. W. Lim, G. I. Cheong, and M. S. Lam. An affine partitioning algorithm to maximize parallelism and minimize communication. In Proceedings of the 13th International Conference on Supercomputing, ICS ’99, pages 228–237, New York, NY, USA, 1999. ACM. ISBN 1- 58113-164-X.

Digital Library

[35]

Mark Harris. Optimizing parallel reduction in CUDA, 2007. http: //developer.download.nvidia.com/compute/cuda/1.1-Beta/ x86_website/projects/reduction/doc/reduction.pdf.

[36]

V. K. Nandivada, J. Shirako, J. Zhao, and V. Sarkar. A Transformation Framework for Optimizing Task-Parallel Programs. ACM Trans. Program. Lang. Syst., 35(1):3:1–3:48, Apr. 2013. ISSN 0164-0925.

Digital Library

[37]

NVIDIA Corporation. CUDA C PROGRAMMING GUIDE 7.0, 2014. http://docs.nvidia.com/cuda/pdf/CUDA_C_ Programming_Guide.pdf.

[38]

NVIDIA Corporation. PROFILER USER’S GUIDE 7.5, 2015. http://docs.nvidia.com/cuda/pdf/CUDA_Profiler_Users_ Guide.pdf.

[39]

OpenACC forum. The OpenACC Application Programming Interface, Version 2.0, 2013. http://www.openacc.org/sites/default/ files/OpenACC.2.0a_1.pdf.

[40]

OpenScop. Openscop specification and library. http://icps.ustrasbg.fr/ bastoul/development/openscop/.

[41]

PolyBench. The polyhedral benchmark suite. http://www.cse. ohio-state.edu/˜pouchet/software/polybench/.

[42]

L.-N. Pouchet, U. Bondhugula, C. Bastoul, A. Cohen, J. Ramanujam, and P. Sadayappan. Combined iterative and model-driven optimization in an automatic parallelization framework. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’10, pages 1–11, Washington, DC, USA, 2010. IEEE Computer Society. ISBN 978-1- 4244-7559-9.

Digital Library

[43]

B. Pradelle, B. Meister, and M. Baskaran. Scalable hierarchical polyhedral compilation. In The 45th International Conference on Parallel Processing, Paris, France, August 2016.

[44]

V. Sarkar. Automatic Selection of High Order Transformations in the IBM XL Fortran Compilers. IBM J. Res. & Dev., 41(3), May 1997.

Digital Library

[45]

J. Shirako, J. M. Zhao, V. K. Nandivada, and V. N. Sarkar. Chunking parallel loops in the presence of synchronization. In Proceedings of the 23rd International Conference on Supercomputing, ICS ’09, pages 181–192, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-498- 0.

Digital Library

[46]

J. Shirako, L.-N. Pouchet, and V. Sarkar. Oil and water can mix: An integration of polyhedral and ast-based transformations. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’14, pages 287–298, Piscataway, NJ, USA, 2014. IEEE Press. ISBN 978-1-4799-5500-8.

Digital Library

[47]

SPEC. Spec accel benchmark. https://www.spec.org/accel/.

[48]

S. S. Stone, J. P. Haldar, S. C. Tsao, W. m. W. Hwu, B. P. Sutton, and Z. P. Liang. Accelerating advanced mri reconstructions on gpus. J. Parallel Distrib. Comput., 68(10):1307–1318, Oct. 2008. ISSN 0743- 7315.

Digital Library

[49]

J. A. Stratton, C. Rodrigrues, I.-J. Sung, N. Obeid, L. Chang, G. Liu, and W.-M. W. Hwu. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Technical Report IMPACT- 12-01, University of Illinois at Urbana-Champaign, Urbana, Mar. 2012.

[50]

G. J. van den Braak, B. Mesman, and H. Corporaal. Compile-time GPU memory access optimizations. In International Conference on Embedded Computer Systems, 2010.

[51]

N. Vasilache, B. Meister, M. Baskaran, and R. Lethin. Joint scheduling and layout optimization to enable multilevel vectorization. In IMPACT-2: 2nd International Workshop on Polyhedral Compilation Techniques, Paris, France, January, Paris, France, Jan 2012.

[52]

S. Verdoolaege. isl: An Integer Set Library for the Polyhedral Model. In K. Fukuda, J. Hoeven, M. Joswig, and N. Takayama, editors, Mathematical Software ICMS 2010, volume 6327 of Lecture Notes in Computer Science, pages 299–302. Springer Berlin Heidelberg, 2010. ISBN 978-3-642-15581-9.

Digital Library

[53]

S. Verdoolaege, J. Carlos Juega, A. Cohen, J. Ignacio G´omez, C. Tenllado, and F. Catthoor. Polyhedral parallel code generation for CUDA. ACM Trans. Archit. Code Optim., 9(4):54:1–54:23, Jan. 2013. ISSN 1544-3566.

Digital Library

[54]

S. Verdoolaege, S. Guelton, T. Grosser, and A. Cohen. Schedule Trees. In IMPACT - 4th Workshop on Polyhedral Compilation Techniques, associated with HiPEAC, Vienna, Austria, Jan. 2014. ACM.

[55]

V. Volkov and J. W. Demmel. Benchmarking gpus to tune dense linear algebra. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC ’08, pages 31:1–31:11, Piscataway, NJ, USA, 2008. IEEE Press. ISBN 978-1-4244-2835-9.

Digital Library

[56]

D. G. Wonnacott. Constraint-based Array Dependence Analysis. PhD thesis, College Park, MD, USA, 1995. UMI Order No. GAX96-22167.

Digital Library

[57]

Y. Yang, P. Xiang, J. Kong, and H. Zhou. A GPGPU compiler for memory optimization and parallelism management. SIGPLAN Not., 45(6):86–97, June 2010. ISSN 0362-1340.

Digital Library

Cited By

Bastoul CZhang ZRazanajato HLossing NSusungi Ade Juan JFilhol EJarry BConsolaro GZhang RLee J(2022)Optimizing GPU deep learning operators with polyhedral scheduling constraint injectionProceedings of the 20th IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO53902.2022.9741260(313-324)Online publication date: 2-Apr-2022
https://dl.acm.org/doi/10.1109/CGO53902.2022.9741260
Shirako JHayashi APaul STumanov ASarkar V(2022)Automatic Parallelization of Python Programs for Distributed Heterogeneous ComputingEuro-Par 2022: Parallel Processing10.1007/978-3-031-12597-3_22(350-366)Online publication date: 22-Aug-2022
https://dl.acm.org/doi/10.1007/978-3-031-12597-3_22
Tripathy DAbdolrashidi AFan QWong DSatpathy M(2021)LocalityGuru: A PTX Analyzer for Extracting Thread Block-level Locality in GPGPUs2021 IEEE International Conference on Networking, Architecture and Storage (NAS)10.1109/NAS51552.2021.9605411(1-8)Online publication date: Oct-2021
https://doi.org/10.1109/NAS51552.2021.9605411
Show More Cited By

Index Terms

Optimized two-level parallelization for GPU accelerators using the polyhedral model
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Parallelization of a color-entropy preprocessed Chan–Vese model for face contour detection on multi-core CPU and GPU
Highlights
- We introduce a novel way to parallelize a face contour detecting application.
- ...
Abstract
Face tracking is an important computer vision technology that has been widely adopted in many areas, from cell phone applications to industry robots. In this paper, we introduce a novel way to parallelize a face contour detecting ...
Parallelization of EULAG model on multicore architectures with GPU accelerators
PPAM'11: Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part II

EULAG (Eulerian/semi-Lagrangian fluid solver) is an established computational model developed by the group headed by Piotr K. Smolarkiewicz for simulating thermo-fluid flows across a wide range of scales and physical scenarios.

In this paper we focus on ...
GPU-Based Parallelization for Fast Circuit Optimization

The progress of GPU (Graphics Processing Unit) technology opens a new avenue for boosting computing power. This work is an attempt to exploit the GPU for accelerating VLSI circuit optimization. We propose GPU-based parallel computing techniques and ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

CC 2017: Proceedings of the 26th International Conference on Compiler Construction

February 2017

141 pages

ISBN:9781450352338

DOI:10.1145/3033019

General Chair:
Peng Wu
Huawei, USA
,
Program Chair:
Sebastian Hack
Saarland University, Germany

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 February 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CC '17

CC '17: Compiler Construction

February 5 - 6, 2017

TX, Austin, USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
401
Total Downloads

Downloads (Last 12 months)41
Downloads (Last 6 weeks)5

Reflects downloads up to 27 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Bastoul CZhang ZRazanajato HLossing NSusungi Ade Juan JFilhol EJarry BConsolaro GZhang RLee J(2022)Optimizing GPU deep learning operators with polyhedral scheduling constraint injectionProceedings of the 20th IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO53902.2022.9741260(313-324)Online publication date: 2-Apr-2022
https://dl.acm.org/doi/10.1109/CGO53902.2022.9741260
Shirako JHayashi APaul STumanov ASarkar V(2022)Automatic Parallelization of Python Programs for Distributed Heterogeneous ComputingEuro-Par 2022: Parallel Processing10.1007/978-3-031-12597-3_22(350-366)Online publication date: 22-Aug-2022
https://dl.acm.org/doi/10.1007/978-3-031-12597-3_22
Tripathy DAbdolrashidi AFan QWong DSatpathy M(2021)LocalityGuru: A PTX Analyzer for Extracting Thread Block-level Locality in GPGPUs2021 IEEE International Conference on Networking, Architecture and Storage (NAS)10.1109/NAS51552.2021.9605411(1-8)Online publication date: Oct-2021
https://doi.org/10.1109/NAS51552.2021.9605411
Khairy MNikiforov VNellans DRogers T(2020)Locality-Centric Data and Threadblock Management for Massive GPUs2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO50266.2020.00086(1022-1036)Online publication date: Oct-2020
https://doi.org/10.1109/MICRO50266.2020.00086
(2019)Performance evaluation of OpenMP's target construct on GPUs-exploring compiler optimisationsInternational Journal of High Performance Computing and Networking10.5555/3302714.330271813:1(54-69)Online publication date: 1-Jan-2019
https://dl.acm.org/doi/10.5555/3302714.3302718
Puthoor SLipasti MEvripidou SStenström PO'Boyle M(2018)Compiler assisted coalescingProceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques10.1145/3243176.3243203(1-11)Online publication date: 1-Nov-2018
https://dl.acm.org/doi/10.1145/3243176.3243203
Li ALiu WWang LBarker KSong S(2018)Warp-ConsolidationProceedings of the 2018 International Conference on Supercomputing10.1145/3205289.3205294(53-64)Online publication date: 12-Jun-2018
https://dl.acm.org/doi/10.1145/3205289.3205294
Doerfert JSharma SHack SDubach CXue J(2018)Polyhedral expression propagationProceedings of the 27th International Conference on Compiler Construction10.1145/3178372.3179529(25-36)Online publication date: 24-Feb-2018
https://dl.acm.org/doi/10.1145/3178372.3179529

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents