research-article

PPOpenCL: a performance-portable OpenCL compiler with host and kernel thread code fusion

Authors:

Jingling XueAuthors Info & Claims

CC 2019: Proceedings of the 28th International Conference on Compiler Construction

Pages 2 - 16

https://doi.org/10.1145/3302516.3307350

Published: 16 February 2019 Publication History

Abstract

OpenCL offers code portability but no performance portability. Given an OpenCL program X specifically written for one platform P, existing OpenCL compilers, which usually optimize its host and kernel codes individually, often yield poor performance for another platform Q. Instead of obtaining a performance-improved version of X for Q via manual tuning, we aim to achieve this automatically by a source-to-source OpenCL compiler framework, PPOpenCL. By fusing X's host and kernel thread codes (with the operations in different work-items in the same work-group represented explicitly), we are able to apply data flow analyses, and subsequently, performance-enhancing optimizations on a fused control flow graph specifically for platform Q. Validation against OpenCL benchmarks shows that PPOpenCL (implemented in Clang 3.9.1) can achieve significantly improved portable performance on seven platforms considered.

References

[1]

Timothy G. Armstrong, Justin M. Wozniak, Michael Wilde, and Ian T. Foster. 2014. Compiler Techniques for Massively Scalable Implicit Task Parallelism. In Proceedings of the 26th International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’14). IEEE, New Orleans, LA, USA, 299– 310.

Digital Library

[2]

Li-Wen Chang, Izzat El Hajj, Christopher Rodrigues, Juan Gómez-Luna, and Wen-mei Hwu. 2016. Efficient Kernel Synthesis for Performance Portable Programming. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ’16). IEEE, Taipei, Taiwan, 12:1–12:13.

Digital Library

[3]

Huimin Cui, Lei Wang, Jingling Xue, Yang Yang, and Xiaobing Feng. 2011. Automatic Library Generation for BLAS3 on GPUs. In 25th IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2011, Anchorage, Alaska, USA, 16-20 May, 2011 - Conference Proceedings. IEEE, 255–265. CC ’19, February 16–17, 2019, Washington, DC, USA Y. Liu, L. Huang, M. Wu, H. Cui, F. Lv, X. Feng, and J. Xue

Digital Library

[4]

Huimin Cui, Jingling Xue, Lei Wang, Yang Yang, Xiaobing Feng, and Dongrui Fan. 2012. Extendable pattern-oriented optimization directives. ACM Transactions on Architecture and Code Optimization 9, 3 (2012), 14.

Digital Library

[5]

Huimin Cui, Qing Yi, Jingling Xue, and Xiaobing Feng. 2013. Layout-Oblivious Compiler Optimization for Matrix Computations. Acm Transactions on Architecture and Code Optimization 9, 4 (2013), 1–20.

Digital Library

[6]

Anthony Danalis, Gabriel Marin, Collin McCurdy, Jeremy S. Meredith, Philip C. Roth, Kyle Spafford, Vinod Tipparaju, and Jeffrey S. Vetter. 2010. The Scalable Heterogeneous Computing (SHOC) Benchmark Suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU ’10). ACM, Pittsburgh, Pennsylvania, USA, 63–74.

Digital Library

[7]

Chen Ding and Yutao Zhong. 2003. Predicting whole-program locality through reuse distance analysis. In Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation (PLDI ’03). ACM, San Diego, California, USA, 245–257.

Digital Library

[8]

Peng Du, Rick Weber, Piotr Luszczek, Stanimire Tomov, Gregory Peterson, and Jack Dongarra. 2012. From CUDA to OpenCL: Towards a Performance-portable Solution for Multi-platform GPU Programming. Parallel Computing. 38, 8 (Aug. 2012), 391–407.

Digital Library

[9]

Jeff Fifield, Ronan Keryell, Hervé Ratigner, Henry Styles, and Jim Wu. 2016. Optimizing OpenCL Applications on Xilinx FPGA. In Proceedings of the 4th International Workshop on OpenCL (IWOCL ’16). ACM, Vienna, Austria, 5:1–5:2.

Digital Library

[10]

Jiri Filipovic, Matus Madzin, Jan Fousek, and Ludek Matyska. 2015. Optimizing CUDA Code By Kernel Fusion:Application on BLAS. The Journal of Supercomputing 71, 10 (2015), 3934–3957.

Digital Library

[11]

Haohuan Fu, Junfeng Liao, Jinzhe Yang, Lanning Wang, et al. 2016. The Sunway TaihuLight supercomputer: system and applications. SCIENCE CHINA Information Sciences 59, 7 (2016), 072001:1–072001:16.

[12]

Xiang Gong, Zhongliang Chen, Amir Kavyan Ziabari, Rafael Ubal, and David Kaeli. 2017. TwinKernels: An Execution Model to Improve GPU Hardware Scheduling at Compile Time. In Proceedings of the 2017 International Symposium on Code Generation and Optimization (CGO ’17). IEEE, Austin, USA, 39–49.

Digital Library

[13]

Khronos Group. 2018. OpenCL Overview. https://www.khronos.org/opencl/

[14]

OpenACC User Group. 2017. NAS SHOC OpenACC 2.5. https: //github.com/OpenACCUserGroup/openacc-users-group/tree/master/ Contributed_Sample_Codes/NAS_SHOC_OpenACC_2.5

[15]

Wenting He, Huimin Cui, Binbin Lu, Jiacheng Zhao, Shengmei Li, Gong Ruan, Jingling Xue, Xiaobing Feng, Wensen Yang, and Youliang Yan. 2015. Hadoop+: Modeling and Evaluating the Heterogeneity for MapReduce Applications in Heterogeneous Clusters. In Proceedings of the 29th ACM on International Conference on Supercomputing, ICS’15, Newport Beach/Irvine, CA, USA, June 08 - 11, 2015.

Digital Library

[16]

ACM, 143–153.

[17]

Pekka Jääskeläinen, Carlos Sánchez Lama, Erik Schnetter, Kalle Raiskila, Jarmo Takala, and Heikki Berg. 2015. Pocl: A Performance-Portable OpenCL Implementation. International Journal of Parallel Programming. 43, 5 (Oct. 2015), 752–785.

Digital Library

[18]

Zhen Jia, Aleksandar Zlateski, Fredo Durand, and Kai Li. 2018. Optimizing Ndimensional, Winograd-based Convolution for Manycore CPUs. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’18). ACM, Vienna, Austria, 109–123.

Digital Library

[19]

Zheming Jin and Hal Finkel. 2018. Performance-oriented Optimizations for OpenCL Streaming Kernels on the FPGA. In Proceedings of the International Workshop on OpenCL (IWOCL ’18). ACM, Oxford, United Kingdom, 1:1–1:8.

Digital Library

[20]

Guido Juckeland, William C. Brantley, Sunita Chandrasekaran, et al. 2014. SPEC ACCEL: A Standard Application Suite for Measuring Hardware Accelerator Performance. In Proceedings of 5th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS’14). Springer, New Orleans, LA, USA, 46–67.

[21]

Hee-Seok Kim, Izzat El Hajj, John Stratton, Steven Lumetta, and Wen-Mei Hwu. 2015. Locality-centric Thread Scheduling for Bulk-synchronous Programming Models on CPU Architectures. In Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO ’15). IEEE/ACM, San Francisco, California, 257–268.

Digital Library

[22]

Samuel Larsen and Saman Amarasinghe. 2000. Exploiting Superword Level Parallelism with Multimedia Instruction Sets. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation (PLDI ’00). ACM, New York, NY, USA, 145–156.

Digital Library

[23]

Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, et al. 2010. An OpenCL Framework for Heterogeneous Multicores with Local Memory. In Proceedings of the 19th ACM/IEEE/IFIP International Conference on Parallel Architectures and Compilation Techniques (PACT ’10). IEEE, Vienna, Austria, 193–204.

Digital Library

[24]

Alberto Magni, Christophe Dubach, and Michael O’Boyle. 2014. Automatic Optimization of Thread-coarsening for Graphics Processors. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT ’14). ACM, Edmonton, AB, Canada, 455–466.

Digital Library

[25]

Alberto Magni, Christophe Dubach, and Michael F. P. O’Boyle. 2013. A Largescale Cross-architecture Evaluation of Thread-coarsening. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC ’13). ACM, Denver, Colorado, USA, Article 11, 11 pages.

Digital Library

[26]

Deepak Majeti, Kuldeep S. Meel, Rajkishore Barik, and Vivek Sarkar. 2016. Automatic data layout generation and kernel mapping for CPU+GPU architectures. In Proceedings of the 21st International Conference on Compiler Construction (CC ’16). ACM, Barcelona, Spain, 240–250.

Digital Library

[27]

Kathryn S Mckinley, Steve Carr, and Chauwen Tseng. 1996. Improving Data Locality with Loop Transformations. ACM Transactions on Programming Languages and Systems 18, 4 (1996), 424–453.

Digital Library

[28]

Douglas Miles, David Norton, and Michael Wolfe. 2014. Performance Portability and OpenACC. In Proceedings of Cray Users Group Meeting (CUG ’14). Lugano, Switzerland, 1–8.

[29]

NVIDIA. 2018. CUDA C Programming Guide. https://docs.nvidia.com/cuda/ cuda-c-programming-guide/index.html

[30]

NVIDIA. 2018. NVIDIA OpenCL SDK Code Samples. https://developer.nvidia. com/opencl

[31]

NVIDIA. 2018. Performance Portability from GPUs to CPUs with OpenACC. https://devblogs.nvidia.com/performance-portability-gpus-cpus-openacc/

[32]

OpenACC. 2018. OpenACC Specification. https://www.openacc.org/ specification

[33]

S. J. Pennycook, S. D. Hammond, S. A. Wright, J. A. Herdman, I. Miller, and S. A. Jarvis. 2013. An Investigation of the Performance Portability of OpenCL. Journal of Parallel and Distributed Computing. 73, 11 (Nov. 2013), 1439–1450.

Digital Library

[34]

PGI. 2018. PGI Accelerator Compilers with OpenACC Directives. https://www. pgroup.com/resources/accel.htm

[35]

James Price and Simon McIntosh-Smith. 2017. Analyzing and Improving Performance Portability of OpenCL Applications via Auto-tuning. In Proceedings of the 5th International Workshop on OpenCL (IWOCL ’2017). ACM, Toronto, Canada, Article 14, 4 pages.

Digital Library

[36]

Prashant Singh Rawat, Changwan Hong, Mahesh Ravishankar, Vinod Grover, Louis-Noel Pouchet, Atanas Rountev, and P. Sadayappan. 2016. Resource Conscious Reuse-Driven Tiling for GPUs. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation (PACT ’16). ACM, Haifa, Israel, 99–111.

Digital Library

[37]

Prashant Singh Rawat, Fabrice Rastello, Aravind Sukumaran-Rajam, Louis-Noël Pouchet, Atanas Rountev, and P. Sadayappan. 2018. Register optimizations for stencils on GPUs. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’18). ACM, Vienna, Austria, 168–182.

Digital Library

[38]

Toomas Remmelg, Thibaut Lutz, Michel Steuwer, and Christophe Dubach. 2016. Performance Portable GPU Code Generation for Matrix Multiplication. In Proceedings of the 9th Annual Workshop on General Purpose Processing Using Graphics Processing Unit (GPGPU ’16). ACM, Barcelona, Spain, 22–31.

Digital Library

[39]

Ingo Wald Roland Leiba, Sebastian Hack. 2012. Extending a C-like Language for Portable SIMD Programming. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’12). ACM, New Orleans, Louisiana, USA, 65–74.

Digital Library

[40]

Karl Rupp, Philippe Tillet, Florian Rudolf, Josef Weinbub, Tibor Grasser, and Ansgar Jüngel. 2014. Performance Portability Study of Linear Algebra Kernels in OpenCL. In Proceedings of the International Workshop on OpenCL (IWOCL ’14). ACM, Bristol, UK, Article 8, 11 pages.

Digital Library

[41]

Amit Sabne, Putt Sakdhnagool, Seyong Lee, and Jeffrey S. Vetter. 2014. Evaluating Performance Portability of OpenACC. In Proceedings of 27th International Workshop on Languages and Compilers for Parallel Computing (LCPC ’14). Springer, Hillsboro, OR, USA, 51–66.

[42]

Sangmin Seo, Jun Lee, Gangwon Jo, and Jaejin Lee. 2013. Automatic OpenCL work-group size selection for multicore CPUs. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT ’13). IEEE, Edinburgh, UK, 387–397.

Digital Library

[43]

Elliott Slaughter, Wonchan Lee, Sean Treichler, Michael Bauer, and Alex Aiken. 2015. Regent: a high-productivity programming language for HPC with logical regions. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’15). ACM, Austin, TX, USA, 81:1–81:12.

Digital Library

[44]

Prakalp Srivastava, Maria Kotsifakou, and Vikram S. Adve. 2016. HPVM: A Portable Virtual Instruction Set for Heterogeneous Parallel Systems. CoRR abs/1611.00860 (2016). arXiv: 1611.00860

[45]

Michel Steuwer, Christian Fensch, Sam Lindley, and Christophe Dubach. 2015. Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code. In Proceedings of the 20th ACM SIGPLAN International Conference on Functional Programming (ICFP ’15). ACM, Vancouver, BC, Canada, 205–217.

Digital Library

[46]

Michel Steuwer, Toomas Remmelg, and Christophe Dubach. 2016. Matrix multiplication beyond auto-tuning: rewrite-based GPU code generation. In 2016 International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES ’16). ACM, Pittsburgh, Pennsylvania, USA, 15:1–15:10.

Digital Library

[47]

John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng D. Liu, and W.W. Hwu. 2012. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. Technical Report. University of Illinois at Urbana-Champaign. PPOpenCL CC ’19, February 16–17, 2019, Washington, DC, USA

[48]

Yulei Sui and Jingling Xue. 2016. SVF: interprocedural static value-flow analysis in LLVM. In Proceedings of the 25th International Conference on Compiler Construction, CC 2016, Barcelona, Spain, March 12-18, 2016. 265–266.

Digital Library

[49]

Ben Taylor, Vicent Sanz Marco, and Zheng Wang. 2017. Adaptive Optimization for OpenCL Programs on Embedded Heterogeneous Systems. In Proceedings of the 18th ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES 2017). ACM, Barcelona, Spain, 11–20.

Digital Library

[50]

Mohamed Wahib and Naoya Maruyama. 2014. Scalable Kernel Fusion for Memory-Bound GPU Applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’14). IEEE, New Orleans, LA, USA, 191–202.

Digital Library

[51]

Dennis Weller, Fabian Oboril, Dimitar Lukarski, Juergen Becker, and Mehdi Tahoori. 2017. Energy Efficient Scientific Computing on FPGAs Using OpenCL. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA ’17). ACM, Monterey, California, USA, 247– 256.

Digital Library

[52]

Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, and Robert Hundt. 2016. Gpucc: An Open-source GPGPU Compiler. In Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO ’16). ACM, Barcelona, Spain, 105–116.

Digital Library

[53]

Jingling Xue and Jens Knoop. 2006. A Fresh Look at PRE as a Maximum Flow Problem. In Compiler Construction, 15th International Conference, CC 2006, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2006, Vienna, Austria, March 30-31, 2006, Proceedings. 139–154.

Digital Library

[54]

Yi Yang, Ping Xiang, Jingfei Kong, Mike Mantor, and Huiyang Zhou. 2012. A Unified Optimizing Compiler Framework for Different GPGPU Architectures. ACM Transactions on Architecture and Code Optimization. 9, 2, Article 9 (June 2012), 33 pages.

Digital Library

[55]

Yao Zhang, Mark Sinclair II, and Andrew A. Chien. 2013. Improving Performance Portability in OpenCL Programs. In Proceedings of the 28th International Supercomputing Conference (ISC ’13). Springer, Leipzig, Germany, 136–150.

[56]

Jisheng Zhao, Jun Shirako, V. Krishna Nandivada, and Vivek Sarkar. 2010. Reducing task creation and termination overhead in explicitly parallel programs. In Proceedings of the 19th International Conference on Parallel Architecture and Compilation Techniques (PACT ’10). ACM, Vienna, Austria, 169–180.

Digital Library

[57]

Hao Zhou and Jingling Xue. 2016. A Compiler Approach for Exploiting Partial SIMD Parallelism. ACM Trans. Archit. Code Optim. 13, 1, Article 11 (March 2016), 26 pages.

Digital Library

[58]

Hao Zhou and Jingling Xue. 2016. Exploiting Mixed SIMD Parallelism by Reducing Data Reorganization Overhead. In Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO ’16). ACM, New York, NY, USA, 59–69.

Digital Library

[59]

Hans Zima and Barbara Chapman. 1991. Supercompilers for Parallel and Vector Computers. ACM, New York, NY, USA.

Cited By

Halbiniak KSzustak LOlas TWyrzykowski RGepner P(2020)Exploration of OpenCL Heterogeneous Programming for Porting Solidification Modeling to CPU‐GPU PlatformsConcurrency and Computation: Practice and Experience10.1002/cpe.601133:4Online publication date: 9-Oct-2020
https://doi.org/10.1002/cpe.6011

Index Terms

PPOpenCL: a performance-portable OpenCL compiler with host and kernel thread code fusion
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Source code generation

Recommendations

Developing High-Performance, Portable OpenCL Code via Multi-Dimensional Homomorphisms
IWOCL '19: Proceedings of the International Workshop on OpenCL

A key challenge in programming high-performance applications is achieving portable performance, such that the same program code can reach a consistent level of performance over the variety of modern parallel processors, including multi-core CPU and ...
Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs
SCC '12: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis

OpenCL (Open Computing Language) is a framework for general-purpose parallel programming. Programs written in OpenCL are functionally portable across multiple processors including CPUs, GPUs, and also FPGAs. Using an auto-tuning technique makes ...
Nuclear Reactor Simulations on OpenCL FPGA Platform
FPGA '19: Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Field-programmable gate arrays (FPGAs) are becoming a promising choice as a heterogeneous computing component for scientific computing when floating-point optimized architectures are added to the current FPGAs. The maturing high-level synthesis (HLS) ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

CC 2019: Proceedings of the 28th International Conference on Compiler Construction

February 2019

204 pages

ISBN:9781450362771

DOI:10.1145/3302516

General Chair:
J. Nelson Amaral
University of Alberta, Canada
,
Program Chair:
Milind Kulkarni
Purdue University, USA

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 February 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CC '19

CC '19: 28th International Conference on Compiler Construction

February 16 - 17, 2019

DC, Washington, USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
369
Total Downloads

Downloads (Last 12 months)23
Downloads (Last 6 weeks)1

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Halbiniak KSzustak LOlas TWyrzykowski RGepner P(2020)Exploration of OpenCL Heterogeneous Programming for Porting Solidification Modeling to CPU‐GPU PlatformsConcurrency and Computation: Practice and Experience10.1002/cpe.601133:4Online publication date: 9-Oct-2020
https://doi.org/10.1002/cpe.6011

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents