Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3302516.3307350acmotherconferencesArticle/Chapter ViewAbstractPublication PagesccConference Proceedingsconference-collections
research-article

PPOpenCL: a performance-portable OpenCL compiler with host and kernel thread code fusion

Published: 16 February 2019 Publication History

Abstract

OpenCL offers code portability but no performance portability. Given an OpenCL program X specifically written for one platform P, existing OpenCL compilers, which usually optimize its host and kernel codes individually, often yield poor performance for another platform Q. Instead of obtaining a performance-improved version of X for Q via manual tuning, we aim to achieve this automatically by a source-to-source OpenCL compiler framework, PPOpenCL. By fusing X's host and kernel thread codes (with the operations in different work-items in the same work-group represented explicitly), we are able to apply data flow analyses, and subsequently, performance-enhancing optimizations on a fused control flow graph specifically for platform Q. Validation against OpenCL benchmarks shows that PPOpenCL (implemented in Clang 3.9.1) can achieve significantly improved portable performance on seven platforms considered.

References

[1]
Timothy G. Armstrong, Justin M. Wozniak, Michael Wilde, and Ian T. Foster. 2014. Compiler Techniques for Massively Scalable Implicit Task Parallelism. In Proceedings of the 26th International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’14). IEEE, New Orleans, LA, USA, 299– 310.
[2]
Li-Wen Chang, Izzat El Hajj, Christopher Rodrigues, Juan Gómez-Luna, and Wen-mei Hwu. 2016. Efficient Kernel Synthesis for Performance Portable Programming. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ’16). IEEE, Taipei, Taiwan, 12:1–12:13.
[3]
Huimin Cui, Lei Wang, Jingling Xue, Yang Yang, and Xiaobing Feng. 2011. Automatic Library Generation for BLAS3 on GPUs. In 25th IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2011, Anchorage, Alaska, USA, 16-20 May, 2011 - Conference Proceedings. IEEE, 255–265. CC ’19, February 16–17, 2019, Washington, DC, USA Y. Liu, L. Huang, M. Wu, H. Cui, F. Lv, X. Feng, and J. Xue
[4]
Huimin Cui, Jingling Xue, Lei Wang, Yang Yang, Xiaobing Feng, and Dongrui Fan. 2012. Extendable pattern-oriented optimization directives. ACM Transactions on Architecture and Code Optimization 9, 3 (2012), 14.
[5]
Huimin Cui, Qing Yi, Jingling Xue, and Xiaobing Feng. 2013. Layout-Oblivious Compiler Optimization for Matrix Computations. Acm Transactions on Architecture and Code Optimization 9, 4 (2013), 1–20.
[6]
Anthony Danalis, Gabriel Marin, Collin McCurdy, Jeremy S. Meredith, Philip C. Roth, Kyle Spafford, Vinod Tipparaju, and Jeffrey S. Vetter. 2010. The Scalable Heterogeneous Computing (SHOC) Benchmark Suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU ’10). ACM, Pittsburgh, Pennsylvania, USA, 63–74.
[7]
Chen Ding and Yutao Zhong. 2003. Predicting whole-program locality through reuse distance analysis. In Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation (PLDI ’03). ACM, San Diego, California, USA, 245–257.
[8]
Peng Du, Rick Weber, Piotr Luszczek, Stanimire Tomov, Gregory Peterson, and Jack Dongarra. 2012. From CUDA to OpenCL: Towards a Performance-portable Solution for Multi-platform GPU Programming. Parallel Computing. 38, 8 (Aug. 2012), 391–407.
[9]
Jeff Fifield, Ronan Keryell, Hervé Ratigner, Henry Styles, and Jim Wu. 2016. Optimizing OpenCL Applications on Xilinx FPGA. In Proceedings of the 4th International Workshop on OpenCL (IWOCL ’16). ACM, Vienna, Austria, 5:1–5:2.
[10]
Jiri Filipovic, Matus Madzin, Jan Fousek, and Ludek Matyska. 2015. Optimizing CUDA Code By Kernel Fusion:Application on BLAS. The Journal of Supercomputing 71, 10 (2015), 3934–3957.
[11]
Haohuan Fu, Junfeng Liao, Jinzhe Yang, Lanning Wang, et al. 2016. The Sunway TaihuLight supercomputer: system and applications. SCIENCE CHINA Information Sciences 59, 7 (2016), 072001:1–072001:16.
[12]
Xiang Gong, Zhongliang Chen, Amir Kavyan Ziabari, Rafael Ubal, and David Kaeli. 2017. TwinKernels: An Execution Model to Improve GPU Hardware Scheduling at Compile Time. In Proceedings of the 2017 International Symposium on Code Generation and Optimization (CGO ’17). IEEE, Austin, USA, 39–49.
[13]
Khronos Group. 2018. OpenCL Overview. https://www.khronos.org/opencl/
[14]
OpenACC User Group. 2017. NAS SHOC OpenACC 2.5. https: //github.com/OpenACCUserGroup/openacc-users-group/tree/master/ Contributed_Sample_Codes/NAS_SHOC_OpenACC_2.5
[15]
Wenting He, Huimin Cui, Binbin Lu, Jiacheng Zhao, Shengmei Li, Gong Ruan, Jingling Xue, Xiaobing Feng, Wensen Yang, and Youliang Yan. 2015. Hadoop+: Modeling and Evaluating the Heterogeneity for MapReduce Applications in Heterogeneous Clusters. In Proceedings of the 29th ACM on International Conference on Supercomputing, ICS’15, Newport Beach/Irvine, CA, USA, June 08 - 11, 2015.
[16]
ACM, 143–153.
[17]
Pekka Jääskeläinen, Carlos Sánchez Lama, Erik Schnetter, Kalle Raiskila, Jarmo Takala, and Heikki Berg. 2015. Pocl: A Performance-Portable OpenCL Implementation. International Journal of Parallel Programming. 43, 5 (Oct. 2015), 752–785.
[18]
Zhen Jia, Aleksandar Zlateski, Fredo Durand, and Kai Li. 2018. Optimizing Ndimensional, Winograd-based Convolution for Manycore CPUs. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’18). ACM, Vienna, Austria, 109–123.
[19]
Zheming Jin and Hal Finkel. 2018. Performance-oriented Optimizations for OpenCL Streaming Kernels on the FPGA. In Proceedings of the International Workshop on OpenCL (IWOCL ’18). ACM, Oxford, United Kingdom, 1:1–1:8.
[20]
Guido Juckeland, William C. Brantley, Sunita Chandrasekaran, et al. 2014. SPEC ACCEL: A Standard Application Suite for Measuring Hardware Accelerator Performance. In Proceedings of 5th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS’14). Springer, New Orleans, LA, USA, 46–67.
[21]
Hee-Seok Kim, Izzat El Hajj, John Stratton, Steven Lumetta, and Wen-Mei Hwu. 2015. Locality-centric Thread Scheduling for Bulk-synchronous Programming Models on CPU Architectures. In Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO ’15). IEEE/ACM, San Francisco, California, 257–268.
[22]
Samuel Larsen and Saman Amarasinghe. 2000. Exploiting Superword Level Parallelism with Multimedia Instruction Sets. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation (PLDI ’00). ACM, New York, NY, USA, 145–156.
[23]
Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, et al. 2010. An OpenCL Framework for Heterogeneous Multicores with Local Memory. In Proceedings of the 19th ACM/IEEE/IFIP International Conference on Parallel Architectures and Compilation Techniques (PACT ’10). IEEE, Vienna, Austria, 193–204.
[24]
Alberto Magni, Christophe Dubach, and Michael O’Boyle. 2014. Automatic Optimization of Thread-coarsening for Graphics Processors. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT ’14). ACM, Edmonton, AB, Canada, 455–466.
[25]
Alberto Magni, Christophe Dubach, and Michael F. P. O’Boyle. 2013. A Largescale Cross-architecture Evaluation of Thread-coarsening. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC ’13). ACM, Denver, Colorado, USA, Article 11, 11 pages.
[26]
Deepak Majeti, Kuldeep S. Meel, Rajkishore Barik, and Vivek Sarkar. 2016. Automatic data layout generation and kernel mapping for CPU+GPU architectures. In Proceedings of the 21st International Conference on Compiler Construction (CC ’16). ACM, Barcelona, Spain, 240–250.
[27]
Kathryn S Mckinley, Steve Carr, and Chauwen Tseng. 1996. Improving Data Locality with Loop Transformations. ACM Transactions on Programming Languages and Systems 18, 4 (1996), 424–453.
[28]
Douglas Miles, David Norton, and Michael Wolfe. 2014. Performance Portability and OpenACC. In Proceedings of Cray Users Group Meeting (CUG ’14). Lugano, Switzerland, 1–8.
[29]
NVIDIA. 2018. CUDA C Programming Guide. https://docs.nvidia.com/cuda/ cuda-c-programming-guide/index.html
[30]
NVIDIA. 2018. NVIDIA OpenCL SDK Code Samples. https://developer.nvidia. com/opencl
[31]
NVIDIA. 2018. Performance Portability from GPUs to CPUs with OpenACC. https://devblogs.nvidia.com/performance-portability-gpus-cpus-openacc/
[32]
OpenACC. 2018. OpenACC Specification. https://www.openacc.org/ specification
[33]
S. J. Pennycook, S. D. Hammond, S. A. Wright, J. A. Herdman, I. Miller, and S. A. Jarvis. 2013. An Investigation of the Performance Portability of OpenCL. Journal of Parallel and Distributed Computing. 73, 11 (Nov. 2013), 1439–1450.
[34]
PGI. 2018. PGI Accelerator Compilers with OpenACC Directives. https://www. pgroup.com/resources/accel.htm
[35]
James Price and Simon McIntosh-Smith. 2017. Analyzing and Improving Performance Portability of OpenCL Applications via Auto-tuning. In Proceedings of the 5th International Workshop on OpenCL (IWOCL ’2017). ACM, Toronto, Canada, Article 14, 4 pages.
[36]
Prashant Singh Rawat, Changwan Hong, Mahesh Ravishankar, Vinod Grover, Louis-Noel Pouchet, Atanas Rountev, and P. Sadayappan. 2016. Resource Conscious Reuse-Driven Tiling for GPUs. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation (PACT ’16). ACM, Haifa, Israel, 99–111.
[37]
Prashant Singh Rawat, Fabrice Rastello, Aravind Sukumaran-Rajam, Louis-Noël Pouchet, Atanas Rountev, and P. Sadayappan. 2018. Register optimizations for stencils on GPUs. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’18). ACM, Vienna, Austria, 168–182.
[38]
Toomas Remmelg, Thibaut Lutz, Michel Steuwer, and Christophe Dubach. 2016. Performance Portable GPU Code Generation for Matrix Multiplication. In Proceedings of the 9th Annual Workshop on General Purpose Processing Using Graphics Processing Unit (GPGPU ’16). ACM, Barcelona, Spain, 22–31.
[39]
Ingo Wald Roland Leiba, Sebastian Hack. 2012. Extending a C-like Language for Portable SIMD Programming. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’12). ACM, New Orleans, Louisiana, USA, 65–74.
[40]
Karl Rupp, Philippe Tillet, Florian Rudolf, Josef Weinbub, Tibor Grasser, and Ansgar Jüngel. 2014. Performance Portability Study of Linear Algebra Kernels in OpenCL. In Proceedings of the International Workshop on OpenCL (IWOCL ’14). ACM, Bristol, UK, Article 8, 11 pages.
[41]
Amit Sabne, Putt Sakdhnagool, Seyong Lee, and Jeffrey S. Vetter. 2014. Evaluating Performance Portability of OpenACC. In Proceedings of 27th International Workshop on Languages and Compilers for Parallel Computing (LCPC ’14). Springer, Hillsboro, OR, USA, 51–66.
[42]
Sangmin Seo, Jun Lee, Gangwon Jo, and Jaejin Lee. 2013. Automatic OpenCL work-group size selection for multicore CPUs. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT ’13). IEEE, Edinburgh, UK, 387–397.
[43]
Elliott Slaughter, Wonchan Lee, Sean Treichler, Michael Bauer, and Alex Aiken. 2015. Regent: a high-productivity programming language for HPC with logical regions. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’15). ACM, Austin, TX, USA, 81:1–81:12.
[44]
Prakalp Srivastava, Maria Kotsifakou, and Vikram S. Adve. 2016. HPVM: A Portable Virtual Instruction Set for Heterogeneous Parallel Systems. CoRR abs/1611.00860 (2016). arXiv: 1611.00860
[45]
Michel Steuwer, Christian Fensch, Sam Lindley, and Christophe Dubach. 2015. Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code. In Proceedings of the 20th ACM SIGPLAN International Conference on Functional Programming (ICFP ’15). ACM, Vancouver, BC, Canada, 205–217.
[46]
Michel Steuwer, Toomas Remmelg, and Christophe Dubach. 2016. Matrix multiplication beyond auto-tuning: rewrite-based GPU code generation. In 2016 International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES ’16). ACM, Pittsburgh, Pennsylvania, USA, 15:1–15:10.
[47]
John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng D. Liu, and W.W. Hwu. 2012. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. Technical Report. University of Illinois at Urbana-Champaign. PPOpenCL CC ’19, February 16–17, 2019, Washington, DC, USA
[48]
Yulei Sui and Jingling Xue. 2016. SVF: interprocedural static value-flow analysis in LLVM. In Proceedings of the 25th International Conference on Compiler Construction, CC 2016, Barcelona, Spain, March 12-18, 2016. 265–266.
[49]
Ben Taylor, Vicent Sanz Marco, and Zheng Wang. 2017. Adaptive Optimization for OpenCL Programs on Embedded Heterogeneous Systems. In Proceedings of the 18th ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES 2017). ACM, Barcelona, Spain, 11–20.
[50]
Mohamed Wahib and Naoya Maruyama. 2014. Scalable Kernel Fusion for Memory-Bound GPU Applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’14). IEEE, New Orleans, LA, USA, 191–202.
[51]
Dennis Weller, Fabian Oboril, Dimitar Lukarski, Juergen Becker, and Mehdi Tahoori. 2017. Energy Efficient Scientific Computing on FPGAs Using OpenCL. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA ’17). ACM, Monterey, California, USA, 247– 256.
[52]
Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, and Robert Hundt. 2016. Gpucc: An Open-source GPGPU Compiler. In Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO ’16). ACM, Barcelona, Spain, 105–116.
[53]
Jingling Xue and Jens Knoop. 2006. A Fresh Look at PRE as a Maximum Flow Problem. In Compiler Construction, 15th International Conference, CC 2006, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2006, Vienna, Austria, March 30-31, 2006, Proceedings. 139–154.
[54]
Yi Yang, Ping Xiang, Jingfei Kong, Mike Mantor, and Huiyang Zhou. 2012. A Unified Optimizing Compiler Framework for Different GPGPU Architectures. ACM Transactions on Architecture and Code Optimization. 9, 2, Article 9 (June 2012), 33 pages.
[55]
Yao Zhang, Mark Sinclair II, and Andrew A. Chien. 2013. Improving Performance Portability in OpenCL Programs. In Proceedings of the 28th International Supercomputing Conference (ISC ’13). Springer, Leipzig, Germany, 136–150.
[56]
Jisheng Zhao, Jun Shirako, V. Krishna Nandivada, and Vivek Sarkar. 2010. Reducing task creation and termination overhead in explicitly parallel programs. In Proceedings of the 19th International Conference on Parallel Architecture and Compilation Techniques (PACT ’10). ACM, Vienna, Austria, 169–180.
[57]
Hao Zhou and Jingling Xue. 2016. A Compiler Approach for Exploiting Partial SIMD Parallelism. ACM Trans. Archit. Code Optim. 13, 1, Article 11 (March 2016), 26 pages.
[58]
Hao Zhou and Jingling Xue. 2016. Exploiting Mixed SIMD Parallelism by Reducing Data Reorganization Overhead. In Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO ’16). ACM, New York, NY, USA, 59–69.
[59]
Hans Zima and Barbara Chapman. 1991. Supercompilers for Parallel and Vector Computers. ACM, New York, NY, USA.

Cited By

View all
  • (2020)Exploration of OpenCL Heterogeneous Programming for Porting Solidification Modeling to CPU‐GPU PlatformsConcurrency and Computation: Practice and Experience10.1002/cpe.601133:4Online publication date: 9-Oct-2020

Index Terms

  1. PPOpenCL: a performance-portable OpenCL compiler with host and kernel thread code fusion

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    CC 2019: Proceedings of the 28th International Conference on Compiler Construction
    February 2019
    204 pages
    ISBN:9781450362771
    DOI:10.1145/3302516
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 16 February 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Compiler
    2. Heterogenous computing
    3. OpenCL

    Qualifiers

    • Research-article

    Conference

    CC '19

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)23
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 09 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2020)Exploration of OpenCL Heterogeneous Programming for Porting Solidification Modeling to CPU‐GPU PlatformsConcurrency and Computation: Practice and Experience10.1002/cpe.601133:4Online publication date: 9-Oct-2020

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media