research-article

Open access

Designing a Tunable Nested Data-Parallel Programming System

Authors:

Saurav Muralidharan,

Michael Garland,

Albert Sidelnik,

Mary HallAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 13, Issue 4

Article No.: 47, Pages 1 - 24

https://doi.org/10.1145/3012011

Published: 28 December 2016 Publication History

Abstract

This article describes Surge, a nested data-parallel programming system designed to simplify the porting and tuning of parallel applications to multiple target architectures. Surge decouples high-level specification of computations, expressed using a C++ programming interface, from low-level implementation details using two first-class constructs: schedules and policies. Schedules describe the valid ways in which data-parallel operators may be implemented, while policies encapsulate a set of parameters that govern platform-specific code generation. These two mechanisms are used to implement a code generation system that analyzes computations and automatically generates a search space of valid platform-specific implementations. An input and architecture-adaptive autotuning system then explores this search space to find optimized implementations. We express in Surge five real-world benchmarks from domains such as machine learning and sparse linear algebra and from the high-level specifications, Surge automatically generates CPU and GPU implementations that perform on par with or better than manually optimized versions.

References

[1]

Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jeffrey Bosboom, Una-May O’Reilly, and Saman Amarasinghe. 2014. OpenTuner: An extensible framework for program autotuning. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT’14). ACM, 303--316.

Digital Library

[2]

Joshua Auerbach, David F. Bacon, Ioana Burcea, Perry Cheng, Stephen J. Fink, Rodric Rabbah, and Sunil Shukla. 2012. A compiler and runtime for heterogeneous computing. In Proceedings of the 49th Annual Design Automation Conference (DAC’12). ACM, 271--276.

Digital Library

[3]

Joshua Auerbach, David F. Bacon, Perry Cheng, and Rodric Rabbah. 2010. Lime: A java-compatible and synthesizable language for heterogeneous architectures. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA’10). ACM, 89--108.

Digital Library

[4]

Nathan Bell and Michael Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC’09). ACM, Article 18.

Digital Library

[5]

Nathan Bell and Jared Hoberock. 2011. Thrust: A productivity-oriented library for CUDA. GPU Comput. Gems Jade Ed. 2 (2011), 359--371.

[6]

Lars Bergstrom, Matthew Fluet, Mike Rainey, John Reppy, Stephen Rosen, and Adam Shaw. 2013. Data-only flattening for nested data parallelism. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’13). ACM, 81--92.

Digital Library

[7]

Guy E. Blelloch. 1992. NESL: A Nested Data-Parallel Language. Technical Report CMU-CS-95-170. Carnegie Mellon University, Pittsburgh, PA.

Digital Library

[8]

Kevin J. Brown, HyoukJoong Lee, Tiark Rompf, Arvind K. Sujeeth, Christopher De Sa, Christopher Aberger, and Kunle Olukotun. 2016. Have abstraction and eat performance, too: Optimized heterogeneous computing with parallel patterns. In Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO 2016). ACM, 194--205.

Digital Library

[9]

Kevin J. Brown, Arvind K. Sujeeth, HyoukJoong Lee, Tiark Rompf, Hassan Chafi, Martin Odersky, and Kunle Olukotun. 2011. A heterogeneous parallel framework for domain-specific languages. In Proceedings of the 20th Parallel Architectures and Compilation Techniques Conference (PACT’11). ACM, 89--100.

Digital Library

[10]

Zoran Budimlić, Michael Burke, Vincent Cavé, Kathleen Knobe, Geoff Lowney, Ryan Newton, Jens Palsberg, David Peixotto, Vivek Sarkar, Frank Schlimbach, and Sağnak Taşirlar. 2010. Concurrent collections. Sci. Program. 18, 3--4 (Aug. 2010), 203--217.

Digital Library

[11]

Bryan Catanzaro. 2014. GPU K-Means Clustering. Retrieved October 28, 2016 from https://github.com/bryancatanzaro/kmeans.

[12]

Bryan Catanzaro, Michael Garland, and Kurt Keutzer. 2011. Copperhead: Compiling an embedded data parallel language. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming (PPoPP’11). ACM, 47--56.

Digital Library

[13]

Manuel M. T. Chakravarty, Roman Leshchinskiy, Simon Peyton Jones, Gabriele Keller, and Simon Marlow. 2007. Data parallel haskell: A status report. In Proceedings of the 2007 Workshop on Declarative Aspects of Multicore Programming (DAMP’07). ACM, 10--18.

Digital Library

[14]

Li-Wen Chang, Izzat El Hajj, Christopher Rodrigues, Juan Gómez-Luna, and Wen-mei Hwu. 2016. Efficient kernel synthesis for performance portable programming. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-49).

[15]

Leonardo Dagum and Ramesh Menon. 1998. OpenMP: An industry standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5, 1 (1998), 46--55.

Digital Library

[16]

Steven Dalton, Nathan Bell, and Michael Garland. 2010. CUSP Library. Retrieved October 28, 2016 from http://cusplibrary.github.io/.

[17]

Tim Davis. 2011. The university of florida sparse matrix collection. ACM Trans. Math. Softw. 38 (2011), 1:1--1:25. Issue 1.

Digital Library

[18]

Denis Demidov, Karsten Ahnert, Karl Rupp, and Peter Gottschling. 2016. VexCL Symbolic Type. Retrieved October 28, 2016 from http://vexcl.readthedocs.io/en/latest/symbolic.html.

[19]

H. Carter Edwards, Christian Trott, Juan Alday, Jesse Perla, Mauro Bianco, Robin Maffeo, Ben Sander, and Bryce Lelbach. 2016. Polymorphic multidimensional array reference. ISO/IEC C++ Standards Committee Paper P0009R2. Retrieved October 28, 2016 from http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0009r2.html.

[20]

ExMatEx. 2015. DoE Exascale Co-Design Center for Materials in Extreme Environments. Retrieved October 28, 2016 from http://www.exmatex.org.

[21]

Albert Hartono, Boyana Norris, and Ponnuswamy Sadayappan. 2009. Annotation-based empirical performance tuning using orio. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium (IPDPS’09). IEEE Computer Society, 1--11.

Digital Library

[22]

Jared Hoberock. 2016. Working draft, technical specification for C++ extensions for parallelism version 2. ISO/IEC C++ Standards Committee Paper N4578 (2016). Retrieved October 28, 2016 from http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/n4578.html.

[23]

Jared Hoberock, Michael Garland, and Olivier Giroux. 2016. An interface for abstracting execution. ISO/IEC C++ Standards Committee Paper P0058R1 (2016). Retrieved October 28, 2016 from http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0058r1.pdf.

[24]

Paul Hudak. 1996. Building domain-specific embedded languages. ACM Comput. Surv. 28, 4es, Article 196 (Dec. 1996).

Digital Library

[25]

Intel. 2016. Math Kernel Library. Retrieved October 28, 2016 from https://software.intel.com/en-us/intel-mkl.

[26]

Simon Peyton Jones and Satnam Singh. 2009. A tutorial on parallel and concurrent programming in haskell. In Proceedings of the 6th International Conference on Advanced Functional Programming (AFP’08). Springer-Verlag, 267--305.

[27]

Laxmikant V. Kale and Sanjeev Krishnan. 1993. CHARM++: A portable concurrent object oriented system based on C++. In Proceedings of the 8th Annual Conference on Object-oriented Programming Systems, Languages, and Applications (OOPSLA’93). ACM, 91--108.

Digital Library

[28]

Gabriele Keller, Manuel M. T. Chakravarty, Roman Leshchinskiy, Ben Lippmeier, and Simon Peyton Jones. 2012. Vectorisation avoidance. In Proceedings of the 2012 Haskell Symposium (Haskell’12). ACM, 37--48.

Digital Library

[29]

Gabriele Keller, Manuel M. T. Chakravarty, Roman Leshchinskiy, Simon Peyton Jones, and Ben Lippmeier. 2010. Regular, shape-polymorphic, parallel arrays in haskell. In Proceedings of the 15th ACM SIGPLAN International Conference on Functional Programming (ICFP’10). ACM, 261--272.

Digital Library

[30]

Matthias Kretz. 2016. Data-parallel vector types and operations. ISO/IEC C++ Standards Committee Paper P0214R1 (2016). Retrieved October 28, 2016 from http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0214r1.pdf.

[31]

HyoukJoong Lee, Kevin J. Brown, Arvind K. Sujeeth, Tiark Rompf, and Kunle Olukotun. 2014. Locality-aware mapping of nested parallel patterns on GPUs. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47). IEEE Computer Society, 63--74.

Digital Library

[32]

Stuart P. Lloyd. 1982. Least squares quantization in PCM. IEEE Trans. Inform. Theor. 28, 2 (Mar. 1982), 129--137.

Digital Library

[33]

Duane G. Merrill, III. 2011. Allocation-oriented Algorithm Design with Application to GPU Computing. Ph.D. Dissertation. University of Virginia, Charlottesville, VA. UMI Order Number: AAI 3501820.

[34]

Saurav Muralidharan, Manu Shantharam, Mary Hall, Michael Garland, and Bryan Catanzaro. 2014. Nitro: A framework for adaptive code variant tuning. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium (IPDPS’14). IEEE Computer Society, 501--512.

Digital Library

[35]

NVIDIA, Cray, CAPS, and PGI. 2015. The OpenACC Specification version 2.0a. Retrieved October 28, 2016 from http://www.openacc.org/sites/default/files/OpenACC.2.0a_1.pdf.

[36]

Shigetoshi Ohshima, Takahiro Katagiri, and Morio Matsumoto. 2014. Performance optimization of SpMV Using CRS format by considering OpenMP scheduling on CPUs and MIC. In Proceedings of the 8th IEEE International Symposium on Embedded Multicore/Manycore SoCs (MCSoc). 253--260.

Digital Library

[37]

Keshav Pingali, Donald Nguyen, Milind Kulkarni, Martin Burtscher, M. Amber Hassaan, Rashid Kaleem, Tsung-Hsien Lee, Andrew Lenharth, Roman Manevich, Mario Méndez-Lojo, Dimitrios Prountzos, and Xin Sui. 2011. The tao of parallelism in algorithms. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’11). ACM, 12--25.

Digital Library

[38]

Dimitrios Prountzos, Roman Manevich, and Keshav Pingali. 2012. Elixir: A system for synthesizing concurrent graph programs. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA’12). ACM, 375--394.

Digital Library

[39]

Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’13). ACM, 519--530.

Digital Library

[40]

Tiark Rompf and Martin Odersky. 2010. Lightweight modular staging: A pragmatic approach to runtime code generation and compiled DSLs. In Proceedings of the 9th International Conference on Generative Programming and Component Engineering (GPCE’10). ACM, 127--136.

Digital Library

[41]

Nikolay Sakharnykh. 2013. CoMD-CUDA. Retrieved October 28, 2016 from https://github.com/NVIDIA/CoMD-CUDA.

[42]

Michel Steuwer, Christian Fensch, Sam Lindley, and Christophe Dubach. 2015. Generating performance portable code using rewrite rules: From high-level functional expressions to high-performance OpenCL code. In Proceedings of the 20th ACM SIGPLAN International Conference on Functional Programming (ICFP’15). ACM, New York, NY, 205--217.

Digital Library

[43]

Ananta Tiwari, Chun Chen, Jacqueline Chame, Mary Hall, and Jeffrey K Hollingsworth. 2009. A scalable auto-tuning framework for compiler optimization. In Proceedings of the 2009 IEEE International Parallel and Distributed Processing Symposium. 1--12.

Digital Library

[44]

Vladimir N. Vapnik. 1998. Statistical Learning Theory.

[45]

Todd Veldhuizen. 1995. Expression templates. C++ Report 7, 5 (1995), 26--31.

[46]

Yongpeng Zhang and F. Mueller. 2012. CuNesl: Compiling nested data-parallel languages for SIMT architectures. In Proceedings of the 41st International Conference on Parallel Processing (ICPP). 340--349.

Digital Library

Cited By

Wu SDong XChen HWang LWang QZhu Z(2022)Simplified High Level Parallelism Expression on Heterogeneous Systems through Data Partition Pattern DescriptionThe Computer Journal10.1093/comjnl/bxac01766:6(1400-1418)Online publication date: 14-Mar-2022
https://doi.org/10.1093/comjnl/bxac017
Liu JTan GLuo YLi JMo ZSun N(2019)An Autotuning Protocol to Rapidly Build AutotunersACM Transactions on Parallel Computing10.1145/32915275:2(1-25)Online publication date: 4-Jan-2019
https://dl.acm.org/doi/10.1145/3291527

Index Terms

Designing a Tunable Nested Data-Parallel Programming System
1. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Parallel programming languages
  2. Software organization and properties
    1. Extra-functional properties
      1. Software performance

Recommendations

A MultiGPU Performance-Portable Solution for Array Programming Based on Kokkos
ARRAY 2023: Proceedings of the 9th ACM SIGPLAN International Workshop on Libraries, Languages and Compilers for Array Programming

Today, multiGPU nodes are widely used in high-performance computing and data centers. However, current programming models do not provide simple, transparent, and portable support for automatically targeting multiple GPUs within a node on application ...
Nested data-parallelism on the gpu
ICFP '12

Graphics processing units (GPUs) provide both memory bandwidth and arithmetic performance far greater than that available on CPUs but, because of their Single-Instruction-Multiple-Data (SIMD) architecture, they are hard to program. Most of the programs ...
Nested data-parallelism on the gpu
ICFP '12: Proceedings of the 17th ACM SIGPLAN international conference on Functional programming

Graphics processing units (GPUs) provide both memory bandwidth and arithmetic performance far greater than that available on CPUs but, because of their Single-Instruction-Multiple-Data (SIMD) architecture, they are hard to program. Most of the programs ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 13, Issue 4

December 2016

648 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/3012405

Editor:
Koen De Bosschere
Ghent University

Issue’s Table of Contents

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 December 2016

Accepted: 01 October 2016

Revised: 01 September 2016

Received: 01 June 2016

Published in TACO Volume 13, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

U.S. Government
DARPA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
483
Total Downloads

Downloads (Last 12 months)63
Downloads (Last 6 weeks)8

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wu SDong XChen HWang LWang QZhu Z(2022)Simplified High Level Parallelism Expression on Heterogeneous Systems through Data Partition Pattern DescriptionThe Computer Journal10.1093/comjnl/bxac01766:6(1400-1418)Online publication date: 14-Mar-2022
https://doi.org/10.1093/comjnl/bxac017
Liu JTan GLuo YLi JMo ZSun N(2019)An Autotuning Protocol to Rapidly Build AutotunersACM Transactions on Parallel Computing10.1145/32915275:2(1-25)Online publication date: 4-Jan-2019
https://dl.acm.org/doi/10.1145/3291527

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Figures

Tables

Media

View Issue’s Table of Contents