research-article

Unified Programming Models for Heterogeneous High-Performance Computers

Authors:

Wei-Min ZhengAuthors Info & Claims

Journal of Computer Science and Technology, Volume 38, Issue 1

Pages 211 - 218

https://doi.org/10.1007/s11390-023-2888-4

Published: 31 January 2023 Publication History

Abstract

Unified programming models can effectively improve program portability on various heterogeneous high-performance computers. Existing unified programming models put a lot of effort to code portability but are still far from achieving good performance portability. In this paper, we present a preliminary design of a performance-portable unified programming model including four aspects: programming language, programming abstraction, compilation optimization, and scheduling system. Specifically, domain-specific languages introduce domain knowledge to decouple the optimizations for different applications and architectures. The unified programming abstraction unifies the common features of different architectures to support common optimizations. Multi-level compilation optimization enables comprehensive performance optimization based on multi-level intermediate representations. Resource-aware lightweight runtime scheduling system improves the resource utilization of heterogeneous computers. This is a perspective paper to show our viewpoints on programming models for emerging heterogeneous systems.

References

[1]

Dongarra JJ, Meuer HW, and Strohmaier E Top500 supercomputer sites Supercomputer 1997 13 1 89-111

[2]

Vazhkudai S S, de Supinski B R, Bland A S et al. The design, deployment, and evaluation of the CORAL pre-exascale systems. In Proc. the 2018 International Conference for High Performance Computing, Networking, Storage and Analysis, Nov. 2018, pp.661–672.

[3]

Fu HH, Liao JF, Yang JZ, et al. The Sunway Taihu-Light supercomputer: System and applications Science China Information Sciences 2016 59 7 072001

[4]

Fu H H, Liao J F, Xue W et al. Refactoring and optimizing the community atmosphere model (CAM) on the Sunway TaihuLight supercomputer. In Proc. the 2016 International Conference for High Performance Computing, Networking, Storage and Analysis, Nov. 2016, pp.969–980.

[5]

Neale R B, Gettelman A, Park S et al. Description of the NCAR community atmosphere model (CAM 5.0). No. NCAR/TN-486+STR, 2010.

[6]

Edwards H C, Trott C R, Sunderland D. Kokkos: Enabling manycore performance portability through polymorphic memory access patterns. Journal of Parallel and Distributed Computing, 2014, 74(12): 3202–3216.

[7]

Trott C R, Lebrun-Grandié D, Arndt D et al. Kokkos 3: Programming model extensions for the exascale era. IEEE Trans. Parallel and Distributed Systems, 2022, 33(4): 805–817.

[8]

Beckingsale D A, Burmark J, Hornung R et al. RAJA: Portable performance for large-scale scientific applications. In Proc. the 2019 IEEE/ACM International workshop on Performance, Portability and Productivity in HPC (P3HPC), Nov. 2019, pp.71–81.

[9]

Reinders J, Ashbaugh B, Brodman J, Kinsner M, Pennycook J, Tian X M. Data Parallel C++: Mastering DPC++ for Programming of Heterogeneous Systems Using C++ and SYCL. Springer Nature, 2021.

[10]

Pennycook SJ, Sewall JD, and Lee VW Implications of a metric for performance portability Future Generation Computer Systems 2019 92 947-958

[11]

Lin W C, McIntosh-Smith S. Comparing Julia to performance portable parallel programming models for HPC. In Proc. the 2021 International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems, Nov. 2021, pp.94–105.

[12]

Ma Z X, He J A, Qiu J Z et al. BaGuaLu: Targeting brain scale pretrained models with over 37 million cores. In Proc. the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Apr. 2022, pp.192–204.

[13]

Zhang YM, Lu K, and Chen WG Processing extreme-scale graphs on China’s supercomputers Communications of the ACM 2021 64 11 60-63

[14]

Zhang Y, Yang M, Baghdadi R, Kamil S, Shun J. Graphit: A high-performance graph DSL. Proceedings of the ACM on Programming Languages, 2018, 2(OOPSLA): Article No. 121.

[15]

Ragan-Kelley J, Barnes C, Adams A, Paris S, Durand F, Amarasinghe S. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proc. the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, June 2013, pp.519–530.

[16]

Chen T Q, Moreau T, Jiang Z H et al. TVM: An automated end-to-end optimizing compiler for deep learning. In Proc. the 13th USENIX Conference on Operating Systems Design and Implementation, Oct. 2018, pp.579–594.

[17]

Ben-Nun T, de Fine Licht J, Ziogas A N, Schneider T, Hoefler T. Stateful dataflow multigraphs: A data-centric model for performance portability on heterogeneous architectures. In Proc. the 2019 International Conference for High Performance Computing, Networking, Storage and Analysis, Nov. 2019, Article No. 81.

[18]

Ziogas A N, Ben-Nun T, Fernández G I, Schneider T, Luisier M, Hoefler T. A data-centric approach to extremescale ab initio dissipative quantum transport simulations. In Proc. the 2019 International Conference for High Performance Computing, Networking, Storage and Analysis, Nov. 2019, Article No. 1.

[19]

Lattner C, Adve V. LLVM: A compilation framework for lifelong program analysis & transformation. In Proc. the 2004 International Symposium on Code Generation and Optimization, Mar. 2004, pp.75–86.

[20]

Lattner C, Amini M, Bondhugula U, Cohen A, Davis A, Pienaar J, Riddle R, Shpeisman T, Vasilache N, Zinenko O. MLIR: A compiler infrastructure for the end of Moore’s law. arXiv: 2002.11054, 2020. https://arxiv.org/abs/2002.11054, Mar. 2020.

[21]

Gysi T, Müller C, Zinenko O, Herhut S, Davis E, Wicky T, Fuhrer O, Hoefler T, Grosser T. Domain-specific multi-level IR rewriting for GPU: The open earth compiler for GPU-accelerated climate simulation. ACM Transactions on Architecture and Code Optimization, 2021, 18(4): Article No. 51.

[22]

McCaskey A, Nguyen T. A MLIR dialect for quantum assembly languages. In Proc. the 2021 IEEE International Conference on Quantum Computing and Engineering, Oct. 2021, pp.255–264.

[23]

Yoo A B, Jette M A, Grondona M. SLURM: Simple Linux utility for resource management. In Proc. the 9th International Workshop on Job Scheduling Strategies for Parallel Processing, Jun. 2003, pp.44–60.

[24]

Bode B, Halstead D M, Kendall R et al. The portable batch scheduler and the Maui scheduler on Linux clusters. In Proc. the 4th Annual Linux Showcase & Conference, Oct. 2000.

[25]

Vavilapalli V K, Murthy A C, Douglas C et al. Apache Hadoop YARN: Yet another resource negotiator. In Proc. the 4th Annual Symposium on Cloud Computing, Oct. 2013, Article No. 5.

[26]

Hindman B, Konwinski A, Zaharia M et al. Mesos: A platform for fine-grained resource sharing in the data center. In Proc. the 8th USENIX Conference on Networked Systems Design and Implementation, Mar. 2011, pp.295–308.

[27]

Tang X C, Wang H J, Ma X S et al. Spread-n-Share: Improving application performance and cluster throughput with resource-aware job placement. In Proc. the International Conference for High Performance Computing, Networking, Storage and Analysis, Nov. 2019, Article No. 12.

Recommendations

Evaluation of directive-based performance portable programming models

We present an extended exploration of the performance portability of directives provided by OpenMP 4 and OpenACC to program various types of node architectures with attached accelerators. To do this, we use examples of algorithms with varying ...
Performance portable C++ programming with RAJA
PPoPP '19: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming

With the rapid change of computing architectures, and variety of programming models; the ability to develop performance portable applications has become of great importance. This is particularly true in large production codes where developing and ...
Programming model for a heterogeneous x86 platform
PLDI '09

The client computing platform is moving towards a heterogeneous architecture consisting of a combination of cores focused on scalar performance, and a set of throughput-oriented cores. The throughput oriented cores (e.g. a GPU) may be connected over ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Journal of Computer Science and Technology

Journal of Computer Science and Technology Volume 38, Issue 1

Feb 2023

218 pages

ISSN:1000-9000

Issue’s Table of Contents

© Institute of Computing Technology, Chinese Academy of Sciences 2023.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 31 January 2023

Accepted: 10 January 2023

Received: 05 October 2022

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents