research-article

Mixed-data-model heterogeneous compilation and OpenMP offloading

Authors:

Björn Forsberg,

Alessandro Capotondi,

Andrea Marongiu,

Tobias Grosser,

Luca BeniniAuthors Info & Claims

CC 2020: Proceedings of the 29th International Conference on Compiler Construction

Pages 119 - 131

https://doi.org/10.1145/3377555.3377891

Published: 24 February 2020 Publication History

Abstract

Heterogeneous computers combine a general-purpose host processor with domain-specific programmable many-core accelerators, uniting high versatility with high performance and energy efficiency. While the host manages ever-more application memory, accelerators are designed to work mainly on their local memory. This difference in addressed memory leads to a discrepancy between the optimal address width of the host and the accelerator. Today 64-bit host processors are commonplace, but few accelerators exceed 32-bit addressable local memory, a difference expected to increase with 128-bit hosts in the exascale era. Managing this discrepancy requires support for multiple data models in heterogeneous compilers. So far, compiler support for multiple data models has not been explored, which hampers the programmability of such systems and inhibits their adoption.

In this work, we perform the first exploration of the feasibility and performance of implementing a mixed-data-model heterogeneous system. To support this, we present and evaluate the first mixed-data-model compiler, supporting arbitrary address widths on host and accelerator. To hide the inherent complexity and to enable high programmer productivity, we implement transparent offloading on top of OpenMP. The proposed compiler techniques are implemented in LLVM and evaluated on a 64+32-bit heterogeneous SoC. Results on benchmarks from the PolyBench-ACC suite show that memory can be transparently shared between host and accelerator at overheads below 0.7% compared to 32-bit-only execution, enabling mixed-data-model computers to execute at near-native performance.

References

[1]

AMD Corp. 2018. AMD Radeon Instinct MI60. Datasheet. https://www.amd.com/system/files/documents/radeon-instinctmi60-datasheet.pdf

[2]

Samuel F. Antao, Alexey Bataev, Arpith C. Jacob, Gheorghe-Teodor Bercea, Alexandre E. Eichenberger, Georgios Rokos, Matt Martineau, Tian Jin, Guray Ozen, Zehra Sura, Tong Chen, Hyojin Sung, Carlo Bertolli, and Kevin O’Brien. 2016. Offloading Support for OpenMP in Clang and LLVM. In LLVM-HPC’16.

[3]

Arm Ltd. 2019. Architecture Reference Manual: ARMv8 for ARMv8-A architecture profile. Chapter D1.19 Interprocessing.

[4]

Gheorghe-Teodor Bercea, Carlo Bertolli, Arpith C. Jacob, Alexandre Eichenberger, Alexey Bataev, Georgios Rokos, Hyojin Sung, Tong Chen, and Kevin O’Brien. 2017. Implementing Implicit OpenMP Data Sharing on GP Us. In Proceedings of the Fourth Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC’17). ACM, New York, NY, USA, Article 5, 12 pages.

Digital Library

[5]

Cadence Design Systems, Inc. 2018. Tensilica Xtensa LX7 processor datasheet. https://ip.cadence.com/uploads/1099/TIP_PB_Xtensa_lx7_ FINAL-pdf

[6]

Alessandro Capotondi and Andrea Marongiu. 2017. Enabling Zerocopy OpenMP Offloading on the P ULP Many-core Accelerator. In Proceedings of the 20th International Workshop on Software and Compilers for Embedded Systems (SCOPES ’17). ACM, New York, NY, USA, 68–71.

Digital Library

[7]

S. Chandrasekaran and G. Juckeland. 2017. OpenACC for Programmers: Concepts and Strategies. Pearson Education.

[8]

David Chisnall. 2015. Adventures with LLVM in a magical land where pointers are not integers. In 2015 LLVM Developer’s Meeting. https: //llvm.org/devmtg/2015-02/slides/chisnall-pointers-not-int.pdf

[9]

Young-kyu Choi, Jason Cong, Zhenman Fang, Yuchen Hao, Glenn Reinman, and Peng Wei. 2016. A Quantitative Analysis on Microarchitectures of Modern CP U-FPGA Platforms. In Proceedings of the 53rd Annual Design Automation Conference (DAC ’16). ACM, New York, NY, USA, Article 109, 6 pages.

Digital Library

[10]

Lucian Codrescu. 2015. Architecture of the Hexagon 680 DSP for Mobile Imaging and Computer Vision. In 2015 IEEE International Symposium on High Performance Chips (HOTCHIPS ’27). https://www.hotchips.org/wp-content/uploads/hc_ archives/hc27/HC27.24-Monday-Epub/HC27.24.20-MultimediaEpub/HC27.24.211-Hexagon680-Codrescu-Qualcomm.pdf

[11]

S. Cook. 2012. CUDA Programming: A Developer’s Guide to Parallel Computing with GPUs. Elsevier Science.

[12]

E. G. Cota, P. Mantovani, G. Di Guglielmo, and L. P. Carloni. 2015. An analysis of accelerator coupling in heterogeneous architectures. In 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC). 1–6.

Digital Library

[13]

Ian Cutress. 2016. CEVA Launches Fifth-Generation Machine Learning Image and Vision DSP Solution: CEVA-XM6. https://www.anandtech. com/show/10700

[14]

Marvin Damschen, Heinrich Riebler, Gavin Vaz, and Christian Plessl. 2015. Transparent Offloading of Computational Hotspots from Binary Code to Xeon Phi. In Proc. of the 2015 Design, Automation & Test in Europe Conf. & Exh (DATE ’15). EDA Consortium, 1078–1083. http: //dl.acm.org/citation.cfm?id=2757012.2757063

[15]

Joel E Denny, Seyong Lee, and Jeffrey S Vetter. 2018. Clacc: Translating OpenACC to OpenMP in Clang. In 2018 IEEE/ACM 5th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC). IEEE, 18–29.

[16]

Michael Ditty, Ashish Karandikar, and David Reed. 2018. Nvidia’s Xavier SoC. In 2018 IEEE International Symposium on High Performance Chips (HOTCHIPS ’30). https://www.hotchips.org/hc30/1conf/1.12_ Nvidia_XavierHotchips2018Final_814.pdf

[17]

Andrei Frumusanu. 2018. The Qualcomm Snapdragon 855 Pre-Dive: Going Into Detail on 2019’s Flagship Android SoC. https://www. anandtech.com/show/13680/snapdragon-855-going-into-detail

[18]

Michael Gautschi, Pasquale Davide Schiavone, Andreas Traber, Igor Loi, Antonio Pullini, Davide Rossi, Eric Flamand, Frank K Gürkaynak, and Luca Benini. 2017. Near-threshold RISC-V core with DSP extensions for scalable IoT endpoint devices. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 25, 10 (2017), 2700–2713.

Digital Library

[19]

Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. 2012. Auto-tuning a high-level language targeted to GP U codes. In 2012 Innovative Parallel Computing (InPar). IEEE, 1–10.

[20]

Tobias Grosser, Armin Groesslinger, and Christian Lengauer. 2012. Polly—performing polyhedral optimizations on a low-level intermediate representation. Parallel Processing Letters 22, 04 (2012), 1250010.

[21]

Tobias Grosser and Torsten Hoefler. 2016. Polly-ACC Transparent Compilation to Heterogeneous Hardware. In Proceedings of the 2016 International Conference on Supercomputing (ICS ’16). ACM, New York, NY, USA, Article 1, 13 pages.

Digital Library

[22]

J.L. Hennessy and D.A. Patterson. 2017. Computer Architecture: A Quantitative Approach. Elsevier Science, Chapter 7.2 Guidelines for Domain-Specific Architectures.

Digital Library

[23]

N. Jouppi, C. Young, N. Patil, and D. Patterson. 2018. Motivation for and Evaluation of the First Tensor Processing Unit. IEEE Micro 38, 3 (May 2018), 10–19.

[24]

M. Kerrisk. 2010. The Linux Programming Interface: A Linux and UNIX System Programming Handbook. No Starch Press.

[25]

Khronos Group Inc. 2019. OpenVX API Specification 1.3. https://www.khronos.org/registry/OpenVX/specs/1.3/OpenVX_ Specification_1_3.pdf

[26]

Andreas Kurth, Alessandro Capotondi, Pirmin Vogel, Luca Benini, and Andrea Marongiu. 2018. HERO: An Open-Source Research Platform for HW/SW Exploration of Heterogeneous Manycore Systems. In Proceedings of the 2nd Workshop on AutotuniNg and aDaptivity AppRoaches for Energy Efficient HPC Systems (ANDARE ’18). ACM, New York, NY, USA, Article 5, 6 pages.

Digital Library

[27]

Andreas Kurth, Pirmin Vogel, Alessandro Capotondi, Andrea Marongiu, and Luca Benini. 2017. HERO: Heterogeneous Embedded Research Platform for Exploring RISC-V Manycore Accelerators on FPGA. In Computer Architecture Research with RISC-V (CARRV ’17).

[28]

A. Kurth, P. Vogel, A. Marongiu, and L. Benini. 2018. Scalable and Efficient Virtual Memory Sharing in Heterogeneous SoCs with TLB Prefetching and MMU-Aware DMA Engine. In 2018 IEEE 36th International Conference on Computer Design (ICCD). 292–300.

[29]

LLVM. 2019. LLVM 8.0.0 Release Notes. https://releases.llvm.org/8.0.0/ docs/ReleaseNotes.html

[30]

H. J. Lu, H. Peter Anvin, and Milind Girkar. 2011. X32: A native 32-bit ABI for x86-64. In Linux Plumbers Conference. http://www.linuxplumbersconf.net/2011/ocw/system/presentations/ 531/original/x32-LPC-2011-0906.pptx

[31]

Andy Lutomirski. 2018. Can we drop upstream Linux x32 support? https://lkml.org/lkml/2018/12/10/1145

[32]

Andrea Marongiu, Alessandro Capotondi, Giuseppe Tagliavini, and Luca Benini. 2015. Simplifying many-core-based heterogeneous SoC programming with offload directives. IEEE Transactions on Industrial Informatics 11, 4 (2015), 957–967.

[33]

M. Martineau, S. McIntosh-Smith, and W. Gaudin. 2016. Evaluating OpenMP 4.0’s Effectiveness as a Heterogeneous Parallel Programming Model. In 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 338–347.

[34]

Mentor, a Siemens Business. 2019. Questa Advanced Simulator. https: //www.mentor.com/products/fv/questa/

[35]

Dmitry Mikushin, Nikolay Likhogrud, Eddy Z Zhang, and Christopher Bergström. 2014. KernelGen–The Design and Implementation of a Next Generation Compiler Platform for Accelerating Numerical Models on GP Us. In 2014 IEEE International Parallel & Distributed Processing Symposium Workshops. IEEE, 1011–1020.

[36]

Gaurav Mitra, Eric Stotzer, Ajay Jayaraj, and Alistair P. Rendell. 2014. Implementation and Optimization of the OpenMP Accelerator Model for the TI Keystone II Architecture. In Using and Improving OpenMP for Devices, Tasks, and More, Luiz DeRose, Bronis R. de Supinski, Stephen L. Olivier, Barbara M. Chapman, and Matthias S. Müller (Eds.). Springer International Publishing, Cham, 202–214.

[37]

Nvidia Corp. 2014. Summit and Sierra Supercomputers: An Inside Look at the U.S. Department of Energy’s New Pre-Exascale Systems. http://www.teratec.eu/actu/calcul/Nvidia_Coral_White_Paper_ Final_3_1.pdf

[38]

Nvidia Corp. 2015. Linux for Tegra R23.1. https://developer.nvidia. com/embedded/linux-tegra-r231

[39]

Nvidia Corp. 2016. Linux for Tegra R24.1. https://developer.nvidia. com/embedded/linux-tegra-r241

[40]

Nvidia Corp. 2019. Nvidia TITAN RTX. Product Brief. https://www.nvidia.com/content/dam/en-zz/Solutions/titan/ documents/titan-rtx-for-creators-us-nvidia-1011126-r6-web.pdf

[41]

Nvidia Corp. 2019. NVVM IR Specification 1.5. https://docs.nvidia. com/cuda/nvvm-ir-spec/index.html

[42]

Nate Oh. 2017. Intel Announces Movidius Myriad X VP U, Featuring ’Neural Compute Engine’. https://www.anandtech.com/show/11771/ intel-announces-movidius-myriad-x-vpu

[43]

OpenMP Architecture Review Board 2015. OpenMP Application Programming Interface. OpenMP Architecture Review Board. Version 4.5.

[44]

G. Özen, S. Atzeni, M. Wolfe, A. Southwell, and G. Klimowicz. 2018. OpenMP GP U Offload in Flang and LLVM. In 2018 IEEE/ACM 5th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC). 1–9.

[45]

Kari Pulli, Anatoly Baksheev, Kirill Kornyakov, and Victor Eruhimov. 2012. Real-time computer vision with OpenCV. Commun. ACM 55, 6 (2012), 61–69. https://research.nvidia.com/sites/default/files/pubs/ 2012-06_Realtime-Computer-Vision/OpenCV_CACM_p61-pulli.pdf

Digital Library

[46]

I RAS. 2010. GRAPHITE-OpenCL: Generate OpenCL code from parallel loops. GCC Developers Summit. Citcseer (2010), 9.

[47]

Jason Redgrave, Albert Meixner, Nathan Goulding-Hotta, Artem Vasilyev, and Ofer Shacham. 2018. Pixel Visual Core: Google’s Fully Programmable Image, Vision, and AI Processor for Mobile Devices. In 2018 IEEE International Symposium on High Performance Chips (HOTCHIPS ’30). https://www.hotchips.org/hc30/1conf/1.02_Google_HC30.Google. JasonRedgrave.V01.pdf

[48]

Davide Rossi, Igor Loi, Francesco Conti, Giuseppe Tagliavini, Antonio Pullini, and Andrea Marongiu. 2014. Energy efficient parallel computing on the P ULP platform with support for OpenMP. In Electrical & Electronics Engineers in Israel (IEEEI), 2014 IEEE 28th Convention of. IEEE, 1–5.

[49]

T. Shanley. 1998. Pentium Pro and Pentium II System Architecture. Addison-Wesley, Chapter 22 Paging Enhancements.

[50]

Eric Stotzer, Ajay Jayaraj, Murtaza Ali, Arnon Friedmann, Gaurav Mitra, Alistair P. Rendell, and Ian Lintault. 2013. OpenMP on the Low-Power TI Keystone II ARM/DSP System-on-Chip. In OpenMP in the Era of Low Power Devices and Accelerators, Alistair P. Rendell, Barbara M. Chapman, and Matthias S. Müller (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 114–127.

[51]

Texas Instruments Inc. 2019. 66AK2Hxx Multicore DSP+ARM KeyStone II System-on-Chip (SoC) datasheet (Rev. G). http://www.ti.com/ lit/ds/symlink/66ak2h14.pdf

[52]

Texas Instruments Inc. 2019. TDA2x ADAS Applications Processor 17mm Package (AAS) Silicon Revision 2.0 datasheet (Rev. F). http: //www.ti.com/lit/ds/sprs952f/sprs952f.pdf

[53]

Texas Instruments Inc. 2019. TMS320C6678 Multicore Fixed and Floating-Point Digital Signal Processor datasheet (Rev. E). http: //www.ti.com/lit/ds/symlink/tms320c6678.pdf

[54]

A. Venkat, H. Basavaraj, and D. M. Tullsen. 2019. Composite-ISA Cores: Enabling Multi-ISA Heterogeneity Using a Single ISA. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). 42–55.

[55]

Ashish Venkat and Dean M. Tullsen. 2014. Harnessing ISA Diversity: Design of a heterogeneous-ISA Chip Multiprocessor. In Proceeding of the 41st Annual International Symposium on Computer Architecuture (ISCA ’14). IEEE Press, Piscataway, NJ, USA, 121–132. http://dl.acm. org/citation.cfm?id=2665671.2665692

Digital Library

[56]

Pirmin Vogel, Andreas Kurth, Johannes Weinbuch, Andrea Marongiu, and Luca Benini. 2017. Efficient virtual memory sharing via onaccelerator page table walking in heterogeneous embedded SoCs. ACM Transactions on Embedded Computing Systems (TECS) 16, 5s (2017), 154.

Digital Library

[57]

Pirmin Vogel, Andrea Marongiu, and Luca Benini. 2018. Exploring shared virtual memory for FPGA accelerators with a configurable IOMMU. IEEE Trans. Comput. (2018).

[58]

Andrew Waterman, Yunsup Lee, Rimas Avizienis, David A Patterson, and Krste Asanović. 2019. The RISC-V Instruction Set Manual, Volume II: Privileged Architecture. Version 20190608-Priv-MSU-Ratified.

[59]

M. Wolfe, S. Lee, J. Kim, X. Tian, R. Xu, S. Chandrasekaran, and B. Chapman. 2017. Implementing the OpenACC Data Model. In 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 662–672.

[60]

Jonathan Woodruff, Robert N.M. Watson, David Chisnall, Simon W. Moore, Jonathan Anderson, Brooks Davis, Ben Laurie, Peter G. Neumann, Robert Norton, and Michael Roe. 2014. The CHERI Capability Model: Revisiting RISC in an Age of Risk. In Proceeding of the 41st Annual International Symposium on Computer Architecuture (ISCA ’14). IEEE Press, Piscataway, NJ, USA, 457–468. http: //dl.acm.org/citation.cfm?id=2665671.2665740

Digital Library

[61]

Florian Zaruba and Luca Benini. 2019. The Cost of Application-Class Processing: Energy and Performance Analysis of a Linux-ready 1.7 GHz 64bit RISC-V Core in 22nm FDSOI Technology. arXiv preprint arXiv:1904.05442 (2019).

Cited By

Kasmeridis IDimakopoulos V(2022)OpenMP Offloading in the Jetson Nano PlatformWorkshop Proceedings of the 51st International Conference on Parallel Processing10.1145/3547276.3548517(1-8)Online publication date: 29-Aug-2022
https://dl.acm.org/doi/10.1145/3547276.3548517
Kurth AForsberg BBenini L(2022)HEROv2: Full-Stack Open-Source Research Platform for Heterogeneous ComputingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.318939033:12(4368-4382)Online publication date: 1-Dec-2022
https://doi.org/10.1109/TPDS.2022.3189390
Cavalcante MKurth ASchuiki FBenini LPalesi MPalermo GGraves CArima E(2020)Design of an open-source bridge between non-coherent burst-based and coherent cache-line-based memory systemsProceedings of the 17th ACM International Conference on Computing Frontiers10.1145/3387902.3392631(81-88)Online publication date: 11-May-2020
https://dl.acm.org/doi/10.1145/3387902.3392631

Index Terms

Mixed-data-model heterogeneous compilation and OpenMP offloading
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Runtime environments

Recommendations

OpenMP Offloading in the Jetson Nano Platform
ICPP Workshops '22: Workshop Proceedings of the 51st International Conference on Parallel Processing

The nvidia Jetson Nano is a very popular system-on-module and developer kit which brings high-performance specs in a small and power-efficient embedded platform. Integrating a 128-core gpu and a quad-core cpu, it provides enough capabilities to support ...
Critical-blame analysis for OpenMP 4.0 offloading on Intel Xeon Phi

Critical-path detection in OpenMP 4.0 programs with offloaded code.Detection and quantification of load imbalances and their cause in OpenMP 4.0 codes.Implementation in the open-source tool infrastructure Score-P.Validation and evaluation with modified ...
OpenMP Dynamic Device Offloading in Heterogeneous Platforms
OpenMP: Conquering the Full Hardware Spectrum
Abstract
Heterogeneous architectures which integrate general purpose CPUs with specialized accelerators such as GPUs and FPGAs are becoming very popular since they achieve greater performance/energy trade-offs than CPU-only architectures. To support this ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CC 2020: Proceedings of the 29th International Conference on Compiler Construction

February 2020

222 pages

ISBN:9781450371209

DOI:10.1145/3377555

General Chairs:
Louis-Noël Pouchet
Colorado State University, USA
,
Alexandra Jimborean
Uppsala University, Sweden

Copyright © 2020 ACM.

Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 February 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Horizon 2020

Conference

CC '20

Sponsor:

SIGPLAN

CC '20: 29th International Conference on Compiler Construction

February 22 - 23, 2020

CA, San Diego, USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
284
Total Downloads

Downloads (Last 12 months)23
Downloads (Last 6 weeks)4

Reflects downloads up to 13 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kasmeridis IDimakopoulos V(2022)OpenMP Offloading in the Jetson Nano PlatformWorkshop Proceedings of the 51st International Conference on Parallel Processing10.1145/3547276.3548517(1-8)Online publication date: 29-Aug-2022
https://dl.acm.org/doi/10.1145/3547276.3548517
Kurth AForsberg BBenini L(2022)HEROv2: Full-Stack Open-Source Research Platform for Heterogeneous ComputingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.318939033:12(4368-4382)Online publication date: 1-Dec-2022
https://doi.org/10.1109/TPDS.2022.3189390
Cavalcante MKurth ASchuiki FBenini LPalesi MPalermo GGraves CArima E(2020)Design of an open-source bridge between non-coherent burst-based and coherent cache-line-based memory systemsProceedings of the 17th ACM International Conference on Computing Frontiers10.1145/3387902.3392631(81-88)Online publication date: 11-May-2020
https://dl.acm.org/doi/10.1145/3387902.3392631

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents