Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3377555.3377891acmconferencesArticle/Chapter ViewAbstractPublication PagesccConference Proceedingsconference-collections
research-article

Mixed-data-model heterogeneous compilation and OpenMP offloading

Published: 24 February 2020 Publication History

Abstract

Heterogeneous computers combine a general-purpose host processor with domain-specific programmable many-core accelerators, uniting high versatility with high performance and energy efficiency. While the host manages ever-more application memory, accelerators are designed to work mainly on their local memory. This difference in addressed memory leads to a discrepancy between the optimal address width of the host and the accelerator. Today 64-bit host processors are commonplace, but few accelerators exceed 32-bit addressable local memory, a difference expected to increase with 128-bit hosts in the exascale era. Managing this discrepancy requires support for multiple data models in heterogeneous compilers. So far, compiler support for multiple data models has not been explored, which hampers the programmability of such systems and inhibits their adoption.
In this work, we perform the first exploration of the feasibility and performance of implementing a mixed-data-model heterogeneous system. To support this, we present and evaluate the first mixed-data-model compiler, supporting arbitrary address widths on host and accelerator. To hide the inherent complexity and to enable high programmer productivity, we implement transparent offloading on top of OpenMP. The proposed compiler techniques are implemented in LLVM and evaluated on a 64+32-bit heterogeneous SoC. Results on benchmarks from the PolyBench-ACC suite show that memory can be transparently shared between host and accelerator at overheads below 0.7% compared to 32-bit-only execution, enabling mixed-data-model computers to execute at near-native performance.

References

[1]
AMD Corp. 2018. AMD Radeon Instinct MI60. Datasheet. https://www.amd.com/system/files/documents/radeon-instinctmi60-datasheet.pdf
[2]
Samuel F. Antao, Alexey Bataev, Arpith C. Jacob, Gheorghe-Teodor Bercea, Alexandre E. Eichenberger, Georgios Rokos, Matt Martineau, Tian Jin, Guray Ozen, Zehra Sura, Tong Chen, Hyojin Sung, Carlo Bertolli, and Kevin O’Brien. 2016. Offloading Support for OpenMP in Clang and LLVM. In LLVM-HPC’16.
[3]
Arm Ltd. 2019. Architecture Reference Manual: ARMv8 for ARMv8-A architecture profile. Chapter D1.19 Interprocessing.
[4]
Gheorghe-Teodor Bercea, Carlo Bertolli, Arpith C. Jacob, Alexandre Eichenberger, Alexey Bataev, Georgios Rokos, Hyojin Sung, Tong Chen, and Kevin O’Brien. 2017. Implementing Implicit OpenMP Data Sharing on GP Us. In Proceedings of the Fourth Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC’17). ACM, New York, NY, USA, Article 5, 12 pages.
[5]
Cadence Design Systems, Inc. 2018. Tensilica Xtensa LX7 processor datasheet. https://ip.cadence.com/uploads/1099/TIP_PB_Xtensa_lx7_ FINAL-pdf
[6]
Alessandro Capotondi and Andrea Marongiu. 2017. Enabling Zerocopy OpenMP Offloading on the P ULP Many-core Accelerator. In Proceedings of the 20th International Workshop on Software and Compilers for Embedded Systems (SCOPES ’17). ACM, New York, NY, USA, 68–71.
[7]
S. Chandrasekaran and G. Juckeland. 2017. OpenACC for Programmers: Concepts and Strategies. Pearson Education.
[8]
David Chisnall. 2015. Adventures with LLVM in a magical land where pointers are not integers. In 2015 LLVM Developer’s Meeting. https: //llvm.org/devmtg/2015-02/slides/chisnall-pointers-not-int.pdf
[9]
Young-kyu Choi, Jason Cong, Zhenman Fang, Yuchen Hao, Glenn Reinman, and Peng Wei. 2016. A Quantitative Analysis on Microarchitectures of Modern CP U-FPGA Platforms. In Proceedings of the 53rd Annual Design Automation Conference (DAC ’16). ACM, New York, NY, USA, Article 109, 6 pages.
[10]
Lucian Codrescu. 2015. Architecture of the Hexagon 680 DSP for Mobile Imaging and Computer Vision. In 2015 IEEE International Symposium on High Performance Chips (HOTCHIPS ’27). https://www.hotchips.org/wp-content/uploads/hc_ archives/hc27/HC27.24-Monday-Epub/HC27.24.20-MultimediaEpub/HC27.24.211-Hexagon680-Codrescu-Qualcomm.pdf
[11]
S. Cook. 2012. CUDA Programming: A Developer’s Guide to Parallel Computing with GPUs. Elsevier Science.
[12]
E. G. Cota, P. Mantovani, G. Di Guglielmo, and L. P. Carloni. 2015. An analysis of accelerator coupling in heterogeneous architectures. In 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC). 1–6.
[13]
Ian Cutress. 2016. CEVA Launches Fifth-Generation Machine Learning Image and Vision DSP Solution: CEVA-XM6. https://www.anandtech. com/show/10700
[14]
Marvin Damschen, Heinrich Riebler, Gavin Vaz, and Christian Plessl. 2015. Transparent Offloading of Computational Hotspots from Binary Code to Xeon Phi. In Proc. of the 2015 Design, Automation & Test in Europe Conf. & Exh (DATE ’15). EDA Consortium, 1078–1083. http: //dl.acm.org/citation.cfm?id=2757012.2757063
[15]
Joel E Denny, Seyong Lee, and Jeffrey S Vetter. 2018. Clacc: Translating OpenACC to OpenMP in Clang. In 2018 IEEE/ACM 5th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC). IEEE, 18–29.
[16]
Michael Ditty, Ashish Karandikar, and David Reed. 2018. Nvidia’s Xavier SoC. In 2018 IEEE International Symposium on High Performance Chips (HOTCHIPS ’30). https://www.hotchips.org/hc30/1conf/1.12_ Nvidia_XavierHotchips2018Final_814.pdf
[17]
Andrei Frumusanu. 2018. The Qualcomm Snapdragon 855 Pre-Dive: Going Into Detail on 2019’s Flagship Android SoC. https://www. anandtech.com/show/13680/snapdragon-855-going-into-detail
[18]
Michael Gautschi, Pasquale Davide Schiavone, Andreas Traber, Igor Loi, Antonio Pullini, Davide Rossi, Eric Flamand, Frank K Gürkaynak, and Luca Benini. 2017. Near-threshold RISC-V core with DSP extensions for scalable IoT endpoint devices. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 25, 10 (2017), 2700–2713.
[19]
Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. 2012. Auto-tuning a high-level language targeted to GP U codes. In 2012 Innovative Parallel Computing (InPar). IEEE, 1–10.
[20]
Tobias Grosser, Armin Groesslinger, and Christian Lengauer. 2012. Polly—performing polyhedral optimizations on a low-level intermediate representation. Parallel Processing Letters 22, 04 (2012), 1250010.
[21]
Tobias Grosser and Torsten Hoefler. 2016. Polly-ACC Transparent Compilation to Heterogeneous Hardware. In Proceedings of the 2016 International Conference on Supercomputing (ICS ’16). ACM, New York, NY, USA, Article 1, 13 pages.
[22]
J.L. Hennessy and D.A. Patterson. 2017. Computer Architecture: A Quantitative Approach. Elsevier Science, Chapter 7.2 Guidelines for Domain-Specific Architectures.
[23]
N. Jouppi, C. Young, N. Patil, and D. Patterson. 2018. Motivation for and Evaluation of the First Tensor Processing Unit. IEEE Micro 38, 3 (May 2018), 10–19.
[24]
M. Kerrisk. 2010. The Linux Programming Interface: A Linux and UNIX System Programming Handbook. No Starch Press.
[25]
Khronos Group Inc. 2019. OpenVX API Specification 1.3. https://www.khronos.org/registry/OpenVX/specs/1.3/OpenVX_ Specification_1_3.pdf
[26]
Andreas Kurth, Alessandro Capotondi, Pirmin Vogel, Luca Benini, and Andrea Marongiu. 2018. HERO: An Open-Source Research Platform for HW/SW Exploration of Heterogeneous Manycore Systems. In Proceedings of the 2nd Workshop on AutotuniNg and aDaptivity AppRoaches for Energy Efficient HPC Systems (ANDARE ’18). ACM, New York, NY, USA, Article 5, 6 pages.
[27]
Andreas Kurth, Pirmin Vogel, Alessandro Capotondi, Andrea Marongiu, and Luca Benini. 2017. HERO: Heterogeneous Embedded Research Platform for Exploring RISC-V Manycore Accelerators on FPGA. In Computer Architecture Research with RISC-V (CARRV ’17).
[28]
A. Kurth, P. Vogel, A. Marongiu, and L. Benini. 2018. Scalable and Efficient Virtual Memory Sharing in Heterogeneous SoCs with TLB Prefetching and MMU-Aware DMA Engine. In 2018 IEEE 36th International Conference on Computer Design (ICCD). 292–300.
[29]
LLVM. 2019. LLVM 8.0.0 Release Notes. https://releases.llvm.org/8.0.0/ docs/ReleaseNotes.html
[30]
H. J. Lu, H. Peter Anvin, and Milind Girkar. 2011. X32: A native 32-bit ABI for x86-64. In Linux Plumbers Conference. http://www.linuxplumbersconf.net/2011/ocw/system/presentations/ 531/original/x32-LPC-2011-0906.pptx
[31]
Andy Lutomirski. 2018. Can we drop upstream Linux x32 support? https://lkml.org/lkml/2018/12/10/1145
[32]
Andrea Marongiu, Alessandro Capotondi, Giuseppe Tagliavini, and Luca Benini. 2015. Simplifying many-core-based heterogeneous SoC programming with offload directives. IEEE Transactions on Industrial Informatics 11, 4 (2015), 957–967.
[33]
M. Martineau, S. McIntosh-Smith, and W. Gaudin. 2016. Evaluating OpenMP 4.0’s Effectiveness as a Heterogeneous Parallel Programming Model. In 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 338–347.
[34]
Mentor, a Siemens Business. 2019. Questa Advanced Simulator. https: //www.mentor.com/products/fv/questa/
[35]
Dmitry Mikushin, Nikolay Likhogrud, Eddy Z Zhang, and Christopher Bergström. 2014. KernelGen–The Design and Implementation of a Next Generation Compiler Platform for Accelerating Numerical Models on GP Us. In 2014 IEEE International Parallel & Distributed Processing Symposium Workshops. IEEE, 1011–1020.
[36]
Gaurav Mitra, Eric Stotzer, Ajay Jayaraj, and Alistair P. Rendell. 2014. Implementation and Optimization of the OpenMP Accelerator Model for the TI Keystone II Architecture. In Using and Improving OpenMP for Devices, Tasks, and More, Luiz DeRose, Bronis R. de Supinski, Stephen L. Olivier, Barbara M. Chapman, and Matthias S. Müller (Eds.). Springer International Publishing, Cham, 202–214.
[37]
Nvidia Corp. 2014. Summit and Sierra Supercomputers: An Inside Look at the U.S. Department of Energy’s New Pre-Exascale Systems. http://www.teratec.eu/actu/calcul/Nvidia_Coral_White_Paper_ Final_3_1.pdf
[38]
Nvidia Corp. 2015. Linux for Tegra R23.1. https://developer.nvidia. com/embedded/linux-tegra-r231
[39]
Nvidia Corp. 2016. Linux for Tegra R24.1. https://developer.nvidia. com/embedded/linux-tegra-r241
[40]
Nvidia Corp. 2019. Nvidia TITAN RTX. Product Brief. https://www.nvidia.com/content/dam/en-zz/Solutions/titan/ documents/titan-rtx-for-creators-us-nvidia-1011126-r6-web.pdf
[41]
Nvidia Corp. 2019. NVVM IR Specification 1.5. https://docs.nvidia. com/cuda/nvvm-ir-spec/index.html
[42]
Nate Oh. 2017. Intel Announces Movidius Myriad X VP U, Featuring ’Neural Compute Engine’. https://www.anandtech.com/show/11771/ intel-announces-movidius-myriad-x-vpu
[43]
OpenMP Architecture Review Board 2015. OpenMP Application Programming Interface. OpenMP Architecture Review Board. Version 4.5.
[44]
G. Özen, S. Atzeni, M. Wolfe, A. Southwell, and G. Klimowicz. 2018. OpenMP GP U Offload in Flang and LLVM. In 2018 IEEE/ACM 5th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC). 1–9.
[45]
Kari Pulli, Anatoly Baksheev, Kirill Kornyakov, and Victor Eruhimov. 2012. Real-time computer vision with OpenCV. Commun. ACM 55, 6 (2012), 61–69. https://research.nvidia.com/sites/default/files/pubs/ 2012-06_Realtime-Computer-Vision/OpenCV_CACM_p61-pulli.pdf
[46]
I RAS. 2010. GRAPHITE-OpenCL: Generate OpenCL code from parallel loops. GCC Developers Summit. Citcseer (2010), 9.
[47]
Jason Redgrave, Albert Meixner, Nathan Goulding-Hotta, Artem Vasilyev, and Ofer Shacham. 2018. Pixel Visual Core: Google’s Fully Programmable Image, Vision, and AI Processor for Mobile Devices. In 2018 IEEE International Symposium on High Performance Chips (HOTCHIPS ’30). https://www.hotchips.org/hc30/1conf/1.02_Google_HC30.Google. JasonRedgrave.V01.pdf
[48]
Davide Rossi, Igor Loi, Francesco Conti, Giuseppe Tagliavini, Antonio Pullini, and Andrea Marongiu. 2014. Energy efficient parallel computing on the P ULP platform with support for OpenMP. In Electrical & Electronics Engineers in Israel (IEEEI), 2014 IEEE 28th Convention of. IEEE, 1–5.
[49]
T. Shanley. 1998. Pentium Pro and Pentium II System Architecture. Addison-Wesley, Chapter 22 Paging Enhancements.
[50]
Eric Stotzer, Ajay Jayaraj, Murtaza Ali, Arnon Friedmann, Gaurav Mitra, Alistair P. Rendell, and Ian Lintault. 2013. OpenMP on the Low-Power TI Keystone II ARM/DSP System-on-Chip. In OpenMP in the Era of Low Power Devices and Accelerators, Alistair P. Rendell, Barbara M. Chapman, and Matthias S. Müller (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 114–127.
[51]
Texas Instruments Inc. 2019. 66AK2Hxx Multicore DSP+ARM KeyStone II System-on-Chip (SoC) datasheet (Rev. G). http://www.ti.com/ lit/ds/symlink/66ak2h14.pdf
[52]
Texas Instruments Inc. 2019. TDA2x ADAS Applications Processor 17mm Package (AAS) Silicon Revision 2.0 datasheet (Rev. F). http: //www.ti.com/lit/ds/sprs952f/sprs952f.pdf
[53]
Texas Instruments Inc. 2019. TMS320C6678 Multicore Fixed and Floating-Point Digital Signal Processor datasheet (Rev. E). http: //www.ti.com/lit/ds/symlink/tms320c6678.pdf
[54]
A. Venkat, H. Basavaraj, and D. M. Tullsen. 2019. Composite-ISA Cores: Enabling Multi-ISA Heterogeneity Using a Single ISA. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). 42–55.
[55]
Ashish Venkat and Dean M. Tullsen. 2014. Harnessing ISA Diversity: Design of a heterogeneous-ISA Chip Multiprocessor. In Proceeding of the 41st Annual International Symposium on Computer Architecuture (ISCA ’14). IEEE Press, Piscataway, NJ, USA, 121–132. http://dl.acm. org/citation.cfm?id=2665671.2665692
[56]
Pirmin Vogel, Andreas Kurth, Johannes Weinbuch, Andrea Marongiu, and Luca Benini. 2017. Efficient virtual memory sharing via onaccelerator page table walking in heterogeneous embedded SoCs. ACM Transactions on Embedded Computing Systems (TECS) 16, 5s (2017), 154.
[57]
Pirmin Vogel, Andrea Marongiu, and Luca Benini. 2018. Exploring shared virtual memory for FPGA accelerators with a configurable IOMMU. IEEE Trans. Comput. (2018).
[58]
Andrew Waterman, Yunsup Lee, Rimas Avizienis, David A Patterson, and Krste Asanović. 2019. The RISC-V Instruction Set Manual, Volume II: Privileged Architecture. Version 20190608-Priv-MSU-Ratified.
[59]
M. Wolfe, S. Lee, J. Kim, X. Tian, R. Xu, S. Chandrasekaran, and B. Chapman. 2017. Implementing the OpenACC Data Model. In 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 662–672.
[60]
Jonathan Woodruff, Robert N.M. Watson, David Chisnall, Simon W. Moore, Jonathan Anderson, Brooks Davis, Ben Laurie, Peter G. Neumann, Robert Norton, and Michael Roe. 2014. The CHERI Capability Model: Revisiting RISC in an Age of Risk. In Proceeding of the 41st Annual International Symposium on Computer Architecuture (ISCA ’14). IEEE Press, Piscataway, NJ, USA, 457–468. http: //dl.acm.org/citation.cfm?id=2665671.2665740
[61]
Florian Zaruba and Luca Benini. 2019. The Cost of Application-Class Processing: Energy and Performance Analysis of a Linux-ready 1.7 GHz 64bit RISC-V Core in 22nm FDSOI Technology. arXiv preprint arXiv:1904.05442 (2019).

Cited By

View all
  • (2022)OpenMP Offloading in the Jetson Nano PlatformWorkshop Proceedings of the 51st International Conference on Parallel Processing10.1145/3547276.3548517(1-8)Online publication date: 29-Aug-2022
  • (2022)HEROv2: Full-Stack Open-Source Research Platform for Heterogeneous ComputingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.318939033:12(4368-4382)Online publication date: 1-Dec-2022
  • (2020)Design of an open-source bridge between non-coherent burst-based and coherent cache-line-based memory systemsProceedings of the 17th ACM International Conference on Computing Frontiers10.1145/3387902.3392631(81-88)Online publication date: 11-May-2020

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CC 2020: Proceedings of the 29th International Conference on Compiler Construction
February 2020
222 pages
ISBN:9781450371209
DOI:10.1145/3377555
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 February 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Compilers
  2. Data Models
  3. Heterogeneous Computer Architectures
  4. Offloading
  5. OpenMP
  6. Runtime Libraries

Qualifiers

  • Research-article

Funding Sources

Conference

CC '20
Sponsor:

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)23
  • Downloads (Last 6 weeks)4
Reflects downloads up to 13 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2022)OpenMP Offloading in the Jetson Nano PlatformWorkshop Proceedings of the 51st International Conference on Parallel Processing10.1145/3547276.3548517(1-8)Online publication date: 29-Aug-2022
  • (2022)HEROv2: Full-Stack Open-Source Research Platform for Heterogeneous ComputingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.318939033:12(4368-4382)Online publication date: 1-Dec-2022
  • (2020)Design of an open-source bridge between non-coherent burst-based and coherent cache-line-based memory systemsProceedings of the 17th ACM International Conference on Computing Frontiers10.1145/3387902.3392631(81-88)Online publication date: 11-May-2020

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media