research-article

HERO: an open-source research platform for HW/SW exploration of heterogeneous manycore systems

Authors:

Alessandro Capotondi,

Andrea MarongiuAuthors Info & Claims

ANDARE '18: Proceedings of the 2nd Workshop on AutotuniNg and aDaptivity AppRoaches for Energy efficient HPC Systems

Article No.: 5, Pages 1 - 6

https://doi.org/10.1145/3295816.3295821

Published: 04 November 2018 Publication History

Abstract

Heterogeneous systems on chip (HeSoCs) co-integrate a high-performance multicore host processor with programmable manycore accelerators (PMCAs) to combine "standard platform" software support (e.g. the Linux OS) with energy-efficient, domain-specific, highly parallel processing capabilities.

In this work, we present HERO, a HeSoC platform that tackles this challenge in a novel way. HERO's host processor is an industry-standard ARM Cortex-A multicore complex, while its PMCA is a scalable, silicon-proven, open-source many-core processing engine, based on the extensible, open RISC-V ISA.

We evaluate a prototype implementation of HERO, where the PMCA implemented on an FPGA fabric is coupled with a hard ARM Cortex-A host processor, and show that the run time overhead compared to manually written PMCA code operating on private physical memory is lower than 10 % for pivotal benchmarks and operating conditions.

References

[1]

ARM Ltd. 2017. ARM Mali GPU OpenCL.

[2]

A. Capotondi and A. Marongiu. 2017. Enabling Zero-copy OpenMP Offloading on the PULP Many-core Accelerator. In SCOPES '17. ACM, New York, NY, USA, 68--71.

Digital Library

[3]

A. Capotondi, A. Marongiu, and L. Benini. 2018. Runtime Support for Multiple Offload-Based Programming Models on Clustered Manycore Accelerators. IEEE Transactions on Emerging Topics in Computing 6, 3 (July 2018), 330--342.

[4]

Young-kyu Choi, Jason Cong, Zhenman Fang, Yuchen Hao, Glenn Reinman, and Peng Wei. 2016. A quantitative analysis on microarchitectures of modern CPU-FPGA platforms. In DAC '16. ACM, 109.

Digital Library

[5]

F. Conti, P. D. Schiavone, and L. Benini. 2018. XNOR Neural Engine: a Hardware Accelerator IP for 21.6 fJ/op Binary Neural Network Inference. IEEE TCADICS (2018), 1--1.

[6]

Leonardo Dagum and Ramesh Menon. 1998. OpenMP: an industry standard API for shared-memory programming. IEE CSE 5, 1 (1998), 46--55.

Digital Library

[7]

S. Das, K. J. M. Martin, P. Coussy, and D. Rossi. 2018. A Heterogeneous Cluster with Reconfigurable Accelerator for Energy Efficient Near-Sensor Data Analytics. In ISCAS '18. 1--5.

[8]

Alejandro Duran, Eduard Ayguadé, Rosa M Badia, Jesús Labarta, Luis Martinell, Xavier Martorell, and Judit Planas. 2011. OmpSs: a proposal for programming heterogeneous multi-core architectures. Parallel Processing Letters 21, 02 (2011), 173--193.

[9]

B. Forsberg, L. Benini, and A. Marongiu. 2018. HePREM: Enabling predictable GPU execution on heterogeneous SoC. In DATE '18. 539--544.

[10]

M. Gautschi et al. 2017. Near-Threshold RISC-V Core With DSP Extensions for Scalable IoT Endpoint Devices. IEEE TVLSI PP, 99 (2017), 1--14.

[11]

John L Hennessy and David A Patterson. 2018. Computer architecture: a quantitative approach (6 ed.). Elsevier.

Digital Library

[12]

Intel Corp. 2018. Migrating Offloading Software to Intel Xeon Phi Processor. white paper. https://www.intel.com/content/dam/www/public/us/en/documents/white-papersmigrating-offloading-software-paper.pdf.

[13]

kokke. 2018. tiny-AES-c: Small portable AES 128/192/256 in C. GitHub repository. https://github.com/kokke/tiny-AES-c/tree/f56dbc05ab0d795d74f43436aac9da56a7cc8e11

[14]

Andreas Kurth, Pirmin Vogel, Alessandro Capotondi, Andrea Marongiu, and Luca Benini. 2017. HERO: Heterogeneous Embedded Research Platform for Exploring RISC-V Manycore Accelerators on FPGA. CoRR abs/1712.06497 (2017). arXiv:1712.06497 http://arxiv.org/abs/1712.06497

[15]

A. Kurth, P. Vogel, A. Marongiu, and L. Benini. 2018. Scalable and Efficient Virtual Memory Sharing in Heterogeneous SoCs with TLB Prefetching and MMU-Aware DMA Engine. In ICCD '18. IEEE.

[16]

I. Loi, A. Capotondi, D. Rossi, A. Marongiu, and L. Benini. 2018. The Quest for Energy-Efficient I$ Design in Ultra-Low-Power Clustered Many-Cores. IEEE TMSCS 4, 2 (Apr 2018), 99--112.

[17]

Andrea Marongiu, Alessandro Capotondi, and Luca Benini. 2016. Controlling NUMA effects in embedded manycore applications with lightweight nested parallelism support. Parallel Comput. 59 (2016), 24 -- 42. Theory and Practice of Irregular Applications.

Digital Library

[18]

A. Marongiu, A. Capotondi, G. Tagliavini, and L. Benini. 2015. Simplifying Many-Core-Based Heterogeneous SoC Programming With Offload Directives. IEEE TII 11, 4 (Aug 2015), 957--967.

[19]

Matt Martineau, Simon McIntosh-Smith, Carlo Bertolli, Arpith C. Jacob, Samuel F. Antao, Alexandre Eichenberger, Gheorghe-Teodor Bercea, Tong Chen, Tian Jin, Kevin O'Brien, Georgios Rokos, Hyojin Sung, and Zehra Sura. 2016. Performance Analysis and Optimization of Clang's OpenMP 4.5 GPU Support. In PMBS '16. 54--64.

Digital Library

[20]

M. Martineau, S. McIntosh-Smith, and W. Gaudin. 2016. Evaluating OpenMP 4.0's Effectiveness as a Heterogeneous Parallel Programming Model. In IPDPSW '16. 338--347.

[21]

D. Melpignano et al. 2012. Platform 2012, a Many-core Computing Accelerator for Embedded SoCs: Performance Evaluation of Visual Analytics Applications. In DAC '12. 1137--1142.

Digital Library

[22]

Gaurav Mitra, Eric Stotzer, Ajay Jayaraj, and Alistair P Rendell. 2014. Implementation and optimization of the OpenMP accelerator model for the TI Keystone II architecture. In International Workshop on OpenMP. Springer, 202--214.

[23]

NVIDIA Corp. 2017. NVIDIA Tesla V100 GPU Architecture. white paper.

[24]

OpenMP Architecture Review Board. 2013. OpenMP API v4.0.

[25]

PULP Platform. 2018. bigPULP: RISC-V manycore accelerator for HERO (v1.0.0). GitHub repository. https://github.com/pulp-platform/bigpulp/tree/v1.0.0

[26]

PULP Platform. 2018. HERO SDK (v1.0.1). GitHub repository. https://github.com/pulp-platform/hero-sdk/tree/v1.0.1

[27]

Ruymán Reyes, Iván López-Rodríguez, Juan J Fumero, and Francisco de Sande. 2012. accULL: an OpenACC implementation with CUDA and OpenCL support. In ECPP '12. Springer, 871--882.

Digital Library

[28]

D. Rossi, I. Loi, G. Haugou, and L. Benini. 2014. Ultra-low-latency Lightweight DMA for Tightly Coupled Multi-core Clusters. In CF '14. ACM, New York, NY, USA, Article 15, 10 pages.

Digital Library

[29]

D. Rossi, A. Pullini, I. Loi, M. Gautschi, F. K. GÃijrkaynak, A. Teman, J. Constantin, A. Burg, I. Miro-Panades, E. BeignÃĺ, F. Clermidy, P. Flatresse, and L. Benini. 2017. Energy-Efficient Near-Threshold Parallel Computing: The PULPv2 Cluster. IEEE Micro 37, 5 (Sept 2017).

[30]

P. Vogel et al. 2015. Lightweight Virtual Memory Support for Many-core Accelerators in Heterogeneous Embedded SoCs. In CODES '15. 45--54.

Digital Library

[31]

P. Vogel et al. 2017. Efficient Virtual Memory Sharing via On-Accelerator Page Table Walking in Heterogeneous Embedded SoCs. ACM TECS 16, 5s (2017), 154:1--154:19.

Digital Library

[32]

Andrew Waterman and Krste Asanovic. 2017. The RISC-V Instruction Set Manual, Volume I: User-Level ISA, v2.2.

[33]

Xilinx Inc. 2016. Zynq-7000 All Programmable SoC Overview. Product Specification.

[34]

B. Zimmer et al. 2016. A RISC-V Vector Processor With Simultaneous-Switching Switched-Capacitor DC -DC Converters in 28 nm FDSOI. IEEE JSSC 51, 4 (April 2016), 930--942.

Cited By

Faye JHaggui NKermarrec FMartin KBhattacharyya SNezan JPelcat M(2024)Scratchy: A Class of Adaptable Architectures with Software-Managed Communication for Edge Streaming ApplicationsDesign and Architectures for Signal and Image Processing10.1007/978-3-031-62874-0_6(68-79)Online publication date: 17-Jan-2024
https://dl.acm.org/doi/10.1007/978-3-031-62874-0_6
Weber IDal Zotto AMoraes F(2023)Chronos-v: a many-core high-level model with support for management techniquesAnalog Integrated Circuits and Signal Processing10.1007/s10470-023-02190-8117:1-3(57-71)Online publication date: 1-Dec-2023
https://dl.acm.org/doi/10.1007/s10470-023-02190-8
Brilli GCapotondi ABurgio PMarongiu A(2022)Understanding and Mitigating Memory Interference in FPGA-based HeSoCs2022 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE54114.2022.9774768(1335-1340)Online publication date: 14-Mar-2022
https://doi.org/10.23919/DATE54114.2022.9774768
Show More Cited By

Index Terms

HERO: an open-source research platform for HW/SW exploration of heterogeneous manycore systems

Recommendations

Efficient Virtual Memory Sharing via On-Accelerator Page Table Walking in Heterogeneous Embedded SoCs
Special Issue ESWEEK 2017, CASES 2017, CODES + ISSS 2017 and EMSOFT 2017

Shared virtual memory is key in heterogeneous systems on chip (SoCs) that combine a general-purpose host processor with a many-core accelerator, both for programmability and performance. In contrast to the full-blown, hardware-only solutions predominant ...
ViennaCL++: Enable TensorFlow/Eigen via ViennaCL with OpenCL C++ Flow
IWOCL '18: Proceedings of the International Workshop on OpenCL

This paper presents the ViennaCL++, an OpenCL C++ kernel library for Vienna Computing Library (ViennaCL) combined with TensorFlow/Eigen library to enable acceleration and optimization of linear algebraic computing. Previously, TensorFlow would invoke ...
Numerical reproducibility for the parallel reduction on multi- and many-core architectures
Highlights
- A parallel algorithm to compute correctly-rounded floating-point sums
- Highly-...
Abstract
On modern multi-core, many-core, and heterogeneous architectures, floating-point computations, especially reductions, may become non-deterministic and, therefore, non-reproducible mainly due to the non-associativity of floating-point ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

ANDARE '18: Proceedings of the 2nd Workshop on AutotuniNg and aDaptivity AppRoaches for Energy efficient HPC Systems

November 2018

36 pages

ISBN:9781450365918

DOI:10.1145/3295816

General Chairs:
Andrea Bartolini
University of Bologna, Italy
,
João M. P. Cardoso
Faculdade de Engenharia da Universidade do Porto, Portugal
,
Cristina Silvano
Politecnico di Milano, Italy

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 November 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ANDARE'18

ANDARE'18: 2nd Workshop on AutotuniNg and aDaptivity AppRoaches for Energy efficient HPC Systems

November 4, 2018

Limassol, Cyprus

Acceptance Rates

Overall Acceptance Rate 3 of 4 submissions, 75%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
250
Total Downloads

Downloads (Last 12 months)35
Downloads (Last 6 weeks)2

Reflects downloads up to 14 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Faye JHaggui NKermarrec FMartin KBhattacharyya SNezan JPelcat M(2024)Scratchy: A Class of Adaptable Architectures with Software-Managed Communication for Edge Streaming ApplicationsDesign and Architectures for Signal and Image Processing10.1007/978-3-031-62874-0_6(68-79)Online publication date: 17-Jan-2024
https://dl.acm.org/doi/10.1007/978-3-031-62874-0_6
Weber IDal Zotto AMoraes F(2023)Chronos-v: a many-core high-level model with support for management techniquesAnalog Integrated Circuits and Signal Processing10.1007/s10470-023-02190-8117:1-3(57-71)Online publication date: 1-Dec-2023
https://dl.acm.org/doi/10.1007/s10470-023-02190-8
Brilli GCapotondi ABurgio PMarongiu A(2022)Understanding and Mitigating Memory Interference in FPGA-based HeSoCs2022 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE54114.2022.9774768(1335-1340)Online publication date: 14-Mar-2022
https://doi.org/10.23919/DATE54114.2022.9774768
Bernardi ABrilli GCapotondi AMarongiu ABurgio P(2022)An FPGA Overlay for Efficient Real-Time Localization in 1/10th Scale Autonomous Vehicles2022 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE54114.2022.9774517(915-920)Online publication date: 14-Mar-2022
https://doi.org/10.23919/DATE54114.2022.9774517
ISLAM MKISE K(2022)An Efficient Resource Shared RISC-V Multicore ArchitectureIEICE Transactions on Information and Systems10.1587/transinf.2021EDP7248E105.D:9(1506-1515)Online publication date: 1-Sep-2022
https://doi.org/10.1587/transinf.2021EDP7248
Kasmeridis IDimakopoulos V(2022)OpenMP Offloading in the Jetson Nano PlatformWorkshop Proceedings of the 51st International Conference on Parallel Processing10.1145/3547276.3548517(1-8)Online publication date: 29-Aug-2022
https://dl.acm.org/doi/10.1145/3547276.3548517
Kurth AForsberg BBenini L(2022)HEROv2: Full-Stack Open-Source Research Platform for Heterogeneous ComputingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.318939033:12(4368-4382)Online publication date: 1-Dec-2022
https://doi.org/10.1109/TPDS.2022.3189390
France-Pillois MMartin JRousseau F(2021)A Non-Intrusive Tool Chain to Optimize MPSoC End-to-End SystemsACM Transactions on Architecture and Code Optimization10.1145/344503018:2(1-22)Online publication date: 9-Feb-2021
https://dl.acm.org/doi/10.1145/3445030
Madroñal DPalumbo FCapotondi AMarongiu A(2021)Unmanned Vehicles in Smart Farming: a Survey and a Glance at Future HorizonsProceedings of the 2021 Drone Systems Engineering and Rapid Simulation and Performance Evaluation: Methods and Tools Proceedings10.1145/3444950.3444958(1-8)Online publication date: 18-Jan-2021
https://dl.acm.org/doi/10.1145/3444950.3444958
Ruospo APiumatti DFloridia ASanchez E(2021)A Suitability Analysis of Software Based Testing Strategies for the On-line Testing of Artificial Neural Networks Applications in Embedded Devices2021 IEEE 27th International Symposium on On-Line Testing and Robust System Design (IOLTS)10.1109/IOLTS52814.2021.9486704(1-6)Online publication date: 28-Jun-2021
https://doi.org/10.1109/IOLTS52814.2021.9486704
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents