research-article

Trireme: Exploration of Hierarchical Multi-level Parallelism for Hardware Acceleration

Authors:

Georgios Zacharopoulos,

Muhammad Huzaifa,

David BrooksAuthors Info & Claims

ACM Transactions on Embedded Computing Systems, Volume 22, Issue 3

Article No.: 53, Pages 1 - 23

https://doi.org/10.1145/3580394

Published: 20 April 2023 Publication History

Abstract

The design of heterogeneous systems that include domain specific accelerators is a challenging and time-consuming process. While taking into account area constraints, designers must decide which parts of an application to accelerate in hardware and which to leave in software. Moreover, applications in domains such as Extended Reality (XR) offer opportunities for various forms of parallel execution, including loop level, task level, and pipeline parallelism. To assist the design process and expose every possible level of parallelism, we present Trireme, a fully automated tool-chain that explores multiple levels of parallelism and produces domain-specific accelerator designs and configurations that maximize performance, given an area budget. FPGA SoCs were used as target platforms, and Catapult HLS [7] was used to synthesize RTL using a commercial 12 nm FinFET technology. Experiments on demanding benchmarks from the XR domain revealed a speedup of up to 20×, as well as a speedup of up to 37× for smaller applications, compared to software-only implementations.

References

[1]

Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, et al. 2011. The gem5 simulator. ACM SIGARCH Comput. Archit. News 39, 2 (Feb.2011), 1–7.

Digital Library

[2]

Coen Bron and Joep Kerbosch. 1973. Algorithm 457: Finding all cliques of an undirected graph. In Communications ACM, Vol. 9. 575–577.

[3]

Iulian Brumar, Georgios Zacharopoulos, Yuan Yao, Saketh Rama, Gu-Yeon Wei, and David Brooks. 2022. Early DSE and automatic generation of coarse grained merged accelerators. ACM Trans. Embed. Comput. Syst. (June2022). DOI:DOI:

Digital Library

[4]

Cadence. 2016. Stratus High-Level Synthesis. Retrieved from https://www.cadence.com/en_US/home/tools/digital-design-and-signoff/synthesis/stratus-high-level-synthesis.html.

[5]

Simone Campanoni, Kevin Brownell, Svilen Kanev, Timothy M. Jones, Gu-Yeon Wei, and David Brooks. 2014. HELIX-RC: An architecture-compiler co-design for automatic parallelization of irregular programs. In Proceedings of the ACM/IEEE 41st International Symposium on Computer Architecture (ISCA). IEEE, 217–228.

Digital Library

[6]

Andrew Canis, Jongsok Choi, Blair Fort, Ruolong Lian, Qijing Huang, Nazanin Calagar, Marcel Gort, Jia Jun Qin, Mark Aldham, Tomasz Czajkowski. et al. 2013. From software to accelerators with LegUp high-level synthesis. In Proceedings of the International Conference on Compilers, Architectures, and Synthesis for Embedded Systems. IEEE.

[7]

Catapult. 2017. Catapult High-level Synthesis. Retrieved from https://eda.sw.siemens.com/en-US/ic/ic-design/high-level-synthesis-and-verification-platform/.

[8]

David Durst, Matthew Feldman, Dillon Huff, David Akeley, Ross Daly, Gilbert Louis Bernstein, Marco Patrignani, Kayvon Fatahalian, and Pat Hanrahan. 2020. Type-directed scheduling of streaming accelerators. In Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation. Association for Computing Machinery, New York, NY, 408–422.

Digital Library

[9]

Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger. 2011. Dark silicon and the end of multicore scaling. In ACM SIGARCH Computer Architecture News, Vol. 39. 365–376.

[10]

Lorenzo Ferretti, Andrea Cini, Georgios Zacharopoulos, Cesare Alippi, and Laura Pozzi. 2021. A graph deep learning framework for high-level synthesis design space exploration. arXiv preprint arXiv:2111.14767 (2021).

[11]

Lorenzo Ferretti, Andrea Cini, Georgios Zacharopoulos, Cesare Alippi, and Laura Pozzi. 2022. Graph neural networks for high-level synthesis design space exploration. ACM Trans. Des. Automat. Electron. Syst. 28, 2 (2022), 20.

Digital Library

[12]

Muhammad Huzaifa, Rishi Desai, Samuel Grayson, Xutao Jiang, Ying Jing, Jae Lee, Fang Lu, Yihan Pang, Joseph Ravichandran, Finn Sinclair, Boyuan Tian, Hengzhi Yuan, Jeffrey Zhang, and Sarita V. Adve. 2021. ILLIXR: Enabling end-to-end extended reality research. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC). 24–38. DOI:DOI:

[13]

David Koeplinger, Matthew Feldman, Raghu Prabhakar, Yaqi Zhang, Stefan Hadjis, Ruben Fiszel, Tian Zhao, Luigi Nardi, Ardavan Pedram, Christos Kozyrakis, et al. 2018. Spatial: A language and compiler for application accelerators. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation. 296–311.

Digital Library

[14]

Maria Kotsifakou, Prakalp Srivastava, Matthew D. Sinclair, Rakesh Komuravelli, Vikram Adve, and Sarita Adve. 2018. HPVM: Heterogeneous parallel virtual machine. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 68–80.

Digital Library

[15]

Snehasish Kumar, Vijayalakshmi Srinivasan, Amirali Sharifian, Nick Sumner, and Arrvindh Shriraman. 2016. Peruse and profit: Estimating the accelerability of loops. In Proceedings of the International Conference on Supercomputing. 1–13.

[16]

Yi-Hsiang Lai, Yuze Chi, Yuwei Hu, Jie Wang, Cody Hao Yu, Yuan Zhou, Jason Cong, and Zhiru Zhang. 2019. HeteroCL: A multi-paradigm programming infrastructure for software-defined reconfigurable computing. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 242–251.

Digital Library

[17]

Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the 2nd International Symposium on Code Generation and Optimization. 75–88.

[18]

LLVM Project. Circuit IR Compilers and Tools (CIRCT). https://github.com/llvm/circt.

[19]

Steven Margerm, Amirali Sharifian, Apala Guha, Arrvindh Shriraman, and Gilles Pokam. 2018. TAPAS: Generating parallel accelerators from parallel programs. In Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 245–257.

Digital Library

[20]

Wim Meeus, Kristof Van Beeck, Toon Goedemé, Jan Meel, and Dirk Stroobandt. 2012. An overview of today’s high-level synthesis tools. Des. Automat. Embed. Syst. 16, 3 (Sept.2012), 31–51.

Digital Library

[21]

Luigi Nardi, Artur Souza, David Koeplinger, and Kunle Olukotun. 2019. HyperMapper: A practical design space exploration framework. In Proceedings of the IEEE 27th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). IEEE, 425–426.

[22]

Tan Nguyen, Swathi Gurumani, Kyle Rupnow, and Deming Chen. 2016. FCUDA-SoC: Platform integration for field-programmable SoC with the CUDA-to-FPGA compiler. In Proceedings of the ACM/SIGDA International Symposium on Field-programmable Gate Arrays. 5–14.

Digital Library

[23]

Alexandros Papakonstantinou, Karthik Gururaj, John A. Stratton, Deming Chen, Jason Cong, and Wen-Mei W. Hwu. 2009. FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs. In Proceedings of the IEEE 7th Symposium on Application Specific Processors. IEEE, 35–42.

[24]

Christian Pilato and Fabrizio Ferrandi. 2012. Bambu: A free framework for the high level synthesis of complex applications. In Proceedings of the 23rd International Conference on Field Programmable Logic and Applications.

[25]

Brandon Reagen, Robert Adolf, Yakun Sophia Shao, Gu-Yeon Wei, and David Brooks. 2014. MachSuite: Benchmarks for accelerator design and customized architectures. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC). IEEE, 110–119.

[26]

Samuel Rogers, Joshua Slycord, Mohammadreza Baharani, and Hamed Tabkhi. 2020. gem5-SALAM: A system architecture for LLVM-based accelerator modeling. In Proceedings of the 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 471–482.

[27]

Tao B. Schardl, William S. Moses, and Charles E. Leiserson. 2017. Tapir: Embedding fork-join parallelism into LLVM’s intermediate representation. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 249–265.

Digital Library

[28]

Yakun Sophia Shao, Brandon Reagen, Gu-Yeon Wei, and David Brooks. 2014. Aladdin: A pre-RTL, power-performance accelerator simulator enabling large design space exploration of customized architectures. In Proceedings of the 41st Annual International Symposium on Computer Architecture. IEEE, 97–108.

Digital Library

[29]

Yakun Sophia Shao, Sam Likun Xi, Vijayalakshmi Srinivasan, Gu-Yeon Wei, and David Brooks. 2016. Co-designing accelerators and SoC interfaces using gem5-aladdin. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 1–12.

Digital Library

[30]

Tom Simonite. 2016. Moore’s law is dead. Now what? MIT Technol. Rev. May 13 (2016), 40–41.

[31]

John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-mei W. Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Cent. Reliab. High-perform. Comput. 127 (2012).

[32]

Xilinx. 2017. Vivado High-level Synthesis. Retrieved from www.xilinx.com/products/design-tools/vivado/integration/esl-design.html.

[33]

Xilinx. 2017. Xilinx All Programmable SoC portfolio. Retrieved from www.xilinx.com/products/silicon-devices/soc.html.

[34]

Yuan Yao and Saketh Rama. yaoyuannnn. CAVA: Camera Vision Pipeline on gem5-Aladdin. https://github.com/yaoyuannnn/cava.

[35]

Georgios Zacharopoulos, Andrea Barbon, Giovanni Ansaloni, and Laura Pozzi. 2018. Machine learning approach for loop unrolling factor prediction in high level synthesis. In Proceedings of the IEEE International Conference on High Performance Computing & Simulation (HPCS). 91–97.

[36]

Georgios Zacharopoulos, Lorenzo Ferretti, Giovanni Ansaloni, Giuseppe Di Guglielmo, Luca Carloni, and Laura Pozzi. 2019. Compiler-assisted selection of hardware acceleration candidates from application source code. In Proceedings of the International Conference on Computer Design. 1–9.

[37]

Georgios Zacharopoulos, Lorenzo Ferretti, Emanuele Giaquinta, Giovanni Ansaloni, and Laura Pozzi. 2019. RegionSeeker: Automatically identifying and selecting accelerators from application source code. IEEE Trans. Comput.-aid Des. Integ. Circ. Syst. 38, 4 (Apr.2019), 741–754.

[38]

Georgios Zacharopoulos and Laura Pozzi. 2017. ClrFreqCFGPrinter: A Tool for Frequency Annotated Control Flow Graph Generation. Technical Report. European LLVM Developers Meeting.

[39]

Ruoyu Zhou and Timothy M. Jones. 2019. Janus: Statically-driven and profile-guided automatic dynamic binary parallelisation. In Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 15–25.

Cited By

Gorgin SKarvandi MMoghari SFallah MLee J(2024)A Hardware Realization Framework for Fuzzy Inference System OptimizationElectronics10.3390/electronics1304069013:4(690)Online publication date: 8-Feb-2024
https://doi.org/10.3390/electronics13040690
Hussein EWaschneck BMayr C(2024)Automating application-driven customization of ASIPsJournal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2024.103080148:COnline publication date: 2-Jul-2024
https://dl.acm.org/doi/10.1016/j.sysarc.2024.103080

Index Terms

Trireme: Exploration of Hierarchical Multi-level Parallelism for Hardware Acceleration

Recommendations

Design space exploration in multi-level computing systems
CompSysTech '14: Proceedings of the 15th International Conference on Computer Systems and Technologies

The paper is dedicated to the design space exploration for Xilinx devices from Zynq-7000 family with such architecture that includes a dual-core processing system and a programmable logic on the same microchip. The developed multi-level computing system ...
Fingerprint image processing acceleration through run-time reconfigurable hardware

To the best of the authors' knowledge, this is the first brief that implements a complete automatic fingerprint-based authentication system (AFAS) application under a dynamically partial self-reconfigurable field-programmable gate array (FPGA). The main ...
A unified model for co-simulation and co-synthesis of mixed hardware/software systems
EDTC '95: Proceedings of the 1995 European conference on Design and Test

This paper presents a methodology for a unified co-simulation and co-synthesis of hardware-software systems. This approach addresses the modeling of communication between the hardware and software modules at different abstraction levels and for ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems

ACM Transactions on Embedded Computing Systems Volume 22, Issue 3

May 2023

519 pages

ISSN:1539-9087

EISSN:1558-3465

DOI:10.1145/3592782

Editor:
Tulika Mitra
National University of Singapore, Singapore

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 20 April 2023

Online AM: 17 January 2023

Accepted: 09 January 2023

Revised: 11 November 2022

Received: 06 October 2021

Published in TECS Volume 22, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Software Analysis for Heterogeneous Computing Architectures
Swiss National Science Foundation (SNSF), by the National Science Foundation (US)
NSF
DARPA through the Domain-Specific System on Chip (DSSoC)
Applications Driving Architectures (ADA) Research Center

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
929
Total Downloads

Downloads (Last 12 months)498
Downloads (Last 6 weeks)36

Reflects downloads up to 18 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Gorgin SKarvandi MMoghari SFallah MLee J(2024)A Hardware Realization Framework for Fuzzy Inference System OptimizationElectronics10.3390/electronics1304069013:4(690)Online publication date: 8-Feb-2024
https://doi.org/10.3390/electronics13040690
Hussein EWaschneck BMayr C(2024)Automating application-driven customization of ASIPsJournal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2024.103080148:COnline publication date: 2-Jul-2024
https://dl.acm.org/doi/10.1016/j.sysarc.2024.103080

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents