Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Trireme: Exploration of Hierarchical Multi-level Parallelism for Hardware Acceleration

Published: 20 April 2023 Publication History

Abstract

The design of heterogeneous systems that include domain specific accelerators is a challenging and time-consuming process. While taking into account area constraints, designers must decide which parts of an application to accelerate in hardware and which to leave in software. Moreover, applications in domains such as Extended Reality (XR) offer opportunities for various forms of parallel execution, including loop level, task level, and pipeline parallelism. To assist the design process and expose every possible level of parallelism, we present Trireme, a fully automated tool-chain that explores multiple levels of parallelism and produces domain-specific accelerator designs and configurations that maximize performance, given an area budget. FPGA SoCs were used as target platforms, and Catapult HLS [7] was used to synthesize RTL using a commercial 12 nm FinFET technology. Experiments on demanding benchmarks from the XR domain revealed a speedup of up to 20×, as well as a speedup of up to 37× for smaller applications, compared to software-only implementations.

References

[1]
Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, et al. 2011. The gem5 simulator. ACM SIGARCH Comput. Archit. News 39, 2 (Feb.2011), 1–7.
[2]
Coen Bron and Joep Kerbosch. 1973. Algorithm 457: Finding all cliques of an undirected graph. In Communications ACM, Vol. 9. 575–577.
[3]
Iulian Brumar, Georgios Zacharopoulos, Yuan Yao, Saketh Rama, Gu-Yeon Wei, and David Brooks. 2022. Early DSE and automatic generation of coarse grained merged accelerators. ACM Trans. Embed. Comput. Syst. (June2022). DOI:DOI:
[4]
Cadence. 2016. Stratus High-Level Synthesis. Retrieved from https://www.cadence.com/en_US/home/tools/digital-design-and-signoff/synthesis/stratus-high-level-synthesis.html.
[5]
Simone Campanoni, Kevin Brownell, Svilen Kanev, Timothy M. Jones, Gu-Yeon Wei, and David Brooks. 2014. HELIX-RC: An architecture-compiler co-design for automatic parallelization of irregular programs. In Proceedings of the ACM/IEEE 41st International Symposium on Computer Architecture (ISCA). IEEE, 217–228.
[6]
Andrew Canis, Jongsok Choi, Blair Fort, Ruolong Lian, Qijing Huang, Nazanin Calagar, Marcel Gort, Jia Jun Qin, Mark Aldham, Tomasz Czajkowski. et al. 2013. From software to accelerators with LegUp high-level synthesis. In Proceedings of the International Conference on Compilers, Architectures, and Synthesis for Embedded Systems. IEEE.
[7]
Catapult. 2017. Catapult High-level Synthesis. Retrieved from https://eda.sw.siemens.com/en-US/ic/ic-design/high-level-synthesis-and-verification-platform/.
[8]
David Durst, Matthew Feldman, Dillon Huff, David Akeley, Ross Daly, Gilbert Louis Bernstein, Marco Patrignani, Kayvon Fatahalian, and Pat Hanrahan. 2020. Type-directed scheduling of streaming accelerators. In Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation. Association for Computing Machinery, New York, NY, 408–422.
[9]
Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger. 2011. Dark silicon and the end of multicore scaling. In ACM SIGARCH Computer Architecture News, Vol. 39. 365–376.
[10]
Lorenzo Ferretti, Andrea Cini, Georgios Zacharopoulos, Cesare Alippi, and Laura Pozzi. 2021. A graph deep learning framework for high-level synthesis design space exploration. arXiv preprint arXiv:2111.14767 (2021).
[11]
Lorenzo Ferretti, Andrea Cini, Georgios Zacharopoulos, Cesare Alippi, and Laura Pozzi. 2022. Graph neural networks for high-level synthesis design space exploration. ACM Trans. Des. Automat. Electron. Syst. 28, 2 (2022), 20.
[12]
Muhammad Huzaifa, Rishi Desai, Samuel Grayson, Xutao Jiang, Ying Jing, Jae Lee, Fang Lu, Yihan Pang, Joseph Ravichandran, Finn Sinclair, Boyuan Tian, Hengzhi Yuan, Jeffrey Zhang, and Sarita V. Adve. 2021. ILLIXR: Enabling end-to-end extended reality research. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC). 24–38. DOI:DOI:
[13]
David Koeplinger, Matthew Feldman, Raghu Prabhakar, Yaqi Zhang, Stefan Hadjis, Ruben Fiszel, Tian Zhao, Luigi Nardi, Ardavan Pedram, Christos Kozyrakis, et al. 2018. Spatial: A language and compiler for application accelerators. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation. 296–311.
[14]
Maria Kotsifakou, Prakalp Srivastava, Matthew D. Sinclair, Rakesh Komuravelli, Vikram Adve, and Sarita Adve. 2018. HPVM: Heterogeneous parallel virtual machine. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 68–80.
[15]
Snehasish Kumar, Vijayalakshmi Srinivasan, Amirali Sharifian, Nick Sumner, and Arrvindh Shriraman. 2016. Peruse and profit: Estimating the accelerability of loops. In Proceedings of the International Conference on Supercomputing. 1–13.
[16]
Yi-Hsiang Lai, Yuze Chi, Yuwei Hu, Jie Wang, Cody Hao Yu, Yuan Zhou, Jason Cong, and Zhiru Zhang. 2019. HeteroCL: A multi-paradigm programming infrastructure for software-defined reconfigurable computing. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 242–251.
[17]
Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the 2nd International Symposium on Code Generation and Optimization. 75–88.
[18]
LLVM Project. Circuit IR Compilers and Tools (CIRCT). https://github.com/llvm/circt.
[19]
Steven Margerm, Amirali Sharifian, Apala Guha, Arrvindh Shriraman, and Gilles Pokam. 2018. TAPAS: Generating parallel accelerators from parallel programs. In Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 245–257.
[20]
Wim Meeus, Kristof Van Beeck, Toon Goedemé, Jan Meel, and Dirk Stroobandt. 2012. An overview of today’s high-level synthesis tools. Des. Automat. Embed. Syst. 16, 3 (Sept.2012), 31–51.
[21]
Luigi Nardi, Artur Souza, David Koeplinger, and Kunle Olukotun. 2019. HyperMapper: A practical design space exploration framework. In Proceedings of the IEEE 27th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). IEEE, 425–426.
[22]
Tan Nguyen, Swathi Gurumani, Kyle Rupnow, and Deming Chen. 2016. FCUDA-SoC: Platform integration for field-programmable SoC with the CUDA-to-FPGA compiler. In Proceedings of the ACM/SIGDA International Symposium on Field-programmable Gate Arrays. 5–14.
[23]
Alexandros Papakonstantinou, Karthik Gururaj, John A. Stratton, Deming Chen, Jason Cong, and Wen-Mei W. Hwu. 2009. FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs. In Proceedings of the IEEE 7th Symposium on Application Specific Processors. IEEE, 35–42.
[24]
Christian Pilato and Fabrizio Ferrandi. 2012. Bambu: A free framework for the high level synthesis of complex applications. In Proceedings of the 23rd International Conference on Field Programmable Logic and Applications.
[25]
Brandon Reagen, Robert Adolf, Yakun Sophia Shao, Gu-Yeon Wei, and David Brooks. 2014. MachSuite: Benchmarks for accelerator design and customized architectures. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC). IEEE, 110–119.
[26]
Samuel Rogers, Joshua Slycord, Mohammadreza Baharani, and Hamed Tabkhi. 2020. gem5-SALAM: A system architecture for LLVM-based accelerator modeling. In Proceedings of the 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 471–482.
[27]
Tao B. Schardl, William S. Moses, and Charles E. Leiserson. 2017. Tapir: Embedding fork-join parallelism into LLVM’s intermediate representation. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 249–265.
[28]
Yakun Sophia Shao, Brandon Reagen, Gu-Yeon Wei, and David Brooks. 2014. Aladdin: A pre-RTL, power-performance accelerator simulator enabling large design space exploration of customized architectures. In Proceedings of the 41st Annual International Symposium on Computer Architecture. IEEE, 97–108.
[29]
Yakun Sophia Shao, Sam Likun Xi, Vijayalakshmi Srinivasan, Gu-Yeon Wei, and David Brooks. 2016. Co-designing accelerators and SoC interfaces using gem5-aladdin. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 1–12.
[30]
Tom Simonite. 2016. Moore’s law is dead. Now what? MIT Technol. Rev. May 13 (2016), 40–41.
[31]
John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-mei W. Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Cent. Reliab. High-perform. Comput. 127 (2012).
[32]
Xilinx. 2017. Vivado High-level Synthesis. Retrieved from www.xilinx.com/products/design-tools/vivado/integration/esl-design.html.
[33]
Xilinx. 2017. Xilinx All Programmable SoC portfolio. Retrieved from www.xilinx.com/products/silicon-devices/soc.html.
[34]
Yuan Yao and Saketh Rama. yaoyuannnn. CAVA: Camera Vision Pipeline on gem5-Aladdin. https://github.com/yaoyuannnn/cava.
[35]
Georgios Zacharopoulos, Andrea Barbon, Giovanni Ansaloni, and Laura Pozzi. 2018. Machine learning approach for loop unrolling factor prediction in high level synthesis. In Proceedings of the IEEE International Conference on High Performance Computing & Simulation (HPCS). 91–97.
[36]
Georgios Zacharopoulos, Lorenzo Ferretti, Giovanni Ansaloni, Giuseppe Di Guglielmo, Luca Carloni, and Laura Pozzi. 2019. Compiler-assisted selection of hardware acceleration candidates from application source code. In Proceedings of the International Conference on Computer Design. 1–9.
[37]
Georgios Zacharopoulos, Lorenzo Ferretti, Emanuele Giaquinta, Giovanni Ansaloni, and Laura Pozzi. 2019. RegionSeeker: Automatically identifying and selecting accelerators from application source code. IEEE Trans. Comput.-aid Des. Integ. Circ. Syst. 38, 4 (Apr.2019), 741–754.
[38]
Georgios Zacharopoulos and Laura Pozzi. 2017. ClrFreqCFGPrinter: A Tool for Frequency Annotated Control Flow Graph Generation. Technical Report. European LLVM Developers Meeting.
[39]
Ruoyu Zhou and Timothy M. Jones. 2019. Janus: Statically-driven and profile-guided automatic dynamic binary parallelisation. In Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 15–25.

Cited By

View all
  • (2024)A Hardware Realization Framework for Fuzzy Inference System OptimizationElectronics10.3390/electronics1304069013:4(690)Online publication date: 8-Feb-2024
  • (2024)Automating application-driven customization of ASIPsJournal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2024.103080148:COnline publication date: 2-Jul-2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems
ACM Transactions on Embedded Computing Systems  Volume 22, Issue 3
May 2023
519 pages
ISSN:1539-9087
EISSN:1558-3465
DOI:10.1145/3592782
  • Editor:
  • Tulika Mitra
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 20 April 2023
Online AM: 17 January 2023
Accepted: 09 January 2023
Revised: 11 November 2022
Received: 06 October 2021
Published in TECS Volume 22, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Accelerators
  2. ASICs
  3. compiler techniques and optimizations
  4. design tools
  5. heterogeneous systems parallelism

Qualifiers

  • Research-article

Funding Sources

  • Software Analysis for Heterogeneous Computing Architectures
  • Swiss National Science Foundation (SNSF), by the National Science Foundation (US)
  • NSF
  • DARPA through the Domain-Specific System on Chip (DSSoC)
  • Applications Driving Architectures (ADA) Research Center

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)498
  • Downloads (Last 6 weeks)36
Reflects downloads up to 18 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A Hardware Realization Framework for Fuzzy Inference System OptimizationElectronics10.3390/electronics1304069013:4(690)Online publication date: 8-Feb-2024
  • (2024)Automating application-driven customization of ASIPsJournal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2024.103080148:COnline publication date: 2-Jul-2024

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media