Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3559009.3569659acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article
Open access

Auto-Partitioning Heterogeneous Task-Parallel Programs with StreamBlocks

Published: 27 January 2023 Publication History

Abstract

FPGAs play an increasing role in the reconfigurable accelerator landscape. A key challenge in designing FPGA-based systems is partitioning computation between processor cores and FPGAs. An appropriate division of labor is difficult to predict in advance and requires experiments and measurements. When an investigation requires rewriting part of the system in a new language or with a new programming model, its high cost can delay design-space exploration. A single-language system with an appropriate programming model and compiler that targets both platforms transforms this tedious exploration to a simple recompile with new compiler directives.
This work introduces StreamBlocks, a unified open-source software/FPGA compiler and runtime that takes dataflow programs written in Cal, and automatically partitions them across heterogeneous CPU/FPGA platforms. The explicit task-parallel semantics of dataflow allows our compiler to simultaneously take advantage of thread parallelism on software and spatial parallelism on hardware.
StreamBlocks is augmented with a profile-guided auto-partitioning tool that helps identify the best hardware-software partitions. We demonstrate the capability of our compiler in finding the right balance between hardware and software execution on both a high-end datacenter accelerator card and an embedded board. Our experiments exhibit a 4 -- 7× speedup over trivial partitions. This speedup is achieved automatically with zero code modifications.

Supplementary Material

ZIP File (p398-emami-supp.zip)
Supplemental files.

References

[1]
ISO/IEC 23001-4:2011. 2011. Information technology - MPEG systems technologies - Part 4: Codec configuration representation.
[2]
Sassan Ahmadi. 2016. Toward 5G Xilinx Solutions and Enablers for Next-Generation Wireless Systems. White paper. Xilinx Inc.
[3]
David Bacon, Rodric Rabbah, and Sunil Shukla. 2013. FPGA Programming for the Masses: The Programmability of FPGAs Must Improve If They Are to Be Part of Mainstream Computing. Queue 11, 2 (feb 2013), 40--52.
[4]
E. Bezati, S. Casale-Brunet, R. Mosqueron, and M. Mattavelli. 2019. An Heterogeneous Compiler of Dataflow Programs for Zynq Platforms. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 1537--1541.
[5]
E. Bezati, M. Mattavelli, and J.W. Janneck. 2013. High-level synthesis of dataflow programs for signal processing systems. In Image and Signal Processing and Analysis (ISPA), 2013 8th International Symposium on. 750--754.
[6]
G. Bilsen, M. Engels, R. Lauwereins, and J.A. Peperstraete. 1995. Cyclo-static data flow. In Acoustics, Speech, and Signal Processing, 1995. ICASSP-95., 1995 International Conference on, Vol. 5. 3255 --3258 vol.5.
[7]
J. T. Buck and E. A. Lee. 1993. Scheduling dynamic dataflow graphs with bounded memory using the token flow model. In 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 1. 429--432 vol.1.
[8]
S. Byma, J. G. Steffan, H. Bannazadeh, A. L. Garcia, and P. Chow. 2014. FPGAs in the Cloud: Booting Virtualized Hardware Accelerators with OpenStack. In 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines. 109--116.
[9]
Andrew Canis, Jongsok Choi, Mark Aldham, Victor Zhang, Ahmed Kammoona, Jason H. Anderson, Stephen Brown, and Tomasz Czajkowski. 2011. LegUp: High-level Synthesis for FPGA-based Processor/Accelerator Systems. In Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays (Monterey, CA, USA) (FPGA '11). ACM, New York, NY, USA, 33--36.
[10]
S. Casale-Brunet, E. Bezati, and M. Mattavelli. 2017. Design space exploration of dataflow-based Smith-Waterman FPGA implementations. In 2017 IEEE International Workshop on Signal Processing Systems (SiPS). 1--6.
[11]
Simone Casale-Brunet, Abdallah Elguindy, Endri Bezati, Richard Thavot, Ghislain Roquier, Marco Mattavelli, and Jorn W Janneck. 2013. Methods to explore design space for MPEG RMC codec specifications. Signal Processing: Image Communication 28, 10 (2013), 1278--1294.
[12]
Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, John Wawrzynek, and André DeHon. 2000. Stream Computations Organized for Reconfigurable Execution (SCORE). In Proceedings of the The Roadmap to Reconfigurable Computing, 10th International Workshop on Field-Programmable Logic and Applications (FPL '00). Springer-Verlag, Berlin, Heidelberg, 605--614.
[13]
A. M. Caulfield, E. S. Chung, A. Putnam, H. Angepat, J. Fowers, M. Haselman, S. Heil, M. Humphrey, P. Kaur, J. Kim, D. Lo, T. Massengill, K. Ovtcharov, M. Papamichael, L. Woods, S. Lanka, D. Chiou, and D. Burger. 2016. A cloud-scale acceleration architecture. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 1--13.
[14]
Gustav Cedersjö and Jörn W. Janneck. 2019. Tÿcho: A Framework for Compiling Stream Programs. ACM Trans. Embed. Comput. Syst. 18, 6, Article 120 (Dec. 2019), 25 pages.
[15]
Yuze Chi, Licheng Guo, Jason Lau, Young kyu Choi, Jie Wang, and Jason Cong. 2021. Extending High-Level Synthesis for Task-Parallel Programs. arXiv:2009.11389 [cs.AR]
[16]
J. Choi, Ruo Long Lian, S. Brown, and J. Anderson. 2016. A unified software approach to specify pipeline and spatial parallelism in FPGA hardware. In 2016 IEEE 27th International Conference on Application-specific Systems, Architectures and Processors (ASAP). 75--82.
[17]
Eric S. Chung, John D. Davis, and Jaewon Lee. 2013. LINQits: Big Data on Little Clients. In Proceedings of the 40th Annual International Symposium on Computer Architecture (Tel-Aviv, Israel) (ISCA '13). Association for Computing Machinery, New York, NY, USA, 261--272.
[18]
Jason Cong, Muhuan Huang, Peichen Pan, Yuxin Wang, and Peng Zhang. 2016. Source-to-Source Optimization for HLS. Springer International Publishing, Cham, 137--163.
[19]
Jason Cong, Peng Li, Bingjun Xiao, and Peng Zhang. 2014. An Optimal Microarchitecture for Stencil Computation Acceleration Based on Non-Uniform Partitioning of Data Reuse Buffers. In Proceedings of the 51st Annual Design Automation Conference (San Francisco, CA, USA) (DAC '14). Association for Computing Machinery, New York, NY, USA, 1--6.
[20]
André DeHon, Yury Markovsky, Eylon Caspi, Michael Chu, Randy Huang, Stylianos Perissakis, Laura Pozzi, Joseph Yeh, and John Wawrzynek. 2006. Stream computations organized for reconfigurable execution. Microprocessors and Microsystems 30, 6 (2006), 334--354. Special Issue on FPGA's.
[21]
Jack B. Dennis. 1974. First version of a data flow procedure language. In Symposium on Programming. 362--376.
[22]
J. Eker and J. Janneck. 2003. CAL Language Report. Technical Report ERL Technical Memo UCB/ERL M03/48. University of California at Berkeley.
[23]
Joachim Falk, Christian Haubelt, Jürgen Teich, and Christian Zebelein. 2017. SysteMoC: A Data-Flow Programming Language for Codesign. In Handbook of Hardware/Software Codesign, Teich J Ha S (Ed.). Vol. 1. Springer, Dordrecht, The Netherlands, 59 -- 97.
[24]
Lorenzo Ferretti, Giovanni Ansaloni, and Laura Pozzi. 2018. Lattice-Traversing Design Space Exploration for High Level Synthesis. In 2018 IEEE 36th International Conference on Computer Design (ICCD). 210--217.
[25]
Juan Fumero, Michail Papadimitriou, Foivos S. Zakkak, Maria Xekalaki, James Clarkson, and Christos Kotselidis. 2019. Dynamic Application Reconfiguration on Heterogeneous Hardware. In Proceedings of the 15th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (Providence, RI, USA) (VEE 2019). Association for Computing Machinery, New York, NY, USA, 165--178.
[26]
Mentor Graphics Inc. 2015. GOOGLE DEVELOPS WEBM VIDEO DECOMPRESSION HARDWARE IP USING TECHNOLOGY INDEPENDENT SOURCES AND HIGH-LEVEL SYNTHESIS. White paper.
[27]
J.W. Janneck. 2011. A machine model for dataflow actors and its applications. In Signals, Systems and Computers (ASILOMAR), 2011 Conference Record of the Forty Fifth Asilomar Conference on. 756--760.
[28]
Jörn Janneck, Ian Miller, David Parlour, Ghislain Roquier, Matthieu Wipliez, and Mickaël Raulet. 2009. Synthesizing Hardware from Dataflow Programs:An MPEG-4 Simple Profile Decoder Case Study. Journal of Signal Processing Systems 63, 2 (2009), 241--249. 10.1007/s11265-009-0397-5.
[29]
Jorn W. Janneck, Ian D. Miller, David B. Parlour, Ghislain Roquier, Matthieu Wipliez, and Mickael Raulet. 2008. Synthesizing hardware from dataflow programs: An MPEG-4 simple profile decoder case study. In 2008 IEEE Workshop on Signal Processing Systems. 287--292.
[30]
K. Jerbi, M. Raulet, O. Deforges, and M. Abid. 2012. Automatic generation of synthesizable hardware implementation from high level RVC-CAL description. In Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on. 1597--1600.
[31]
Khaled Jerbi, Daniele Renzi, Damien De Saint Jorre, Hervé Yviquel, Mickaël Raulet, Claudio Alberti, and Marco Mattavelli. 2014. Development and optimization of high level dataflow programs: The HEVC decoder design case. In 2014 48th Asilomar Conference on Signals, Systems and Computers. 2155--2159.
[32]
Gangwon Jo, Heehoon Kim, Jeesoo Lee, and Jaejin Lee. 2020. SOFF: An OpenCL High-Level Synthesis Framework for FPGAs. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). 295--308.
[33]
Gilles Kahn. 1974. The Semantics of Simple Language for Parallel Programming. In IFIP Congress. 471--475.
[34]
Joachim Keinert, Martin Streubūhr, Thomas Schlichter, Joachim Falk, Jens Gladigau, Christian Haubelt, Jūrgen Teich, and Michael Meredith. 2009. SystemCoDe-signer---an Automatic ESL Synthesis Approach by Design Space Exploration and Behavioral Synthesis for Streaming Applications. ACM Trans. Des. Autom. Electron. Syst. 14, 1, Article 1 (Jan. 2009), 23 pages.
[35]
Yongfeng Gu Kiran Kintali and Eric Cigan. 2014. Model-Based Design Using Simulink, HDL Coder, and DSP Builder for Intel FPGAs. White paper. Matlab Inc.
[36]
David Koeplinger, Matthew Feldman, Raghu Prabhakar, Yaqi Zhang, Stefan Hadjis, Ruben Fiszel, Tian Zhao, Luigi Nardi, Ardavan Pedram, Christos Kozyrakis, and Kunle Olukotun. 2018. Spatial: A Language and Compiler for Application Accelerators. SIGPLAN Not. 53, 4 (jun 2018), 296--311.
[37]
Jason Lau, Aishwarya Sivaraman, Qian Zhang, Muhammad Ali Gulzar, Jason Cong, and Miryung Kim. 2020. HeteroRefactor: Refactoring for Heterogeneous Computing with FPGA. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering (Seoul, South Korea) (ICSE '20). Association for Computing Machinery, New York, NY, USA, 493--505.
[38]
E.A. Lee and D.G. Messerschmitt. 1987. Synchronous data flow. Proc. IEEE 75, 9 (Sept 1987), 1235--1245.
[39]
E.A. Lee and T.M. Parks. 1995. Dataflow process networks. Proc. IEEE 83, 5 (May 1995), 773 --801.
[40]
Yanbing Li, Tim Callahan, Ervan Darnell, Randolph Harr, Uday Kurkure, and Jon Stockwood. 2000. Hardware-Software Co-Design of Embedded Reconfigurable Architectures. In Proceedings of the 37th Annual Design Automation Conference (Los Angeles, California, USA) (DAC '00). Association for Computing Machinery, New York, NY, USA, 507--512.
[41]
Thierry Moreau, Tianqi Chen, Luis Vega, Jared Roesch, Eddie Yan, Lianmin Zheng, Josh Fromm, Ziheng Jiang, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2019. A Hardware-Software Blueprint for Flexible Deep Learning Specialization. arXiv:1807.04188 [cs.LG]
[42]
Andrew Putnam, Adrian M. Caulfield, Eric S. Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jan Gray, and et al. 2014. A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services. SIGARCH Comput. Archit. News 42, 3 (June 2014), 13--24.
[43]
Zhenyuan Ruan, Tong He, Bojie Li, Peipei Zhou, and Jason Cong. 2018. ST-Accel: A High-Level Programming Platform for Streaming Applications on FPGA. In 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 9--16.
[44]
N. Siret, M. Wipliez, J.-F. Nezan, and A. Rhatay. 2010. Hardware code generation from dataflow programs. In Design and Architectures for Signal and Image Processing (DASIP), 2010 Conference on. 113--120.
[45]
Atefeh Sohrabizadeh, Cody Hao Yu, Min Gao, and Jason Cong. 2021. AutoDSE: Enabling Software Programmers to Design Efficient FPGA Accelerators. arXiv:2009.14381 [cs.AR]
[46]
J. E. Stone, D. Gohara, and G. Shi. 2010. OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems. Computing in Science Engineering 12, 3 (2010), 66--73.
[47]
Qi Sun, Tinghuan Chen, Siting Liu, Jin Miao, Jianli Chen, Hao Yu, and Bei Yu. 2021. Correlated Multi-objective Multi-fidelity Optimization for HLS Directives Design. In 2021 Design, Automation Test in Europe Conference Exhibition (DATE). 46--51.
[48]
J. Weerasinghe, R. Polig, F. Abel, and C. Hagleitner. 2016. Network-attached FPGAs for data center applications. In 2016 International Conference on Field-Programmable Technology (FPT). 36--43.
[49]
Xilinx. [n. d.]. Vivado Design Suite User Guide - High-Level Synthesis. Xilinx Inc.
[50]
Hanchen Ye, Cong Hao, Jianyi Cheng, Hyunmin Jeong, Jack Huang, Stephen Neuendorffer, and Deming Chen. 2021. ScaleHLS: A New Scalable High-Level Synthesis Framework on Multi-Level Intermediate Representation. arXiv:2107.11673 [cs.PL]
[51]
Cody Hao Yu, Peng Wei, Max Grossman, Peng Zhang, Vivek Sarker, and Jason Cong. 2018. S2FA: An Accelerator Automation Framework for Heterogeneous Computing in Datacenters. In 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC). 1--6.
[52]
Herve Yviquel, Antoine Lorence, Khaled Jerbi, Gildas Cocherel, Alexandre Sanchez, and Mickael Raulet. 2013. Orcc: Multimedia Development Made Easy. In Proceedings of the 21st ACM International Conference on Multimedia (MM '13). ACM, 863--866.
[53]
Qian Zhang, Jiyuan Wang, Guoqing Harry Xu, and Miryung Kim. 2022. Hetero-Gen: Transpiling C to Heterogeneous HLS Code with Automated Test Generation and Program Repair. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (Lausanne, Switzerland) (ASPLOS 2022). Association for Computing Machinery, New York, NY, USA, 1017--1029.
[54]
Wei Zuo, Louis-Noel Pouchet, Andrey Ayupov, Taemin Kim, Chung-Wei Lin, Shinichi Shiraishi, and Deming Chen. 2017. Accurate high-level modeling and automated hardware/software co-design for effective SoC design space exploration. In 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC). 1--6.

Cited By

View all
  • (2024)Auto-Generating Diverse Heterogeneous Designs2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00035(116-123)Online publication date: 27-May-2024
  • (2023)Informing Static Mapping and Local Scheduling of Stream Programs with Trace Analysis2023 25th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC)10.1109/SYNASC61333.2023.00021(98-103)Online publication date: 11-Sep-2023
  • (2023)New Architecture for Real-Time Image Computing Using Parallel Processing Based on DSP/FPGA2023 3rd International Conference on Electrical, Computer, Communications and Mechatronics Engineering (ICECCME)10.1109/ICECCME57830.2023.10252728(1-4)Online publication date: 19-Jul-2023

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PACT '22: Proceedings of the International Conference on Parallel Architectures and Compilation Techniques
October 2022
569 pages
ISBN:9781450398688
DOI:10.1145/3559009
© 2022 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

In-Cooperation

  • IFIP WG 10.3: IFIP WG 10.3
  • IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 January 2023

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. actors
  2. partitioning
  3. reconfigurable computing

Qualifiers

  • Research-article

Conference

PACT '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)245
  • Downloads (Last 6 weeks)35
Reflects downloads up to 14 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Auto-Generating Diverse Heterogeneous Designs2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00035(116-123)Online publication date: 27-May-2024
  • (2023)Informing Static Mapping and Local Scheduling of Stream Programs with Trace Analysis2023 25th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC)10.1109/SYNASC61333.2023.00021(98-103)Online publication date: 11-Sep-2023
  • (2023)New Architecture for Real-Time Image Computing Using Parallel Processing Based on DSP/FPGA2023 3rd International Conference on Electrical, Computer, Communications and Mechatronics Engineering (ICECCME)10.1109/ICECCME57830.2023.10252728(1-4)Online publication date: 19-Jul-2023

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media