Nothing Special   »   [go: up one dir, main page]

skip to main content
tutorial

Aggressive Pipelining of Irregular Applications on Reconfigurable Hardware

Published: 24 June 2017 Publication History

Abstract

CPU-FPGA heterogeneous platforms offer a promising solution for high-performance and energy-efficient computing systems by providing specialized accelerators with post-silicon reconfigurability. To unleash the power of FPGA, however, the programmability gap has to be filled so that applications specified in high-level programming languages can be efficiently mapped and scheduled on FPGA. The above problem is even more challenging for irregular applications, in which the execution dependency can only be determined at run time. Thus over-serialized accelerators are generated from existing works that rely on compile time analysis to schedule the computation.
In this work, we propose a comprehensive software-hardware co-design framework, which captures parallelism in irregular applications and aggressively schedules pipelined execution on reconfigurable platform. Based on an inherently parallel abstraction packaging parallelism for runtime schedule, our framework significantly differs from existing works that tend to schedule executions at compile time. An irregular application is formulated as a set of tasks with their dependencies specified as rules describing the conditions under which a subset of tasks can be executed concurrently. Then datapaths on FPGA will be generated by transforming applications in the formulation into task pipelines orchestrated by evaluating rules at runtime, which could exploit fine-grained pipeline parallelism as handcrafted accelerators do.
An evaluation shows that this framework is able to produce datapath with its quality close to handcrafted designs. Experiments show that generated accelerators are dramatically more efficient than those created by current high-level synthesis tools. Meanwhile, accelerators generated for a set of irregular applications attain 0.5x~1.9x performance compared to equivalent software implementations we selected on a server-grade 10-core processor, with the memory subsystem remaining as the bottleneck.

References

[1]
2006. 9th DIMACS Implementation Challenge: Shortest Paths. (2006). http://www.dis.uniroma1.it/challenge9/
[2]
2015. Xeon+fpga platform for the data center. (2015). https://www.ece.cmu.edu/~calcm/carl/lib/exe/fetch.php?media=carl15-gupta.pdf
[3]
Altera Corporation. 2013. Implementing FPGA Design with the OpenCL Standard (Altera). (Nov. 2013).
[4]
Altera Corporation. 2016. Altera SDK for OpenCL - Best Practices Guide. (2016).
[5]
K. Arvind and Rishiyur S. Nikhil. 1990. Executing a Program on the MIT Tagged-Token Dataflow Architecture. IEEE Trans. Comput. 39 (March 1990), 300--318.
[6]
Joshua Auerbach, David F. Bacon, Perry Cheng, and Rodric Rabbah. 2010. Lime: A Java-compatible and Synthesizable Language for Heterogeneous Architectures. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA '10). ACM, New York, NY, USA, 89--108.
[7]
David Bacon, Rodric Rabbah, and Sunil Shukla. 2013. FPGA Programming for the Masses. Queue 11 (Feb. 2013), 40:40--40:52.
[8]
Daniel U. Becker and William J. Dally. 2009. Allocator implementations for network-on-chip routers. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. 1--12.
[9]
Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, and Julian Shun. 2012. Internally Deterministic Parallel Algorithms Can Be Fast. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '12). ACM, New York, NY, USA, 181--192.
[10]
Joseph Tobin Buck. 1993. Scheduling Dynamic Dataflow Graphs with Bounded Memory Using the Token Flow Model. Ph.D. Dissertation. University of California, Berkeley. AAI9431898.
[11]
Calin Cascaval, Colin Blundell, Maged Michael, Harold W. Cain, Peng Wu, Stefanie Chiras, and Siddhartha Chatterjee. 2008. Software Transactional Memory: Why Is It Only a Research Toy? Queue 6 (Sept. 2008), 40:46--40:58.
[12]
Fei Chen, Yi Shan, Yu Zhang, Yu Wang, Hubertus Franke, Xiaotao Chang, and Kun Wang. 2014. Enabling FPGAs in the Cloud. In Proceedings of the 11th ACM Conference on Computing Frontiers (CF '14). ACM, New York, NY, USA, 3:1--3:10.
[13]
Boris V. Cherkassky, Andrew V. Goldberg, and Tomasz Radzik. 1996. Shortest paths algorithms: Theory and experimental evaluation. Mathematical Programming 73 (May 1996), 129--174.
[14]
Young-kyu Choi, Jason Cong, Zhenman Fang, Yuchen Hao, Glenn Reinman, and Peng Wei. 2016. A Quantitative Analysis on Microarchitectures of Modern CPU-FPGA Platforms. In Proceedings of the 53rd Annual Design Automation Conference (DAC '16). ACM, New York, NY, USA, 109:1--109:6.
[15]
Jason Cong, Bin Liu, Stephen Neuendorffer, Juanjo Noguera, Kees Vissers, and Zhiru Zhang. 2011. High-Level Synthesis for FPGAs: From Prototyping to Deployment. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 30 (April 2011), 473--491.
[16]
Guohao Dai, Yuze Chi, Yu Wang, and Huazhong Yang. 2016. FPGP: Graph Processing Framework on FPGA A Case Study of Breadth-First Search. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA '16). ACM, New York, NY, USA, 105--110.
[17]
Alejandro Duran, Xavier Teruel, Roger Ferrer, Xavier Martorell, and Eduard Ayguade. 2009. Barcelona OpenMP Tasks Suite: A Set of Benchmarks Targeting the Exploitation of Task Parallelism in OpenMP. In Proceedings of the 2009 International Conference on Parallel Processing (ICPP '09). IEEE Computer Society, Washington, DC, USA, 124--131.
[18]
Rahulkumar Gayatri, Rosa. M. Badia, and Eduard Aygaude. 2013. Loop level speculation in a task based programming model. In 20th Annual International Conference on High Performance Computing. 39--48.
[19]
Andreas Gerstlauer, Christian Haubelt, Andy D. Pimentel, Todor P. Stefanov, Daniel D. Gajski, and JÃijrgen Teich. 2009. Electronic System-Level Synthesis Methodologies. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 28 (Oct. 2009), 1517--1530.
[20]
Paul Grigoras, Pavel Burovskiy, and Wayne Luk. 2016. CASK: Open-Source Custom Architectures for Sparse Kernels. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA '16). ACM, New York, NY, USA, 179--184.
[21]
Muhammad Amber Hassaan, Martin Burtscher, and Keshav Pingali. 2011. Ordered vs. Unordered: A Comparison of Parallelism and Work-efficiency in Irregular Algorithms. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming (PPoPP '11). ACM, New York, NY, USA, 3--12.
[22]
Muhammad Amber Hassaan, Donald D. Nguyen, and Keshav K. Pingali. 2015. Kinetic Dependence Graphs. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '15). ACM, New York, NY, USA, 457--471.
[23]
Chen-Han Ho, Sung Jin Kim, and Karthikeyan Sankaralingam. 2015. Efficient Execution of Memory Access Phases Using Dataflow Specialization. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture (ISCA '15). ACM, New York, NY, USA, 118--130.
[24]
Martin Isenburg, Yuanxin Liu, Jonathan Shewchuk, and Jack Snoeyink. 2006. Streaming Computation of Delaunay Triangulations. In ACM SIGGRAPH 2006 Papers (SIGGRAPH '06). ACM, New York, NY, USA, 1049--1056.
[25]
Jian Ouyang, Shiding Lin, Wei Qi, Yong Wang, Bo Yu, and Song Jiang. 2014. SDA: Software-defined accelerator for large-scale DNN systems. IEEE Press, 1--23.
[26]
Nick P. Johnson, Hanjun Kim, Prakash Prabhu, Ayal Zaks, and David I. August. 2012. Speculative Separation for Privatization and Reductions. In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '12). ACM, New York, NY, USA, 359--370.
[27]
Ken Kennedy and John R. Allen. 2002. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.
[28]
Kurt Keutzer, Sharad Malik, A.Richard Newton, Jan M. Rabaey, and Alberto Sangiovanni-Vincentelli. 2000. System-level design: orthogonalization of concerns and platform-based design. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 19 (Dec. 2000), 1523--1543.
[29]
David Koeplinger, Christina Delimitrou, Raghu Prabhakar, Christos Kozyrakis, Yaqi Zhang, and Kunle Olukotun. 2016. Automatic Generation of Efficient Accelerators for Reconfigurable Hardware. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA '16). IEEE Press, Piscataway, NJ, USA, 115--127.
[30]
Venkata Krishnan and Josep Torrellas. 1999. A Chip-Multiprocessor Architecture with Speculative Multithreading. IEEE Trans. Comput. 48 (Sept. 1999), 866--880.
[31]
Konstantinos Krommydas, Wu-chun Feng, Christos D. Antonopoulos, and Nikolaos Bellas. 2015. OpenDwarfs: Characterization of Dwarf-Based Benchmarks on Fixed and Reconfigurable Architectures. Journal of Signal Processing Systems 85 (Oct. 2015), 373--392.
[32]
Milind Kulkarni, Martin Burtscher, Rajeshkar Inkulu, Keshav Pingali, and Calin CasÃğaval. 2009. How Much Parallelism is There in Irregular Applications?. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '09). ACM, New York, NY, USA, 3--14.
[33]
Milind Kulkarni, Keshav Pingali, Bruce Walter, Ganesh Ramanarayanan, Kavita Bala, and L. Paul Chew. 2007. Optimistic Parallelism Requires Abstractions. In Proceedings of the 2007 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '07). ACM, New York, NY, USA, 211--222.
[34]
Edward A. Lee and Alberto Sangiovanni-Vincentelli. 1998. A framework for comparing models of computation. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 17 (Dec. 1998), 1217--1229.
[35]
Charles E. Leiserson and Tao B. Schardl. 2010. A Work-efficient Parallel Breadth-first Search Algorithm (or How to Cope with the Nondeterminism of Reducers). In Proceedings of the Twenty-second Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '10). ACM, New York, NY, USA, 303--314.
[36]
Tony Nowatzki, Vinay Gangadhar, and Karthikeyan Sankaralingam. 2015. Exploring the Potential of Heterogeneous Von Neumann/Dataflow Execution Models. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture (ISCA '15). ACM, New York, NY, USA, 298--310.
[37]
Tayo Oguntebi and Kunle Olukotun. 2016. GraphOps: A Dataflow Library for Graph Analytics Acceleration. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA '16). ACM, New York, NY, USA, 111--117.
[38]
Karl J. Ottenstein, Robert A. Ballance, and Arthur B. MacCabe. 1990. The Program Dependence Web: A Representation Supporting Control-, Data-, and Demand-driven Interpretation of Imperative Languages. In Proceedings of the ACM SIGPLAN 1990 Conference on Programming Language Design and Implementation (PLDI '90). ACM, New York, NY, USA, 257--271.
[39]
Alexandros Papakonstantinou, Karthik Gururaj, John A. Stratton, Deming Chen, Jason Cong, and Wen-MeiW.Hwu. 2013. Efficient Compilation of CUDA Kernels for High-performance Computing on FPGAs. ACM Trans. Embed. Comput. Syst. 13 (Sept. 2013), 25:1--25:26.
[40]
Angshuman Parashar, Michael Pellauer, Michael Adler, Bushra Ahsan, Neal Crago, Daniel Lustig, Vladimir Pavlov, Antonia Zhai, Mohit Gambhir, Aamer Jaleel, Randy Allmon, Rachid Rayess, Stephen Maresh, and Joel Emer. 2013. Triggered Instructions: A Control Paradigm for Spatially-programmed Architectures. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA '13). ACM, New York, NY, USA, 142--153.
[41]
Keshav Pingali, Donald Nguyen, Milind Kulkarni, Martin Burtscher, M. Amber Hassaan, Rashid Kaleem, Tsung-Hsien Lee, Andrew Lenharth, Roman Manevich, Mario MÃl'ndez-Lojo, Dimitrios Prountzos, and Xin Sui. 2011. The Tao of Parallelism in Algorithms. In Proceedings of the 32Nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '11). ACM, New York, NY, USA, 12--25.
[42]
Raghu Prabhakar, David Koeplinger, Kevin J. Brown, HyoukJoong Lee, Christopher De Sa, Christos Kozyrakis, and Kunle Olukotun. 2016. Generating Configurable Hardware from Parallel Patterns. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '16). ACM, New York, NY, USA, 651--665.
[43]
Andrew Putnam, Adrian M. Caulfield, Eric S. Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth, Gopal Jan, Gray Michael, Haselman Scott Hauck, Stephen Heil, Amir Hormati, Joo-Young Kim, Sitaram Lanka, James Larus, Eric Peterson, Simon Pope, Aaron Smith, Jason Thong, Phillip Yi, and Xiao Doug Burger. 2014. A Reconfigurable Fabric for Accelerating Large-scale Datacenter Services. In Proceeding of the 41st Annual International Symposium on Computer Architecuture (ISCA '14). IEEE Press, Piscataway, NJ, USA, 13--24. http://dl.acm.org/citation.cfm?id=2665671.2665678
[44]
Arun Raman, Hanjun Kim, Thomas R. Mason, Thomas B. Jablin, and David I. August. 2010. Speculative Parallelization Using Software Multi-threaded Transactions. In Proceedings of the Fifteenth Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems (ASPLOS XV). ACM, New York, NY, USA, 65--76.
[45]
Bratin Saha, Ali-Reza Adl-Tabatabai, Richard L. Hudson, Chi Cao Minh, and Benjamin Hertzberg. 2006. McRT-STM: A High Performance Software Transactional Memory System for a Multi-core Runtime. In Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '06). ACM, New York, NY, USA, 187--197.
[46]
Nadathur Satish, Narayanan Sundaram, Md. Mostofa Ali Patwary, Jiwon Seo, Jongsoo Park, M. Amber Hassaan, Shubho Sengupta, Zhaoming Yin, and Pradeep Dubey. 2014. Navigating the Maze of Graph Analytics Frameworks Using Massive Graph Datasets. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD '14). ACM, New York, NY, USA, 979--990.
[47]
Nir Shavit and Dan Touitou. 1995. Software Transactional Memory. In Proceedings of the Fourteenth Annual ACM Symposium on Principles of Distributed Computing (PODC '95). ACM, New York, NY, USA, 204--213.
[48]
Yaman Umuroglu, Donn Morrison, and Magnus Jahre. 2015. Hybrid breadth-first search on a single-chip FPGA-CPU heterogeneous platform. In Proceedings of the 2015 25th International Conference on Field Programmable Logic and Applications (FPL). IEEE Computer Society, London, UK, 1--8.
[49]
Dani Voitsechov and Yoav Etsion. 2014. Single-graph Multiple Flows: Energy Efficient Design Alternative for GPGPUs. In Proceeding of the 41st Annual International Symposium on Computer Architecuture (ISCA '14). IEEE Press, Piscataway, NJ, USA, 205--216. http://dl.acm.org/citation.cfm?id=2665671.2665703
[50]
Skyler Windh, Xiaoyin Ma, R.J. Halstead, Prerna Budhkar, Zabdiel Luna, Omar Hussaini, and Walid A. Najjar. 2015. High-Level Language Tools for Reconfigurable Computing. Proc. IEEE 103 (March 2015), 390--408.
[51]
Felix Winterstein, Samuel Bayliss, and George A. Constantinides. 2013. High-level synthesis of dynamic data structures: A case study using Vivado HLS. In 2013 International Conference on Field-Programmable Technology (FPT). 362--365.
[52]
Shijie Zhou, Charalampos Chelmis, and Viktor K. Prasanna. 2015. Accelerating Large-Scale Single-Source Shortest Path on FPGA. In Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop (IPDPSW '15). IEEE Computer Society, Washington, DC, USA, 129--136.

Cited By

View all
  • (2023)Phloem: Automatic Acceleration of Irregular Applications with Fine-Grain Pipeline Parallelism2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071026(1262-1274)Online publication date: Feb-2023
  • (2021)An Elastic Task Scheduling Scheme on Coarse-Grained Reconfigurable ArchitecturesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.308480432:12(3066-3080)Online publication date: 1-Dec-2021
  • (2021)Skew-Oblivious Data Routing for Data Intensive Applications on FPGAs with HLS2021 58th ACM/IEEE Design Automation Conference (DAC)10.1109/DAC18074.2021.9586184(937-942)Online publication date: 5-Dec-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News
ACM SIGARCH Computer Architecture News  Volume 45, Issue 2
ISCA'17
May 2017
715 pages
ISSN:0163-5964
DOI:10.1145/3140659
Issue’s Table of Contents
  • cover image ACM Conferences
    ISCA '17: Proceedings of the 44th Annual International Symposium on Computer Architecture
    June 2017
    736 pages
    ISBN:9781450348928
    DOI:10.1145/3079856
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 June 2017
Published in SIGARCH Volume 45, Issue 2

Check for updates

Author Tags

  1. FPGA
  2. Hardware Accelerator
  3. Parallel Programming

Qualifiers

  • Tutorial
  • Research
  • Refereed limited

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)40
  • Downloads (Last 6 weeks)5
Reflects downloads up to 27 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Phloem: Automatic Acceleration of Irregular Applications with Fine-Grain Pipeline Parallelism2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071026(1262-1274)Online publication date: Feb-2023
  • (2021)An Elastic Task Scheduling Scheme on Coarse-Grained Reconfigurable ArchitecturesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.308480432:12(3066-3080)Online publication date: 1-Dec-2021
  • (2021)Skew-Oblivious Data Routing for Data Intensive Applications on FPGAs with HLS2021 58th ACM/IEEE Design Automation Conference (DAC)10.1109/DAC18074.2021.9586184(937-942)Online publication date: 5-Dec-2021
  • (2019)On-The-Fly Parallel Data Shuffling for Graph Processing on OpenCL-Based FPGAs2019 29th International Conference on Field Programmable Logic and Applications (FPL)10.1109/FPL.2019.00020(67-73)Online publication date: Sep-2019
  • (2019)A Survey on Graph Processing Accelerators: Challenges and OpportunitiesJournal of Computer Science and Technology10.1007/s11390-019-1914-z34:2(339-371)Online publication date: 22-Mar-2019
  • (2019)Parallel multiprocessing and scheduling on the heterogeneous Xeon+FPGA platformThe Journal of Supercomputing10.1007/s11227-019-02935-1Online publication date: 18-Jun-2019
  • (2018)Compilation Method of Reconfigurable Cryptographic ProcessorsReconfigurable Cryptographic Processor10.1007/978-981-10-8899-5_4(169-211)Online publication date: 17-May-2018
  • (2017)Accelerating Graph Analytics on CPU-FPGA Heterogeneous Platform2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD.2017.25(137-144)Online publication date: Oct-2017
  • (2023)System Design and Function Implementation Based on Decentralized Artificial Intelligence Internet of Things2023 5th International Conference on Artificial Intelligence and Computer Applications (ICAICA)10.1109/ICAICA58456.2023.10405571(600-605)Online publication date: 28-Nov-2023
  • (2023)Phloem: Automatic Acceleration of Irregular Applications with Fine-Grain Pipeline Parallelism2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071026(1262-1274)Online publication date: Feb-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media