research-article

Multithreaded pipeline synthesis for data-parallel kernels

Authors:

Zhiru ZhangAuthors Info & Claims

ICCAD '14: Proceedings of the 2014 IEEE/ACM International Conference on Computer-Aided Design

Pages 718 - 725

Published: 03 November 2014 Publication History

Abstract

Pipelining is an important technique in high-level synthesis, which overlaps the execution of successive loop iterations or threads to achieve high throughput for loop/function kernels. Since existing pipelining techniques typically enforce in-order thread execution, a variable-latency operation in one thread would block all subsequent threads, resulting in considerable performance degradation. In this paper, we propose a multithreaded pipelining approach that enables context switching to allow out-of-order thread execution for data-parallel kernels. To ensure that the synthesized pipeline is complexity effective, we further propose efficient scheduling algorithms for minimizing the hardware overhead associated with context management. Experimental results show that our proposed techniques can significantly improve the effective pipeline throughput over conventional approaches while conserving hardware resources.

References

[1]

SuiteSparse: A Suite of Sparse Matrix Packages. https://www.cise.ufl.edu/research/sparse/SuiteSparse/.

[2]

The Green Graph 500. http://www.graph500.org.

[3]

CPLEX: High-Performance Software for Mathematical Programming and Optimization, 2005.

[4]

Y. Ben-Asher, D. Meisler, and N. Rotem. Reducing Memory Constraints in Modulo Scheduling Synthesis for FPGAs. ACM Trans. on Reconfigurable Technology and Systems, 3(3):1--19, 2010.

Digital Library

[5]

A. Canis, J. H. Anderson, and S. D. Brown. Modulo SDC Scheduling with Recurrence Minimization in High-Level Synthesis. Int'l Conf. on Field Programmable Logic and Applications (FPL), 2014.

[6]

A. Canis, J. Choi, M. Aldham, V. Zhang, A. Kammoona, J. H. Anderson, S. Brown, and T. Czajkowski. LegUp: High-Level Synthesis for FPGA-Based Processor/Accelerator Systems. Int'l Symp. on Field-Programmable Gate Arrays (FPGA), pages 33--36, Mar 2011.

Digital Library

[7]

J. Choi, S. Brown, and J. Anderson. From Software Threads to Parallel Hardware in High-Level Synthesis for FPGAs. Int'l Conf. on Field Programmable Technology (FPT), pages 270--277, Dec 2013.

[8]

J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang. High-Level Synthesis for FPGAs: from Prototyping to Deployment. IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems (TCAD), 30(4):473--491, 2011.

Digital Library

[9]

T. H. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein, et al. Introduction to Algorithms, volume 2. MIT press Cambridge, 2001.

Digital Library

[10]

P. Coussy and A. Morawiec. High-Level Synthesis: from Algorithm to Digital Circuit. Springer, 2008.

Digital Library

[11]

T. S. Czajkowski, D. Neto, M. Kinsner, U. Aydonat, J. Wong, D. Denisenko, P. Yiannacouras, J. Freeman, D. P. Singh, and S. D. Brown. OpenCL for FPGAs: Prototyping a Compiler. Int'l Conf. on Engineering of Reconfigurable Systems and Algorithms (ERSA), pages 3--12, Jul 2012.

[12]

S. Dai, M. Tan, K. Hao, and Z. Zhang. Flushing-Enabled Loop Pipelining for High-Level Synthesis. Design Automation Conf. (DAC), Jun 2014.

Digital Library

[13]

R. J. Halstead and W. Najjar. Compiled Multithreaded Data Paths on FPGAs for Dynamic Workloads. Intl'l Conf. on Compilers, Architectures and Synthesis of Embedded Systems (CASES), Oct 2013.

Digital Library

[14]

C. Lattner and V. Adve. LLVM: A Compilation Framework for Lifelong Program Analysis and Transformation. Int'l Symp. on Code Generation and Optimization (CGO), pages 75--86, Mar 2004.

Digital Library

[15]

F. Liu, S. Ghosh, N. P. Johnson, and D. I. August. CGPA: Coarse-Grained Pipelined Accelerators. Design Automation Conf. (DAC), Jun 2014.

Digital Library

[16]

J. Llosa, E. Ayguadé, A. Gonzalez, M. Valero, and J. Eckhardt. Lifetime-Sensitive Modulo Scheduling in a Production Environment. IEEE Trans. on Computers (TC), 50(3):234--249, Mar 2001.

Digital Library

[17]

J. Llosa, A. González, E. Ayguadé, and M. Valero. Swing Module Scheduling: A Lifetime-Sensitive Approach. Int'l Conf. on Parallel Architectures and Compilation Techniques (PACT), pages 80--86, Oct 1996.

Digital Library

[18]

A. Morvan, S. Derrien, and P. Quinton. Polyhedral Bubble Insertion: A Method to Improve Nested Loop Pipelining for High-Level Synthesis. IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems (TCAD), 32(3):339--352, 2013.

Digital Library

[19]

R. Nelson. Probability, Stochastic Processes, and Queueing Theory: The Mathematics of Computer Performance Modeling. Springer, 1995.

Digital Library

[20]

M. Nemirovsky and D. M. Tullsen. Multithreading Architecture. Synthesis Lectures on Computer Architecture, 8(1):1--109, 2013.

Digital Library

[21]

J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C. Phillips. GPU Computing. Proceedings of the IEEE, 96(5):879--899, 2008.

[22]

B. R. Rau. Iterative Modulo Scheduling: An Algorithm for Software Pipelining Loops. Int'l Symp. on Microarchitecture (MICRO), pages 63--74, Nov 1994.

Digital Library

[23]

B. R. Rau and C. D. Glaeser. Some Scheduling Techniques and an Easily Schedulable Horizontal Architecture for High Performance Scientific Computing. ACM SIGMICRO Newsletter, 12(4):183--198, 1981.

Digital Library

[24]

R. Schreiber, S. Aditya, S. Mahlke, V. Kathail, B. Rau, D. Cronquist, and M. Sivaraman. PICO-NPA: High-level Synthesis of Nonprogrammable Hardware Accelerators. Journal of VLSI Signal Processing, 31(2):127--142, 2002.

Digital Library

[25]

J. E. Stone, D. Gohara, and G. Shi. OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems. IEEE Design & Test, 12(3):66--73, 2010.

Digital Library

[26]

Z. Zhang and B. Liu. SDC-Based Modulo Scheduling for Pipeline Synthesis. Int'l Conf. on Computer-Aided Design (ICCAD), pages 211--218, Nov 2013.

Digital Library

[27]

W. Zuo, Y. Liang, P. Li, K. Rupnow, D. Chen, and J. Cong. Improving High Level Synthesis Optimization Opportunity through Polyhedral Transformations. In Int'l Symp. on Field-Programmable Gate Arrays (FPGA), pages 9--18. ACM, 2013.

Digital Library

Cited By

Dai SLiu GZhang ZAnderson JBazargan K(2018)A Scalable Approach to Exact Resource-Constrained Scheduling Based on a Joint SDC and SAT FormulationProceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays10.1145/3174243.3174268(137-146)Online publication date: 15-Feb-2018
https://dl.acm.org/doi/10.1145/3174243.3174268
Shao YXi SSrinivasan VWei GBrooks DHsu WYang CLipasti MLee H(2016)Co-designing accelerators and SoC interfaces using gem5-aladdinThe 49th Annual IEEE/ACM International Symposium on Microarchitecture10.5555/3195638.3195697(1-12)Online publication date: 15-Oct-2016
https://dl.acm.org/doi/10.5555/3195638.3195697
Chen TSuh GHsu WYang CLipasti MLee H(2016)Efficient data supply for hardware accelerators with prefetching and access/execute decouplingThe 49th Annual IEEE/ACM International Symposium on Microarchitecture10.5555/3195638.3195694(1-12)Online publication date: 15-Oct-2016
https://dl.acm.org/doi/10.5555/3195638.3195694
Show More Cited By

Multithreaded pipeline synthesis for data-parallel kernels
1. Theory of computation
  1. Theory and algorithms for application domains
    1. Machine learning theory
      1. Reinforcement learning

Recommendations

Automatic multithreaded pipeline synthesis from transactional datapath specifications
DAC '10: Proceedings of the 47th Design Automation Conference

We present a technique to automatically synthesize a multithreaded in-order pipeline from a high-level unpipelined datapath specification. This work extends the previously proposed transactional specification (T-spec) and synthesis technology (T-piper). ...
Implicitly-multithreaded processors
Implicitly-multithreaded processors
ISCA '03: Proceedings of the 30th annual international symposium on Computer architecture

This paper proposes the Implicitly-MultiThreaded (IMT) architecture to execute compiler-specified speculative threads on to a modified Simultaneous Multithreading pipeline. IMT reduces hardware complexity by relying on the compiler to select suitable ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICCAD '14: Proceedings of the 2014 IEEE/ACM International Conference on Computer-Aided Design

November 2014

801 pages

ISBN:9781479962778

General Chair:
Yao-Wen Chang
National Taiwan Univ., Taipei, Taiwan

Sponsors

CEDA: Council on Electronic Design Automation
ACM: Association for Computing Machinery
SIGDA: ACM Special Interest Group on Design Automation
IEEE CAS

In-Cooperation

IEEE SSCS Shanghai Chapter
IEEE-EDS: Electronic Devices Society

Publisher

IEEE Press

Publication History

Published: 03 November 2014

Check for updates

Qualifiers

Research-article

Conference

ICCAD '14

Sponsor:

CEDA
ACM
SIGDA

ICCAD '14: IEEE/ACM INTERNATIONAL CONFERENCE ON COMPUTER-AIDED DESIGN

November 3 - 6, 2014

California, San Jose

Acceptance Rates

Overall Acceptance Rate 457 of 1,762 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
174
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Dai SLiu GZhang ZAnderson JBazargan K(2018)A Scalable Approach to Exact Resource-Constrained Scheduling Based on a Joint SDC and SAT FormulationProceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays10.1145/3174243.3174268(137-146)Online publication date: 15-Feb-2018
https://dl.acm.org/doi/10.1145/3174243.3174268
Shao YXi SSrinivasan VWei GBrooks DHsu WYang CLipasti MLee H(2016)Co-designing accelerators and SoC interfaces using gem5-aladdinThe 49th Annual IEEE/ACM International Symposium on Microarchitecture10.5555/3195638.3195697(1-12)Online publication date: 15-Oct-2016
https://dl.acm.org/doi/10.5555/3195638.3195697
Chen TSuh GHsu WYang CLipasti MLee H(2016)Efficient data supply for hardware accelerators with prefetching and access/execute decouplingThe 49th Annual IEEE/ACM International Symposium on Microarchitecture10.5555/3195638.3195694(1-12)Online publication date: 15-Oct-2016
https://dl.acm.org/doi/10.5555/3195638.3195694
Tan MLiu GZhao RDai SZhang ZMarculescu DLiu F(2015)ElasticFlowProceedings of the IEEE/ACM International Conference on Computer-Aided Design10.5555/2840819.2840831(78-85)Online publication date: 2-Nov-2015
https://dl.acm.org/doi/10.5555/2840819.2840831
Zhao RTan MDai SZhang Z(2015)Area-efficient pipelining for FPGA-targeted high-level synthesisProceedings of the 52nd Annual Design Automation Conference10.1145/2744769.2744801(1-6)Online publication date: 7-Jun-2015
https://dl.acm.org/doi/10.1145/2744769.2744801
Tan MDai SGupta UZhang ZConstantinides GChen D(2015)Mapping-Aware Constrained Scheduling for LUT-Based FPGAsProceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays10.1145/2684746.2689063(190-199)Online publication date: 22-Feb-2015
https://dl.acm.org/doi/10.1145/2684746.2689063

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten