research-article

An Optimal Microarchitecture for Stencil Computation Acceleration Based on Non-Uniform Partitioning of Data Reuse Buffers

Authors:

Peng ZhangAuthors Info & Claims

DAC '14: Proceedings of the 51st Annual Design Automation Conference

Pages 1 - 6

https://doi.org/10.1145/2593069.2593090

Published: 01 June 2014 Publication History

Abstract

High-level synthesis (HLS) tools have made significant progress in compiling high-level descriptions of computation into highly pipelined register-transfer level (RTL) specifications. The high-throughput computation raises a high data demand. To prevent data accesses from being the bottleneck, on-chip memories are used as data reuse buffers to reduce off-chip accesses. Also memory partitioning is explored to increase the memory bandwidth by scheduling multiple simultaneous memory accesses to different memory banks. Prior work on memory partitioning of data reuse buffers is limited to uniform partitioning. In this paper, we perform an early-stage exploration of non-uniform memory partitioning. We use the stencil computation, a popular communication-intensive application domain, as a case study to show the potential benefits of non-uniform memory partitioning. Our novel method can always achieve the minimum memory size and the minimum number of memory banks, which cannot be guaranteed in any prior work. We develop a generalized microarchitecture to decouple stencil accesses from computation, and an automated design flow to integrate our microarchitecture with the HLS-generated computation kernel for a complete accelerator.

References

[1]

Y.-t. Chen, J. Cong, M. A. Ghodrat, M. Huang, C. Liu, B. Xiao, and Y. Zou, "Accelerator-Rich CMPs: From Concept to Real Hardware," in International Conference on Computer Design, 2013.

[2]

J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang, "High-Level Synthesis for FPGAs: From Prototyping to Deployment," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 30, no. 4, pp. 473--491, Apr. 2011.

Digital Library

[3]

J. Cong, P. Zhang, and Y. Zou, "Optimizing memory hierarchy allocation with loop transformations for high-level synthesis," in Design Automation Conference, 2012, pp. 1229--1234.

Digital Library

[4]

L.-N. Pouchet, P. Zhang, P. Sadayappan, and J. Cong, "Polyhedral-based data reuse optimization for configurable computing," Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays - FPGA '13, p. 29, 2013.

Digital Library

[5]

J. Cong, W. Jiang, B. Liu, and Y. Zou, "Automatic memory partitioning and scheduling for throughput and power optimization," in International Conference on Computer-Aided Design, 2009, p. 697.

Digital Library

[6]

Y. Wang, P. Zhang, C. Xu, and J. Cong, "An integrated and automated memory optimization flow for FPGA behavioral synthesis," in Asia and South Pacific Design Automation Conference, Jan. 2012, pp. 257--262.

[7]

P. Li, Y. Wang, P. Zhang, G. Luo, T. Wang, and J. Cong, "Memory partitioning and scheduling co-optimization in behavioral synthesis," in International Conference on Computer-Aided Design, 2012, pp. 488--495.

Digital Library

[8]

Y. Wang, P. Li, P. Zhang, C. Zhang, and J. Cong, "Memory partitioning for multidimensional arrays in high-level synthesis," in Design Automation Conference, 2013, p. 1.

Digital Library

[9]

Y. Ben-Asher and N. Rotem, "Automatic memory partitioning," in International Conference on Hardware/Software Codesign and System Synthesis, 2010, p. 155.

Digital Library

[10]

Xilinx, "Vivado High-Level Synthesis." {Online}. Available: http://www.xilinx.com/products/design-tools/vivado/integration/esldesign/index.htm

[11]

J. Cong, P. Li, B. Xiao, and P. Zhang, "An Optimal Microarchitecture for Stencil Computation Acceleration Based on Non-Uniform Partitioning of Data Reuse Buffers," Computer Science Department, UCLA, TR140009, Tech. Rep., 2014. {Online}. Available: http://fmdb.cs.ucla.edu/Treports/140009.pdf

[12]

T. Henretty, J. Holewinski, N. Sedaghati, L.-N. Pouchet, A. Rountev, and P. Sadayappan, "Stencil Domain Specific Language (SDSL) User Guide 0.2.1 draft," OSU TR OSU-CISRC-4/13-TR09, Tech. Rep., 2013.

[13]

J. Cong, V. Sarkar, G. Reinman, and A. Bui, "Customizable Domain-Specific Computing," IEEE Design and Test of Computers, vol. 28, no. 2, pp. 6--15, Mar. 2011.

Digital Library

[14]

A. A. Nacci, V. Rana, F. Bruschi, D. Sciuto, I. Beretta, and D. Atienza, "A high-level synthesis flow for the implementation of iterative stencil loop algorithms on FPGA devices," in Design Automation Conference, 2013, p. 1.

Digital Library

[15]

"Bicubic interpolation." {Online}. Available: http://www.mpihd.mpg.de/astrophysik/HEA/internal/Numerical_Recipes/f3-6.pdf

[16]

S. Verdoolaege, H. Nikolov, and T. Stefanov, "pn: A Tool for Improved Derivation of Process Networks," EURASIP Journal on Embedded Systems, vol. 2007, pp. 1--13, 2007.

Digital Library

[17]

W. Zuo, Y. Liang, P. Li, K. Rupnow, D. Chen, and J. Cong, "Improving high level synthesis optimization opportunity through polyhedral transformations," in International Symposium on FPGAs, 2013, p. 9.

Digital Library

[18]

"LLVM-polly." {Online}. Available: http://llvm.org/svn/llvm-project/polly/

[19]

"ROSE compiler infrastucture." {Online}. Available: http://rosecompiler.org/

[20]

Xilinx, "Virtex-7 FPGA data sheets." {Online}. Available: http://www.xilinx.com/support/documentation/7_series.htm

Cited By

Zhu FQi XZhang PFang JTang TChe YYu KXie JHuang CRen J(2024)Optimizing Stencil Computation on Multi-core DSPsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673062(679-690)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673062
Del Sozzo EConficconi DSano K(2024)Across Time and Space: Senju’s Approach for Scaling Iterative Stencil Loop Accelerators on Single and Multiple FPGAsACM Transactions on Reconfigurable Technology and Systems10.1145/363492017:2(1-33)Online publication date: 30-Apr-2024
https://dl.acm.org/doi/10.1145/3634920
Shi BZhang JHe ZWei XLi SLuo GZheng HXie Y(2023)Efficient Super-Resolution System With Block-Wise Hybridization and Quantized Winograd on FPGAIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.324762142:11(3910-3924)Online publication date: Nov-2023
https://doi.org/10.1109/TCAD.2023.3247621
Show More Cited By

An Optimal Microarchitecture for Stencil Computation Acceleration Based on Non-Uniform Partitioning of Data Reuse Buffers
1. Hardware
  1. Electronic design automation

Recommendations

An Optimal Microarchitecture for Stencil Computation Acceleration Based on Nonuniform Partitioning of Data Reuse Buffers

High-level synthesis (HLS) tools have made significant progress in compiling high-level descriptions of computation into highly pipelined register-transfer level specifications. The high-throughput computation raises a high data demand. To prevent data ...
An Optimal Microarchitecture for Stencil Computation with Data Reuse and Fine-Grained Parallelism: (Abstract Only)
FPGA '18: Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Stencil computation is one of the most important kernels for many applications such as image processing, solving partial differential equations, and cellular automata. Nevertheless, implementing a high throughput stencil kernel is not trivial due to its ...
A case for small row buffers in non-volatile main memories
ICCD '12: Proceedings of the 2012 IEEE 30th International Conference on Computer Design (ICCD 2012)

DRAM-based main memories have read operations that destroy the read data, and as a result, must buffer large amounts of data on each array access to keep chip costs low. Unfortunately, system-level trends such as increased memory contention in multi-...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

DAC '14: Proceedings of the 51st Annual Design Automation Conference

June 2014

1249 pages

ISBN:9781450327305

DOI:10.1145/2593069

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

EDAC: Electronic Design Automation Consortium
SIGBED: ACM Special Interest Group on Embedded Systems
SIGDA: ACM Special Interest Group on Design Automation
IEEE-CEDA

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Conference

DAC '14

DAC '14: The 51st Annual Design Automation Conference 2014

June 1 - 5, 2014

CA, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 1,770 of 5,499 submissions, 32%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
577
Total Downloads

Downloads (Last 12 months)21
Downloads (Last 6 weeks)6

Reflects downloads up to 16 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhu FQi XZhang PFang JTang TChe YYu KXie JHuang CRen J(2024)Optimizing Stencil Computation on Multi-core DSPsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673062(679-690)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673062
Del Sozzo EConficconi DSano K(2024)Across Time and Space: Senju’s Approach for Scaling Iterative Stencil Loop Accelerators on Single and Multiple FPGAsACM Transactions on Reconfigurable Technology and Systems10.1145/363492017:2(1-33)Online publication date: 30-Apr-2024
https://dl.acm.org/doi/10.1145/3634920
Shi BZhang JHe ZWei XLi SLuo GZheng HXie Y(2023)Efficient Super-Resolution System With Block-Wise Hybridization and Quantized Winograd on FPGAIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.324762142:11(3910-3924)Online publication date: Nov-2023
https://doi.org/10.1109/TCAD.2023.3247621
Emami MBezati EJanneck JLarus JKloeckner AMoreira J(2022)Auto-Partitioning Heterogeneous Task-Parallel Programs with StreamBlocksProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569659(398-411)Online publication date: 8-Oct-2022
https://dl.acm.org/doi/10.1145/3559009.3569659
Reggiani EDel Sozzo EConficconi DNatale GMoroni CSantambrogio M(2021)Enhancing the Scalability of Multi-FPGA Stencil Computations via Highly Optimized HDL ComponentsACM Transactions on Reconfigurable Technology and Systems10.1145/346147814:3(1-33)Online publication date: 12-Aug-2021
https://dl.acm.org/doi/10.1145/3461478
Chi YCong J(2020)Exploiting Computation Reuse for Stencil Accelerators2020 57th ACM/IEEE Design Automation Conference (DAC)10.1109/DAC18072.2020.9218680(1-6)Online publication date: Jul-2020
https://doi.org/10.1109/DAC18072.2020.9218680
Purkayastha ARogers SShiddibhavi STabkhi H(2020)LLVM-based automation of memory decoupling for OpenCL applications on FPGAsMicroprocessors & Microsystems10.1016/j.micpro.2019.10290972:COnline publication date: 1-Feb-2020
https://dl.acm.org/doi/10.1016/j.micpro.2019.102909
Koraei MFatemi OJahre M(2019)DCMIACM Transactions on Architecture and Code Optimization10.1145/335281316:4(1-24)Online publication date: 11-Oct-2019
https://dl.acm.org/doi/10.1145/3352813
Li WYang FZhu HZeng XZhou D(2019)An Efficient Memory Partitioning Approach for Multi-Pattern Data Access via Data ReuseACM Transactions on Reconfigurable Technology and Systems10.1145/330129612:1(1-22)Online publication date: 5-Feb-2019
https://dl.acm.org/doi/10.1145/3301296
Rogers STabkhi H(2018)Locality aware memory assignment and tilingProceedings of the 55th Annual Design Automation Conference10.1145/3195970.3196070(1-6)Online publication date: 24-Jun-2018
https://dl.acm.org/doi/10.1145/3195970.3196070
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents