Abstract
FPGAs have been widely used for accelerating various applications. For many data intensive applications, the memory bandwidth limits the performance. 3D memories with through-silicon-via connections provide potential solutions to the latency and bandwidth limitations. In this paper, we revisit the classic 2D FFT problem to evaluate the performance of 3D memory integrated FPGA. To fully utilize the fine-grained parallelism in 3D memory, data layouts which take into account the structure and organization of the memory are required. We propose dynamic data layouts for optimizing the performance of the 3D architecture. In 2D FFT, data are accessed in row major order in the first phase, whereas the data are accessed in column major order in the second phase. This column major order results in high memory latency and low bandwidth due to high row activation overhead of memory. Using the proposed dynamic data layouts, we improve memory access performance in the second phase without degrading the performance of the first phase. With parallelism employed in the third dimension of the memory, data parallelism can be increased to further improve the performance. We adopt a model-based approach for 3D memory and we perform experiments on the FPGA to validate our analysis and evaluate the performance. Compared with the baseline architecture, our approach achieves up to \(40\times \) peak memory bandwidth utilization for columnwise FFT, thus resulting in approximately \(97\,\,\%\) improvement in throughput for the complete 2D FFT application.
Similar content being viewed by others
References
Chen R, Prasanna VK (2015) Energy and memory efficient bitonic sorting on FPGA. In: Proc. of ACM/SIGDA FPGA, pp 45–54
Chen R, Prasanna VK (2015) Automatic generation of high throughput energy efficient streaming architectures for arbitrary fixed permutations. In: Proc. of IEEE Conference on Field Programmable Logic and Applications (FPL), pp 1–8. IEEE
Akin B, Milder PA, Franchetti F, Hoe JC (2012) Memory bandwidth efficient two-dimensional fast fourier transform algorithm and implementation for large problem sizes. In: Proc. of IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM ’12), pp 188–191
Chen R, Prasanna VK (2013) Energy efficient parameterized FFT architecture. In: Proc. of IEEE International Conference on FPL
Kim JS, Yu C-L, Deng L, Kestur S, Narayanan V, Chakrabarti C (2009) FPGA Architecture for 2D Discrete Fourier Transform based on 2D decomposition for large-sized data. In: Proc. of IEEE Workshop on Signal Processing Systems, pp 121–126
Hybrid Memory Cube Consortium. Hybrid Memory Cube Specification. http://hybridmemorycube.org/files/SiteDownloads/HMC_Specification
Park Neungsoo, Prasanna Viktor K (2004) Dynamic data layouts for cache-conscious implementation of a class of signal transforms. IEEE Trans Signal Process 52(7):2120–2134
Wang W, Duan B, Zhang C, Zhang P, Sun N (2010) Accelerating 2D FT with non-power-of-two problem size on FPGA. In: Proc. of IEEE International Conference on Reconfigurable Computing and FPGAs (ReConFig ’10), pp 208–213
Akin B, Franchetti F, Hoe JC (2014) Understanding the Design Space of Dram-optimized Hardware FFT Accelerators. In: Application-specific Systems, Architectures and Processors (ASAP), 2014 IEEE 25th International Conference on, pp 248–255. IEEE
Wu Hong Ren, Paoloni Frank John (1989) The structure of vector radix fast Fourier transforms. IEEE Trans Acoust Speech Signal Process 37(9):1415–1424
Zhu Q, Akin B, Sumbul HE, Sadi F, Hoe JC, Pileggi L, Franchetti F (2013) A 3D-stacked logic-in-memory accelerator for application-specific data intensive computing. In: Proc. of IEEE International Conference on 3D Systems Integration Conference (3DIC), pp 1–7. IEEE
Gadfort P, Dasu A, Akoglu A, Leow YK, Fritze M (2014) A power efficient reconfigurable system-in-stack: 3D integration of accelerators, FPGAs, and DRAM. In: Proc. of IEEE International Conference on System-on-Chip Conference (SOCC), pp 11–16. IEEE
Singapura SG, Panangadan A, Prasanna VK (2015) Towards performance modeling of 3D memory integrated FPGA architectures. In: Proc. of International Conference on Applied Reconfigurable Computing
Singapura SG, Panangadan A, Prasanna VK (2015) Performance modeling of matrix multiplication on 3D memory integrated FPGA. In: Proc. of 22nd Reconfigurable Architectures Workshop, IPDPDS
Chen R, Prasanna VK (2013) Energy-efficient architecture for stride permutation on streaming data. In: Proc. of IEEE Conference on ReConFig, pp 1–7
Chen R, Park N, Prasanna VK (2013) High throughput energy efficient parallel FFT architecture on FPGAs. In: Proc. of IEEE High Performance Extreme Computing Conference (HPEC), pp 1–6. IEEE
Virtex-7 FPGA Family. http://www.xilinx.com/products/virtex7
Chen R, Prasanna VK (2015) DRAM Row Activation Energy Optimization for Stride Memory Access on FPGA-based Systems. In: Proc. of International Conference on Applied Reconfigurable Computing
Author information
Authors and Affiliations
Corresponding author
Additional information
This material was supported by the NSF under Grant Number ACI-1339756.
Rights and permissions
About this article
Cite this article
Chen, R., Singapura, S.G. & Prasanna, V.K. Optimal dynamic data layouts for 2D FFT on 3D memory integrated FPGA. J Supercomput 73, 652–663 (2017). https://doi.org/10.1007/s11227-016-1772-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-016-1772-1