Abstract
Fast Fourier transform algorithms on large data sets achieve poor performance on various platforms because of the inefficient strided memory access patterns. These inefficient access patterns need to be reshaped to achieve high performance implementations. In this paper we formally restructure 1D, 2D and 3D FFTs targeting a generic machine model with a two-level memory hierarchy requiring block data transfers, and derive memory access pattern efficient algorithms using custom block data layouts. These algorithms need to be carefully mapped to the targeted platform’s architecture, particularly the memory subsystem, to fully utilize performance and energy efficiency potentials. Using the Kronecker product formalism, we integrate our optimizations into Spiral framework and evaluate a family of DRAM-optimized FFT algorithms and their hardware implementation design space via automated techniques. In our evaluations, we demonstrate DRAM-optimized accelerator designs over a large tradeoff space given various problem (single/double precision 1D, 2D and 3D FFTs) and hardware platform (off-chip DRAM, 3D-stacked DRAM, ASIC, FPGA, etc.) parameters. We show that Spiral generated pareto optimal designs can achieve close to theoretical peak performance of the targeted platform offering 6x and 6.5x system performance and power efficiency improvements respectively over conventional row-column FFT algorithms.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
Calculated as 5n log2(n)/t for DFT n where t is the total runtime.
References
CACTI 6.5. HP labs. http://www.hpl.hp.com/research/cacti/.
DDR3-1600 dram datasheet. MT41J256M4, Micron. http://www.micron.com/parts/dram/ddr3-sdram.
DDR3 sdram system-power calculator. Micron. http://www.micron.com/products/support/power-calc.
DesignWare library. Synopsys. http://www.synopsys.com/dw.
McPAT 1.0. HP labs. http://www.hpl.hp.com/research/mcpat/.
CUDA toolkit 5.0 performance report (2013). Nvidia. https://developer.nvidia.com/cuda-math-library.
Akin, B., Franchetti, F., & Hoe, J.C. (2014). FFTs with near-optimal memory access through block data layouts. In Proceedings of IEEE international conference on acoustics speech and signal processing (ICASSP).
Akin, B., Franchetti, F., & Hoe, J.C. (2014). Understanding the design space of dram-optimized hardware FFT accelerators. In IEEE 25th international conference on application-specific systems, architectures and processors, ASAP 2014, June 18-20, (pp. 248–255). Zurich.
Akin, B., Franchetti, F., & Hoe, J.C. (2015). Data reorganization in memory using 3d-stacked dram. In Proceedings of the 42nd international symposium on computer architecture (ISCA).
Akin, B., Hoe, J.C., & Franchetti, F. (2014). Hamlet: hardware accelerated memory layout transform within 3d-stacked DRAM. In IEEE high performance extreme computing conference, HPEC 2014, September 9-11 (pp. 1–6). Waltham.
Akin, B., Milder, P.A., Franchetti, F., & Hoe, J.C. (2012). Memory bandwidth efficient two-dimensional fast Fourier transform algorithm and implementation for large problem sizes. In Proceedings of the IEEE symposium on FCCM (pp. 188–191).
Chen, K., Li, S., Muralimanohar, N., Ahn, J.H., Brockman, J., & Jouppi, N. (2012). CACTI-3DD: architecture-level modeling for 3D die-stacked DRAM main memory. In Design, automation test in Europe (DATE) (pp. 33–38).
Chung, E.S., Milder, P.A., Hoe, J.C., & Mai, K. (2010). Single-chip heterogeneous computing: does the future include custom logic, FPGAs, and GPGPUs?. In Proceedings of the 43th IEEE/ACM international symposium on microarchitecture (MICRO).
Cooley, J.W., & Tukey, J.W. (1965). An algorithm for the machine calculation of complex Fourier series. Mathematics of Computation, 19(90), 297–301.
Eleftheriou, M., Fitch, B., Rayshubskiy, A., Ward, T.J.C., & Germain, R. (2005). Scalable framework for 3D FFTs on the Blue Gene/L supercomputer: implementation and early performance measurements. IBM Journal of Research and Development, 49(2.3), 457–464.
Franchetti, F., & Püschel, M. (2011). Encyclopedia of Parallel Computing, chap, Fast fourier transform: Springer.
Frigo, M., & Johnson, S.G. (2005). The design and implementation of FFTW3. Proceedings of the IEEE, Special issue on Program Generation Optimization, and Platform Adaptation, 93(2), 216–231.
Govindaraju, N.K., Lloyd, B., Dotsenko, Y., Smith, B., & Manferdelli, J. (2008). High performance discrete Fourier transforms on graphics processors. In Proceedings of the ACM/IEEE conference on supercomputing (SC) (pp. 2:1–2:12).
Loh, G.H. (2008). 3D-stacked memory architectures for multi-core processors. In Proceedings of the 35th annual international symposium on computer architecture, (ISCA) (pp. 453–464).
Milder, P.A., Franchetti, F., Hoe, J.C., & Püschel, M. (2012). Computer generation of hardware for linear digital signal processing transforms. ACM Transactions on Design Automation of Electronic Systems, 17(2).
Milder, P.A., Hoe, J.C., & Püschel, M. (2009). Automatic generation of streaming datapaths for arbitrary fixed permutations. In Design, automation and test in Europe (DATE) (pp. 1118–1123).
Pawlowski, J.T. (2011). Hybrid memory cube (HMC). In Hotchips.
Pedram, A., McCalpin, J., & Gerstlauer, A. (2013). Transforming a linear algebra core to an FFT accelerator. In Proceedings of IEEE international conference on application-specific systems, architectures and processors (ASAP) (pp. 175–184).
Püschel, M., Moura, J.M.F., Johnson, J., Padua, D., Veloso, M., Singer, B., Xiong, J., Franchetti, F., Gacic, A., Voronenko, Y., Chen, K., Johnson, R.W., & Rizzolo, N. (2005). SPIRAL: code generation for DSP transforms. Proceedings of IEEE, Special Issue on Program Generation Optimization, and Adaptation, 93(2), 232–275.
Rosenfeld, P., Cooper-Balis, E., & Jacob, B. (2011). Dramsim2: a cycle accurate memory system simulator. IEEE Computer Architecture Letters, 10(1), 16–19.
Loan Van, C. (1992). Computational frameworks for the fast Fourier transform: SIAM.
Weis, C., Loi, I., Benini, L., & Wehn, N. (2013). Exploration and optimization of 3-D integrated dram subsystems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 32(4), 597–610.
Yu, C.L., Irick, K., Chakrabarti, C., & Narayanan, V. (2010). Multidimensional DFT IP generator for FPGA platforms. IEEE Transactions on Circuits and Systems, 58(4), 755–764.
Zhu, Q., Akin, B., Sumbul, H., Sadi, F., Hoe, J., Pileggi, L., & Franchetti, F. (2013). A 3D-stacked logic-in-memory accelerator for application-specific data intensive computing. In 2013 IEEE international 3D systems integration conference (3DIC) (pp. 1–7).
Zhu, Q., Vaidyanathan, K., Shacham, O., Horowitz, M., Pileggi, L., & Franchetti, F. (2012). Design automation framework for application-specific logic-in-memory blocks. In Proceedings of IEEE international conference on application-specific systems, architectures and processors (ASAP) (pp. 125–132).
Acknowledgments
The work was sponsored by Defense Advanced Research Projects Agency (DARPA) under agreement no. HR0011-13-2-0007. The content, views and conclusions presented in this document do not necessarily reflect the position or the policy of DARPA or the U.S. Government. No official endorsement should be inferred.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Akin, B., Franchetti, F. & Hoe, J.C. FFTs with Near-Optimal Memory Access Through Block Data Layouts: Algorithm, Architecture and Design Automation. J Sign Process Syst 85, 67–82 (2016). https://doi.org/10.1007/s11265-015-1018-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11265-015-1018-0