Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

A truly two-dimensional systolic array FPGA implementation of QR decomposition

Published: 29 October 2009 Publication History

Abstract

We have implemented a two-dimensional systolic array QR decomposition on a Xilinx Virtex5 FPGA using the Givens rotation algorithm. QR decomposition is a key step in many DSP applications including sonar beamforming, channel equalization, and 3G wireless communication. Compared to previous work that implements Givens rotations using a one-dimensional systolic array, our implementation uses a truly two-dimensional systolic array architecture. As a result, latency scales well for larger matrices. In addition, prior work avoids divide and square root operations in the Givens rotation algorithm by using special operations such as CORDIC or special number systems such as the logarithmic number system (LNS). In contrast, our design uses straightforward floating-point divide and square root implementations, which makes it easier to be used within a larger system. In our design, the input matrix size can be configured at compile time to many different sizes, making it easily scalable to future large FPGAs or over multiple FPGAs. The QR module is fully pipelined with a throughput of over 130MHz for the IEEE single-precision floating-point format. The peak performance for a 12 × 12 input matrix is approximately 35 GFLOPs.

References

[1]
Accel-dsp. AccelWare DSP IP Toolkits. http://www.xilinx.com/ise/dspdesignprod/acceldsp/accelware/.
[2]
Altera-cordic. ALTERA CORDIC reference design. http://www.altera.com/literature/an/an263.pdf.
[3]
Boppana, D., Dhanoa, K., and Kempa, J. 2004. FPGA-based embedded processing architecture for the QRD-RLS algorithm. In Proceedings of the 12th IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'04). IEEE, Los Alamitos, CA, 330--331.
[4]
Dohler, R. 1991. Squared Givens rotations. IMA J. Numer. Anal. 1--5.
[5]
Ercegovac, M. D., Lang, T., Muller, J.-M., and Tisserand, A. 2000. Reciprocation, square root, inverse square root, and some elementary functions using small multipliers. IEEE Trans. Comput. 49, 7, 628--637.
[6]
Fitton, M. P., Perry, S., and Jackson, R. 2004. Reconfigurable antenna processing with matrix decomposition using FPGA-based application specific integrated processors. In Proceedings of the Software Defined Radio Technical Conference. ACM, New York.
[7]
Francis, J. G. F. 1962. The QR transformation. Comput. J. 4, 4, 332--345.
[8]
Gay, M. and Phillips, I. 2005. Real-time adaptive beam forming: FPGA implementation using QR decomposition. J. Electron. Defense.
[9]
Gentleman, M. W. and T. Kung, T. H. 1981. Matrix triangularization by systolic arrays. In Proceedings of Real-Time Signal Processing IV, SPIE 298. ACM, New York, 19--26.
[10]
Givens, W. 1958. Computation of plane unitary rotations transforming a general matrix to triangular form. In J. Soc. Indust. Appl. Math. 6, 26--50.
[11]
Golub, G. H. and Van Loan, C. F. 1996. Matrix Computations. Johns Hopkins University Press, Baltimore, MD.
[12]
Hermanek, J. A. and Schier, P. R. 2004. Architecture design for FPGA implementation of finite interval CMA. In Proceedings of the 12th European Signal Processing Conference. ACM, New York, 1--4.
[13]
Householder, A. S. 1958. Unitary triangularization of a nonsymmetric matrix. J. ACM 5, 4, 339--342.
[14]
Hung, P., Fahmy, H., Mencer, O., and Flynn, M. J. 1999. Fast division algorithm with a small lookup table. In Proceedings of the 33rd Asilomar Conference on Signals, Systems and Computers. IEEE, Los Alamitos, CA, 1465--1468.
[15]
Karkooti, M., Cavallaro, J.R., and Dick, C. 2005. FPGA implementation of matrix inversion using QRD-RLS algorithm. In Proceedings of the 39rd Asilomar Conference on Signals, Systems and Computers. IEEE, Los Alamitos, CA, 1625--1629.
[16]
Louka, B. and Tchuente, M. 1988. Givens elimination on systolic arrays. In Proceedings of the 2nd International Conference on Super-Computing. ACM, New York, 638--647.
[17]
Matousek1, R., Tich, M., Pohl, Z., Kadlec, J., Softley, C., and Coleman, N. 2002. Logarithmic number system and floating-point arithmetics on FPGA. Lecture Notes in Computer Science, vol. 2438. Springer, Berlin, 175--188.
[18]
Moon, T. K. and Stirling, W. C. 2000. Mathematical Methods and Algorithms for Signal Processing. Prentice Hall International, Upper Saddle River, NJ, 285--300.
[19]
Northeastern University Floating-Point Library. Northeastern University variable precision floating-point modules. http://www.ece.neu.edu/groups/rcl/projects/floatingpoint/.
[20]
Quixilica-qr. Quixilica Floating-Point QR Processor Core. http://www.eonic.co.kr/data/datasheet/transtech/FPGA/qx qr.pdf.
[21]
Rice, J. R. 1966. Experiments on Gram-Schmidt orthogonalization. Math. Comput. 20, 94, 325--328.
[22]
Schier, J. and Hermanek, A. 2004. Using logarithmic arithmetic to implement the recursive least squares algorithm in FPGA. In Proceedings of the 14th International Conference on Field-Programmable Logic and Applications (FPL'04). Springer, Berlin, 1149--1151.
[23]
Schier, J. and Kadlec, J. 2003. Using logarithmic arithmetic for FPGA implementation of givens rotations. In Proceedings of the 6th Baiona Workshop on Signal Processing in Communications. ACM, New York, 199--204.
[24]
Sergyienko, A. and Maslennikov, O. 2002. Implementation of givens QR-decomposition in FPGA. Lecture Notes in Computer Science, vol. 2328. Springer, Berlin, 458--465.
[25]
Sucha, P., Hanzalek, Z., Hermanek, A., and Schier, J. 2007. Scheduling of iterative algorithms with matrix operations for efficient FPGA design implementation of finite interval constant modulus algorithm. J. VLSI Signal Process. Syst. 46, 1, 35--53.
[26]
Volder, J. 1959. The CORDIC trigonometric computing technique. IRE Trans. Electron. Comput. EC-8, 3, 330--334.
[27]
Walke, R. L., M. Smith, R. W., and Lightbody, G. 1999. Architectures for adaptive weight calculation on ASIC and FPGA. In Proceedings of the 33rd Asilomar Conference on Signals, Systems and Computers. Vol. 2. IEEE, Los Alamitos, CA, 1375--1380.
[28]
Walke, R. L., M. Smith, R. W., and Lightbody, G. 2000. 20 GFLOPS QR processor on a Xilinx Virtex-E FPGA. In Proceedings of the Advanced Signal Processing Algorithms, Architectures, and Implementations X, SPIE. Vol. 4116. 300--310.
[29]
Wang, X., Braganza, S., and Leeser, M. 2006. Advanced components in the variable precision floating-point library. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM '06). IEEE, Los Alamitos, CA, 249--258.
[30]
Xilinx. XILINX LogiCORE floating-point operator v3.0. http://www.xilinx.com/bvdocs/ipcenter/data sheet/floatingpointds335.pdf.
[31]
Xilinx-cordic. XILINX LogiCORE CORDIC v3.0. http://www.xilinx.com/bvdocs/ipcenter/datasheet/cordic.pdf.

Cited By

View all
  • (2024)QR-PULP: Streamlining QR Decomposition for RISC-V Parallel Ultra-Low-Power PlatformsProceedings of the 21st ACM International Conference on Computing Frontiers10.1145/3649153.3649210(147-154)Online publication date: 7-May-2024
  • (2024)Flexible Systolic Array Platform on Virtual 2-D Multi-FPGA PlaneProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3635035.3637285(84-94)Online publication date: 18-Jan-2024
  • (2022)Floating Point Implementation of the Improved QRD and OMP for Compressive Sensing Signal ReconstructionSensing and Imaging10.1007/s11220-022-00389-z23:1Online publication date: 26-Jun-2022
  • Show More Cited By

Recommendations

Reviews

Vladimir Botchev

Wang and Leeser describe in this paper a straightforward implementation of a QR decomposition (QRD) processor, based on Givens rotations. This specialized processor is a two-dimensional (2D) triangular semi-systolic array. The authors use their own well-tested and proven floating-point arithmetic units for the design of a "classic" square root cell. They provide some background on the algorithm and architectural details for the computing cells. A timing schedule for a particular implementation and some performance data are also shown. There are two major omissions in the paper: first, there is no comparison with other types of QRD in terms of precision, complexity, and speed; second, the authors do not even mention how their design could be used for arbitrary-sized matrices, also known as systolic scheduling. In short, the paper makes a better case for the authors' arithmetic units than for the systolic design that uses them, which is central to the paper. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems
ACM Transactions on Embedded Computing Systems  Volume 9, Issue 1
October 2009
184 pages
ISSN:1539-9087
EISSN:1558-3465
DOI:10.1145/1596532
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 29 October 2009
Accepted: 01 February 2009
Revised: 01 January 2009
Received: 01 June 2008
Published in TECS Volume 9, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. FPGA
  2. QR decomposition
  3. floating point

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)43
  • Downloads (Last 6 weeks)8
Reflects downloads up to 01 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)QR-PULP: Streamlining QR Decomposition for RISC-V Parallel Ultra-Low-Power PlatformsProceedings of the 21st ACM International Conference on Computing Frontiers10.1145/3649153.3649210(147-154)Online publication date: 7-May-2024
  • (2024)Flexible Systolic Array Platform on Virtual 2-D Multi-FPGA PlaneProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3635035.3637285(84-94)Online publication date: 18-Jan-2024
  • (2022)Floating Point Implementation of the Improved QRD and OMP for Compressive Sensing Signal ReconstructionSensing and Imaging10.1007/s11220-022-00389-z23:1Online publication date: 26-Jun-2022
  • (2021)Accelerating Matrix Processing for MIMO SystemsProceedings of the 11th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies10.1145/3468044.3468050(1-6)Online publication date: 21-Jun-2021
  • (2021)Acceleration of Parallel-Blocked QR Decomposition of Tall-and-Skinny Matrices on FPGAsACM Transactions on Architecture and Code Optimization10.1145/344777518:3(1-25)Online publication date: 10-May-2021
  • (2021)Efficient Floating-Point Givens Rotation UnitCircuits, Systems, and Signal Processing10.1007/s00034-020-01580-x40:5(2419-2442)Online publication date: 1-May-2021
  • (2019)Neural Network Circuits and Parallel ImplementationsNeural Networks and Statistical Learning10.1007/978-1-4471-7452-3_28(829-851)Online publication date: 13-Sep-2019
  • (2018)High-Performance QR Decomposition for FPGAsProceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays10.1145/3174243.3174273(183-188)Online publication date: 15-Feb-2018
  • (2016)Distributed Gram-Schmidt orthogonalization with simultaneous elements refinementEURASIP Journal on Advances in Signal Processing10.1186/s13634-016-0322-62016:1Online publication date: 24-Feb-2016
  • (2016)Open-Source Variable-Precision Floating-Point Library for Major Commercial FPGAsACM Transactions on Reconfigurable Technology and Systems10.1145/28515079:3(1-17)Online publication date: 19-Jul-2016
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media