research-article

A truly two-dimensional systolic array FPGA implementation of QR decomposition

Authors:

Miriam LeeserAuthors Info & Claims

ACM Transactions on Embedded Computing Systems (TECS), Volume 9, Issue 1

Article No.: 3, Pages 1 - 17

https://doi.org/10.1145/1596532.1596535

Published: 29 October 2009 Publication History

Abstract

We have implemented a two-dimensional systolic array QR decomposition on a Xilinx Virtex5 FPGA using the Givens rotation algorithm. QR decomposition is a key step in many DSP applications including sonar beamforming, channel equalization, and 3G wireless communication. Compared to previous work that implements Givens rotations using a one-dimensional systolic array, our implementation uses a truly two-dimensional systolic array architecture. As a result, latency scales well for larger matrices. In addition, prior work avoids divide and square root operations in the Givens rotation algorithm by using special operations such as CORDIC or special number systems such as the logarithmic number system (LNS). In contrast, our design uses straightforward floating-point divide and square root implementations, which makes it easier to be used within a larger system. In our design, the input matrix size can be configured at compile time to many different sizes, making it easily scalable to future large FPGAs or over multiple FPGAs. The QR module is fully pipelined with a throughput of over 130MHz for the IEEE single-precision floating-point format. The peak performance for a 12 × 12 input matrix is approximately 35 GFLOPs.

References

[1]

Accel-dsp. AccelWare DSP IP Toolkits. http://www.xilinx.com/ise/dspdesignprod/acceldsp/accelware/.

[2]

Altera-cordic. ALTERA CORDIC reference design. http://www.altera.com/literature/an/an263.pdf.

[3]

Boppana, D., Dhanoa, K., and Kempa, J. 2004. FPGA-based embedded processing architecture for the QRD-RLS algorithm. In Proceedings of the 12th IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'04). IEEE, Los Alamitos, CA, 330--331.

Digital Library

[4]

Dohler, R. 1991. Squared Givens rotations. IMA J. Numer. Anal. 1--5.

[5]

Ercegovac, M. D., Lang, T., Muller, J.-M., and Tisserand, A. 2000. Reciprocation, square root, inverse square root, and some elementary functions using small multipliers. IEEE Trans. Comput. 49, 7, 628--637.

Digital Library

[6]

Fitton, M. P., Perry, S., and Jackson, R. 2004. Reconfigurable antenna processing with matrix decomposition using FPGA-based application specific integrated processors. In Proceedings of the Software Defined Radio Technical Conference. ACM, New York.

[7]

Francis, J. G. F. 1962. The QR transformation. Comput. J. 4, 4, 332--345.

[8]

Gay, M. and Phillips, I. 2005. Real-time adaptive beam forming: FPGA implementation using QR decomposition. J. Electron. Defense.

[9]

Gentleman, M. W. and T. Kung, T. H. 1981. Matrix triangularization by systolic arrays. In Proceedings of Real-Time Signal Processing IV, SPIE 298. ACM, New York, 19--26.

[10]

Givens, W. 1958. Computation of plane unitary rotations transforming a general matrix to triangular form. In J. Soc. Indust. Appl. Math. 6, 26--50.

[11]

Golub, G. H. and Van Loan, C. F. 1996. Matrix Computations. Johns Hopkins University Press, Baltimore, MD.

[12]

Hermanek, J. A. and Schier, P. R. 2004. Architecture design for FPGA implementation of finite interval CMA. In Proceedings of the 12th European Signal Processing Conference. ACM, New York, 1--4.

[13]

Householder, A. S. 1958. Unitary triangularization of a nonsymmetric matrix. J. ACM 5, 4, 339--342.

Digital Library

[14]

Hung, P., Fahmy, H., Mencer, O., and Flynn, M. J. 1999. Fast division algorithm with a small lookup table. In Proceedings of the 33rd Asilomar Conference on Signals, Systems and Computers. IEEE, Los Alamitos, CA, 1465--1468.

Digital Library

[15]

Karkooti, M., Cavallaro, J.R., and Dick, C. 2005. FPGA implementation of matrix inversion using QRD-RLS algorithm. In Proceedings of the 39rd Asilomar Conference on Signals, Systems and Computers. IEEE, Los Alamitos, CA, 1625--1629.

[16]

Louka, B. and Tchuente, M. 1988. Givens elimination on systolic arrays. In Proceedings of the 2nd International Conference on Super-Computing. ACM, New York, 638--647.

Digital Library

[17]

Matousek1, R., Tich, M., Pohl, Z., Kadlec, J., Softley, C., and Coleman, N. 2002. Logarithmic number system and floating-point arithmetics on FPGA. Lecture Notes in Computer Science, vol. 2438. Springer, Berlin, 175--188.

Digital Library

[18]

Moon, T. K. and Stirling, W. C. 2000. Mathematical Methods and Algorithms for Signal Processing. Prentice Hall International, Upper Saddle River, NJ, 285--300.

[19]

Northeastern University Floating-Point Library. Northeastern University variable precision floating-point modules. http://www.ece.neu.edu/groups/rcl/projects/floatingpoint/.

[20]

Quixilica-qr. Quixilica Floating-Point QR Processor Core. http://www.eonic.co.kr/data/datasheet/transtech/FPGA/qx qr.pdf.

[21]

Rice, J. R. 1966. Experiments on Gram-Schmidt orthogonalization. Math. Comput. 20, 94, 325--328.

[22]

Schier, J. and Hermanek, A. 2004. Using logarithmic arithmetic to implement the recursive least squares algorithm in FPGA. In Proceedings of the 14th International Conference on Field-Programmable Logic and Applications (FPL'04). Springer, Berlin, 1149--1151.

[23]

Schier, J. and Kadlec, J. 2003. Using logarithmic arithmetic for FPGA implementation of givens rotations. In Proceedings of the 6th Baiona Workshop on Signal Processing in Communications. ACM, New York, 199--204.

[24]

Sergyienko, A. and Maslennikov, O. 2002. Implementation of givens QR-decomposition in FPGA. Lecture Notes in Computer Science, vol. 2328. Springer, Berlin, 458--465.

Digital Library

[25]

Sucha, P., Hanzalek, Z., Hermanek, A., and Schier, J. 2007. Scheduling of iterative algorithms with matrix operations for efficient FPGA design implementation of finite interval constant modulus algorithm. J. VLSI Signal Process. Syst. 46, 1, 35--53.

Digital Library

[26]

Volder, J. 1959. The CORDIC trigonometric computing technique. IRE Trans. Electron. Comput. EC-8, 3, 330--334.

[27]

Walke, R. L., M. Smith, R. W., and Lightbody, G. 1999. Architectures for adaptive weight calculation on ASIC and FPGA. In Proceedings of the 33rd Asilomar Conference on Signals, Systems and Computers. Vol. 2. IEEE, Los Alamitos, CA, 1375--1380.

[28]

Walke, R. L., M. Smith, R. W., and Lightbody, G. 2000. 20 GFLOPS QR processor on a Xilinx Virtex-E FPGA. In Proceedings of the Advanced Signal Processing Algorithms, Architectures, and Implementations X, SPIE. Vol. 4116. 300--310.

[29]

Wang, X., Braganza, S., and Leeser, M. 2006. Advanced components in the variable precision floating-point library. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM '06). IEEE, Los Alamitos, CA, 249--258.

Digital Library

[30]

Xilinx. XILINX LogiCORE floating-point operator v3.0. http://www.xilinx.com/bvdocs/ipcenter/data sheet/floatingpointds335.pdf.

[31]

Xilinx-cordic. XILINX LogiCORE CORDIC v3.0. http://www.xilinx.com/bvdocs/ipcenter/datasheet/cordic.pdf.

Cited By

Kiamarzi ARossi DTagliavini G(2024)QR-PULP: Streamlining QR Decomposition for RISC-V Parallel Ultra-Low-Power PlatformsProceedings of the 21st ACM International Conference on Computing Frontiers10.1145/3649153.3649210(147-154)Online publication date: 7-May-2024
https://dl.acm.org/doi/10.1145/3649153.3649210
Ueno TDel Sozzo ESano K(2024)Flexible Systolic Array Platform on Virtual 2-D Multi-FPGA PlaneProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3635035.3637285(84-94)Online publication date: 18-Jan-2024
https://dl.acm.org/doi/10.1145/3635035.3637285
Alahari RKodati SKalitkar K(2022)Floating Point Implementation of the Improved QRD and OMP for Compressive Sensing Signal ReconstructionSensing and Imaging10.1007/s11220-022-00389-z23:1Online publication date: 26-Jun-2022
https://doi.org/10.1007/s11220-022-00389-z
Show More Cited By

Index Terms

A truly two-dimensional systolic array FPGA implementation of QR decomposition

Recommendations

CORDIC-Based Enhanced Systolic Array Architecture for QR Decomposition
Special Section on RAW2014

Multiple input multiple output (MIMO) with orthogonal frequency division multiplexing (OFDM) systems typically use orthogonal-triangular (QR) decomposition. In this article, we present an enhanced systolic array architecture to realize QR decomposition ...
Efficient FPGA implementation of qr decomposition using a systolic array architecture
FPGA '08: Proceedings of the 16th international ACM/SIGDA symposium on Field programmable gate arrays

QR decomposition is used in many signal processing applications. We have implemented a systolic array QR decomposition on a Xilinx Virtex5 FPGA using the Givens rotation algorithm. It uses a truly two dimensional systolic array architecture so latency ...
Analysis of a QR Algorithm for Computing Singular Values

We extend the Golub--Kahan algorithm for computing the singular value decomposition of bidiagonal matrices to triangular matrices $R$. Our algorithm avoids the explicit formation of $R^TR$ or $RR^T$.

We derive a relation between left and right ...

Reviews

Reviewer: Vladimir Botchev

Wang and Leeser describe in this paper a straightforward implementation of a QR decomposition (QRD) processor, based on Givens rotations. This specialized processor is a two-dimensional (2D) triangular semi-systolic array. The authors use their own well-tested and proven floating-point arithmetic units for the design of a "classic" square root cell. They provide some background on the algorithm and architectural details for the computing cells. A timing schedule for a particular implementation and some performance data are also shown. There are two major omissions in the paper: first, there is no comparison with other types of QRD in terms of precision, complexity, and speed; second, the authors do not even mention how their design could be used for arbitrary-sized matrices, also known as systolic scheduling. In short, the paper makes a better case for the authors' arithmetic units than for the systolic design that uses them, which is central to the paper. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems

ACM Transactions on Embedded Computing Systems Volume 9, Issue 1

October 2009

184 pages

ISSN:1539-9087

EISSN:1558-3465

DOI:10.1145/1596532

Issue’s Table of Contents

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 29 October 2009

Accepted: 01 February 2009

Revised: 01 January 2009

Received: 01 June 2008

Published in TECS Volume 9, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Division of Engineering Education and Centers

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

20
Total Citations
View Citations
1,305
Total Downloads

Downloads (Last 12 months)43
Downloads (Last 6 weeks)8

Reflects downloads up to 01 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kiamarzi ARossi DTagliavini G(2024)QR-PULP: Streamlining QR Decomposition for RISC-V Parallel Ultra-Low-Power PlatformsProceedings of the 21st ACM International Conference on Computing Frontiers10.1145/3649153.3649210(147-154)Online publication date: 7-May-2024
https://dl.acm.org/doi/10.1145/3649153.3649210
Ueno TDel Sozzo ESano K(2024)Flexible Systolic Array Platform on Virtual 2-D Multi-FPGA PlaneProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3635035.3637285(84-94)Online publication date: 18-Jan-2024
https://dl.acm.org/doi/10.1145/3635035.3637285
Alahari RKodati SKalitkar K(2022)Floating Point Implementation of the Improved QRD and OMP for Compressive Sensing Signal ReconstructionSensing and Imaging10.1007/s11220-022-00389-z23:1Online publication date: 26-Jun-2022
https://doi.org/10.1007/s11220-022-00389-z
Xu JLeeser M(2021)Accelerating Matrix Processing for MIMO SystemsProceedings of the 11th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies10.1145/3468044.3468050(1-6)Online publication date: 21-Jun-2021
https://dl.acm.org/doi/10.1145/3468044.3468050
Borbon JHuang JWong BNajjar W(2021)Acceleration of Parallel-Blocked QR Decomposition of Tall-and-Skinny Matrices on FPGAsACM Transactions on Architecture and Code Optimization10.1145/344777518:3(1-25)Online publication date: 10-May-2021
https://dl.acm.org/doi/10.1145/3447775
Hormigo JMuñoz S(2021)Efficient Floating-Point Givens Rotation UnitCircuits, Systems, and Signal Processing10.1007/s00034-020-01580-x40:5(2419-2442)Online publication date: 1-May-2021
https://dl.acm.org/doi/10.1007/s00034-020-01580-x
Du KSwamy MDu KSwamy M(2019)Neural Network Circuits and Parallel ImplementationsNeural Networks and Statistical Learning10.1007/978-1-4471-7452-3_28(829-851)Online publication date: 13-Sep-2019
https://doi.org/10.1007/978-1-4471-7452-3_28
Langhammer MPasca BAnderson JBazargan K(2018)High-Performance QR Decomposition for FPGAsProceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays10.1145/3174243.3174273(183-188)Online publication date: 15-Feb-2018
https://dl.acm.org/doi/10.1145/3174243.3174273
Slučiak OStraková HRupp MGansterer W(2016)Distributed Gram-Schmidt orthogonalization with simultaneous elements refinementEURASIP Journal on Advances in Signal Processing10.1186/s13634-016-0322-62016:1Online publication date: 24-Feb-2016
https://doi.org/10.1186/s13634-016-0322-6
Fang XLeeser M(2016)Open-Source Variable-Precision Floating-Point Library for Major Commercial FPGAsACM Transactions on Reconfigurable Technology and Systems10.1145/28515079:3(1-17)Online publication date: 19-Jul-2016
https://dl.acm.org/doi/10.1145/2851507
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents