research-article

Architectural modifications to enhance the floating-point performance of FPGAs

Authors:

Michael J. Beauchamp,

Keith D. Underwood,

K. Scott HemmertAuthors Info & Claims

IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Volume 16, Issue 2

Pages 177 - 187

https://doi.org/10.1109/TVLSI.2007.912041

Published: 01 February 2008 Publication History

Abstract

With the density of field-programmable gate arrays (FPGAs) steadily increasing, FPGAs have reached the point where they are capable of implementing complex floating-point applications. However, their general-purpose nature has limited the use of FPGAs in scientific applications that require floating-point arithmetic due to the large amount of FPGA resources that floating-point operations still require. This paper considers three architectural modifications that make floating-point operations more efficient on FPGAs. The first modification embeds floating-point multiply-add units in an island-style FPGA. While offering a dramatic reduction in area and improvement in clock rate, these embedded units are a significant change and may not be justified by the market. The next two modifications target a major component of IEEE compliant floating-point computations: variable length shifters. The first alternative to lookup tables (LUTs) for implementing the variable length shifters is a coarse-grained approach: embedded variable length shifters in the FPGA fabric. These shifters offer a significant reduction in area with a modest increase in clock rate and are smaller and more general than embedded floating-point units. The next alternative is a fine-grained approach: adding a 4:1 multiplexer unit inside a configurable logic block (CLB), in parallel to each 4-LUT. While this offers the smallest overall area improvement, it does offer a significant improvement in clock rate with only a trivial increase in the size of the CLB.

References

[1]

K. D. Underwood, "FPGAs vs. CPUs: Trends in peak floating-point performance," in Proc. ACM Int. Symp. Field Program. Gate Arrays, 2004, pp. 171-180.

Digital Library

[2]

K. D. Underwood and K. S. Hemmert, "Closing the gap: CPU and FPGA trends in sustainable floating-point BLAS performance," in Proc. IEEE Symp. FPGAs Custom Comput. Mach., 2004, pp. 219-228.

Digital Library

[3]

K. S. Hemmert and K. D. Underwood, "An analysis of the double-precision floating-point FFT on FPGAs," in Proc. IEEE Symp. FPGAs Custom Comput. Mach., 2005, pp. 171-180.

Digital Library

[4]

M. de Lorimier and A. DeHon, "Floating point sparse matrix-vector multiply for FPGAs," in Proc. ACM Int. Symp. Field Program. Gate Arrays, 2005, pp. 75-85.

Digital Library

[5]

G. Govindu, S. Choi, V. K. Prasanna, V. Daga, S. Gangadharpalli, and V. Sridhar, "A high-performance and energy efficient architecture for floating-point based lu decomposition on fpgas," in Proc. 11th Reconfigurable Arch. Workshop (RAW), 2004, p. 149.

[6]

L. Zhuo and V. K. Prasanna, "Scalable and modular algorithms for floating-point matrix multiplication on FPGAs," in Proc. 18th Int. Parallel Distrib. Process. Symp. (IPDPS), 2004, pp. 433-448.

[7]

L. Zhuo and V. K. Prasanna, "Sparse matrix-vector multiplication on FPGAs," in Proc. ACM Int. Symp. Field Program. Gate Arrays, 2005, pp. 63-74.

Digital Library

[8]

"IEEE standard for binary floating-point arithmetic," ANSI/IEEE Std. 754-1985, 1985.

[9]

K.S. Hemmert and K. D. Underwood, "Open source high performance floating-point modules," in Proc. IEEE Symp. FPGAs Custom Comput. Mach., 2006, pp. 349-350.

Digital Library

[10]

I. Koren, Computer Arithmetic Algorithms, 2nd ed. Natick, MA: A.K. Peters, Ltd., 2002.

Digital Library

[11]

Xilinx, San Jose, CA, "Virtex-II Pro and Virtex-II Pro X platform FPGAs: Complete data sheet," Jun. 2005 {Online}. Available: http://direct.xilinx.com/bvdocs/publications/ds083.pdf

[12]

Xilinx, San Jose, CA, "Virtex-4 family overview," Jun. 2005 {Online}. Available: http://direct.xilinx.com/bvdocs/publications/ds112.pdf

[13]

V. Betz and J. Rose, "VPR: A new packing, placement and routing tool for FPGA research," in Proc. 7th Int. Workshop Field-Program. Logic Appl., 1997, pp. 213-222.

Digital Library

[14]

V. Betz and J. Rose, Architecture and CAD for Deep-Submicron FPGAs. Boston, MA: Kluwer, 1999.

Digital Library

[15]

Xilinx, San Jose, CA, "Xilinx: ASMBL architecture," 2005 {On-line}. Available: http://www.xilinx.com/products/silicon_solutions/fpgas/virtex/virtex4/overview/

[16]

Xilinx, San Jose, CA, "Virtex-4 data sheet: DC and switching characteristics," Aug. 2005 {Online}. Available: http://direct.xilinx.com/bvdocs/publications/ds302.pdf

[17]

Xilinx, San Jose, CA, "Virtex-II platform FPGAs: Complete data sheet," Mar. 2005 {Online}. Available: http://direct.xilinx.com/bvdocs/publications/ds031.pdf

[18]

MIPS Technologies, Inc., Mountain View, CA, "64-Bit Cores, MIPS64 Family Features," 2005 {Online}. Available: http://www.mips.com/ content/Products/Cores/64-BitCores

[19]

J. B. Brockman, S. Thoziyoor, S. Kuntz, and P. Kogge, "A low cost, multithreaded processing-in-memory system," in Proc. 3rd Workshop Memory Performance Issues, 2004, pp. 16-22.

Digital Library

[20]

B. Hutchings, P. Bellows, J. Hawkins, K. S. Hemmert, B. Nelson, and M. Rytting, "A CAD suite for high-performance FPGA design," in Proc. IEEE Workshop FPGAs Custom Comput. Mach., 1999, pp. 12-24.

Digital Library

[21]

E. Sentovich et al., "SIS: A system for sequential circuit analysis," Univ. California, Berkeley, Tech Rep. UCB/ERL M92/41, 1992.

[22]

J. Cong and Y. Ding, "FlowMap: An optimal technology mapping algorithm for delay optimization in lookup-table based FPGA designs," IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 13, no. 1, pp. 1-12, Jan. 1994.

Digital Library

[23]

A. Ye and J. Rose, "Using bus-based connections to improve field-programmable gate array density for implementing datapath circuits" in Proc. ACM Int. Symp. Field-Program. Gate Arrays, 2005, pp. 3-13.

Digital Library

[24]

D. Lewis et al., "The stratix II logic and routing architecture," in Proc. ACM Int. Symp. Field-Program. Gate Arrays, 2005, pp. 14-20.

Digital Library

[25]

Xilinx, San Jose, CA, "Virtex-5 LX platform overview," 2006 {Online}. Available: http://direct.xilinx.com/bvdocs/publications/ds100.pdf

[26]

Xilinx, San Jose, CA, "Xilinx ISE 6 software manuals" 2007 {On-line}. Available: http://toolbox.xilinx.com/docsan/xilinx6/books/manuals.pdf

Cited By

Pitchai SPitchai S(2023)Area-latency efficient floating point adder using interleaved alignment and normalizationMicroprocessors & Microsystems10.1016/j.micpro.2023.10484299:COnline publication date: 1-Jun-2023
https://dl.acm.org/doi/10.1016/j.micpro.2023.104842
Pimentel JBohnenstiehl BBaas B(2017)Hybrid Hardware/Software Floating-Point Implementations for Optimized Area and Throughput TradeoffsIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2016.258014225:1(100-113)Online publication date: 1-Jan-2017
https://dl.acm.org/doi/10.1109/TVLSI.2016.2580142
Tang SLemieux GConstantinides GChen D(2015)Area Optimization of Arithmetic Units by Component Sharing for FPGAs (Abstract Only)Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays10.1145/2684746.2689146(276-276)Online publication date: 22-Feb-2015
https://dl.acm.org/doi/10.1145/2684746.2689146
Show More Cited By

Index Terms

Architectural modifications to enhance the floating-point performance of FPGAs
1. Computer systems organization
  1. Embedded and cyber-physical systems
  2. Real-time systems
2. Hardware

Recommendations

Floating-point FPGA: architecture and modeling

This paper presents an architecture for a reconfigurable device that is specifically optimized for floating-point applications. Fine-grained units are used for implementing control logic and bit-oriented operations, while parameterized and ...
Floating-point divider design for FPGAs

Growth in floating-point applications for field-programmable gate arrays (FPGAs) has made it critical to optimize floating-point units for FPGA technology. The divider is of particular interest because the design space is large and divider usage in ...
Optimizing floating point units in hybrid FPGAs

This paper introduces a methodology to optimize coarse-grained floating point units (FPUs) in a hybrid field-programmable gate array (FPGA), where the FPU consists of a number of interconnected floating point adders/subtracters (FAs), multipliers (FMs), ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Very Large Scale Integration (VLSI) Systems

IEEE Transactions on Very Large Scale Integration (VLSI) Systems Volume 16, Issue 2

February 2008

104 pages

ISSN:1063-8210

Issue’s Table of Contents

Copyright © 2008.

Publisher

IEEE Educational Activities Department

United States

Publication History

Published: 01 February 2008

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 06 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Pitchai SPitchai S(2023)Area-latency efficient floating point adder using interleaved alignment and normalizationMicroprocessors & Microsystems10.1016/j.micpro.2023.10484299:COnline publication date: 1-Jun-2023
https://dl.acm.org/doi/10.1016/j.micpro.2023.104842
Pimentel JBohnenstiehl BBaas B(2017)Hybrid Hardware/Software Floating-Point Implementations for Optimized Area and Throughput TradeoffsIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2016.258014225:1(100-113)Online publication date: 1-Jan-2017
https://dl.acm.org/doi/10.1109/TVLSI.2016.2580142
Tang SLemieux GConstantinides GChen D(2015)Area Optimization of Arithmetic Units by Component Sharing for FPGAs (Abstract Only)Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays10.1145/2684746.2689146(276-276)Online publication date: 22-Feb-2015
https://dl.acm.org/doi/10.1145/2684746.2689146
Langhammer MPasca BConstantinides GChen D(2015)Floating-Point DSP Block Architecture for FPGAsProceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays10.1145/2684746.2689071(117-125)Online publication date: 22-Feb-2015
https://dl.acm.org/doi/10.1145/2684746.2689071
Moctar YGeorge NParandeh-Afshar HIenne PLemieux GBrisk PCompton KHutchings B(2012)Reducing the cost of floating-point mantissa alignment and normalization in FPGAsProceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays10.1145/2145694.2145738(255-264)Online publication date: 22-Feb-2012
https://dl.acm.org/doi/10.1145/2145694.2145738
Tomasi MVanegas MBarranco FDíaz JRos E(2012)Real-time architecture for a robust multi-scale stereo engine on FPGAIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2011.217200720:12(2208-2219)Online publication date: 1-Dec-2012
https://dl.acm.org/doi/10.1109/TVLSI.2011.2172007
Yu CSmith ALuk WLeong PWilton S(2012)Optimizing floating point units in hybrid FPGAsIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2011.215388320:7(1295-1303)Online publication date: 1-Jul-2012
https://dl.acm.org/doi/10.1109/TVLSI.2011.2153883
Smith AConstantinides GCheung P(2010)An Automated Flow for Arithmetic Component Generation in Field-Programmable Gate ArraysACM Transactions on Reconfigurable Technology and Systems10.1145/1839480.18394833:3(1-21)Online publication date: 1-Sep-2010
https://dl.acm.org/doi/10.1145/1839480.1839483
Jin ZPittman RForin ACheung PWawrzynek J(2010)Reconfigurable custom floating-point instructions (abstract only)Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arrays10.1145/1723112.1723173(287-287)Online publication date: 21-Feb-2010
https://dl.acm.org/doi/10.1145/1723112.1723173
Parandeh-Afshar HVerma ABrisk PIenne P(2010)Improving FPGA performance for carry-save arithmeticIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2009.201438018:4(578-590)Online publication date: 1-Apr-2010
https://dl.acm.org/doi/10.1109/TVLSI.2009.2014380
Show More Cited By

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents