FPGA-Based Processor Acceleration for Image Processing Applications
<p>Bandwidth/memory distribution in Xilinx Virtex-7 FPGA which highlight how bandwidth and computation improves as we near the datapath parts of the FPGA.</p> "> Figure 2
<p>Illustration of possible data and task parallel decomposition of a dataflow algorithm found in image processing designs where the numerous of rows indicate the level of parallelism.</p> "> Figure 3
<p>A brief description of the design flow of a hardware and software heterogeneous system highlighting key features. More detail of the flow is contained in reference [<a href="#B11-jimaging-05-00016" class="html-bibr">11</a>].</p> "> Figure 4
<p>(<b>a</b>) Impact of DSP48E1 configurations on maximum achievable clock frequency using different speed grades using Kintex-7 FPGAs for fully pipelined with no (NOPATDET) and with (PATDET) PATtern DETector, then multiply with no MREG (MULT_NOMREG) and pattern detector (MULT_NOMREG_PATDET) and a Multiply, pre-adder, no ADREG (PREADD_MULT_NOADREG) (<b>b</b>) Impact of BRAM configurations on the maximum achievable clock frequency of Artix-7, Kintex-7 and Virtex-7 FPGAs for single and true-dual port RAM configurations.</p> "> Figure 5
<p>A range of dataflow models taken from [<a href="#B24-jimaging-05-00016" class="html-bibr">24</a>,<a href="#B25-jimaging-05-00016" class="html-bibr">25</a>]. (<b>a</b>) DFG node without internal storage called configuration ①; (<b>b</b>) DFG actor without internal storage t1 and constant i called configuration ②; (<b>c</b>) Programmable DFG actor with internal storage t1, t2 and t3 and constants i and j called configuration ③.</p> "> Figure 6
<p>FPGA datapath models resulting from <a href="#jimaging-05-00016-f005" class="html-fig">Figure 5</a>. (<b>a</b>) Programmable ALU corresponding to configuration ①; (<b>b</b>) Fine-grained processor corresponding to configuration ②; (<b>c</b>) Coarse-grained processor corresponding to configuration ③.</p> "> Figure 7
<p>Impact of the various datapath models ①, ②, ③ on <math display="inline"><semantics> <msub> <mi>f</mi> <mrow> <mi>m</mi> <mi>a</mi> <mi>x</mi> </mrow> </msub> </semantics></math> across Xilinx Artix-7, Kintex-7 and Virtex-7 FPGA families.</p> "> Figure 8
<p>Block diagram of FPGA-based soft core Image Processing Processor (IPPro) datapath highlighting where relevant the fixed Xilinx FPGA resources utilised by the approach.</p> "> Figure 9
<p>System architecture of IPPro-based hardware acceleration highlighting data distribution and control infrastructure, FIFO configuration and Finite-State-Machine control.</p> "> Figure 10
<p>High-level implementation of <span class="html-italic">k</span>-means clustering algorithm: (<b>a</b>) Graphical view of Orcc dataflow network; (<b>b</b>) Part of dataflow network including the connections; (<b>c</b>) Part of <tt>Distance.cal</tt> file showing distance calculation in RVC-CAL where two pixels are received through an input FIFO channel, processed and sent to an output FIFO channel; (<b>d</b>) Compiled IPPro assembly code of <tt>Distance.cal</tt>.</p> "> Figure 11
<p>IPPro-based hardware accelerator designs to explore and analyse the impact of parallelism on area and performance based on Single core IPPro ①, eight-way parallel SIMD IPPro ②, parallel Dual core IPPro ③ and combined Dual core 8-way SIMD IPPro called ④.</p> "> Figure 12
<p>Section execution times and ratios for each stage of the traffic sign recognition algorithm.</p> "> Figure 13
<p>(<b>a</b>) The simplified IPPro assembly code of 3 × 3 dilation operation. (<b>b</b>) The output result of implemented design.</p> "> Figure 14
<p>Stage-wise comparison of traffic sign recognition acceleration using ARM and IPPro based approach.</p> ">
Abstract
:1. Introduction
- Exploration of mapping the functionality for a k-means clustering function, resulting in a possible speedup of up to 8 times that is 57, 28 and 1.7 times more power efficient (fps/W) than ARM Cortex-A7 CPU, nVIDIA GeForce GTX980 GPU and ARM Mali-T628 embedded GPU.
- Acceleration of colour and morphology operations of traffic sign recognition application, resulting in a speedup of 4.5 and 9.6 times respectively on a Zedboard.
2. Background
2.1. Accelerating Image Processing Algorithms
- Customised hardware accelerator designs in HDLs which require long development times but can be optimised in terms of performance and area.
- Application specific hardware accelerators which are generally optimized for a single function, non-programmable and created using IP cores.
- Designs created using high-level synthesis tools such as Xilinx’s Vivado HLS tool and Altera’s OpenCL compiler which convert a C-based specification into an RTL implementation synthesizable code [15] allowing pipelining and parallelization to be explored.
- Programmable hardware accelerator in the form of vendor specific soft processors such as Xilinx’s Microblaze and Altera’s NIOS II processors and customized hard/soft processors.
2.2. Soft Processor Architectures
3. System Implementation
- Programmability: there is a need for a design methodology which includes a flexible data communication interface to exchange data. Intellectual Property (IP) cores and HLS tools [15]/ OpenCL design routes increase programming abstraction but do not provide the flexible system infrastructure for image processing systems.
- Dataflow support: the dataflow model of computation is a recognized model for data-intensive applications. Algorithms are represented as a directed graph composed of nodes (actors) as computational units and edges as communication channels [21]. While the actors run explicitly in parallel decided by the user, actor functionality can either be sequential or concurrent. Current FPGA realizations use the concurrency of the whole design at a higher level but eliminate reprogrammability. A better approach is to keep reprogrammability while still maximizing parallelism by running actors on simple “pipelined” processors; the actors still run their code explicitly in parallel (user-specified).
- Heterogeneity: the processing features of FPGAs should be integrated with CPUs. Since dataflow supports both sequential and concurrent platforms, the challenge is then to allow effective mapping onto CPUs with parallelizable code onto FPGA.
- Toolset availability: design tools created to specifically compile user-defined dataflow programs at higher levels to fully reprogrammable heterogeneous platform should be available.
High-Level Programming Environment
- Efficient FPGA-based processor design that operates at higher operating frequency .
- Reducing the actor’s execution time by decomposing it into multiple pipelined stages, thus reducing to improve the . Shorter actors can be merged sequentially to minimise the data transfer overhead by localising data into FIFOs between processing stages.
- Vertical scaling to exploit data parallelism by mapping an actor on multiple processor cores, thus reducing () at the cost of additional system-level data distribution, control, and collection mechanisms.
- No explicit balanced actors or actions are provided by the user.
- The actors include actions which are balanced without depending on each other, e.g., no global variables in an actor is updated by one action and then used by the other ones; otherwise, these would need to be decomposed into separate actors.
- The actors are explicitly balanced and only require hardware/software partitioning.
- Examination of the xdf dataflow network file and assignment and recording of the actor mapping to the processors on the network.
- Compilation of each actor’s RVC-CAL code to IPPro assembly code.
- Generation of control register values, mainly for AXI Lite Registers, and parameters required by the developed C-APIs. running on the host CPU
4. Exploration of Efficient FPGA-Based Processor Design
4.1. Exploration of FPGA Fabric for Soft Core Processor Architecture
4.2. Functionality vs. Performance Trade-Off Analysis
4.3. Image Processing Processor (IPPro)
4.4. Processor Micro-Benchmarks
5. System Architecture
Control Infrastructure
6. Case Study 1: k-Means Clustering Algorithm
6.1. High-Level System Description
6.2. IPPro-Based Hardware Acceleration Designs
6.3. Power Measurement
7. Case Study 2: Traffic Sign Recognition
Acceleration of Colour and Morphology Filter
8. Conclusions and Future Work
Author Contributions
Funding
Conflicts of Interest
References
- Conti, F.; Rossi, D.; Pullini, A.; Loi, I.; Benini, L. PULP: A Ultra-Low Power Parallel Accelerator for Energy-Efficient and Flexible Embedded Vision. J. Signal Process. Syst. 2016, 84, 339–354. [Google Scholar] [CrossRef]
- Lamport, L. The Parallel Execution of DO Loops. Commun. ACM 1974, 17, 83–93. [Google Scholar] [CrossRef]
- Markov, I.L. Limits on Fundamental Limits to Computation. Nature 2014, 512, 147–154. [Google Scholar] [CrossRef] [PubMed]
- Bacon, D.F.; Rabbah, R.; Shukla, S. FPGA Programming for the Masses. ACM Queue Mag. 2013, 11, 40–52. [Google Scholar] [CrossRef]
- Gort, M.; Anderson, J. Design re-use for compile time reduction in FPGA high-level synthesis flows. In Proceedings of the IEEE International Conference on Field-Programmable Technology (FPT), Shanghai, China, 10–12 December 2014; pp. 4–11. [Google Scholar]
- Yiannacouras, P.; Steffan, J.G.; Rose, J. VESPA: Portable, scalable, and flexible FPGA-based vector processors. In Proceedings of the 2008 International Conference on Compilers, Architectures and Synthesis for Embedded Systems, Atlanta, GA, USA, 19–24 October 2008; pp. 61–70. [Google Scholar]
- Severance, A.; Lemieux, G.G. Embedded supercomputing in FPGAs with the VectorBlox MXP matrix processor. In Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis, Montreal, QC, Canada, 29 September–4 October 2013; pp. 1–10. [Google Scholar]
- Andryc, K.; Merchant, M.; Tessier, R. FlexGrip: A soft GPGPU for FPGAs. In Proceedings of the 23rd International Conference on Field Programmable Logic and Applications (FPL 2013), Porto, Portugal, 2–4 September 2013; pp. 230–237. [Google Scholar]
- Cheah, H.Y.; Brosser, F.; Fahmy, S.A.; Maskell, D.L. The iDEA DSP block-based soft processor for FPGAs. ACM Trans. Reconfig. Technol. Syst. 2014, 7, 19. [Google Scholar] [CrossRef]
- Siddiqui, F.M.; Russell, M.; Bardak, B.; Woods, R.; Rafferty, K. IPPro: FPGA based image processing processor. In Proceedings of the IEEE Workshop on Signal Processing Systems, Belfast, UK, 20–22 October 2014; pp. 1–6. [Google Scholar]
- Amiri, M.; Siddiqui, F.M.; Kelly, C.; Woods, R.; Rafferty, K.; Bardak, B. FPGA-Based Soft-Core Processors for Image Processing Applications. J. Signal Process. Syst. 2017, 87, 139–156. [Google Scholar] [CrossRef]
- Bourrasset, C.; Maggiani, L.; Sérot, J.; Berry, F. Dataflow object detection system for FPGA-based smart camera. IET Circuits Devices Syst. 2016, 10, 280–291. [Google Scholar] [CrossRef]
- Nugteren, C.; Corporaal, H.; Mesman, B. Skeleton-based automatic parallelization of image processing algorithms for GPUs. In Proceedings of the 2011 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation, Samos, Greece, 18–21 July 2011; pp. 25–32. [Google Scholar] [CrossRef]
- Brodtkorb, A.R.; Dyken, C.; Hagen, T.R.; Hjelmervik, J.M.; Storaasli, O.O. State-of-the-art in Heterogeneous Computing. Sci. Program. 2010, 18, 1–33. [Google Scholar] [CrossRef] [Green Version]
- Neuendorffer, S.; Li, T.; Wang, D. Accelerating OpenCV Applications with Zynq-7000 All Programmable SoC Using Vivado HLS Video Libraries; Technical Report; Xilinx Inc.: San Jose, CA, USA, 2015. [Google Scholar]
- Strik, M.T.; Timmer, A.H.; Van Meerbergen, J.L.; van Rootselaar, G.J. Heterogeneous multiprocessor for the management of real-time video and graphics streams. IEEE J. Solid-State Circuits 2000, 35, 1722–1731. [Google Scholar] [CrossRef]
- Zhang, J.; Zhang, Z.; Zhou, S.; Tan, M.; Liu, X.; Cheng, X.; Cong, J. Bit-level Optimization for High-level Synthesis and FPGA-based Acceleration. In Proceedings of the 18th Annual ACM/SIGDA International Symposium on Field Programmable Gate Arrays, Monterey, CA, USA, 21–23 February 2010; pp. 59–68. [Google Scholar]
- Nikhil, R. Bluespec System Verilog: Efficient, correct RTL from high level specifications. In Proceedings of the Second ACM and IEEE International Conference on Formal Methods and Models for Co-Design (MEMOCODE ’04), San Diego, CA, USA, 23–25 June 2004; pp. 69–70. [Google Scholar]
- Kapre, N. Custom FPGA-based soft-processors for sparse graph acceleration. In Proceedings of the 2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP), Toronto, ON, Canada, 27–29 July 2015; pp. 9–16. [Google Scholar]
- LaForest, C.E.; Steffan, J.G. Octavo: An FPGA-centric processor family. In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2012; pp. 219–228. [Google Scholar]
- Sutherland, W.R. On-Line Graphical Specification of Computer Procedures. Technical Report, DTIC Document. Ph.D. Thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, 1966. [Google Scholar]
- Eker, J.; Janneck, J. CAL Language Report; Tech. Rep. UCB/ERL M; University of California: Berkeley, CA, USA, 2003; Volume 3. [Google Scholar]
- Yviquel, H.; Lorence, A.; Jerbi, K.; Cocherel, G.; Sanchez, A.; Raulet, M. Orcc: Multimedia Development Made Easy. In Proceedings of the 21st ACM International Conference on Multimedia (MM ’13), Barcelona, Spain, 21–25 October 2013; pp. 863–866. [Google Scholar]
- So, H.K.H.; Liu, C. FPGA Overlays. In FPGAs for Software Programmers; Springer: Berlin, Germany, 2016; pp. 285–305. [Google Scholar]
- Gupta, S. Comparison of Different Data Flow Graph Models; Technical Report; University of Stuttgart: Stuttgart, Germany, 2010. [Google Scholar]
- Kelly, C.; Siddiqui, F.M.; Bardak, B.; Woods, R. Histogram of oriented gradients front end processing: An FPGA based processor approach. In Proceedings of the 2014 IEEE Workshop on Signal Processing Systems (SiPS), Belfast, UK, 20–22 October 2014; pp. 1–6. [Google Scholar]
- Schleuniger, P.; McKee, S.A.; Karlsson, S. Design Principles for Synthesizable Processor Cores. In Proceedings of the 25th International Conference on Architecture of Computing Systems (ARCS); Springer: Berlin/Heidelberg, Germany, 2012; pp. 111–122. [Google Scholar]
- García, G.J.; Jara, C.A.; Pomares, J.; Alabdo, A.; Poggi, L.M.; Torres, F. A survey on FPGA-based sensor systems: Towards intelligent and reconfigurable low-power sensors for computer vision, control and signal processing. Sensors 2014, 14, 6247–6278. [Google Scholar] [CrossRef] [PubMed]
- Mogelmose, A.; Trivedi, M.M.; Moeslund, T.B. Vision-Based Traffic Sign Detection and Analysis for Intelligent Driver Assistance Systems: Perspectives and Survey. IEEE Trans. Intell. Transp. Syst. 2012, 13, 1484–1497. [Google Scholar] [CrossRef] [Green Version]
Operation Type | Domain | Output Depends on | Memory Pattern | Execution Pattern | Examples |
---|---|---|---|---|---|
Point and Line | Spatial | Single input pixel | Pipelined | One-to-one | Intensity change by factor, Negative image-inversion. |
Area/Local | Spatial | Neighbouring pixels | Coalesced | Tree | Convolution functions: Sobel, Sharpen, Emboss. |
Geometric | Spatial | Whole frame | Recursive non-coalesced | Large reduction tree | Rotate, Scale, Translate, Reflect, Perspective and Affine. |
Product | Family | Part Number | BRAM (18 Kb Each) | DSP48E1 | GMAC/s | BRAM/DSP |
---|---|---|---|---|---|---|
Standalone | Artix-7 | XC7A200T | 730 | 740 | 929 | 0.99 |
Standalone | Kintex-7 | XC7K480T | 1910 | 1920 | 2845 | 0.99 |
Standalone | Virtex-7 | XC7VX980T | 3000 | 3600 | 5335 | 0.83 |
Zynq SoC | Artix-7 | XC7Z020 | 280 | 220 | 276 | 1.27 |
Zynq SoC | Kintex-7 | XC7Z045 | 1090 | 900 | 1334 | 1.21 |
Addressing Mode | Data Abstraction | Supported Instructions |
---|---|---|
FIFO handling | Stream access | get, push |
Register File–FIFO | Stream and randomly accessed data | addrf, subrf, mulrf, orrf, minrf, maxrf etc |
Register File–Register File | Randomly accessed data | str, add, mul, mulacc, and, min, max etc. |
Kernel Memory–FIFO | Stream and fixed values | addkm, mulkm, minkm, maxkm etc. |
Resource | IPPro | Graph-SoC [19] | FlexGrip 8 SP * [8] | MicroBlaze | |
---|---|---|---|---|---|
FFs | 422 | 551 | (103,776/8 =) | 12,972 | 518 |
LUTs | 478 | 974 | (71,323/8 =) | 8916 | 897 |
BRAMs | 1 | 9 | (120/8 =) | 15 | 4 |
DSP48E1 | 1 | 1 | (156/8 =) | 19.5 | 3 |
Stages | 5 | 3 | 5 | 5 | |
Freq. (MHz) | 337 | 200 | 100 | 211 |
a | |||
Processor | MicroBlaze | IPPro | |
FPGA Fabric | Kintex-7 | ||
Freq (MHz) | 287 | 337 | |
Micro-benchmarks | Exec. Time (us) | Speed-up | |
Convolution | 0.60 | 0.14 | 4.41 |
Degree-2 Polynomial | 5.92 | 3.29 | 1.80 |
5-tap FIR | 47.73 | 5.34 | 8.94 |
Matrix Multiplication | 0.67 | 0.10 | 6.7 |
Sum of Abs. Diff. | 0.73 | 0.77 | 0.95 |
Fibonacci | 4.70 | 3.56 | 1.32 |
b | |||
Processor | MicroBlaze | IPPro | Ratio |
FFs | 746 | 422 | 1.77 |
LUTs | 1114 | 478 | 2.33 |
BRAMs | 4 | 2 | 2.67 |
DSP48E1 | 0 | 1 | 0.00 |
Design | Acceleration Paradigm | Mapping | Parallelism | |
---|---|---|---|---|
Data | Task | |||
① | Single core IPPro | Single actor | No | No |
② | 8-way SIMD IPPro | Single actor | Yes | No |
③ | Dual core IPPro | Dual actor | No | Yes |
④ | Dual core 8-way SIMD IPPro | Dual actor | Yes | Yes |
Single Actor | ① Single Core IPPro | ② 8-Way SIMD IPPro | ||
---|---|---|---|---|
Exec. (ms) | fps | Exec. (ms) | fps | |
Distance Calculation | 118.21 | 8.45 | 23.37 | 42.78 |
Averaging | 145.17 | 6.88 | 27.02 | 37.00 |
k-Means Acceleration | Area | Performance | ||||
---|---|---|---|---|---|---|
LUT | FF | BRAM | DSP | Exec. (ms) | fps | |
① Combined stages using Single-core IPPro | 4736 | 5197 | 4.5 | 1 | 263.38 | 3.8 |
② Combined stages using 8-way SIMD IPPro | 10,941 | 12,279 | 18.5 | 8 | 50.39 | 19.8 |
③ Dual-core IPPro | 4987 | 5519 | 4.5 | 2 | 163.2 | 6 |
④ Dual 8-way SIMD IPPro | 13,864 | 16,106 | 18.5 | 16 | 35.9 | 28 |
Software implementation on ARM Cortex-A7 | - | - | - | - | 286 | 3.5 |
Power (mW) | Freq. | Exec. | Power | TU | Efficiency | |||||
---|---|---|---|---|---|---|---|---|---|---|
Impl. | Static | Dyn. | Tot. | (MHz) | (ms) | fps | Efficiency | (×) | (fps/TU) | (fps/W/TU) |
(fps/W) | (×) | (×) | ||||||||
③ | 118 | 18 | 136 | 100 | 163.2 | 6 | 44.1 | 591 (9%) | 1.0 | 74.6 |
④ | 122 | 92 | 214 | 100 | 35.9 | 28 | 130.8 | 1564 (23%) | 1.8 | 83.6 |
Power (W) | Freq. | Exec. | Power | TU | Efficiency | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Plat. | Impl. | Static | Dyn. | Tot. | (MHz) | (ms) | fps | Effic. | (×) | (fps/TU) | (fps/W/TU) |
(fps/W) | (×) | (×) | |||||||||
FPGA | ③ | 0.15 | 0.03 | 0.19 | 337 | 48.43 | 21 | 114.1 | 0.6 (9%) | 3.6 | 193.1 |
④ | 0.16 | 0.15 | 0.31 | 337 | 10.65 | 94 | 300.3 | 1.0 (6%) | 6.0 | 192.0 | |
GPU | OpenCL | 37 | 27 | 64 | 1127 | 1.19 | 840 | 13.1 | 1.3 (26%) | 63.1 | 9.8 |
CUDA | 37 | 22 | 59 | 1127 | 1.58 | 632 | 10.7 | 1.2 (24%) | 51.5 | 8.7 | |
eGPU | Mali | 0.12 | - | 1.56 | 600 | 3.69 | 271 | 173 | - | - | - |
eCPU | Cortex | 0.25 | - | 0.67 | 1200 | 286 | 3.49 | 5.2 | - | - | - |
Description | Colour | Morphology |
---|---|---|
No. of cores | 32 | 16 |
FF | 41,624 (39%) | 43,588 (41%) |
LUT | 29,945 (56%) | 33,545 (63%) |
DSP48E1 | 32 (15%) | 48 (22%) |
BRAM | 60 (42%) | 112 (80%) |
Cycles/Pixel | 160 | 26 |
Exec. (ms) | 19.7 (8.7 *) | 41.3 (18.3 *) |
Speed-up | 4.5× (10.3× *) | 9.6× (21.75× *) |
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Siddiqui, F.; Amiri, S.; Minhas, U.I.; Deng, T.; Woods, R.; Rafferty, K.; Crookes, D. FPGA-Based Processor Acceleration for Image Processing Applications. J. Imaging 2019, 5, 16. https://doi.org/10.3390/jimaging5010016
Siddiqui F, Amiri S, Minhas UI, Deng T, Woods R, Rafferty K, Crookes D. FPGA-Based Processor Acceleration for Image Processing Applications. Journal of Imaging. 2019; 5(1):16. https://doi.org/10.3390/jimaging5010016
Chicago/Turabian StyleSiddiqui, Fahad, Sam Amiri, Umar Ibrahim Minhas, Tiantai Deng, Roger Woods, Karen Rafferty, and Daniel Crookes. 2019. "FPGA-Based Processor Acceleration for Image Processing Applications" Journal of Imaging 5, no. 1: 16. https://doi.org/10.3390/jimaging5010016
APA StyleSiddiqui, F., Amiri, S., Minhas, U. I., Deng, T., Woods, R., Rafferty, K., & Crookes, D. (2019). FPGA-Based Processor Acceleration for Image Processing Applications. Journal of Imaging, 5(1), 16. https://doi.org/10.3390/jimaging5010016