Nothing Special   »   [go: up one dir, main page]

Mhamdan Publication

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/322942712

VHDL generator for a high performance convolutional neural network FPGA-


based accelerator

Conference Paper · December 2017


DOI: 10.1109/RECONFIG.2017.8279827

CITATIONS READS
13 2,691

2 authors, including:

Muhammad Hamdan
Iowa State University
9 PUBLICATIONS 30 CITATIONS

SEE PROFILE

All content following this page was uploaded by Muhammad Hamdan on 03 April 2019.

The user has requested enhancement of the downloaded file.


VHDL Generator for A High Performance
Convolutional Neural Network FPGA-Based
Accelerator
Muhammad K.A. Hamdan and Diane T. Rover
Electrical and Computer Engineering Department
Iowa State University of Science and Technology
Ames, IA United States
{mhamdan, drover}@iastate.edu

Abstract — Convolutional Neural Network (CNN) has been Large CNNs are computationally expensive, requiring over
proven as a highly accurate and effective algorithm that has been billion operations per image, making general purpose processors
used in a variety of applications such as handwriting digit recog- inefficient in implementing CNN models, thus platforms like
nition, visual recognition, and image classification. As a matter of GPUs, ASIC and FPGAs have attracted a lot of attention be-
fact, state-of-the-art CNNs are computationally intensive; how- cause of their high performance. FPGAs particularly seem to
ever, their parallel and modular nature make platforms like well-fit the job because they are reconfigurable, take advantage
FPGAs well suited for the acceleration process. A typical CNN of the inherent parallelism in CNNs, and power efficient. In-
takes a very long development round on FPGAs, hence in this pa-
deed, many CNN accelerators have been proposed for different
per, we propose a tool which allows developers, through a config-
purposes and with different techniques and methodologies
urable user-interface, to automatically generate VHDL code for
their desired CNN model. The generated code or architecture is
[5][6] [7][8][9]. CNNs are known for their frequent data access,
modular, massively parallel, reconfigurable, scalable, fully pipe- computation complexity, and very long development round,
lined, and adaptive to different CNN models. We demonstrate the hence an efficient implementation is required. In this paper, we
automatic VHDL generator and its adaptability by implementing propose a GUI based tool to significantly speed up the process
a small-scale CNN model “LeNet” and a large-scale one of CNN development, also we highly optimize the computation
“AlexNet”. The parameters of small scale models are automati- component and efficiently manage memory accesses.
cally hard-coded as constants (part of the programmable logic) to
The key contributions of this work are as follows:
overcome the memory bottleneck issue. On a Xilinx Virtex-7 run-
ning at 200 MHZ, the system is capable of processing up to 125k A. A VHDL generator with the following features:
images/s of size 28×28 for LeNet and achieved a peak performance
of 611.52 GOP/s and 414 FPS for AlexNet. • Easy configuration, support for externally pre-configured
models, and support for model checking and validation
Keywords- VHDL generator; CNNs; AlexNet; parallelism;
• Flexibility, scalability, and adaptability with small and
reconfigurable; adaptability; pipeline; scalable; FPGA.
large-scale CNN models
I. INTRODUCTION • A Test-bench, for testing and simulation purposes
In the past years, machine learning has advanced like never
before, where many algorithms were proposed to solve prob- • Compared to the HLS-based work in [10], our generated
lems like visual recognition and image classification. Convolu- optimized implementation achieved a speed up of 6.1x
tional Neural Network , a popular type of neural networks, in- • With Standard HLS tools such as Vivado HLS, users have
spired by the visual cortex of the brain and a mathematical op- to go through the lengthy development process by pro-
eration called convolution, has gained popularity in applications graming in a high-level language. By contrast, in our tool
such as image classification [1], data analysis, visual object users only have to configure the model of their choice
recognition and self-driving cars [2]. The interest in CNNs is without doing any programming.
driven by the high performance and accuracy they have shown.
For example, the AlexNet model won ImageNet Large-Scale B. Scalable, reconfigurable, fully-pipelined, and massively par-
Vision Recognition Challenge (ILSVRC) 2012 achieving a top- allel accelerator
5 accuracy of 84.7%. The popularity of CNNs would not have C. Tested the VHDL generator on two benchmarked models
been possible if it was not for the continually developed models (LeNet and AlexNet) and other hand-tuned models. The
such as LeNet [3], AlexNet [4], VGG, GoogleNet, and ResNet system can process up to 125K Images/s for LeNet and
as well as the availability of powerful computing platforms. achieved peak performance of 611.52 GOP/s for AlexNet
D. An executable of the VHDL generator will be available at:
https://github.com/mhamdan91/cnn_vhdl_generator

1
The rest of this paper is organized as follows. Section II re- unlike ReLU which converges faster during training. ReLU is
views convolutional neural networks briefly. Section III de- defined as a zero-thresholding operation → ReLU = max (0, x).
scribes the VHDL generator and its architecture. Section IV de-
scribes hardware architecture. In Section V, related work is pre-
sented. Section VI describes our implementation details. Section
VII describes future work and conclusion.

Fig. 2 AlexNet architecture: ImageNet 2012 winning CNN model.

Fig. 1 A visualization of a CNN layer that arranges its neurons in three


dimensions (width, height, depth). The 3D input volume is transformed into a
3D output volume of neuron activations in every layer [19]

II. BACKGROUND
A Convolutional Neural Network consists of various layers
such as convolutional and fully-connected layers, where most of Fig. 3 Right: A mathematical representation of the convolution operation
the operations are performed; and pooling layers, which are used followed by a nonlinearity function. Left: Input value of size 7×7×1 with
to avoid overfitting; and a classification layer, to classify final padding of 1, a stride of 2, and receptive field of 3×3 is convolved with a filter
(In Red) of size 3×3×1 and the summed weighted inputs in addition to the bias
results into classes. A typical layer consists of 3D volumes of are stored in the 3x3x1 output neurons (In Green).[19]
neurons as shown in Figure 1 (width, height, and depth and the
word depth refer to what is called “Feature-maps or activation- C. Pooling layer
maps” not the number of layers in the CNN).
Spatial pooling is a form of nonlinear subsampling that is
CNNs typically start with a convolutional layer, where it utilized to reduce the feature dimensions as we go deeper in the
takes the input image and decomposes it into different feature network. Max and average pooling are the most common meth-
maps such as edges, lines, curves, etc. Multiple processes are ods to perform pooling. In max pooling as adopted in AlexNet,
applied to the extracted feature maps throughout the entire net- a set of neurons are subsampled based on the size of a pooling
work. Extracted feature maps from the last layer (typically, a filter, whereas the maximum neuron value in that filter is passed
fully connected layer) are classified into output classes using a to the corresponding neuron in the next layer and the rest of neu-
classifier like SoftMax classifier. For example, the architecture rons are dropped out as shown in Equation 2 (𝐹𝑖𝑙𝑡𝑒𝑟𝑠𝑖𝑧𝑒 2 × 2).
of AlexNet [4], shown in Figure 2, classifies 224×224 colored In average pooling, the forwarded value to the corresponding
images to a 1000 different output classes. neuron in the next layer is the average of all neurons in a filter
A. Convolutional Layer as shown in Equation 3.

The convolutional layer essentially performs a mathematical 𝑃𝑎𝑠𝑠𝑒𝑑𝑛𝑒𝑢𝑟𝑜𝑛 → max(2𝑥, 𝑥, 0.5𝑥, 3𝑥) = 3𝑥 (2)
operation called convolution that involves 3-dimensional multi-
ply-accumulate (MACC) operations. Shown in Figure 3, a ker- 𝑃𝑎𝑠𝑠𝑒𝑑𝑛𝑒𝑢𝑟𝑜𝑛 → avg(𝑥, 2𝑥, 3𝑥, 4𝑥, 5𝑥) = 3𝑥 (3)
nel of weights that is multiplied by a set of inputs (receptive-
region), and the weighted inputs are summed together. A bias D. Fully-Connected layer
whose value usually 1 is added to the summed weighted inputs
The fully connected layer (FC) usually comes before the
to ensure that neurons fire. An activation function is applied to
classification layer and it comprises the highest number of pa-
the accumulated sum to limit the output to a reasonable range.
rameters because every neuron in this layer is connected to all
Results from the activation function are traversed to correspond-
neurons in the previous layer, and parameters are translated on
ing neurons in the next layer. The computation of a feature-
the connections between those neurons. Inputs in this layer are
map’s output size is shown in Equation 1.
multiplied with corresponding weights, biases added respec-
𝑂𝑢𝑡𝑝𝑢𝑡𝑠𝑖𝑧𝑒 =
(𝐼𝑛𝑝𝑢𝑡𝑤𝑖𝑑𝑡ℎ−𝐹𝑖𝑙𝑡𝑒𝑟𝑠𝑖𝑧𝑒 +2× 𝑃𝑎𝑑𝑑𝑖𝑛𝑔)
+1 (1)
tively, then nonlinearity is applied as shown in Equation 4.
𝑆𝑡𝑟𝑖𝑑𝑒
𝑖 𝑖𝑛𝑝𝑢𝑡 𝐾
𝑂𝑈𝑇𝑛𝑒𝑢𝑟𝑜𝑛 = ∑𝑗=1 𝐼𝑁𝑃𝑈𝑇 𝑖 × 𝑤𝑒𝑖𝑔ℎ𝑡 𝑖𝑗 + 𝐵𝑖𝑎𝑠 𝑖 (4)
B. Activation Function
The activation function is used to ensure nonlinearity in the The output of the nonlinearity in the last FC layer is passed
network as well as to get rid of unnecessary information. Among to a classifier, like SoftMax classifier, that converts output neu-
the various activation functions, Sigmoid, Tanh, and ReLU are rons to a probability in the range (0, 1) for the classification
the most commonly used functions. The Sigmoid and Tanh ac- layer. The classification layer “Final layer” compares labels of
tivation functions require longer training timing in CNNs [4], the top probabilities from SoftMax classifier with actual labels
of the available classes, thus gives the accuracy of the model.

2
III. VHDL GENERATION TOOL ARCHITECTURE in Table I. The syntax of configuration file is shown in Table II
The tool produces an optimized parameterized implementa- and parameters configuration is shown in Table III.
tion of a desired CNN model through a series of processes. We Table I Tool supported configurations
developed a VHDL based library to build the architecture of the Image Size User-defined
specified model through a GUI. Figure 4 shows the top-level Output Classifier SoftMax
tool flow for generating VHDL code. Filter Size User-defined
Feature maps User-defined
Start No. of Classes User-defined
Activation Functions ReLU, Sigmoid, Tanh, Average and
Max Pool
Layer type Convolution, Pooling, FC, LRN
Import configuration Manual
from a text file configuration
via GUI
Table II Example configuration syntax for conv→ pool → FC network
Row_count,3
Model Specifications Image_Size,28
Image_type,Colored,24
No_Classes,10
Classifier,SoftMax
Convolution,2,2,0,2,ReLU,
Pooling,2,2,0,2,Max-Pool,
Error message Model verification
Fix incorrect Fail and configuration Pass Automatic
Fully-Connected,4,1,0,1,ReLU,
configuration validation configuration
storage Row_count represents the number of layers; Image_Size is
the input image dimension; Image_type specifies the type of im-
No
Want to Save
Configuration to a Yes
age if colored or grayscale and 24 represents the input data
Error Message
file? width, where 24 is for colored and 8 is for grayscale; NO_classes
represents the number of output classes and Classifier is the clas-
Import parameters from Parameters sifier function; Convolution,2,2,0,2,Max pool respectively rep-
a text file Inclusion
Store configuration resents layer name, number of output feature maps, filter size,
padding, stride size, and used activation function; the same syn-
Yes tax applies to pooling and fully connected layers.
Table III parameters (weights and biases) for the configuration in table II
No Model meets small Generate Convolution,1
Process parameters Yes No No
scale constrains? Test-bench? Filter_1_1,0001,0010,0011,0010,1,$ Filter_fmap1_kernel1,weight,,,,bias,$
Filter_1_2,0001,0010,0011,0010,1,$ We have 2 feature maps in our exam-
Filter_1_3,0001,0010,0011,0010,1,$ ple and since the image is colored, we
Yes
Filter_2_1,0001,0010,0011,0010,0,$ have 3 different kernels for each out-
Match model Filter_2_2,0001,0010,0011,0010,0,$ put feature map. Bias value is the
Yes Generate VHDL code Generate
Configuration
Test-bench
Filter_2_3,0001,0010,0011,0010,0,$ same for a distinct feature map
Pooling,1
Fully-Connected,1
Filter_1_1,0101,1,$ No parameters
Filter_1_2,0101,1,$
End Filter_2_1,0111,0,$
Store on Desk Filter_2_2,0101,0,$ Weights= 2x4x1
Filter_3_1,0101,1,$ Biases = 4x2
Filter_3_2,0101,1,$ Biases are optional depends
Fig. 4 VHDL generation tool flow Filter_4_1,0111,0,$ on trained model use for them
Filter_4_2,0101,0,$
The main building blocks of the tool are model configuration
and validation, and parameters inclusion. Those blocks are illus- B. Parameters Inclusion and VHDL files Generation
trated in details as follows.
Parameters are handled according to a specified CNN model,
A. Model Configuration and Validation where for small-scale models that comprise typically less than
In this block, developers can load a pre-configured model 100K parameters, parameters are consolidated within the gener-
from a text file which abide by a particular configuration syntax ated VHDL code as part of the programmable logic (PL), other-
or they can choose to manually configure their model using the wise, parameters are stored in an external memory source.
GUI. Once configuration is complete, the user is prompted to Parameters must be formatted according to model configu-
validate their configuration to ensure it meets standard CNN ration in order to have a successful VHDL generation. The user
configuration. On unsuccessful validation check (Incorrect con- should specify the layer name, list all kernels used in each fea-
figuration), a prompted message is displayed to the user to in- ture map along with their weights, specify biases value, and end
form them of what changes they have to make to fix errors. On each line with a dollar sign as shown in Table III. The tool sup-
a successful validation check, the user can proceed to the next ports binary, decimal and hexadecimal representations of pa-
stage which is parameters inclusion. The current version of the rameters. The size of weights and biases are specified in the
tool supports particular model configurations that are illustrated GUI, so for our example the tool is expecting a weight size of 4-

3
bits and a bias size of 1-bit. If the parameters file does not cor- system is capable of processing up to 125K 28×28 Images/s,
respond to configuration, an error message will be displayed to having the system running at 200 MHz.
the user highlighting the error. Figure 5 illustrates the options
given to incorporate parameters. Optimizing computation in CNNs can significantly improve
the overall performance of a CNN model. Many attempts have
been made to optimize computation through various parallelism
approaches. Authors in [15][16] use parallelism only in convo-
lution operations and output feature maps. This work imple-
ments three types of parallelism: parallelism in convolution op-
erations, parallelism in input feature maps, and parallelism in
output feature maps. In addition, the design in this work is im-
plemented in a pipelined style which helped increase the
throughput of the system, achieving a peak performance of
611.54 GOP/s and 414 FPS (224×224) for AlexNet.

V. HARDWARE ARCHITECTURE
Fig. 5 Parameters inclusion and storage type selection Figure 6 describes the top-level architecture of the proposed
system. The same architecture is used in small and large-scale
IV. RELATED WORK models except that in small scale models we do not use an ex-
ternal memory to store parameters.
The main drawback of accelerating CNNs on FPGA is the
long development round. A few implementations tackled this is-
sue, for example, In [11] authors proposed an FPGA framework,
based on Caffe framework, to map CNN layers to an FPGA plat-
form. The framework uses Xilinx FPGA SDAccel to map CNN
layers and generate the bit-stream file. To optimize computa-
tions, they increase the number of hardware units used to process
a task which in turns increase hardware resources linearly, mak-
ing it an inefficient optimization method.
HLS tools such as Vivado HLS [12] are a good escape from
low-level programming; however, such tools are not highly op-
timized to take full advantage of the available parallelism in
CNNs. In [10] authors use Vivado HLS 2014 to implement a 5-
layer accelerator for MNIST dataset. Their system is capable of
processing ~ 20.8K images/s, while our system is capable of pro- Fig. 6 Top-Level architecture of the system
cessing up to 125K images/s.
A. Convolutional Layer Architecture
HDL generation for CNNs was previously proposed, where
in [13] authors use a high-level descriptive language to generate The process in this layer begins by streaming input data to a
Verilog code for CNN models. They generate each layer inde- sliding window, where the sliding window has the size of
pendently by specifying their parameters, then they combine all weights kernel, and it is used to perform the convolution opera-
of the layers to have a complete accelerator. They did not state tion. The convolution operation is fully-pipelined and parallel-
anywhere that they store parameters on-chip or hard code them, ized, where all multiplication operations are performed at once
meaning that they use an external memory for small-scale mod- for a complete receptive region and for different feature maps.
els which is not an efficient way to handle parameters. Their ac- An adder tree is used to add up results followed by bias-addition
celerator can achieve 222.1 GOP/s for AlexNet, while ours can stage. The activation function(ReLU), a simple zero threshold-
achieve 611.52 GOP/s for the same model. ing operation, is directly applied to all extracted feature maps,
then the output from the ReLU (intermediate values) is stored in
In [14] authors avoid loading parameters from an external buffers which feed the next layer. Figure 7 shows processing el-
memory source by storing them in an on-chip memory. In their ement (PE) details.
implementation, they adopt a parallel-serial style to increase the
throughput; however, this strategy does not take full advantage B. Pooling Layer Architecture
of the available parallelism in the CNN, further different layers Pooling layer takes up values stored in buffers from the pre-
do not work concurrently. They implemented a small-scale neu- vious layer and applies a sliding window that has the size of the
ral network that performs digits recognition on Xilinx pooling filter, and a step size based on the specified stride value.
XC7Z045. Under 172 MHz, their system is capable of pro- This sliding window is similar to that one in the convolutional
cessing about 70K 28×28 images per second. In our implemen- layer, except that the performed operation is max or average
tation, we hard code parameters as part of the PL to maximize pooling and no weights multiplication is performed. Details of
the utilization of hardware resources and get over memory band- the pooling layer architecture are described in Figure 8.
width limitations. In our highly parallelized implementation, the

4
A. LeNet Model
DIN
PE1 DIN
Reg Reg

LeNet model comprises three convolutional layers, two


< Reg

PE2 FIFO Reg Reg


pooling layers, and one fully connected layer. The number of
< Reg
parameters required for the entire model is only 3.75x times the
< Reg parameters required for the first convolutional layer in AlexNet.
PEn
Stride-Enable
Nevertheless, this small model is good enough to perform digit
recognition with decent accuracy. Since the number of parame-
ters in LeNet is relatively small compared to AlexNet, we man-
Fig. 8 max pooling operation architecture aged to have them hard-coded as part of the PL. This strategy
helped significantly improve the overall throughput of the sys-
C. Full Connected Layer Architecture tem as well as reduce the number of used DSPs. P&R synthesis
The architecture of the FC layer is similar to the convolu- report of used hardware resources is shown in Table IV.
tional layer architecture, but convolution is replaced with matrix
multiplication operation. The first FC in AlexNet requires about Table IV Resources utilization for LeNet Model
398 million multiplication operations. The input vector is of size
Layer/Resources Slice Registers LUTs DSPs
(1 × 9216) and the weights matrix is of size ((6×6×256) × 4096).
To perform such a massive matrix multiplication operation, we Available/ VirtexVC709 866400 433200 3600
divide the input vector into small and equal (1 × 𝑋𝑛𝑖 ) vectors as CONV 1
𝑖𝑗 1784 1857 25
well as divide the weights matrix into similar ( 𝑋𝑛 × 1) vectors. POOL 1 966 1137 0
The multiplication operation is performed as shown in Equation
CONV 2
5. Results from the small vector multiplication are stored in a 20848 21643 50
temporary output. When all multiplications for a complete input POOL 2 1121 1304 0
vector are done, final results are generated and stored in desig- FCs 209396 238541 0
nated outputs: 𝑌1 → 𝑌𝑗 .The multiplication operation is illus- Total
trated in Figure 9. 234115 264482 75
Utilization 27.02% 61.05% 2.08%
9216
𝑘=
∑𝑚=4096
𝑗=1 ∑𝑖=1 𝑛 (1 × 𝑋𝑛𝑖 ) ∗ ( 𝑋𝑛𝑖𝑗 × 1) = 𝑌𝑖𝑗 (5)
B. AlexNet Model
4096

𝑌𝑖1 Our implementation for AlexNet on Virtex-7 uses 16-bit


1 × 𝑋𝑛𝑖𝑗

Output Vector → 1 x 4096 fixed point precision for weights representation. In memory
×
Input Vector → 1 x 9216
9126

➔ management, we adopt the strategy presented in [17] to manage


𝑌𝑖𝑗
1 × 𝑋𝑖𝑛 memory requirements for the fully-connected layer, but we also
balance between the input transfer and weight transfer to allow
Fig. 9 Small-scale matrix multiplication room for increasing the input batch size, thus improving the
overall performance. Table V shows hardware resource utiliza-
VI. IMPLEMENTATION tion of the AlexNet model.
To demonstrate the VHDL generation tool functionality we
Table V Resources utilization for AlexNet model
implemented two benchmarked models, LeNet and AlexNet. In
AlexNet implementation, 16-bit Fixed point precision is used for Resources (VirtexVC709) FFs LUTs DSPs BRAMs
weights and intermediate values representation, and 8-bit fixed
Available 866400 433200 3600 2940
point precision is used in LeNet. The tool supports different pre-
cisions, from a single bit up to 32-bit. used 269845 287461 2070 2023
Utilization 31.14% 66.35% 57.5% 68.8%

Input Stream
PE1 DIN
Reg Reg Reg
W ei ghts K erne l-1

FIFO Reg Reg Reg

PE2 ADDER-TREE
W ei ghts Kerne l-2
FIFO Reg Reg Reg

W ei ghts Kerne l-2


PEn
W ei gh ts K ern e l-n
WEIGHTS MUL TIPLICTIO N

Fig. 7 Example processing element details in a convolutional layer for a 3 x 3 filter.

5
Table VI Comparison with other implementations of AlexNet model Convolutional Neural Networks,” Proc. 2015
Platform Frequency GOP/s FPS Processing/ ACM/SIGDA Int. Symp. Field-Programmable Gate
(MHz) Image (ms) Arrays - FPGA ’15, pp. 161–170, 2015.
[8] Altera Stratix-V 120 136.5 50 20.1 [7] S. Chakradhar, M. Sankaradas, V. Jakkula, and S.
[17] Virtex7- VX690T 156 565.9 391 2.56 Cadambi, “A dynamically configurable coprocessor for
[18] Stratix-V GXA7 100 114.5 - >12.5 convolutional neural networks,” ACM SIGARCH
[6] Comput. Archit. News, vol. 38, no. 3, p. 247, 2010.
Virtex7-VX485T 100 61.62 47 21.61
[8] “2016- Throughput-Optimized OpenCL-based FPGA
This
work
Virtex7- VX690T 200 611.5 414 2.41 Accelerator for Large-Scale.” .
[9] C. Poulet, J. Y. Han, and Y. Lecun, “CNP : AN FPGA-
Table VII Comparison with other automatic HDL generation implementations BASED PROCESSOR FOR CONVOLUTIONAL
NETWORKS Cl ´,” vol. 1, no. 1.
Platform Frequency GOP/s CNN
(MHz) GMACs model [10] Y. Zhou and J. Jiang, “An FPGA-based accelerator
[11] Virtex7- VX690T 200 45.8 GOP/s AlexNet implementation for deep convolutional neural
[10] networks,” Proc. 2015 4th Int. Conf. Comput. Sci.
Virtex 7-VX485T 150 16.42 GMAC/s LeNet
Netw. Technol. ICCSNT 2015, no. Iccsnt, pp. 829–832,
[13] Virtex7- VX690T 100 222.1 GOP/s AlexNet 2016.
This Virtex7- VX690T 200 611.5 GOP/s AlexNet [11] R. DiCecco, G. Lacey, J. Vasiljevic, P. Chow, G.
work (VC709) Taylor, and S. Areibi, “Caffeinated FPGAs: FPGA
Framework For Convolutional Neural Networks,”
VII. CONCLUSION AND FUTURE WORK arXiv, 2016.
In this work, we proposed a VHDL generation tool that is [12] “Vivado High-Level Synthesis.” [Online]. Available:
optimized to generate a modular, scalable, reconfigurable, and https://www.xilinx.com/products/design-
highly parallel implementation for CNN models. We demon- tools/vivado/integration/esl-design.html. [Accessed:
strated our VHDL generator by implementing a small-scale (Le- 07-Aug-2017].
Net) and a large-scale (AlexNet) CNN models on Virtex-7 run- [13] Z. Liu, Y. Dou, J. Jiang, and J. Xu, “Automatic Code
ning at 200 MHz. Our system is capable of processing up to Generation of Convolutional Neural Networks in
125K images for the small-scale model and achieved 414 FPS FPGA Implementation,” in International Conference
and 611.52 GOP/s for the large scale one. We aim to extend this on Field-Programmable Technology (FPT), 2016, pp.
work by; first, incorporate a design space exploration methodol- 61–68.
ogy for choosing the adequate FPGA platform along with the [14] J. Park and W. Sung, “Fpga Based Implementation of
desired CNN model; second, give developers the ability to Deep Neural Networks Using on-Chip Memory Only,”
choose desired parallelism methodology to meet their own hard- Icassp 2016, pp. 1011–1015, 2016.
ware resources constraints; third, support all CNNs styles be- [15] M. Sankaradas et al., “A Massively Parallel
sides the Conv→Poo→FC style, and support other neural net-
Coprocessor for Convolutional Neural Networks,”
works algorithms such as recurrent neural networks (RNNs).
Icasap, pp. 53–60, 2009.
Lastly, visualize the built model through the GUI.
[16] S. Cadambi, A. Majumdar, M. Becchi, S. Chakradhar,
and H. P. Graf, “A programmable parallel accelerator
REFERENCES
for learning and classification,” Proc. 19th Int. Conf.
[1] J. D. J. Deng, W. D. W. Dong, R. Socher, and L. F.-F. Parallel Archit. Compil. Tech. - PACT ’10, p. 273,
L. Fei-Fei, “ImageNet: A large-scale hierarchical 2010.
image database,” 2009 IEEE Conf. Comput. Vis. [17] Huimin Li, Xitian Fan, Li Jiao, Wei Cao, Xuegong
Pattern Recognit., pp. 2–9, 2009. Zhou, and Lingli Wang, “A high performance FPGA-
[2] J. L. F. Pereira and R. J. F. Rossetti, “An integrated based accelerator for large-scale convolutional neural
architecture for autonomous vehicles simulation,” networks,” 2016 26th Int. Conf. F. Program. Log.
Proc. 27th Annu. ACM Symp. Appl. Comput. - SAC ’12, Appl., pp. 1–9, 2016.
pp. 286–292, 2012. [18] Y. Ma, N. Suda, Y. Cao, J. S. Seo, and S. Vrudhula,
[3] “http://yann.lecun.com/exdb/lenet/.” “Scalable and modularized RTL compilation of
[4] A. Krizhevsky, I. Sutskever, and G. E. Hinton, Convolutional Neural Networks onto FPGA,” FPL
“ImageNet Classification with Deep Convolutional 2016 - 26th Int. Conf. Field-Programmable Log. Appl.,
Neural Networks,” Adv. Neural Inf. Process. Syst., pp. 2016.
1–9, 2012. [19] “CS231n Convolutional Neural Networks for Visual
[5] J. Qiu et al., “Going Deeper with Embedded FPGA Recognition.” [Online]. Available:
Platform for Convolutional Neural Network,” Proc. http://cs231n.github.io/convolutional-networks/.
2016 ACM/SIGDA Int. Symp. Field-Programmable [Accessed: 01-Jan-2017].
Gate Arrays - FPGA ’16, pp. 26–35, 2016.
[6] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong,
“Optimizing FPGA-based Accelerator Design for Deep

View publication stats

You might also like