CN109086867B

CN109086867B - Convolutional neural network acceleration system based on FPGA

Info

Publication number: CN109086867B
Application number: CN201810710069.1A
Authority: CN
Inventors: 李开; 邹复好; 孙浩; 李全; 祁迪; 贺坤坤
Original assignee: Wuhan Meitong Technology Co ltd
Current assignee: Wuhan Meitong Technology Co ltd
Priority date: 2018-07-02
Filing date: 2018-07-02
Publication date: 2021-06-08
Anticipated expiration: 2038-07-02
Also published as: CN109086867A

Abstract

The invention discloses a convolutional neural network acceleration system based on an FPGA (field programmable gate array). A convolutional neural network on the FPGA is accelerated based on an OpenCL programming framework, and the convolutional neural network acceleration system comprises a data preprocessing module, a data post-processing module, a convolutional neural network calculation module, a data storage module and a network model configuration module; the convolutional neural network computing module comprises a convolutional computing submodule, an activation function computing submodule, a pooling computing submodule and a full-connection computing submodule; the acceleration system provided by the invention can set the calculation parallelism according to the hardware resource condition of the FPGA in the using process so as to adapt to different FPGAs and different convolutional neural networks, can operate the convolutional neural networks on the FPGAs in an efficient parallel pipelining mode, can effectively reduce the power consumption of the system, greatly improve the processing speed of the convolutional neural networks and meet the real-time requirement.

Description

Convolutional neural network acceleration system based on FPGA

Technical Field

The invention belongs to the technical field of neural network computing, and particularly relates to a convolutional neural network acceleration system based on an FPGA (field programmable gate array).

Background

With the continuous maturity of deep learning technology, convolutional neural networks are widely used in the fields of computer vision, speech recognition, natural language processing and the like, and achieve good effects in practical application scenarios such as face detection, speech recognition and the like. In recent years, due to the continuously-robust trainable data set and the continuously-innovative neural network structure, the accuracy and performance of the convolutional neural network are remarkably improved, but as the convolutional neural network structure becomes more and more complex, the requirements on high real-time performance and low cost in an actual application scene are higher and higher, and the requirements on the computing capability and energy consumption of hardware for operating the neural network are higher and higher.

The FPGA has the characteristics of abundant computing resources, high flexibility, high energy efficiency and the like, and compared with the conventional digital circuit system, the FPGA has the advantages of programmability, high integration degree, high speed, high reliability and the like, and is continuously tried to accelerate the neural network. OpenCL is a heterogeneous computing language based on the traditional C language, can run on acceleration processors such as CPU, GPU, PFGA, and DSP, and has a high language abstraction level, and programmers can develop high-performance application programs without knowing hardware circuits and bottom level details, thereby greatly reducing the complexity of the programming process.

In 11 months 2012, Altera corporation formally introduced a Software Development Kit (SDK) that integrates a powerful parallel architecture of FPGA and an Open CL parallel programming model and is used for Open CL development on FPGA, and programmers familiar with C language can quickly adapt to and master a method for developing FPGA applications with high performance, low power consumption and high efficiency in an Open CL high-level language environment by using the software development kit. The Altera OpenCL SDK is adopted to accelerate the calculation of the convolutional neural network on the FPGA, and the FPGA is used as an external accelerator of the host machine, so that the cooperative work of the host machine and the external FPGA accelerator can be realized.

Disclosure of Invention

In view of at least one of the above defects or needs for improvement in the prior art, the present invention provides an FPGA-based convolutional neural network acceleration system, which aims to readjust the existing convolutional neural network computation structure to fully exploit the parallelism of the convolutional neural network during the computation process and the water-flow between the computation layers, thereby increasing the processing speed of the convolutional neural network.

In order to achieve the above object, according to an aspect of the present invention, there is provided a convolutional neural network acceleration system based on FPGA, comprising a data preprocessing module, a convolutional neural network computing module, a data post-processing module, a data storage module, and a network model configuration module; the data preprocessing module, the convolutional neural network computing module and the data post-processing module are realized on the basis of an FPGA (field programmable gate array), the data storage module is realized on the basis of off-chip storage of the FPGA, and the network model configuration module is realized on the basis of on-chip storage of the FPGA;

the data preprocessing module is used for reading corresponding convolution kernel parameters and input feature maps from the data storage module according to the current calculation stage, and preprocessing the convolution kernel parameters and the input feature maps: arranging 4-dimensional convolution kernel parameters into 3 dimensions, expanding and copying an input feature graph by using a sliding window, and enabling local feature graphs in the sliding window to correspond to the convolution kernel parameters one by one to obtain a convolution kernel parameter sequence and a local feature graph series which are convenient to calculate directly; after the preprocessing is finished, sending the processed convolution kernel parameters and the input characteristic graph to a convolution neural network computing module;

the network model configuration module is used for carrying out parameter configuration on the convolutional neural network calculation module; the convolutional neural network computing module is used for independently setting a convolutional layer, an activation function layer, a pooling layer and a full-connection layer in the convolutional neural network, constructing various different network structures through parameter configuration, carrying out interlayer flow processing of convolution, activation, pooling and full-connection computing on convolutional kernel parameters and input characteristic graphs received from the data preprocessing module according to configuration parameters, and carrying out parallel processing in the layers; the processing result is sent to a data post-processing module;

the data post-processing module is used for writing the output data of the convolutional neural network computing module into the data storage module;

the data storage module is used for storing model parameters, a model result and a final calculation result of the convolutional neural network, and the data storage module exchanges data with an external host computer through a PCIe interface.

Preferably, in the above convolutional neural network acceleration system, the convolutional neural network computation module includes a convolutional computation submodule, an activation function computation submodule, a pooling computation submodule, and a full-connection computation submodule, and these submodules inside the convolutional neural network computation module are connected according to network model configuration parameters predefined by the network model configuration module;

after receiving the convolution kernel parameters and the characteristic diagram sent by the data preprocessing module, the convolution neural network computing module starts processing according to each organized sub-module configured with the parameters, and sends the result to the data post-processing module after the processing is finished;

specifically, the convolution calculation submodule performs convolution calculation by using the input convolution kernel parameters and the characteristic diagram, and sends the result to the activation function calculation submodule;

the activation function calculation sub-module selects an activation function according to activation function configuration parameters predefined by the network model parameter configuration module, performs activation calculation on the feature map by using the selected activation function, and sends a result to the pooling calculation sub-module or the full-connection calculation sub-module according to the parameter configuration after the activation calculation is completed;

the pooling calculation sub-module is used for performing pooling calculation on the received characteristic diagram and sending a pooling result to the full-connection calculation module according to configuration parameters predefined by the network model configuration module or directly to the data post-processing module;

and the full-connection calculation submodule is used for performing full-connection calculation on the received characteristic diagram and sending a full-connection result to the data post-processing module.

Preferably, in the above convolutional neural network acceleration system, the data preprocessing module includes a data transmission sub-module, a convolutional kernel parameter preprocessing sub-module, and a feature map preprocessing sub-module;

the data transmission submodule is used for controlling the transmission of the characteristic diagram and the convolution kernel parameter between the data storage module and the convolution neural network computing module; the convolution kernel parameter preprocessing submodule is used for rearranging and sorting convolution kernel parameters; the characteristic diagram preprocessing submodule is used for unfolding, copying and arranging the characteristic diagram.

Preferably, in the above convolutional neural network acceleration system, the data storage module includes a convolutional kernel parameter storage submodule and a feature map storage submodule, the convolutional kernel parameter storage submodule is used for storing the convolutional kernel parameters, and the feature map storage submodule is used for storing the input feature map and the temporary feature map in the calculation process; the storage sub-modules are preferably divided by a DDR memory connected to the FPGA, and in the OpenCL programming framework, the data storage module is used as a global memory.

Preferably, in the convolutional neural network acceleration system, the data transmission submodule includes a DDR controller, a data transmission bus, and a storage buffer;

the DDR controller is used for controlling data transmission between the DDR and the FPGA, and a data transmission bus is connected with the DDR and the FPGA and is a channel for data transmission; the storage buffer is used for temporarily storing data, reducing the reading of DDR by the FPGA and improving the data transmission speed.

Preferably, in the convolutional neural network acceleration system, the convolution computation submodule includes one or more matrix multiplication computation submodules; the number of the matrix multiplication calculation sub-modules is set by configuration parameters predefined by a network model configuration module; the calculation among the matrix multiplication calculation submodules is executed in parallel;

the matrix multiplication calculation submodule uses a Winograd minimum filter algorithm to accelerate operation and is used for calculating and acquiring matrix multiplication between a single convolution kernel and a corresponding local feature map.

Preferably, in the above convolutional neural network acceleration system, the activation function calculation sub-module includes an activation function selection sub-module, a Sigmoid function calculation sub-module, a Tanh function calculation sub-module, and a ReLU function calculation sub-module;

the activation function selection sub-module is respectively connected with the Sigmoid function calculation sub-module, the Tanh function calculation sub-module and the ReLU function calculation sub-module, and the data of the characteristic diagram is sent to one of the three calculation sub-modules;

the activation function selection submodule is used for setting an activation calculation mode of a characteristic diagram in the convolutional neural network;

the Sigmoid function calculation sub-module is used for calculating a Sigmoid function; the Tanh function calculation submodule is used for calculating a Tanh function; and the ReLU function calculation submodule is used for calculating the ReLU function.

Preferably, in the convolutional neural network acceleration system, the pooling calculation sub-module includes a double buffer memory formed by two FPGA chip memories;

the double-buffer-area structure is used for storing temporary characteristic diagram data in the pooling calculation process, the size of the buffer area is set by network configuration parameters predefined by a network model parameter configuration module, the sizes of the buffer areas of different pooling layers are different, ping-pong read-write operation is realized through the double-buffer-area structure, and the flow processing of the pooling calculation is realized.

Preferably, in the convolutional neural network acceleration system, the network model parameter configuration module is implemented by FPGA chip storage, and is used for storing network model configuration parameters, including the size of the network input feature map, the size and number of convolutional kernel parameters in the convolutional calculation submodule, the size of the pooling window in the pooling calculation submodule, the parameter scale of the fully-connected calculation submodule, and the calculation parallelism; the data in the network model parameter configuration module is preferably pre-written prior to system startup.

Preferably, in the above convolutional neural network acceleration system, the convolutional neural network computation module is formed by cascading a convolutional computation submodule, an activation function computation submodule, a pooling computation submodule, and a full-connection computation submodule according to network model configuration parameters, the submodules use OpenCL channels for data transmission, the computations in the submodules are executed in parallel, and the computations between the submodules are performed in a pipeline manner.

The convolutional neural network acceleration system based on the FPGA provided by the invention combines the structural characteristics of the convolutional neural network model, the characteristics of the FPGA chip and the advantages of the OpenCL programming framework, readjusts the existing convolutional neural network computing structure and designs a corresponding module, fully excavates the parallelism of the convolutional neural network in the computing process and the pipelining among computing layers, enables the parallelism to be more matched with the design characteristics of the FPGA, reasonably and efficiently utilizes the computing resources designed by the FPGA, and improves the processing speed of the convolutional neural network. In general, by the above technical solution conceived by the present invention, compared with the prior art, the following beneficial effects can be obtained:

(1) the convolutional neural network acceleration system based on the FPGA provided by the invention designs a system architecture suitable for pipeline processing and parallel computing by utilizing the computing characteristics of each layer of the convolutional neural network; the data preprocessing module, the convolutional neural network computing module and the data post-processing module form a pipeline structure; the data transmission between the storage module and the calculation module is controlled through the data preprocessing module and the data post-processing module, and the convolution kernel parameters and the characteristic diagram sequentially pass through three large modules in the pipeline structure to complete the pipeline processes of data reading, data calculation and data storage; designing a convolution layer, an activation function layer, a pooling layer and a full-connection layer in the convolutional neural network into separate computing modules respectively, and constructing various different network structures through parameter configuration; processing of each submodule of the convolutional neural network is divided into a plurality of small processing processes, and data of the corresponding submodule of each layer are subjected to different stages of data reading, data processing, data storage and the like to form a pipeline structure similar to a computer instruction pipeline form; the calculation in the neural network layer can be executed in parallel, the calculation between layers can be executed in a running mode, and the processing speed of the convolutional neural network can be effectively improved.

(2) The convolutional neural network acceleration system based on the FPGA is based on the low data association characteristic between convolutional kernel parameters and local feature maps in the convolutional neural network calculation, in the parallel calculation structure of a convolutional calculation submodule, the data of windows corresponding to the convolutional kernels and input feature maps are calculated each time, and under the structure, because the calculated data between the convolutional kernels are not associated, a plurality of calculation processes can be performed in parallel; in the parallel computing structure, the sliding window is removed and the data in the original sliding window is directly spread to form a plurality of data blocks, and the corresponding data blocks are directly input during computing, so that the plurality of data blocks are simultaneously computed with the convolution kernel, and the processing speed is further improved.

(3) By the convolutional neural network acceleration system based on the FPGA, partial pooling calculation can be performed if data enters the pooling calculation submodule in the calculation process of the convolutional neural network; since the computations of the multiple convolution kernels are parallel, partial results on partial channels can be generated simultaneously, that is, partial inputs of the pooling computation submodule are already generated; the calculation of the pooling calculation submodule and the calculation of the convolution calculation submodule are calculated by taking a sliding window as a unit, so that pooling operation can be started after all data in a certain window in the pooling calculation submodule are obtained, and the pooling operation is not started after all calculations of the convolution calculation submodule are finished; the convolution calculation submodule can simultaneously generate data on a plurality of channels, and the channels are not related to each other during pooling calculation, so that calculation on each channel in the pooling calculation submodule can be performed in parallel, and the processing speed of the convolution neural network is greatly improved.

(4) The convolutional neural network acceleration system based on the FPGA has the advantages that the parameters of the network model can be configured, and the configuration file is used for setting the structure of the network model and the parallelism in network calculation, so that the convolutional neural network can be operated by different types of network models and FPGAs with different calculation capabilities through parameter configuration.

(5) According to the convolutional neural network acceleration system based on the FPGA, the optimal scheme adopts a Winograd minimum filtering algorithm in the calculation process of the convolutional layer, and the convolutional layer acceleration system can play a role in accelerating convolutional calculation;

a ping-pong buffer area is adopted in the calculation process of the pooling layer, so that the pooling calculation can be accelerated and the use of a storage space can be reduced;

the method of batch calculation is adopted in the calculation of the full connection layer, the aim of reducing the access to the external storage space in the calculation process can be achieved, and the segmented calculation is adopted, so that the effect of simplifying high-dimensional matrix multiplication operation can be achieved, the processing speed is improved, and the requirement on the calculation capacity of FPGA hardware is reduced;

and each computing module in the convolutional neural network is realized by adopting an OpenCL kernel program, so that the development difficulty can be reduced.

Drawings

FIG. 1 is a schematic diagram of an architecture of one embodiment of an FPGA-based convolutional neural network acceleration system provided by the present invention;

FIG. 2 is a schematic processing diagram of a data preprocessing module in an embodiment;

FIG. 3 is a schematic processing diagram of a convolution calculation submodule in an embodiment;

FIG. 4 is a schematic processing diagram of an activation function calculation sub-module in an embodiment;

FIG. 5 is a process diagram of the pooling calculation sub-module in an embodiment;

FIG. 6 is a schematic processing diagram of a fully-connected computation submodule in an embodiment;

FIG. 7 is a schematic processing diagram of a data post-processing module in an embodiment;

fig. 8 is a process flow diagram of the acceleration system in an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

Referring to fig. 1, an embodiment of the convolutional neural network acceleration system based on the FPGA according to the present invention includes a data preprocessing module, a convolutional neural network computing module, a data post-processing module, a data storage module, and a network model configuration module;

the input end of the data preprocessing module is connected with the data storage module, the input end of the convolutional neural network computing module is connected with the output end of the data preprocessing module, the input end of the data post-processing module is connected with the output end of the convolutional neural network computing module, and the input end of the data storage module is connected with the output end of the data post-processing module; the convolutional neural network calculation module is also connected with the network model configuration module;

the data preprocessing module is used for reading corresponding convolution kernel parameters and input feature maps from the data storage module according to the current calculation stage, preprocessing the convolution kernel parameters and the input feature maps, rearranging the 4-dimensional convolution kernel parameters into 3-dimensional convolution kernel parameters, expanding and copying the input feature maps by using a sliding window, enabling the local feature maps in the sliding window to correspond to the convolution kernel parameters one by one, obtaining a convolution kernel parameter sequence and a local feature map series which are convenient to calculate directly, and sending the processed convolution kernel parameters and the input feature maps to the convolution neural network calculation module after preprocessing;

the network model configuration module is used for carrying out parameter configuration on the convolutional neural network calculation module; the convolution neural network computing module is used for rearranging the convolution kernel parameters and the input characteristic graph received from the data preprocessing module according to the configuration parameters and sending a processing result to the data post-processing module;

the convolutional neural network computing module comprises a convolutional computing submodule, an activation function computing submodule, a pooling computing submodule and a full-connection computing submodule, and the submodules in the convolutional neural network computing module are connected according to network model configuration parameters predefined by the network model configuration module;

the data storage module is used for storing model parameters, a model result and a final calculation result of the convolutional neural network, and the data storage module exchanges data with an external host through a PCIe interface.

Referring to fig. 2, the data preprocessing module reads the convolution KERNEL parameters and the input feature map from the data storage module, and reads the size of each parameter of PARALLEL _ KERNEL as k × C according to the predefined parameters in the model parameter configuration module when reading the convolution KERNEL_iA convolution kernel of (2), wherein C_iThe number of channels of the input profile is indicated. After the convolution kernel is read in, the serialization operation is started to the convolution kernel, namely the size is k × C_iFour-dimensional convolution KERNELs of PARALLEL _ KERNEL are arranged to have a size k (C)_iParalel KERNEL) in three-dimensional form.

When processing the input feature map, firstly, the size is H x W*C_iThen spreading the characteristic diagram according to the size of a sliding window on the characteristic diagram and the moving step length, wherein the size of the feature diagram after being unfolded is ((W-k)/stride +1) ((H-k)/stride +1) × C_i。

After the input FEATURE map is expanded, the size is cut into (PARALLEL _ FEATURE _ W) (PARALLEL _ FEATURE _ H) } C according to the configuration parameters_iThe cut FEATURE maps are copied into multiple copies, and the number of the copies is equal to the number of convolution kernels, so that the size of the FEATURE map is (PARALLEL _ FEATURE _ W) (PARALLEL _ FEATURE _ H) ((C))_iParalel kernell) to enable PARALLEL computation of multiple convolution KERNELs and the feature map. And after the convolution kernel and the characteristic graph are processed, sending the processed convolution kernel parameters and the characteristic graph to a convolution neural network computing module for processing.

The processing flow of the convolution calculation submodule of the convolution neural network calculation module refers to fig. 3, and the input of the module is the convolution kernel parameters and the characteristic diagram generated by the data preprocessing module and the related configuration parameters predefined in the network model configuration module. The preprocessed convolution KERNEL and the feature map are three-dimensional matrixes, and the number of channels is PARALLEL _ KERNEL C_i(ii) a The convolution kernels and the characteristic maps on each channel are respectively input into different OpenCL computing units to perform two-dimensional matrix multiplication operation by using Winograd matrix multiplication submodules, the computation among the OpenCL computing units can be performed in PARALLEL, and the computation results are (PARALLEL _ FEATURE _ W/k), (PARALLEL _ FEATURE _ H/k) and (C) channel number_iParalel _ KERNEL). And processing the input feature map by a convolution calculation submodule to generate a partial output feature map of the convolution layer, wherein the partial output feature map is subjected to different processing according to the type of the next layer. If the next layer predefined in the network model configuration is a convolutional layer or a fully-connected layer, the output feature map skips over the pooling layer and the data post-processing module writes the result back to the external storage for processing; and if the next layer predefined in the network model configuration is a pooling layer, sending the input feature map to a pooling calculation submodule for pooling.

Referring to fig. 4, the activation function calculation sub-module in the embodiment includes an activation function selection sub-module and three function calculation sub-modules, a selector in the activation function selection sub-module is determined by a configuration parameter in the model configuration module, and the three function calculation sub-modules respectively correspond to the calculation of the Sigmoid, tanh, and ReLU activation functions. And the input feature diagram is sent to the function calculation submodule for activation function calculation processing according to the path determined by the activation function selection submodule, and is sent to the data storage module or the pooling calculation submodule according to the configuration parameters after the activation function calculation processing is finished.

Referring to fig. 5, two Ping-Pong buffers with the size of pool _ size × W are used in the pooling calculation sub-module to store the calculation results from the activation function calculation sub-module, wherein pool _ size and W are configuration parameters, the calculation results of the convolution calculation sub-module are continuously filled into Buffer1 at first, the partial pooling calculation in the Buffer can be performed during the filling process, the calculation results of the convolution calculation module are filled into Buffer2 after Buffer1 is filled, the data in Buffer2 can be pooled during the filling process of Buffer2, the data between Buffer1 and Buffer2 can also be pooled, the calculation results of the convolution calculation module are filled into Buffer1 when Buffer2 is filled, and the two buffers are operated alternately until the whole pooling calculation is completed. And a pooling window is also included between the two buffers, the data of the window is from the two buffers, and the pooling window between the two buffers can be calculated when one buffer performs calculation operation and the other buffer performs filling operation. Since there is no computational correlation of data between pooled windows, a loop unrolling method can be used to synchronize the computations within different windows.

Referring to fig. 6, in the process of processing, the pooling computation submodule transversely divides an input matrix formed by N input vectors into dim1/m segments, where N represents the number of input eigenvectors, dim1 represents the dimension of the input eigenvectors, m represents the segment length of the input eigenvectors, each segment separately forms a submatrix with size m × N, the submatrix is multiplied by the corresponding part in the weight matrix to obtain a partial result formed by the submatrix with size N × N, the partial results of dim1/m segments are merged to be the final computation result formed by the output vectors of N, and when the computation submatrix is multiplied by the corresponding part in the weight matrix, Winograd minimum filter matrix multiplication is used for performing accelerated computation.

Referring to fig. 7, after the processing of the pooling calculation submodule or the full-connection calculation submodule in the convolutional neural network calculation module is completed, the data post-processing module starts to write back the data output by the pooling calculation or full-connection calculation submodule into the data storage module, and in the process, barrier operation in the OpenCL framework is used to ensure that transmission is started after all calculation results are obtained and the next processing is started after all data transmission is completed.

Referring to fig. 8, a processing flow of the acceleration system provided by the embodiment mainly includes three major portions; the first part is a kernel program compiling process, and in order to maximally utilize computing resources and storage resources on the FPGA, an appropriate network computing parallelism parameter needs to be set. In the embodiment, the process of setting the parallelism parameter is automatically completed through a program, the initial values of PARALLEL _ FEATURE and PARALLEL _ KERNEL in a KERNEL program of the convolutional neural network are set, then the KERNEL program is compiled by using an Altera OpenCL SDK, the resource utilization conditions including storage resources, logic resources, computing resources and the like are obtained from a compiling report after the compiling is completed, if the resource utilization does not reach the maximum, the values of PARALLEL _ FEATURE and PARALLEL _ KERNEL are updated to be recompiled until the maximum hardware resource utilization is obtained, and the hardware program capable of running on the FPGA is obtained after the compiling is completed.

The second part is a parameter configuration process, which includes network model calculation parameters and model configuration parameters, the network model calculation parameters are directly read from a model file caffieldol of Caffe, the model configuration parameters include the input feature diagram size of each layer, the size of a convolution kernel, the size of a pooling window, and the like, the parameter configuration is completed by using a clSetKernelArg () function in OpenCL, and table 1 below illustrates the types and parameter values of the model configuration parameters by taking VGG16 as an example.

TABLE 1 types of model configuration parameters and parameter value examples

In the above table, in the active function column, 0 indicates that there is no activation function, 1 indicates that a ReLU activation function is used, 2 indicates that a Sigmoid activation function is used, and 3 indicates that a Tanh activation function is used; in the Output dst column, 1 indicates Output to the data storage module, 2 indicates Output to the pooling calculation sub-module, and 3 indicates Output to the convolution calculation sub-module.

And the third part is the running process of the neural network, when the host computer transmits the picture to the data storage module, the system on the FPGA starts running, after the running is finished, the data storage module returns the calculation result to the host computer, and the running is finished when no picture is input.

The embodiment provides an FPGA-based convolutional neural network acceleration system, which realizes VGG16 and AlexNet network models on a DE5a-Net development board, and performs performance tests by using picture data with the size of 224X 3, and experimental data show that the processing speed of VGG16 is 160ms/image, and the processing speed of AlexNet is 12ms/image, which is superior to other FPGA implementation schemes.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A convolutional neural network acceleration system based on FPGA is characterized by comprising a data preprocessing module, a convolutional neural network calculation module, a data post-processing module, a data storage module and a network model configuration module; the data preprocessing module, the convolutional neural network computing module and the data post-processing module are realized on the basis of an FPGA (field programmable gate array), the data storage module is realized on the basis of FPGA off-chip storage, and the network model configuration module is realized on the basis of FPGA on-chip storage;

the data preprocessing module is used for reading corresponding convolution kernel parameters and input feature maps from the data storage module according to the current calculation stage, and preprocessing the convolution kernel parameters and the input feature maps: arranging 4-dimensional convolution kernel parameters into 3 dimensions, expanding and copying an input feature graph by using a sliding window, so that local feature graphs in the sliding window correspond to convolution kernel parameters one by one, and a convolution kernel parameter sequence and a local feature graph series which are convenient to calculate directly are obtained; after the preprocessing is finished, sending the processed convolution kernel parameters and the input characteristic graph to a convolution neural network computing module; specifically, reading PARALLEL _ KERNEL dimensions k × C according to predefined parameters in the model parameter configuration module during reading the convolution KERNEL_iA convolution kernel of (2), wherein C_iRepresenting the number of channels of the input feature diagram, reading in the convolution kernel and then starting to perform serialization operation on the convolution kernel, namely, the size is k × C_iFour-dimensional convolution KERNELs of PARALLEL _ KERNEL are arranged to have a size k (C)_iPara whole _ KERNEL) three-dimensional form;

when processing the input feature map, firstly, the size is H W C_iThen spreading the characteristic diagram according to the size of a sliding window on the characteristic diagram and the moving step length, wherein the size of the feature diagram after being unfolded is ((W-k)/stride +1) ((H-k)/stride +1) × C_i；

After the input FEATURE map is expanded, the size is cut into (PARALLEL _ FEATURE _ W) (PARALLEL _ FEATURE _ H) } C according to the configuration parameters_iCopying multiple parts of the cut FEATURE maps to make the number of the parts of the FEATURE maps be equal to that of convolution kernels, and finally obtaining the FEATURE maps with the size of (PARALLEL _ FEATURE _ W) ((PARALLEL _ FEATURE _ H) ((C)_iParalel kernell) to enable PARALLEL computation of the plurality of convolution KERNELs and the feature map;

the network model configuration module is used for configuring parameters of the convolutional neural network calculation module; the convolutional neural network computing module is used for independently setting a convolutional layer, an activation function layer, a pooling layer and a full-link layer in the convolutional neural network, constructing various different network structures through parameter configuration, carrying out interlayer pipeline processing of convolution, activation, pooling and full-link computing on convolutional kernel parameters and input feature maps received from the data preprocessing module according to configuration parameters, and sending processing results to the data post-processing module;

the data storage module is used for storing model parameters, a calculation result and a final calculation result of the intermediate characteristic diagram of the convolutional neural network, and the data storage module exchanges data with an external host through a PCIe interface;

the convolutional neural network computing module is formed by a convolutional computing submodule, an activation function computing submodule, a pooling computing submodule and a full-connection computing submodule which are cascaded according to network model configuration parameters, data transmission is carried out among the submodules by using OpenCL channels, computing in the submodules is executed in parallel, and computing among the submodules is carried out in a flowing mode;

the convolution computation submodule comprises one or more matrix multiplication computation submodules; the number of the matrix multiplication calculation sub-modules is set by configuration parameters predefined by a network model configuration module; the processing among the matrix multiplication computation submodules is executed in parallel;

the matrix multiplication calculation submodule uses a Winograd minimum filtering algorithm to accelerate operation and is used for calculating and obtaining matrix multiplication between a single convolution kernel and a corresponding local feature map.

2. The convolutional neural network acceleration system of claim 1, wherein the convolutional neural network computation module comprises a convolutional computation submodule, an activation function computation submodule, a pooling computation submodule, and a fully-connected computation submodule; the sub-modules in the convolutional neural network computing module are connected according to network model configuration parameters predefined by the network model configuration module;

the convolution calculation submodule performs convolution calculation by using the input convolution kernel parameters and the characteristic diagram, and sends a result to the activation function calculation submodule after the convolution calculation is completed;

the activation function calculation submodule selects an activation function according to activation function configuration parameters predefined by the network model parameter configuration module; performing activation calculation on the feature map by using the selected activation function, and sending a result to the pooling calculation submodule or the full-connection calculation submodule according to parameter configuration after the activation calculation is completed;

3. The convolutional neural network acceleration system as claimed in claim 2, wherein the activation function calculation sub-module includes an activation function selection sub-module, a Sigmoid function calculation sub-module, a Tanh function calculation sub-module, and a ReLU function calculation sub-module;

the activation function selection submodule is respectively connected with the Sigmoid function calculation submodule, the Tanh function calculation submodule and the ReLU function calculation submodule, and data of the characteristic diagram is sent to one of the three calculation submodules;

the activation function selection submodule is used for setting an activation calculation mode of a characteristic diagram in the convolutional neural network; the Sigmoid function calculation sub-module is used for calculating a Sigmoid function; the Tanh function calculation submodule is used for calculating a Tanh function; and the ReLU function calculation submodule is used for calculating the ReLU function.

4. The convolutional neural network acceleration system as claimed in any one of claims 1 to 3, wherein the pooling calculation sub-module comprises a double buffer memory formed by two FPGA chips for storing temporary feature map data in a pooling calculation process, the size of the buffer area is set by network configuration parameters predefined by a network model parameter configuration module, the sizes of the buffer areas in different pooling layers are different, and ping-pong read-write operation is realized through the double buffer memory area structure to realize pipeline processing of pooling calculation.

5. The convolutional neural network acceleration system of claim 1, wherein the data preprocessing module comprises a data transmission sub-module, a convolution kernel parameter preprocessing sub-module, and a feature map preprocessing sub-module;

the data transmission submodule is used for controlling the transmission of the characteristic diagram and the convolution kernel parameter between the data storage module and the convolution neural network computing module; the convolution kernel parameter preprocessing submodule is used for rearranging and sorting the convolution kernel parameters; the characteristic diagram preprocessing submodule is used for expanding, copying and sorting the characteristic diagram.

6. The convolutional neural network acceleration system of claim 5, wherein the data transfer submodule comprises a DDR controller, a data transfer bus and a memory cache;

the DDR controller is used for controlling data transmission between the DDR and the FPGA, and a data transmission bus is connected with the DDR and the FPGA and is a data transmission channel; the storage buffer is used for temporarily storing data, reducing the reading of DDR by the FPGA and improving the data transmission speed.

7. The convolutional neural network acceleration system of claim 1, wherein the data storage module comprises a convolutional kernel parameter storage submodule and a feature map storage submodule, the convolutional kernel parameter storage submodule is used for storing convolutional kernel parameters, and the feature map storage submodule is used for storing an input feature map and a temporary feature map in the calculation process; the storage sub-modules are all formed by dividing a DDR memory connected with the FPGA.

8. The convolutional neural network acceleration system of claim 1, wherein the network model parameter configuration module is configured to store network model configuration parameters, including the size of the network input feature map, the size and number of convolutional kernel parameters in the convolutional calculation submodule, the size of the pooling window in the pooling calculation submodule, the parameter size of the fully-connected calculation submodule, and the calculation parallelism; the data in the network model parameter configuration module is written in advance before the system is started.