Nothing Special   »   [go: up one dir, main page]

CN109086867B - Convolutional neural network acceleration system based on FPGA - Google Patents

Convolutional neural network acceleration system based on FPGA Download PDF

Info

Publication number
CN109086867B
CN109086867B CN201810710069.1A CN201810710069A CN109086867B CN 109086867 B CN109086867 B CN 109086867B CN 201810710069 A CN201810710069 A CN 201810710069A CN 109086867 B CN109086867 B CN 109086867B
Authority
CN
China
Prior art keywords
module
calculation
submodule
neural network
convolutional neural
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201810710069.1A
Other languages
Chinese (zh)
Other versions
CN109086867A (en
Inventor
李开
邹复好
孙浩
李全
祁迪
贺坤坤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Meitong Technology Co ltd
Original Assignee
Wuhan Meitong Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Meitong Technology Co ltd filed Critical Wuhan Meitong Technology Co ltd
Priority to CN201810710069.1A priority Critical patent/CN109086867B/en
Publication of CN109086867A publication Critical patent/CN109086867A/en
Application granted granted Critical
Publication of CN109086867B publication Critical patent/CN109086867B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a convolutional neural network acceleration system based on an FPGA (field programmable gate array). A convolutional neural network on the FPGA is accelerated based on an OpenCL programming framework, and the convolutional neural network acceleration system comprises a data preprocessing module, a data post-processing module, a convolutional neural network calculation module, a data storage module and a network model configuration module; the convolutional neural network computing module comprises a convolutional computing submodule, an activation function computing submodule, a pooling computing submodule and a full-connection computing submodule; the acceleration system provided by the invention can set the calculation parallelism according to the hardware resource condition of the FPGA in the using process so as to adapt to different FPGAs and different convolutional neural networks, can operate the convolutional neural networks on the FPGAs in an efficient parallel pipelining mode, can effectively reduce the power consumption of the system, greatly improve the processing speed of the convolutional neural networks and meet the real-time requirement.

Description

Convolutional neural network acceleration system based on FPGA
Technical Field
The invention belongs to the technical field of neural network computing, and particularly relates to a convolutional neural network acceleration system based on an FPGA (field programmable gate array).
Background
With the continuous maturity of deep learning technology, convolutional neural networks are widely used in the fields of computer vision, speech recognition, natural language processing and the like, and achieve good effects in practical application scenarios such as face detection, speech recognition and the like. In recent years, due to the continuously-robust trainable data set and the continuously-innovative neural network structure, the accuracy and performance of the convolutional neural network are remarkably improved, but as the convolutional neural network structure becomes more and more complex, the requirements on high real-time performance and low cost in an actual application scene are higher and higher, and the requirements on the computing capability and energy consumption of hardware for operating the neural network are higher and higher.
The FPGA has the characteristics of abundant computing resources, high flexibility, high energy efficiency and the like, and compared with the conventional digital circuit system, the FPGA has the advantages of programmability, high integration degree, high speed, high reliability and the like, and is continuously tried to accelerate the neural network. OpenCL is a heterogeneous computing language based on the traditional C language, can run on acceleration processors such as CPU, GPU, PFGA, and DSP, and has a high language abstraction level, and programmers can develop high-performance application programs without knowing hardware circuits and bottom level details, thereby greatly reducing the complexity of the programming process.
In 11 months 2012, Altera corporation formally introduced a Software Development Kit (SDK) that integrates a powerful parallel architecture of FPGA and an Open CL parallel programming model and is used for Open CL development on FPGA, and programmers familiar with C language can quickly adapt to and master a method for developing FPGA applications with high performance, low power consumption and high efficiency in an Open CL high-level language environment by using the software development kit. The Altera OpenCL SDK is adopted to accelerate the calculation of the convolutional neural network on the FPGA, and the FPGA is used as an external accelerator of the host machine, so that the cooperative work of the host machine and the external FPGA accelerator can be realized.
Disclosure of Invention
In view of at least one of the above defects or needs for improvement in the prior art, the present invention provides an FPGA-based convolutional neural network acceleration system, which aims to readjust the existing convolutional neural network computation structure to fully exploit the parallelism of the convolutional neural network during the computation process and the water-flow between the computation layers, thereby increasing the processing speed of the convolutional neural network.
In order to achieve the above object, according to an aspect of the present invention, there is provided a convolutional neural network acceleration system based on FPGA, comprising a data preprocessing module, a convolutional neural network computing module, a data post-processing module, a data storage module, and a network model configuration module; the data preprocessing module, the convolutional neural network computing module and the data post-processing module are realized on the basis of an FPGA (field programmable gate array), the data storage module is realized on the basis of off-chip storage of the FPGA, and the network model configuration module is realized on the basis of on-chip storage of the FPGA;
the data preprocessing module is used for reading corresponding convolution kernel parameters and input feature maps from the data storage module according to the current calculation stage, and preprocessing the convolution kernel parameters and the input feature maps: arranging 4-dimensional convolution kernel parameters into 3 dimensions, expanding and copying an input feature graph by using a sliding window, and enabling local feature graphs in the sliding window to correspond to the convolution kernel parameters one by one to obtain a convolution kernel parameter sequence and a local feature graph series which are convenient to calculate directly; after the preprocessing is finished, sending the processed convolution kernel parameters and the input characteristic graph to a convolution neural network computing module;
the network model configuration module is used for carrying out parameter configuration on the convolutional neural network calculation module; the convolutional neural network computing module is used for independently setting a convolutional layer, an activation function layer, a pooling layer and a full-connection layer in the convolutional neural network, constructing various different network structures through parameter configuration, carrying out interlayer flow processing of convolution, activation, pooling and full-connection computing on convolutional kernel parameters and input characteristic graphs received from the data preprocessing module according to configuration parameters, and carrying out parallel processing in the layers; the processing result is sent to a data post-processing module;
the data post-processing module is used for writing the output data of the convolutional neural network computing module into the data storage module;
the data storage module is used for storing model parameters, a model result and a final calculation result of the convolutional neural network, and the data storage module exchanges data with an external host computer through a PCIe interface.
Preferably, in the above convolutional neural network acceleration system, the convolutional neural network computation module includes a convolutional computation submodule, an activation function computation submodule, a pooling computation submodule, and a full-connection computation submodule, and these submodules inside the convolutional neural network computation module are connected according to network model configuration parameters predefined by the network model configuration module;
after receiving the convolution kernel parameters and the characteristic diagram sent by the data preprocessing module, the convolution neural network computing module starts processing according to each organized sub-module configured with the parameters, and sends the result to the data post-processing module after the processing is finished;
specifically, the convolution calculation submodule performs convolution calculation by using the input convolution kernel parameters and the characteristic diagram, and sends the result to the activation function calculation submodule;
the activation function calculation sub-module selects an activation function according to activation function configuration parameters predefined by the network model parameter configuration module, performs activation calculation on the feature map by using the selected activation function, and sends a result to the pooling calculation sub-module or the full-connection calculation sub-module according to the parameter configuration after the activation calculation is completed;
the pooling calculation sub-module is used for performing pooling calculation on the received characteristic diagram and sending a pooling result to the full-connection calculation module according to configuration parameters predefined by the network model configuration module or directly to the data post-processing module;
and the full-connection calculation submodule is used for performing full-connection calculation on the received characteristic diagram and sending a full-connection result to the data post-processing module.
Preferably, in the above convolutional neural network acceleration system, the data preprocessing module includes a data transmission sub-module, a convolutional kernel parameter preprocessing sub-module, and a feature map preprocessing sub-module;
the data transmission submodule is used for controlling the transmission of the characteristic diagram and the convolution kernel parameter between the data storage module and the convolution neural network computing module; the convolution kernel parameter preprocessing submodule is used for rearranging and sorting convolution kernel parameters; the characteristic diagram preprocessing submodule is used for unfolding, copying and arranging the characteristic diagram.
Preferably, in the above convolutional neural network acceleration system, the data storage module includes a convolutional kernel parameter storage submodule and a feature map storage submodule, the convolutional kernel parameter storage submodule is used for storing the convolutional kernel parameters, and the feature map storage submodule is used for storing the input feature map and the temporary feature map in the calculation process; the storage sub-modules are preferably divided by a DDR memory connected to the FPGA, and in the OpenCL programming framework, the data storage module is used as a global memory.
Preferably, in the convolutional neural network acceleration system, the data transmission submodule includes a DDR controller, a data transmission bus, and a storage buffer;
the DDR controller is used for controlling data transmission between the DDR and the FPGA, and a data transmission bus is connected with the DDR and the FPGA and is a channel for data transmission; the storage buffer is used for temporarily storing data, reducing the reading of DDR by the FPGA and improving the data transmission speed.
Preferably, in the convolutional neural network acceleration system, the convolution computation submodule includes one or more matrix multiplication computation submodules; the number of the matrix multiplication calculation sub-modules is set by configuration parameters predefined by a network model configuration module; the calculation among the matrix multiplication calculation submodules is executed in parallel;
the matrix multiplication calculation submodule uses a Winograd minimum filter algorithm to accelerate operation and is used for calculating and acquiring matrix multiplication between a single convolution kernel and a corresponding local feature map.
Preferably, in the above convolutional neural network acceleration system, the activation function calculation sub-module includes an activation function selection sub-module, a Sigmoid function calculation sub-module, a Tanh function calculation sub-module, and a ReLU function calculation sub-module;
the activation function selection sub-module is respectively connected with the Sigmoid function calculation sub-module, the Tanh function calculation sub-module and the ReLU function calculation sub-module, and the data of the characteristic diagram is sent to one of the three calculation sub-modules;
the activation function selection submodule is used for setting an activation calculation mode of a characteristic diagram in the convolutional neural network;
the Sigmoid function calculation sub-module is used for calculating a Sigmoid function; the Tanh function calculation submodule is used for calculating a Tanh function; and the ReLU function calculation submodule is used for calculating the ReLU function.
Preferably, in the convolutional neural network acceleration system, the pooling calculation sub-module includes a double buffer memory formed by two FPGA chip memories;
the double-buffer-area structure is used for storing temporary characteristic diagram data in the pooling calculation process, the size of the buffer area is set by network configuration parameters predefined by a network model parameter configuration module, the sizes of the buffer areas of different pooling layers are different, ping-pong read-write operation is realized through the double-buffer-area structure, and the flow processing of the pooling calculation is realized.
Preferably, in the convolutional neural network acceleration system, the network model parameter configuration module is implemented by FPGA chip storage, and is used for storing network model configuration parameters, including the size of the network input feature map, the size and number of convolutional kernel parameters in the convolutional calculation submodule, the size of the pooling window in the pooling calculation submodule, the parameter scale of the fully-connected calculation submodule, and the calculation parallelism; the data in the network model parameter configuration module is preferably pre-written prior to system startup.
Preferably, in the above convolutional neural network acceleration system, the convolutional neural network computation module is formed by cascading a convolutional computation submodule, an activation function computation submodule, a pooling computation submodule, and a full-connection computation submodule according to network model configuration parameters, the submodules use OpenCL channels for data transmission, the computations in the submodules are executed in parallel, and the computations between the submodules are performed in a pipeline manner.
The convolutional neural network acceleration system based on the FPGA provided by the invention combines the structural characteristics of the convolutional neural network model, the characteristics of the FPGA chip and the advantages of the OpenCL programming framework, readjusts the existing convolutional neural network computing structure and designs a corresponding module, fully excavates the parallelism of the convolutional neural network in the computing process and the pipelining among computing layers, enables the parallelism to be more matched with the design characteristics of the FPGA, reasonably and efficiently utilizes the computing resources designed by the FPGA, and improves the processing speed of the convolutional neural network. In general, by the above technical solution conceived by the present invention, compared with the prior art, the following beneficial effects can be obtained:
(1) the convolutional neural network acceleration system based on the FPGA provided by the invention designs a system architecture suitable for pipeline processing and parallel computing by utilizing the computing characteristics of each layer of the convolutional neural network; the data preprocessing module, the convolutional neural network computing module and the data post-processing module form a pipeline structure; the data transmission between the storage module and the calculation module is controlled through the data preprocessing module and the data post-processing module, and the convolution kernel parameters and the characteristic diagram sequentially pass through three large modules in the pipeline structure to complete the pipeline processes of data reading, data calculation and data storage; designing a convolution layer, an activation function layer, a pooling layer and a full-connection layer in the convolutional neural network into separate computing modules respectively, and constructing various different network structures through parameter configuration; processing of each submodule of the convolutional neural network is divided into a plurality of small processing processes, and data of the corresponding submodule of each layer are subjected to different stages of data reading, data processing, data storage and the like to form a pipeline structure similar to a computer instruction pipeline form; the calculation in the neural network layer can be executed in parallel, the calculation between layers can be executed in a running mode, and the processing speed of the convolutional neural network can be effectively improved.
(2) The convolutional neural network acceleration system based on the FPGA is based on the low data association characteristic between convolutional kernel parameters and local feature maps in the convolutional neural network calculation, in the parallel calculation structure of a convolutional calculation submodule, the data of windows corresponding to the convolutional kernels and input feature maps are calculated each time, and under the structure, because the calculated data between the convolutional kernels are not associated, a plurality of calculation processes can be performed in parallel; in the parallel computing structure, the sliding window is removed and the data in the original sliding window is directly spread to form a plurality of data blocks, and the corresponding data blocks are directly input during computing, so that the plurality of data blocks are simultaneously computed with the convolution kernel, and the processing speed is further improved.
(3) By the convolutional neural network acceleration system based on the FPGA, partial pooling calculation can be performed if data enters the pooling calculation submodule in the calculation process of the convolutional neural network; since the computations of the multiple convolution kernels are parallel, partial results on partial channels can be generated simultaneously, that is, partial inputs of the pooling computation submodule are already generated; the calculation of the pooling calculation submodule and the calculation of the convolution calculation submodule are calculated by taking a sliding window as a unit, so that pooling operation can be started after all data in a certain window in the pooling calculation submodule are obtained, and the pooling operation is not started after all calculations of the convolution calculation submodule are finished; the convolution calculation submodule can simultaneously generate data on a plurality of channels, and the channels are not related to each other during pooling calculation, so that calculation on each channel in the pooling calculation submodule can be performed in parallel, and the processing speed of the convolution neural network is greatly improved.
(4) The convolutional neural network acceleration system based on the FPGA has the advantages that the parameters of the network model can be configured, and the configuration file is used for setting the structure of the network model and the parallelism in network calculation, so that the convolutional neural network can be operated by different types of network models and FPGAs with different calculation capabilities through parameter configuration.
(5) According to the convolutional neural network acceleration system based on the FPGA, the optimal scheme adopts a Winograd minimum filtering algorithm in the calculation process of the convolutional layer, and the convolutional layer acceleration system can play a role in accelerating convolutional calculation;
a ping-pong buffer area is adopted in the calculation process of the pooling layer, so that the pooling calculation can be accelerated and the use of a storage space can be reduced;
the method of batch calculation is adopted in the calculation of the full connection layer, the aim of reducing the access to the external storage space in the calculation process can be achieved, and the segmented calculation is adopted, so that the effect of simplifying high-dimensional matrix multiplication operation can be achieved, the processing speed is improved, and the requirement on the calculation capacity of FPGA hardware is reduced;
and each computing module in the convolutional neural network is realized by adopting an OpenCL kernel program, so that the development difficulty can be reduced.
Drawings
FIG. 1 is a schematic diagram of an architecture of one embodiment of an FPGA-based convolutional neural network acceleration system provided by the present invention;
FIG. 2 is a schematic processing diagram of a data preprocessing module in an embodiment;
FIG. 3 is a schematic processing diagram of a convolution calculation submodule in an embodiment;
FIG. 4 is a schematic processing diagram of an activation function calculation sub-module in an embodiment;
FIG. 5 is a process diagram of the pooling calculation sub-module in an embodiment;
FIG. 6 is a schematic processing diagram of a fully-connected computation submodule in an embodiment;
FIG. 7 is a schematic processing diagram of a data post-processing module in an embodiment;
fig. 8 is a process flow diagram of the acceleration system in an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
Referring to fig. 1, an embodiment of the convolutional neural network acceleration system based on the FPGA according to the present invention includes a data preprocessing module, a convolutional neural network computing module, a data post-processing module, a data storage module, and a network model configuration module;
the input end of the data preprocessing module is connected with the data storage module, the input end of the convolutional neural network computing module is connected with the output end of the data preprocessing module, the input end of the data post-processing module is connected with the output end of the convolutional neural network computing module, and the input end of the data storage module is connected with the output end of the data post-processing module; the convolutional neural network calculation module is also connected with the network model configuration module;
the data preprocessing module is used for reading corresponding convolution kernel parameters and input feature maps from the data storage module according to the current calculation stage, preprocessing the convolution kernel parameters and the input feature maps, rearranging the 4-dimensional convolution kernel parameters into 3-dimensional convolution kernel parameters, expanding and copying the input feature maps by using a sliding window, enabling the local feature maps in the sliding window to correspond to the convolution kernel parameters one by one, obtaining a convolution kernel parameter sequence and a local feature map series which are convenient to calculate directly, and sending the processed convolution kernel parameters and the input feature maps to the convolution neural network calculation module after preprocessing;
the network model configuration module is used for carrying out parameter configuration on the convolutional neural network calculation module; the convolution neural network computing module is used for rearranging the convolution kernel parameters and the input characteristic graph received from the data preprocessing module according to the configuration parameters and sending a processing result to the data post-processing module;
the convolutional neural network computing module comprises a convolutional computing submodule, an activation function computing submodule, a pooling computing submodule and a full-connection computing submodule, and the submodules in the convolutional neural network computing module are connected according to network model configuration parameters predefined by the network model configuration module;
the data post-processing module is used for writing the output data of the convolutional neural network computing module into the data storage module;
the data storage module is used for storing model parameters, a model result and a final calculation result of the convolutional neural network, and the data storage module exchanges data with an external host through a PCIe interface.
Referring to fig. 2, the data preprocessing module reads the convolution KERNEL parameters and the input feature map from the data storage module, and reads the size of each parameter of PARALLEL _ KERNEL as k × C according to the predefined parameters in the model parameter configuration module when reading the convolution KERNELiA convolution kernel of (2), wherein CiThe number of channels of the input profile is indicated. After the convolution kernel is read in, the serialization operation is started to the convolution kernel, namely the size is k × CiFour-dimensional convolution KERNELs of PARALLEL _ KERNEL are arranged to have a size k (C)iParalel KERNEL) in three-dimensional form.
When processing the input feature map, firstly, the size is H x W*CiThen spreading the characteristic diagram according to the size of a sliding window on the characteristic diagram and the moving step length, wherein the size of the feature diagram after being unfolded is ((W-k)/stride +1) ((H-k)/stride +1) × Ci
After the input FEATURE map is expanded, the size is cut into (PARALLEL _ FEATURE _ W) (PARALLEL _ FEATURE _ H) } C according to the configuration parametersiThe cut FEATURE maps are copied into multiple copies, and the number of the copies is equal to the number of convolution kernels, so that the size of the FEATURE map is (PARALLEL _ FEATURE _ W) (PARALLEL _ FEATURE _ H) ((C))iParalel kernell) to enable PARALLEL computation of multiple convolution KERNELs and the feature map. And after the convolution kernel and the characteristic graph are processed, sending the processed convolution kernel parameters and the characteristic graph to a convolution neural network computing module for processing.
The processing flow of the convolution calculation submodule of the convolution neural network calculation module refers to fig. 3, and the input of the module is the convolution kernel parameters and the characteristic diagram generated by the data preprocessing module and the related configuration parameters predefined in the network model configuration module. The preprocessed convolution KERNEL and the feature map are three-dimensional matrixes, and the number of channels is PARALLEL _ KERNEL Ci(ii) a The convolution kernels and the characteristic maps on each channel are respectively input into different OpenCL computing units to perform two-dimensional matrix multiplication operation by using Winograd matrix multiplication submodules, the computation among the OpenCL computing units can be performed in PARALLEL, and the computation results are (PARALLEL _ FEATURE _ W/k), (PARALLEL _ FEATURE _ H/k) and (C) channel numberiParalel _ KERNEL). And processing the input feature map by a convolution calculation submodule to generate a partial output feature map of the convolution layer, wherein the partial output feature map is subjected to different processing according to the type of the next layer. If the next layer predefined in the network model configuration is a convolutional layer or a fully-connected layer, the output feature map skips over the pooling layer and the data post-processing module writes the result back to the external storage for processing; and if the next layer predefined in the network model configuration is a pooling layer, sending the input feature map to a pooling calculation submodule for pooling.
Referring to fig. 4, the activation function calculation sub-module in the embodiment includes an activation function selection sub-module and three function calculation sub-modules, a selector in the activation function selection sub-module is determined by a configuration parameter in the model configuration module, and the three function calculation sub-modules respectively correspond to the calculation of the Sigmoid, tanh, and ReLU activation functions. And the input feature diagram is sent to the function calculation submodule for activation function calculation processing according to the path determined by the activation function selection submodule, and is sent to the data storage module or the pooling calculation submodule according to the configuration parameters after the activation function calculation processing is finished.
Referring to fig. 5, two Ping-Pong buffers with the size of pool _ size × W are used in the pooling calculation sub-module to store the calculation results from the activation function calculation sub-module, wherein pool _ size and W are configuration parameters, the calculation results of the convolution calculation sub-module are continuously filled into Buffer1 at first, the partial pooling calculation in the Buffer can be performed during the filling process, the calculation results of the convolution calculation module are filled into Buffer2 after Buffer1 is filled, the data in Buffer2 can be pooled during the filling process of Buffer2, the data between Buffer1 and Buffer2 can also be pooled, the calculation results of the convolution calculation module are filled into Buffer1 when Buffer2 is filled, and the two buffers are operated alternately until the whole pooling calculation is completed. And a pooling window is also included between the two buffers, the data of the window is from the two buffers, and the pooling window between the two buffers can be calculated when one buffer performs calculation operation and the other buffer performs filling operation. Since there is no computational correlation of data between pooled windows, a loop unrolling method can be used to synchronize the computations within different windows.
Referring to fig. 6, in the process of processing, the pooling computation submodule transversely divides an input matrix formed by N input vectors into dim1/m segments, where N represents the number of input eigenvectors, dim1 represents the dimension of the input eigenvectors, m represents the segment length of the input eigenvectors, each segment separately forms a submatrix with size m × N, the submatrix is multiplied by the corresponding part in the weight matrix to obtain a partial result formed by the submatrix with size N × N, the partial results of dim1/m segments are merged to be the final computation result formed by the output vectors of N, and when the computation submatrix is multiplied by the corresponding part in the weight matrix, Winograd minimum filter matrix multiplication is used for performing accelerated computation.
Referring to fig. 7, after the processing of the pooling calculation submodule or the full-connection calculation submodule in the convolutional neural network calculation module is completed, the data post-processing module starts to write back the data output by the pooling calculation or full-connection calculation submodule into the data storage module, and in the process, barrier operation in the OpenCL framework is used to ensure that transmission is started after all calculation results are obtained and the next processing is started after all data transmission is completed.
Referring to fig. 8, a processing flow of the acceleration system provided by the embodiment mainly includes three major portions; the first part is a kernel program compiling process, and in order to maximally utilize computing resources and storage resources on the FPGA, an appropriate network computing parallelism parameter needs to be set. In the embodiment, the process of setting the parallelism parameter is automatically completed through a program, the initial values of PARALLEL _ FEATURE and PARALLEL _ KERNEL in a KERNEL program of the convolutional neural network are set, then the KERNEL program is compiled by using an Altera OpenCL SDK, the resource utilization conditions including storage resources, logic resources, computing resources and the like are obtained from a compiling report after the compiling is completed, if the resource utilization does not reach the maximum, the values of PARALLEL _ FEATURE and PARALLEL _ KERNEL are updated to be recompiled until the maximum hardware resource utilization is obtained, and the hardware program capable of running on the FPGA is obtained after the compiling is completed.
The second part is a parameter configuration process, which includes network model calculation parameters and model configuration parameters, the network model calculation parameters are directly read from a model file caffieldol of Caffe, the model configuration parameters include the input feature diagram size of each layer, the size of a convolution kernel, the size of a pooling window, and the like, the parameter configuration is completed by using a clSetKernelArg () function in OpenCL, and table 1 below illustrates the types and parameter values of the model configuration parameters by taking VGG16 as an example.
TABLE 1 types of model configuration parameters and parameter value examples
Figure BDA0001716436010000121
Figure BDA0001716436010000131
In the above table, in the active function column, 0 indicates that there is no activation function, 1 indicates that a ReLU activation function is used, 2 indicates that a Sigmoid activation function is used, and 3 indicates that a Tanh activation function is used; in the Output dst column, 1 indicates Output to the data storage module, 2 indicates Output to the pooling calculation sub-module, and 3 indicates Output to the convolution calculation sub-module.
And the third part is the running process of the neural network, when the host computer transmits the picture to the data storage module, the system on the FPGA starts running, after the running is finished, the data storage module returns the calculation result to the host computer, and the running is finished when no picture is input.
The embodiment provides an FPGA-based convolutional neural network acceleration system, which realizes VGG16 and AlexNet network models on a DE5a-Net development board, and performs performance tests by using picture data with the size of 224X 3, and experimental data show that the processing speed of VGG16 is 160ms/image, and the processing speed of AlexNet is 12ms/image, which is superior to other FPGA implementation schemes.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (8)

1. A convolutional neural network acceleration system based on FPGA is characterized by comprising a data preprocessing module, a convolutional neural network calculation module, a data post-processing module, a data storage module and a network model configuration module; the data preprocessing module, the convolutional neural network computing module and the data post-processing module are realized on the basis of an FPGA (field programmable gate array), the data storage module is realized on the basis of FPGA off-chip storage, and the network model configuration module is realized on the basis of FPGA on-chip storage;
the data preprocessing module is used for reading corresponding convolution kernel parameters and input feature maps from the data storage module according to the current calculation stage, and preprocessing the convolution kernel parameters and the input feature maps: arranging 4-dimensional convolution kernel parameters into 3 dimensions, expanding and copying an input feature graph by using a sliding window, so that local feature graphs in the sliding window correspond to convolution kernel parameters one by one, and a convolution kernel parameter sequence and a local feature graph series which are convenient to calculate directly are obtained; after the preprocessing is finished, sending the processed convolution kernel parameters and the input characteristic graph to a convolution neural network computing module; specifically, reading PARALLEL _ KERNEL dimensions k × C according to predefined parameters in the model parameter configuration module during reading the convolution KERNELiA convolution kernel of (2), wherein CiRepresenting the number of channels of the input feature diagram, reading in the convolution kernel and then starting to perform serialization operation on the convolution kernel, namely, the size is k × CiFour-dimensional convolution KERNELs of PARALLEL _ KERNEL are arranged to have a size k (C)iPara whole _ KERNEL) three-dimensional form;
when processing the input feature map, firstly, the size is H W CiThen spreading the characteristic diagram according to the size of a sliding window on the characteristic diagram and the moving step length, wherein the size of the feature diagram after being unfolded is ((W-k)/stride +1) ((H-k)/stride +1) × Ci
After the input FEATURE map is expanded, the size is cut into (PARALLEL _ FEATURE _ W) (PARALLEL _ FEATURE _ H) } C according to the configuration parametersiCopying multiple parts of the cut FEATURE maps to make the number of the parts of the FEATURE maps be equal to that of convolution kernels, and finally obtaining the FEATURE maps with the size of (PARALLEL _ FEATURE _ W) ((PARALLEL _ FEATURE _ H) ((C)iParalel kernell) to enable PARALLEL computation of the plurality of convolution KERNELs and the feature map;
the network model configuration module is used for configuring parameters of the convolutional neural network calculation module; the convolutional neural network computing module is used for independently setting a convolutional layer, an activation function layer, a pooling layer and a full-link layer in the convolutional neural network, constructing various different network structures through parameter configuration, carrying out interlayer pipeline processing of convolution, activation, pooling and full-link computing on convolutional kernel parameters and input feature maps received from the data preprocessing module according to configuration parameters, and sending processing results to the data post-processing module;
the data post-processing module is used for writing the output data of the convolutional neural network computing module into the data storage module;
the data storage module is used for storing model parameters, a calculation result and a final calculation result of the intermediate characteristic diagram of the convolutional neural network, and the data storage module exchanges data with an external host through a PCIe interface;
the convolutional neural network computing module is formed by a convolutional computing submodule, an activation function computing submodule, a pooling computing submodule and a full-connection computing submodule which are cascaded according to network model configuration parameters, data transmission is carried out among the submodules by using OpenCL channels, computing in the submodules is executed in parallel, and computing among the submodules is carried out in a flowing mode;
the convolution computation submodule comprises one or more matrix multiplication computation submodules; the number of the matrix multiplication calculation sub-modules is set by configuration parameters predefined by a network model configuration module; the processing among the matrix multiplication computation submodules is executed in parallel;
the matrix multiplication calculation submodule uses a Winograd minimum filtering algorithm to accelerate operation and is used for calculating and obtaining matrix multiplication between a single convolution kernel and a corresponding local feature map.
2. The convolutional neural network acceleration system of claim 1, wherein the convolutional neural network computation module comprises a convolutional computation submodule, an activation function computation submodule, a pooling computation submodule, and a fully-connected computation submodule; the sub-modules in the convolutional neural network computing module are connected according to network model configuration parameters predefined by the network model configuration module;
the convolution calculation submodule performs convolution calculation by using the input convolution kernel parameters and the characteristic diagram, and sends a result to the activation function calculation submodule after the convolution calculation is completed;
the activation function calculation submodule selects an activation function according to activation function configuration parameters predefined by the network model parameter configuration module; performing activation calculation on the feature map by using the selected activation function, and sending a result to the pooling calculation submodule or the full-connection calculation submodule according to parameter configuration after the activation calculation is completed;
the pooling calculation sub-module is used for performing pooling calculation on the received characteristic diagram and sending a pooling result to the full-connection calculation module according to configuration parameters predefined by the network model configuration module or directly to the data post-processing module;
and the full-connection calculation submodule is used for performing full-connection calculation on the received characteristic diagram and sending a full-connection result to the data post-processing module.
3. The convolutional neural network acceleration system as claimed in claim 2, wherein the activation function calculation sub-module includes an activation function selection sub-module, a Sigmoid function calculation sub-module, a Tanh function calculation sub-module, and a ReLU function calculation sub-module;
the activation function selection submodule is respectively connected with the Sigmoid function calculation submodule, the Tanh function calculation submodule and the ReLU function calculation submodule, and data of the characteristic diagram is sent to one of the three calculation submodules;
the activation function selection submodule is used for setting an activation calculation mode of a characteristic diagram in the convolutional neural network; the Sigmoid function calculation sub-module is used for calculating a Sigmoid function; the Tanh function calculation submodule is used for calculating a Tanh function; and the ReLU function calculation submodule is used for calculating the ReLU function.
4. The convolutional neural network acceleration system as claimed in any one of claims 1 to 3, wherein the pooling calculation sub-module comprises a double buffer memory formed by two FPGA chips for storing temporary feature map data in a pooling calculation process, the size of the buffer area is set by network configuration parameters predefined by a network model parameter configuration module, the sizes of the buffer areas in different pooling layers are different, and ping-pong read-write operation is realized through the double buffer memory area structure to realize pipeline processing of pooling calculation.
5. The convolutional neural network acceleration system of claim 1, wherein the data preprocessing module comprises a data transmission sub-module, a convolution kernel parameter preprocessing sub-module, and a feature map preprocessing sub-module;
the data transmission submodule is used for controlling the transmission of the characteristic diagram and the convolution kernel parameter between the data storage module and the convolution neural network computing module; the convolution kernel parameter preprocessing submodule is used for rearranging and sorting the convolution kernel parameters; the characteristic diagram preprocessing submodule is used for expanding, copying and sorting the characteristic diagram.
6. The convolutional neural network acceleration system of claim 5, wherein the data transfer submodule comprises a DDR controller, a data transfer bus and a memory cache;
the DDR controller is used for controlling data transmission between the DDR and the FPGA, and a data transmission bus is connected with the DDR and the FPGA and is a data transmission channel; the storage buffer is used for temporarily storing data, reducing the reading of DDR by the FPGA and improving the data transmission speed.
7. The convolutional neural network acceleration system of claim 1, wherein the data storage module comprises a convolutional kernel parameter storage submodule and a feature map storage submodule, the convolutional kernel parameter storage submodule is used for storing convolutional kernel parameters, and the feature map storage submodule is used for storing an input feature map and a temporary feature map in the calculation process; the storage sub-modules are all formed by dividing a DDR memory connected with the FPGA.
8. The convolutional neural network acceleration system of claim 1, wherein the network model parameter configuration module is configured to store network model configuration parameters, including the size of the network input feature map, the size and number of convolutional kernel parameters in the convolutional calculation submodule, the size of the pooling window in the pooling calculation submodule, the parameter size of the fully-connected calculation submodule, and the calculation parallelism; the data in the network model parameter configuration module is written in advance before the system is started.
CN201810710069.1A 2018-07-02 2018-07-02 Convolutional neural network acceleration system based on FPGA Expired - Fee Related CN109086867B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810710069.1A CN109086867B (en) 2018-07-02 2018-07-02 Convolutional neural network acceleration system based on FPGA

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810710069.1A CN109086867B (en) 2018-07-02 2018-07-02 Convolutional neural network acceleration system based on FPGA

Publications (2)

Publication Number Publication Date
CN109086867A CN109086867A (en) 2018-12-25
CN109086867B true CN109086867B (en) 2021-06-08

Family

ID=64836906

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810710069.1A Expired - Fee Related CN109086867B (en) 2018-07-02 2018-07-02 Convolutional neural network acceleration system based on FPGA

Country Status (1)

Country Link
CN (1) CN109086867B (en)

Families Citing this family (64)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10678244B2 (en) 2017-03-23 2020-06-09 Tesla, Inc. Data synthesis for autonomous control systems
US10671349B2 (en) 2017-07-24 2020-06-02 Tesla, Inc. Accelerated mathematical engine
US11409692B2 (en) 2017-07-24 2022-08-09 Tesla, Inc. Vector computational unit
US11157441B2 (en) 2017-07-24 2021-10-26 Tesla, Inc. Computational array microprocessor system using non-consecutive data formatting
US11893393B2 (en) 2017-07-24 2024-02-06 Tesla, Inc. Computational array microprocessor system with hardware arbiter managing memory requests
US11561791B2 (en) 2018-02-01 2023-01-24 Tesla, Inc. Vector computational unit receiving data elements in parallel from a last row of a computational array
US11215999B2 (en) 2018-06-20 2022-01-04 Tesla, Inc. Data pipeline and deep learning system for autonomous driving
US11361457B2 (en) 2018-07-20 2022-06-14 Tesla, Inc. Annotation cross-labeling for autonomous control systems
US11636333B2 (en) 2018-07-26 2023-04-25 Tesla, Inc. Optimizing neural network structures for embedded systems
US11562231B2 (en) 2018-09-03 2023-01-24 Tesla, Inc. Neural networks for embedded devices
KR20210072048A (en) 2018-10-11 2021-06-16 테슬라, 인크. Systems and methods for training machine models with augmented data
US11196678B2 (en) 2018-10-25 2021-12-07 Tesla, Inc. QOS manager for system on a chip communications
US11816585B2 (en) 2018-12-03 2023-11-14 Tesla, Inc. Machine learning models operating at different frequencies for autonomous vehicles
US11537811B2 (en) 2018-12-04 2022-12-27 Tesla, Inc. Enhanced object detection for autonomous vehicles based on field view
US11610117B2 (en) 2018-12-27 2023-03-21 Tesla, Inc. System and method for adapting a neural network model on a hardware platform
CN109656721A (en) * 2018-12-28 2019-04-19 上海新储集成电路有限公司 A kind of efficient intelligence system
CN109685209B (en) * 2018-12-29 2020-11-06 瑞芯微电子股份有限公司 Device and method for accelerating operation speed of neural network
CN109948784B (en) * 2019-01-03 2023-04-18 重庆邮电大学 Convolutional neural network accelerator circuit based on rapid filtering algorithm
CN109961139A (en) * 2019-01-08 2019-07-02 广东浪潮大数据研究有限公司 A kind of accelerated method, device, equipment and the storage medium of residual error network
CN109784489B (en) * 2019-01-16 2021-07-30 北京大学软件与微电子学院 Convolutional neural network IP core based on FPGA
CN109767002B (en) * 2019-01-17 2023-04-21 山东浪潮科学研究院有限公司 Neural network acceleration method based on multi-block FPGA cooperative processing
CN109799977B (en) * 2019-01-25 2021-07-27 西安电子科技大学 Method and system for developing and scheduling data by instruction program
US11150664B2 (en) 2019-02-01 2021-10-19 Tesla, Inc. Predicting three-dimensional features for autonomous driving
US10997461B2 (en) 2019-02-01 2021-05-04 Tesla, Inc. Generating ground truth for machine learning from time series elements
US11567514B2 (en) 2019-02-11 2023-01-31 Tesla, Inc. Autonomous and user controlled vehicle summon to a target
US10956755B2 (en) 2019-02-19 2021-03-23 Tesla, Inc. Estimating object properties using visual image data
CN109976903B (en) 2019-02-22 2021-06-29 华中科技大学 Deep learning heterogeneous computing method and system based on layer width memory allocation
US11580386B2 (en) 2019-03-18 2023-02-14 Electronics And Telecommunications Research Institute Convolutional layer acceleration unit, embedded system having the same, and method for operating the embedded system
CN110188869B (en) * 2019-05-05 2021-08-10 北京中科汇成科技有限公司 Method and system for integrated circuit accelerated calculation based on convolutional neural network algorithm
CN110334801A (en) * 2019-05-09 2019-10-15 苏州浪潮智能科技有限公司 A kind of hardware-accelerated method, apparatus, equipment and the system of convolutional neural networks
CN110263925B (en) * 2019-06-04 2022-03-15 电子科技大学 Hardware acceleration implementation device for convolutional neural network forward prediction based on FPGA
CN110399883A (en) * 2019-06-28 2019-11-01 苏州浪潮智能科技有限公司 Image characteristic extracting method, device, equipment and computer readable storage medium
CN110490311A (en) * 2019-07-08 2019-11-22 华南理工大学 Convolutional neural networks accelerator and its control method based on RISC-V framework
CN110458279B (en) * 2019-07-15 2022-05-20 武汉魅瞳科技有限公司 FPGA-based binary neural network acceleration method and system
CN110390392B (en) * 2019-08-01 2021-02-19 上海安路信息科技有限公司 Convolution parameter accelerating device based on FPGA and data reading and writing method
CN110443357B (en) * 2019-08-07 2020-09-15 上海燧原智能科技有限公司 Convolutional neural network calculation optimization method and device, computer equipment and medium
CN110852930B (en) * 2019-10-25 2021-06-29 华中科技大学 FPGA graph processing acceleration method and system based on OpenCL
CN111079923B (en) * 2019-11-08 2023-10-13 中国科学院上海高等研究院 Spark convolutional neural network system suitable for edge computing platform and circuit thereof
CN111105015A (en) * 2019-12-06 2020-05-05 浪潮(北京)电子信息产业有限公司 General CNN reasoning accelerator, control method thereof and readable storage medium
CN111160544B (en) * 2019-12-31 2021-04-23 上海安路信息科技股份有限公司 Data activation method and FPGA data activation system
CN111210019B (en) * 2020-01-16 2022-06-24 电子科技大学 Neural network inference method based on software and hardware cooperative acceleration
CN111242289B (en) * 2020-01-19 2023-04-07 清华大学 Convolutional neural network acceleration system and method with expandable scale
CN111325327B (en) * 2020-03-06 2022-03-08 四川九洲电器集团有限责任公司 Universal convolution neural network operation architecture based on embedded platform and use method
CN111340198B (en) * 2020-03-26 2023-05-05 上海大学 Neural network accelerator for data high multiplexing based on FPGA
CN111626403B (en) * 2020-05-14 2022-05-10 北京航空航天大学 Convolutional neural network accelerator based on CPU-FPGA memory sharing
CN111583095B (en) * 2020-05-22 2022-03-22 浪潮电子信息产业股份有限公司 Image data storage method, image data processing system and related device
CN111736986B (en) * 2020-05-29 2023-06-23 浪潮(北京)电子信息产业有限公司 FPGA (field programmable Gate array) acceleration execution method and related device of deep learning model
CN111753974B (en) * 2020-06-22 2024-10-15 深圳鲲云信息科技有限公司 Neural network accelerator
CN111860781B (en) * 2020-07-10 2024-06-28 逢亿科技(上海)有限公司 Convolutional neural network feature decoding system based on FPGA
CN111931913B (en) * 2020-08-10 2023-08-01 西安电子科技大学 Deployment method of convolutional neural network on FPGA (field programmable gate array) based on Caffe
CN112149814A (en) * 2020-09-23 2020-12-29 哈尔滨理工大学 Convolutional neural network acceleration system based on FPGA
CN112101284A (en) * 2020-09-25 2020-12-18 北京百度网讯科技有限公司 Image recognition method, training method, device and system of image recognition model
CN112905526B (en) * 2021-01-21 2022-07-08 北京理工大学 FPGA implementation method for multiple types of convolution
CN112766478B (en) * 2021-01-21 2024-04-12 中国电子科技集团公司信息科学研究院 FPGA (field programmable Gate array) pipeline structure oriented to convolutional neural network
CN112732638B (en) * 2021-01-22 2022-05-06 上海交通大学 Heterogeneous acceleration system and method based on CTPN network
CN112819140B (en) * 2021-02-02 2022-06-24 电子科技大学 OpenCL-based FPGA one-dimensional signal recognition neural network acceleration method
CN112949845B (en) * 2021-03-08 2022-08-09 内蒙古大学 Deep convolutional neural network accelerator based on FPGA
CN113065647B (en) * 2021-03-30 2023-04-25 西安电子科技大学 Calculation-storage communication system and communication method for accelerating neural network
CN113517007B (en) * 2021-04-29 2023-07-25 西安交通大学 Flowing water processing method and system and memristor array
CN113467783B (en) * 2021-07-19 2023-09-12 中科曙光国际信息产业有限公司 Nuclear function compiling method and device of artificial intelligent accelerator
CN114943635B (en) * 2021-09-30 2023-08-22 太初(无锡)电子科技有限公司 Fusion operator design and implementation method based on heterogeneous collaborative computing core
CN113949592B (en) * 2021-12-22 2022-03-22 湖南大学 Anti-attack defense system and method based on FPGA
CN114997392B (en) * 2022-08-03 2022-10-21 成都图影视讯科技有限公司 Architecture and architectural methods for neural network computing
CN117195989B (en) * 2023-11-06 2024-06-04 深圳市九天睿芯科技有限公司 Vector processor, neural network accelerator, chip and electronic equipment

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10891538B2 (en) * 2016-08-11 2021-01-12 Nvidia Corporation Sparse convolutional neural network accelerator
CN107341127B (en) * 2017-07-05 2020-04-14 西安电子科技大学 Convolutional neural network acceleration method based on OpenCL standard
CN107657581B (en) * 2017-09-28 2020-12-22 中国人民解放军国防科技大学 Convolutional neural network CNN hardware accelerator and acceleration method
CN108154229B (en) * 2018-01-10 2022-04-08 西安电子科技大学 Image processing method based on FPGA (field programmable Gate array) accelerated convolutional neural network framework

Also Published As

Publication number Publication date
CN109086867A (en) 2018-12-25

Similar Documents

Publication Publication Date Title
CN109086867B (en) Convolutional neural network acceleration system based on FPGA
CN111459877B (en) Winograd YOLOv2 target detection model method based on FPGA acceleration
Tanomoto et al. A CGRA-based approach for accelerating convolutional neural networks
CN111079923B (en) Spark convolutional neural network system suitable for edge computing platform and circuit thereof
Kästner et al. Hardware/software codesign for convolutional neural networks exploiting dynamic partial reconfiguration on PYNQ
JP2021510219A (en) Multicast Network On-Chip Convolutional Neural Network Hardware Accelerator and Its Behavior
CN112508184B (en) Design method of fast image recognition accelerator based on convolutional neural network
CN114450661A (en) Compiler flow logic for reconfigurable architecture
CN111210019B (en) Neural network inference method based on software and hardware cooperative acceleration
GB2625452A (en) Neural network comprising matrix multiplication
CN114970849A (en) Hardware accelerator multi-array parallel computing method and system
CN114035916A (en) Method for compiling and scheduling calculation graph and related product
Seto et al. Small memory footprint neural network accelerators
US20230021204A1 (en) Neural network comprising matrix multiplication
WO2022047423A1 (en) Memory processing unit architecture mapping techniques
CN112799599A (en) Data storage method, computing core, chip and electronic equipment
Wu Review on FPGA-based accelerators in deep learning
Hu et al. On-chip instruction generation for cross-layer CNN accelerator on FPGA
CN114595813B (en) Heterogeneous acceleration processor and data computing method
CN109583006B (en) Dynamic optimization method of field programmable gate array convolution layer based on cyclic cutting and rearrangement
Hu et al. Data optimization cnn accelerator design on fpga
CN115374395A (en) Hardware structure for carrying out scheduling calculation through algorithm control unit
Li et al. Fpga-based object detection acceleration architecture design
WO2022095676A1 (en) Neural network sparsification device and method, and corresponding product
CN116185377A (en) Optimization method and device for calculation graph and related product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20210608

CF01 Termination of patent right due to non-payment of annual fee