CN108154229B - Image processing method based on FPGA (field programmable Gate array) accelerated convolutional neural network framework - Google Patents
Image processing method based on FPGA (field programmable Gate array) accelerated convolutional neural network framework Download PDFInfo
- Publication number
- CN108154229B CN108154229B CN201810022870.7A CN201810022870A CN108154229B CN 108154229 B CN108154229 B CN 108154229B CN 201810022870 A CN201810022870 A CN 201810022870A CN 108154229 B CN108154229 B CN 108154229B
- Authority
- CN
- China
- Prior art keywords
- picture
- resource
- block ram
- size
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000013527 convolutional neural network Methods 0.000 title claims abstract description 20
- 238000003672 processing method Methods 0.000 title claims abstract description 4
- 238000003860 storage Methods 0.000 claims abstract description 40
- 238000011176 pooling Methods 0.000 claims abstract description 31
- 230000011218 segmentation Effects 0.000 claims abstract description 24
- 238000000034 method Methods 0.000 claims abstract description 21
- 238000009826 distribution Methods 0.000 claims abstract description 5
- 230000006870 function Effects 0.000 claims description 31
- 238000004364 calculation method Methods 0.000 claims description 29
- 238000012545 processing Methods 0.000 claims description 19
- 230000005540 biological transmission Effects 0.000 claims description 18
- 230000004913 activation Effects 0.000 claims description 15
- 239000011159 matrix material Substances 0.000 claims description 6
- 230000001360 synchronised effect Effects 0.000 claims description 3
- 238000010276 construction Methods 0.000 claims description 2
- 238000013468 resource allocation Methods 0.000 claims description 2
- 230000026676 system process Effects 0.000 claims description 2
- 230000001133 acceleration Effects 0.000 abstract description 9
- 230000000694 effects Effects 0.000 abstract description 7
- 238000003058 natural language processing Methods 0.000 abstract description 3
- 235000019800 disodium phosphate Nutrition 0.000 description 14
- 238000004088 simulation Methods 0.000 description 10
- 238000013528 artificial neural network Methods 0.000 description 8
- 238000013461 design Methods 0.000 description 6
- 238000004422 calculation algorithm Methods 0.000 description 4
- 230000008859 change Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 238000012938 design process Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Theoretical Computer Science (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Neurology (AREA)
- Image Processing (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses an image processing method based on an FPGA (field programmable gate array) accelerated convolutional neural network framework, which mainly solves the problems of low resource utilization rate and low speed in the prior art. The scheme is as follows: 1) calculating a picture segmentation fixed value according to the designed picture parameter and the FPGA resource parameter; 2) determining the number of DDR3 according to the picture fixed value, and allocating block ram resources; 3) constructing a convolutional neural network framework according to 1) and 2), wherein the framework comprises a picture storage module, a picture data distribution module, a convolution module, a pooling module, a picture return DDR3 module and an instruction register set; 4) and all modules acquire control instructions from the instruction register group through handshake signals, are matched with each other, and process the image data according to the control instructions. The invention improves the resource utilization rate and the acceleration effect through the convolutional neural network framework accelerated by the FPGA, and can be used for image classification, target recognition, voice recognition and natural language processing.
Description
Technical Field
The invention belongs to the technical field of computer design, and particularly relates to a convolutional neural network implementation method which can be used for image classification, target recognition, voice recognition and natural language processing.
Background
With the progress of integrated circuit design and manufacturing process, the field programmable gate array with high-speed and high-density programmable logic resources has been rapidly developed, and the integration level of a single chip is higher and higher. In order to further improve the performance of the FPGA, mainstream chip manufacturers integrate a DSP customized computing unit of a digital signal processing chip with high-speed digital signal processing capability inside the chip, and a DSP hard core is a component capable of realizing fixed-point operation with high efficiency and low cost, so that the FPGA is widely used in the application fields of video and image processing, network communication and information security, bioinformatics, and the like.
The convolutional neural network CNN is a structure of an artificial neural network, is widely applied to the fields of image classification, target recognition, voice recognition, natural language processing and the like, and in recent years, along with the great improvement of computer capability and the development of a neural network structure, the performance and accuracy of the CNN are greatly improved, but the requirement on the parallel computing capability of an operation unit is higher and higher, so that a GPU (graphics processing unit) with the parallel computing capability and an FPGA (field programmable gate array) become a mainstream direction.
The configurable computing architecture based on FPGA can exploit the parallelism of the artificial neural network and change the weight and topology of the convolutional neural network through configuration. The artificial neural network realized by the FPGA has the flexibility of software design and is close to an Application Specific Integrated Circuit (ASIC) in the aspect of calculation performance, and meanwhile, the high-efficiency interconnection can be realized by utilizing on-chip programmable connecting line resources, so the FPGA is an important choice for realizing the artificial neural network by hardware.
In the current patents and research directions, the OpenCL programming language is basically used as a construction core, and the purpose is to reduce the implementation time of converting the convolutional neural network algorithm into the hardware description language, but the acceleration of the hardware description language code in the FPGA algorithm is not involved, and meanwhile, the OpenCL programming language is not a language actually running on the FPGA, so that the actual running speed of the FPGA is not ideal. In the prior art implemented based on OpenCL programming, the acceleration of a DSP module in an FPGA is mainly focused, a convolutional neural network algorithm is not integrally implemented and a bottom hardware description language is not optimized, and the computation resources of the FPGA cannot be fully utilized, so that the computation time of the FPGA is increased, and the acceleration effect is not obvious.
Disclosure of Invention
The invention aims to provide a convolutional neural network implementation method based on FPGA acceleration, which is used for integrally implementing the convolutional neural network through a hardware description language, optimizing a bottom hardware description language, fully utilizing FPGA operation resources and maximizing the FPAG acceleration effect.
In order to achieve the purpose, the technical scheme of the invention comprises the following steps:
(1) parameter processing:
1a) reading the picture and FPGA board resource parameters input by a user, wherein the resource parameters comprise: picture size N, total block ram resource SsumThe number P of the DDR3 of the synchronous dynamic random access memory and the number A of the DSP of the calculation function chip;
1b) designing the FPGA operation frequency f, the convolution kernel size m, the convolution layer number J, the channel number T, the pooling layer number C, the activation function layer number E, and the multi-classification function softmax layer number G, softmax input number IinSoftmax layer output number IoutThe full connection layer number Q, a pooling function and an activation function;
1c) calculating a size value set X of each layer of pictures, a maximum convolution parallelizable number L, a theoretical operation speed bandwidth D and a theoretical data transmission bandwidth Z according to the data read in the step 1a) and the parameters designed in the step 1 b);
(2) fixed values for picture segmentation are calculated:
2a) calculating a common divisor M of the size of each layer of picture according to the value set X of the size of each layer of picture obtained by calculation in the step (1);
2b) the common divisor obtained according to the 2a) and the total block ram resource S read in the step (1)sumCalculating a common divisor C of the pictures meeting the resource limit of the block ram of the FPGA;
2c) calculating the maximum common divisor meeting the DSP resource limit as a picture segmentation fixed value n according to the resource limit common divisor obtained in the step 2b) and the DSP resource read in the step (1);
(3) determining the number of DDR 3:
calculating an actual data transmission bandwidth H according to the picture segmentation fixed value n, and comparing the actual data transmission bandwidth H with a theoretical data transmission bandwidth Z:
if H > Z, the number B of DDR3 is determined to be 2 or an integer of 1+2j, j ≧ 1
If H is less than or equal to Z, determining the number B of the DDR3 as 3 or 1+4i, wherein i is an integer more than or equal to 1; i ≠ j
(4) Resource allocation is carried out on the block ram on the FPGA:
4a) calculating a picture storage block ram resource S according to the picture segmentation fixed value n determined in the step (2) and the channel number T in the step (1)pic;
4b) Picture storage block ram resource S according to 4a)picAnd (1) total block ram resource SsumCalculating the remaining block ram storage resource SlastAnd the largest storage parameter block ram resource SneAnd comparing the sizes of the two: if S islast≥SneThen S will beneStoring a block ram resource S as a parameterparIf S islast<SneThen S will belastSubtract 0.5Mbit as parameter memory Block ram resource Spar;
(5) Constructing a convolutional neural network framework, and processing an input picture by combining the parameters in the 1a), 1b), 2c), (3), 4a), and 4 b):
5a) setting a picture storage module for dividing a fixed value n, the number of convolution layers J, the number of channels T and a picture storage block ram resource S according to the pictures in the steps 2c), 4a) and 3picAnd DDR3 number B, the pixel points of the input picture are taken out from the DDR3 and stored;
5b) setting a picture data distribution module for dividing the fixed value n and the picture storage block ram resource S according to the pictures of 2c), 4a) and 4b)picAnd parameter storage Block ram resource SparDistributing the picture data stored in the step 5 a);
5c) setting a convolution module for dividing a fixed value n according to the picture of 2c) and performing convolution calculation on the distributed picture data in 5 b);
5d) setting a pooling module for implementing pooling processing on the image data after the convolution calculation of 5c) according to the pooling function of 1 b);
5e) setting a picture storing module, storing the picture data after the pooling processing in the 5d) into the DDR3 according to the DDR3 number B in the (3) and the 2c) and the picture segmentation fixed value n;
5f) setting an instruction register group module, and setting a convolution kernel size m, a convolution layer number J, a pooling layer number C, an activation function layer number E, softmax layer number G, softmax layer input number I according to the picture size N in 1a), 1b) and 2C)inSoftmax layer output number IoutA full connection layer output value Q, a picture segmentation size n, and a control instruction which is constructed and distributed to modules arranged in 5a), 5b), 5c), 5d) and 5 e).
Compared with the prior art, the invention has the following advantages:
1. the invention realizes a convolutional neural network framework based on FPGA acceleration through a hardware description language;
2. the invention ensures that the most DSP resources are utilized and the convolution calculation is uninterrupted through the pipeline structure of the picture segmentation fixed value n and the convolution module in the parameter processing, can maximize the DSP resources and the transmission efficiency through the uninterrupted calculation, and realizes the acceleration effect of the convolution neural network framework;
3. according to the invention, through dividing the picture in the picture storage module, the DDR3 transmission bandwidth is ensured to be at the maximum value, and the maximum transmission efficiency is realized;
4. the invention changes the design parameters in the parameter processing through the instruction register group module, and can realize the convolutional neural network with different picture sizes N and different convolutional layer number J parameters.
Drawings
FIG. 1 is a flow chart of an implementation of the present invention;
fig. 2 is a diagram of simulation results of an embodiment of the present invention.
Detailed Description
The embodiments and effects of the present invention will be described in detail below with reference to the accompanying drawings;
and step 1, processing parameters.
1.1) reading pictures and FPGA board resource parameters input by a user, wherein the FPGA parameters comprise: picture size N, total block ram resource SsumThe number P of the DDR3 of the synchronous dynamic random access memory and the number A of the DSP of the calculation function chip;
1.2) design parameters including: the method comprises the following steps of FPGA operation frequency f, convolution kernel size m, convolution layer number J, channel number T, pooling layer number C, activation function layer number E, multi-classification function softmax layer number G, softmax layer input number IinSoftmax layer output number IoutThe full connection layer number Q, a pooling function and an activation function;
1.3) the computer calculates the value set X of the size of each layer of pictures according to the read parameter values, the maximum convolution can be parallel to the line L, the theoretical operation speed bandwidth and the theoretical data transmission bandwidth:
1.3a) solving the set of per-layer picture size values X by the following formula:
X=N/2i+2 i=0,1,2...
wherein N is the picture size of 1.1), and X and i are integers;
1.3b) finding the maximum possible number of parallel rows L by the following formula:
wherein A is the DSP resource number of 1.1), and m is the convolution kernel size of 1.2);
1.3c) solving the maximum operation speed bandwidth D by the following formula:
D=f×m2×32×L,
wherein f is the FPGA operation frequency of 1.2), m is the convolution kernel size of 1.2), and L is the maximum parallel number of 1.3 b);
1.3d) solving the data transmission bandwidth Z by the following formula:
Z=4×(P-1),
wherein P is the number of DDR3 of 1.2);
and 2, calculating a picture segmentation fixed value.
2.1) solving the common divisor M of the size of each layer of picture by the following formula:
M=GCD(X)
wherein, X is 1.3a), GCD () represents a common divisor;
2.2) solving a common divisor C of the pictures meeting the resource limit of the block ram by the following formula:
C=max(M)
wherein M is a common divisor of the size of each layer of pictures of 2.1), T is the number of channels of 1.2), and M is the convolution size of 1.2),
Ssum1.2), and max () is the maximum value;
2.3) solving a picture segmentation fixed value n meeting the DSP resource limitation by the following formula:
n=max(C)<L,
wherein C is the common divisor C of the pictures of 2.2), and L is the maximum number of parallel rows of 1.2).
3.1) finding the actual data transmission bandwidth H by the following formula:
H=n2×32×max(T),
wherein n is a fixed value for picture segmentation of 2.3), and T is the number of channels of 1.2);
3.2) comparing the actual data transmission bandwidth H with the theoretical data transmission bandwidth Z, and calculating the number B of DDR 3:
if H > Z, determining the number B of DDR3 to be 2 or 1+2j, j being an integer greater than or equal to 1;
if H is less than or equal to Z, determining the number B of the DDR3 to be 3 or 1+4i, wherein i is an integer which is greater than or equal to 1; i ≠ j, Z is 1.2).
And 4, distributing block ram resources on the FPGA.
4.1) solving the resource S of the picture storage block ram by the following formulapic:
Spic=max(M)×max(T)×32,
Wherein M is a common divisor of the size of each layer of pictures of 2.1), and T is a channel number of 1.2);
4.2) the remaining block ram storage resource S is solved through the following formulalast:
Slast=Ssum-2×Spic,
Wherein S ispicPicture storage block ram resource, S, of 4.1)sum1.1) FPGA block ram size;
4.3) obtaining storage parameter Block ram resource Sne:
4.3a) solving for the intermediate variable U by the following formula
Wherein, n is a picture segmentation fixed value n of 2.3), X is 1.3a) the value set X of the size value of each layer of pictures, T is the channel number of 1.2), and max () is the maximum value;
4.3b) solving the storage parameter block ram resource S byne:
Where U is an intermediate variable of 4.3a), SsumA block ram size, S, of 1.1)picPicture storage block ram resource, S, of 4.1)last4.2) the remaining block ram storage resources;
and 5, setting a picture storage module.
5.1) dividing B DDR3 into two parts, taking B-1 DDR3 as storage picture pixel points, and 1 for the rest
DDR3 as storage parameters, B is 3.2) DDR 3;
5.2) taking each of the B-1 DDR3 from each DDR3 each with length n and widthTaking the picture pixel points with the matrix size for T times in total, wherein the initial address of the picture pixel points is taken from 0, the initial address is increased by n-1 after the picture is taken for T times, the initial address returns to 0 after the picture is taken for T times, T is the number of channels of 1.2), and n is a picture segmentation fixed value of 2.3);
5.3) storing the picture pixel points taken out from the DDR3 in SpicIn block ram resources of size, storeThe address is increased by 1 and S in sequence from 0picIs the picture storage block ram resource of 4.1);
5.4) repeat steps 5a2) -5a3) J times, J being the number of convolution layers of 1.2).
And 6, setting a picture data distribution module.
6.1) constructing a register group of m x (n +1), wherein the former register group of m x n is used as a calculation group, the last register group of m x 1 is used as a cache group, n is 2.3) of the picture is divided into fixed values, and m is the convolution kernel size of 1.2);
6.2) taking the picture data with the matrix size of m and the width of n from the picture storage block ram resources, and storing the picture data in the calculation group constructed by 6.1), wherein the initial address of the picture data is started from 0, the initial address is increased by m after the picture data is taken each time, m is the convolution kernel size of 1.2), and n is a picture segmentation fixed value of 2.3).
6.3) the calculation group outputs the picture data with the length and the width of n to a convolution module each time, simultaneously, the picture data with the length of m and the width of 1 is taken from a picture storage block ram and stored into a cache group, wherein the address is taken from 0, 1 is automatically added each time, after the calculation group outputs m-1 times, the register data of the first line of the calculation group is abandoned, the register data of the second line of the calculation group is assigned to the register of the first line, the register data of the third line is assigned to the register of the second line, and similarly, other lines are sequentially assigned to the register of the upper line, and m is the convolution kernel size of 1.2).
And 7, setting a convolution module.
Inputting the matrix picture data with the length and the width of n input in the step 6 into n2In each DSP, multiplying two by two, adopting a pipeline structure, adding two by two adjacent multiplied data to finish convolution calculation, and inputting a convolution calculation result into a pooling module, wherein n is a fixed value for dividing the picture of 2.3);
the pipeline structure is that when the system processes data, each clock pulse receives the next instruction for processing data.
And 8, arranging a pooling module.
Acquiring the picture data input in the step 7, carrying out arbitrary pairwise subtraction on every 4 picture data according to the input sequence,
and 6 results are obtained, and whether the highest bit of the data of the 6 results is 1 is judged:
if it is 1, the decrement is removed,
if the number is 0, the number is subtracted, the 6 results are processed in sequence, the last picture data is the maximum value in the 4 picture data, and the maximum value is transferred to the picture storing back module of the DDR 3.
And 9, setting the picture to be stored back to the DDR 3.
Storing the picture data of the step 8 in a block ram resource SpicIn, from block ram resource SpicThe middle part is n in length and n in widthThe picture data of (2.3) is stored back in the DDR3, wherein the picture data address is taken to start from 0 and increment by 1 each time, and the DDR3 address is stored to start from 0 and increment by 8 each time, wherein n is the picture division fixed value of 2.3), and B is the number of DDR3 of 3.2).
10.1) constructing a register bank with a length of 128 and a width of J + C + G + Q +1 to store control instructions, wherein J is the number of convolution layers of 1.2), C is the number of pooling layers of 1.2), G is the number of softmax layers of 1.2), and Q is the full nexus output value of 1.2);
10.2) constructing a 128-bit binary code control instruction: the instructions are as follows in sequence from top to bottom: an input picture size N of 10 bits, a picture division size N of 8 bits, a convolution kernel size m of 4 bits, a convolution layer number J of 6 bits, a pooling layer number C of 6 bits, an activation function layer number E of 4 bits, a softmax layer number G of 4 bits, and an input number I of softmax layer of 16 bitsin16-bit softmax layer output number IoutAnd a 54-bit full-connected-layer output value Q, wherein N is 1.1) the size value of the input picture, N is 2.3) the picture segmentation fixed value, m is a convolution kernel size m of 1.2, J is the number of convolution layers of 1.2), C is the number of pooling layers of 1.2), E is the number of activation function layers of 1.2), G is the number of softmax layers of 1.2), I isin1.2) softmax layer transferNumber of entries, Iout1.2) output number of the softmax layer, and Q is 1.2) output value of the full connection layer;
10.3) transmitting control instructions to the modules arranged in the steps 5-9 at the same time through handshake signals.
The handshake signal means that before two modules communicate, the modules need to acknowledge each other to enable signals, and then can transmit data to each other.
The effects of the present invention can be further illustrated by the following simulations.
1. Simulation conditions
The simulation uses a purple light co-created FPGA platform, model PGT 180H;
reading FPGA resource parameters and design parameters input by a user by a computer:
the FPGA resource parameters comprise: picture size N224, total block ram resource Ssum9.2M, the number P of the sdram DDR3 is 3 and the number a of the computing function chips DSP is 424.
The design parameters include: the FPGA operation frequency f is 150M, the convolution kernel size M is 3, the number of convolution layers J is 8, the number of channels T is 524, the number of pooling layers C is 8, the number of activation function layers E is 8, the number of multi-classification function softmax layers G is 2, and the softmax layer input number Iin5120, softmax layer output number Iout1024, the number of full connection layers Q is 100, the pooling function is a maximum pooling function, and the activation function is a linear correction relu activation function;
2. emulated content
As can be seen from fig. 2, the output result values of the convolution modules are 9, 36, 81, and 144, which are in accordance with the result of the convolution neural network algorithm when the same image and design parameters are input in the cpu, and it is verified that the method can correctly implement the convolution neural network structure.
Simulation 2, by using modleisim software, based on the above parameters, the method of the present invention performs simulation processing on an input picture with length and width N being 224 at a clock frequency of 150M, to obtain simulation time and resource utilization rate of FPGA, as shown in table 1.
As can be seen from the following table 1, the simulation time is 0.4s, the clock frequency is 187MHz at most, the DDR3 bandwidth is 3.73Gbit/s, and the transmission efficiency is maximized; the number of the DSPs is 390, the resource utilization rate is 87%, the block ram resource is 9.1M, the resource utilization rate is 98.9%, the final calculation speed is 6.2Gbit/s, and the acceleration effect of the convolutional neural network framework is realized.
TABLE 1
Claims (13)
1. The image processing method based on the FPGA accelerated convolutional neural network framework comprises the following steps:
(1) parameter processing:
1a) reading the picture and FPGA board resource parameters input by a user, wherein the resource parameters comprise: picture size N, total block ram resource SsumThe number P of the DDR3 of the synchronous dynamic random access memory and the number A of the DSP of the calculation function chip;
1b) designing the FPGA operation frequency f, the convolution kernel size m, the convolution layer number J, the channel number T, the pooling layer number D, the activation function layer number E, and the multi-classification function softmax layer number G, softmax input number IinSoftmax layer output number IoutThe number of full-connection layers Q, a pooling function and an activation function;
1c) calculating a size value set X of each layer of pictures, a maximum convolution parallelizable number L, a theoretical operation speed bandwidth D and a theoretical data transmission bandwidth Z according to the data read in the step 1a) and the parameters designed in the step 1 b);
(2) fixed values for picture segmentation are calculated:
2a) calculating a common divisor M of the size of each layer of picture according to the value set X of the size of each layer of picture obtained by calculation in the step (1);
2b) the common divisor obtained according to the 2a) and the total block ram resource S read in the step (1)sumCalculating to satisfy FPGA, a common image divisor C limited by block ram resources;
2c) calculating the maximum common divisor meeting the DSP resource limit as a picture segmentation fixed value n according to the resource limit common divisor obtained in the step 2b) and the DSP resource read in the step (1);
(3) determining the number of DDR 3:
calculating an actual data transmission bandwidth H according to the picture segmentation fixed value n, and comparing the actual data transmission bandwidth H with a theoretical data transmission bandwidth Z:
if H > Z, the number B of DDR3 is determined to be 2 or an integer of 1+2j, j ≧ 1
If H is less than or equal to Z, determining the number B of the DDR3 as 3 or 1+4i, wherein i is an integer more than or equal to 1; i ≠ j
(4) Resource allocation is carried out on the block ram on the FPGA:
4a) calculating a picture storage block ram resource S according to the picture segmentation fixed value n determined in the step (2) and the channel number T in the step (1)pic;
4b) Picture storage block ram resource S according to 4a)picAnd (1) total block ram resource SsumCalculating the remaining block ram storage resource SlastAnd the largest storage parameter block ram resource SneAnd comparing the sizes of the two: if S islast≥SneThen S will beneStoring a block ram resource S as a parameterparIf S islast<SneThen S will belastSubtract 0.5Mbit as parameter memory Block ram resource Spar;
(5) Constructing a convolutional neural network framework, and processing an input picture by combining the parameters in the 1a), 1b), 2c), (3), 4a), and 4 b):
5a) setting a picture storage module for dividing a fixed value n, the number of convolution layers J, the number of channels T and a picture storage block ram resource S according to the pictures in the steps 2c), 4a) and 3picAnd the number B of DDR3, the pixel point of the input picture is taken out from the DDR3 and stored;
5b) setting a picture data distribution module for dividing a fixed value n and a picture storage block ram resource S according to pictures of 2c), 4a) and 4b)picAnd parameter storage Block ram resource SparFor the number of pictures stored in 5a)According to the distribution;
5c) setting a convolution module for dividing a fixed value n according to the picture of 2c) and performing convolution calculation on the distributed picture data in 5 b);
5d) setting a pooling module for implementing pooling processing on the image data after the convolution calculation of 5c) according to the pooling function of 1 b);
5e) setting a picture storing module, storing the picture data after the pooling processing in the 5d) into the DDR3 according to the DDR3 number B in the (3) and the 2c) and the picture segmentation fixed value n;
5f) setting an instruction register group module, and setting a convolution kernel size m, a convolution layer number J, a pooling layer number D, an activation function layer number E, softmax layer number G, softmax layer input number I according to the picture size N in 1a), 1b) and 2c)inSoftmax layer output number IoutThe full-connection layer output value Q, the picture segmentation size n, and the construction instructions are distributed to the modules arranged in 5a), 5b), 5c), 5d) and 5 e).
2. The method according to claim 1, wherein in step 1c), the set of picture size values X of each layer is calculated according to the data read in 1a) and the parameters designed in 1b), the maximum convolution parallelizable number L, the theoretical operation speed bandwidth D, and the theoretical data transmission bandwidth Z are calculated according to the following formula:
X=N/2i+2 i=0,1,2...
D=f×m2×32×L
Z=4×(P-1)
wherein N is the picture size, L is the maximum parallelizable number, A is the DSP resource number, m is the convolution kernel size, f is the FPGA operation frequency, P is the DDR3 number, wherein X and i are integers.
3. The method according to claim 1, wherein step 2a) calculates a common divisor M of the picture size of each layer according to the set of picture size values of each layer calculated in (1);
M=GCD(X)
where X is a set of per-layer picture size values and GCD () represents a common divisor.
4. The method according to claim 1, wherein the common divisor M of pictures of block ram resource limit obtained in step 2a) and the total block ram resource S read in (1)sumCalculating a common divisor C of the pictures meeting the resource limit of the block ram of the FPGA;
C=max(M)
wherein M is the common divisor of the size of each layer of picture, T is the number of channels, M is the convolution size, SsumMax () is the maximum value for the block ram size in the FPGA.
5. The method according to claim 1, wherein step 4a) calculates a greatest common divisor satisfying DSP resource limitations as a picture division fixed value n according to the picture division fixed value n determined in (2) and the number of channels T in (1);
n=max(M)<L
wherein, M is a common divisor of pictures limited by the resource of the block ram, L is a maximum number of parallel lines, and max () is a maximum value.
6. The method of claim 1, wherein step 4a) calculates picture storage block ram resource S based on picture segmentation fixed value n determined in (2) and channel number T in (1)pic:
Spic=max(M)×max(T)×32
Wherein, M is a common divisor of pictures limited by the resource of the block ram, T is the number of input channels, and max () is a maximum value.
7. Method according to claim 1, wherein step 4b) stores block ram resources S according to picture of 4a)picAnd (1) total block ram resource SsumCalculating the remaining blocks ram storage resource SlastAnd storing the parameter block ram resource Sne:
Slast=Ssum-2×Spic
8. The method according to claim 1, wherein in step 5a) a fixed value n, a number of convolution layers J, a number of channels T, and picture storage block ram resources S are partitioned according to the pictures in 2c), 4a), and (3)picAnd the DDR3 number B, the pixel point of the input picture is taken out from the DDR3 and stored, and the method comprises the following steps:
5a1) dividing the B DDR3 into two parts, taking the B-1 DDR3 as storage picture pixel points, and taking the rest 1 DDR3 as storage parameters;
5a2) each DDR3 of B-1 DDR3 takes the length n and the width nTaking the picture pixel points with the matrix size for T times in total, wherein the initial address of the picture pixel points is taken from 0, the initial address is increased by n-1 after the picture is taken for T times, and the initial address is returned to 0 after the picture is taken for T times;
5a3) storing the picture pixel points taken out from the DDR3 in SpicIn the block ram resource with the size, the storage addresses are sequentially increased by one from 0;
5a4) repeating steps 5a2) -5a3) J times.
9. The process of claim 1, wherein in step 5b) as per 1a),4a) And 2c) convolution size m, picture storage block ram resource SpicAnd a picture division fixed value n, distributing the picture data stored in the step 5a), and performing the following steps:
5b1) constructing an mxn (n +1) register group, wherein the first mxn register group is used as a calculation group, and the last mx1 register group is used as a cache group;
5b2) taking the picture data with the length of m and the width of n in the picture storage block ram resources, storing the picture data in a matrix size constructed in 5b1), wherein the initial address of the picture data is started from 0, and the initial address is increased by m after the picture data is taken each time;
5b3) the calculation group outputs the picture data with the length and the width of n to the convolution module each time, simultaneously, the picture data with the length of m and the width of 1 is taken from the picture storage block ram and stored into the cache group, wherein the address is taken from 0, 1 is automatically added each time, after the calculation group outputs m-1 times, the register data of the first line of the calculation group is abandoned, the register data of the second line of the calculation group is assigned to the register of the first line, the register data of the third line is assigned to the register of the second line, and similarly, other lines are sequentially assigned to the register of the first line.
10. The method according to claim 1, wherein the fixed value n is divided according to the picture of 2c) in step 5c), and the convolution calculation is performed on the allocated picture data in 5b), wherein the matrix picture data with the length and width of n input in step 5b) is input into n2In each DSP, multiplying two by two, adopting a pipeline structure, and adding the multiplied data two by two to complete convolution calculation;
the pipeline structure is that when the system processes data, each clock pulse receives the next instruction for processing data.
11. The method according to claim 1, wherein the step 5d) of pooling the picture data after the 5c) convolution calculation according to the pooling function of 1b) is performed as follows:
5d1) acquiring 5c) input picture data, and subtracting any two of every 4 picture data to obtain 6 results;
5d2) judging whether the highest bit of the data of 5d1)6 results is 1:
if the number is 1, the number of subtractions is removed, if the number is 0, the number of subtractions is removed, the 6 results are processed in sequence, and the last remaining picture data is the maximum value of the 4 picture data.
12. The method as claimed in claim 1, wherein the fixed value n is divided according to the number B of DDR3 and the picture in (3) and 2c) in step 5e), the picture data after pooling in 5d) is stored back in DDR3, and the picture data after pooling in step 5d) is stored in block ram resource SpicIn, from block ram resource SpicThe middle part is n in length and n in widthThe picture data of (2) is stored back to the DDR3, wherein the address of the picture data is fetched from 0 and incremented by 1 automatically each time, and the address of the DDR3 is stored from 0 and incremented by 8 automatically each time.
13. The method according to claim 1, wherein in step 5f) the picture size N, convolution kernel size m, number of convolution layers J, number of pooling layers D, number of activation function layers E, softmax, number of layers G, softmax input numbers I in steps 1a), 1b) and 2c), and the number of layers of the convolution kernelinSoftmax layer output number IoutThe output value Q of the full connection layer, the picture segmentation size n, the control instruction is constructed and distributed to modules arranged in 5a), 5b), 5c), 5d) and 5e), and the method comprises the following steps:
5f1) constructing a register group with the length of 128 and the width of J + C + G + Q +1 to store instructions;
5f2) the composing sequence of the instructions is as follows from top to bottom: an input picture size N of 10 bits, a picture segmentation size N of 8 bits, a convolution kernel size m of 4 bits, a convolution layer number J of 6 bits, a pooling layer number D of 6 bits, an activation function layer number E of 4 bits, a softmax layer number G of 4 bits, and a softmax layer input number I of 16 bitsin16 bits softmax layer output number IoutA full link layer output value Q of 54 bits;
5f3) the instructions are simultaneously transmitted to the modules arranged in 5a), 5b), 5c), 5d) and 5e) through handshake signals;
the handshake signal means that before two modules communicate, the modules need to acknowledge each other to enable signals, and then can transmit data to each other.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810022870.7A CN108154229B (en) | 2018-01-10 | 2018-01-10 | Image processing method based on FPGA (field programmable Gate array) accelerated convolutional neural network framework |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810022870.7A CN108154229B (en) | 2018-01-10 | 2018-01-10 | Image processing method based on FPGA (field programmable Gate array) accelerated convolutional neural network framework |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108154229A CN108154229A (en) | 2018-06-12 |
CN108154229B true CN108154229B (en) | 2022-04-08 |
Family
ID=62461260
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810022870.7A Active CN108154229B (en) | 2018-01-10 | 2018-01-10 | Image processing method based on FPGA (field programmable Gate array) accelerated convolutional neural network framework |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108154229B (en) |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109086867B (en) * | 2018-07-02 | 2021-06-08 | 武汉魅瞳科技有限公司 | Convolutional neural network acceleration system based on FPGA |
CN109214506B (en) * | 2018-09-13 | 2022-04-15 | 深思考人工智能机器人科技(北京)有限公司 | Convolutional neural network establishing device and method based on pixels |
CN111667046A (en) * | 2019-03-08 | 2020-09-15 | 富泰华工业(深圳)有限公司 | Deep learning acceleration method and user terminal |
CN109978161B (en) * | 2019-03-08 | 2022-03-04 | 吉林大学 | Universal convolution-pooling synchronous processing convolution kernel system |
CN110175670B (en) * | 2019-04-09 | 2020-12-08 | 华中科技大学 | Method and system for realizing YOLOv2 detection network based on FPGA |
CN110413539B (en) * | 2019-06-19 | 2021-09-14 | 深圳云天励飞技术有限公司 | Data processing method and device |
CN110399883A (en) * | 2019-06-28 | 2019-11-01 | 苏州浪潮智能科技有限公司 | Image characteristic extracting method, device, equipment and computer readable storage medium |
CN110516800B (en) * | 2019-07-08 | 2022-03-04 | 山东师范大学 | Deep learning network application distributed self-assembly instruction processor core, processor, circuit and processing method |
CN110390392B (en) * | 2019-08-01 | 2021-02-19 | 上海安路信息科技有限公司 | Convolution parameter accelerating device based on FPGA and data reading and writing method |
CN114365148A (en) * | 2019-10-22 | 2022-04-15 | 深圳鲲云信息科技有限公司 | Neural network operation system and method |
CN113919982A (en) * | 2021-10-09 | 2022-01-11 | 中国联合网络通信有限公司重庆市分公司 | Language class course intelligent auxiliary learning system based on voice recognition technology |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102118289A (en) * | 2010-12-02 | 2011-07-06 | 西北工业大学 | Real-time image segmentation processing system and high-speed intelligent unified bus interface method based on Institute of Electrical and Electronic Engineers (IEEE) 1394 interface |
CN102420931A (en) * | 2011-07-26 | 2012-04-18 | 西安费斯达自动化工程有限公司 | Full-frame-rate image processing method based on FPGA (Field Programmable Gate Array) |
CN106228240A (en) * | 2016-07-30 | 2016-12-14 | 复旦大学 | Degree of depth convolutional neural networks implementation method based on FPGA |
CN106355244A (en) * | 2016-08-30 | 2017-01-25 | 深圳市诺比邻科技有限公司 | CNN (convolutional neural network) construction method and system |
CN106611216A (en) * | 2016-12-29 | 2017-05-03 | 北京旷视科技有限公司 | Computing method and device based on neural network |
CN107103113A (en) * | 2017-03-23 | 2017-08-29 | 中国科学院计算技术研究所 | Towards the Automation Design method, device and the optimization method of neural network processor |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10572824B2 (en) * | 2003-05-23 | 2020-02-25 | Ip Reservoir, Llc | System and method for low latency multi-functional pipeline with correlation logic and selectively activated/deactivated pipelined data processing engines |
JP6480644B1 (en) * | 2016-03-23 | 2019-03-13 | グーグル エルエルシー | Adaptive audio enhancement for multi-channel speech recognition |
-
2018
- 2018-01-10 CN CN201810022870.7A patent/CN108154229B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102118289A (en) * | 2010-12-02 | 2011-07-06 | 西北工业大学 | Real-time image segmentation processing system and high-speed intelligent unified bus interface method based on Institute of Electrical and Electronic Engineers (IEEE) 1394 interface |
CN102420931A (en) * | 2011-07-26 | 2012-04-18 | 西安费斯达自动化工程有限公司 | Full-frame-rate image processing method based on FPGA (Field Programmable Gate Array) |
CN106228240A (en) * | 2016-07-30 | 2016-12-14 | 复旦大学 | Degree of depth convolutional neural networks implementation method based on FPGA |
CN106355244A (en) * | 2016-08-30 | 2017-01-25 | 深圳市诺比邻科技有限公司 | CNN (convolutional neural network) construction method and system |
CN106611216A (en) * | 2016-12-29 | 2017-05-03 | 北京旷视科技有限公司 | Computing method and device based on neural network |
CN107103113A (en) * | 2017-03-23 | 2017-08-29 | 中国科学院计算技术研究所 | Towards the Automation Design method, device and the optimization method of neural network processor |
Non-Patent Citations (3)
Title |
---|
Musical notes classification with neuromorphic auditory system using FPGA and a convolutional spiking network;E. Cerezuela-Escudero等;《2015 International Joint Conference on Neural Networks (IJCNN)》;20151001;1-7 * |
基于FPGA的卷积神经网络并行结构研究;陆志坚;《中国优秀博硕士学位论文全文数据库(博士)信息科技辑》;20140415(第4期);I140-12 * |
深度学习及其在医学图像分析中的应用研究;王媛媛等;《电视技术》;20161017;第40卷(第10期);118-126 * |
Also Published As
Publication number | Publication date |
---|---|
CN108154229A (en) | 2018-06-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108154229B (en) | Image processing method based on FPGA (field programmable Gate array) accelerated convolutional neural network framework | |
CN112214726B (en) | Operation accelerator | |
CN108805266B (en) | Reconfigurable CNN high-concurrency convolution accelerator | |
CN109409511B (en) | Convolution operation data flow scheduling method for dynamic reconfigurable array | |
CN108564168B (en) | Design method for neural network processor supporting multi-precision convolution | |
CN111667051A (en) | Neural network accelerator suitable for edge equipment and neural network acceleration calculation method | |
CN108229671B (en) | System and method for reducing storage bandwidth requirement of external data of accelerator | |
CN109063825A (en) | Convolutional neural networks accelerator | |
US20220083857A1 (en) | Convolutional neural network operation method and device | |
CN112668708B (en) | Convolution operation device for improving data utilization rate | |
CN113033794B (en) | Light weight neural network hardware accelerator based on deep separable convolution | |
CN109146065B (en) | Convolution operation method and device for two-dimensional data | |
WO2024193337A1 (en) | Convolutional neural network acceleration method and system, storage medium, apparatus, and device | |
CN112836813A (en) | Reconfigurable pulsation array system for mixed precision neural network calculation | |
CN115238863A (en) | Hardware acceleration method, system and application of convolutional neural network convolutional layer | |
CN109948787B (en) | Arithmetic device, chip and method for neural network convolution layer | |
CN111008691A (en) | Convolutional neural network accelerator architecture with weight and activation value both binarized | |
CN107783935B (en) | Approximate calculation reconfigurable array based on dynamic precision configurable operation | |
CN116888591A (en) | Matrix multiplier, matrix calculation method and related equipment | |
CN111667052A (en) | Standard and nonstandard volume consistency transformation method for special neural network accelerator | |
CN116611488A (en) | Vector processing unit, neural network processor and depth camera | |
CN113031912A (en) | Multiplier, data processing method, device and chip | |
CN116167425A (en) | Neural network acceleration method, device, equipment and medium | |
CN114372012B (en) | Universal and configurable high-energy-efficiency pooling calculation single-row output system and method | |
CN116957018A (en) | Method for realizing channel-by-channel convolution |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |