CN115081602A - Computing device, integrated circuit device and board card for executing Winograd convolution - Google Patents
Computing device, integrated circuit device and board card for executing Winograd convolution Download PDFInfo
- Publication number
- CN115081602A CN115081602A CN202110266344.7A CN202110266344A CN115081602A CN 115081602 A CN115081602 A CN 115081602A CN 202110266344 A CN202110266344 A CN 202110266344A CN 115081602 A CN115081602 A CN 115081602A
- Authority
- CN
- China
- Prior art keywords
- data
- winograd
- convolution
- unit
- bit
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 210000002569 neuron Anatomy 0.000 claims abstract description 59
- 238000006243 chemical reaction Methods 0.000 claims abstract description 43
- 239000000872 buffer Substances 0.000 claims description 52
- 238000009825 accumulation Methods 0.000 claims description 12
- 230000000694 effects Effects 0.000 abstract description 5
- 238000007792 addition Methods 0.000 description 47
- 230000009466 transformation Effects 0.000 description 41
- 239000011159 matrix material Substances 0.000 description 39
- 238000003860 storage Methods 0.000 description 38
- 238000010586 diagram Methods 0.000 description 33
- 238000012545 processing Methods 0.000 description 32
- 238000004364 calculation method Methods 0.000 description 26
- 238000000034 method Methods 0.000 description 21
- 238000004422 calculation algorithm Methods 0.000 description 17
- 230000008569 process Effects 0.000 description 16
- 238000013461 design Methods 0.000 description 14
- 210000002364 input neuron Anatomy 0.000 description 13
- 230000001133 acceleration Effects 0.000 description 9
- 230000006870 function Effects 0.000 description 8
- 238000004891 communication Methods 0.000 description 6
- 210000004205 output neuron Anatomy 0.000 description 6
- 238000012546 transfer Methods 0.000 description 6
- 238000013527 convolutional neural network Methods 0.000 description 5
- 238000009826 distribution Methods 0.000 description 5
- COCLLEMEIJQBAG-UHFFFAOYSA-N 8-methylnonyl 2-methylprop-2-enoate Chemical compound CC(C)CCCCCCCOC(=O)C(C)=C COCLLEMEIJQBAG-UHFFFAOYSA-N 0.000 description 4
- 238000003491 array Methods 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 4
- UMFJAHHVKNCGLG-UHFFFAOYSA-N n-Nitrosodimethylamine Chemical compound CN(C)N=O UMFJAHHVKNCGLG-UHFFFAOYSA-N 0.000 description 4
- 238000013473 artificial intelligence Methods 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 3
- 230000004913 activation Effects 0.000 description 2
- 210000004027 cell Anatomy 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 230000008878 coupling Effects 0.000 description 2
- 238000010168 coupling process Methods 0.000 description 2
- 238000005859 coupling reaction Methods 0.000 description 2
- 230000010354 integration Effects 0.000 description 2
- 238000007726 management method Methods 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 210000000352 storage cell Anatomy 0.000 description 2
- 239000013598 vector Substances 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 238000005481 NMR spectroscopy Methods 0.000 description 1
- 102100029469 WD repeat and HMG-box DNA-binding protein 1 Human genes 0.000 description 1
- 101710097421 WD repeat and HMG-box DNA-binding protein 1 Proteins 0.000 description 1
- 238000001994 activation Methods 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000010923 batch production Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000003139 buffering effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 235000019800 disodium phosphate Nutrition 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000007667 floating Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 238000012432 intermediate storage Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000013433 optimization analysis Methods 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 238000004321 preservation Methods 0.000 description 1
- 238000004445 quantitative analysis Methods 0.000 description 1
- 238000004064 recycling Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 230000008054 signal transmission Effects 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000011426 transformation method Methods 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
- 238000005406 washing Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/15—Correlation function computation including computation of convolution operations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Algebra (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Biophysics (AREA)
- Neurology (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Complex Calculations (AREA)
Abstract
The invention relates to a computing device, an integrated circuit device and a board card for executing Winograd convolution, wherein the computing device comprises a forward conversion unit and an on-chip cache, the forward conversion unit is used for forward converting neuron data by taking a first data unit as a unit so as to generate forward conversion data, the scale of the first data unit is [ (r +1) × (s +1) l ], wherein r is the height of a Winograd weight, s is the width of the Winograd weight, and l is the bandwidth of the on-chip cache. The invention has the technical effects of ensuring network precision, accelerating performance, reducing area and reducing power consumption.
Description
Technical Field
The present invention relates generally to the field of neural networks. More particularly, the present invention relates to computing devices, integrated circuit devices, and boards that perform Winograd convolution.
Background
With the rapid development of the information age, the research in the fields of artificial intelligence and machine learning is earnestly successful, and the related industries are vigorously developed. The convolutional neural network has wide functions in the aspects of computer vision, automatic driving, machine translation, voice recognition, smart home and the like.
The convolutional neural network has large parameter quantity and large operation quantity, so that the execution performance of the convolutional neural network model is severely limited under the limited area and the computational power of the portable mobile terminal, and meanwhile, a non-specially designed processor can cause huge expenditure of power consumption when carrying out convolutional operation.
Winograd convolution is a convolution acceleration implementation mode based on a polynomial interpolation algorithm. It passes through two inputs to the convolution operation: after the neurons and the weight are segmented in a certain scale, linear transformation, namely Winograd positive transformation, is respectively carried out, the transformed neurons and the weight are subjected to counterpoint multiplication, the counterpoint multiplication result is subjected to linear transformation again, namely Winograd inverse transformation, and finally a convolution result equivalent to the original convolution operation is obtained.
In the process of Winograd convolution operation, the positive and inverse transformation matrixes of the neurons and the weights are all composed of simple fixed numerical values, so that the positive and inverse transformation process of the Winograd neurons and the weights can be realized only by addition. The multiplication operation required in the Winograd algorithm only occurs in the bit multiplication process, and the multiplication complexity of the process is reduced to a considerable extent compared with the original convolution algorithm. Because the cost (time sequence, power consumption and area) for realizing multiplication operation by hardware is much higher than that for realizing addition with the same bit width, the energy efficiency ratio of hardware and obvious gains in operation time can be brought by replacing the original convolution operation by Winograd convolution.
However, at present, no hardware is designed for the Winograd convolution acceleration algorithm, so that the conventional artificial intelligent chip cannot completely show the advantages of the Winograd convolution operation. Therefore, a hardware device capable of efficiently running the Winograd convolution algorithm is urgently needed.
Disclosure of Invention
To at least partially solve the technical problems mentioned in the background, the present invention provides a computing device, an integrated circuit device and a board card for performing Winograd convolution.
In one aspect, the present invention discloses a computing device for performing Winograd convolution, connected to an off-chip memory, where the off-chip memory stores neuron data and a Winograd weight, the computing device includes a forward transformation unit and an on-chip buffer, where the forward transformation unit forward transforms the neuron data in units of first data units to generate forward transformed data, and the first data units have a scale of [ (r +1) × (s +1) l ], where r is a height of the Winograd weight, s is a width of the Winograd weight, and l is a bandwidth of the on-chip buffer.
In another aspect, the present invention discloses an integrated circuit device including the aforementioned computing device, and a board including the integrated circuit device according to the aforementioned description.
The hardware structure provided by the invention can be matched with a Winograd convolution acceleration algorithm, and has the technical effects of ensuring network precision, accelerating performance, reducing area and reducing power consumption.
Drawings
The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. In the accompanying drawings, several embodiments of the present invention are illustrated by way of example and not by way of limitation, and like reference numerals designate like or corresponding parts throughout the several views, in which:
FIG. 1 is a schematic diagram illustrating a convolution kernel performing a convolution operation with an input neuron image;
FIG. 2 is a diagram showing the conversion of a raw convolution of F (2 × 2,3 × 3) to a Winograd convolution;
FIG. 3 is a visualization diagram illustrating a multiply-by-bit operation;
FIG. 4 is a diagram illustrating the homogenous operation of forward transformed data with weights;
fig. 5 is a structural diagram showing a board card of the embodiment of the present invention;
FIG. 6 is a block diagram illustrating an integrated circuit device of an embodiment of the invention;
FIG. 7 is a schematic diagram showing the internal structure of a computing device of an embodiment of the invention;
FIG. 8 is a diagram showing an overlapping portion when a transform is being performed;
FIG. 9 is a schematic diagram showing a neuron cache of an embodiment of the present invention;
FIG. 10 is a schematic diagram showing a forward transform unit of an embodiment of the present invention;
FIG. 11 is a schematic diagram illustrating a forward transform data cache of an embodiment of the present invention;
FIG. 12 is a diagram illustrating weight caching according to an embodiment of the invention;
FIG. 13 is a schematic diagram showing the forward transform data buffer output side of an embodiment of the present invention;
FIG. 14 is a diagram illustrating a weight buffer output side according to an embodiment of the present invention;
fig. 15 is a schematic diagram showing an inverse transform unit of an embodiment of the present invention;
FIG. 16 is a diagram illustrating the connectivity of result caches in an embodiment of the invention;
FIG. 17 is a schematic diagram showing the sequence of WNram when inputting and outputting data according to an embodiment of the present invention; and
fig. 18 is a schematic diagram showing the output scale of Wram of the embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be understood that the terms "first", "second", "third" and "fourth", etc. in the claims, the description and the drawings of the present invention are used for distinguishing different objects and are not used for describing a particular order. The terms "comprises" and "comprising," when used in the description and claims of the present invention, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification and claims of this application, the singular form of "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this specification refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.
As used in this specification and claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection".
The following detailed description of embodiments of the invention refers to the accompanying drawings.
The Winograd convolution acceleration algorithm (hereinafter referred to as Winograd algorithm or Winograd convolution) is a method for finding out a transformation method requiring the minimum number of multiplications by performing linear transformation on operands in convolution operation, and then replacing the required multiplications by adding partial addition operations. In terms of hardware, compared with an adder, the structure of the multiplier is more complex, the area power consumption is larger, the comprehensive processing performance is poorer, and a Winograd algorithm which replaces multiplication with addition in practice has great advantages in processing convolution operation.
For two-dimensional convolution, assuming that the size of the input neuron image is H × W (H is the height of the input neuron image and W is the width of the input neuron image), the size of the weight is r × s (r is the height of the weight and s is the width of the weight), and the convolution result can be expressed as F (m × n, r × s), where m × n is the size of the output neuron image, m is the height of the output neuron image and n is the width of the output neuron image. In order to reduce the complexity of hardware, improve the universality and achieve a good acceleration effect, the embodiment of the invention sets convolution kernels (namely weights) not greater than 3 × 3 as a base convolution unit to combine and perform Winograd convolution operation with a convolution operation step (stride) of 1 at any scale. In the embodiment of the present invention, arbitrary F (m × n, r × s) is decomposed into 5 kinds of calculations of base convolution with the operation scale of 3 × 3, 3 × 2 (or 2 × 3), 3 × 1, 2 × 2,2 × 1, and the like, and then combined. More specifically, in the examples of the present invention, an arbitrary F (m × n, r × s) is decomposed into a base convolution calculation combination of F (2 × 2,3 × 3), F (2 × 2,3 × 2), F (2 × 2,2 × 3), F (2 × 2,3 × 1), F (2 × 2), and F (2 × 2,2 × 1). It should be noted that, since the convolution operation of 1 × 1 cannot be accelerated by Winograd convolution, the scale of 1 × 1 does not belong to the base convolution unit set in the embodiment of the present invention.
Taking F (2 × 2,5 × 5) with the input neuron image size of 6 × 6 and the step size of 1 as an example, before using the computing apparatus according to the embodiment of the present invention to perform Winograd convolution acceleration operation, the input neuron image of 6 × 6 and the convolution kernel of 5 × 5 need to be linearly split based on the base convolution unit, and the splitting process is shown in fig. 1.
Fig. 1 shows a convolution operation of a5 × 5 convolution kernel 101 with a6 × 6 input neuron image 102 to obtain a2 × 2 convolution result 103. The convolution kernel 101 needs to be split into 3 × 3, 3 × 2 (or 2 × 3), 3 × 1, 2 × 2,2 × 1, etc., and this embodiment preferably selects 3 × 3, 3 × 2 (or 2 × 3) times, 3 × 1 times, 2 × 2 times, and finally 2 × 1. According to this rule, the convolution kernel 101 is split into 4 base convolution kernels: the 3 × 3 first base convolution kernel 104, the 3 × 2 second base convolution kernel 105, the 2 × 3 third base convolution kernel 106, and the 2 × 2 fourth base convolution kernel 107, i.e., F (2 × 2,5 × 5), are decomposed into one F (2 × 2,3 × 3), one F (2 × 2,3 × 2), one F (2 × 2,2 × 3), and one F (2 × 2 ). The input neuron image 102 is correspondingly also split into 4 sub-neuron data: 4 × 4 first sub-neuron data 108, 4 × 3 second sub-neuron data 109, 3 × 4 third sub-neuron data 110, and 3 × 3 fourth sub-neuron data 111.
Then Winograd convolution operation is carried out, namely: the first base convolution kernel 104 convolves with the first sub-neuron data 108 to generate a first sub-convolution result 112; convolving the second basis convolution kernel 105 with the second sub-neuron data 109 to generate a second sub-convolution result 113; convolving the third base convolution kernel 106 with the third sub-neuron data 110 to generate a third sub-convolution result 114; the fourth base convolution kernel 107 is convolved with the fourth sub-neuron data 111 to generate a fourth sub-convolution result 115.
Finally, the first sub-convolution result 112, the second sub-convolution result 113, the third sub-convolution result 114 and the fourth sub-convolution result 115 are added to obtain a convolution result 116, and the convolution result 116 is the same as the convolution result 103. The above is an example of using the Winograd convolution algorithm to implement the original convolution operation.
Further, the Winograd algorithm can be represented by the following equation:
Y=A T [(GgG T )⊙(B T dB)]A
wherein Y denotes the output matrix of the convolution operation, A T Inverse transformation left-multiplication constant matrix, G weight transformation left-multiplication constant matrix, G weight of original convolution, G T For weight transformation right-times constant matrix,. alpha.representing bit-wise multiplication, B T The method comprises the following steps of converting a neuron into a left-multiplication constant matrix, d is neuron data, B is a right-multiplication constant matrix of the neuron, and A is an inverse-conversion right-multiplication constant matrix. The left-hand and right-hand multiplication matrices for each transform are only transposed.
Taking F (2 × 2,3 × 3) as an example, the constant matrices are as follows:
fig. 2 shows a schematic diagram of the conversion of the original convolution of F (2 × 2,3 × 3) into a Winograd convolution. As shown, neuron data 201 is convolved with convolution kernel 202. During calculation, the neuron data 201 is arranged in a row according to elements in the sliding window 203, the sliding window 203 slides for 4 times to form a4 × 9 matrix 204, then the elements of the convolution kernel 202 are arranged in a column to form a9 × 1 matrix 205, and the 4 × 9 matrix 204 and the 9 × 1 matrix 205 are subjected to convolution operation to obtain a4 × 1 convolution result 206.
Further, the 4 × 9 matrix 204 is converted into a2 × 3 matrix 207, the 9 × 1 matrix 205 is converted into a3 × 1 matrix 208, and the 4 × 1 convolution result 206 is converted into a2 × 1 convolution result 209 by slicing according to the dotted line in the figure. After the linear transformation, the first element R of the 2 × 1 convolution result 209 0 =M 0 +M 1 +M 2 And R is 1 =M 1 -M 2 -M 3 . And M 0 、M 1 、M 2 、M 3 Can be represented by the following sub-formula:
by the segmentation and linear transformation, the original convolution operation involves 36 multiplications, while the Winograd algorithm only needs to execute 16 multiplications, so that the computational complexity of the multiplications is reduced by 2.25 times.
As can be seen from the conversion of the Winograd algorithm of the two-dimensional convolution, the Winograd algorithm is mainly divided into the following steps. First, the weights are left-and right-multiplied by a matrix of weight constants, GgG T Obtaining a weight value after Winograd linear transformation, namely a Winograd weight value; next, the neuron data is subjected to a positive transformation operation, i.e., left and right multiplication of a neuron constant matrix, i.e., B T And dB, obtaining the forward conversion data after Winograd linear conversion. Further, the forward transformed data and Winograd weight matrix are subjected to a bit-wise multiplication operation (GgG) T )⊙(B T dB), the bit-multiplied data is obtained. Finally, inverse transformation operation is carried out on the bit multiplier data, namely left multiplication and right multiplication operation of Winograd inverse transformation constant matrix, namely A T LA, wherein L is [ (GgG) T )⊙(B T dB)]And finally obtaining a convolution result equivalent to the original convolution.
From the perspective of hardware design, the embodiment of the present invention performs pipeline design on the three large transformation steps according to the dependency and operation distinguishing characteristics among the three processes, so as to achieve more efficient acceleration performance. The following will be separately described for the design of the forward transform operation, the multiply-by-bit operation, and the inverse transform operation.
Embodiments of the present invention utilize a forward transform unit to implement a forward transform operation, i.e., perform B T dB, forward transforming the left-multiplication matrix B according to the rule of Winograd convolution T The size of (m + r-1) × (m + r-1), and the size of the right multiplication matrix B is (n + s-1) × (n + s-1). Due to forward transformation of the left multiplication matrix B T The elements of the right multiplication matrix B are all composed of 0, 1 and-1, so that the matrix multiplication operation of the positive transformation can be decomposed into the addition operation of the fixed mode, and the computing device of the embodiment of the invention configures a specific number of floating-point addition operators to complete the linear addition operation required by the multiplication of the whole matrix. Since the embodiment of the present invention converts any original convolution into a base convolution for calculation, the scale of the forward transform unit is related to the above-mentioned 5 types of base convolution scale calculation, and therefore, the following description will be made with respect to the data of the above-mentioned 5 types of base convolution calculation FP32, taking the convolution result of 2 × 2 as an example (i.e., m ═ n ═ 2).
based on the above formula, the forward conversion power requirement of the forward conversion unit directly corresponds to the number of additions, which is 4 × (n + s-1) +4 × (m + r-1) ═ 32flops (floating point operations per second), and the input and output quantities of the forward conversion unit are respectively: the input data and the output data are both (r +1) (s +1) × 32 ═ 16 × 32 bits, and the reason why the foregoing arithmetic expression is multiplied by 32 bits is data for FP32, which is a sequence of 32 bits. When the input and output of the forward conversion unit are the same as the operation time, the hardware utilization rate of the forward conversion unit is optimal, so the ratio of the input and output bandwidth of the forward conversion unit to the addition operation is preferably 16: 32 to 1: 2. In other words, when the buffer bandwidth (or vectorization length) is l, the input bandwidth and the output bandwidth of the forward transform unit are l × 32 bits, and the computation power of the adder group of the forward transform unit is 2 × l flops. Each operation will generate 16 final results, and considering that 8 intermediate results will be generated during the operation, the minimum number of registers in the register file is l × 32 × (16+ 8).
based on the above equation, the positive conversion force requirement of the positive conversion unit is 4 × (n + s-1) +2 × (m + r-1) ═ 20flops, and the input and output quantities of the positive conversion unit are respectively: both the input data and the output data are (r +1) (s +1) × 32 ═ 12 × 32 bits. In order to increase the hardware utilization rate of the forward transform unit, the ratio of the input/output bandwidth of the forward transform unit to the addition operation is preferably 12: 20 to 3: 5. I.e. input bandwidth and output bandwidth of l x 32 bits, and the computing power of the adder group isEach calculation yields 12 final results and 6 intermediate results, with the minimum number of registers in the register file being l × 32 × (12+6) with maximum pipeline utilization.
based on the above formula, the positive conversion power demand of the positive conversion unit is 2 × (n + s-1) +2 × (m + r-1) ═ 12flops, and the input and output quantities of the positive conversion unit are respectively: since both the input data and the output data are (r +1) (s +1) × 32 bits, 9 × 32 bits, the ratio of the input/output bandwidth of the forward conversion unit to the addition operation is preferably 9: 12 to 3: 4. I.e. input bandwidth and output bandwidth of l x 32 bits, and the computing power of the adder group isEach calculation yields 9 final results, and 6 intermediate results, with the minimum number of registers in the register file being l × 32 × (9+6) with maximum pipelined use of the register file.
based on the above formula, the positive conversion force requirement of the positive conversion unit is 4flops, and the input and output quantities of the positive conversion unit are respectively: both the input data and the output data are (r +1) × 32 ═ 4 × 32 bits. Therefore, the ratio of the input/output bandwidth of the forward conversion unit to the addition operation is preferably 4: 4 to 1: 1. I.e. input bandwidth and output bandwidth are l x 32 bits, while the computation power of the adder group is l flops. Each calculation yields 4 final results and 2 intermediate results, and the minimum number of registers in the register file is l × 32 × (4+2) with maximum pipeline utilization.
based on the above formula, the forward conversion computing power requirement of the forward conversion unit is 2flops, and the input and output quantities of the forward conversion unit are respectively: since both the input data and the output data are (r +1) × 32 ═ 3 × 32 bits, the ratio of the input-output bandwidth of the forward transform unit to the addition operation is preferably 3: 2. I.e. input bandwidth and output bandwidth of l x 32 bits, and the computing power of the adder group isEach calculation yields 3 final results and 1 intermediate result, and the minimum number of registers in the register file is l × 32 × (3+1) with the maximum pipeline utilization.
In order to satisfy and support the aforementioned 5 kinds of basic convolution operations simultaneously, the embodiment of the present invention selects the input bandwidth and the output bandwidth of the forward conversion unit to be the same, and the computation power of the addition operation is twice as much as the input bandwidth and the output bandwidth, i.e. the input bandwidth and the output bandwidth are both l × 32 bits, the computation power of the adder group is 2 × l flops, and the number of the register files is l × 32 × (16+ 8).
Then, considering the multiply-accumulate counter, the embodiment of the invention combines the multiply-accumulate counter operation and the characteristic diagram direction of the convolution neuron data based on the general consideration of hardware design, scheduling strategy and execution performance, and utilizes the same multiply-accumulate counter, thereby not only effectively reducing the overall complexity and resource consumption of hardware design, but also reducing the access amount of on-chip cache, saving power consumption area and simultaneously improving performance.
Assume that the parameters of the convolutional layer are: the number N of input batch processes (batch), the number Ci of input neuron channels, the height Hi of input neuron data, the width Wi of input neuron data, the number Co of output neuron channels, the height Ho of output neuron data, the width Wo of output neuron data, the size of convolution kernel r × s, and the step size 1. Since this example supports the operation of F (2 × 2, r × s), Ho ═ Hi-r + 1, Wo ═ Wi-s +1, and Winograd unit numberWhere T is the number of slices along the HW direction.
Since the on-chip buffer capacity is limited, the calculation device of the embodiment performs calculation with a single batch processing amount (N ═ 1), and thus the scale of input neuron data input to the calculation device is [1 Ci Hi Wi ], the scale of forward conversion data is [1 Ci T (r +1) × (S +1) ], the scale of original weight is [ Co Ci rs ], and the scale of Winograd weight is [1 Co Ci (r +1) × (S +1) ].
Fig. 3 shows a visual schematic diagram of the bit multiplication operation. Since N is 1, each of the aforementioned data can be represented in three dimensions, and the scale of the forward transform data 301 is [ Ci (r +1) × (S +1) ], the three dimensions of which are Ci, T (i.e., the number of HW slices) and (r +1) × (S +1), respectively; the Winograd weight 302 is [ Co Ci (r +1) × (S +1) ], the three dimensions of which are Co, Ci and (r +1) × (S +1), respectively, the bit-alignment multiplication operation is cross bit-alignment multiplication of Co in the HW direction, and accumulation is performed in the Ci direction to obtain bit-alignment multiplication data 303, the scale of which is [ Co T (r +1) × (S +1) ], and the three dimensions of which are Co, T and (r +1) × (S +1), respectively.
In more detail, the forward transform data 301 includes T [ Ci (r +1) × (S +1) ] data units by bit multiplication, and the Winograd weight 302 includes Co [ Ci (r +1) × (S +1) ] data units by bit multiplication to obtain an intermediate result of [ Ci (r +1) × (S +1) ]. And accumulating along the Ci direction, wherein the process is the same as the matrix multiplication operation process, so that the process can be combined into the matrix multiplication operation, hardware resources are more effectively used, and the resource consumption of a register for intermediate storage is reduced.
Since the forward transform data 301 includes T data units of [ Ci (r +1) × (S +1) ], the Winograd weights 302 include Co data units of [ Ci (r +1) × (S +1) ], each data unit of the forward transform data 301 needs to be multiplied by each data unit of the Winograd weights 302. As shown in fig. 4, when performing the bit-by-bit multiplication, one data unit 401 of the forward transform data 301 performs a homogeneous operation with Co weight data units, i.e., the Co direction is taken as the direction of parallel computation, and produces an intermediate result 402. Then, the next data unit and the Co weight data units are taken out from the forward transformed data 301, and then the homogeneous operation is performed to generate the next intermediate result, and in this way, the operation is performed until all the T data units are calculated, and the multiplication-by-alignment data 303 is obtained.
When the above-mentioned data units are bit-multiplied and added up in the direction of the feature map, the amount of computation required is (Ci + Ci-1) × (r +1) × (S +1) flops. Since the value of Ci is often very large, it is practically difficult to input Ci as a granularity for real operation to the bit multiply accumulate operator, so this embodiment can further split Ci, perform multiply accumulate operation with the vectorization length l as a unit, split the multiply accumulate operation of another dimension (r +1) × (S +1) into (r +1) × (S +1) beats to be completed in sequence, and finally add all results along Ci direction to obtain the final result.
Since the output bandwidth of the forward transform unit is l × 32 bits, in order to ensure the same overall pipeline time from the forward transform unit to the bit multiply accumulate unit, the computation power of each bit multiply accumulate unit in the bit multiply accumulate unit is set to l + (l-1) flops in this embodiment, which include l multiplications and l-1 additions. If the multiply-accumulate unit has ω parallel dimensions, i.e. includes ω simultaneous units, the computation power of the multiply-accumulate unit is ω × (l + (l-1)) flops, which are functions of ω and l.
This embodiment is further provided with an inverse transformation unit for performing an inverse transformation operation on the basis of the inverse transformation left-multiplication matrixAnd right multiplication matrix A (n+s-1)×2 Carrying out A T LA calculation where L is (GgG) T )⊙(B T dB). Due to inverse transformation of the left multiplication matrix A T The elements of the right multiplication matrix A are also composed of 0, 1, -1, so the inverse matrix multiplication operation can be decomposed into fixed-pattern addition operation as well. The adder bank of the inverse transform unit configures a specific number of floating-point addition operators accordingly to complete the linear addition operation required for the entire matrix multiplication. The following description is also made based on 5 kinds of base convolutions to determine the inverseThe size of the transform unit.
based on the above equation, the inverse conversion force of the ITU 715 is 24 fps, the input bandwidth is (r +1) (s +1) × 32 bits 16 × 32 bits, and the output bandwidth is (s +1) × 32 bits 4 × 32 bits. Similarly, when the input bandwidth and the computation power of the inverse transform unit are the same, the hardware utilization rate of the inverse transform unit is optimal, so the ratio of the input bandwidth to the addition operation is preferably 16: 24 to 2: 3, i.e., the input bandwidth is l × 32 bits, and the computation power of the adder group isEach calculation will produce 16 final results, no intermediate results will occur, and the minimum number of registers in the register file is l × 32 × 16 under the premise of maximizing the pipeline use of the register file.
based on the above formula, the inverse conversion power of the inverse transform unit is 16flops, the input bandwidth is 12 × 32 bits, the output bandwidth is 4 × 32 bits, the ratio of the input bandwidth to the addition operation is preferably 12: 16 to 3: 4, i.e., the input bandwidth is l × 32 bits, and the computation power of the adder group isEach calculation will produce 12 final results, no intermediate results will occurOn the premise of maximizing the pipeline use of the register file, the minimum number of registers of the register file is l × 32 × 12.
based on the above formula, the inverse conversion power of the inverse transform unit is 10flops, the input bandwidth is 9 × 32 bits, and the output bandwidth is 4 × 32 bits, so the ratio of the input bandwidth to the addition operation is preferably 9: 10, i.e., the input bandwidth is l × 32 bits, and the power of the adder group isEach calculation can generate 9 final results, no intermediate results can be generated, and the minimum number of registers of the register file is l multiplied by 32 multiplied by 9 under the premise of maximizing the pipeline use of the register file.
based on the above formula, the inverse conversion power of the inverse transform unit is 4flop, the input bandwidth is 4 × 32 bits, and the output bandwidth is 2 × 32 bits, so the ratio of the input bandwidth to the addition operation is preferably 4: 4 to 1: 1, i.e. the input bandwidth is l × 32 bits, and the computation power of the adder group is l flops, each computation will generate 4 final results and 2 intermediate results, and the minimum number of registers of the register file is l × 32 × (4+2) under the premise of maximizing the pipeline use of the register file.
based on the above formula, the inverse conversion power of the inverse transform unit is 2 fps, the input bandwidth is 3 × 32 bits, and the output bandwidth is 3 × 32 bits, so the ratio of the input bandwidth to the addition operation is preferably 3: 2, i.e., the input bandwidth is l × 32 bits, and the power of the adder group isEach calculation yields 3 final results, 1 intermediate result, and the minimum number of registers in the register file is l × 32 × (3+1) with the maximum pipeline utilization.
In order to simultaneously satisfy and support the above-mentioned 5 kinds of basic convolution operations, the power of the addition operation of the inverse transform unit may be set to the input bandwidthMultiplication, i.e. when the input bandwidth is l x 32 bits, the computing power of the adder group is
However, in order to make the hardware design relatively simple, the embodiment may further consider that the hardware configuration of the forward transform unit and the inverse transform unit are the same. On the premise of satisfying the requirements of the forward transform unit and the inverse transform unit at the same time, the inverse transform unit is selected in this embodiment to adopt the design of the forward transform unit, that is, the input bandwidth and the output bandwidth are the same, and the computational power of the addition operation is twice as large as that of the input bandwidth and the output bandwidth. In other words, the input bandwidth of the inverse transform unit is l × 32 bits, the output bandwidth is also l × 32 bits, and the computation power of the adder group is 2 × l flops.
In summary, the bandwidths and the computation powers of the 3 core modules (the forward transform unit, the alignment multiply accumulate operator and the inverse transform unit) performing the Winograd convolution operation in this embodiment are all matched, that is, the input bandwidths of the 3 core modules are all set to l × 32 bits, the output bandwidths are also all set to l × 32 bits, the computation power of the forward transform unit is 2 × l flops, the computation power of the alignment multiply accumulate operator is ω × (l + (l-1)) flops, and the computation power of the inverse transform unit is 2 × l flops.
As can be seen from the foregoing, Winograd convolution operation is directly related to the vectorization length parameter l. The vectorization length parameter l is a minimum processing length, which relates to the neuron transformation multiplexing condition of the computing device in this embodiment, and the larger the parameter l is, the higher the multiplexing rate is, and meanwhile, the required access amount, the operation amount, the power consumption and the average hardware design area are proportionally reduced. However, the parameters of the convolutional layer of the neural network change with the change of the network model, and with the increase of the vectorization length parameter l, when the number of channels of a part of the network model is smaller than the vectorization length l, the computational power is wasted, thereby affecting the acceleration effect and causing the additional overhead of area power consumption. Therefore, when determining the vectorization length l, a trade-off analysis needs to be performed on the two factors, so as to plan the most suitable vectorization length parameter configuration.
According to the empirical values, weights are set for several major hardware in this embodiment (such as FP32 adder, bit multiplication unit, register, etc.) to obtain their computation power and resource overhead function, and it is found that when l is greater than 16, the utilization rate of hardware resources can be guaranteed to be at a higher level. Then, the number of input channels and the number of output channels of the currently common neural network models (such as LeNet, VGG16, VGG19 and Alexnet) are taken into consideration, the computational power loss is calculated, and the comprehensive computational power loss is found to be greatly improved when l is larger than 64. From the above two quantitative analysis, it can be seen that the computing device of this embodiment performs better when the vectorization length parameter l is between 16 and 64. This embodiment preferably selects l-16 if further versatility is considered to meet future possible network architectures and network parameters.
Fig. 5 shows a schematic structural diagram of the foregoing embodiment in the form of a board card. As shown in fig. 5, the board 50 includes a Chip 501, which is a System on Chip (SoC), or System on Chip, integrated with one or more combined processing devices, which is an artificial intelligence arithmetic unit, for supporting various deep learning and machine learning algorithms, and meeting the intelligent processing requirements in the fields of computer vision, speech, natural language processing, data mining, and the like under complex scenes. Especially, the deep learning technology is widely applied to the field of cloud intelligence, and one remarkable characteristic of the cloud intelligence application is that the input data size is large, and the requirements on the storage capacity and the computing capacity of the platform are high.
The chip 501 is connected to an external device 503 through an external interface 502. The external device 503 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a wifi interface, or the like. The data to be processed may be transferred to the chip 501 by the external device 503 through the external interface means 502. The results of the calculations of the chip 501 may be communicated back to the external device 503 via the external interface means 502. The external interface device 502 may have different interface forms, such as PCIe interface, according to different application scenarios.
The card 50 also includes a memory device 504 for storing data, including one or more memory cells 505. The memory device 504 is connected and data-transferred to the control device 506 and the chip 501 via a bus. The control device 506 in the board 50 is configured to regulate the state of the chip 501. For this purpose, in an application scenario, the control device 506 may include a single chip Microcomputer (MCU).
Fig. 6 is a structural diagram showing a combined processing device in the chip 501 of this embodiment. As shown in fig. 6, the combination processing device 60 includes a computing device 601, an interface device 602, a processing device 603, and a DRAM 604.
The computing device 601 is configured to perform user-specified operations, mainly implemented as a single-core smart processor or a multi-core smart processor, to perform deep learning or machine learning computations, especially Winograd convolution operations, which can interact with the processing device 603 through the interface device 602 to collectively perform the user-specified operations.
The interface device 602 is used for transmitting data and control commands between the computing device 601 and the processing device 603. For example, the computing device 601 may obtain input data from the processing device 603 via the interface device 602, and write the input data to the on-chip cache of the computing device 601. Further, the computing device 601 may obtain the control command from the processing device 603 via the interface device 602, and also write the control command into the on-chip cache of the computing device 601. Alternatively or optionally, the interface device 602 may also read data in an on-chip cache of the computing device 601 and transmit to the processing device 603.
The processing device 603 is a general purpose processing device that performs basic control including, but not limited to, data handling, starting and/or stopping of the computing device 601. Depending on the implementation, the processing device 603 may be one or more types of Central Processing Unit (CPU), Graphics Processing Unit (GPU) or other general purpose and/or special purpose processor, including but not limited to a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., and the number thereof may be determined according to actual needs. As previously mentioned, the computing device 601 of the present invention may be viewed as having a single core structure or an isostructural multi-core structure only. However, when considered collectively, the computing device 601 and the processing device 603 are considered to form a heterogeneous multi-core structure.
The DRAM 604 is used for storing data to be processed, and is an off-chip memory, generally 16G or larger, for storing data of the computing device 601 and/or the processing device 603, and particularly storing neuron data and weights to be subjected to Winograd convolution operation. In this embodiment, the processing means 603 has previously linearly transformed the weights of the original convolution into Winograd weights GgG T And stored in DRAM 604.
Fig. 7 shows a block diagram of the computing device 601. The computing device 601 includes a bus 701, a Direct Memory Access (DMA) module 702, an instruction cache (Iram)707, a decode unit (IDU)708, a neuron cache (Nram)709, a transform unit (NTU) 710, a transform data cache (WNram)711, a weight cache (Wram)712, a Multiply Accumulate (MAC)713, a multiply data cache (WRram)714, an Inverse Transform Unit (ITU)715, a result cache (Rram)716, and a logical operation module (ALU, arithmetric logic unit) 717.
The bus 701 is a common communication trunk for transmitting information between the devices, and is a transmission line bundle composed of wires, and the bus 701 is a generic name of a data bus, an address bus, and a control bus for transmitting data, data addresses, and commands, respectively, according to the kind of information transmitted by the combination processing device 60. The bus 701 serves as a communication channel for the DRAM 604 and the computing device 601, which in this embodiment is specifically PCIe.
The DMA module 702 is used to copy data from one address space to another, typically by transferring data between external memory (e.g., DRAM 604) and internal caches of the computing device 601. When the DMA transfer is to be performed, the processing device 603 gives the DMA module 702 the bus control right, and the DMA module 702 controls the bus 701 to transfer data, and after the DMA transfer is completed, the DMA module 702 gives the bus control right back to the processing device 603.
The DMA module 702 includes Neuronal Direct Memory Access (NDMA)703, Weighted Direct Memory Access (WDMA)704, Instruction Direct Memory Access (IDMA)705, and Resultant Direct Memory Access (RDMA) 706. NDMA 703 is used to input neuron data from DRAM 604, WDMA 704 is used to input Winograd weights from DRAM 604, IDMA 705 is used to input commands from DRAM 604, and RDMA 706 is used to output the calculation results to DRAM 604. In other embodiments, NDMA 703, WDMA 704, IDMA 705, and RDMA 706 may be implemented by the same direct memory access.
Since the computing device 601 mainly aims at Winograd convolution calculation, which has no or low general processing capability, it will greatly depend on the scheduling and data communication of the processing device 603 during the task execution, which results in that the input/output communication between the computing device 601 and the processing device 603 is very frequent, and thus, the operation performance of the computing device 601 is greatly limited. To this end, the computing device 601 is provided with a plurality of small-capacity on-chip caches for caching data capable of being temporarily stored in a multiplex manner, such as Nram 709, WNram 711, WRram 712, WRram 714, and the like.
When data on/off-chip is transferred, the neuron data and the Winograd weight are transferred in units of a single batch (N is 1), that is, the data unit of the neuron data is [ Ci Hi Wi ], the data unit of the Winograd weight is [ Co Ci (r +1) × (s +1) ], and the scale of the result obtained after the convolution operation of Winograd is [ Co Ho Wo ]. The former two are input data and the latter is output data, which are the minimum throughput transmitted and calculated in the calculating device 601, and as for the actual data throughput, it needs to be determined according to the size of the on-chip buffer and the operation scheduling flow, which will be further described below.
As can be seen from the characteristics of convolution operation, the convolution operation related to the input data of the above scale can be split in multiple dimensions, for example, in the Ci direction, the HW image direction, or the Co direction, but when Winograd conversion is involved, the minimum operation splitting unit is F (2 × 2, r × s), and the minimum splitting unit in the HW direction is (r +1) × (s + 1). Considering that the base convolution size of the computing device 601 for achieving the Winograd acceleration does not exceed 3 × 3, the embodiment estimates the buffer capacity based on the 3 × 3 base convolution which consumes the most on-chip buffer resources.
According to the rule of Winograd convolution, when forward conversion operation is carried out, vectorization length parameter l is required to be processed in parallel in the Ci direction, when bit-wise multiplication accumulation operation is carried out, operation is required to be carried out in parallel in the Co direction in the unit of l, and when inverse conversion is carried out, operation is required to be carried out in parallel in the Co direction in the unit of l, so that the size of the minimum neuron input data block participating in the operation can be estimated to be [ l (r +1) × (s +1) ]. Since the data block size of the neuron transformation result is estimated by the convolution with 3 × 3 basis, the block size of the Winograd weight data to be subjected to bit-wise multiplication and accumulation is [ l l 4 × 4], the block size of the bit-wise multiplication output data is [ l 4 × 4], and the block size of the inverse transformation output [ l 2 × 2 ].
The on-chip cache design is carried out according to the scale, although all requirements can be met, the design idea of multiplexing and low power consumption is also considered, the scale data is only the minimum input/output storage data scale for realizing the function, and the optimization potential of the input/output quantity of Winograd convolution operation needs to be further considered. This embodiment is further planned for caching as follows.
In the process of neuron forward transformation, the operation is based on the minimum implementation unit with F (2 × 2, r × s) and l as the vectorization length, the size of the data block taken out each time is [ l 44 ], and the step size of neuron taking is kept to be 2. As shown in fig. 8, there is a quarter of overlap 806 between the data unit 801 to be transformed and the four data blocks 802, 803, 804, 805 generated by the sliding window, and the size of the overlap 806 is [ l 44 ], and it can be seen that, in the process of performing forward transformation on the data unit 801, the data blocks 802, 803, 804, 805 each include 1 overlap 806, so that 4 overlaps 806 are generated. When data is shifted by splitting it in the minimum data unit l 44, the data throughput required for the overlap portion 806 is quadrupled, so that redundant data increases. To solve this problem, this embodiment further reduces the input/output amount by caching the data unit of a larger setting scale in the on-chip cache of the computing device 601.
As previously mentioned, the scale is [ Ci Hi Wi]Has a neuron data set and scale of [ Co Ci (r +1) × (s +1)]And performing convolution operation on the Winograd weight. This embodiment keeps as much Winograd weight on-chip as possible, i.e., keeps the Winograd weight on-chip to the full extentAs many as possiblePieces of paper (l l (r +1) × (s +1)]Therefore, the batch of neuron data can be calculated only by one time of weight loading operation, so as to save the input/output quantity of the weight data.
For the output data, since the convolutional neural network also has other network layer operations such as activation, pooling, normalization, and the like, the convolutional result needs to be cached on a chip, and the subsequent network layer operations are continued, and therefore, the computing device 601 will reserve a cache storage convolutional result of a fixed capacity. This portion of the data buffer may share buffer space with the results that ultimately go through various other layer operations, thus reducing the data throughput of other layer operations to reload the convolution results and transmit the results of the computation out.
As can be seen from the above optimization analysis, the buffer capacity of the neuron data should be as large as possible, so as to reduce the total throughput of the neuron data, and since the neuron data is accumulated along the Ci direction, the larger the amount of data stored along the Ci direction is, the more times the neuron data is reloaded and accumulated can be reduced. Furthermore, the buffer space for Winograd weights also needs to be as large as possible. Finally, this embodiment also needs to reserve the corresponding output result space for other layer operations. To sum up, the on-chip cache is mainly divided into three blocks in this embodiment, which are respectively responsible for different functions: nram 709 is responsible for storing neuron data, Wram 712 is responsible for storing Winograd weights, and Rram 716 is responsible for storing convolution results. The computing device 601 further sets 2 buffers responsible for temporarily storing the intermediate results: WNram 711 is responsible for temporarily storing the data after being transformed, and WRram 714 is responsible for temporarily storing the data after bit multiplication and accumulation.
Although the larger the buffer capacity for storing the neuron data, the Winograd weight, and the convolution result, the better the buffer capacity, the buffer size is closely related to the configuration of the arithmetic unit resources, and once the configuration is too large, the computing capability of the computing device 601 will be lost. The judgment standard is the balance between the input/output bottleneck pressure and the calculation force pressure. This embodiment sets the Nram 709 size toWherein alpha isβ is the directional coefficient of HW; the size of the Wram 712 is set to α × γ × [ l l 44 ]]Wherein γ isThe directional coefficient of (a); the scale of Rram 716 is set to β × γ × [ l 22 ]]. The time required to complete the operation of these scale data is l × α × β × γ.
Preferably, this embodiment selects l to be 16, α to be 4, β to be 64, and γ to be 16, and considering that the data size of each FP32 is 4B, the storage capacity of the storage array of Nram 709 isThe storage capacity of the Wram 712 is α × γ × [ l l 44 × ]]X 4B 1MB, and the storage capacity of Rram 716 is β × γ × [ l 22 ]]×4B=256KB。
Returning to FIG. 7, Nram 709 stores the neuron data sent by NDMA 703 in a temporary storage according to the decoded instruction, and NTU 710 reads the neuron data from Nram 709 for forward transformation, i.e., B T dB to produce forward transformed data, which is temporarily stored in WNram 711. FIG. 9 shows a schematic of Nram 709. In this embodiment, Nram 709 includes 4 memory arrays 901, 902, 903, 904, each of which includes 4 memory blocks 905, 906, 907, 908, each of which has a size of d w-bit memory locations, where d also represents the number of addresses in the memory locations. Preferably, w is 128 and d is 1024, each memory block is 16KB in size, each memory array is 64KB in size, and Nram 709 has a total memory size of 256KB, a total width of 4 × w — 64B and a depth of 4 × d — 4 × 1024.
In the width direction, the input bandwidth of Nram 709 is set to 4B, and the output bandwidth is matched to the input bandwidth of NTU 710. As mentioned above, the input bandwidth of the NTU 710 is set to l × 32 bits, and l is preferably 16, the input bandwidth of the NTU 710 is 64B, so the output bandwidth of Nram 709 is also 4 × w — 64B. The input and output of Nram 709 need to be performed simultaneously, so the design of input and output dual port is adopted.
Fig. 10 shows a schematic diagram of the NTU 710. The NTU 710 includes an input buffer 1001, a register file 1002, an adder set 1003, and an output buffer 1004.
When the NTU 710 receives a command to load neuron data from Nram 709, the input buffer 1001 acts as a fifo queue buffer to temporarily store neuron data based on the input bandwidth 64B. The stage of loading neuron data continues until all data reception is complete, the overall process being controlled by the IDU708 issuing instructions.
The register file 1002 fetches the temporarily stored neuron data from the input buffer 1001 in accordance with the programmed operation sequence based on the decoded instruction, stores the neuron data at a specific address of the register file 1002, and uses the neuron data stored at the specific address of the register file 1002 as an addition operand. In this embodiment, since the pipeline time lengths of the input stage, the operation stage and the output stage of the NTU 710 should be equal, a phenomenon of buffering hardware resource dependency may occur, in order to solve the problem of resource dependency, the register file 1002 is divided into a ping storage unit 1005 and a pong storage unit 1006 having the same size, the ith addition operand and the positive transformation data generated after the calculation are temporarily stored in the ping storage unit 1005, the (i +1) th addition operand and the (i +1) th positive transformation data are temporarily stored in the pong storage unit 1006, the (i + 5) th addition operand and the (i + 5) th positive transformation data are temporarily stored in the ping storage unit 1005, the (i + 5) th addition operand and the (i + 5) th positive transformation data are overwritten, and the register file 1002 stores according to the rule.
The adder group 1003 reads the addition operands in sequence from the specific address of the register file 1002 according to the decoded instruction to perform the addition operation. In this embodiment, the number of adder groups 1003 is 2 groups corresponding to the addition scheduling direction, each group includes 16 adders corresponding to the vectorization direction l, each adder is an FP32 adder, and the addition operation in the forward transform of the Winograd convolution is performed in a specific order in the channel direction of the neuron data, this specific order beingThe order is that the left multiplication matrix B of Winograd convolution is calculated firstly T And (3) calculating the addition of the right multiplication matrix B of Winograd convolution, finally generating positive transformation data, and storing the positive transformation data back to the register file 1002. The order of operations, as well as register allocation and operation time, are all dependent on the convolution filter size and are controlled by instructions sent by the IDU 708. The operation stage and the neuron data loading stage generate data dependency, are executed in a pipeline mode, and are realized by counting through hardware.
The output buffer 1004 is also a fifo buffer for temporarily storing the positive-transition data sequentially from the ping storage unit 1005 and the pong storage unit 1006. This output stage needs to rely on the overall completion of the operation stage to perform the corresponding buffered output based on the output bandwidth 64B.
Specifically, the widths of the first buffer unit 1101, the second buffer unit 1102, the third buffer unit 1103 and the fourth buffer unit 1104 are w 1 Byte, depth d 1 And is divided into m parts in the depth direction. In this embodiment, m is preferably 8, w 1 Is 64, d 1 The buffer size is 128, so the width of each buffer unit is 64B, the depth is 128, the address space is divided into 8 parts in the depth direction for data multiplexing, the size of each buffer unit is 8KB, that is, the total capacity of WNram 711 is set to 32 KB.
Referring back to fig. 7, Wram 712 temporarily stores Winograd weights sent from WDMA 704 according to the decoded instructions, and MAC 713 reads the Winograd weights from Wram 712 and the forward transformed data from WNram 711 according to the decoded instructions, and performs a bit-by-bit accumulation operation on the forward transformed data and the Winograd weights, that is, performs [ (GgG) T )⊙(B T dB)]Generates the bit-aligned multiplication data, and temporarily stores the bit-aligned multiplication data to WRram 714.
Fig. 12 shows a schematic diagram of Wram 712. In this embodiment, the Wram 712 includes 4 storage arrays 1201, 1202, 1203, 1204, and the WDMA 704 sends Winograd weights to the storage arrays 1201, 1202, 1203, 1204 by route distribution. Each storage array includes 4 storage blocks 1205, 1206, 1207, 1208, each storage block including 4 storage cells 1209, 1210, 1211, 1212, each storage cell being 4 xd × w in size. As previously mentioned, w is 128 and d is 1024, so the size of each memory block is 64KB, while the size of each memory array is 256KB, with the total capacity of Wram 712 being 1 MB. For each memory block, the width of the memory block is 4 × w ═ 512 bits, the memory block is segmented in the depth direction and divided into 4 segments of address independent memory space, the depth of each segment is d ═ 1024, and the total depth is 4 × d ═ 4096.
In this embodiment, each storage array 1201, 1202, 1203, 1204 independently has an input bandwidth and an output bandwidth of 4 × w B, and the total output bandwidth of Wram 712 is 4 × 4 × w B. Specifically, when w is 128, the input bandwidth and the output bandwidth of each memory array are 64B, and the total output bandwidth are 256B.
In this embodiment, the MAC 713 includes 64 MAC operators, which are divided into 4 groups and perform 4 operations in different batches, and 16 MAC operators in each group are distributed independently. The forward-transformed data of WNram 711 needs to be sent to the 64 MAC operators simultaneously, so that the forward-transformed data is subjected to bit-multiplication accumulation operation with different Winograd weights, and therefore WNram 711 sends the forward-transformed data in a broadcasting or distribution routing manner. Due to the fact that output load is large, in order to guarantee driving capacity and timing sequence, positive conversion data of WNram 711 are firstly sent to 4N 1 nodes through N1 and N2 two-stage broadcasting or distribution routing, each N1 node broadcasts or distributes routing to 4N 2 nodes, and each N2 node broadcasts or distributes routing to 4 MAC operators.
Fig. 13 shows a schematic diagram of the WNram 711 output side. The MAC 713 first performs bit-wise multiplication and then sequentially accumulates the resulting vectors, and the logic function is equivalent to solving an inner product of the vectors or performing an operation of element values in matrix multiplication. Each MAC set includes 16 MAC units 1301, i.e., ω ═ 16, and since l is preferably 16, the calculated force for each MAC set is 16 × (16+ (16-1)) ═ 496 flops.
Fig. 14 shows a schematic diagram of the output side of the Wram 712. The 4 outputs of the Wram 712 are responsible for the data transfer of 16 MAC units 1301, respectively, and in fact, this embodiment is responsible for the data transfer of 16 MAC units 1301 on a single N1 node by each storage array 1201, 1202, 1203, 1204. Since the output bandwidth is only 64B, the bandwidth needs to be time-division multiplexed, each N2 node only occupies one eighth of the bandwidth time, and the other half of the bandwidth time is idle to reduce power consumption. In more detail, the Wram 712 transmits the Winograd weight to the N1 node based on the bandwidth of 64B, the N1 node transmits the Winograd weight to the N2 node in a broadcast manner based on the bandwidth of 64B, and the N2 node transmits the Winograd weight to each MAC unit 1301 in a route distribution manner based on the bandwidth of 64B. Each MAC unit 1301 may perform a FP32 multiply-accumulate operation of length l.
Figure 15 shows a schematic diagram of ITU 715. ITU 715 includes input buffer 1501, register file 1502, adder bank 1503, and output buffer 1504.
When ITU 715 receives an instruction to load multiply-by-multiply-data from WRram 714, input buffer 1501 acts as a fifo buffer to temporarily store the multiply-by-multiply-data based on the input bandwidth. The stage of loading the multiply-by-bit data continues until all data reception is complete, the convolutional filters of different sizes are configured with fixed and independent cache resource partitioning and input counting, and the overall process is controlled by the IDU708 sending instructions.
The register file 1502 fetches the temporarily stored bit-aligned data from the input buffer 1501 in a fixed operation order according to the decoded instruction, stores the fetched data to the specific address of the register file 1502, and adds the bit-aligned data stored at the specific address of the register file 1502 as an addition operand. Similarly, in order to solve the problem of resource dependency, the register file 1502 has a ping storage unit 1505 and a pong storage unit 1506 with the same size, the ith addition operand and the convolution result generated after the calculation are temporarily stored in the ping storage unit 1505, the (i +1) th addition operand and the (i +1) th convolution result are temporarily stored in the pong storage unit 1506, the (i + 5) th addition operand and the (i + 5) th convolution result are temporarily stored in the ping storage unit 1505, the ith addition operand and the ith convolution result are overwritten, and the register file 1502 stores according to the rule.
The adder group 1503 sequentially reads the addition operands from the specific addresses of the register file 1502 according to the decoded instruction and performs the addition operation. Like the adder group 1003, the number of the adder groups 1503 is 2 groups to correspond to the addition scheduling direction, each group includes 16 adders to correspond to the vectorization direction, each adder is an FP32 adder, and the addition operation in the inverse transform of Winograd convolution is performed in the channel direction of the bit multiplied data in a specific order of first calculating the left multiplication matrix a of the Winograd convolution T And (3) calculating the addition of the right multiplication matrix A of Winograd convolution, generating a convolution result, and storing the convolution result back to the register file 1502. The operation sequence, the register allocation and the operation time are all related to the scale of the convolution filter, and are controlled by an instruction sent by the IDU708. The operation stage and the loading stage generate data dependency for the multiply-by-multiply-data stage, and the operation stage is executed in a pipeline mode and is realized by hardware through counting.
The output buffer 1504 is also a fifo queue buffer for temporarily storing convolution results from the ping storage unit 1505 and the pong storage unit 1506 in sequence. The output stage needs to rely on the overall completion of the operation stage to perform the output of the corresponding cache based on the output bandwidth.
In addition to Winograd convolution, the computing device 601 is capable of performing all neural network related operations, and the ALU 717 performs two tasks according to the decoded instructions: the first task is the operation of convolution fusion operation, namely the operation can be completed on a chip with a convolution layer in one step without depending on more data, and the operation comprises the operation processes of activation, bias addition, direction part, accumulation and the like; the second task is a non-convolution operation. The result of the ALU 717 operation is also buffered in the Rram 716. The presence of the ALU 717 may ensure that various operations in the convolutional neural network are fully implemented in the computing device 601, allowing the computing device 601 to have the versatility and integrity of a neural network.
Fig. 16 shows a schematic connection relationship diagram of Rram 716. The input port of Rram 716 is connected to ITU 715 and ALU 717, receiving its output data. Because convolution and other operations do not occur, it is not necessary to have the two input ports simultaneously operating, so the input bandwidth of each storage array is kept at 64B, and the 64B bandwidth is time-division multiplexed to access the data of ITU 715 and ALU 717. There are also 2 output ports of Rram 716, one connected to RDMA 706 and the other to ALU 717. After the ALU 717 operation is complete, Rram 716 sends the computation results to DRAM 604 via RDMA 706, so that the data transfer to RDMA 706 and ALU 717 is accomplished using 64B of output bandwidth, again in a time-multiplexed manner at the output.
In the process of implementing the base convolution operation by the calculation apparatus 601, Winograd convolution operation is performed based on the 5 types of base convolution units. Quantitative elements such as Winograd transformation matrix of the 5 convolutions are different, so that the differential hardware design as described above needs to be performed. Considering that the difference of data scale of each base convolution results in difference of data multiplexing, data splitting, instruction splitting and computational power dividing during calculation, this embodiment further applies to a specific scheduling scheme in the 5 base convolution planning instructions to meet the requirements of part of hardware structures and performance on the difference of different base convolution scales. In this embodiment, the following scheduling schemes are directly integrated in hardware design, and are logically designed by hardware, and the scheduling policy of the scheduling schemes cannot be arbitrarily changed. In other embodiments, the following scheduling scheme may also be implemented in software.
When Nram 709 inputs neuron data to NTU 710, [ (r +1) × (s +1) l]And the neuron data stored on Nram 709 is 4 batches (batch) of data, each batch extending α × β × [ (r +1) × (s +1) l]Considering that weight multiplexing is related to the scale of the neuron transformation result, the neuron forward transformation needs to be continuously performed 8 times in the β direction (i.e., HW direction), which is the first repetition. Furthermore, the neuron data is subjected to neuron forward transformation in the NTU 710 in units of the first data unit in a time-division multiplexing manner, so that the second repetition cycle is a cycle in the batch processing direction, and the cycle number is 4. Finally, consider that MAC unit 1301 would be along the alpha direction (i.e., C) i Direction) and the third recycling dimension refers to the alpha direction. Direction of last recirculationReturning to the direction of beta, the cycle number is beta/8 to 8.
The steps between all the iterations of the aforementioned flow are pipelined to ensure that the instructions of the neuron forward transform can be completely pipelined, and further align the overhead of the forward transform operation to the first data unit [ (r +1) × (s +1) l ] to minimize the overhead.
The forward transform data is sent to the first cache unit 1101, the second cache unit 1102, the third cache unit 1103, and the fourth cache unit 1104 of the WNram 711 in a routing manner, and the WNram 711 also receives and temporarily stores the forward transform data in units of the first data unit. While WNram 711 outputs positive transformed data, it needs to combine with the weight multiplexing rate of 8, transform HW direction to output different neurons to MAC 713. Therefore, there is data accumulation between the input and output of WNram 711, with the input and output having different data dimensions.
Fig. 17 shows a sequence diagram of WNram 711 when inputting and outputting data, where the horizontal axis of the storage space of WNram 711 is r × s dimension and the vertical axis is HW dimension. When the positive transform data is input to WNram 711, it is input in units of first data units, i.e., sequentially stored in the r × s dimension of a single HW dimension, i.e., sequentially stored as input order 1701 in the figure; when the positive transform data is outputted from WNram 711, it is inputted in units of second data units, i.e., sequentially stored in HW dimension of single r × s dimension, i.e., sequentially outputted as output sequence 1702 in the figure, and the size of the second data units is the product of the bandwidth of WNram 711 and the number of data bits of neuron data, i.e., l × 32 bits. Since the input and output sequences are not consistent, WNram 711 needs to wait for a certain amount of forward transform data to be stored, and then starts to send the forward transform data to MAC 713.
To avoid data errors, the above-mentioned layout for the size of WNram 711 storage space is used to store a certain amount of forward transformed data, i.e., WNram 711 is filled up, and then the forward transformed data can be sent to MAC 713.
When outputting the positive transformed data, WNram 711 sends the positive transformed data N times based on the second data unit and output order 1702, where N depends on the size of Wram 712, i.e., the same positive transformed data is repeatedly calculated N times with different Winograd weight data in MAC 713. In this embodiment, N is not greater than γ.
In more detail, when the Wram 712 outputs the Winograd weight on the scale of the third data unit, the Wram 712 outputs the Winograd weight in the α direction and then in the γ direction, i.e., in the sequence indicated by the arrows in the figure, using ω × F × l as the data unit, so that the pipeline supply of the weight data can be accurately realized.
The invention carries out hardware design based on the characteristic of Winograd algorithm to realize the accelerated universality, provides a pipeline-level operation mode for accelerating the Winograd convolution operation speed, and fully utilizes reusable resources through methods of time division multiplexing, broadcast routing and the like in the hardware realization process. The hardware structure provided by the invention can be matched with a Winograd convolution algorithm, and has the technical effects of ensuring network precision, accelerating performance, reducing area and reducing power consumption.
According to different application scenarios, the electronic device or apparatus of the present invention may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet computer, an intelligent terminal, a PC device, an internet of things terminal, a mobile phone, a car recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph. The electronic device or apparatus of the present invention can also be applied to the fields of the internet, the internet of things, data centers, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction sites, medical care, and the like. Furthermore, the electronic equipment or the device can be used in application scenes such as a cloud end, an edge end and a terminal which are related to artificial intelligence, big data and/or cloud computing. In one or more embodiments, the electronic device or apparatus with high computational power according to the present disclosure may be applied to a cloud device (e.g., a cloud server), and the electronic device or apparatus with low power consumption may be applied to a terminal device and/or an edge device (e.g., a smartphone or a camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device to simulate the hardware resources of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device, and uniform management, scheduling and cooperative work of end-cloud integration or cloud-edge-end integration can be completed.
It is noted that for the sake of simplicity, the present invention sets forth some methods and embodiments thereof as a series of acts or combinations thereof, but those skilled in the art will appreciate that the inventive arrangements are not limited by the order of acts described. Accordingly, persons skilled in the art may appreciate that certain steps may be performed in other sequences or simultaneously, in accordance with the disclosure or teachings of the invention. Further, those skilled in the art will appreciate that the described embodiments of the invention are capable of being practiced in other alternative embodiments that may involve fewer acts or modules than are necessary to practice one or more aspects of the invention. In addition, the description of some embodiments of the present invention is also focused on different schemes. In view of this, those skilled in the art will understand that portions of the present invention that are not described in detail in one embodiment may also refer to related descriptions of other embodiments.
In particular implementations, based on the disclosure and teachings of the present invention, one of ordinary skill in the art will appreciate that the several embodiments disclosed herein can be practiced in other ways not disclosed herein. For example, as for the units in the foregoing embodiments of the electronic device or apparatus, the units are split based on the logic function, and there may be another splitting manner in the actual implementation. Also for example, multiple units or components may be combined or integrated with another system or some features or functions in a unit or component may be selectively disabled. The connections discussed above in connection with the figures may be direct or indirect couplings between the units or components in terms of connectivity between the different units or components. In some scenarios, the aforementioned direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustic, magnetic, or other forms of signal transmission.
In the present invention, units described as separate parts may or may not be physically separate, and parts shown as units may or may not be physical units. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, according to actual needs, part or all of the units can be selected to achieve the purpose of the scheme of the embodiment of the invention. In addition, in some scenarios, multiple units in an embodiment of the present invention may be integrated into one unit or each unit may exist physically separately.
In other implementation scenarios, the integrated unit may also be implemented in hardware, that is, a specific hardware circuit, which may include a digital circuit and/or an analog circuit, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, transistors or memristors, among other devices. In this regard, the various devices described herein (e.g., computing devices or other processing devices) may be implemented by suitable hardware processors, such as central processing units, GPUs, FPGAs, DSPs, ASICs, and the like. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), and may be, for example, a variable Resistive Memory (RRAM), a Dynamic Random Access Memory (DRAM), a Static Random Access Memory (SRAM), an Enhanced Dynamic Random Access Memory (EDRAM), a High Bandwidth Memory (HBM), a Hybrid Memory Cube (HMC), a ROM, a RAM, or the like.
The foregoing may be better understood in light of the following clauses:
clause a1, a computing device for performing Winograd convolution, connected to an off-chip memory, where neuron data and a Winograd weight are stored, the computing device includes a forward transform unit and an on-chip buffer, where the forward transform unit forward transforms the neuron data by using a first data unit as a unit to generate forward transformed data, and the first data unit has a scale of [ (r +1) × (s +1) l ], where r is a height of the Winograd weight, s is a width of the Winograd weight, and l is a bandwidth of the on-chip buffer.
Clause a2, the computing device of clause a1, wherein the on-chip cache comprises a forward transform data cache that receives and temporarily stores the forward transform data in units of the first data unit.
Clause A3, the computing device of clause a2, wherein the forward transform data cache sends the forward transform data in units of a second data unit sized as a product of a bandwidth of the on-chip cache and a data bit count of the neuron data.
Clause a4, the computing device according to clause A3, wherein the on-chip cache further includes a weight cache for temporarily storing the Winograd weight from the off-chip memory, wherein the weight cache sends the Winograd weight in units of a third data unit, a size of the third data unit is ω × F × l, where ω is a multiply-accumulate parallel dimension, and F is (r +1) × (s + 1).
Clause a5, the computing device according to clause a4, further comprising a bit-wise multiplication and accumulation operator, configured to receive the forward transform data sent by the forward transform data buffer, and the Winograd weight sent by the weight buffer, and perform a bit-wise multiplication and accumulation operation to generate bit-wise multiplication data.
Clause a6, the computing device of clause a5, wherein the on-chip cache further comprises a bit-aligned multiplier data cache to temporarily store the bit-aligned multiplier data.
Clause a7, the computing device of clause a6, further comprising an inverse transform unit, for receiving the bit-by-bit multiplied data sent by the bit-by-bit multiplier data buffer, and performing an inverse transform to obtain a convolution result.
Clause A8, the computing device of clause a7, wherein the on-chip cache further comprises a convolution result cache to temporarily store the convolution result.
Clause a9, an integrated circuit device, comprising the computing device of any of clauses a 1-8.
Clause a10, a board comprising the integrated circuit device of clause a 9.
The above embodiments of the present invention are described in detail, and the principle and the implementation of the present invention are explained by applying specific embodiments, and the above description of the embodiments is only used to help understanding the method of the present invention and the core idea thereof; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.
Claims (10)
1. A computing device for executing Winograd convolution is connected to an off-chip memory, neuron data and Winograd weight values are stored in the off-chip memory, the computing device comprises a forward conversion unit and an on-chip cache, the forward conversion unit is used for forward converting the neuron data by taking a first data unit as a unit so as to generate forward conversion data, the scale of the first data unit is [ (r +1) x (s +1) l ], wherein r is the height of the Winograd weight values, s is the width of the Winograd weight values, and l is the bandwidth of the on-chip cache.
2. The computing device of claim 1, wherein the on-chip cache comprises a forward transform data cache that receives and temporarily stores the forward transform data in units of the first data units.
3. The computing device of claim 2, wherein the forward transform data buffer sends the forward transform data in units of second data units, the second data units sized as a product of a bandwidth of the on-chip buffer and a number of data bits of the neuron data.
4. The computing device of claim 3, wherein the on-chip cache further comprises a weight cache to temporarily store the Winograd weight from the off-chip memory, wherein the weight cache sends the Winograd weight in units of a third data unit, the third data unit being of a size ω xFxl, where ω is a multiply-accumulate parallel dimension and F is (r +1) x (s + 1).
5. The computing device of claim 4, further comprising a bit-by-bit accumulation operator for receiving the forward transformed data sent by the forward transformed data buffer and the Winograd weights sent by the weight buffer, and performing a bit-by-bit accumulation operation to generate bit-by-bit data.
6. The computing device of claim 5, wherein the on-chip cache further comprises a bit-by-bit multiplier data cache to temporarily store the bit-by-bit multiplier data.
7. The computing device of claim 6, further comprising an inverse transform unit to receive the bit-wise multiplied data sent by the bit-wise multiplier data buffer, and perform an inverse transform to obtain a convolution result.
8. The computing device of claim 7, wherein the on-chip cache further comprises a convolution result cache to temporarily store the convolution result.
9. An integrated circuit device comprising the computing device of any of claims 1 to 8.
10. A board card comprising the integrated circuit device of claim 9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110266344.7A CN115081602A (en) | 2021-03-11 | 2021-03-11 | Computing device, integrated circuit device and board card for executing Winograd convolution |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110266344.7A CN115081602A (en) | 2021-03-11 | 2021-03-11 | Computing device, integrated circuit device and board card for executing Winograd convolution |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115081602A true CN115081602A (en) | 2022-09-20 |
Family
ID=83241684
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110266344.7A Pending CN115081602A (en) | 2021-03-11 | 2021-03-11 | Computing device, integrated circuit device and board card for executing Winograd convolution |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115081602A (en) |
-
2021
- 2021-03-11 CN CN202110266344.7A patent/CN115081602A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110163357B (en) | Computing device and method | |
CN108733348B (en) | Fused vector multiplier and method for performing operation using the same | |
CN109992743A (en) | Matrix multiplier | |
CN108170640B (en) | Neural network operation device and operation method using same | |
CN112633490B (en) | Data processing device, method and related product for executing neural network model | |
CN110059797B (en) | Computing device and related product | |
CN114003198A (en) | Inner product processing unit, arbitrary precision calculation device, method, and readable storage medium | |
CN116710912A (en) | Matrix multiplier and control method thereof | |
CN113837922B (en) | Computing device, data processing method and related product | |
CN115081600A (en) | Conversion unit for executing Winograd convolution, integrated circuit device and board card | |
CN115081603A (en) | Computing device, integrated circuit device and board card for executing Winograd convolution | |
CN114595813B (en) | Heterogeneous acceleration processor and data computing method | |
CN113469333B (en) | Artificial intelligence processor, method and related products for executing neural network model | |
CN112801276B (en) | Data processing method, processor and electronic equipment | |
CN115081602A (en) | Computing device, integrated circuit device and board card for executing Winograd convolution | |
CN115079927A (en) | Temporary storage of convolution results, computing device, integrated circuit device and board card | |
CN115081605A (en) | Buffer memory, device and board card for temporarily storing neuron data in Winograd convolution | |
CN115081604A (en) | Buffer for temporarily storing Winograd weight, computing device, integrated circuit device and board card | |
CN115081606A (en) | Device and board card for executing Winograd convolution | |
CN115081599A (en) | Method for preprocessing Winograd convolution, computer readable storage medium and device | |
Wan et al. | ADS-CNN: Adaptive Dataflow Scheduling for lightweight CNN accelerator on FPGAs | |
CN115470176B (en) | Computing device, method for implementing convolution operation by utilizing computing device and related product | |
CN115438777A (en) | Device for performing Winograd convolution forward transform on neuron data | |
CN114692849A (en) | Inverse transformation unit, device and board card for inverse transformation of Winograd convolution bit-to-bit multiplication data | |
CN115438778A (en) | Integrated circuit device for executing Winograd convolution |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |