Nothing Special   »   [go: up one dir, main page]

WO2021184143A1 - Data processing apparatus and data processing method - Google Patents

Data processing apparatus and data processing method Download PDF

Info

Publication number
WO2021184143A1
WO2021184143A1 PCT/CN2020/079431 CN2020079431W WO2021184143A1 WO 2021184143 A1 WO2021184143 A1 WO 2021184143A1 CN 2020079431 W CN2020079431 W CN 2020079431W WO 2021184143 A1 WO2021184143 A1 WO 2021184143A1
Authority
WO
WIPO (PCT)
Prior art keywords
bits
low
multiplier
multiplicand
product
Prior art date
Application number
PCT/CN2020/079431
Other languages
French (fr)
Chinese (zh)
Inventor
董镇江
李震桁
袁宏辉
谢环
蒋东龙
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN202080098682.8A priority Critical patent/CN115280277A/en
Priority to PCT/CN2020/079431 priority patent/WO2021184143A1/en
Publication of WO2021184143A1 publication Critical patent/WO2021184143A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • G06F7/523Multiplying only
    • G06F7/533Reduction of the number of iteration steps or stages, e.g. using the Booth algorithm, log-sum, odd-even

Definitions

  • This application relates to the technical field of digital signal processing, and in particular to a data processing device and a data processing method.
  • Convolutional neural network (convolutional neural network, CNN) has a wide range of application scenarios in the fields of image and speech recognition.
  • CNN convolutional neural network
  • the convolution calculation accounts for 90% of the calculation of the entire algorithm model. Therefore, the efficient calculation of the convolution layer is the key to greatly improving the calculation efficiency of the CNN algorithm model.
  • the convolution calculation is realized through hardware acceleration. It is an effective way.
  • Nbit*2Nbit and 2Nbit*2Nbit processors have high logic resource overhead and low performance.
  • * represents convolution
  • N is a positive integer.
  • FIG. 1 a schematic diagram of an Nbit*Nbit data processing device, the processor includes an Nbit ⁇ Nbit multiplier and an adder with a bit width of 2Nbit. Each Nbit ⁇ Nbit multiplier outputs 2Nbit data. When there are two or more sets of data for convolution operation, an adder with a bit width of 2Nbit is required to accumulate the output result of the multiplier.
  • the area of the 2Nbit ⁇ Nbit multiplier is twice that of the Nbit ⁇ Nbit multiplier, and the area of the 2Nbit ⁇ 2Nbit multiplier is 4 times that of the Nbit ⁇ Nbit multiplier. Compared with the Nbit ⁇ Nbit multiplier, this This method causes the area of the multiplier to be greatly increased.
  • this implementation requires bits. Compared with the Nbit*Nbit solution, the adder with a width of 4Nbit has doubled the bit width of the adder in this way. The increase in the area of the multiplier and the expansion of the bit width of the adder will increase the logic resource overhead of the processor and reduce the performance of the processor. Therefore, how to design a processor with low logic resource overhead is an urgent solution.
  • This application provides a data processing device and a data processing method. Compared with the product of the high part of each group of multipliers and the multiplicand, after being shifted, it is directly combined with the product of the low part of the corresponding multiplier and the multiplicand. In addition, the solution provided in this application combines partial products with the same left shift number in multiple sets of multiplication operations, and performs cumulative addition operations on the combined results, which greatly saves logic resources.
  • the first aspect of the present application provides a data processing device, which may include: a product calculation circuit for calculating a first set of products and a second set of products, the first set of products may include the high N bits of the first multiplier and the first The product of the multiplier, and the product of the high N bits of the second multiplier and the second multiplicand.
  • the second set of products may include the product of the low N bits of the first multiplier and the first multiplicand, and the second multiplication
  • the product of the low N bits of the number and the second multiplicand, the first multiplier and the second multiplier are both 2N bits, and N is a positive integer.
  • the accumulating circuit is used for accumulating the first group of products and the second group of products respectively.
  • the technical solution provided by this application is performed by separately accumulating the product of the high part of multiple sets of multipliers and the multiplicand. Shift processing avoids shifting the product of the high-order part of each group of multipliers and the multiplicand, directly adding the product of the corresponding low-order part of the multiplier and the multiplicand, resulting in bit expansion of the adder.
  • the data processing device may further include a first shifter and a first adder, and the first multiplicand and the second multiplicand are both N bits, the first shifter, used to perform shift processing on the result of the first group of multiplication accumulation and addition to obtain the first shift result.
  • the first adder is used to accumulate the first shift result and the second set of products.
  • the first multiplicand and the second multiplicand are 2N bits
  • the first set of products may include high-order products and high-low-order products
  • the second The group product may include the low-high-order product and the low-order product.
  • the product calculation circuit is specifically used to calculate the high-order product.
  • the high-order product may include the product of the high N bits of the first multiplier and the high N bits of the first multiplicand, and the second The product of the high N bits of the multiplier and the high N bits of the second multiplicand. Calculate the product of high and low bits.
  • the product of high and low bits can include the product of the high N bits of the first multiplier and the low N bits of the first multiplicand, and the high N bits of the second multiplier and the low N bits of the second multiplicand. product.
  • Calculate the low-high product can include the product of the low N bits of the first multiplicand and the high N bits of the first multiplicand, and the low N bits of the second multiplier and the high N bits of the second multiplicand. product.
  • the low-order product is calculated.
  • the low-order product may include the product of the low N bits of the first multiplier and the low N bits of the first multiplicand, and the product of the low N bits of the second multiplier and the low N bits of the second multiplicand.
  • the accumulation circuit is specifically used to accumulate high-order products, high-low-order products, low-high-order products, and low-order products through a second adder with a bit width of 2Nbit. From the second possible implementation of the first aspect, it can be seen that a 2Nbit*2Nbit processor solution can be formed through an Nbit ⁇ Nbit multiplier.
  • the data processing device may further include a second shifter, a third adder, and a third shifter,
  • the fourth shifter, the fourth adder, and the second shifter are used to shift the result of the high-order multiplication accumulation addition to the left by N bits to obtain the second shift result.
  • the third adder is used to accumulate the second shift result and the result of multiplication and accumulation of high and low bits.
  • the third shifter is used to shift the result output by the third adder to the left by N bits to obtain the third shift result.
  • the fourth shifter is used to shift the result of the multiplication and accumulation of the low and high bits by N bits to the left to obtain the fourth shift result.
  • the fourth adder is used to accumulate the third shift result, the fourth shift result, and the low-order product.
  • the data processing device may further include a fifth adder for: multiplying the fourth shift result and the low-order product Accumulate.
  • the fourth adder is specifically used to accumulate the third shift result and the result output by the fifth adder.
  • a split logic circuit may also be included for: outputting the first multiplier through the selector MUX The high and low N bits of the number, the high and low N bits of the second multiplier.
  • the logic circuit is split and is also used to construct a first association relationship.
  • the first association relationship may include the first multiplication The relationship between the high N bits of a number and the first multiplicand, the relationship between the low N bits of the first multiplier and the first multiplicand, and the relationship between the high N bits of the second multiplier and the second multiplicand , The relationship between the low N bits of the second multiplier and the second multiplicand.
  • a split logic circuit may also be included for: outputting through the selector MUX High N bits and low N bits of the first multiplier, high N bits and low N bits of the second multiplier, high N bits and low N bits of the first multiplicand, and high N bits of the second multiplicand Low N bits.
  • splitting the logic circuit is also used to: construct a second association relationship, and the second association relationship may include the first The relationship between the high N bits of the multiplier and the high N bits of the first multiplicand, the relationship between the high N bits of the first multiplier and the low N bits of the first multiplicand, and the low N bits of the first multiplier The relationship with the high N bits of the first multiplicand, the relationship between the low N bits of the first multiplier and the low N bits of the first multiplicand, the high N bits of the second multiplicand and the second multiplicand The relationship between the high N bits of the second multiplier, the high N bits of the second multiplicand and the low N bits of the second multiplicand, the low N bits of the second multiplier and the high N bits of the second multiplicand Relationship, the relationship between the low N bits of the second multiplier and the low N bits of the second multiplicand.
  • the data processing device may further include a data random access memory RAM, and a weight RAM, and a data RAM for storing The high N bits and low N bits of the first multiplier, and the high N bits and low N bits of the second multiplier.
  • the weight RAM is used to store the high N bits and low N bits of the first multiplicand, and the high N bits and low N bits of the second multiplicand according to the second association relationship.
  • the first multiplier and the second multiplier are characteristic layers Data
  • the first multiplicand and the second multiplicand are the convolution kernel data
  • the first and second multipliers are the convolution kernel data
  • the first and second multiplicands are the feature layer data .
  • a second aspect of the present application provides a data processing method, which may include: calculating a first set of products and a second set of products, the first set of products may include the product of the high N bits of the first multiplier and the first multiplicand, The product of the high N bits of the second multiplier and the second multiplicand.
  • the second set of products can include the product of the low N bits of the first multiplier and the first multiplicand, and the low N bits of the second multiplier and the second multiplier.
  • the product of the multiplicand, the first multiplier and the second multiplier are both 2N bits, and N is a positive integer.
  • the first group of products and the second group of products are respectively accumulated.
  • it may further include: performing shift processing on the result of the first group of multiplication accumulation and addition to obtain the first shift result. Accumulate the first shift result and the second set of products.
  • calculating the first set of products and the second set of products may specifically include: calculating the high-order product, and the high-order product may include the high N of the first multiplier.
  • the product of high and low bits can include the product of the high N bits of the first multiplier and the low N bits of the first multiplicand, and the high N bits of the second multiplier and the low N bits of the second multiplicand. product. Calculate the low-high product.
  • the low-high product can include the product of the low N bits of the first multiplicand and the high N bits of the first multiplicand, and the low N bits of the second multiplier and the high N bits of the second multiplicand. product.
  • the low-order product is calculated.
  • the low-order product may include the product of the low N bits of the first multiplier and the low N bits of the first multiplicand, and the product of the low N bits of the second multiplier and the low N bits of the second multiplicand.
  • Accumulating the first group of products and the second group of products separately may include: accumulating high-order products, high-low-order products, low-high-order products, and low-order products, respectively.
  • it may further include: shifting the result of the high-order multiplication accumulation addition to the left by N bits to obtain the second shift result. Accumulate the second shift result and the result of multiplying and accumulating high and low bits. After accumulating the second shift result and the result of multiplying and accumulating the high and low bits, the result is shifted to the left by N bits to obtain the third shift result. The result of multiplying and accumulating the low and high bits is shifted to the left by N bits to obtain the fourth shift result. Accumulate the third shift result, the fourth shift result, and the low-order product.
  • the fourth possible implementation manner may further include: accumulating the fourth shift result and the low-order product.
  • Accumulating the third shift result, the fourth shift result, and the low-order product may include: accumulating the third shift result and the result of the fourth shift result and the low-order product after accumulating.
  • it may further include: outputting the high N bits and low N bits of the first multiplier, The high N bits and low N bits of the second multiplier.
  • the sixth possible implementation manner may further include: constructing a first association relationship, and the first association relationship may include the high N of the first multiplier.
  • it may further include: outputting the high N bits and low N bits of the first multiplier Bits, the high N bits and low N bits of the second multiplier, the high N bits and low N bits of the first multiplicand, and the high N bits and low N bits of the second multiplicand.
  • the ninth possible implementation manner may further include: constructing a second association relationship, and the second association relationship may include the high N of the first multiplier.
  • the low N bits of the first multiplicand and the first multiplied The relationship between the high N bits of the number, the relationship between the low N bits of the first multiplier and the low N bits of the first multiplicand, the high N bits of the second multiplier and the high N bits of the second multiplicand Association relationship, the relationship between the high N bits of the second multiplier and the low N bits of the second multiplicand, the relationship between the low N bits of the second multiplier and the high N bits of the second multiplicand, the second multiplication The relationship between the low N bits of the number and the low N bits of the second multiplicand.
  • the first multiplier and the second multiplier are characteristic layers Data
  • the first multiplicand and the second multiplicand are the convolution kernel data
  • the first and second multipliers are the convolution kernel data
  • the first and second multiplicands are the feature layer data .
  • a third aspect of the present application provides a data processing device, which may include: a product calculation module for calculating a first set of products and a second set of products, the first set of products may include the high N bits of the first multiplier and the first set of products. The product of the multiplier, and the product of the high N bits of the second multiplier and the second multiplicand.
  • the second set of products may include the product of the low N bits of the first multiplier and the first multiplicand, and the second multiplication
  • the product of the low N bits of the number and the second multiplicand, the first multiplier and the second multiplier are both 2N bits, and N is a positive integer.
  • the accumulation module is used for accumulating the first group of products and the second group of products respectively.
  • the technical solution provided by this application is performed by separately accumulating the product of the high part of the multiplier and the multiplicand. Shift processing avoids shifting the product of the high part of each group of multipliers and the multiplicand, directly adding the product of the corresponding low part of the multiplier and the multiplicand, resulting in bit expansion of the addition module.
  • the data processing device may further include a first shift module and a first addition module, and the first multiplicand and the second multiplicand are both N bits, the first shift module, used to perform shift processing on the result of the first group of multiplication accumulation and addition to obtain the first shift result.
  • the first addition module is used to accumulate the first shift result and the second set of products.
  • the first multiplicand and the second multiplicand are 2N bits
  • the first set of products may include high-order products and high-low-order products
  • the second The group product can include a low-high-order product and a low-order product.
  • the product calculation module is specifically used to calculate the high-order product.
  • the high-order product can include the product of the high N bits of the first multiplier and the high N bits of the first multiplicand, and the second The product of the high N bits of the multiplier and the high N bits of the second multiplicand. Calculate the product of high and low bits.
  • the product of high and low bits can include the product of the high N bits of the first multiplier and the low N bits of the first multiplicand, and the high N bits of the second multiplier and the low N bits of the second multiplicand. product.
  • Calculate the low-high product can include the product of the low N bits of the first multiplicand and the high N bits of the first multiplicand, and the low N bits of the second multiplier and the high N bits of the second multiplicand. product.
  • the low-order product is calculated.
  • the low-order product may include the product of the low N bits of the first multiplier and the low N bits of the first multiplicand, and the product of the low N bits of the second multiplier and the low N bits of the second multiplicand.
  • the accumulation module is specifically used to accumulate the high-order product, the high-low-order product, the low-high-order product, and the low-order product through the second addition module with a bit width of 2Nbit. From the second possible implementation manner of the third aspect, it can be known that a 2Nbit*2Nbit processing module solution can be formed through an Nbit ⁇ Nbit multiplication module.
  • the data processing device may further include a second shift module, a third addition module, and a third shift module,
  • the fourth shift module, the fourth addition module, and the second shift module are used to shift the result of the high-order multiplication accumulation addition to the left by N bits to obtain the second shift result.
  • the third addition module is used to accumulate the second shift result and the result of multiplying and accumulating high and low bits.
  • the third shift module is used to shift the result output by the third addition module by N bits to the left to obtain the third shift result.
  • the fourth shift module is used to shift the result of multiplication and accumulation of low and high bits to the left by N bits to obtain the fourth shift result.
  • the fourth addition module is used to accumulate the third shift result, the fourth shift result, and the low-order product.
  • the data processing device may further include a fifth addition module, configured to: compare the fourth shift result and the low-order product Accumulate.
  • the fourth addition module is specifically used to accumulate the third shift result and the result output by the fifth addition module.
  • a split logic module may also be included for: outputting the first multiplier through the selection module MUX The high and low N bits of the number, the high and low N bits of the second multiplier.
  • the logic module is split and is also used to construct a first association relationship.
  • the first association relationship may include the first multiplication The relationship between the high N bits of a number and the first multiplicand, the relationship between the low N bits of the first multiplier and the first multiplicand, and the relationship between the high N bits of the second multiplier and the second multiplicand , The relationship between the low N bits of the second multiplier and the second multiplicand.
  • a split logic module may also be included for: outputting through the selection module MUX The high N bits and low N bits of the first multiplier, the high N bits and low N bits of the second multiplier, the high N bits and low N bits of the first multiplicand, and the high N bits of the second multiplicand Low N bits.
  • splitting the logic module is also used to: construct a second association relationship, and the second association relationship may include the first The relationship between the high N bits of the multiplier and the high N bits of the first multiplicand, the relationship between the high N bits of the first multiplier and the low N bits of the first multiplicand, and the low N bits of the first multiplier The relationship with the high N bits of the first multiplicand, the relationship between the low N bits of the first multiplier and the low N bits of the first multiplicand, the high N bits of the second multiplicand and the second multiplicand The relationship between the high N bits of the second multiplier, the high N bits of the second multiplicand and the low N bits of the second multiplicand, the low N bits of the second multiplier and the high N bits of the second multiplicand Relationship, the relationship between the low N bits of the second multiplier and the low N bits of the second multiplicand.
  • the data processing device may further include a data random access storage module RAM, and a weight RAM and a data RAM for Store the high N bits and low N bits of the first multiplier, and the high N bits and low N bits of the second multiplier.
  • the weight RAM is used to store the high N bits and low N bits of the first multiplicand, and the high N bits and low N bits of the second multiplicand according to the second association relationship.
  • the first multiplier and the second multiplier are characteristic layers Data
  • the first multiplicand and the second multiplicand are the convolution kernel data
  • the first and second multipliers are the convolution kernel data
  • the first and second multiplicands are the feature layer data .
  • the fourth aspect of the present application provides a field programmable gate array FPGA.
  • the FPGA may include the data processing device described in the first aspect or any one of the possible implementation manners of the first aspect.
  • 2Nbit data is split into Nbit, and the multiplication including 2Nbit data can be processed by the Nbit*Nbit data processing device, avoiding the increase of the area of the multiplier.
  • the Nbit*Nbit data processing device by separately accumulating the product of the high-order part of multiple sets of multipliers and the multiplicand, avoiding shifting the product of the high-order part of each group of multipliers and the multiplicand, directly and the low-order part of the corresponding multiplier Part and the product of the multiplicand are added, resulting in bit expansion of the adder.
  • Figure 1 shows a Nbit*Nbit convolution processor
  • Fig. 2 is a Nbit*2Nbit and 2Nbit*2Nbit processor composed of Nbit ⁇ Nbit multipliers;
  • Figure 3 is a schematic diagram of the convolution processing principle of CNN
  • FIG. 4 is an Nbit*2Nbit convolution processing solution provided by an embodiment of the application.
  • FIG. 5 is an Nbit*2Nbit convolution processing solution provided by an embodiment of the application.
  • FIG. 6 is a 2Nbit*2Nbit convolution processing solution provided by an embodiment of the application.
  • FIG. 7 is a schematic diagram of a calculation process in which the solution provided in an embodiment of the application is applied to a product
  • FIG. 8 is a schematic diagram of a 2Nbit ⁇ Nbit splitting method provided by an embodiment of the application.
  • FIG. 9 is a schematic diagram of a 2Nbit ⁇ 2Nbit splitting method provided by an embodiment of the application.
  • FIG. 10 is a schematic diagram of another calculation process in which the solution provided in an embodiment of the application is applied to a product
  • FIG. 11 is a schematic diagram of another calculation process in which the solution provided by an embodiment of the application is applied to a product
  • FIG. 12 is a schematic diagram of another calculation process in which the solution provided by an embodiment of the application is applied to a product
  • FIG. 13 is a schematic flowchart of a data processing method provided by an embodiment of this application.
  • the convolution operation is a weighted summation process. For example, each element in the used image area is multiplied by each element in the convolution kernel, and the sum of all products is used as the new value of the center pixel of the area.
  • the convolution kernel is a matrix of fixed size and composed of numerical parameters.
  • the convolutional neural network performs convolution processing on the convolution kernel data A1, A2, ..., An and the feature layer data w1, w2, ..., w3. Specifically, for each convolution kernel, it starts from the first pixel of the feature map and moves pixel by pixel along the row direction. When moving to the end of this row, move down one pixel in the column direction, and at the same time return to the starting point in the row direction, and repeat the process of moving in the row direction until all pixels in the feature map are traversed.
  • the technical solution provided in this application can be applied to the field of original image processing, for example, it can be applied to a scene where the original image is processed for desiccation. Since each pixel of the original image is generally represented by an integer ranging from 10 bits to 12 bits. If the traditional 8bit data format is used for quantization, the pixel information of the original image will be lost too much, and the effect of de-noising will be unsatisfactory. Therefore, when processing the original image, it is necessary to use a high-bit floating point number (FP) data format or a high-bit integer (INT) data format for quantization processing. Using the floating-point number data format for processing will cause additional exponential processing area and power consumption overhead, and using the INT data format for processing can save this part of the overhead.
  • FP floating point number
  • INT high-bit integer
  • N is a positive integer.
  • the convolution kernel data can be Nbit
  • the feature layer data can be 2Nbit
  • the convolution kernel data can be 2Nbit
  • the feature layer data can be Nbit
  • the convolution kernel data and features The layer data is 2Nbit. It should be noted that the solution provided in this application is not only applicable to the field of original image processing, and how the Nbit ⁇ Nbit multiplier constitutes the Nbit*2Nbit and 2Nbit*2Nbit processors will be separately described below.
  • an Nbit*2Nbit convolution processing solution provided by this embodiment of the application.
  • the 2Nbit data is split into a high-bit part and a low-bit part.
  • the high-bit part is the high N-bit or the first N bits of the 2Nbit data
  • the low-bit part is the low N-bit or the last N bits of the 2Nbit data.
  • N is 8
  • 2Nbit is 16bit, such as FF1A
  • the high part is FF
  • the low part is 1A.
  • 32-bit data such as 3F68415B
  • the high part is 3F68 and the low part is 415B.
  • a ⁇ B+C ⁇ D where A and C are 2Nbits, or 2N bits, and B and D are Nbits, or N bits.
  • A-high is used to represent the high part of data A, that is, the high N bits of A data
  • C-low is used to represent the low part of C data, that is, the low N bits of C data.
  • a ⁇ B+C ⁇ D [(A-high ⁇ B+C-high ⁇ D) ⁇ N]+A-low ⁇ B+C-low ⁇ D.
  • the multiplier 401, the multiplier 402, the multiplier 403, and the multiplier are all Nbit ⁇ Nbit multipliers.
  • the multiplier 401 can be used to calculate A-high ⁇ B
  • the multiplier 402 can be used to calculate C-high ⁇ D
  • the multiplier 403 can be used to calculate A-low ⁇ B
  • the multiplier 404 can be used to calculate C- Low ⁇ D.
  • the adder 405 with a bit width of 2Nbit at the input can be used to accumulate the results output by the multiplier 401 and the multiplier 402 to obtain the accumulation result of the first set of products
  • the adder 406 with a bit width of 2Nbit at the input can be used for the multiplier.
  • 403 and the result output by the multiplier 404 are accumulated to obtain the accumulated result of the second set of products.
  • the solution provided in this application combines the product results of partial products that require the same number of left shifts.
  • the product of A-high ⁇ B and C-high ⁇ D The results need to be shifted by N bits to the left, so the partial products are combined, that is, the adder 405 is used for accumulation.
  • the result of the product of A-low ⁇ B and C-low ⁇ D does not need to be shifted to the left, that is, to the left by 0 bits, so the two partial products are combined, that is, accumulated by the adder 406.
  • the shifter 407 performs shift processing on the result output by the adder 405, specifically shifting it to the left by N bits.
  • the result output by the shifter 407 and the result output by the adder 406 are accumulated by the adder 408 to output the final result.
  • the bit width of the adder 408 is 2Nbit.
  • the convolution operation of the two sets of data listed above does not mean that the technical solution provided by this application is only applicable to the convolution operation of the two sets of data. This application does not limit the number of data participating in the convolution operation. , This will not be repeated in the following.
  • the following uses four sets of data as an example to illustrate how the Nbit ⁇ Nbit multiplier constitutes Nbit*2Nbit. As shown in FIG. 5, an Nbit*2Nbit convolution processing solution provided by this embodiment of the application. Suppose there are four sets of data convolution operations: A ⁇ B+C ⁇ D+E ⁇ F+G ⁇ H.
  • A, C, E, G are 2Nbits, or 2N bits
  • B, D, F, and H are Nbits, or N bits.
  • A-high is used to represent the high part of data A, that is, the high N bits of A data
  • A-low is used to represent the low part of A data, that is, the low N bits of A data.
  • a ⁇ B+C ⁇ D+E ⁇ F+G ⁇ H [(A-High ⁇ B+C-High ⁇ D+E-High ⁇ F+G-High ⁇ H) ⁇ N]+A-Low ⁇ B+C-low ⁇ D+E-low ⁇ F+G-low ⁇ H.
  • A-height ⁇ B, C-height ⁇ D, E-height ⁇ F, G-height ⁇ H can be calculated by the multiplier 501 to the multiplier 504, respectively, and calculated by the multiplier 505 to the multiplier 508.
  • the product results of A-H ⁇ B, C-H ⁇ D, E-H ⁇ F and G-H ⁇ H all need to be shifted to the left by N bits, so their product results are combined.
  • the output results of the multiplier 501 and the multiplier 502 can be accumulated by the adder 509
  • the output results of the multiplier 503 and the multiplier 504 can be accumulated by the adder 510
  • the adder 509 and the adder 510 can be added by the adder 513
  • the output results are accumulated.
  • the bit widths of the adder 509, the adder 510, and the adder 513 are all 2Nbit.
  • the product results of A-low ⁇ B, C-low ⁇ D, E-low ⁇ F and G-low ⁇ H do not need to be shifted left, that is, shifted to the left by O bits, so their product results are combined, specifically, such as As shown in FIG. 5, the output results of the multiplier 505 and the multiplier 506 can be accumulated by the adder 511, the output results of the multiplier 507 and the multiplier 508 can be accumulated by the adder 512, and the adder 511 can be added by the adder 514. And the output result of the adder 512 for accumulation. Among them, the bit widths of the adder 511, the adder 512, and the adder 514 are all 2Nbit.
  • the shifter 515 shifts the output result of the adder 513 to the left by N bits.
  • the adder 516 accumulates the output results of the shifter 515 and the adder 514.
  • this scheme is to multiply the product of the high part of the multiplier and the multiplicand and the low order of the multiplier
  • the product of the part and the multiplicand is accumulated separately, and the accumulated result of the product of the high part of the multiplier and the multiplicand is shifted as a whole, and then the product of the low part of the multiplier and the multiplicand is multiplied
  • the accumulated results of are added to form the final result.
  • the technical solution provided in the present application does not need to directly extend the Nbit ⁇ Nbit multiplier to avoid an increase in the area of the multiplier.
  • by separately accumulating the product of the high-order part of multiple sets of multipliers and the multiplicand and then performing the shifting process avoiding shifting the product of the high-order part of each group of multipliers and the multiplicand, directly and The low part of the corresponding multiplier is added to the product of the multiplicand, resulting in bit expansion of the adder.
  • a ⁇ B+C ⁇ D [(A-High ⁇ B) ⁇ N+A-Low ⁇ B]+[(C-High ⁇ D) ⁇ N+C-Low ⁇ D]
  • an adder with a bit width of 3Nbit is required to calculate the sum of A-high ⁇ B and A-low ⁇ B, and the sum of C-high ⁇ D+C-low ⁇ D.
  • the other adders in the scheme can be 2Nbit, and are compared to the product of the high part of each group of multipliers and the multiplicand After the shift, the solution is directly added to the product of the low-order part of the corresponding multiplier and the multiplicand. This solution does not need to perform shift processing multiple times, saving logic resources.
  • a 2Nbit*2Nbit convolution processing solution provided by this embodiment of the application.
  • the 2Nbit data is split into a high-bit part and a low-bit part.
  • the explanation of the high-bit part and the low-bit part can be understood with reference to the description in FIG. 4, and the details will not be repeated here.
  • the product results that require the same number of left shifts in the partial products are combined.
  • the products of multiple high-order parts and high-order parts, high-order parts and low-order parts, low-order parts and high-order parts, and low-order parts and low-order parts are respectively combined by polynomials.
  • a ⁇ C+E ⁇ G where A, C, E, and G are all 2Nbit.
  • A-high is used to represent the high part of data A, that is, the high N bits of A data
  • A-low is used to represent the low part of A data, that is, the low N bits of A data.
  • a ⁇ C+E ⁇ G [(A-high ⁇ C-high+E-high ⁇ G-high) ⁇ 2N]+[(A-high ⁇ C-low+E-high ⁇ G-low) ⁇ ⁇ N]+[(A-low ⁇ C-high+E-low ⁇ G-high) ⁇ N]+(A-low ⁇ C-low+E-low ⁇ G-low).
  • the multiplier 601 to the multiplier 608 are all Nbit ⁇ Nbit multipliers.
  • the product of the high-order part and the high-order part can be calculated by the multiplier 601 and the multiplier 602.
  • the multiplier 601 can calculate the A-high ⁇ C-high, E-high ⁇ G-high is calculated by the multiplier 602.
  • the product of the high-order part and the low-order part can be calculated by the multiplier 603 and the multiplier 604, or the product of the low-order part and the high-order part can be calculated.
  • the multiplier 603 can calculate A-high ⁇ C-low
  • the multiplier 604 can calculate E- High ⁇ G-low
  • A-low ⁇ C-high can be calculated by the multiplier 603
  • E-low ⁇ G-high can be calculated by the multiplier 604.
  • the multiplier 605 and the multiplier 606 calculate the product of the high-order part and the low-order part, where the product of the high-order part and the low-order part refers to A-high ⁇ C-low, E-high ⁇ G-low, low-order part
  • the product of the high part is A-low ⁇ C-high, E-low ⁇ G-high.
  • the product of the low-order part and the low-order part can be calculated by the multiplier 607 and the multiplier 608.
  • the multiplier 607 can calculate A-low ⁇ C-low
  • the multiplier 608 can calculate E-low ⁇ G-low.
  • the adder 609 accumulates the output results of the multiplier 601 and the multiplier 602
  • the adder 610 accumulates the output results of the multiplier 603 and the multiplier 604
  • the adder 611 accumulates the output results of the multiplier 605 and the multiplier 606.
  • the accumulation processing is performed, and the adder 612 performs accumulation processing on the output results of the multiplier 607 and the multiplier 608.
  • the bit widths of the adder 609, the adder 610, the adder 611, and the adder 612 are all 2Nbit.
  • the product of the high part and the high part (hereinafter referred to as the high product) needs to be shifted to the left by 2Nbit
  • the product of the high part and the low part (hereinafter referred to as the high-low product)
  • the product of the low part and the high part (hereinafter referred to as the low-high product)
  • the product of the low-order part and the low-order part (hereinafter referred to as the low-order product) does not need to be shifted to the left, that is, shifted to the left by 0bit.
  • the output result of the adder 609 can be shifted to the left by Nbit by the shifter 613, and the data output by the shifter 613 is 3Nbit, which is the first shift of the high-order product.
  • the adder 615 accumulates the output results of the shifter 613 and the adder 610, and the bit width of the adder 615 is 3Nbit.
  • the shifter 617 shifts the output result of the adder 615 to the left by Nbit, and the data output by the shifter 617 is 4Nbit. At this time, the high-order product completes the shift of 2Nbit.
  • the shifter 614 shifts the output result of the adder 611 to the left by N bits
  • the adder 616 accumulates the output results of the shifter 614 and the adder 612
  • the bit width of the adder 616 is 3Nbit
  • the adder 618 accumulates the output results of the shifter 617 and the adder 616 to obtain the final output result.
  • the bit width of the adder 618 is 4Nbit.
  • the technical solution provided in this application deals with the product of the high part of the multiplier and the high part of the multiplicand, the product of the high part of the multiplier and the low part of the multiplicand, and the low part of the multiplier and The product of the high part of the multiplicand, the products of the low part of the multiplier and the low part of the multiplicand are accumulated separately, and then the 4 accumulated results are shifted and added accordingly to get the final result.
  • the solution provided by this application avoids the product of the high part of each group of multipliers and the high part of the multiplicand, and the product of the high part and the low part is shifted separately, resulting in bit expansion of the adder.
  • a ⁇ C+E ⁇ G [(A-High ⁇ C-High) ⁇ 2N]+[(A-High ⁇ C-Low+) ⁇ N]+[(A-Low ⁇ C- High ⁇ N]+(A-Low ⁇ C-Low)+[(E-High ⁇ G-High) ⁇ 2N]+[(E-High ⁇ G-Low+) ⁇ N]+[(E -Low ⁇ G-High ⁇ N]+(E-Low ⁇ G-Low).
  • This kind of scheme requires multiple shifts, and the more data involved in the convolution operation, the more shifts required.
  • this solution requires a large number of adders with a bit width of 3Nbit and a 4Nbit adder. This solution combines partial products with the same number of left shifts and performs accumulation operations separately, which greatly saves logic resources.
  • the shifter 613, the shifter 614, and the shifter 616 can be turned on and off through the state machine.
  • the specific user can input instructions through the state machine to control the shifter 613, the shifter 614, and the shifter 616. Turn on and turn off.
  • the shifter 613, the shifter 614, and the shifter 617 can be controlled to be in an on state.
  • two sets of 2Nbit ⁇ 2Nbit operations can be processed.
  • the shifter 613 can be controlled to be turned on, the shifter 614 is turned on, and the shifter 617 is turned off.
  • 4 groups of 2Nbit ⁇ Nbit operations can be processed.
  • the shifter 601, the shifter 614, and the shifter 617 can be controlled to be in an off state, and at this time, 8 groups of Nbit ⁇ Nbit operations can be processed.
  • the above describes how to perform calculations based on the feature layer data and the convolution kernel data.
  • the above solutions can be implemented by any convolution operation device, such as a multiplier, a central processing unit, and a CPU. ), field-programmable gate array (FPGA), application specific intergrated circuits (ASIC), graphics processing unit (GPU) or other artificial intelligence (AI) chips And so on on the chip and so on.
  • FPGA field-programmable gate array
  • ASIC application specific intergrated circuits
  • GPU graphics processing unit
  • AI artificial intelligence
  • a double-rate synchronous dynamic random access memory (DDR) controller 702 reads data from the DDR701, and the data includes feature layer data and convolution kernel data.
  • the DDR controller 702 sends the read data to the splitting logic circuit 703, and the splitting logic circuit 703 splits the 2Nbit feature layer data into a high-order part and a low-order part, and stores the split data into the data randomly.
  • DDR double-rate synchronous dynamic random access memory
  • the split logic circuit 703 splits the 2Nbit convolution kernel data into a high-order part and a low-order part, and stores the split data in the weight RAM 704.
  • the calculation circuit 706 obtains the feature layer data from the data RAM, and performs calculation with the convolution kernel data preloaded in the calculation circuit 706.
  • the specific calculation process can be understood with reference to the description of FIGS. 4 to 6. After the calculation circuit 706 completes the calculation, it writes the calculated result into the DDR 701 through the DDR controller 702 to complete the entire process.
  • a state machine (not shown in the figure) may also be included to control the turning off of the shifter in the calculation circuit. The specific principle has been described in detail above and will not be repeated here.
  • the DDR controller reads data from the DDR and sends the read data to the split logic circuit.
  • the split logic circuit splits the acquired data and establishes a corresponding relationship. For example, as shown in Figure 8, a 2Nbit ⁇ Nbit splitting method is given.
  • the embodiment of this application does not limit the number of data participating in the calculation. In actual application scenarios, participation The calculated data can be two groups or more than two groups.
  • a ⁇ C+E ⁇ G where A, C, E, and G are all 2Nbit
  • a ⁇ C+E ⁇ G [(A -High ⁇ C-High+E-High ⁇ G-High) ⁇ 2N]+[(A-High ⁇ C-Low+E-High ⁇ G-Low) ⁇ N]+[(A-Low ⁇ C -High+E-Low ⁇ G-High) ⁇ N]+(A-Low ⁇ C-Low+E-Low ⁇ G-Low)
  • the splitting logic circuit will split data A into A-high sum A-low, split data C into C-high and C-low, split data E into E-high and E-low, split data G into G-high and G-low, and divide A- Establish a corresponding relationship between high and C-high,
  • the split logic circuit stores the split feature layer data in the data RAM, and stores the split convolution kernel data in the weight RAM according to the corresponding relationship established above, as shown in Figure 10, with 2Nbit ⁇ 2Nbit
  • a schematic diagram of a split logic circuit that splits 2Nbit data into two parts, a high part and a low part, is stored in the data RAM and parameter RAM. As shown in Figure 11, the data in the weight RAM is preloaded into the calculation circuit.
  • the first segment of data is preloaded into the calculation circuit 1, and the second segment of data is preloaded into the calculation circuit 2,..., Preload the nth segment of data into the calculation circuit n.
  • the first segment data is extracted from the data RAM and the first segment data preloaded in the calculation circuit 1 to calculate and the result is obtained.
  • the specific calculation process can be understood with reference to the description in Figure 6, and will not be omitted here. Repeat it.
  • the calculation circuit 1 After the calculation circuit 1 completes the calculation of the first segment data and the first segment data, it forwards the first segment data to the calculation circuit 2, and obtains the second segment data from the data RAM, and compares the second segment data with the calculation circuit 1
  • the preloaded first segment data is calculated and the result is obtained.
  • the calculation circuit 1 After each clock, the calculation circuit 1 obtains new data from the data RAM, and the calculation circuit 2 to the calculation circuit n forward the data of the characteristic layer processed by the previous clock to the next calculation circuit. After all the data stored in the data RAM have completed the calculation, the calculation circuit 1 to the calculation circuit n output data, and the data output by the calculation circuit is stored in the DDR through the DDR controller.
  • FIG. 13 it is a schematic flowchart of a data processing method provided in an embodiment of this application.
  • a data processing method provided by an embodiment of the present application may include the following steps:
  • the first group of products can include the product of the high N bits of the first multiplier and the first multiplicand, the product of the high N bits of the second multiplier and the second multiplicand, and the second group of products can include the first multiplier.
  • the product of the low N bits of and the first multiplicand, the product of the low N bits of the second multiplier and the second multiplicand, the first and second multipliers are both 2N bits, and N is a positive integer.
  • the first group of products and the second group of products are respectively accumulated.
  • shift processing is performed on the result of the first group of multiplication accumulation and addition to obtain the first shift result. Accumulate the first shift result and the second set of products.
  • This application can process the calculation of 2Nbit*Nbit through the Nbit ⁇ Nbit multiplier.
  • calculating the first set of products and the second set of products may specifically include: calculating the high-order product, and the high-order product may include the high N bits of the first multiplier and the high N bits of the first multiplicand Product, the product of the high N bits of the second multiplier and the high N bits of the second multiplicand. Calculate the product of high and low bits.
  • the product of high and low bits can include the product of the high N bits of the first multiplier and the low N bits of the first multiplicand, and the high N bits of the second multiplier and the low N bits of the second multiplicand. product. Calculate the low-high product.
  • the low-high product can include the product of the low N bits of the first multiplicand and the high N bits of the first multiplicand, and the low N bits of the second multiplier and the high N bits of the second multiplicand. product.
  • the low-order product is calculated.
  • the low-order product may include the product of the low N bits of the first multiplier and the low N bits of the first multiplicand, and the product of the low N bits of the second multiplier and the low N bits of the second multiplicand.
  • Accumulating the first group of products and the second group of products separately may include: accumulating high-order products, high-low-order products, low-high-order products, and low-order products, respectively.
  • the solution provided in this application can process the calculation of 2Nbit*2Nbit through the Nbit ⁇ Nbit multiplier.
  • it may further include: shifting the result of the high-order multiplication accumulation addition to the left by N bits to obtain the second shift result. Accumulate the second shift result and the result of multiplying and accumulating high and low bits. After accumulating the second shift result and the result of multiplying and accumulating the high and low bits, the result is shifted to the left by N bits to obtain the third shift result. The result of multiplying and accumulating the low and high bits is shifted to the left by N bits to obtain the fourth shift result. Accumulate the third shift result, the fourth shift result, and the low-order product.
  • it may further include: accumulating the fourth shift result and the low-order product.
  • Accumulating the third shift result, the fourth shift result, and the low-order product may include: accumulating the third shift result and the result of the fourth shift result and the low-order product after accumulating.
  • it may further include: outputting the high N bits and low N bits of the first multiplier, and the high N bits and low N bits of the second multiplier.
  • it may further include: constructing a first association relationship, the first association relationship may include an association relationship between the high N bits of the first multiplier and the first multiplicand, and the low N bits of the first multiplier The association relationship with the first multiplicand, the association relationship between the high N bits of the second multiplier and the second multiplicand, and the association relationship between the low N bits of the second multiplier and the second multiplicand.
  • it may also include: outputting the high N bits and low N bits of the first multiplier, the high N bits and low N bits of the second multiplier, and the high N bits and low N bits of the first multiplicand. N bits, the high and low N bits of the second multiplicand.
  • the second association relationship may include an association relationship between the high N bits of the first multiplier and the high N bits of the first multiplicand, and the first multiplier
  • the first multiplier and the second multiplier are feature layer data
  • the first multiplicand and the second multiplicand are convolution kernel data
  • the first multiplier and the second multiplier are Convolution kernel data
  • the first multiplicand and the second multiplicand are feature layer data.
  • the convolution kernel data can be Nbit
  • the feature layer data can be 2Nbit
  • the convolution kernel data can be 2Nbit
  • the feature layer data can be Nbit
  • the layer data is 2Nbit. It should be noted that the solution provided by this application can be applied not only to the field of original image processing.

Landscapes

  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Complex Calculations (AREA)

Abstract

A data processing apparatus, comprising: a product calculation circuit, for calculating a first group of products and a second group of products, the first group of products comprising a product of high N digits of a first multiplier and a first multiplicand and a product of high N digits of a second multiplier and a second multiplicand, the second group of products comprising a product of low N digits of a first multiplier of the data processing apparatus and a first multiplicand of the data processing apparatus and a product of low N digits of a second multiplier and a second multiplicand of the data processing apparatus, the first multiplier and the second multiplier being 2N digits, and N being a positive integer; and an accumulation circuit, for respectively performing accumulation processing on the first group of products and the second group of products. Partial products having the same left shift digit in multiple groups of multiplication operations are combined, and an accumulation operation is separately performed on the combined results, thereby reducing logic overhead of the data processing apparatus.

Description

一种数据处理装置以及数据处理方法Data processing device and data processing method 技术领域Technical field
本申请涉及数字信号处理技术领域,尤其涉及一种数据处理装置以及数据处理方法。This application relates to the technical field of digital signal processing, and in particular to a data processing device and a data processing method.
背景技术Background technique
卷积神经网络(convolutional neural network,CNN)在图像和语音识别等领域有较为广泛的应用场景。在实现卷积神经网络算法的模型中,卷积计算占整个算法模型90%的计算量,因此卷积层的高效计算是大幅提升CNN算法模型的计算效率的关键,通过硬件加速实现卷积计算是一种有效途径。Convolutional neural network (convolutional neural network, CNN) has a wide range of application scenarios in the fields of image and speech recognition. In the model that implements the convolutional neural network algorithm, the convolution calculation accounts for 90% of the calculation of the entire algorithm model. Therefore, the efficient calculation of the convolution layer is the key to greatly improving the calculation efficiency of the CNN algorithm model. The convolution calculation is realized through hardware acceleration. It is an effective way.
目前,Nbit*2Nbit以及2Nbit*2Nbit的处理器,逻辑资源开销大,性能低下,*代表卷积,N为正整数。具体的,如图1所示,一种Nbit*Nbit的数据处理装置示意图,该处理器包括Nbit×Nbit的乘法器以及位宽为2Nbit的加法器。每一个Nbit×Nbit的乘法器输出2Nbit的数据,当有两组或两组以上的数据做卷积运算,需要位宽为2Nbit的加法器对乘法器的输出结果进行累加处理。2Nbit×Nbit的乘法器的面积为Nbit×Nbit的乘法器的2倍,2Nbit×2Nbit的乘法器的面积为Nbit×Nbit的乘法器的4倍,相比于Nbit×Nbit的乘法器,这种方式导致乘法器的面积大大增加。此外,如图2所示,为了使处理器在一个时钟内,可以处理4组Nbit×Nbit的运算,2组2Nbit×Nbit的运算,以及1组2Nbit×2Nbit的运算,这种实现方式需要位宽为4Nbit的加法器,相比于Nbit*Nbit的方案,这种方式加法器的位宽扩展了1倍。乘法器的面积的增加,加法器位宽的扩展都会增加处理器的逻辑资源的开销,降低处理器的性能,因此,如何设计一种逻辑资源开销小的处理器,亟待解决。At present, Nbit*2Nbit and 2Nbit*2Nbit processors have high logic resource overhead and low performance. * represents convolution, and N is a positive integer. Specifically, as shown in FIG. 1, a schematic diagram of an Nbit*Nbit data processing device, the processor includes an Nbit×Nbit multiplier and an adder with a bit width of 2Nbit. Each Nbit×Nbit multiplier outputs 2Nbit data. When there are two or more sets of data for convolution operation, an adder with a bit width of 2Nbit is required to accumulate the output result of the multiplier. The area of the 2Nbit×Nbit multiplier is twice that of the Nbit×Nbit multiplier, and the area of the 2Nbit×2Nbit multiplier is 4 times that of the Nbit×Nbit multiplier. Compared with the Nbit×Nbit multiplier, this This method causes the area of the multiplier to be greatly increased. In addition, as shown in Figure 2, in order to enable the processor to process 4 groups of Nbit×Nbit operations, 2 groups of 2Nbit×Nbit operations, and 1 group of 2Nbit×2Nbit operations within one clock, this implementation requires bits. Compared with the Nbit*Nbit solution, the adder with a width of 4Nbit has doubled the bit width of the adder in this way. The increase in the area of the multiplier and the expansion of the bit width of the adder will increase the logic resource overhead of the processor and reduce the performance of the processor. Therefore, how to design a processor with low logic resource overhead is an urgent solution.
发明内容Summary of the invention
本申请提供一种数据处理装置以及数据处理方法,相比于每一组乘数的高位部分与被乘数的乘积移位后,直接与对应的乘数的低位部分与被乘数的乘积相加的方案,本申请提供的方案将多组乘法运算中左移位数相同的部分积进行合并,并对合并后的结果分别进行累加运算,大幅节省了逻辑资源。This application provides a data processing device and a data processing method. Compared with the product of the high part of each group of multipliers and the multiplicand, after being shifted, it is directly combined with the product of the low part of the corresponding multiplier and the multiplicand. In addition, the solution provided in this application combines partial products with the same left shift number in multiple sets of multiplication operations, and performs cumulative addition operations on the combined results, which greatly saves logic resources.
本申请第一方面提供一种数据处理装置,可以包括:乘积计算电路,用于计算第一组乘积和第二组乘积,第一组乘积可以包括第一乘数的高N位与第一被乘数的乘积,以及第二乘数的高N位与第二被乘数的乘积,第二组乘积可以包括第一乘数的低N位与第一被乘数的乘积,以及第二乘数的低N位与第二被乘数的乘积,第一乘数和第二乘数均为2N位,N为正整数。累加电路,用于对第一组乘积和第二组乘积分别进行累加处理。由第一方面可知,本申请提供的技术方案当有两组或两组以上的数据进行卷积运算时,通过对多组乘数的高位部分与被乘数的乘积进行单独的累加后再进行移位处理,避免对每一组乘数的高位部分与被乘数的乘积移位后,直接与对应的乘数的低位部分与被乘数的乘积相加,导致的加法器的扩位。The first aspect of the present application provides a data processing device, which may include: a product calculation circuit for calculating a first set of products and a second set of products, the first set of products may include the high N bits of the first multiplier and the first The product of the multiplier, and the product of the high N bits of the second multiplier and the second multiplicand. The second set of products may include the product of the low N bits of the first multiplier and the first multiplicand, and the second multiplication The product of the low N bits of the number and the second multiplicand, the first multiplier and the second multiplier are both 2N bits, and N is a positive integer. The accumulating circuit is used for accumulating the first group of products and the second group of products respectively. From the first aspect, it can be seen that when there are two or more sets of data for convolution operation, the technical solution provided by this application is performed by separately accumulating the product of the high part of multiple sets of multipliers and the multiplicand. Shift processing avoids shifting the product of the high-order part of each group of multipliers and the multiplicand, directly adding the product of the corresponding low-order part of the multiplier and the multiplicand, resulting in bit expansion of the adder.
可选地,结合上述第一方面,在第一种可能的实现方式中,数据处理装置还可以包括第一移位器和第一加法器,第一被乘数和第二被乘数均为N位,第一移位器,用于对第一组乘积累加的结果进行移位处理,以得到第一移位结果。第一加法器,用于对第一移位结果和第二组乘积进行累加。由第一方面第一种可能的实现方式可知,给出了一种具体的如何用Nbit×Nbit的乘法器构成2Nbit*Nbit的处理器的方案。Optionally, in combination with the above first aspect, in a first possible implementation manner, the data processing device may further include a first shifter and a first adder, and the first multiplicand and the second multiplicand are both N bits, the first shifter, used to perform shift processing on the result of the first group of multiplication accumulation and addition to obtain the first shift result. The first adder is used to accumulate the first shift result and the second set of products. As can be seen from the first possible implementation manner of the first aspect, a specific solution of how to use an Nbit×Nbit multiplier to form a 2Nbit*Nbit processor is given.
可选地,结合上述第一方面,在第二种可能的实现方式中,第一被乘数和第二被乘数为 2N位,第一组乘积可以包括高位乘积和高低位乘积,第二组乘积可以包括低高位乘积和低位乘积,乘积计算电路,具体用于:计算高位乘积,高位乘积可以包括第一乘数的高N位与第一被乘数的高N位的乘积,第二乘数的高N位与第二被乘数的高N位的乘积。计算高低位乘积,高低位乘积可以包括第一乘数的高N位与第一被乘数的低N位的乘积,第二乘数的高N位与第二被乘数的低N位的乘积。计算低高位乘积,低高位乘积可以包括第一乘数的低N位与第一被乘数的高N位的乘积,第二乘数的低N位与第二被乘数的高N位的乘积。计算低位乘积,低位乘积可以包括第一乘数的低N位与第一被乘数的低N位的乘积,第二乘数的低N位与第二被乘数的低N位的乘积。累加电路,具体用于通过位宽为2Nbit的第二加法器,分别对高位乘积,高低位乘积,低高位乘积,以及低位乘积进行累加。由第一方面第二种可能的实现方式可知,通过Nbit×Nbit的乘法器可以构成2Nbit*2Nbit的处理器的方案。Optionally, in combination with the above-mentioned first aspect, in a second possible implementation manner, the first multiplicand and the second multiplicand are 2N bits, the first set of products may include high-order products and high-low-order products, and the second The group product may include the low-high-order product and the low-order product. The product calculation circuit is specifically used to calculate the high-order product. The high-order product may include the product of the high N bits of the first multiplier and the high N bits of the first multiplicand, and the second The product of the high N bits of the multiplier and the high N bits of the second multiplicand. Calculate the product of high and low bits. The product of high and low bits can include the product of the high N bits of the first multiplier and the low N bits of the first multiplicand, and the high N bits of the second multiplier and the low N bits of the second multiplicand. product. Calculate the low-high product. The low-high product can include the product of the low N bits of the first multiplicand and the high N bits of the first multiplicand, and the low N bits of the second multiplier and the high N bits of the second multiplicand. product. The low-order product is calculated. The low-order product may include the product of the low N bits of the first multiplier and the low N bits of the first multiplicand, and the product of the low N bits of the second multiplier and the low N bits of the second multiplicand. The accumulation circuit is specifically used to accumulate high-order products, high-low-order products, low-high-order products, and low-order products through a second adder with a bit width of 2Nbit. From the second possible implementation of the first aspect, it can be seen that a 2Nbit*2Nbit processor solution can be formed through an Nbit×Nbit multiplier.
可选地,结合上述第一方面第二种可能的实现方式,在第三种可能的实现方式中,数据处理装置还可以包括第二移位器,第三加法器,第三移位器,第四移位器以及第四加法器,第二移位器,用于对高位乘积累加的结果左移N位,以得到第二移位结果。第三加法器,用于对第二移位结果和高低位乘积累加的结果进行累加。第三移位器,用于对第三加法器输出的结果左移N位,以得到第三移位结果。第四移位器,用于对低高位乘积累加的结果左移N位,以得到第四移位结果。第四加法器,用于对第三移位结果、第四移位结果以及低位乘积进行累加。Optionally, in combination with the second possible implementation manner of the first aspect described above, in the third possible implementation manner, the data processing device may further include a second shifter, a third adder, and a third shifter, The fourth shifter, the fourth adder, and the second shifter are used to shift the result of the high-order multiplication accumulation addition to the left by N bits to obtain the second shift result. The third adder is used to accumulate the second shift result and the result of multiplication and accumulation of high and low bits. The third shifter is used to shift the result output by the third adder to the left by N bits to obtain the third shift result. The fourth shifter is used to shift the result of the multiplication and accumulation of the low and high bits by N bits to the left to obtain the fourth shift result. The fourth adder is used to accumulate the third shift result, the fourth shift result, and the low-order product.
可选地,结合上述第一方面第三种可能的实现方式,在第四种可能的实现方式中,数据处理装置还可以包括第五加法器,用于:对第四移位结果和低位乘积进行累加。第四加法器,具体用于对第三移位结果和第五加法器输出的结果进行累加。Optionally, in combination with the third possible implementation manner of the first aspect described above, in the fourth possible implementation manner, the data processing device may further include a fifth adder for: multiplying the fourth shift result and the low-order product Accumulate. The fourth adder is specifically used to accumulate the third shift result and the result output by the fifth adder.
可选地,结合上述第一方面或第一方面第一种可能的实现方式,在第五种可能的实现方式中,还可以包括拆分逻辑电路,用于:通过选择器MUX输出第一乘数的高N位和低N位,第二乘数的高N位和低N位。Optionally, in combination with the foregoing first aspect or the first possible implementation manner of the first aspect, in the fifth possible implementation manner, a split logic circuit may also be included for: outputting the first multiplier through the selector MUX The high and low N bits of the number, the high and low N bits of the second multiplier.
可选地,结合上述第一方面第五种可能的实现方式,在第六种可能的实现方式中,拆分逻辑电路,还用于构建第一关联关系,第一关联关系可以包括第一乘数的高N位与第一被乘数的关联关系,第一乘数的低N位与第一被乘数的关联关系,第二乘数的高N位与第二被乘数的关联关系,第二乘数的低N位与第二被乘数的关联关系。Optionally, in combination with the fifth possible implementation manner of the first aspect described above, in the sixth possible implementation manner, the logic circuit is split and is also used to construct a first association relationship. The first association relationship may include the first multiplication The relationship between the high N bits of a number and the first multiplicand, the relationship between the low N bits of the first multiplier and the first multiplicand, and the relationship between the high N bits of the second multiplier and the second multiplicand , The relationship between the low N bits of the second multiplier and the second multiplicand.
可选地,结合上述第一方面第二种至第一方面第四种可能的实现方式,在第七种可能的实现方式中,还可以包括拆分逻辑电路,用于:通过选择器MUX输出第一乘数的高N位和低N位,第二乘数的高N位和低N位,第一被乘数的高N位和低N位,第二被乘数的高N位和低N位。Optionally, in combination with the above-mentioned second aspect of the first aspect to the fourth possible implementation manner of the first aspect, in the seventh possible implementation manner, a split logic circuit may also be included for: outputting through the selector MUX High N bits and low N bits of the first multiplier, high N bits and low N bits of the second multiplier, high N bits and low N bits of the first multiplicand, and high N bits of the second multiplicand Low N bits.
可选地,结合上述第一方面第七种可能的实现方式,在第八种可能的实现方式中,拆分逻辑电路,还用于:构建第二关联关系,第二关联关系可以包括第一乘数的高N位与第一被乘数的高N位的关联关系,第一乘数的高N位与第一被乘数的低N位的关联关系,第一乘数的低N位与第一被乘数的高N位的关联关系,第一乘数的低N位与第一被乘数的低N位的关联关系,第二乘数的高N位与第二被乘数的高N位的关联关系,第二乘数的高N位与第二被乘数的低N位的关联关系,第二乘数的低N位与第二被乘数的高N位的关联关系,第二乘数的低N位与第二被乘数的低N位的关联关系。Optionally, in combination with the seventh possible implementation manner of the first aspect described above, in the eighth possible implementation manner, splitting the logic circuit is also used to: construct a second association relationship, and the second association relationship may include the first The relationship between the high N bits of the multiplier and the high N bits of the first multiplicand, the relationship between the high N bits of the first multiplier and the low N bits of the first multiplicand, and the low N bits of the first multiplier The relationship with the high N bits of the first multiplicand, the relationship between the low N bits of the first multiplier and the low N bits of the first multiplicand, the high N bits of the second multiplicand and the second multiplicand The relationship between the high N bits of the second multiplier, the high N bits of the second multiplicand and the low N bits of the second multiplicand, the low N bits of the second multiplier and the high N bits of the second multiplicand Relationship, the relationship between the low N bits of the second multiplier and the low N bits of the second multiplicand.
可选地,结合上述第一方面第八种可能的实现方式,在第九种可能的实现方式中,数据处 理装置还可以包括数据随机存取存储器RAM,以及权重RAM,数据RAM,用于存储第一乘数的高N位和低N位,第二乘数的高N位和低N位。权重RAM,用于根据第二关联关系存储第一被乘数的高N位和低N位,第二被乘数的高N位和低N位。Optionally, in combination with the eighth possible implementation manner of the first aspect described above, in the ninth possible implementation manner, the data processing device may further include a data random access memory RAM, and a weight RAM, and a data RAM for storing The high N bits and low N bits of the first multiplier, and the high N bits and low N bits of the second multiplier. The weight RAM is used to store the high N bits and low N bits of the first multiplicand, and the high N bits and low N bits of the second multiplicand according to the second association relationship.
可选地,结合上述第一方面或第一方面第一种至第一方面第九种可能的实现方式,在第十种可能的实现方式中,第一乘数和第二乘数为特征层数据,第一被乘数和第二被乘数为卷积核数据,或者第一乘数和第二乘数为卷积核数据,第一被乘数和第二被乘数为特征层数据。Optionally, in combination with the foregoing first aspect or the first aspect of the first aspect to the ninth possible implementation manner of the first aspect, in the tenth possible implementation manner, the first multiplier and the second multiplier are characteristic layers Data, the first multiplicand and the second multiplicand are the convolution kernel data, or the first and second multipliers are the convolution kernel data, and the first and second multiplicands are the feature layer data .
本申请第二方面提供一种数据处理方法,可以包括:计算第一组乘积和第二组乘积,第一组乘积可以包括第一乘数的高N位与第一被乘数的乘积,第二乘数的高N位与第二被乘数的乘积,第二组乘积可以包括第一乘数的低N位与第一被乘数的乘积,第二乘数的低N位与第二被乘数的乘积,第一乘数和第二乘数均为2N位,N为正整数。对第一组乘积和第二组乘积分别进行累加处理。A second aspect of the present application provides a data processing method, which may include: calculating a first set of products and a second set of products, the first set of products may include the product of the high N bits of the first multiplier and the first multiplicand, The product of the high N bits of the second multiplier and the second multiplicand. The second set of products can include the product of the low N bits of the first multiplier and the first multiplicand, and the low N bits of the second multiplier and the second multiplier. The product of the multiplicand, the first multiplier and the second multiplier are both 2N bits, and N is a positive integer. The first group of products and the second group of products are respectively accumulated.
可选地,结合上述第二方面,在第一种可能的实现方式中,还可以包括:对第一组乘积累加的结果进行移位处理,以得到第一移位结果。对第一移位结果和第二组乘积进行累加。Optionally, in combination with the above second aspect, in the first possible implementation manner, it may further include: performing shift processing on the result of the first group of multiplication accumulation and addition to obtain the first shift result. Accumulate the first shift result and the second set of products.
可选地,结合上述第二方面,在第二种可能的实现方式中,计算第一组乘积和第二组乘积,具体可以包括:计算高位乘积,高位乘积可以包括第一乘数的高N位与第一被乘数的高N位的乘积,第二乘数的高N位与第二被乘数的高N位的乘积。计算高低位乘积,高低位乘积可以包括第一乘数的高N位与第一被乘数的低N位的乘积,第二乘数的高N位与第二被乘数的低N位的乘积。计算低高位乘积,低高位乘积可以包括第一乘数的低N位与第一被乘数的高N位的乘积,第二乘数的低N位与第二被乘数的高N位的乘积。计算低位乘积,低位乘积可以包括第一乘数的低N位与第一被乘数的低N位的乘积,第二乘数的低N位与第二被乘数的低N位的乘积。对第一组乘积和第二组乘积分别进行累加处理,可以包括:分别对高位乘积,高低位乘积,低高位乘积,以及低位乘积进行累加。Optionally, in combination with the above second aspect, in a second possible implementation manner, calculating the first set of products and the second set of products may specifically include: calculating the high-order product, and the high-order product may include the high N of the first multiplier. The product of bits and the high N bits of the first multiplicand, and the product of the high N bits of the second multiplier and the high N bits of the second multiplicand. Calculate the product of high and low bits. The product of high and low bits can include the product of the high N bits of the first multiplier and the low N bits of the first multiplicand, and the high N bits of the second multiplier and the low N bits of the second multiplicand. product. Calculate the low-high product. The low-high product can include the product of the low N bits of the first multiplicand and the high N bits of the first multiplicand, and the low N bits of the second multiplier and the high N bits of the second multiplicand. product. The low-order product is calculated. The low-order product may include the product of the low N bits of the first multiplier and the low N bits of the first multiplicand, and the product of the low N bits of the second multiplier and the low N bits of the second multiplicand. Accumulating the first group of products and the second group of products separately may include: accumulating high-order products, high-low-order products, low-high-order products, and low-order products, respectively.
可选地,结合上述第二方面第二种可能的实现方式,在第三种可能的实现方式中,还可以包括:对高位乘积累加的结果左移N位,以得到第二移位结果。对第二移位结果和高低位乘积累加的结果进行累加。对第二移位结果和高低位乘积累加的结果进行累加后的结果左移N位,以得到第三移位结果。对低高位乘积累加的结果左移N位,以得到第四移位结果。对第三移位结果、第四移位结果以及低位乘积进行累加。Optionally, in combination with the second possible implementation manner of the second aspect described above, in the third possible implementation manner, it may further include: shifting the result of the high-order multiplication accumulation addition to the left by N bits to obtain the second shift result. Accumulate the second shift result and the result of multiplying and accumulating high and low bits. After accumulating the second shift result and the result of multiplying and accumulating the high and low bits, the result is shifted to the left by N bits to obtain the third shift result. The result of multiplying and accumulating the low and high bits is shifted to the left by N bits to obtain the fourth shift result. Accumulate the third shift result, the fourth shift result, and the low-order product.
可选地,结合上述第二方面第三种可能的实现方式,在第四种可能的实现方式中,还可以包括:对第四移位结果和低位乘积进行累加。对第三移位结果、第四移位结果以及低位乘积进行累加,可以包括:对第三移位结果以及第四移位结果和低位乘积进行累加后的结果进行累加。Optionally, in combination with the third possible implementation manner of the second aspect described above, in the fourth possible implementation manner, it may further include: accumulating the fourth shift result and the low-order product. Accumulating the third shift result, the fourth shift result, and the low-order product may include: accumulating the third shift result and the result of the fourth shift result and the low-order product after accumulating.
可选地,结合上述第二方面或第二方面第一种可能的实现方式,在第五种可能的实现方式中,还可以包括:输出第一乘数的高N位和低N位,第二乘数的高N位和低N位。Optionally, in combination with the foregoing second aspect or the first possible implementation manner of the second aspect, in the fifth possible implementation manner, it may further include: outputting the high N bits and low N bits of the first multiplier, The high N bits and low N bits of the second multiplier.
可选地,结合上述第二方面第五种可能的实现方式,在第六种可能的实现方式中,还可以包括:构建第一关联关系,第一关联关系可以包括第一乘数的高N位与第一被乘数的关联关系,第一乘数的低N位与第一被乘数的关联关系,第二乘数的高N位与第二被乘数的关联关系,第二乘数的低N位与第二被乘数的关联关系。Optionally, in combination with the fifth possible implementation manner of the second aspect described above, in the sixth possible implementation manner, it may further include: constructing a first association relationship, and the first association relationship may include the high N of the first multiplier. The relationship between bits and the first multiplicand, the relationship between the low N bits of the first multiplier and the first multiplicand, the relationship between the high N bits of the second multiplier and the second multiplicand, the second multiplication The relationship between the low N bits of the number and the second multiplicand.
可选地,结合上述第二方面第二种至第二方面第四种可能的实现方式,在第七种可能的实现方式中,还可以包括:输出第一乘数的高N位和低N位,第二乘数的高N位和低N位,第一 被乘数的高N位和低N位,第二被乘数的高N位和低N位。Optionally, in combination with the foregoing second aspect of the second aspect to the fourth possible implementation manner of the second aspect, in the seventh possible implementation manner, it may further include: outputting the high N bits and low N bits of the first multiplier Bits, the high N bits and low N bits of the second multiplier, the high N bits and low N bits of the first multiplicand, and the high N bits and low N bits of the second multiplicand.
可选地,结合上述第二方面第八种可能的实现方式,在第九种可能的实现方式中,还可以包括:构建第二关联关系,第二关联关系可以包括第一乘数的高N位与第一被乘数的高N位的关联关系,第一乘数的高N位与第一被乘数的低N位的关联关系,第一乘数的低N位与第一被乘数的高N位的关联关系,第一乘数的低N位与第一被乘数的低N位的关联关系,第二乘数的高N位与第二被乘数的高N位的关联关系,第二乘数的高N位与第二被乘数的低N位的关联关系,第二乘数的低N位与第二被乘数的高N位的关联关系,第二乘数的低N位与第二被乘数的低N位的关联关系。Optionally, in combination with the eighth possible implementation manner of the second aspect described above, in the ninth possible implementation manner, it may further include: constructing a second association relationship, and the second association relationship may include the high N of the first multiplier. The correlation between the high N bits of the first multiplicand and the high N bits of the first multiplicand and the low N bits of the first multiplicand. The low N bits of the first multiplicand and the first multiplied The relationship between the high N bits of the number, the relationship between the low N bits of the first multiplier and the low N bits of the first multiplicand, the high N bits of the second multiplier and the high N bits of the second multiplicand Association relationship, the relationship between the high N bits of the second multiplier and the low N bits of the second multiplicand, the relationship between the low N bits of the second multiplier and the high N bits of the second multiplicand, the second multiplication The relationship between the low N bits of the number and the low N bits of the second multiplicand.
可选地,结合上述第二方面或第二方面第一种至第二方面第九种可能的实现方式,在第十种可能的实现方式中,第一乘数和第二乘数为特征层数据,第一被乘数和第二被乘数为卷积核数据,或者第一乘数和第二乘数为卷积核数据,第一被乘数和第二被乘数为特征层数据。Optionally, in combination with the foregoing second aspect or the first aspect of the second aspect to the ninth possible implementation manner of the second aspect, in the tenth possible implementation manner, the first multiplier and the second multiplier are characteristic layers Data, the first multiplicand and the second multiplicand are the convolution kernel data, or the first and second multipliers are the convolution kernel data, and the first and second multiplicands are the feature layer data .
本申请第三方面提供一种数据处理装置,可以包括:乘积计算模块,用于计算第一组乘积和第二组乘积,第一组乘积可以包括第一乘数的高N位与第一被乘数的乘积,以及第二乘数的高N位与第二被乘数的乘积,第二组乘积可以包括第一乘数的低N位与第一被乘数的乘积,以及第二乘数的低N位与第二被乘数的乘积,第一乘数和第二乘数均为2N位,N为正整数。累加模块,用于对第一组乘积和第二组乘积分别进行累加处理。由第三方面可知,本申请提供的技术方案当有两组或两组以上的数据进行卷积运算时,通过对多组乘数的高位部分与被乘数的乘积进行单独的累加后再进行移位处理,避免对每一组乘数的高位部分与被乘数的乘积移位后,直接与对应的乘数的低位部分与被乘数的乘积相加,导致的加法模块的扩位。A third aspect of the present application provides a data processing device, which may include: a product calculation module for calculating a first set of products and a second set of products, the first set of products may include the high N bits of the first multiplier and the first set of products. The product of the multiplier, and the product of the high N bits of the second multiplier and the second multiplicand. The second set of products may include the product of the low N bits of the first multiplier and the first multiplicand, and the second multiplication The product of the low N bits of the number and the second multiplicand, the first multiplier and the second multiplier are both 2N bits, and N is a positive integer. The accumulation module is used for accumulating the first group of products and the second group of products respectively. From the third aspect, it can be seen that when there are two or more sets of data for the convolution operation, the technical solution provided by this application is performed by separately accumulating the product of the high part of the multiplier and the multiplicand. Shift processing avoids shifting the product of the high part of each group of multipliers and the multiplicand, directly adding the product of the corresponding low part of the multiplier and the multiplicand, resulting in bit expansion of the addition module.
可选地,结合上述第三方面,在第一种可能的实现方式中,数据处理装置还可以包括第一移位模块和第一加法模块,第一被乘数和第二被乘数均为N位,第一移位模块,用于对第一组乘积累加的结果进行移位处理,以得到第一移位结果。第一加法模块,用于对第一移位结果和第二组乘积进行累加。由第三方面第一种可能的实现方式可知,给出了一种具体的如何用Nbit×Nbit的乘法模块构成2Nbit*Nbit的处理模块的方案。Optionally, in combination with the above third aspect, in the first possible implementation manner, the data processing device may further include a first shift module and a first addition module, and the first multiplicand and the second multiplicand are both N bits, the first shift module, used to perform shift processing on the result of the first group of multiplication accumulation and addition to obtain the first shift result. The first addition module is used to accumulate the first shift result and the second set of products. As can be seen from the first possible implementation manner of the third aspect, a specific solution of how to use Nbit×Nbit multiplication modules to form a 2Nbit*Nbit processing module is given.
可选地,结合上述第三方面,在第二种可能的实现方式中,第一被乘数和第二被乘数为2N位,第一组乘积可以包括高位乘积和高低位乘积,第二组乘积可以包括低高位乘积和低位乘积,乘积计算模块,具体用于:计算高位乘积,高位乘积可以包括第一乘数的高N位与第一被乘数的高N位的乘积,第二乘数的高N位与第二被乘数的高N位的乘积。计算高低位乘积,高低位乘积可以包括第一乘数的高N位与第一被乘数的低N位的乘积,第二乘数的高N位与第二被乘数的低N位的乘积。计算低高位乘积,低高位乘积可以包括第一乘数的低N位与第一被乘数的高N位的乘积,第二乘数的低N位与第二被乘数的高N位的乘积。计算低位乘积,低位乘积可以包括第一乘数的低N位与第一被乘数的低N位的乘积,第二乘数的低N位与第二被乘数的低N位的乘积。累加模块,具体用于通过位宽为2Nbit的第二加法模块,分别对高位乘积,高低位乘积,低高位乘积,以及低位乘积进行累加。由第三方面第二种可能的实现方式可知,通过Nbit×Nbit的乘法模块可以构成2Nbit*2Nbit的处理模块的方案。Optionally, in combination with the above-mentioned third aspect, in a second possible implementation manner, the first multiplicand and the second multiplicand are 2N bits, the first set of products may include high-order products and high-low-order products, and the second The group product can include a low-high-order product and a low-order product. The product calculation module is specifically used to calculate the high-order product. The high-order product can include the product of the high N bits of the first multiplier and the high N bits of the first multiplicand, and the second The product of the high N bits of the multiplier and the high N bits of the second multiplicand. Calculate the product of high and low bits. The product of high and low bits can include the product of the high N bits of the first multiplier and the low N bits of the first multiplicand, and the high N bits of the second multiplier and the low N bits of the second multiplicand. product. Calculate the low-high product. The low-high product can include the product of the low N bits of the first multiplicand and the high N bits of the first multiplicand, and the low N bits of the second multiplier and the high N bits of the second multiplicand. product. The low-order product is calculated. The low-order product may include the product of the low N bits of the first multiplier and the low N bits of the first multiplicand, and the product of the low N bits of the second multiplier and the low N bits of the second multiplicand. The accumulation module is specifically used to accumulate the high-order product, the high-low-order product, the low-high-order product, and the low-order product through the second addition module with a bit width of 2Nbit. From the second possible implementation manner of the third aspect, it can be known that a 2Nbit*2Nbit processing module solution can be formed through an Nbit×Nbit multiplication module.
可选地,结合上述第三方面第二种可能的实现方式,在第三种可能的实现方式中,数据处理装置还可以包括第二移位模块,第三加法模块,第三移位模块,第四移位模块以及第四加法模块,第二移位模块,用于对高位乘积累加的结果左移N位,以得到第二移位结果。第三加法 模块,用于对第二移位结果和高低位乘积累加的结果进行累加。第三移位模块,用于对第三加法模块输出的结果左移N位,以得到第三移位结果。第四移位模块,用于对低高位乘积累加的结果左移N位,以得到第四移位结果。第四加法模块,用于对第三移位结果、第四移位结果以及低位乘积进行累加。Optionally, in combination with the second possible implementation manner of the third aspect described above, in the third possible implementation manner, the data processing device may further include a second shift module, a third addition module, and a third shift module, The fourth shift module, the fourth addition module, and the second shift module are used to shift the result of the high-order multiplication accumulation addition to the left by N bits to obtain the second shift result. The third addition module is used to accumulate the second shift result and the result of multiplying and accumulating high and low bits. The third shift module is used to shift the result output by the third addition module by N bits to the left to obtain the third shift result. The fourth shift module is used to shift the result of multiplication and accumulation of low and high bits to the left by N bits to obtain the fourth shift result. The fourth addition module is used to accumulate the third shift result, the fourth shift result, and the low-order product.
可选地,结合上述第三方面第三种可能的实现方式,在第四种可能的实现方式中,数据处理装置还可以包括第五加法模块,用于:对第四移位结果和低位乘积进行累加。第四加法模块,具体用于对第三移位结果和第五加法模块输出的结果进行累加。Optionally, in combination with the third possible implementation manner of the third aspect described above, in the fourth possible implementation manner, the data processing device may further include a fifth addition module, configured to: compare the fourth shift result and the low-order product Accumulate. The fourth addition module is specifically used to accumulate the third shift result and the result output by the fifth addition module.
可选地,结合上述第三方面或第三方面第一种可能的实现方式,在第五种可能的实现方式中,还可以包括拆分逻辑模块,用于:通过选择模块MUX输出第一乘数的高N位和低N位,第二乘数的高N位和低N位。Optionally, in combination with the foregoing third aspect or the first possible implementation manner of the third aspect, in the fifth possible implementation manner, a split logic module may also be included for: outputting the first multiplier through the selection module MUX The high and low N bits of the number, the high and low N bits of the second multiplier.
可选地,结合上述第三方面第五种可能的实现方式,在第六种可能的实现方式中,拆分逻辑模块,还用于构建第一关联关系,第一关联关系可以包括第一乘数的高N位与第一被乘数的关联关系,第一乘数的低N位与第一被乘数的关联关系,第二乘数的高N位与第二被乘数的关联关系,第二乘数的低N位与第二被乘数的关联关系。Optionally, in combination with the fifth possible implementation manner of the third aspect described above, in the sixth possible implementation manner, the logic module is split and is also used to construct a first association relationship. The first association relationship may include the first multiplication The relationship between the high N bits of a number and the first multiplicand, the relationship between the low N bits of the first multiplier and the first multiplicand, and the relationship between the high N bits of the second multiplier and the second multiplicand , The relationship between the low N bits of the second multiplier and the second multiplicand.
可选地,结合上述第三方面第二种至第三方面第四种可能的实现方式,在第七种可能的实现方式中,还可以包括拆分逻辑模块,用于:通过选择模块MUX输出第一乘数的高N位和低N位,第二乘数的高N位和低N位,第一被乘数的高N位和低N位,第二被乘数的高N位和低N位。Optionally, in combination with the above-mentioned second aspect of the third aspect to the fourth possible implementation manner of the third aspect, in the seventh possible implementation manner, a split logic module may also be included for: outputting through the selection module MUX The high N bits and low N bits of the first multiplier, the high N bits and low N bits of the second multiplier, the high N bits and low N bits of the first multiplicand, and the high N bits of the second multiplicand Low N bits.
可选地,结合上述第三方面第七种可能的实现方式,在第八种可能的实现方式中,拆分逻辑模块,还用于:构建第二关联关系,第二关联关系可以包括第一乘数的高N位与第一被乘数的高N位的关联关系,第一乘数的高N位与第一被乘数的低N位的关联关系,第一乘数的低N位与第一被乘数的高N位的关联关系,第一乘数的低N位与第一被乘数的低N位的关联关系,第二乘数的高N位与第二被乘数的高N位的关联关系,第二乘数的高N位与第二被乘数的低N位的关联关系,第二乘数的低N位与第二被乘数的高N位的关联关系,第二乘数的低N位与第二被乘数的低N位的关联关系。Optionally, in combination with the seventh possible implementation manner of the third aspect described above, in the eighth possible implementation manner, splitting the logic module is also used to: construct a second association relationship, and the second association relationship may include the first The relationship between the high N bits of the multiplier and the high N bits of the first multiplicand, the relationship between the high N bits of the first multiplier and the low N bits of the first multiplicand, and the low N bits of the first multiplier The relationship with the high N bits of the first multiplicand, the relationship between the low N bits of the first multiplier and the low N bits of the first multiplicand, the high N bits of the second multiplicand and the second multiplicand The relationship between the high N bits of the second multiplier, the high N bits of the second multiplicand and the low N bits of the second multiplicand, the low N bits of the second multiplier and the high N bits of the second multiplicand Relationship, the relationship between the low N bits of the second multiplier and the low N bits of the second multiplicand.
可选地,结合上述第三方面第八种可能的实现方式,在第九种可能的实现方式中,数据处理装置还可以包括数据随机存取存储模块RAM,以及权重RAM,数据RAM,用于存储第一乘数的高N位和低N位,第二乘数的高N位和低N位。权重RAM,用于根据第二关联关系存储第一被乘数的高N位和低N位,第二被乘数的高N位和低N位。Optionally, in combination with the eighth possible implementation manner of the third aspect described above, in the ninth possible implementation manner, the data processing device may further include a data random access storage module RAM, and a weight RAM and a data RAM for Store the high N bits and low N bits of the first multiplier, and the high N bits and low N bits of the second multiplier. The weight RAM is used to store the high N bits and low N bits of the first multiplicand, and the high N bits and low N bits of the second multiplicand according to the second association relationship.
可选地,结合上述第三方面或第三方面第一种至第三方面第九种可能的实现方式,在第十种可能的实现方式中,第一乘数和第二乘数为特征层数据,第一被乘数和第二被乘数为卷积核数据,或者第一乘数和第二乘数为卷积核数据,第一被乘数和第二被乘数为特征层数据。Optionally, in combination with the foregoing third aspect or the first to the ninth possible implementation manner of the third aspect, in the tenth possible implementation manner, the first multiplier and the second multiplier are characteristic layers Data, the first multiplicand and the second multiplicand are the convolution kernel data, or the first and second multipliers are the convolution kernel data, and the first and second multiplicands are the feature layer data .
本申请第四方面提供一种现场可编程门阵列FPGA,FPGA可以包括第一方面或第一方面任意一种可能实现方式中所描述的数据处理装置。The fourth aspect of the present application provides a field programmable gate array FPGA. The FPGA may include the data processing device described in the first aspect or any one of the possible implementation manners of the first aspect.
通过本申请实施例提供的技术方案,2Nbit的数据被拆分为Nbit,可以通过Nbit*Nbit的数据处理装置处理包括2Nbit数据的乘法,避免乘法器的面积增加。此外,通过对多组乘数的高位部分与被乘数的乘积进行单独的累加,避免对每一组乘数的高位部分与被乘数的乘积移位后,直接与对应的乘数的低位部分与被乘数的乘积相加,导致的加法器的扩位。According to the technical solution provided by the embodiment of the present application, 2Nbit data is split into Nbit, and the multiplication including 2Nbit data can be processed by the Nbit*Nbit data processing device, avoiding the increase of the area of the multiplier. In addition, by separately accumulating the product of the high-order part of multiple sets of multipliers and the multiplicand, avoiding shifting the product of the high-order part of each group of multipliers and the multiplicand, directly and the low-order part of the corresponding multiplier Part and the product of the multiplicand are added, resulting in bit expansion of the adder.
附图说明Description of the drawings
图1为一种Nbit*Nbit的卷积处理器;Figure 1 shows a Nbit*Nbit convolution processor;
图2为一种利用Nbit×Nbit的乘法器构成的Nbit*2Nbit以及2Nbit*2Nbit的处理器;Fig. 2 is a Nbit*2Nbit and 2Nbit*2Nbit processor composed of Nbit×Nbit multipliers;
图3为CNN的卷积处理原理示意图;Figure 3 is a schematic diagram of the convolution processing principle of CNN;
图4为本申请实施例提供的一种Nbit*2Nbit的卷积处理方案;FIG. 4 is an Nbit*2Nbit convolution processing solution provided by an embodiment of the application;
图5为本申请实施例提供的一种Nbit*2Nbit的卷积处理方案;FIG. 5 is an Nbit*2Nbit convolution processing solution provided by an embodiment of the application;
图6为本申请实施例提供的一种2Nbit*2Nbit的卷积处理方案;FIG. 6 is a 2Nbit*2Nbit convolution processing solution provided by an embodiment of the application;
图7为本申请实施例提供的方案应用在产品中的一种计算流程示意图;FIG. 7 is a schematic diagram of a calculation process in which the solution provided in an embodiment of the application is applied to a product;
图8为本申请实施例提供的一种2Nbit×Nbit的拆分方式示意图;FIG. 8 is a schematic diagram of a 2Nbit×Nbit splitting method provided by an embodiment of the application;
图9为本申请实施例提供的一种2Nbit×2Nbit的拆分方式示意图;FIG. 9 is a schematic diagram of a 2Nbit×2Nbit splitting method provided by an embodiment of the application;
图10为本申请实施例提供的方案应用在产品中的另一种计算流程示意图;FIG. 10 is a schematic diagram of another calculation process in which the solution provided in an embodiment of the application is applied to a product;
图11为本申请实施例提供的方案应用在产品中的另一种计算流程示意图;FIG. 11 is a schematic diagram of another calculation process in which the solution provided by an embodiment of the application is applied to a product;
图12为本申请实施例提供的方案应用在产品中的另一种计算流程示意图;FIG. 12 is a schematic diagram of another calculation process in which the solution provided by an embodiment of the application is applied to a product;
图13为本申请实施例提供的一种数据处理方法的流程示意图。FIG. 13 is a schematic flowchart of a data processing method provided by an embodiment of this application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述。The technical solutions in the embodiments of the present application will be clearly and completely described below in conjunction with the drawings in the embodiments of the present application.
为了便于理解,首先结合图3对卷积计算的过程进行简要介绍。卷积运算是加权求和的过程,例如,使用到的图像区域中的每个元素分别与卷积核中的每个元素对应相乘,所有乘积之和作为区域中心像素的新值。卷积核即一个大小固定、由数值参数构成的矩阵。在图3中,对于一张特征图,卷积神经网络对卷积核数据A1,A2,…,An与特征层数据w1,w2,…,w3进行卷积处理。具体的,对于每一个卷积核而言,将其从特征图的第一个像素开始,沿着行方向逐像素移动。当移动到此行的终点时,在列方向下移一个像素,同时行方向回到起点的位置,并且重复上述行方向移动的过程,直到遍历特征图中的所有像素。In order to facilitate understanding, first, the process of convolution calculation is briefly introduced in conjunction with FIG. 3. The convolution operation is a weighted summation process. For example, each element in the used image area is multiplied by each element in the convolution kernel, and the sum of all products is used as the new value of the center pixel of the area. The convolution kernel is a matrix of fixed size and composed of numerical parameters. In Figure 3, for a feature map, the convolutional neural network performs convolution processing on the convolution kernel data A1, A2, ..., An and the feature layer data w1, w2, ..., w3. Specifically, for each convolution kernel, it starts from the first pixel of the feature map and moves pixel by pixel along the row direction. When moving to the end of this row, move down one pixel in the column direction, and at the same time return to the starting point in the row direction, and repeat the process of moving in the row direction until all pixels in the feature map are traversed.
本申请提供的技术方案可以应用在原始图像处理的领域,比如,可以应用于对原始图像进行去燥处理的场景中。由于原始图像的每个像素点一般由10比特(bit)至12bit的整数表示。如果采用传统的8bit的数据格式对其进行量化处理,会导致原始图像的像素信息损失过大,去燥效果不理想。所以在对原始图像进行处理时,需要采用高bit的浮点数(FP)数据格式或者高bit的整数(INT)数据格式进行量化处理。利用浮点数数据格式进行处理会造成额外的指数处理面积和功耗开销,而利用INT数据格式进行处理可以节省这部分的开销。本申请利用Nbit×Nbit的乘法器构成的Nbit*2Nbit以及2Nbit*2Nbit的处理器,N为正整数。当本申请的方案应用在原始图像处理的领域中,卷积核数据可以为Nbit,特征层数据可以为2Nbit,或者卷积核数据为2Nbit,特征层数据为Nbit,或者卷积核数据以及特征层数据均为2Nbit。需要说明的是,本申请提供的方案不止可以应用于原始图像处理的领域,下面将针对Nbit×Nbit的乘法器如何构成Nbit*2Nbit以及2Nbit*2Nbit的处理器分别进行说明。The technical solution provided in this application can be applied to the field of original image processing, for example, it can be applied to a scene where the original image is processed for desiccation. Since each pixel of the original image is generally represented by an integer ranging from 10 bits to 12 bits. If the traditional 8bit data format is used for quantization, the pixel information of the original image will be lost too much, and the effect of de-noising will be unsatisfactory. Therefore, when processing the original image, it is necessary to use a high-bit floating point number (FP) data format or a high-bit integer (INT) data format for quantization processing. Using the floating-point number data format for processing will cause additional exponential processing area and power consumption overhead, and using the INT data format for processing can save this part of the overhead. This application uses Nbit*2Nbit and 2Nbit*2Nbit processors composed of Nbit×Nbit multipliers, and N is a positive integer. When the solution of this application is applied in the field of original image processing, the convolution kernel data can be Nbit, the feature layer data can be 2Nbit, or the convolution kernel data can be 2Nbit, and the feature layer data can be Nbit, or the convolution kernel data and features The layer data is 2Nbit. It should be noted that the solution provided in this application is not only applicable to the field of original image processing, and how the Nbit×Nbit multiplier constitutes the Nbit*2Nbit and 2Nbit*2Nbit processors will be separately described below.
如图4所示,为本申请实施例提供的一种Nbit*2Nbit的卷积处理方案。在这种方案中,2Nbit的数据被拆分为高位部分和低位部分,高位部分为2Nbit数据的高N位或者说前N位,低位部分为2Nbit数据的低N位或者说后N位。举例说明,假设N为8,则2Nbit为16bit,比如FF1A,那么高位部分就是FF,低位部分是1A,如果是32bit的数据,比如3F68415B, 高位部分是3F68,低位部分是415B。现有技术中关于如何将2Nbit的数据被拆分为高位部分和低位部分的方式本申请实施例均可以采用,比如通过选择器实现数据的高低位拆分,选择输出数据A的高N位到乘法器401,或者选择输出数据A的低N位到乘法器403。当有两组或两组以上的数据进行乘法运算时,对部分积中需要左移位数相同的乘积结果进行合并,下面对此举例说明。As shown in FIG. 4, an Nbit*2Nbit convolution processing solution provided by this embodiment of the application. In this scheme, the 2Nbit data is split into a high-bit part and a low-bit part. The high-bit part is the high N-bit or the first N bits of the 2Nbit data, and the low-bit part is the low N-bit or the last N bits of the 2Nbit data. For example, assuming N is 8, then 2Nbit is 16bit, such as FF1A, then the high part is FF, and the low part is 1A. If it is 32-bit data, such as 3F68415B, the high part is 3F68 and the low part is 415B. In the prior art, how to split 2Nbit data into a high-order part and a low-order part can be used in the embodiments of this application. The multiplier 401, or select the low N bits of the output data A to the multiplier 403. When there are two or more sets of data to be multiplied, the product results that require the same number of left shifts in the partial products are combined. The following example illustrates this.
假设有两组数据卷积运算:A×B+C×D,其中,A和C为2Nbit,或者说2N位,B和D为Nbit,或者说N位。以下用“A-高”代表数据A的高位部分,即A数据的高N位,用“C-低”代表C数据的低位部分,即C数据的低N位。A×B+C×D=[(A-高×B+C-高×D)<<N]+A-低×B+C-低×D。如图4所示,乘法器401,乘法器402,乘法器403以及乘法器均为Nbit×Nbit的乘法器。其中,乘法器401可以用于计算A-高×B,乘法器402可以用于计算C-高×D,乘法器403可以用于计算A-低×B,乘法器404可以用于计算C-低×D。可以采用输入端的位宽为2Nbit的加法器405对乘法器401和乘法器402输出的结果进行累加,以得到第一组乘积的累加结果,采用输入端的位宽为2Nbit的加法器406对乘法器403以及乘法器404输出的结果进行累加,以得到第二组乘积的累加结果。本申请提供的方案对部分积中需要左移位数相同的乘积结果进行合并,比如在上面列举的A×B+C×D的例子中,A-高×B与C-高×D的乘积结果都需要左移N位,所以将这个部分积进行合并,即通过加法器405进行累加。A-低×B与C-低×D的乘积结果不需要左移,即左移0位,所以将这两个部分积进行合并,即通过加法器406进行累加。移位器407对加法器405输出的结果进行移位处理,具体的左移N位。通过加法器408对移位器407输出的结果和加法器406输出的结果进行累加,以输出最后的结果,加法器408的位宽为2Nbit。Suppose there are two sets of data convolution operations: A×B+C×D, where A and C are 2Nbits, or 2N bits, and B and D are Nbits, or N bits. In the following, "A-high" is used to represent the high part of data A, that is, the high N bits of A data, and "C-low" is used to represent the low part of C data, that is, the low N bits of C data. A×B+C×D=[(A-high×B+C-high×D)<<N]+A-low×B+C-low×D. As shown in FIG. 4, the multiplier 401, the multiplier 402, the multiplier 403, and the multiplier are all Nbit×Nbit multipliers. Among them, the multiplier 401 can be used to calculate A-high×B, the multiplier 402 can be used to calculate C-high×D, the multiplier 403 can be used to calculate A-low×B, and the multiplier 404 can be used to calculate C- Low × D. The adder 405 with a bit width of 2Nbit at the input can be used to accumulate the results output by the multiplier 401 and the multiplier 402 to obtain the accumulation result of the first set of products, and the adder 406 with a bit width of 2Nbit at the input can be used for the multiplier. 403 and the result output by the multiplier 404 are accumulated to obtain the accumulated result of the second set of products. The solution provided in this application combines the product results of partial products that require the same number of left shifts. For example, in the example of A×B+C×D listed above, the product of A-high×B and C-high×D The results need to be shifted by N bits to the left, so the partial products are combined, that is, the adder 405 is used for accumulation. The result of the product of A-low×B and C-low×D does not need to be shifted to the left, that is, to the left by 0 bits, so the two partial products are combined, that is, accumulated by the adder 406. The shifter 407 performs shift processing on the result output by the adder 405, specifically shifting it to the left by N bits. The result output by the shifter 407 and the result output by the adder 406 are accumulated by the adder 408 to output the final result. The bit width of the adder 408 is 2Nbit.
需要说明的是,上面列举的两组数据的卷积运算,并不代表本申请提供的技术方案只适用于两组数据的卷积运算,本申请并不对参与卷积运算的数据的数目进行限制,以下对此不再重复赘述。为了更好的理解本方案,下面以四组数据为例,对Nbit×Nbit的乘法器如何构成Nbit*2Nbit进行说明。如图5所示,为本申请实施例提供的一种Nbit*2Nbit的卷积处理方案。假设有四组数据卷积运算:A×B+C×D+E×F+G×H。其中,A,C,E,G为2Nbit,或者说2N位,B,D,F,H为Nbit,或者说N位。以下用“A-高”代表数据A的高位部分,即A数据的高N位,用“A-低”代表A数据的低位部分,即A数据的低N位。用“C-高”代表数据C的高位部分,用“C-低”代表C数据的低位部分,即C数据的低N位,用“E-高”代表数据E的高位部分,即E数据的高N位,用“E-低”代表E数据的低位部分,即E数据的低N位,用“G-高”代表数据G的高位部分,即G数据的高N位,用“G-低”代表G数据的低位部分,即E数据的低N位。A×B+C×D+E×F+G×H=[(A-高×B+C-高×D+E-高×F+G-高×H)<<N]+A-低×B+C-低×D+E-低×F+G-低×H。在这个例子中,可以通过乘法器501至乘法器504分别计算A-高×B,C-高×D,E-高×F,G-高×H,通过乘法器505至乘法器508分别计算A-低×B,C-低×D,E-低×F以及G-低×H。A-高×B,C-高×D,E-高×F以及G-高×H的乘积结果都需要左移N位,所以将它们的乘积结果进行合并,具体的,如图5所示,可以通过加法器509对乘法器501和乘法器502的输出结果进行累加,通过加法器510对乘法器503和乘法器504的输出结果进行累加,通过加法器513对加法器509和加法器510的输出结果进行累加。其中,加法器509,加法器510以及加法器513的位宽均为2Nbit。A-低×B,C-低×D,E-低×F以及G-低×H的乘积结果不需要左移,即左移O位,所以将它们的乘积结果进行合并,具体的,如图5所示,可以通过加法器511 对乘法器505和乘法器506的输出结果进行累加,通过加法器512对乘法器507和乘法器508的输出结果进行累加,通过加法器514对加法器511和加法器512的输出结果进行累加。其中,加法器511,加法器512以及加法器514的位宽均为2Nbit。移位器515对加法器513的输出结果左移N位。加法器516对移位器515和加法器514的输出结果进行累加处理。It should be noted that the convolution operation of the two sets of data listed above does not mean that the technical solution provided by this application is only applicable to the convolution operation of the two sets of data. This application does not limit the number of data participating in the convolution operation. , This will not be repeated in the following. In order to better understand this solution, the following uses four sets of data as an example to illustrate how the Nbit×Nbit multiplier constitutes Nbit*2Nbit. As shown in FIG. 5, an Nbit*2Nbit convolution processing solution provided by this embodiment of the application. Suppose there are four sets of data convolution operations: A×B+C×D+E×F+G×H. Among them, A, C, E, G are 2Nbits, or 2N bits, and B, D, F, and H are Nbits, or N bits. In the following, "A-high" is used to represent the high part of data A, that is, the high N bits of A data, and "A-low" is used to represent the low part of A data, that is, the low N bits of A data. Use "C-high" to represent the high part of data C, use "C-low" to represent the low part of C data, that is, the low N bits of C data, and use "E-high" to represent the high part of data E, that is, E data Use "E-low" to represent the low part of E data, that is, the low N bits of E data, and use "G-high" to represent the high part of data G, that is, the high N bits of G data, and use "G -"Low" represents the low part of the G data, that is, the low N bits of the E data. A×B+C×D+E×F+G×H=[(A-High×B+C-High×D+E-High×F+G-High×H)<<N]+A-Low ×B+C-low×D+E-low×F+G-low×H. In this example, A-height×B, C-height×D, E-height×F, G-height×H can be calculated by the multiplier 501 to the multiplier 504, respectively, and calculated by the multiplier 505 to the multiplier 508. A-low×B, C-low×D, E-low×F, and G-low×H. The product results of A-H×B, C-H×D, E-H×F and G-H×H all need to be shifted to the left by N bits, so their product results are combined. Specifically, as shown in Figure 5 , The output results of the multiplier 501 and the multiplier 502 can be accumulated by the adder 509, the output results of the multiplier 503 and the multiplier 504 can be accumulated by the adder 510, and the adder 509 and the adder 510 can be added by the adder 513 The output results are accumulated. Among them, the bit widths of the adder 509, the adder 510, and the adder 513 are all 2Nbit. The product results of A-low×B, C-low×D, E-low×F and G-low×H do not need to be shifted left, that is, shifted to the left by O bits, so their product results are combined, specifically, such as As shown in FIG. 5, the output results of the multiplier 505 and the multiplier 506 can be accumulated by the adder 511, the output results of the multiplier 507 and the multiplier 508 can be accumulated by the adder 512, and the adder 511 can be added by the adder 514. And the output result of the adder 512 for accumulation. Among them, the bit widths of the adder 511, the adder 512, and the adder 514 are all 2Nbit. The shifter 515 shifts the output result of the adder 513 to the left by N bits. The adder 516 accumulates the output results of the shifter 515 and the adder 514.
假设将A,C,E,G看作乘数,B,D,F,H看作被乘数,则本方案是将多组乘数的高位部分与被乘数的乘积以及乘数的低位部分与被乘数的乘积进行单独的累加,并对多组乘数的高位部分与被乘数的乘积的累加结果进行整体移位,再与多组乘数的低位部分与被乘数的乘积的累加结果相加形成最终结果。Assuming that A, C, E, and G are regarded as multipliers, and B, D, F, and H are regarded as multiplicands, this scheme is to multiply the product of the high part of the multiplier and the multiplicand and the low order of the multiplier The product of the part and the multiplicand is accumulated separately, and the accumulated result of the product of the high part of the multiplier and the multiplicand is shifted as a whole, and then the product of the low part of the multiplier and the multiplicand is multiplied The accumulated results of are added to form the final result.
本申请提供的技术方案,相比于背景技术中提到的方案,无需对Nbit×Nbit的乘法器直接进行扩展,避免乘法器的面积增加。此外,通过对多组乘数的高位部分与被乘数的乘积进行单独的累加后再进行移位处理,避免对每一组乘数的高位部分与被乘数的乘积移位后,直接与对应的乘数的低位部分与被乘数的乘积相加,导致的加法器的扩位。比如,在一些方案中,A×B+C×D=[(A-高×B)<<N+A-低×B]+[(C-高×D)<<N+C-低×D],需要位宽为3Nbit的加法器分别计算A-高×B与A-低×B的加和,C-高×D+C-低×D的加和。而本方案中,如图5所示,只需要输出最终结果的加法器为3Nbit,方案中的其他加法器可以是2Nbit,并且相比于每一组乘数的高位部分与被乘数的乘积移位后,直接与对应的乘数的低位部分与被乘数的乘积相加的方案,本方案无需多次进行移位处理,节省逻辑资源。Compared with the solutions mentioned in the background art, the technical solution provided in the present application does not need to directly extend the Nbit×Nbit multiplier to avoid an increase in the area of the multiplier. In addition, by separately accumulating the product of the high-order part of multiple sets of multipliers and the multiplicand and then performing the shifting process, avoiding shifting the product of the high-order part of each group of multipliers and the multiplicand, directly and The low part of the corresponding multiplier is added to the product of the multiplicand, resulting in bit expansion of the adder. For example, in some schemes, A×B+C×D=[(A-High×B)<<N+A-Low×B]+[(C-High×D)<<N+C-Low× D], an adder with a bit width of 3Nbit is required to calculate the sum of A-high×B and A-low×B, and the sum of C-high×D+C-low×D. In this scheme, as shown in Figure 5, only the adder that needs to output the final result is 3Nbit, and the other adders in the scheme can be 2Nbit, and are compared to the product of the high part of each group of multipliers and the multiplicand After the shift, the solution is directly added to the product of the low-order part of the corresponding multiplier and the multiplicand. This solution does not need to perform shift processing multiple times, saving logic resources.
如图6所示,为本申请实施例提供的一种2Nbit*2Nbit的卷积处理方案。在这种方案中,2Nbit的数据被拆分为高位部分和低位部分,关于高位部分和低位部分的解释可以参照图4中的描述进行理解,这里不再重复赘述。当有两组或两组以上的数据进行乘法运算时,对部分积中需要左移位数相同的乘积结果进行合并。具体的,对多个高位部分和高位部分、高位部分和低位部分,低位部分和高位部分以及低位部分和低位部分的乘积分别进行多项式合并、合并后多个乘积分别累加并且对累加结果整体位移之后求和得到最终结果。下面对此进行举例说明,假设有两组数据卷积运算:A×C+E×G,其中A,C,E,G均为2Nbit。以下用“A-高”代表数据A的高位部分,即A数据的高N位,用“A-低”代表A数据的低位部分,即A数据的低N位。用“C-高”代表数据C的高位部分,用“C-低”代表C数据的低位部分,即C数据的低N位,用“E-高”代表数据E的高位部分,即E数据的高N位,用“E-低”代表E数据的低位部分,即E数据的低N位,用“G-高”代表数据G的高位部分,即G数据的高N位,用“G-低”代表G数据的低位部分,即G数据的低N位。A×C+E×G=[(A-高×C-高+E-高×G-高)<<2N]+[(A-高×C-低+E-高×G-低)<<N]+[(A-低×C-高+E-低×G-高)<<N]+(A-低×C-低+E-低×G-低)。如图6所示,乘法器601至乘法器608均为Nbit×Nbit的乘法器,可以通过乘法器601和乘法器602计算高位部分和高位部分的乘积,比如可以通过乘法器601计算A-高×C-高,通过乘法器602计算E-高×G-高。可以通过乘法器603和乘法器604计算高位部分和低位部分的乘积,或者计算低位部分和高位部分的乘积,比如可以通过乘法器603计算A-高×C-低,通过乘法器604计算E-高×G-低,或者可以通过乘法器603计算A-低×C-高,通过乘法器604计算E-低×G-高。如果通过乘法器603和乘法器604计算高位部分和低位部分的乘积,则通过乘法器605和乘法器606计算低位部分和高位部分的乘积,如果通过乘法器603和乘法器604计算低位部分和高位部分的乘积,则通过乘法器605和乘法器606计算高位部分和低位部分的乘积,其中高位部 分和低位部分的乘积是指A-高×C-低,E-高×G-低,低位部分和高位部分的乘积是指A-低×C-高,E-低×G-高。可以通过乘法器607和乘法器608计算低位部分和低位部分的乘积,比如可以通过乘法器607计算A-低×C-低,通过乘法器608计算E-低×G-低。加法器609对乘法器601和乘法器602的输出结果进行累计处理,加法器610对乘法器603和乘法器604的输出结果进行累计处理,加法器611对乘法器605和乘法器606的输出结果进行累计处理,加法器612对乘法器607和乘法器608的输出结果进行累计处理。加法器609,加法器610,加法器611以及加法器612的位宽均为2Nbit。高位部分和高位部分的乘积(以下简称为高位乘积)需要左移2Nbit,高位部分和低位部分的乘积(以下简称为高低位乘积)以及低位部分和高位部分的乘积(以下简称为低高位乘积)均需要左移Nbit,低位部分和低位部分的乘积(以下简称为低位乘积)不需要左移,即左移0bit。在具体实现上,可以通过移位器613对加法器609的输出结果左移Nbit,移位器613输出的数据为3Nbit,这是对高位乘积的第一次移位。加法器615对移位器613和加法器610的输出结果进行累加处理,加法器615的位宽为3Nbit。移位器617对加法器615的输出结果左移Nbit,移位器617输出的数据为4Nbit,此时,高位乘积完成了2Nbit的移位。移位器614对加法器611的输出结果左移N位,加法器616对移位器614和加法器612的输出结果进行累加,加法器616的位宽为3Nbit。加法器618对移位器617和加法器616的输出结果进行累加处理,得到最后的输出结果,加法器618的位宽为4Nbit。As shown in FIG. 6, a 2Nbit*2Nbit convolution processing solution provided by this embodiment of the application. In this solution, the 2Nbit data is split into a high-bit part and a low-bit part. The explanation of the high-bit part and the low-bit part can be understood with reference to the description in FIG. 4, and the details will not be repeated here. When there are two or more sets of data to be multiplied, the product results that require the same number of left shifts in the partial products are combined. Specifically, the products of multiple high-order parts and high-order parts, high-order parts and low-order parts, low-order parts and high-order parts, and low-order parts and low-order parts are respectively combined by polynomials. After the combination, multiple products are accumulated separately and the accumulated result is overall shifted. Summing to get the final result. The following is an example to illustrate this, suppose there are two sets of data convolution operations: A×C+E×G, where A, C, E, and G are all 2Nbit. In the following, "A-high" is used to represent the high part of data A, that is, the high N bits of A data, and "A-low" is used to represent the low part of A data, that is, the low N bits of A data. Use "C-high" to represent the high part of data C, use "C-low" to represent the low part of C data, that is, the low N bits of C data, and use "E-high" to represent the high part of data E, that is, E data Use "E-low" to represent the low part of E data, that is, the low N bits of E data, and use "G-high" to represent the high part of data G, that is, the high N bits of G data, and use "G -Low" represents the lower part of the G data, that is, the lower N bits of the G data. A×C+E×G=[(A-high×C-high+E-high×G-high)<<2N]+[(A-high×C-low+E-high×G-low)< <N]+[(A-low×C-high+E-low×G-high)<<N]+(A-low×C-low+E-low×G-low). As shown in FIG. 6, the multiplier 601 to the multiplier 608 are all Nbit×Nbit multipliers. The product of the high-order part and the high-order part can be calculated by the multiplier 601 and the multiplier 602. For example, the multiplier 601 can calculate the A-high ×C-high, E-high×G-high is calculated by the multiplier 602. The product of the high-order part and the low-order part can be calculated by the multiplier 603 and the multiplier 604, or the product of the low-order part and the high-order part can be calculated. For example, the multiplier 603 can calculate A-high×C-low, and the multiplier 604 can calculate E- High×G-low, or A-low×C-high can be calculated by the multiplier 603, and E-low×G-high can be calculated by the multiplier 604. If the product of the high-order part and the low-order part is calculated by the multiplier 603 and the multiplier 604, the product of the low-order part and the high-order part is calculated by the multiplier 605 and the multiplier 606, if the low-order part and the high-order part are calculated by the multiplier 603 and the multiplier 604 Part of the product, the multiplier 605 and the multiplier 606 calculate the product of the high-order part and the low-order part, where the product of the high-order part and the low-order part refers to A-high×C-low, E-high×G-low, low-order part The product of the high part is A-low×C-high, E-low×G-high. The product of the low-order part and the low-order part can be calculated by the multiplier 607 and the multiplier 608. For example, the multiplier 607 can calculate A-low×C-low, and the multiplier 608 can calculate E-low×G-low. The adder 609 accumulates the output results of the multiplier 601 and the multiplier 602, the adder 610 accumulates the output results of the multiplier 603 and the multiplier 604, and the adder 611 accumulates the output results of the multiplier 605 and the multiplier 606. The accumulation processing is performed, and the adder 612 performs accumulation processing on the output results of the multiplier 607 and the multiplier 608. The bit widths of the adder 609, the adder 610, the adder 611, and the adder 612 are all 2Nbit. The product of the high part and the high part (hereinafter referred to as the high product) needs to be shifted to the left by 2Nbit, the product of the high part and the low part (hereinafter referred to as the high-low product) and the product of the low part and the high part (hereinafter referred to as the low-high product) Both need to be shifted to the left by Nbit, and the product of the low-order part and the low-order part (hereinafter referred to as the low-order product) does not need to be shifted to the left, that is, shifted to the left by 0bit. In a specific implementation, the output result of the adder 609 can be shifted to the left by Nbit by the shifter 613, and the data output by the shifter 613 is 3Nbit, which is the first shift of the high-order product. The adder 615 accumulates the output results of the shifter 613 and the adder 610, and the bit width of the adder 615 is 3Nbit. The shifter 617 shifts the output result of the adder 615 to the left by Nbit, and the data output by the shifter 617 is 4Nbit. At this time, the high-order product completes the shift of 2Nbit. The shifter 614 shifts the output result of the adder 611 to the left by N bits, the adder 616 accumulates the output results of the shifter 614 and the adder 612, and the bit width of the adder 616 is 3Nbit. The adder 618 accumulates the output results of the shifter 617 and the adder 616 to obtain the final output result. The bit width of the adder 618 is 4Nbit.
本申请提供的技术方案,对多组乘数的高位部分与被乘数的高位部分的乘积,多组乘数的高位部分与被乘数的低位部分的乘积,多组乘数的低位部分与被乘数的高位部分的乘积,多组乘数的低位部分与被乘数的低位部分的乘积进行单独的累加,再对4个累加的结果进行相应的位移和加法操作,得到最后的结果。本申请提供的方案避免对每一组乘数的高位部分与被乘数的高位部分的乘积,高位部分与低位部分的乘积单独进行移位处理,导致的加法器的扩位,比如,有一些方案中,A×C+E×G=[(A-高×C-高)<<2N]+[(A-高×C-低+)<<N]+[(A-低×C-高<<N]+(A-低×C-低)+[(E-高×G-高)<<2N]+[(E-高×G-低+)<<N]+[(E-低×G-高<<N]+(E-低×G-低)。这种方案需要多次移位,并且参与卷积运算的数据越多,需要移位的次数也随之增多。此外,这种方案中需要大量的位宽为3Nbit的加法器以及4Nbit的加法器,本方案将左移位数相同的部分积进行合并,分别进行累加运算,大幅节省了逻辑资源。The technical solution provided in this application deals with the product of the high part of the multiplier and the high part of the multiplicand, the product of the high part of the multiplier and the low part of the multiplicand, and the low part of the multiplier and The product of the high part of the multiplicand, the products of the low part of the multiplier and the low part of the multiplicand are accumulated separately, and then the 4 accumulated results are shifted and added accordingly to get the final result. The solution provided by this application avoids the product of the high part of each group of multipliers and the high part of the multiplicand, and the product of the high part and the low part is shifted separately, resulting in bit expansion of the adder. For example, there are some In the scheme, A×C+E×G=[(A-High×C-High)<<2N]+[(A-High×C-Low+)<<N]+[(A-Low×C- High<<N]+(A-Low×C-Low)+[(E-High×G-High)<<2N]+[(E-High×G-Low+)<<N]+[(E -Low×G-High<<N]+(E-Low×G-Low). This kind of scheme requires multiple shifts, and the more data involved in the convolution operation, the more shifts required. In addition, this solution requires a large number of adders with a bit width of 3Nbit and a 4Nbit adder. This solution combines partial products with the same number of left shifts and performs accumulation operations separately, which greatly saves logic resources.
在本申请的一个具体的实施方式中,可以通过控制移位器的开启和关断,在一个时钟内,处理4组Nbit×Nbit的运算,2组2Nbit×Nbit的运算,以及1组2Nbit×2Nbit的运算,下面结合图6对此进行具体的说明。可以通过状态机控制移位器613,移位器614,移位器616的开启和关断,具体的用户可以通过状态机输入指令,控制移位器613,移位器614,移位器616的开启和关断。在一个具体的实施方式中,可以控制移位器613,移位器614,移位器617均处于开启状态,此时可以处理2组2Nbit×2Nbit的运算,具体参见上述图6中的描述。在一个具体的实施方式中,可以控制移位器613开启,移位器614开启,移位器617关闭,此时可以处理4组2Nbit×Nbit的运算。在一个具体的实施方式中,可以控制可以控制移位器601,移位器614以及移位器617均处于关闭状态,此时可以处理8组Nbit×Nbit的运算。In a specific implementation of the present application, by controlling the turn-on and turn-off of the shifter, 4 groups of Nbit×Nbit operations, 2 groups of 2Nbit×Nbit operations, and 1 group of 2Nbit×Nbit operations can be processed within one clock. The calculation of 2Nbit will be described in detail with reference to FIG. 6 below. The shifter 613, the shifter 614, and the shifter 616 can be turned on and off through the state machine. The specific user can input instructions through the state machine to control the shifter 613, the shifter 614, and the shifter 616. Turn on and turn off. In a specific implementation manner, the shifter 613, the shifter 614, and the shifter 617 can be controlled to be in an on state. At this time, two sets of 2Nbit×2Nbit operations can be processed. For details, refer to the description in FIG. 6 above. In a specific implementation manner, the shifter 613 can be controlled to be turned on, the shifter 614 is turned on, and the shifter 617 is turned off. At this time, 4 groups of 2Nbit×Nbit operations can be processed. In a specific implementation manner, the shifter 601, the shifter 614, and the shifter 617 can be controlled to be in an off state, and at this time, 8 groups of Nbit×Nbit operations can be processed.
以上对如何根据特征层数据和卷积核数据进行计算进行了说明,在具体应用场景中,可以通过任意一种卷积运算装置实现上述方案,比如乘法器,中央处理器(central processing unit,CPU)、现场可编程门阵列(field-programmable gate array,FPGA)、专用集成电路 (application specific intergrated circuits,ASIC)、图形处理器(graphics processing unit,GPU)或其他人工智能(artificial intelligence,AI)芯片之类的芯片上等等。The above describes how to perform calculations based on the feature layer data and the convolution kernel data. In specific application scenarios, the above solutions can be implemented by any convolution operation device, such as a multiplier, a central processing unit, and a CPU. ), field-programmable gate array (FPGA), application specific intergrated circuits (ASIC), graphics processing unit (GPU) or other artificial intelligence (AI) chips And so on on the chip and so on.
下面对本申请提供的方案应用在具体的产品上的场景,对本申请中涉及的计算流程进行说明,具体的产品可以指上述提到的任意一种卷积运算装置。如图7所示,双倍速率同步动态随机存储器(double data rate,DDR)控制器702从DDR701中读取数据,该数据包括特征层数据以及卷积核数据。DDR控制器702将读取到的数据向拆分逻辑电路703发送,拆分逻辑电路703将2Nbit的特征层数据拆分为高位部分和低位部分,并将拆分后的数据存入数据随机存取存储器(random access memory,RAM)705中,拆分逻辑电路703将2Nbit的卷积核数据拆分为高位部分和低位部分,并将拆分后的数据存入权重RAM704中。计算电路706从数据RAM中获取特征层数据,与计算电路706中预加载的卷积核数据进行计算,具体的计算过程可以参照图4至图6的描述进行理解。计算电路706完成计算后,将计算得到的结果通过DDR控制器702写过DDR701中,完成整个流程。在一个具体的实施方式中,还可以包括状态机(图中未示出),用于控制计算电路中移位器的关断,具体原理上面已经进行了详细的描述,这里不再重复赘述。The following describes a scenario where the solution provided in this application is applied to a specific product, and the calculation process involved in this application is described. The specific product may refer to any of the convolution operation devices mentioned above. As shown in FIG. 7, a double-rate synchronous dynamic random access memory (DDR) controller 702 reads data from the DDR701, and the data includes feature layer data and convolution kernel data. The DDR controller 702 sends the read data to the splitting logic circuit 703, and the splitting logic circuit 703 splits the 2Nbit feature layer data into a high-order part and a low-order part, and stores the split data into the data randomly. In a random access memory (RAM) 705, the split logic circuit 703 splits the 2Nbit convolution kernel data into a high-order part and a low-order part, and stores the split data in the weight RAM 704. The calculation circuit 706 obtains the feature layer data from the data RAM, and performs calculation with the convolution kernel data preloaded in the calculation circuit 706. The specific calculation process can be understood with reference to the description of FIGS. 4 to 6. After the calculation circuit 706 completes the calculation, it writes the calculated result into the DDR 701 through the DDR controller 702 to complete the entire process. In a specific implementation manner, a state machine (not shown in the figure) may also be included to control the turning off of the shifter in the calculation circuit. The specific principle has been described in detail above and will not be repeated here.
下面以乘数和被乘数均为2Nbit,或者说卷积核数据和特征层数据均为2Nbit为例,对图7所示的产品的结构中,数据的计算流程进行说明。DDR控制器从DDR中读取数据,并将读取到的数据向拆分逻辑电路发送。拆分逻辑电路将获取到的数据进行拆分,并建立对应关系。举例说明,如图8所示,给出了一种2Nbit×Nbit的拆分方式,假设有四组数据做乘法运算,A×B+C×D+E×F+G×H,其中,A,C,E,G为2Nbit,或者说2N位,B,D,F,H为Nbit,A×B+C×D+E×F+G×H=[(A-高×B+C-高×D+E-高×F+G-高×H)<<N]+A-低×B+C-低×D+E-低×F+G-低×H,则拆分逻辑电路将数据A拆分为A-高和A-低,并将A-高和A-低分别与数据B建立对应关系,将数据C拆分为C-高和C-低,并将C-高和C-低分别与数据D建立对应关系,将数据E拆分为E-高和E-低,并将E-高和E-低分别与数据F建立对应关系,数据G拆分为G-高和G-低,并将G-高和G-低分别与数据H建立对应关系,需要说明的是,本申请实施例并不对参与计算的数据的数目进行限定,在实际应用场景中,参与计算的数据可以是两组或者两组以上。如图9所示,给出了一种2Nbit×2Nbit的拆分方式,A×C+E×G,其中A,C,E,G均为2Nbit,A×C+E×G=[(A-高×C-高+E-高×G-高)<<2N]+[(A-高×C-低+E-高×G-低)<<N]+[(A-低×C-高+E-低×G-高)<<N]+(A-低×C-低+E-低×G-低),则拆分逻辑电路将将数据A拆分为A-高和A-低,将数据C拆分为C-高和C-低,将数据E拆分为E-高和E-低,将数据G拆分为G-高和G-低,并将A-高与C-高建立对应关系,将G-高和E-高建立对应关系,将A-高与C-低建立对应关系,将G-高和E-低建立对应关系,将A-低与C-高建立对应关系,将G-低和E-高建立对应关系,将A-低与C-低建立对应关系,将G-低和E-低建立对应关系。拆分逻辑电路将拆分好的特征层的数据存入数据RAM中,根据上述建立的对应关系将拆分好的卷积核数据存入权重RAM中,如图10所示,以2Nbit×2Nbit为例,给出了一种拆分逻辑电路将2Nbit数据拆分为高位部分和低位部分两个部分并且存入数据RAM、参数RAM的示意图。如图11所示,将权重RAM中的数据预加载到计算电路中,具体的,将第1段数据预加载到计算电路1中,将第2段数据预加载到计算电路2中,…,将第n段数据预加载到计算电路n中。如图12所示,从数据RAM中提取第I段数据与计算电路1中预加载的第1段数据进行计算并且获得结果,具体的计算过程可以参照图6中的描述进行理解,这里不再重复赘述。计算电路1完成第I段数据与第1段数据 的计算后,将第I段数据转发给计算电路2,并且从数据RAM中获取第II段数据,并对第II段数据与计算电路1中预加载的第1段数据进行计算并且获得结果。此后每个时钟,计算电路1从数据RAM中获取新的数据,计算电路2至计算电路n将上一个时钟处理完毕的特征层的数据转发给下一个计算电路。当数据RAM中所有存储的数据均完成了运算后,计算电路1至计算电路n输出数据,通过DDR控制器将计算电路输出的数据存储至DDR中。In the following, taking the multiplier and the multiplicand both 2Nbit, or that the convolution kernel data and the feature layer data are both 2Nbit, as an example, the data calculation process in the structure of the product shown in FIG. 7 will be described. The DDR controller reads data from the DDR and sends the read data to the split logic circuit. The split logic circuit splits the acquired data and establishes a corresponding relationship. For example, as shown in Figure 8, a 2Nbit×Nbit splitting method is given. Assuming that there are four sets of data for multiplication, A×B+C×D+E×F+G×H, where A , C, E, G are 2Nbit, or 2N bit, B, D, F, H are Nbit, A×B+C×D+E×F+G×H=[(A-High×B+C- High×D+E-High×F+G-High×H)<<N]+A-Low×B+C-Low×D+E-Low×F+G-Low×H, then split the logic circuit Split data A into A-high and A-low, and establish a corresponding relationship between A-high and A-low and data B respectively, split data C into C-high and C-low, and set C-high And C-low respectively establish corresponding relationships with data D, split data E into E-high and E-low, and establish corresponding relationships between E-high and E-low and data F respectively, and split data G into G- High and G-low, and the corresponding relationship between G-high and G-low and data H is established. It should be noted that the embodiment of this application does not limit the number of data participating in the calculation. In actual application scenarios, participation The calculated data can be two groups or more than two groups. As shown in Figure 9, a 2Nbit×2Nbit splitting method is given, A×C+E×G, where A, C, E, and G are all 2Nbit, A×C+E×G=[(A -High×C-High+E-High×G-High)<<2N]+[(A-High×C-Low+E-High×G-Low)<<N]+[(A-Low×C -High+E-Low×G-High)<<N]+(A-Low×C-Low+E-Low×G-Low), the splitting logic circuit will split data A into A-high sum A-low, split data C into C-high and C-low, split data E into E-high and E-low, split data G into G-high and G-low, and divide A- Establish a corresponding relationship between high and C-high, establish a corresponding relationship between G-high and E-high, establish a corresponding relationship between A-high and C-low, establish a corresponding relationship between G-high and E-low, and establish a corresponding relationship between A-low and E-low. C-high establishes a corresponding relationship, G-low and E-high establish a corresponding relationship, A-low and C-low establish a corresponding relationship, and G-low and E-low establish a corresponding relationship. The split logic circuit stores the split feature layer data in the data RAM, and stores the split convolution kernel data in the weight RAM according to the corresponding relationship established above, as shown in Figure 10, with 2Nbit×2Nbit As an example, a schematic diagram of a split logic circuit that splits 2Nbit data into two parts, a high part and a low part, is stored in the data RAM and parameter RAM. As shown in Figure 11, the data in the weight RAM is preloaded into the calculation circuit. Specifically, the first segment of data is preloaded into the calculation circuit 1, and the second segment of data is preloaded into the calculation circuit 2,..., Preload the nth segment of data into the calculation circuit n. As shown in Figure 12, the first segment data is extracted from the data RAM and the first segment data preloaded in the calculation circuit 1 to calculate and the result is obtained. The specific calculation process can be understood with reference to the description in Figure 6, and will not be omitted here. Repeat it. After the calculation circuit 1 completes the calculation of the first segment data and the first segment data, it forwards the first segment data to the calculation circuit 2, and obtains the second segment data from the data RAM, and compares the second segment data with the calculation circuit 1 The preloaded first segment data is calculated and the result is obtained. After each clock, the calculation circuit 1 obtains new data from the data RAM, and the calculation circuit 2 to the calculation circuit n forward the data of the characteristic layer processed by the previous clock to the next calculation circuit. After all the data stored in the data RAM have completed the calculation, the calculation circuit 1 to the calculation circuit n output data, and the data output by the calculation circuit is stored in the DDR through the DDR controller.
以上对本申请实施例提供的数据处理装置以及包括该数据处理装置的装置进行了说明,下面对本申请实施例提供的数据处理方法进行说明。The data processing device provided by the embodiment of the present application and the device including the data processing device have been described above, and the data processing method provided by the embodiment of the present application will be described below.
如图13所示,为本申请实施例提供一种数据处理方法的流程示意图。如图13所示,本申请实施例提供的一种数据处理方法,可以包括以下步骤:As shown in FIG. 13, it is a schematic flowchart of a data processing method provided in an embodiment of this application. As shown in FIG. 13, a data processing method provided by an embodiment of the present application may include the following steps:
1301、计算第一组乘积和第二组乘积。第一组乘积可以包括第一乘数的高N位与第一被乘数的乘积,第二乘数的高N位与第二被乘数的乘积,第二组乘积可以包括第一乘数的低N位与第一被乘数的乘积,第二乘数的低N位与第二被乘数的乘积,第一乘数和第二乘数均为2N位,N为正整数。1301 Calculate the first set of products and the second set of products. The first group of products can include the product of the high N bits of the first multiplier and the first multiplicand, the product of the high N bits of the second multiplier and the second multiplicand, and the second group of products can include the first multiplier. The product of the low N bits of and the first multiplicand, the product of the low N bits of the second multiplier and the second multiplicand, the first and second multipliers are both 2N bits, and N is a positive integer.
1302、对第一组乘积和第二组乘积分别进行累加处理。1302. The first group of products and the second group of products are respectively accumulated.
本申请提供的技术方案当有两组或两组以上的数据进行卷积运算时,通过对多组乘数的高位部分与被乘数的乘积进行单独的累加后再进行移位处理,避免对每一组乘数的高位部分与被乘数的乘积移位后,直接与对应的乘数的低位部分与被乘数的乘积相加,导致的加法器的扩位。The technical solution provided in this application, when there are two or more than two sets of data for convolution operation, by separately accumulating the product of the high part of multiple sets of multipliers and the multiplicand and then performing shift processing, avoiding After the product of the high part of each group of multipliers and the multiplicand is shifted, it is directly added to the product of the corresponding low part of the multiplier and the multiplicand, resulting in bit expansion of the adder.
在一个具体的实施方式中,对第一组乘积累加的结果进行移位处理,以得到第一移位结果。对第一移位结果和第二组乘积进行累加。本申请可以通过Nbit×Nbit的乘法器处理2Nbit*Nbit的计算。In a specific implementation, shift processing is performed on the result of the first group of multiplication accumulation and addition to obtain the first shift result. Accumulate the first shift result and the second set of products. This application can process the calculation of 2Nbit*Nbit through the Nbit×Nbit multiplier.
在一个具体的实施方式中,计算第一组乘积和第二组乘积,具体可以包括:计算高位乘积,高位乘积可以包括第一乘数的高N位与第一被乘数的高N位的乘积,第二乘数的高N位与第二被乘数的高N位的乘积。计算高低位乘积,高低位乘积可以包括第一乘数的高N位与第一被乘数的低N位的乘积,第二乘数的高N位与第二被乘数的低N位的乘积。计算低高位乘积,低高位乘积可以包括第一乘数的低N位与第一被乘数的高N位的乘积,第二乘数的低N位与第二被乘数的高N位的乘积。计算低位乘积,低位乘积可以包括第一乘数的低N位与第一被乘数的低N位的乘积,第二乘数的低N位与第二被乘数的低N位的乘积。对第一组乘积和第二组乘积分别进行累加处理,可以包括:分别对高位乘积,高低位乘积,低高位乘积,以及低位乘积进行累加。本申请提供的方案,可以通过Nbit×Nbit的乘法器处理2Nbit*2Nbit的计算。In a specific embodiment, calculating the first set of products and the second set of products may specifically include: calculating the high-order product, and the high-order product may include the high N bits of the first multiplier and the high N bits of the first multiplicand Product, the product of the high N bits of the second multiplier and the high N bits of the second multiplicand. Calculate the product of high and low bits. The product of high and low bits can include the product of the high N bits of the first multiplier and the low N bits of the first multiplicand, and the high N bits of the second multiplier and the low N bits of the second multiplicand. product. Calculate the low-high product. The low-high product can include the product of the low N bits of the first multiplicand and the high N bits of the first multiplicand, and the low N bits of the second multiplier and the high N bits of the second multiplicand. product. The low-order product is calculated. The low-order product may include the product of the low N bits of the first multiplier and the low N bits of the first multiplicand, and the product of the low N bits of the second multiplier and the low N bits of the second multiplicand. Accumulating the first group of products and the second group of products separately may include: accumulating high-order products, high-low-order products, low-high-order products, and low-order products, respectively. The solution provided in this application can process the calculation of 2Nbit*2Nbit through the Nbit×Nbit multiplier.
在一个具体的实施方式中,还可以包括:对高位乘积累加的结果左移N位,以得到第二移位结果。对第二移位结果和高低位乘积累加的结果进行累加。对第二移位结果和高低位乘积累加的结果进行累加后的结果左移N位,以得到第三移位结果。对低高位乘积累加的结果左移N位,以得到第四移位结果。对第三移位结果、第四移位结果以及低位乘积进行累加。In a specific implementation, it may further include: shifting the result of the high-order multiplication accumulation addition to the left by N bits to obtain the second shift result. Accumulate the second shift result and the result of multiplying and accumulating high and low bits. After accumulating the second shift result and the result of multiplying and accumulating the high and low bits, the result is shifted to the left by N bits to obtain the third shift result. The result of multiplying and accumulating the low and high bits is shifted to the left by N bits to obtain the fourth shift result. Accumulate the third shift result, the fourth shift result, and the low-order product.
在一个具体的实施方式中,还可以包括:对第四移位结果和低位乘积进行累加。对第三移位结果、第四移位结果以及低位乘积进行累加,可以包括:对第三移位结果以及第四移位结果和低位乘积进行累加后的结果进行累加。In a specific implementation, it may further include: accumulating the fourth shift result and the low-order product. Accumulating the third shift result, the fourth shift result, and the low-order product may include: accumulating the third shift result and the result of the fourth shift result and the low-order product after accumulating.
在一个具体的实施方式中,还可以包括:输出第一乘数的高N位和低N位,第二乘数的高N位和低N位。In a specific implementation, it may further include: outputting the high N bits and low N bits of the first multiplier, and the high N bits and low N bits of the second multiplier.
在一个具体的实施方式中,还可以包括:构建第一关联关系,第一关联关系可以包括第一乘数的高N位与第一被乘数的关联关系,第一乘数的低N位与第一被乘数的关联关系,第二乘数的高N位与第二被乘数的关联关系,第二乘数的低N位与第二被乘数的关联关系。In a specific embodiment, it may further include: constructing a first association relationship, the first association relationship may include an association relationship between the high N bits of the first multiplier and the first multiplicand, and the low N bits of the first multiplier The association relationship with the first multiplicand, the association relationship between the high N bits of the second multiplier and the second multiplicand, and the association relationship between the low N bits of the second multiplier and the second multiplicand.
在一个具体的实施方式中,还可以包括:输出第一乘数的高N位和低N位,第二乘数的高N位和低N位,第一被乘数的高N位和低N位,第二被乘数的高N位和低N位。In a specific implementation, it may also include: outputting the high N bits and low N bits of the first multiplier, the high N bits and low N bits of the second multiplier, and the high N bits and low N bits of the first multiplicand. N bits, the high and low N bits of the second multiplicand.
在一个具体的实施方式中,还可以包括:构建第二关联关系,第二关联关系可以包括第一乘数的高N位与第一被乘数的高N位的关联关系,第一乘数的高N位与第一被乘数的低N位的关联关系,第一乘数的低N位与第一被乘数的高N位的关联关系,第一乘数的低N位与第一被乘数的低N位的关联关系,第二乘数的高N位与第二被乘数的高N位的关联关系,第二乘数的高N位与第二被乘数的低N位的关联关系,第二乘数的低N位与第二被乘数的高N位的关联关系,第二乘数的低N位与第二被乘数的低N位的关联关系。In a specific embodiment, it may further include: constructing a second association relationship, the second association relationship may include an association relationship between the high N bits of the first multiplier and the high N bits of the first multiplicand, and the first multiplier The correlation between the high N bits of the first multiplicand and the low N bits of the first multiplicand, the correlation between the low N bits of the first multiplier and the high N bits of the first multiplicand, the low N bits of the first multiplier and the first multiplicand The correlation between the low N bits of a multiplicand, the correlation between the high N bits of the second multiplier and the high N bits of the second multiplicand, the high N bits of the second multiplier and the low of the second multiplicand N-bit correlation, the correlation between the low N bits of the second multiplier and the high N bits of the second multiplicand, and the correlation between the low N bits of the second multiplier and the low N bits of the second multiplicand.
在一个具体的实施方式中,第一乘数和第二乘数为特征层数据,第一被乘数和第二被乘数为卷积核数据,或者第一乘数和第二乘数为卷积核数据,第一被乘数和第二被乘数为特征层数据。当本申请的方案应用在原始图像处理的领域中,卷积核数据可以为Nbit,特征层数据可以为2Nbit,或者卷积核数据为2Nbit,特征层数据为Nbit,或者卷积核数据以及特征层数据均为2Nbit。需要说明的是,本申请提供的方案不止可以应用于原始图像处理的领域。In a specific embodiment, the first multiplier and the second multiplier are feature layer data, and the first multiplicand and the second multiplicand are convolution kernel data, or the first multiplier and the second multiplier are Convolution kernel data, the first multiplicand and the second multiplicand are feature layer data. When the solution of this application is applied in the field of original image processing, the convolution kernel data can be Nbit, the feature layer data can be 2Nbit, or the convolution kernel data can be 2Nbit, and the feature layer data can be Nbit, or the convolution kernel data and features The layer data is 2Nbit. It should be noted that the solution provided by this application can be applied not only to the field of original image processing.
以上对本申请实施例所提供的数据处理装置以及数据处理方法进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。The data processing device and data processing method provided by the embodiments of the application are described in detail above. Specific examples are used in this article to illustrate the principles and implementation of the application. The description of the above embodiments is only used to help understand the application. At the same time, for those skilled in the art, according to the ideas of this application, there will be changes in the specific implementation and the scope of application. In summary, the content of this specification should not be understood as Restrictions on this application.

Claims (22)

  1. 一种数据处理装置,其特征在于,包括:A data processing device, characterized in that it comprises:
    乘积计算电路,用于计算第一组乘积和第二组乘积,所述第一组乘积包括第一乘数的高N位与第一被乘数的乘积,以及第二乘数的高N位与第二被乘数的乘积,所述第二组乘积包括所述第一乘数的低N位与所述第一被乘数的乘积,以及所述第二乘数的低N位与所述第二被乘数的乘积,所述第一乘数和所述第二乘数均为2N位,所述N为正整数;A product calculation circuit for calculating a first set of products and a second set of products, the first set of products including the product of the high N bits of the first multiplier and the first multiplicand, and the high N bits of the second multiplier And the second multiplicand, the second set of products includes the product of the low N bits of the first multiplier and the first multiplicand, and the low N bits of the second multiplier and the The product of the second multiplicand, the first multiplier and the second multiplier are both 2N bits, and the N is a positive integer;
    累加电路,用于对所述第一组乘积和所述第二组乘积分别进行累加处理。The accumulating circuit is used for accumulating the first group of products and the second group of products respectively.
  2. 根据权利要求1所述的数据处理装置,其特征在于,所述数据处理装置还包括第一移位器和第一加法器,所述第一被乘数和所述第二被乘数均为N位,The data processing device according to claim 1, wherein the data processing device further comprises a first shifter and a first adder, and the first multiplicand and the second multiplicand are both N bits,
    所述第一移位器,用于对所述第一组乘积累加的结果进行移位处理,以得到第一移位结果;The first shifter is configured to perform shift processing on the result of the first group of multiplication accumulation and addition to obtain a first shift result;
    所述第一加法器,用于对所述第一移位结果和所述第二组乘积进行累加。The first adder is configured to accumulate the first shift result and the second set of products.
  3. 根据权利要求1所述的数据处理装置,其特征在于,所述第一被乘数和所述第二被乘数为2N位,所述第一组乘积包括高位乘积和高低位乘积,所述第二组乘积包括低高位乘积和低位乘积,所述乘积计算电路,具体用于:The data processing device according to claim 1, wherein the first multiplicand and the second multiplicand are 2N bits, the first set of products includes a high-order product and a high-low-order product, and the The second set of products includes low-high-order products and low-order products. The product calculation circuit is specifically used for:
    计算所述高位乘积,所述高位乘积包括所述第一乘数的高N位与所述第一被乘数的高N位的乘积,所述第二乘数的高N位与所述第二被乘数的高N位的乘积;Calculate the high-order product, the high-order product includes the product of the high N bits of the first multiplier and the high N bits of the first multiplicand, the high N bits of the second multiplier and the first multiplier The product of the high N bits of the two multiplicands;
    计算所述高低位乘积,所述高低位乘积包括所述第一乘数的高N位与所述第一被乘数的低N位的乘积,所述第二乘数的高N位与所述第二被乘数的低N位的乘积;Calculate the high and low product, the high and low product includes the product of the high N bits of the first multiplier and the low N bits of the first multiplicand, the high N bits of the second multiplier and the The product of the low N bits of the second multiplicand;
    计算所述低高位乘积,所述低高位乘积包括所述第一乘数的低N位与所述第一被乘数的高N位的乘积,所述第二乘数的低N位与所述第二被乘数的高N位的乘积;Calculate the low-high product, the low-high product includes the product of the low N bits of the first multiplier and the high N bits of the first multiplicand, the low N bits of the second multiplier and the The product of the high N bits of the second multiplicand;
    计算所述低位乘积,所述低位乘积包括所述第一乘数的低N位与所述第一被乘数的低N位的乘积,所述第二乘数的低N位与所述第二被乘数的低N位的乘积;Calculate the low-order product, the low-order product includes the product of the low-order N bits of the first multiplier and the low-order N bits of the first multiplicand, and the low-order N bits of the second multiplier and the first multiplicand The product of the low N bits of the two multiplicand;
    所述累加电路,具体用于通过位宽为2Nbit的第二加法器,分别对所述高位乘积,所述高低位乘积,所述低高位乘积,以及所述低位乘积进行累加。The accumulation circuit is specifically configured to accumulate the high-order product, the high-low-order product, the low-high-order product, and the low-order product through a second adder with a bit width of 2Nbit.
  4. 根据权利要求3所述的数据处理装置,其特征在于,所述数据处理装置还包括第二移位器,第三加法器,第三移位器,第四移位器以及第四加法器,3. The data processing device of claim 3, wherein the data processing device further comprises a second shifter, a third adder, a third shifter, a fourth shifter, and a fourth adder,
    所述第二移位器,用于对所述高位乘积累加的结果左移N位,以得到第二移位结果;The second shifter is configured to shift the result of the high-order multiplication accumulation addition to the left by N bits to obtain a second shift result;
    所述第三加法器,用于对所述第二移位结果和所述高低位乘积累加的结果进行累加;The third adder is configured to accumulate the second shift result and the result of the high and low multiplication accumulation and addition;
    所述第三移位器,用于对所述第三加法器输出的结果左移N位,以得到第三移位结果;The third shifter is configured to shift the result output by the third adder to the left by N bits to obtain a third shift result;
    所述第四移位器,用于对所述低高位乘积累加的结果左移N位,以得到第四移位结果;The fourth shifter is configured to shift the result of the low-high-order multiplication accumulation and addition to the left by N bits to obtain a fourth shift result;
    所述第四加法器,用于对所述第三移位结果、所述第四移位结果以及所述低位乘积进行累加。The fourth adder is configured to accumulate the third shift result, the fourth shift result, and the low-order product.
  5. 根据权利要求4所述的数据处理装置,其特征在于,所述数据处理装置还包括第五加法器,用于:The data processing device according to claim 4, wherein the data processing device further comprises a fifth adder for:
    对所述第四移位结果和所述低位乘积进行累加;Accumulate the fourth shift result and the low-order product;
    所述第四加法器,具体用于对所述第三移位结果和所述第五加法器输出的结果进行累加。The fourth adder is specifically configured to accumulate the third shift result and the result output by the fifth adder.
  6. 根据权利要求1或2所述的数据处理装置,其特征在于,还包括拆分逻辑电路,用于:The data processing device according to claim 1 or 2, further comprising a split logic circuit for:
    通过选择器MUX输出所述第一乘数的高N位和低N位,所述第二乘数的高N位和低N位。The high N bits and low N bits of the first multiplier, and the high N bits and low N bits of the second multiplier are output through the selector MUX.
  7. 根据权利要求6所述的数据处理装置,其特征在于,所述拆分逻辑电路,还用于:The data processing device according to claim 6, wherein the split logic circuit is further used for:
    构建第一关联关系,所述第一关联关系包括所述第一乘数的高N位与所述第一被乘数的关联关系,所述第一乘数的低N位与所述第一被乘数的关联关系,所述第二乘数的高N位与所述第二被乘数的关联关系,所述第二乘数的低N位与所述第二被乘数的关联关系。Construct a first association relationship, the first association relationship includes an association relationship between the high N bits of the first multiplier and the first multiplicand, and the low N bits of the first multiplier and the first multiplicand The association relationship of the multiplicand, the association relationship between the high N bits of the second multiplier and the second multiplicand, and the association relationship between the low N bits of the second multiplier and the second multiplicand .
  8. 根据权利要求3至5任一项所述的数据处理装置,其特征在于,还包括拆分逻辑电路,用于:The data processing device according to any one of claims 3 to 5, further comprising a split logic circuit for:
    通过选择器MUX输出所述第一乘数的高N位和低N位,所述第二乘数的高N位和低N位,所述第一被乘数的高N位和低N位,所述第二被乘数的高N位和低N位。Output the high N bits and low N bits of the first multiplier through the selector MUX, the high N bits and low N bits of the second multiplier, and the high N bits and low N bits of the first multiplicand , The high N bits and low N bits of the second multiplicand.
  9. 根据权利要求8所述的数据处理装置,其特征在于,所述拆分逻辑电路,还用于:The data processing device according to claim 8, wherein the split logic circuit is further used for:
    构建第二关联关系,所述第二关联关系包括所述第一乘数的高N位与所述第一被乘数的高N位的关联关系,所述第一乘数的高N位与所述第一被乘数的低N位的关联关系,所述第一乘数的低N位与所述第一被乘数的高N位的关联关系,所述第一乘数的低N位与所述第一被乘数的低N位的关联关系,所述第二乘数的高N位与所述第二被乘数的高N位的关联关系,所述第二乘数的高N位与所述第二被乘数的低N位的关联关系,所述第二乘数的低N位与所述第二被乘数的高N位的关联关系,所述第二乘数的低N位与所述第二被乘数的低N位的关联关系。Construct a second association relationship, the second association relationship includes an association relationship between the high N bits of the first multiplier and the high N bits of the first multiplicand, and the high N bits of the first multiplier and The association relationship between the low N bits of the first multiplicand, the association relationship between the low N bits of the first multiplier and the high N bits of the first multiplicand, the low N bits of the first multiplier The correlation between the low N bits of the first multiplicand, the high N bits of the second multiplier and the high N bits of the second multiplicand, the correlation of the second multiplier The correlation between the high N bits and the low N bits of the second multiplicand, the correlation between the low N bits of the second multiplier and the high N bits of the second multiplicand, the second multiplication The correlation between the low N bits of the number and the low N bits of the second multiplicand.
  10. 根据权利要求9所述的数据处理装置,其特征在于,所述数据处理装置还包括数据随机存取存储器RAM,以及权重RAM,The data processing device according to claim 9, wherein the data processing device further comprises a data random access memory RAM, and a weight RAM,
    所述数据RAM,用于存储所述第一乘数的高N位和低N位,所述第二乘数的高N位和低N位;The data RAM is used to store the high N bits and low N bits of the first multiplier, and the high N bits and low N bits of the second multiplier;
    所述权重RAM,用于根据所述第二关联关系存储所述第一被乘数的高N位和低N位,所述第二被乘数的高N位和低N位。The weight RAM is configured to store the high N bits and low N bits of the first multiplicand, and the high N bits and low N bits of the second multiplicand according to the second association relationship.
  11. 根据权利要求1至10任一项所述的数据处理装置,其特征在于,所述第一乘数和所述第二乘数为特征层数据,所述第一被乘数和所述第二被乘数为卷积核数据,或者所述第一乘数和所述第二乘数为卷积核数据,所述第一被乘数和所述第二被乘数为特征层数据。The data processing device according to any one of claims 1 to 10, wherein the first multiplier and the second multiplier are feature layer data, and the first multiplicand and the second multiplier are feature layer data. The multiplicand is convolution kernel data, or the first multiplier and the second multiplier are convolution kernel data, and the first multiplicand and the second multiplicand are feature layer data.
  12. 一种数据处理方法,其特征在于,包括:A data processing method, characterized in that it comprises:
    计算第一组乘积和第二组乘积,所述第一组乘积包括第一乘数的高N位与第一被乘数的乘积,以及第二乘数的高N位与第二被乘数的乘积,所述第二组乘积包括所述第一乘数的低N位与所述第一被乘数的乘积,以及所述第二乘数的低N位与所述第二被乘数的乘积,所述第一乘数和所述第二乘数均为2N位,所述N为正整数;Calculate the first set of products and the second set of products. The first set of products includes the product of the high N bits of the first multiplier and the first multiplicand, and the high N bits of the second multiplier and the second multiplicand The second set of products includes the product of the low N bits of the first multiplier and the first multiplicand, and the low N bits of the second multiplier and the second multiplicand The first multiplier and the second multiplier are both 2N bits, and the N is a positive integer;
    对所述第一组乘积和所述第二组乘积分别进行累加处理。The first set of products and the second set of products are respectively accumulated and added.
  13. 根据权利要求12所述的数据处理方法,其特征在于,还包括:The data processing method according to claim 12, further comprising:
    对所述第一组乘积累加的结果进行移位处理,以得到第一移位结果;Performing shift processing on the result of the first group of multiplication accumulation and addition to obtain a first shift result;
    对所述第一移位结果和所述第二组乘积进行累加。Accumulate the first shift result and the second set of products.
  14. 根据权利要求12所述的数据处理方法,其特征在于,所述计算第一组乘积和第二组乘积,具体包括:The data processing method according to claim 12, wherein said calculating the first set of products and the second set of products specifically comprises:
    计算所述高位乘积,所述高位乘积包括所述第一乘数的高N位与所述第一被乘数的高N 位的乘积,所述第二乘数的高N位与所述第二被乘数的高N位的乘积;Calculate the high-order product, the high-order product includes the product of the high N bits of the first multiplier and the high N bits of the first multiplicand, the high N bits of the second multiplier and the first multiplier The product of the high N bits of the two multiplicands;
    计算所述高低位乘积,所述高低位乘积包括所述第一乘数的高N位与所述第一被乘数的低N位的乘积,所述第二乘数的高N位与所述第二被乘数的低N位的乘积;Calculate the high and low product, the high and low product includes the product of the high N bits of the first multiplier and the low N bits of the first multiplicand, the high N bits of the second multiplier and the The product of the low N bits of the second multiplicand;
    计算所述低高位乘积,所述低高位乘积包括所述第一乘数的低N位与所述第一被乘数的高N位的乘积,所述第二乘数的低N位与所述第二被乘数的高N位的乘积;Calculate the low-high product, the low-high product includes the product of the low N bits of the first multiplier and the high N bits of the first multiplicand, the low N bits of the second multiplier and the The product of the high N bits of the second multiplicand;
    计算所述低位乘积,所述低位乘积包括所述第一乘数的低N位与所述第一被乘数的低N位的乘积,所述第二乘数的低N位与所述第二被乘数的低N位的乘积;Calculate the low-order product, the low-order product includes the product of the low-order N bits of the first multiplier and the low-order N bits of the first multiplicand, and the low-order N bits of the second multiplier and the first multiplicand The product of the low N bits of the two multiplicand;
    所述对所述第一组乘积和所述第二组乘积分别进行累加处理,包括:The step of separately accumulating the first group of products and the second group of products includes:
    分别对所述高位乘积,所述高低位乘积,所述低高位乘积,以及所述低位乘积进行累加。The high-order product, the high-low-order product, the low-high-order product, and the low-order product are accumulated respectively.
  15. 根据权利要求14所述的数据处理方法,其特征在于,还包括:The data processing method according to claim 14, further comprising:
    对所述高位乘积累加的结果左移N位,以得到第二移位结果;Shift the result of the high-order multiplication accumulation and addition to the left by N bits to obtain the second shift result;
    对所述第二移位结果和所述高低位乘积累加的结果进行累加;Accumulating the second shift result and the result of multiplying and accumulating the high and low bits;
    对所述第二移位结果和所述高低位乘积累加的结果进行累加后的结果左移N位,以得到第三移位结果;The result of accumulating the second shifting result and the result of multiplying and accumulating the high and low bits is shifted to the left by N bits to obtain a third shifting result;
    对所述低高位乘积累加的结果左移N位,以得到第四移位结果;Shifting the result of the accumulation and addition of the low and high bits to the left by N bits to obtain the fourth shift result;
    对所述第三移位结果、所述第四移位结果以及所述低位乘积进行累加。Accumulate the third shift result, the fourth shift result, and the low-order product.
  16. 根据权利要求15所述的数据处理方法,其特征在于,还包括:The data processing method according to claim 15, further comprising:
    对所述第四移位结果和所述低位乘积进行累加;Accumulate the fourth shift result and the low-order product;
    所述对所述第三移位结果、所述第四移位结果以及所述低位乘积进行累加,包括:The accumulating the third shift result, the fourth shift result, and the low-order product includes:
    对所述第三移位结果以及所述第四移位结果和所述低位乘积进行累加后的结果进行累加。Accumulate the third shift result and the result of the accumulation of the fourth shift result and the low-order product.
  17. 根据权利要求12或13所述的数据处理方法,其特征在于,还包括:The data processing method according to claim 12 or 13, further comprising:
    输出所述第一乘数的高N位和低N位,所述第二乘数的高N位和低N位。Output high N bits and low N bits of the first multiplier, and high N bits and low N bits of the second multiplier.
  18. 根据权利要求17所述的数据处理方法,其特征在于,还包括:The data processing method according to claim 17, further comprising:
    构建第一关联关系,所述第一关联关系包括所述第一乘数的高N位与所述第一被乘数的关联关系,所述第一乘数的低N位与所述第一被乘数的关联关系,所述第二乘数的高N位与所述第二被乘数的关联关系,所述第二乘数的低N位与所述第二被乘数的关联关系。Construct a first association relationship, the first association relationship includes an association relationship between the high N bits of the first multiplier and the first multiplicand, and the low N bits of the first multiplier and the first multiplicand The association relationship of the multiplicand, the association relationship between the high N bits of the second multiplier and the second multiplicand, and the association relationship between the low N bits of the second multiplier and the second multiplicand .
  19. 根据权利要求14至16任一项所述的数据处理方法,其特征在于,还包括:The data processing method according to any one of claims 14 to 16, further comprising:
    输出所述第一乘数的高N位和低N位,所述第二乘数的高N位和低N位,所述第一被乘数的高N位和低N位,所述第二被乘数的高N位和低N位。Output the high N bits and low N bits of the first multiplier, the high N bits and low N bits of the second multiplier, the high N bits and low N bits of the first multiplicand, the first multiplier Two high N bits and low N bits of the multiplicand.
  20. 根据权利要求19所述的数据处理方法,其特征在于,还包括:The data processing method according to claim 19, further comprising:
    构建第二关联关系,所述第二关联关系包括所述第一乘数的高N位与所述第一被乘数的高N位的关联关系,所述第一乘数的高N位与所述第一被乘数的低N位的关联关系,所述第一乘数的低N位与所述第一被乘数的高N位的关联关系,所述第一乘数的低N位与所述第一被乘数的低N位的关联关系,所述第二乘数的高N位与所述第二被乘数的高N位的关联关系,所述第二乘数的高N位与所述第二被乘数的低N位的关联关系,所述第二乘数的低N位与所述第二被乘数的高N位的关联关系,所述第二乘数的低N位与所述第二被乘数的低N位的关联关系。Construct a second association relationship, the second association relationship includes an association relationship between the high N bits of the first multiplier and the high N bits of the first multiplicand, and the high N bits of the first multiplier and The association relationship between the low N bits of the first multiplicand, the association relationship between the low N bits of the first multiplier and the high N bits of the first multiplicand, the low N bits of the first multiplier The correlation between the low N bits of the first multiplicand, the high N bits of the second multiplier and the high N bits of the second multiplicand, the correlation of the second multiplier The correlation between the high N bits and the low N bits of the second multiplicand, the correlation between the low N bits of the second multiplier and the high N bits of the second multiplicand, the second multiplication The correlation between the low N bits of the number and the low N bits of the second multiplicand.
  21. 根据权利要求12至20任一项所述的数据处理方法,其特征在于,所述第一乘数和所述第二乘数为特征层数据,所述第一被乘数和所述第二被乘数为卷积核数据,或者所述第一乘 数和所述第二乘数为卷积核数据,所述第一被乘数和所述第二被乘数为特征层数据。The data processing method according to any one of claims 12 to 20, wherein the first multiplier and the second multiplier are feature layer data, and the first multiplicand and the second multiplier are feature layer data. The multiplicand is convolution kernel data, or the first multiplier and the second multiplier are convolution kernel data, and the first multiplicand and the second multiplicand are feature layer data.
  22. 一种现场可编程门阵列FPGA,其特征在于,所述FPGA包括权利要求1至11任一项所描述的数据处理装置。A field programmable gate array FPGA, characterized in that the FPGA comprises the data processing device described in any one of claims 1 to 11.
PCT/CN2020/079431 2020-03-16 2020-03-16 Data processing apparatus and data processing method WO2021184143A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202080098682.8A CN115280277A (en) 2020-03-16 2020-03-16 Data processing device and data processing method
PCT/CN2020/079431 WO2021184143A1 (en) 2020-03-16 2020-03-16 Data processing apparatus and data processing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/079431 WO2021184143A1 (en) 2020-03-16 2020-03-16 Data processing apparatus and data processing method

Publications (1)

Publication Number Publication Date
WO2021184143A1 true WO2021184143A1 (en) 2021-09-23

Family

ID=77767930

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/079431 WO2021184143A1 (en) 2020-03-16 2020-03-16 Data processing apparatus and data processing method

Country Status (2)

Country Link
CN (1) CN115280277A (en)
WO (1) WO2021184143A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190102671A1 (en) * 2017-09-29 2019-04-04 Intel Corporation Inner product convolutional neural network accelerator
CN110109646A (en) * 2019-03-28 2019-08-09 北京迈格威科技有限公司 Data processing method, device and adder and multiplier and storage medium
CN110147252A (en) * 2019-04-28 2019-08-20 深兰科技(上海)有限公司 A kind of parallel calculating method and device of convolutional neural networks
CN110555516A (en) * 2019-08-27 2019-12-10 上海交通大学 FPGA-based YOLOv2-tiny neural network low-delay hardware accelerator implementation method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190102671A1 (en) * 2017-09-29 2019-04-04 Intel Corporation Inner product convolutional neural network accelerator
CN110109646A (en) * 2019-03-28 2019-08-09 北京迈格威科技有限公司 Data processing method, device and adder and multiplier and storage medium
CN110147252A (en) * 2019-04-28 2019-08-20 深兰科技(上海)有限公司 A kind of parallel calculating method and device of convolutional neural networks
CN110555516A (en) * 2019-08-27 2019-12-10 上海交通大学 FPGA-based YOLOv2-tiny neural network low-delay hardware accelerator implementation method

Also Published As

Publication number Publication date
CN115280277A (en) 2022-11-01

Similar Documents

Publication Publication Date Title
US11449576B2 (en) Convolution operation processing method and related product
CN108427990B (en) Neural network computing system and method
US20230026006A1 (en) Convolution computation engine, artificial intelligence chip, and data processing method
CN109409511B (en) Convolution operation data flow scheduling method for dynamic reconfigurable array
CN108647773B (en) Hardware interconnection system capable of reconstructing convolutional neural network
CN111445012A (en) FPGA-based packet convolution hardware accelerator and method thereof
CN107633297B (en) Convolutional neural network hardware accelerator based on parallel fast FIR filter algorithm
JP3228927B2 (en) Processor element, processing unit, processor, and arithmetic processing method thereof
US20190258306A1 (en) Data processing system and method
CN112636745B (en) Logic unit, adder and multiplier
CN109993293B (en) Deep learning accelerator suitable for heap hourglass network
CN111931925B (en) Acceleration system of binary neural network based on FPGA
CN110555516A (en) FPGA-based YOLOv2-tiny neural network low-delay hardware accelerator implementation method
CN108681773B (en) Data operation acceleration method, device, terminal and readable storage medium
EP4318275A1 (en) Matrix multiplier and method for controlling matrix multiplier
CN112765540A (en) Data processing method and device and related products
CN110088777B (en) Deconvolution implementation method and related products
CN112784951A (en) Winograd convolution operation method and related product
WO2021184143A1 (en) Data processing apparatus and data processing method
CN111667052A (en) Standard and nonstandard volume consistency transformation method for special neural network accelerator
CN113138748B (en) Configurable CNN multiplication accumulator supporting 8bit and 16bit data based on FPGA
US20230259780A1 (en) Neural network sparsification apparatus and method and related product
Wang et al. An FPGA-based reconfigurable CNN training accelerator using decomposable Winograd
US20220207332A1 (en) Scalable neural network accelerator architecture
CN113554163A (en) Convolutional neural network accelerator

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20926238

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20926238

Country of ref document: EP

Kind code of ref document: A1