CN112612447B

CN112612447B - Matrix calculator and full-connection layer calculating method based on same

Info

Publication number: CN112612447B
Application number: CN202011638796.5A
Authority: CN
Inventors: 林广栋; 黄光红; 张笑; 顾大晔
Original assignee: Anhui Core Century Technology Co ltd
Current assignee: Anhui Core Century Technology Co ltd
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2023-12-08
Anticipated expiration: 2040-12-31
Also published as: CN112612447A

Abstract

The invention provides a matrix calculator and a full-connection layer calculating method based on the matrix calculator, wherein the matrix calculator comprises H rows and W columns of multiply-accumulate units, the multiply-accumulate units comprise multipliers and accumulators, an adding tree and a row of accumulate registers are arranged on each row of multiply-accumulate units, and the adding tree is used for calculating the sum of the current calculation results of the row of multiply-accumulate units and accumulating the current sum into the row of accumulate registers; the matrix calculator is provided with a first control circuit for controlling the addition tree and the row accumulation register of each row to work when the result matrix has only one row or one column; the matrix calculator is provided with a second control circuit for disabling the accumulator in the multiply-accumulate unit. The matrix calculator provided by the invention can efficiently realize matrix multiplication, especially when the result matrix has a smaller dimension, for example, when the result matrix has only one row or one column, the invention can efficiently utilize the hardware multiplier array, thereby achieving the effect of improving the calculation efficiency.

Description

Matrix calculator and full-connection layer calculating method based on same

Technical Field

The invention relates to the technical field of integrated circuits, in particular to a matrix calculator and a full-connection layer calculating method based on the matrix calculator.

Background

Matrix multiplication is the most basic operation in linear algebra, and is widely applied in the fields of image processing, artificial neural networks, deep learning and the like. Especially in the field of deep learning, most operations are converted into matrix multiplication operations. For the convolution layers, the convolution kernels are spread, the input layers participating in the convolution operation are also spread, and the convolution operation can be converted into common matrix operation.

The matrix operation requires a large number of repeated multiply-accumulate calculations, a large amount of data is read in, and a large amount of calculation results are written out. Conventional CPUs typically perform only one multiplication operation per cycle, and are not suitable for computationally intensive algorithms such as matrix multiplication. For matrix operations, various hardware acceleration circuits are designed, and a common method is to use a multiply-accumulate unit array to perform computation, and fig. 1 is a common arrangement mode of the multiply-accumulate array, where one MAC represents a multiply-accumulate device.

The matrix calculator shown in fig. 1 may perform a calculation of the type d=a×b+c, where each row of the matrix calculator inputs row data corresponding to the left matrix a and each column inputs column data corresponding to the right matrix B. Each computing period, inputting a new data into each row, and broadcasting the new data to all multiplication and accumulation units of the row; at the same time, a new data is input to each column, broadcast to all multiply-accumulate units of the column, each multiply-accumulate unit multiplies the data received in its row and column directions and accumulates to a local multiply-accumulate result register. When all the rows of the left matrix a are input to the matrix calculator, all the column data of the right matrix B should also be input to the matrix calculator, i.e. the number of data per row of the left matrix a should be equal to the number of data per column of the right matrix B. At this time, each multiply-accumulate unit stores one result of the result matrix.

Of course, if the number of rows of the matrix a is greater than the number of rows of the matrix calculator and/or the number of columns of the matrix B is greater than the number of columns of the matrix calculator, the matrix a and/or the matrix B need to be calculated in a blocking manner, and the result is calculated multiple times. For example, if the number of rows of matrix a is M, the number of columns is N, the number of rows of matrix B is N, the number of columns is K, the number of rows of matrix calculator is H, and the number of columns is W, then at least one a×b operation is requiredThe acceleration performance of the matrix calculator is not ideal when M, K is relatively small and N is relatively large.

The fully connected layer is one of the most common layers in the deep learning model, for example, in the image recognition model based on deep learning, the last layer is basically the fully connected layer, the output number of the fully connected layer is equal to the number of the types of the objects to be recognized (if the objects to be classified contain 10 types, the fully connected layer has 10 outputs), and the probability that the representative image is a certain type of image is calculated through a softmax function.

The fully connected layer can also be converted into matrix calculation, and if the last layer of the deep learning model contains N elements and the last layer contains M elements, the calculation of the fully connected layer is equivalent to calculating the product of an N-row M-column matrix and an M-row 1-column matrix, or the product of a 1-row M-column matrix and an M-row N-column matrix. For such matrix computation, the matrix calculator shown in fig. 1 can only use one column or one row in the array, and after the full-connection layer is converted into matrix computation, there must be one matrix with only one row or one column, which makes the matrix calculator shown in fig. 1 not efficient for full-connection layer computation, and if the matrix calculator includes an H row and W column multiply-accumulate unit, the utilization rate is at most 1/H or 1/W.

For the fully connected layer, the result matrix is only one vector, and no matter whether the result matrix is a row vector or a column vector, M or K must have one 1, i.e. the multiplier actually functioning as the matrix calculator has only 1 row or 1 column, so the conventional matrix calculator shown in fig. 1 is not suitable for the calculation of the fully connected layer in the deep learning model.

Disclosure of Invention

Aiming at the defects of the existing matrix calculator, the invention provides a matrix calculator and a full-connection layer calculating method based on the matrix calculator, and the calculating efficiency of the matrix calculator is improved.

A matrix calculator comprises an H-row W-column multiply-accumulate unit, wherein the multiply-accumulate unit comprises a multiplier and an accumulator, and the multiply-accumulate unit is used for receiving data input in the row direction and the column direction, carrying out multiply calculation and carrying out accumulation calculation through an internal accumulation register; setting an addition tree and a row of accumulation registers on each row of multiplication accumulation units, wherein the addition tree is used for calculating the sum of the current calculation results of the row of multiplication accumulation units and accumulating the current sum into the row of accumulation registers; the matrix calculator is provided with a first control circuit for controlling the addition tree and the row accumulation register of each row to work when the result matrix has only one row or one column; the matrix calculator is provided with a second control circuit for disabling the accumulator in the multiply-accumulate unit.

Further, the product of an M row and N column matrix and an N row and 1 column matrix is calculated by the H row and W column multiply-accumulate unit, wherein M > H, N > W, comprising the steps of:

step 1, controlling the addition tree and the row accumulation register of all rows to work through a first control circuit, and disabling the accumulator in all multiplication accumulation units through a second control circuit;

step 2, the left matrix is partitioned by H line units, and the multiplication result of a partitioned sub-matrix and the right matrix is calculated in each round, at mostAnd (3) a round of completing the whole matrix multiplication operation, wherein each round of calculation process comprises the following steps:

step 2.1, each round sends H rows of left matrix data to the matrix calculator, each row of data is sent to the multiplication and accumulation units of the corresponding row from the row direction, N data of each row are distributed on W multiplication and accumulation units of the same row, and each multiplication and accumulation unit is distributed at mostData;

step 2.2, corresponding to step 2.1, each row of each round of matrix calculator is fed with 1 column of N data of right matrix from column direction, N data are distributed on W multiplication accumulation units of same row, and each multiplication accumulation unit is distributed at mostData;

step 2.3, each cycle of each round, only 1 data is sent in the row direction and the column direction of the multiplication accumulator, each multiplication accumulator in the matrix calculator multiplies the data sent in the row direction and the column direction in each cycle, and then the addition tree fully adds the calculation results of the same-row multipliersSumming with the last accumulated result in the row accumulation register, and updating the summed result to the row accumulation register; at most pass throughAnd (3) obtaining a multiplication result of the block submatrix and the right matrix in a period, wherein the multiplication result is an H row and 1 column matrix.

Further, if M is an integer multiple of H and N is an integer multiple of W, each round of calculation process is,

h lines of left matrix data are correspondingly sent to H lines of multipliers of a matrix calculator, and W multipliers in each line are sequentially sent to the corresponding line of the left matrixPersonal data, th->Personal data, th->Data, … …,Data;

correspondingly feeding 1 column of right matrix data into H-row multipliers of a matrix calculator, and sequentially feeding W multipliers of each row into 1 column of the right matrixPersonal data, th->Personal data, th->Data, … …, thData;

each multiplier receives 1 data per cyclePersonal data need->A cycle; each period, each multiplier multiplies the data sent in the row direction and the column direction, then the addition tree adds the calculation results of the same-row multipliers completely, sums the calculation results with the last accumulation result in the row accumulation register, updates the summation result to the row accumulation register, and passes through->And (3) periodically, the accumulated value stored in each row of accumulation registers is the result of multiplying one row of the left matrix by the 1 st column of the right matrix.

The matrix calculator provided by the invention can efficiently realize matrix multiplication, especially when the result matrix has a smaller dimension, for example, when the result matrix has only one row or one column, the invention can efficiently utilize the hardware multiplier array to achieve the effect of improving the calculation efficiency, and in the full-connection layer calculation of the deep learning model, the situation that the result matrix has only one row or one column is obtained after the conversion into matrix calculation.

Drawings

FIG. 1 is a diagram of a prior art matrix calculator architecture;

FIG. 2 is a diagram of a matrix calculator architecture of the present invention;

FIG. 3 is a schematic diagram of the internal circuit of the multiply-accumulate unit;

FIG. 4 is a block diagram of a left matrix;

fig. 5 is a schematic diagram of input data for a row of multipliers.

Detailed Description

The invention will be described in further detail with reference to the drawings and the detailed description. The embodiments of the invention have been presented for purposes of illustration and description, and are not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.

Example 1

Taking a 4 row and 8 column matrix calculator as an example, the matrix calculator comprises 4 rows and 8 columns of multiply-accumulate units MAC, wherein the multiply-accumulate units MAC comprise multipliers MUL and accumulators ADD, an adding tree and a row of accumulation registers are arranged on each row of multiply-accumulate units, and the adding tree is used for calculating the sum of the current calculation results of the row of multiply-accumulate units and adding the current sum to the row of accumulation registers as shown in fig. 2.

The matrix calculator is provided with a first control circuit for controlling the addition tree and the row accumulation register of each row to work when the result matrix has only one row or one column; the matrix calculator is provided with a second control circuit for disabling the accumulator in the multiply-accumulate unit. The internal structure of the multiply-accumulate unit MAC is shown in fig. 3, where block (1) represents the input from the row direction, block (2) represents the input from the column direction, block (3) represents the multiplier, block (4) represents the output to the row-add tree, block (5) represents the adder, block (6) represents the multiply-accumulate register, and the multiply-accumulate result of one multiply-accumulate unit is stored. When both the first control circuit and the second control circuit are active, the modules (5) (6) are inactive, and the modules (1) (2) (3) (4) are active; when both the first control circuit and the second control circuit are inactive, the module (4) is inactive and the modules (1) (2) (3) (5) (6) are active.

The operation of the matrix calculator of fig. 2 to calculate the product of an M row N column matrix and an N row 1 column matrix is discussed below. The present invention mainly discusses the case where M > H, N > W, since h=4, w=8 in the present embodiment, without losing generality, the present embodiment sets m=12, n=32, that is, the product of a 12 row and 32 column left matrix and a 32 row and 1 column right matrix is calculated by the 4 row and 8 column multiply-accumulate unit.

First, the first control circuit controls the operation of the addition tree and the row accumulation register of all rows, and the second control circuit disables the accumulators in all multiply-accumulate units.

The left matrix has 12 rows and the matrix calculator has only 4 rows of multiply-accumulate units, for which the left matrix is divided into 3 blocks in 4 row units, as shown in fig. 4; then 4 lines of data are sent to the matrix calculator every round, 32 data of each line are distributed on 8 multiplication and accumulation units of the same line, 4 data are distributed to each multiplication and accumulation unit, 1 data are sent to each period, and 4 periods are finished; the data can be completely processed after 3 rounds of 12 rows.

At the same time, each row of each round of matrix calculator is fed with 1 column of 32 data of right matrix, and 32 data are distributed on 8 multiplication-accumulation units of the same row, and similarly, each multiplication-accumulation unit is distributed with 4 data, and each cycle is fed with 1 data, and 4 cycles are fed.

Each multiplication and accumulation unit in the matrix calculator multiplies the data fed in the row direction (left matrix) and the column direction (right matrix) every period, then the addition tree fully adds the calculation results of the same-row multiplication and accumulation units, sums the calculation results with the last accumulation result in the row accumulation register, and updates the summation result to the row accumulation register, referring to fig. 5.

After the 1 st round for 4 periods, the stored result in the 1 st row accumulation register is the multiplied result of the 1 st row of the left matrix and the 1 st column of the right matrix, namely the result of the 1 st row of the result matrix; the storage result in the 2 nd row accumulation register is the multiplication result of the 2 nd row of the left matrix and the 1 st column of the right matrix, namely the result of the 2 nd row of the result matrix; the stored result in the 3 rd row accumulation register is the multiplied result of the 3 rd row of the left matrix and the 1 st column of the right matrix, namely the result of the 3 rd row of the result matrix; the stored result in the 4 th row accumulation register is the multiplied result of the 4 th row of the left matrix and the 1 st column of the right matrix, namely the result of the 4 th row of the result matrix.

After the 2 nd round of 4 cycles, the stored result in the 1 st row accumulation register is the multiplied result of the 5 th row of the left matrix and the 1 st column of the right matrix, namely the result of the 5 th row of the result matrix; the storage result in the 2 nd row accumulation register is the multiplication result of the 6 th row of the left matrix and the 1 st column of the right matrix, namely the result of the 6 th row of the result matrix; the stored result in the 3 rd row accumulation register is the multiplied result of the 7 th row of the left matrix and the 1 st column of the right matrix, namely the result of the 7 th row of the result matrix; the result stored in the 4 th row accumulation register is the result of multiplying the 8 th row of the left matrix by the 1 st column of the right matrix, namely the result of the 8 th row of the result matrix. The 3 rd round is vice versa and will not be described again.

In this embodiment, M is an integer multiple of H, N is an integer multiple of W, and it is easy to think that even though the relationship is not a multiple, the application of the present invention in full-connection layer calculation is not affected, for example, if the left matrix is 10 rows, then the 3 rd round only needs to input two rows of data; if the left matrix is 30 columns, the last multiply-accumulate unit of each row only needs to input data in the first 2 periods of 4 periods.

If the left matrix has M rows, the left matrix needs to be subjected to matrix blocking, each round of the left matrix is sent to H rows of a matrix calculator, W data of the H rows are respectively sent to corresponding rows of H row multiplication accumulation units of the matrix calculator in each period, and each round of the left matrix needs to be subjected to matrix blockingThe multiplication of the left matrix H row and the right matrix 1 column can be completed in one period. Through->The round operation can complete the operation of the whole matrix, namely, the common need +.>The matrix operation is completed in one period. However, if the calculation is performed using a conventional matrix calculator, it is necessary to +.>I.e. < ->A cycle. In contrast, the matrix calculator according to the present invention can achieve an efficiency improvement of about W times.

It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art and which are included in the embodiments of the present invention without the inventive step, are intended to be within the scope of the present invention.

Claims

1. A full connection layer computing method of a matrix calculator is characterized in that,

the matrix calculator comprises H rows and W columns of multiplication and accumulation units, wherein the multiplication and accumulation units comprise multipliers and accumulators, and the matrix calculator is characterized in that an addition tree and a row of accumulation registers are arranged on each row of multiplication and accumulation unit, and the addition tree is used for calculating the sum of the current calculation results of the row of multiplication and accumulation units and accumulating the current sum into the row of accumulation registers;

the matrix calculator is provided with a first control circuit for controlling the addition tree and the row accumulation register of each row to work when the result matrix has only one row or one column; the matrix calculator is provided with a second control circuit for disabling the accumulator in the multiply-accumulate unit;

calculating the product of an M row and N column matrix and an N row and 1 column matrix by an H row and W column multiply-accumulate unit, wherein M is larger than H, N and larger than W, and the method comprises the following steps of:

step 2.3, each cycle of each round, only 1 data is sent to the multiplication accumulator in the row direction and the column direction, each multiplication accumulator in the matrix calculator performs multiplication operation on the data sent in the row direction and the column direction in each cycle, then the addition tree performs full addition on the calculation results of the same-row multipliers, performs summation with the last accumulation result in the row accumulation register, and updates the summation result to the row accumulation register; at most pass throughAnd (3) obtaining a multiplication result of the block submatrix and the right matrix in a period, wherein the multiplication result is an H row and 1 column matrix.

2. The method of claim 1, wherein if M is an integer multiple of H and N is an integer multiple of W, each round of calculation is,

h lines of left matrix data are correspondingly sent to H lines of multipliers of a matrix calculator, and W multipliers in each line are sequentially sent to the corresponding line of the left matrixPersonal data, th->Data, line 1->Data, … …,Data;

each multiplier receives 1 data per cyclePersonal data need->A cycle; each period, each multiplier multiplies the data sent in the row direction and the column direction, then the addition tree adds the calculation results of the same-row multipliers completely, sums the calculation results with the last accumulation result in the row accumulation register, updates the summation result to the row accumulation register, and passes through->And each period, the accumulated value stored in each row of accumulated registers is the result of multiplying one row of the left matrix by the 1 st column of the right matrix.