CN112612447A

CN112612447A - Matrix calculator and full-connection-layer calculation method based on matrix calculator

Info

Publication number: CN112612447A
Application number: CN202011638796.5A
Authority: CN
Inventors: 林广栋; 黄光红; 张笑; 顾大晔
Original assignee: Anhui Core Century Technology Co ltd
Current assignee: Anhui Core Century Technology Co ltd
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2021-04-06
Anticipated expiration: 2040-12-31
Also published as: CN112612447B

Abstract

The invention provides a matrix calculator and a full-connection layer calculation method based on the matrix calculator, wherein the matrix calculator comprises H rows and W columns of multiply-accumulate units, each multiply-accumulate unit comprises a multiplier and an accumulator, each row of multiply-accumulate units is provided with an addition tree and a row of accumulation registers, and the addition tree is used for calculating the sum of the current calculation results of the row of multiply-accumulate units and accumulating the current sum to the row accumulation registers; the matrix calculator is provided with a first control circuit for controlling the addition tree and the row accumulation register of each row to work when the result matrix has only one row or one column; the matrix calculator is provided with a second control circuit for disabling the accumulator in the multiply-accumulate unit. The matrix calculator provided by the invention can efficiently realize matrix multiplication, particularly under the condition that one dimension of a result matrix is small, for example, under the condition that the result matrix only has one row or one column, the matrix calculator can efficiently utilize a hardware multiplier array to achieve the effect of improving the calculation efficiency.

Description

Matrix calculator and full-connection-layer calculation method based on matrix calculator

Technical Field

The invention relates to the technical field of integrated circuits, in particular to a matrix calculator and a full-connection layer calculation method based on the matrix calculator.

Background

Matrix multiplication is the most basic operation in linear algebra and is commonly applied in the fields of image processing, artificial neural networks, deep learning and the like. Especially in the deep learning field, most of the operations are converted into matrix multiplication operations. For convolution layers, a convolution kernel is expanded, input layers participating in convolution operation are also expanded, and the convolution operation can be converted into ordinary matrix operation.

The matrix operation requires a large number of repeated multiply-accumulate calculations, a large amount of data needs to be read in, and a large number of calculation results are written out. The traditional CPU can only carry out multiplication calculation once in one period, and is not suitable for a calculation intensive algorithm of matrix multiplication. Various hardware acceleration circuits are designed for matrix operation, a common method is to use a multiply-accumulate unit array for calculation, fig. 1 is a common multiply-accumulate array arrangement mode, wherein one MAC represents a multiply-accumulate device.

The matrix calculator shown in fig. 1 can perform calculation of type D ═ a × B + C, and each row of the matrix calculator inputs row data corresponding to the left matrix a and each column inputs column data corresponding to the right matrix B. In each calculation period, inputting a new data into each row, and broadcasting the new data to all multiply-accumulate units of the row; meanwhile, a new data is input into each column, the new data is broadcast to all multiply-accumulate units of the column, and each multiply-accumulate unit performs multiplication calculation on the data received in the row direction and the column direction and accumulates the data in a local multiply-accumulate result register. When all the rows of the left matrix a are input to the matrix calculator, all the columns of the right matrix B should be input to the matrix calculator, i.e. the number of data in each row of the left matrix a should be equal to the number of data in each column of the right matrix B. At this time, each multiply-accumulate unit stores one result of the result matrix.

Of course, if the number of rows of the matrix A is greater than the number of rows of the matrix calculator and/or the number of columns of the matrix B is greater than the number of columns of the matrix calculator, the number of rows of the matrix A is greater than the number of columns of the matrix calculatorAnd carrying out block calculation on the matrix A and/or the matrix B, and calculating the result for multiple times. For example, if the number of rows of matrix a is M, the number of columns is N, the number of rows of matrix B is N, the number of columns is K, the number of rows of matrix calculator is H, and the number of columns is W, then at least one a × B operation is required

The acceleration performance of the matrix calculator is not ideal when M, K is small and N is large.

The full-connected layer is one of the most common layers in the deep learning model, for example, in the image recognition model based on deep learning, the last layer is basically the full-connected layer, the number of outputs of the full-connected layer is equal to the number of types of objects to be recognized (if the objects to be classified include 10 types, the full-connected layer should have 10 outputs), and the probability that an image is a certain type of image is represented after calculation by a softmax function.

The fully-connected layer can also be converted to a matrix calculation, and if the last layer of the deep learning model contains N elements and the second last layer contains M elements, the calculation of the fully-connected layer is equivalent to calculating the product of an N-row M-column matrix and an M-row 1-column matrix, or the product of a 1-row M-column matrix and an M-row N-column matrix. For such matrix calculation, the matrix calculator shown in fig. 1 can only use one column or one row in the array, and after the full link layer is converted into the matrix calculation, only one row or one column of the matrix is necessary, which makes the matrix calculator shown in fig. 1 inefficient for the full link layer calculation, and if the matrix calculator includes the H row and W column multiply-accumulate unit, the utilization rate is at most 1/H or 1/W.

For the fully-connected layer, the result matrix is only one vector, and whether the result matrix is a row vector or a column vector, M or K must have one 1, that is, the matrix calculator actually functions as a multiplier with only 1 row or 1 column, so the conventional matrix calculator shown in fig. 1 is not suitable for the calculation of the fully-connected layer in the deep learning model.

Disclosure of Invention

Aiming at the defects of the existing matrix calculator, the invention provides the matrix calculator and the full-connection layer calculation method based on the matrix calculator, and the calculation efficiency is improved.

A matrix calculator comprises a multiplication and accumulation unit with H rows and W columns, wherein the multiplication and accumulation unit comprises a multiplier and an accumulator, and is used for receiving data input in the row direction and the column direction, performing multiplication calculation and performing accumulation calculation through an internal accumulation register; each row of multiply-accumulate units is provided with an addition tree and a row of accumulate registers, wherein the addition tree is used for calculating the sum of the current calculation results of the row multiply-accumulate units and accumulating the current sum to the row accumulate registers; the matrix calculator is provided with a first control circuit for controlling the addition tree and the row accumulation register of each row to work when the result matrix has only one row or one column; the matrix calculator is provided with a second control circuit for disabling the accumulator in the multiply-accumulate unit.

Further, the product of a matrix with M rows and N columns and a matrix with N rows and 1 columns is calculated by the multiplication and accumulation unit with H rows and W columns, wherein M > H, N > W, comprising the following steps:

step 1, controlling the operation of the addition trees and the row accumulation registers of all rows through a first control circuit, and forbidding the accumulators in all multiply-accumulate units through a second control circuit;

step 2, partitioning the left matrix by H line unit, calculating a multiplication result of a partitioned submatrix and the right matrix in each round, and maximizing

And (4) completing the whole matrix multiplication operation in turn, wherein each round of calculation process is as follows:

step 2.1, sending H rows of left matrix data to the matrix calculator in each round, sending each row of data to the multiply-accumulate units in the corresponding row from the row direction, distributing N data in each row to the W multiply-accumulate units in the same row, and distributing each multiply-accumulate unit to the maximum

A piece of data;

step 2.2, corresponding to step 2.1, each row of the matrix calculator is fed into 1 column of the right matrix from the column directionN data distributed on W multiply-accumulate units in the same row, and each multiply-accumulate unit is distributed at most

A piece of data;

step 2.3, in each cycle, only 1 data is sent in the row direction and the column direction of the multiplication accumulator, in each cycle, each multiplication accumulator in the matrix calculator carries out multiplication operation on the data sent in the row direction and the column direction, then the addition tree carries out full addition on the calculation result of the same-row multiplier, and the sum is summed with the last accumulation result in the row accumulation register, and the summation result is updated to the row accumulation register; at most pass through

And in each period, obtaining a multiplication result of the block submatrix and the right matrix, wherein the multiplication result is a matrix with H rows and 1 column.

Further, if M is an integer multiple of H and N is an integer multiple of W, each round of calculation is,

h rows of left matrix data are correspondingly sent into H rows of multipliers of the matrix calculator, and W multipliers in each row are sequentially sent into the first row of the corresponding row of the left matrix

Data, number one

Data, number one

Data … …,

A piece of data;

correspondingly sending 1 column of right matrix data to H row multipliers of the matrix calculator, and sequentially sending W multipliers in each row to the 1 st column of the right matrix

Data, number one

Data, number one

Personal data … …, second

A piece of data;

each multiplier is fed with 1 data per cycle and each multiplier receives

Individual data need

A period of time; in each period, each multiplier multiplies the data sent in the row direction and the column direction, then the addition tree performs full addition on the calculation result of the multiplier in the same row, sums with the last accumulation result in the row accumulation register, updates the summation result to the row accumulation register, and passes through

And periodically, the accumulated value stored in each row accumulation register is the result of multiplying one row of the left matrix by the 1 st column of the right matrix.

The matrix calculator provided by the invention can efficiently realize matrix multiplication, particularly under the condition that the result matrix has a smaller dimension, for example, under the condition that the result matrix has only one row or one column, the matrix calculator can efficiently utilize a hardware multiplier array to achieve the effect of improving the calculation efficiency.

Drawings

FIG. 1 is a prior art matrix calculator architecture;

FIG. 2 is a matrix calculator structure of the present invention;

FIG. 3 is a schematic diagram of an internal circuit of the multiply-accumulate unit;

FIG. 4 is a schematic diagram of a left matrix block;

FIG. 5 is a diagram of input data to a row multiplier.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments. The embodiments of the present invention have been presented for purposes of illustration and description, and are not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art. The embodiment was chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.

Example 1

Taking a 4-row and 8-column matrix calculator as an example, the matrix calculator includes a 4-row and 8-column multiply-accumulate unit MAC, the multiply-accumulate unit MAC includes a multiplier MUL and an accumulator ADD, and an adder tree and a row of accumulator registers are arranged on each row of multiply-accumulate unit, as shown in fig. 2, the adder tree is used for calculating the sum of the current calculation results of the row of multiply-accumulate unit and accumulating the current sum to the row accumulator register.

The matrix calculator is provided with a first control circuit for controlling the addition tree and the row accumulation register of each row to work when the result matrix has only one row or one column; the matrix calculator is provided with a second control circuit for disabling the accumulator in the multiply-accumulate unit. The internal structure diagram of the multiply-accumulate unit MAC is shown in fig. 3, where module (i) represents an input from the row direction, module (ii) represents an input from the column direction, module (iii) represents a multiplier, module (iv) represents an output to the row-add tree, module (iv) represents an adder, and module (iv) represents a multiply-accumulate register, and stores a multiply-accumulate result of the multiply-accumulate unit. When the first control circuit and the second control circuit are both effective, the module does not work, and the module works; when the first control circuit and the second control circuit are both invalid, the module does not work, and the module works.

The following discusses the operation steps of the matrix calculator of fig. 2 to calculate the product of an M-row N-column matrix and an N-row 1-column matrix. The present invention mainly discusses the case of M > H, N > W, since in this embodiment H is 4 and W is 8, without loss of generality, this embodiment sets M is 12 and N is 32, i.e. the product of a left matrix with 12 rows and 32 columns and a right matrix with 32 rows and 1 columns is calculated by a 4-row and 8-column multiply-accumulate unit.

First, the adder tree and the row accumulation registers of all rows are controlled to work by the first control circuit, and the accumulators in all multiply-accumulate units are disabled by the second control circuit.

The left matrix has 12 rows and the matrix calculator has only 4 rows of multiply-accumulate units, for which the left matrix is divided into 3 blocks in 4 row units, as shown in fig. 4; then, 4 rows of data are sent to the matrix calculator in each round, 32 data in each row are distributed on 8 multiply-accumulate units in the same row, 4 data are distributed to each multiply-accumulate unit, 1 data is sent in each period, and 4 periods are finished; the data can be completely processed after 3 rounds of 12 rows.

Meanwhile, each round of matrix calculator sends 32 data of 1 column of the right matrix to each row, 32 data are distributed on 8 multiply-accumulate units in the same row, similarly, each multiply-accumulate unit distributes 4 data, 1 data is sent in each period, and 4 periods are finished.

In each period, each multiply-accumulate unit in the matrix calculator performs multiplication operation on data sent in the row direction (left matrix) and the column direction (right matrix), then the addition tree performs full addition on the calculation result of the multiply-accumulate unit in the same row, sums the calculation result with the last accumulation result in the row accumulation register, and updates the summation result to the row accumulation register, which is shown in fig. 5.

After 4 cycles of the 1 st round, the stored result in the row accumulating register of the 1 st row is the multiplication result of the 1 st row of the left matrix and the 1 st column of the right matrix, namely the result of the 1 st row of the result matrix; the result stored in the row 2 accumulation register is the multiplication result of the row 2 of the left matrix and the column 1 of the right matrix, namely the result of the row 2 of the result matrix; the stored result in the row 3 accumulation register is the multiplication result of the row 3 of the left matrix and the column 1 of the right matrix, namely the result of the row 3 of the result matrix; the row 4 accumulator register stores the result, i.e. the multiplication result of the row 4 of the left matrix and the column 1 of the right matrix, i.e. the result of the row 4 of the result matrix.

After 4 cycles of the 2 nd round, the stored result in the row accumulating register of the 1 st row is the multiplication result of the 5 th row of the left matrix and the 1 st column of the right matrix, namely the result of the 5 th row of the result matrix; the result stored in the row 2 accumulation register is the multiplication result of the row 6 of the left matrix and the column 1 of the right matrix, namely the result of the row 6 of the result matrix; the stored result in the row 3 accumulation register is the multiplication result of the 7 th row of the left matrix and the 1 st column of the right matrix, namely the result of the 7 th row of the result matrix; the row 4 accumulator register stores the result, i.e. the result of multiplying the row 8 of the left matrix by the column 1 of the right matrix, i.e. the result of row 8 of the result matrix. And the 3 rd round is also omitted for brevity.

In this embodiment, M is an integer multiple of H, N is an integer multiple of W, and it is easy to think that even if the M is not an integer multiple, the application of the present invention to the calculation of the full connection layer is not affected, for example, if the left matrix is 10 rows, then only two rows of data need to be input in the 3 rd round; if the left matrix is 30 columns, the last multiply-accumulate unit in each row only inputs data in the first 2 cycles of 4 cycles.

If the left matrix has M rows, the left matrix needs to be subjected to matrix partitioning, each round is sent to the H rows of the matrix calculator, W data of the H rows are respectively sent to corresponding rows of the H row multiplication accumulation unit of the matrix calculator in each period, and each round needs to be subjected to matrix partitioning

In one cycle, multiplication of the left matrix H rows by the right matrix 1 columns can be completed. Through

The operation of the whole matrix can be completed by the round operation, namely the operation is needed

The matrix operation is completed in each cycle. However, if a conventional matrix calculator is used for the calculation, it is necessary to calculate

Namely, it is

And (4) one period. Compared with the prior art, the matrix calculator provided by the invention can realize the efficiency improvement of about W times.

It is to be understood that the described embodiments are merely a few embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by one of ordinary skill in the art and related arts based on the embodiments of the present invention without any creative effort, shall fall within the protection scope of the present invention.

Claims

1. A matrix calculator comprises H rows and W columns of multiply-accumulate units, wherein each multiply-accumulate unit comprises a multiplier and an accumulator, and is characterized in that each multiply-accumulate unit is provided with an addition tree and a row of accumulation registers, and the addition tree is used for calculating the sum of the current calculation results of the row multiply-accumulate unit and accumulating the current sum to the row accumulation registers;

the matrix calculator is provided with a first control circuit for controlling the addition tree and the row accumulation register of each row to work when the result matrix has only one row or one column; the matrix calculator is provided with a second control circuit for disabling the accumulator in the multiply-accumulate unit.

2. The full link layer calculation method of the matrix calculator of claim 1, wherein the product of a matrix with M rows and N columns and a matrix with N rows and 1 columns is calculated by the multiply-accumulate unit with H rows and W columns, wherein M > H, N > W, comprising the steps of:

A piece of data;

step 2.2, corresponding to step 2.1, each row of the matrix calculator is fed with N data of 1 column of the right matrix from the column direction, N data are distributed on W multiply-accumulate units in the same row, and each multiply-accumulate unit is distributed at most

A piece of data;

3. The full link layer calculation method according to claim 2, wherein if M is an integer multiple of H and N is an integer multiple of W, each calculation pass is,

Data, number one

Data, line 1

Data … …,

A piece of data;

Data, number one

Data, number one

Personal data … …, second

A piece of data;

each multiplier is fed with 1 data per cycle and each multiplier receives

Individual data need

In each period, the accumulated value stored in each row accumulation register is the result of multiplying one row of the left matrix by the 1 st column of the right matrix.