CN109961131B

CN109961131B - Neural network forward operation method and related product

Info

Publication number: CN109961131B
Application number: CN201711347407.1A
Authority: CN
Inventors: 不公告发明人
Original assignee: Cambricon Technologies Corp Ltd
Current assignee: Cambricon Technologies Corp Ltd
Priority date: 2017-12-14
Filing date: 2017-12-14
Publication date: 2020-05-08
Anticipated expiration: 2037-12-14
Also published as: TW201928793A; TWI767098B; CN109961131A

Abstract

The present disclosure provides a method for performing a forward operation of a neural network on an integrated circuit chip device, the neural network comprising a plurality of layers, the method comprising: receiving a first calculation instruction, and analyzing the first calculation instruction to obtain a first operation contained in the ith layer of the forward operation of the first calculation instruction, input data corresponding to the first calculation instruction and weight data; determining a first complexity of a first operation according to the input data, the weight data and the first operation, and determining a first data type of the input data and the weight data when the first operation is executed according to the first complexity, wherein the first data type comprises: a floating point type or a fixed point type; and executing a first operation contained in a first layer of the forward operation by using the input data and the weight data according to the first data type. The technical scheme provided by the disclosure has the advantages of small calculation amount and low power consumption.

Description

Neural network forward operation method and related product

Technical Field

The present disclosure relates to the field of neural networks, and more particularly, to a method for forward operation of a neural network and a related product.

Background

Artificial Neural Networks (ANN) are a research hotspot in the field of Artificial intelligence since the 80 s of the 20 th century. The method abstracts the human brain neuron network from the information processing angle, establishes a certain simple model, and forms different networks according to different connection modes. It is also often directly referred to in engineering and academia as neural networks or neural-like networks. A neural network is an operational model, which is formed by connecting a large number of nodes (or neurons). The operation of the existing neural network is based on a Central Processing Unit (CPU) or a Graphics Processing Unit (GPU) to realize the forward operation of the neural network, and the forward operation has a large amount of calculation and high power consumption.

Disclosure of Invention

The embodiment of the disclosure provides a neural network forward operation method and a related product, which can improve the processing speed of a computing device and improve the efficiency.

In a first aspect, a method for performing a forward operation of a neural network on an integrated circuit chip device, the neural network comprising a plurality of layers, the method comprising:

receiving a first calculation instruction, and analyzing the first calculation instruction to obtain a first operation contained in the ith layer of the forward operation of the first calculation instruction, input data corresponding to the first calculation instruction and weight data; the value range of i is an integer greater than or equal to 1, and if i is greater than or equal to 2, the input data is output data of the (i-1) th layer;

determining a first complexity of a first operation according to the input data, the weight data and the first operation, and determining a first data type of the input data and the weight data when the first operation is executed according to the first complexity, wherein the first data type comprises: a floating point type or a fixed point type;

and executing a first operation contained in a first layer of the forward operation by using the input data and the weight data according to the first data type.

In a second aspect, an integrated circuit chip device is provided for performing a forward operation of a neural network, the neural network including a plurality of layers, the device comprising: a processing circuit and an external interface;

the external interface is used for receiving a first calculation instruction;

the processing circuit is used for analyzing a first calculation instruction to obtain a first operation contained in the ith layer of the forward operation by the first calculation instruction, input data corresponding to the first calculation instruction and weight data; the value range of i is an integer greater than or equal to 1, and if i is greater than or equal to 2, the input data is output data of the (i-1) th layer;

the processing circuit is further configured to determine a first complexity of a first operation according to the input data, the weight data, and the first operation, and determine a first data type of the input data and the weight data when the first operation is performed according to the first complexity, where the first data type includes: a floating point type or a fixed point type;

the processing circuit is further used for executing a first operation contained in the ith layer of the forward operation on the input data and the weight data according to a first data type.

In a third aspect, a neural network computing device is provided, which includes one or more integrated circuit chip devices provided in the second aspect.

In a fourth aspect, there is provided a combined processing apparatus comprising: the neural network arithmetic device, the universal interconnection interface and the universal processing device are provided by the third aspect;

the neural network operation device is connected with the general processing device through the general interconnection interface.

In a fifth aspect, there is provided a chip integrating the apparatus of the second aspect, the apparatus of the third aspect, or the apparatus of the fourth aspect.

In a sixth aspect, an electronic device is provided, the electronic device comprising the chip of the fourth aspect.

It can be seen that, by the embodiments of the present disclosure, the data conversion operation circuit is provided to perform the post-conversion operation on the type of the data block, so that transmission resources and calculation resources are saved, and therefore, the data conversion operation circuit has the advantages of low power consumption and small calculation amount.

Drawings

Fig. 1 is a schematic diagram of a forward operation of a neural network.

FIG. 1a is a schematic block diagram of a fixed point data type.

Fig. 2a is a schematic diagram of convolved input data.

Fig. 2b is a schematic diagram of a convolution kernel.

FIG. 2c is a diagram of an operation window of a three-dimensional data block of input data.

FIG. 2d is a diagram of another exemplary window for inputting a three-dimensional data block of data.

FIG. 2e is a diagram of another operation window of a three-dimensional data block of input data.

Fig. 3a is a schematic structural diagram of a neural network chip.

FIG. 3b is a schematic diagram of another neural network chip.

Fig. 4a is a schematic diagram of matrix multiplication.

Fig. 4b is a flow chart of a method for multiplying a matrix by a matrix.

FIG. 4c is a diagram of a matrix multiplied by a vector.

FIG. 4d is a flow chart of a method for multiplying a matrix by a vector.

Fig. 5a is a schematic structural diagram of a combined processing device according to the disclosure.

Fig. 5b is a schematic view of another structure of a combined processing device disclosed in the present disclosure.

Fig. 5c is a schematic structural diagram of a neural network processor board card according to an embodiment of the present disclosure;

fig. 5d is a schematic structural diagram of a neural network chip package structure according to an embodiment of the present disclosure;

fig. 5e is a schematic structural diagram of a neural network chip according to an embodiment of the present disclosure;

fig. 6 is a schematic diagram of a neural network chip package structure according to an embodiment of the disclosure;

fig. 6a is a schematic diagram of another neural network chip package structure according to an embodiment of the present disclosure.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those skilled in the art, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are only a part of the embodiments of the present disclosure, not all of the embodiments. All other embodiments, which can be derived by one of ordinary skill in the art from the embodiments disclosed herein without making any creative effort, shall fall within the scope of protection of the present disclosure.

In the method provided in the first aspect, determining a first data type of the input data and the weight data when performing the first operation according to the first complexity includes:

and comparing the first complexity with a preset threshold, if the first complexity is higher than the preset threshold, determining that the first data type is a fixed-point type, and if the first complexity is lower than or equal to the preset threshold, determining that the first data type is a floating-point type.

In the method provided by the first aspect, after the determining the first data type of the input data and the weight data when performing the first operation according to the first complexity, the method further includes:

determining that the input data and the weight data belong to a second data type, and if the second data type is different from the first data type, converting the input data belonging to the second data type and the weight data belonging to the second data type into the input data belonging to the first data type and the weight data belonging to the first data type.

In the method provided by the first aspect, if the first operation is a convolution operation, the input data is convolution input data, the weight data is a convolution kernel,

a first complexity of α C KW KH M N W C H;

wherein α is a convolution coefficient with a value range larger than 1, C, KW, KH and M are values of four dimensions of a convolution kernel, and N, W, C, H is a value of four dimensions of convolution input data;

if the first complexity is larger than the set threshold, determining whether the convolution input data and the convolution kernel are floating point data or not, if the convolution input data and the convolution kernel are not floating point data, converting the convolution input data into the floating point data, converting the convolution kernel into the floating point data, and then performing convolution operation on the convolution input data and the convolution kernel in the floating point data type.

In the method provided in the first aspect, as the first operation, the following is performed: matrix multiplication matrix operation, wherein the input data is a first matrix of the matrix multiplication matrix operation, and the weight is a second matrix of the matrix multiplication matrix operation;

the first complexity is β F1G E F2, wherein β is a matrix coefficient, the value range is more than or equal to 1, F1 and G are row and column values of the first matrix, and E, F2 is row and column values of the second matrix;

if the first complexity is larger than the set threshold, determining whether the first matrix and the second matrix are floating point data, if the first matrix and the second matrix are not floating point data, converting the first matrix into the floating point data, converting the second matrix into the floating point data, and then performing matrix multiplication operation on the first matrix and the second matrix according to the type of the floating point data.

In the method provided in the first aspect, as the first operation, the following is performed: performing matrix multiplication vector operation, wherein the input data is a first matrix of the matrix multiplication vector operation, and the weight is a vector of the matrix multiplication vector operation;

the first complexity is β F1G F2, wherein β is a matrix coefficient, the value range is more than or equal to 1, F1 and G are row and column values of the first matrix, and F2 is a column value of the vector;

if the first complexity is larger than the set threshold, determining whether the first matrix and the vector are floating point data, if the first matrix and the vector are not floating point data, converting the first matrix into the floating point data, converting the vector into the floating point data, and then performing matrix multiplication and vector calculation on the first matrix and the vector according to the type of the floating point data.

In the method provided in the first aspect, the ith layer may further include the following operations: one or any combination of bias operation, full connection operation, GEMM operation, GEMV operation and activation operation.

In the apparatus provided in the second aspect, the processing circuit is specifically configured to compare the first complexity with a preset threshold, and if the first complexity is higher than the preset threshold, the computing device determines that the first data type is a fixed-point type, and if the first complexity is lower than or equal to the preset threshold, the computing device determines that the first data type is a floating-point type.

In the apparatus provided in the second aspect, the integrated circuit chip apparatus further includes: a data type conversion circuit;

the processing circuit is further configured to determine a second data type to which the input data and the weight data belong, and send a switch command to the data type switching circuit if the second data type is different from the first data type,

the data type conversion circuit is configured to convert the input data belonging to the second data type and the weight data belonging to the second data type into the input data belonging to the first data type and the weight data belonging to the first data type according to the conversion command.

In the apparatus provided in the second aspect, if the first operation is a convolution operation, the input data is convolution input data, the weight data is a convolution kernel,

the processing circuit is configured to calculate a first complexity, α × C × KW × KH M × N × W × C × H;

the processing circuit is further configured to determine whether the convolution input data and the convolution kernel are floating point data or not if the first complexity is greater than a set threshold, convert the convolution input data into floating point data, convert the convolution kernel into floating point data, and then perform convolution operation on the convolution input data and the convolution kernel in the floating point data type if the convolution input data and the convolution kernel are not floating point data.

In the apparatus provided in the second aspect, the first operation is: matrix multiplication matrix operation, wherein the input data is a first matrix of the matrix multiplication matrix operation, and the weight is a second matrix of the matrix multiplication matrix operation;

the processing circuit is used for calculating a first complexity;

the processing circuit is further configured to determine whether the first matrix and the second matrix are floating point data if the first complexity is greater than a set threshold, convert the first matrix into floating point data if the first matrix and the second matrix are not floating point data, convert the second matrix into floating point data, and perform matrix multiplication on the first matrix and the second matrix according to the type of the floating point data.

In the apparatus provided in the second aspect, the first operation is: performing matrix multiplication vector operation, wherein the input data is a first matrix of the matrix multiplication vector operation, and the weight is a vector of the matrix multiplication vector operation;

the processing circuit is used for calculating a first complexity;

the processing circuit is further configured to determine whether the first matrix and the vector are floating point data if the first complexity is greater than a set threshold, convert the first matrix into floating point data, convert the vector into floating point data if the first matrix and the vector are not floating point data, and then perform a matrix-by-vector operation on the first matrix and the vector according to the type of the floating point data.

In the apparatus provided in the second aspect, the i layer may further include the following operations: one or any combination of bias operation, full connection operation, GEMM operation, GEMV operation and activation operation.

As shown in fig. 1, for the forward operation of the neural network provided by the embodiment of the present disclosure, each layer uses its own input data and weight to calculate according to the operation rule specified by the type of the layer to obtain corresponding output data;

the forward operation process (also called inference) of the neural network is a process of processing input data of each layer by layer and obtaining output data through certain calculation, and has the following characteristics:

input to a certain layer:

the input of a certain layer can be input data of a neural network;

the input of a certain layer may be the output of other layers;

the input to a certain layer may be the output at a time on the layer (corresponding to the case of a recurrent neural network);

a layer may obtain input from a plurality of said input sources simultaneously;

output of a certain layer:

the output of a certain layer can be used as the output result of the neural network;

the output of a certain layer may be the input of other layers;

the output of a layer may be the input of the layer at the next time (in the case of a recurrent neural network);

the output of a certain layer can output results to the plurality of output directions;

specifically, the types of operations of the layers in the neural network include, but are not limited to, the following:

convolutional layers (i.e., performing convolution operations);

fully-connected layers (i.e., performing fully-connected operations);

normalization (regularization) layer: including LRN (local Response normalization) layer, BN (Batchnormalization) layer, etc.;

a pooling layer;

an active layer: including but not limited to the following types Sigmoid layer, ReLU layer, prilu layer, LeakyReLu layer, Tanh layer;

the inverse operations of the layers, each of which needs to perform two parts of operations: one part is to calculate gradients of weights (for updating weights of the present layer in a "weight update" step) using gradients of output data that may be sparse representations and input data that may be sparse representations, and the other part is to calculate gradients of input data (for being used as gradients of output data of a next layer in an inverse operation for the inverse operation thereof) using gradients of output data that may be sparse representations and weights that may be sparse representations;

the backward operation reversely transfers the gradient from the last layer in the reverse order of the forward operation.

In one alternative, the inverse calculated output data gradient for a layer may be from:

gradient returned by last loss function (lost function or cost function) of the neural network;

input data gradients for other layers;

the input data gradient at a time on the local layer (corresponding to the case of the recurrent neural network);

a layer may simultaneously acquire output data gradients from a plurality of said sources;

after the reverse operation of the neural network is executed, calculating the weight gradient of each layer, wherein in the step, a first input cache and a second input cache of the device are respectively used for storing the weight of the layer and the gradient of the weight, and then the weight is updated by using the weight gradient in an operation unit;

in the forward operation, after the execution of the artificial neural network of the previous layer is completed, the operation instruction of the next layer takes the output data calculated in the operation unit as the input data of the next layer for operation (or performs some operation on the output data and then takes the output data as the input data of the next layer), and at the same time, the weight value is replaced by the weight value of the next layer; in the reverse operation, after the reverse operation of the artificial neural network of the previous layer is completed, the next layer of operation instruction takes the input data gradient calculated in the operation unit as the output data gradient of the next layer to perform operation (or performs some operation on the input data gradient and then takes the input data gradient as the output data gradient of the next layer), and simultaneously, the weight is replaced by the weight of the next layer; (in the following drawings, the arrows of broken lines in the following drawings indicate the inverse operation, the arrows of solid lines indicate the forward operation, and the labels below the drawings indicate the meanings of the drawings.)

Fixed point data representation method

The fixed-point method is a method of converting a data representation of a certain data block in a network into a data representation of a specific certain fixed decimal point position (0/1 bits of data mapped on a circuit device);

in one alternative scheme, a plurality of data are combined into a data block as a whole to be represented in a fixed point mode by using the same fixed point representation method;

FIG. 1a illustrates a specific representation of a short-bit fixed-point data structure for storing data according to an embodiment of the present invention. Wherein 1Bit is used for representing symbols, M bits are used for representing integer parts, and N bits are used for representing decimal parts; compared with a 32-bit floating Point data representation form, the short-bit fixed Point data representation form adopted by the invention has the advantages that the occupied bit number is less, and a flag bit Point location is additionally arranged for recording the position of a decimal Point for data of the same layer and the same type in a neural network, such as ownership value data of a first convolution layer, so that the precision of data representation and the representable data range can be adjusted according to the distribution of actual data.

The floating-point number representation is 32 bits, but for the technical scheme, the fixed-point number can reduce the bit number of one numerical value, so that the transmitted data volume and the operated data volume are reduced.

The input data is represented by fig. 2a (N samples, each sample having C channels, the height of the profile of each channel being H and the width being W), and the weights, i.e. the convolution kernels, are represented by fig. 2b (M convolution kernels, each convolution kernel having C channels, the height and width being KH and KW, respectively). For N samples of input data, the rule of convolution operation is the same, and the following explains the process of performing convolution operation on one sample, where each of M convolution kernels needs to perform the same operation, each convolution kernel obtains one planar feature map, and the M convolution kernels finally obtain M planar feature maps by calculation, (for one sample, the output of convolution is M feature maps), for one convolution kernel, an inner product operation is performed at each planar position of one sample, and then sliding is performed along the H and W directions, for example, fig. 2c shows a corresponding diagram of a convolution kernel performing an inner product operation at the lower right corner position in one sample of input data; figure 2d shows the position of the convolution sliding one grid to the left and figure 2e shows the position of the convolution sliding one grid upwards.

When the first operation is a convolution operation, the input data is convolution input data, the weight data is a convolution kernel,

a first complexity of α C KW KH M N W C H;

Specifically, the convolution processing may be performed by using a chip structure as shown in fig. 3a or fig. 3b, the data conversion operation circuit of the main processing circuit (which may also be referred to as a master unit) may convert data in part or all of the convolution kernels of the weight values into fixed-point type data when the first complexity is greater than a set threshold, and the control circuit of the main processing circuit may send data in part or all of the convolution kernels of the weight values to those base processing circuits (which may also be referred to as base units, and the master unit and the base unit are illustrated in fig. 3b as examples, and the base units are respectively determined as the base unit 0 and the base unit 1.. the base unit 15) (for example, the uppermost gray-filled vertical data path in fig. 3 b).

In one alternative scheme, the control circuit of the main processing circuit sends one number or a part of numbers of data of a certain convolution kernel in the weight to a certain basic processing circuit at a time; (for example, for a given basic processing circuit, line 3 1 is transmitted 1 st number, line 3 is transmitted 2 nd number in 2 nd line 3, line 3 is transmitted 3 rd number … …, or line 3 first two numbers are transmitted 1 st time,

line

3 and 4 are transmitted second time, line 3 5 and 6 th numbers are transmitted third time … …;)

In another alternative, the control circuit of the main processing circuit sends data of a plurality of convolution kernels in the weight to a certain basic processing circuit one number at a time; (for example, for a base processing circuit,

row

3,4,5, line 1, row 2,

row

3,4,5,

row

3,4,5, … … is transmitted 1 time,

row

3,4,5, two

previous rows

3,4,5, row 3, row 5, row 6, row 5, … … is transmitted 1 time)

The control circuit of the main processing circuit divides the input data according to the convolution position, and the control circuit of the main processing circuit transmits the data in partial or all convolution positions in the input data to the basic processing circuits (for example, a gray-filled transverse data path on the left side of the basic processing circuit array in fig. 3 b) which are directly connected with the main processing circuit through the vertical data input interface;

in one alternative, the control circuit of the main processing circuit sends data at a certain convolution position in the input data to a certain basic processing circuit one number or a part of numbers at a time; (for example, for a basic processing circuit, the 1 st transmission of the 1 st number of the 3 rd column, the 2 nd transmission of the 2 nd number in the 3 rd column data, the 3 rd transmission of the 3 rd column of … …, or the 1 st transmission of the first two numbers of the 3 rd column, the second transmission of the 3 rd and 4 th numbers of the 3 rd column, the third transmission of the 3 rd column of the 5 th and 6 th numbers of … …;)

In an alternative, the control circuit of the main processing circuit sends data of a certain number of convolution positions in the input data to a certain basic processing circuit one number or a part of numbers at a time; (for example, for a base processing circuit, the 1 st transmission of the 1 st number of

columns

3,4,5 per column, the 2 nd transmission of the 2 nd number of

columns

3,4,5 per column, the 3 rd transmission of the 3 rd number of

columns

3,4,5 per column … …, or the 1 st transmission of the first two numbers of

columns

3,4,5 per column, the second transmission of the 3 rd and 4 th numbers of

columns

3,4,5 per column, the third transmission of the 5 th and 6 th numbers of

columns

3,4,5 per column … …;)

After the basic processing circuit receives the weighted data, the data is transmitted to the next basic processing circuit connected with the basic processing circuit through the data output interface in the horizontal direction (for example, a white filled horizontal data path in the middle of the basic processing circuit array in fig. 3 b); after receiving the data of the input data, the basic processing circuit transmits the data to the next basic processing circuit connected to the basic processing circuit through the vertical data output interface (for example, a white filled vertical data path in the middle of the basic processing circuit array in fig. 3 b);

each basic processing circuit operates on the received data;

in one alternative, the base processing circuitry computes a multiplication of one or more sets of two data at a time, and then accumulates the results onto registers and/or on-chip caches;

in one alternative, the base processing circuitry computes the inner product of one or more sets of two vectors at a time, and then accumulates the results onto a register and/or on-chip cache;

after the basic processing circuit calculates the result, the result can be transmitted out from the data output interface;

in one alternative, the calculation result may be the final result or an intermediate result of the inner product operation;

specifically, if the basic processing circuit has an output interface directly connected to the main processing circuit, the result is transmitted from the interface, and if not, the result is output in the direction of the basic processing circuit capable of directly outputting to the main processing circuit (for example, in fig. 3b, the lowermost row of basic processing circuits directly outputs the output result thereof to the main processing circuit, and the other basic processing circuits transmit the operation result downward from the vertical output interface).

After receiving the calculation results from other basic processing circuits, the basic processing circuit transmits the data to other basic processing circuits or main processing circuits connected with the basic processing circuit;

outputting the result in a direction capable of being directly output to the main processing circuit (for example, the bottom row of basic processing circuits directly outputs the output result to the main processing circuit, and the other basic processing circuits transmit the operation result downwards from the vertical output interface);

the main processing circuit receives the inner product operation result of each basic processing circuit, and the output result can be obtained.

Referring to fig. 4a, fig. 4a is a matrix-by-matrix operation, such as the first operation: matrix multiplication matrix operation, wherein the input data is a first matrix of the matrix multiplication matrix operation, and the weight is a second matrix of the matrix multiplication matrix operation;

Referring to FIG. 4b, the matrix multiplication operation is performed using the apparatus shown in FIG. 3 b;

the following describes the operation of calculating the multiplication of a matrix S of size M rows and L columns and a matrix P of size L rows and N columns, (each row in the matrix S being the same length as each column of the matrix P, as shown in fig. 2 d) the neural network computing device possesses K basic processing circuits:

step S401b, when the first complexity is larger than the set threshold, the main processing circuit converts the matrix S and the matrix P into fixed point type data, the control circuit of the main processing circuit distributes each row of data in the matrix S to one of K basic processing circuits, and the basic processing circuits store the received data in an on-chip cache and/or a register; specifically, the K basic processing circuits may be sent to the basic processing circuit connected to the main processing circuit.

In one alternative, if the number of rows M < ═ K of S, the control circuit of the main processing circuit distributes one row of the S matrix to the M basic processing circuits, respectively;

in an alternative, the control circuit of the main processing circuit distributes data of one or more rows in the S matrix to each of the elementary processing circuits, respectively, if the number of rows M > K of S.

In S, Mi rows are distributed to the ith basic processing circuit, and the set of Mi rows is called Ai, as shown in fig. 2e, which represents the calculation to be performed on the ith basic processing circuit.

In one alternative, in each base processing circuit, for example, in the ith base processing circuit:

the received matrix Ai distributed by the main processing circuit stores the matrix Ai in an ith basic processing circuit register and/or an on-chip cache; the method has the advantages of reducing the subsequent data transmission quantity, improving the calculation efficiency and reducing the power consumption.

Step S402b, the control circuit of the main processing circuit transmits each part in the matrix P to each basic processing circuit in a broadcasting mode;

in an alternative scheme, each part in the matrix P may be broadcasted to the register or on-chip cache of each basic processing circuit only once, and the ith basic processing circuit multiplexes the data of the matrix P obtained this time sufficiently to complete the inner product operation corresponding to each row in the matrix Ai; the multiplexing in this embodiment may be specifically that the basic processing circuit is repeatedly used in the calculation, for example, the multiplexing of the data of the matrix P may be that the data of the matrix P is used multiple times.

In an alternative, the control circuit of the main processing circuit may broadcast each part of the matrix P to the register or on-chip cache of each basic processing circuit for multiple times, and the ith basic processing circuit does not multiplex the data of the matrix P obtained each time, and completes the inner product operation corresponding to each row in the matrix Ai for multiple times;

in an alternative, the control circuit of the main processing circuit may broadcast each part of the matrix P to the register or on-chip cache of each basic processing circuit for multiple times, and the ith basic processing circuit performs partial multiplexing on the data of the matrix P obtained each time, and completes the inner product operation corresponding to each row in the matrix Ai;

in one alternative, each basic processing circuit, for example the ith basic processing circuit, calculates the inner product of the data of matrix Ai and the data of matrix P;

in step S403b, the accumulator circuit of each basic processing circuit accumulates the result of the inner product operation and transmits it back to the main processing circuit.

In one alternative, the base processing circuit may transmit the partial sums obtained by performing the inner product operation each time back to the main processing circuit for accumulation;

in an alternative, the partial sum obtained by the inner product operation executed by the basic processing circuit each time can be stored in a register and/or an on-chip cache of the basic processing circuit, and the partial sum is transmitted back to the main processing circuit after the accumulation is finished;

in an alternative, the partial sum obtained by the inner product operation performed by the basic processing circuit each time may be stored in a register and/or an on-chip buffer of the basic processing circuit in some cases for accumulation, and transmitted to the main processing circuit for accumulation in some cases, and transmitted back to the main processing circuit after the accumulation is finished.

Fig. 4c is a schematic diagram of a matrix multiplied by a vector. If the first operation is: performing matrix multiplication vector operation, wherein the input data is a first matrix of the matrix multiplication vector operation, and the weight is a vector of the matrix multiplication vector operation;

Referring to fig. 4d, fig. 4d provides an implementation method of matrix multiplication vector, which may specifically include:

step S401, each row of data in the matrix S is converted into fixed-point type data by a data conversion operation circuit of a main processing circuit, a control circuit of the main processing circuit distributes the data to one of K basic processing circuits, and the basic processing circuits store the received distributed data in an on-chip cache and/or a register of the basic processing circuit;

in an alternative, if the number M < ═ K of rows of the matrix S, the control circuit of the main processing circuit distributes one row of the matrix S to the K basic processing circuits, respectively;

in an alternative, the control circuit of the main processing circuit distributes data of one or more rows of the S matrix to each of the elementary processing circuits, respectively, if the number of rows M > K of the matrix S.

The set of rows in S distributed to the ith basic processing circuit is Ai, and there are Mi rows in total, as fig. 2c shows the calculations to be performed on the ith basic processing circuit.

In one alternative, in each base processing circuit, e.g., the ith base processing circuit, the received dispatch data, e.g., the matrix Ai, may be stored in a register and/or on-chip cache of the ith base processing circuit; the method has the advantages of reducing the data transmission quantity of the subsequent distribution data, improving the calculation efficiency and reducing the power consumption.

Step S402, a data type arithmetic circuit of the main processing circuit converts the vector P into fixed point type data, and a control circuit of the main processing circuit transmits each part in the fixed point type vector P to K basic processing circuits in a broadcasting mode;

in an alternative, the control circuit of the main processing circuit may broadcast each part of the vector P only once to the register or on-chip buffer of each basic processing circuit, and the ith basic processing circuit may fully multiplex the data of the vector P obtained this time, and perform the inner product operation corresponding to each row in the matrix Ai. The method has the advantages of reducing the data transmission quantity of repeated transmission of the vector P from the main processing circuit to the basic processing circuit, improving the execution efficiency and reducing the transmission power consumption.

In an alternative, the control circuit of the main processing circuit may broadcast each part of the vector P to the register or on-chip cache of each basic processing circuit for multiple times, and the ith basic processing circuit does not multiplex the data of the vector P obtained each time, and completes the inner product operation corresponding to each row in the matrix Ai for multiple times; the method has the advantages of reducing the data transmission quantity of the vector P of single transmission in the basic processing circuit, reducing the capacity of the cache and/or the register of the basic processing circuit, improving the execution efficiency, reducing the transmission power consumption and reducing the cost.

In an alternative, the control circuit of the main processing circuit may broadcast each part of the vector P to the register or on-chip cache of each basic processing circuit for multiple times, and the ith basic processing circuit performs partial multiplexing on the data of the vector P obtained each time, and completes the inner product operation corresponding to each row in the matrix Ai; the method has the advantages of reducing the data transmission quantity from the main processing circuit to the basic processing circuit, reducing the data transmission quantity in the basic processing circuit, improving the execution efficiency and reducing the transmission power consumption.

Step S403, calculating an inner product of the matrix S and the data of the vector P by an inner product operator circuit of K basic processing circuits, for example, an i-th basic processing circuit, calculating an inner product of the data of the matrix Ai and the data of the vector P;

and S404, accumulating the results of the inner product operation by the accumulator circuits of the K basic processing circuits to obtain accumulated results, and transmitting the accumulated results back to the main processing circuit in a fixed-point type mode.

In an alternative, the partial sums (i.e., a portion of the accumulated result, e.g., F1G 1+ F2G 2+ F3G 3+ F4G 4+ F5G 5, then the partial sums may be the values of F1G 1+ F2G 2+ F3G 3) resulting from each inner product operation performed by the basic processing circuit may be transmitted back to the main processing circuit for accumulation; the method has the advantages of reducing the internal operation amount of the basic processing circuit and improving the operation efficiency of the basic processing circuit.

In an alternative, the partial sum obtained by the inner product operation executed by the basic processing circuit each time can be stored in a register and/or an on-chip cache of the basic processing circuit, and the partial sum is transmitted back to the main processing circuit after the accumulation is finished; the method has the advantages of reducing the data transmission quantity between the basic processing circuit and the main processing circuit, improving the operation efficiency and reducing the data transmission power consumption.

In an alternative, the partial sum obtained by the inner product operation executed by the basic processing circuit each time is stored in a register and/or an on-chip cache of the basic processing circuit for accumulation in partial cases, and is transmitted to the main processing circuit for accumulation in partial cases, and is transmitted back to the main processing circuit after the accumulation is finished; the method has the advantages of reducing the data transmission quantity between the basic processing circuit and the main processing circuit, improving the operation efficiency, reducing the data transmission power consumption, reducing the operation quantity in the basic processing circuit and improving the operation efficiency of the basic processing circuit.

The present disclosure also provides an integrated circuit chip device for performing a forward operation of a neural network, the neural network including a plurality of layers, the device comprising: a processing circuit and an external interface;

the external interface is used for receiving a first calculation instruction;

the processing circuit is used for analyzing a first calculation instruction to obtain a first operation contained in the ith layer of the forward operation by the first calculation instruction, input data corresponding to the first calculation instruction and weight data; the value of i may be 1, for example, when i is 1, the input data may be original input data, and when i is greater than or equal to 2, the input data may be output data of a previous layer, for example, output data of an i-1 layer.

The disclosure also discloses a neural network computing device, which includes one or more chips shown in fig. 3a or fig. 3b, and is used for acquiring data to be computed and control information from other processing devices, executing a specified neural network operation, and transmitting the execution result to peripheral equipment through an I/O interface. Peripheral devices such as cameras, displays, mice, keyboards, network cards, wifi interfaces, servers. When more than one chip shown in fig. 3a or fig. 3b is included, the chips shown in fig. 3a or fig. 3b can be linked and transmit data through a specific structure, for example, a PCIE bus interconnects and transmits data to support larger-scale operation of the neural network. At this time, the same control system may be shared, or there may be separate control systems; the memory may be shared or there may be separate memories for each accelerator. In addition, the interconnection mode can be any interconnection topology.

The neural network arithmetic device has high compatibility and can be connected with various types of servers through PCIE interfaces.

The disclosure also discloses a combined processing device, which includes the above neural network computing device, the universal interconnect interface, and other processing devices (i.e., general processing devices). The neural network arithmetic device interacts with other processing devices to jointly complete the operation designated by the user. FIG. 5a is a schematic view of a combined processing apparatus.

Other processing devices include one or more of general purpose/special purpose processors such as Central Processing Units (CPUs), Graphics Processing Units (GPUs), neural network processors, and the like. The number of processors included in the other processing devices is not limited. The other processing devices are used as interfaces of the neural network arithmetic device and external data and control, and comprise data transportation to finish basic control of starting, stopping and the like of the neural network arithmetic device; other processing devices can cooperate with the neural network arithmetic device to complete the arithmetic task.

And the universal interconnection interface is used for transmitting data and control instructions between the neural network arithmetic device and other processing devices. The neural network arithmetic device acquires required input data from other processing devices and writes the input data into a storage device on the neural network arithmetic device chip; control instructions can be obtained from other processing devices and written into a control cache on a neural network arithmetic device chip; the data in the storage module of the neural network arithmetic device can also be read and transmitted to other processing devices.

As shown in fig. 5b, the structure may further include a storage device for storing data required by the present arithmetic unit/arithmetic device or other arithmetic units, and is particularly suitable for data that is required to be calculated and cannot be stored in the internal storage of the present neural network arithmetic device or other processing devices.

The combined processing device can be used as an SOC (system on chip) system of equipment such as a mobile phone, a robot, an unmanned aerial vehicle and video monitoring equipment, the core area of a control part is effectively reduced, the processing speed is increased, and the overall power consumption is reduced. In this case, the generic interconnect interface of the combined processing device is connected to some component of the apparatus. Some parts are such as camera, display, mouse, keyboard, network card, wifi interface.

Referring to fig. 5c, fig. 5c is a schematic structural diagram of a neural network processor board card according to an embodiment of the disclosure. As shown in fig. 5c, the neural network processor board 10 includes a neural network chip package structure 11, a first electrical and non-electrical connection device 12, and a first substrate (substrate) 13.

The present disclosure does not limit the specific structure of the neural network chip package structure 11, and optionally, as shown in fig. 5d, the neural network chip package structure 11 includes: a neural network chip 111, a second electrical and non-electrical connection device 112, and a second substrate 113.

The specific form of the neural network chip 111 related to the present disclosure is not limited, and the neural network chip 111 includes, but is not limited to, a neural network chip integrating a neural network processor, and the neural network chip may be made of silicon material, germanium material, quantum material, molecular material, or the like. The neural network chip can be packaged according to practical conditions (such as a severer environment) and different application requirements, so that most of the neural network chip is wrapped, and the pins on the neural network chip are connected to the outer side of the packaging structure through conductors such as gold wires and the like for circuit connection with a further outer layer.

The present disclosure is not limited to the specific structure of the neural network chip 111, and please refer to the apparatus shown in fig. 3a or fig. 3b as an alternative.

The type of the first substrate 13 and the second substrate 113 is not limited in this disclosure, and may be a Printed Circuit Board (PCB) or a Printed Wiring Board (PWB), and may be other circuit boards. The material of the PCB is not limited.

The second substrate 113 according to the present disclosure is used for carrying the neural network chip 111, and the neural network chip package structure 11 obtained by connecting the neural network chip 111 and the second substrate 113 through the second electrical and non-electrical connection device 112 is used for protecting the neural network chip 111, so as to further package the neural network chip package structure 11 and the first substrate 13.

The specific packaging method and the corresponding structure of the second electrical and non-electrical connecting device 112 are not limited, and an appropriate packaging method can be selected according to actual conditions and different application requirements, and can be simply improved, for example: flip Chip Ball Grid Array (FCBGAP) packages, Low-profile Quad Flat packages (LQFP), Quad Flat packages with Heat sinks (HQFP), Quad Flat packages (Quad Flat Non-lead Package, QFN), or small pitch Quad Flat packages (FBGA).

The Flip Chip (Flip Chip) is suitable for the conditions of high requirements on the area after packaging or sensitivity to the inductance of a lead and the transmission time of a signal. In addition, a Wire Bonding (Wire Bonding) packaging mode can be used, so that the cost is reduced, and the flexibility of a packaging structure is improved.

Ball Grid Array (Ball Grid Array) can provide more pins, and the average wire length of the pins is short, and has the function of transmitting signals at high speed, wherein, the package can be replaced by Pin Grid Array Package (PGA), Zero Insertion Force (ZIF), Single Edge Contact Connection (SECC), Land Grid Array (LGA) and the like.

Optionally, the neural network Chip 111 and the second substrate 113 are packaged in a Flip Chip Ball Grid Array (Flip Chip Ball Grid Array) packaging manner, and a schematic diagram of a specific neural network Chip packaging structure may refer to fig. 6. As shown in fig. 6, the neural network chip package structure includes: the neural network chip 21, the bonding pad 22, the solder ball 23, the second substrate 24, the connection point 25 on the second substrate 24, and the pin 26.

The bonding pads 22 are connected to the neural network chip 21, and the solder balls 23 are formed between the bonding pads 22 and the connection points 25 on the second substrate 24 by soldering, so that the neural network chip 21 and the second substrate 24 are connected, that is, the package of the neural network chip 21 is realized.

The pins 26 are used for connecting with an external circuit of the package structure (for example, the first substrate 13 on the neural network processor board 10), so as to realize transmission of external data and internal data, and facilitate processing of data by the neural network chip 21 or a neural network processor corresponding to the neural network chip 21. The present disclosure is also not limited to the type and number of pins, and different pin types can be selected according to different packaging technologies and arranged according to certain rules.

Optionally, the neural network chip packaging structure further includes an insulating filler, which is disposed in a gap between the pad 22, the solder ball 23 and the connection point 25, and is used for preventing interference between the solder ball and the solder ball.

Wherein, the material of the insulating filler can be silicon nitride, silicon oxide or silicon oxynitride; the interference includes electromagnetic interference, inductive interference, and the like.

Optionally, the neural network chip package structure further includes a heat dissipation device for dissipating heat generated when the neural network chip 21 operates. The heat dissipation device may be a metal plate with good thermal conductivity, a heat sink, or a heat sink, such as a fan.

For example, as shown in fig. 6a, the neural network chip package structure 11 includes: the neural network chip 21, the bonding pad 22, the solder ball 23, the second substrate 24, the connection point 25 on the second substrate 24, the pin 26, the insulating filler 27, the thermal grease 28 and the metal housing heat sink 29. The heat dissipation paste 28 and the metal case heat dissipation sheet 29 are used to dissipate heat generated during operation of the neural network chip 21.

Optionally, the neural network chip package structure 11 further includes a reinforcing structure connected to the bonding pad 22 and embedded in the solder ball 23 to enhance the connection strength between the solder ball 23 and the bonding pad 22.

The reinforcing structure may be a metal wire structure or a columnar structure, which is not limited herein.

The present disclosure is not limited to the specific form of the first electrical and non-electrical device 12, and reference may be made to the description of the second electrical and non-electrical device 112, that is, the neural network chip package structure 11 is packaged by soldering, and a connection wire or a plug connection may be used to connect the second substrate 113 and the first substrate 13, so as to facilitate subsequent replacement of the first substrate 13 or the neural network chip package structure 11.

Optionally, the first substrate 13 includes an interface of a memory unit for expanding a storage capacity, for example: synchronous Dynamic Random Access Memory (SDRAM), Double Rate SDRAM (DDR), etc., which improve the processing capability of the neural network processor by expanding the Memory.

The first substrate 13 may further include a Peripheral component interconnect Express (PCI-E or PCIe) interface, a Small Form-factor pluggable (SFP) interface, an ethernet interface, a Controller Area Network (CAN) interface, and the like on the first substrate, for data transmission between the package structure and the external circuit, which may improve the operation speed and the convenience of operation.

The neural network processor is packaged into a neural network chip 111, the neural network chip 111 is packaged into a neural network chip packaging structure 11, the neural network chip packaging structure 11 is packaged into a neural network processor board card 10, and data interaction is performed with an external circuit (for example, a computer motherboard) through an interface (a slot or a plug core) on the board card, that is, the function of the neural network processor is directly realized by using the neural network processor board card 10, and the neural network chip 111 is protected. And other modules can be added to the neural network processor board card 10, so that the application range and the operation efficiency of the neural network processor are improved.

In one embodiment, the present disclosure discloses an electronic device comprising the above neural network processor board card 10 or the neural network chip package 11.

Electronic devices include data processing devices, robots, computers, printers, scanners, tablets, smart terminals, cell phones, tachographs, navigators, sensors, cameras, servers, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, vehicles, home appliances, and/or medical devices.

The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.

The above-described embodiments, objects, technical solutions and advantages of the present disclosure are further described in detail, it should be understood that the above-described embodiments are only illustrative of the embodiments of the present disclosure, and are not intended to limit the present disclosure, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present disclosure should be included in the scope of the present disclosure.

Claims

1. A method of forward operation of a neural network implemented on an integrated circuit chip device, the neural network comprising a plurality of layers, the method being applied to the integrated circuit chip device, the device comprising: a processing circuit, the processing circuit comprising: the method comprises a main processing circuit and a basic processing circuit, wherein the main processing circuit is connected with the basic processing circuit, and the connection is used for transmitting input data and weight data, and the method comprises the following steps:

the processing circuit receives a first calculation instruction, analyzes the first calculation instruction to obtain a first operation contained by the first calculation instruction in an ith layer of the forward operation, input data corresponding to the first calculation instruction and weight data, wherein the value range of i is an integer greater than or equal to 1, and if i is greater than or equal to 2, the input data is output data of an ith-1 layer;

the processing circuit determines a first complexity of a first operation according to the input data, the weight data and the first operation, and determines a first data type of the input data and the weight data when the first operation is executed according to the first complexity, wherein the first data type includes: a floating point type or a fixed point type;

when the first complexity is larger than a preset threshold value, the main processing circuit converts the weight and partial or all data of the input data into fixed-point type data, and a control circuit of the main processing circuit sends partial or all data of the weight to a basic processing circuit connected with the main processing circuit; the main processing circuit broadcasts partial or all data of input data to the basic processing circuit;

and executing a first operation contained in the ith layer of the forward operation by using the input data and the weight data according to the first data type.

2. The method of claim 1, wherein determining the first data type of the input data and the weight data when performing the first operation according to the first complexity comprises:

3. The method of claim 2, further comprising, after said determining the first data type of the input data and the weight data when performing the first operation according to the first complexity,:

4. The method of claim 1, wherein if the first operation is a convolution operation, the input data is convolution input data, the weight data is a convolution kernel,

a first complexity of α C KW KH M N W C H;

5. The method of claim 1, wherein if the first operation is: matrix multiplication matrix operation, wherein the input data is a first matrix of the matrix multiplication matrix operation, and the weight is a second matrix of the matrix multiplication matrix operation;

6. The method of claim 1, wherein if the first operation is: performing matrix multiplication vector operation, wherein the input data is a first matrix of the matrix multiplication vector operation, and the weight is a vector of the matrix multiplication vector operation;

7. The method according to any one of claims 1 to 6,

the ith layer further comprises: one or any combination of bias operation, full connection operation, GEMM operation, GEMV operation and activation operation.

8. An integrated circuit chip apparatus for performing a forward operation of a neural network, the neural network comprising a plurality of layers, the apparatus comprising: a processing circuit and an external interface; the processing circuit includes: a main processing circuit and a basic processing circuit, the main processing circuit is connected with the basic processing circuit, the connection is used for transmitting input data and weight data,

the external interface is used for receiving a first calculation instruction;

the processing circuit is configured to analyze a first calculation instruction to obtain a first operation included by the first calculation instruction in an ith layer of the forward operation, input data corresponding to the first calculation instruction, and weight data, where a value range of i is an integer greater than or equal to 1, and if i is greater than or equal to 2, the input data is output data of an ith-1 layer;

the main processing circuit is used for converting the weight value and partial or all data of the input data into fixed-point type data when the first complexity is larger than a preset threshold value,

the control circuit of the main processing circuit is used for sending part or all of the data of the weight to the basic processing circuit connected with the main processing circuit;

the main processing circuit is also used for broadcasting part or all of the input data to the basic processing circuit;

the processing circuit is further configured to perform a first operation included in a first layer of the forward operation on the input data and the weight data in a first data type.

9. The integrated circuit chip apparatus of claim 8,

the processing circuit is specifically configured to compare the first complexity with a preset threshold, and if the first complexity is higher than the preset threshold, the computing device determines that the first data type is a fixed-point type, and if the first complexity is lower than or equal to the preset threshold, the computing device determines that the first data type is a floating-point type.

10. The integrated circuit chip apparatus of claim 9, further comprising: a data type conversion circuit;

11. The integrated circuit chip apparatus of claim 8, wherein if the first operation is a convolution operation, the input data is convolution input data, the weight data is a convolution kernel,

12. The integrated circuit chip device of claim 8, wherein if the first operation is: matrix multiplication matrix operation, wherein the input data is a first matrix of the matrix multiplication matrix operation, and the weight is a second matrix of the matrix multiplication matrix operation;

the processing circuit is used for calculating a first complexity;

13. The integrated circuit chip device of claim 8, wherein if the first operation is: performing matrix multiplication vector operation, wherein the input data is a first matrix of the matrix multiplication vector operation, and the weight is a vector of the matrix multiplication vector operation;

the processing circuit is used for calculating a first complexity;

14. The integrated circuit chip apparatus of any one of claims 8-13,

15. A neural network operation device, comprising one or more integrated circuit chip devices as claimed in any one of claims 8 to 14.

16. A combined processing apparatus, characterized in that the combined processing apparatus comprises: the neural network computing device, the universal interconnect interface, and the general purpose processing device of claim 15;

17. A chip incorporating the device of any one of claims 8-14.

18. An electronic device, characterized in that the electronic device comprises a chip according to claim 17.