WO2017185418A1

WO2017185418A1 - Device and method for performing neural network computation and matrix/vector computation

Info

Publication number: WO2017185418A1
Application number: PCT/CN2016/082015
Authority: WO
Inventors: 陶劲桦; 陈天石; 陈云霁
Original assignee: 北京中科寒武纪科技有限公司
Priority date: 2016-04-29
Filing date: 2016-05-13
Publication date: 2017-11-02
Also published as: CN107329936A

Abstract

A device and a method for performing neural network computation and matrix/vector computation, the device comprising a storage unit, a register unit, a control unit, a computation unit and a scratchpad memory. Neuron/matrix/vector data engaged in computation are temporarily stored in the scratchpad memory, such that data of different widths can be more flexibly and efficiently supported during computation, improving the execution performance of computation tasks. A customized neural network computation and matrix/vector computation module can accomplish various neural network computations and matrix/vector computations more efficiently, improving the execution performance of computation tasks; moreover, the instructions used are in the format of a very long instruction word.

Description

Apparatus and method for performing neural network operations and matrix/vector operations

Technical field

The present invention relates to the field of neural network computing technologies, and more particularly to an apparatus and method for performing neural network operations and matrix/vector operations.

Background technique

Artificial neural networks (ANNs), referred to as neural networks (NNs), are mathematical models of algorithms that mimic the behavioral characteristics of animal neural networks and perform distributed parallel information processing. This kind of network relies on the complexity of the system to adjust the relationship between a large number of internal nodes to achieve the purpose of processing information. At present, neural networks have made great progress in many fields such as intelligent control and machine learning. Since the neural network belongs to the mathematical model of the algorithm, which involves a large number of mathematical operations, how to perform the neural network operation quickly and accurately is an urgent problem to be solved.

Summary of the invention

In view of the above, it is an object of the present invention to provide an apparatus and method for performing neural network operations and matrix/vector operations to achieve efficient neural network operations and matrix/vector operations.

In order to achieve the above object, as an aspect of the present invention, the present invention provides an apparatus for performing a neural network operation and a matrix/vector operation, including a storage unit, a register unit, a control unit, an arithmetic unit, and a scratchpad memory. among them:

a storage unit for storing neurons/matrices/vectors;

a register unit for storing a neuron address/matrix address/vector address, wherein the neuron address is an address stored by the neuron in the storage unit, and the matrix address is an address of a matrix stored in the storage unit The vector address is an address of a vector stored in the storage unit;

a control unit, configured to perform a decoding operation, and control each unit module according to the read instruction;

An operation unit, configured to acquire a neuron address/matrix address/vector address from the register unit according to the instruction, and acquire a corresponding neuron/matrix in the storage unit according to the neuron address/matrix address/vector address/ Vectors, and operations based on the data carried in the neurons/matrices/vectors and/or instructions thus obtained, to obtain an operation result;

The feature is that the neuron/matrix/vector data participating in the calculation by the computing unit is temporarily stored in the scratchpad memory, and the arithmetic unit reads from the scratchpad memory when needed.

The scratch pad memory can support different sizes of neuron/matrix/vector data.

Wherein, the register unit is a scalar register file, and provides a scalar register required in the operation process.

Wherein the arithmetic unit comprises a vector multiplication component, an accumulation component, and a scalar multiplication component;

The arithmetic unit is responsible for the neural network/matrix/vector operation of the device, including convolutional neural network forward operation operation, convolutional neural network training operation, neural network Pooling operation operation, full connection neural network forward operation operation, full connection nerve Network training operation, batch normalization operation, RBM neural network operation, matrix-vector multiplication operation, matrix-matrix addition/subtraction operation, vector outer product operation, vector inner product operation, vector four operation, vector logic The arithmetic operation, the vector transcendental operation operation, the vector comparison operation operation, the vector maximum/minimum operation operation, the vector cyclic shift operation operation, and the generation of a random vector operation operation subject to a certain distribution.

The device further includes an instruction cache unit for storing an operation instruction to be executed; the instruction cache unit is preferably a reorder buffer;

The apparatus also includes an instruction queue for sequentially buffering the decoded instructions for transmission to the dependency processing unit.

The device further includes a dependency processing unit and a storage queue, and the dependency processing unit is configured to determine, before the operation unit acquires the instruction, whether the operation instruction and the previous operation instruction access the same neuron/matrix/vector Store the address, and if so, the operation refers to Storing the operation instruction in the storage queue; otherwise, directly providing the operation instruction to the operation unit, and after the execution of the previous operation instruction is completed, providing the operation instruction in the storage queue to the operation unit; The store queue is used to store instructions that are dependent on the previous instruction on the data and to commit the instructions after the dependency is removed.

Wherein, the instruction set of the device adopts a Load/Store structure, and the operation unit does not operate on data in the memory;

The instruction set of the apparatus preferably employs a very long instruction word architecture, and preferably uses fixed length instructions.

The operation instruction executed by the operation unit includes at least one operation code and at least three operands; wherein the operation code is used to indicate the function of the operation instruction, and the operation unit performs different operations by identifying one or more operation codes. The operand is used to indicate data information of the operation instruction, wherein the data information is an immediate number or a register number.

Advantageously, when the operation instruction is a neural network operation instruction, the neural network operation instruction includes at least one operation code and 16 operands;

Advantageously, when the operation instruction is a matrix-matrix operation instruction, the matrix-matrix operation instruction includes at least one operation code and at least 4 operands;

Advantageously, when the operation instruction is a vector-vector operation instruction, the vector-vector operation instruction includes at least one operation code and at least three operands;

Advantageously, when said operational instruction is a matrix-vector operation instruction, said matrix-vector operation instruction comprises at least one opcode and at least six operands.

As another aspect of the present invention, the present invention further provides an apparatus for performing a neural network operation and a matrix/vector operation, comprising:

The fetch module is configured to fetch an instruction to be executed from the instruction sequence and transmit the instruction to the decoding module;

a decoding module, configured to decode the instruction, and transmit the decoded instruction to the instruction queue;

An instruction queue, configured to sequentially cache the decoded instruction of the decoding module, and send the instruction to the dependency processing unit;

a scalar register file that provides a scalar register for use in operations;

a dependency processing unit, configured to determine whether the current instruction has a data dependency relationship with the previous instruction, and if present, store the current instruction in a storage queue;

a storage queue, configured to cache a current instruction having a data dependency relationship with the previous instruction, and transmitting the current instruction after the current instruction has a dependency relationship with the previous instruction;

Reordering the cache for caching the instruction when it is executed, and determining whether the instruction is the earliest instruction in the uncommitted instruction in the reordering cache after execution, and if so, submitting the instruction;

An arithmetic unit for performing all neural network operations and matrix/vector operations;

a cache memory for temporarily storing the neuron/matrix/vector data participating in the calculation of the arithmetic unit, the arithmetic unit reading from the scratchpad memory when needed; the cache memory preferably Support different sizes of data;

An IO memory access module is configured to directly access the scratch pad memory and is responsible for reading or writing data from the scratch pad memory.

As still another aspect of the present invention, the present invention also provides a method for performing a neural network operation and a matrix/vector instruction, comprising the steps of:

Step S1, the fetch module takes out a neural network operation and a matrix/vector instruction, and sends the instruction to the decoding module;

Step S2, the decoding module decodes the instruction, and sends the instruction to the instruction queue;

Step S3, in the decoding module, the instruction is sent to the instruction accepting module;

Step S4, the instruction accepting module sends the instruction to the micro-instruction generating module to generate the micro-instruction;

Step S5, the microinstruction generation module acquires the neural network operation operation code of the instruction and the neural network operation operand from the scalar register file, and decodes the instruction into a micro instruction that controls each functional component, and sends it to the microinstruction transmission. queue;

Step S6, after obtaining the required data, the instruction is sent to the dependency processing unit; the dependency processing unit analyzes whether the instruction has a dependency on the data with the previously unexecuted instruction, and if so, the The instruction needs to wait in the storage queue until it no longer has a dependency on the data with the previously unexecuted instruction;

Step S7, sending the micro-instruction corresponding to the instruction to the arithmetic unit;

In step S8, the arithmetic unit extracts the required data from the scratchpad memory according to the address and size of the required data, and then completes the neural network operation and/or the matrix/vector operation corresponding to the instruction in the operation unit.

Based on the above technical solutions, the neural network operation and matrix/vector operation apparatus and method of the present invention have the following beneficial effects: the data participating in the calculation is temporarily stored in a scratch pad memory (Scratchpad Memory), so that the neural network operation And the matrix/vector operation process can support data of different widths more flexibly and effectively, and the customized neural network operation and matrix/vector operation module can more effectively implement various neural network operations and matrix/vector operations, and improve computing tasks. Execution performance, the instructions used in the present invention have a format of a very long instruction word.

DRAWINGS

1 is a schematic structural diagram of a neural network operation and a matrix/vector operation device according to the present invention;

2 is a schematic diagram showing the format of an instruction set of the present invention;

3 is a schematic diagram showing the format of a neural network operation instruction of the present invention;

4 is a schematic diagram showing the format of a matrix-matrix operation instruction of the present invention;

Figure 5 is a schematic diagram showing the format of a vector-vector operation instruction of the present invention;

6 is a schematic diagram showing the format of a matrix-vector operation instruction of the present invention;

7 is a schematic structural diagram of a neural network operation and a matrix/vector operation device as an embodiment of the present invention;

8 is a schematic structural diagram of a decoding module in a neural network operation and a matrix/vector operation device according to an embodiment of the present invention;

9 is a flow chart of a neural network operation and a matrix/vector operation device performing neural network operations and matrix/vector instructions as an embodiment of the present invention.

detailed description

The invention discloses a neural network operation and a matrix/vector operation device, comprising a storage unit, a register unit, a control unit and an operation unit. The storage unit stores a neuron/matrix/vector, and the register unit stores the neuron/ The address of the matrix/vector storage and other parameters, the control unit performs a decoding operation, and controls each module according to the read instruction. The operation unit acquires the neuron/matrix/in the instruction or the register unit according to the neural network operation and the matrix/vector operation instruction. The vector address and other parameters are then obtained according to the neuron/matrix/vector address to obtain the corresponding neuron/matrix/vector in the storage unit, and then, according to the acquired neuron/matrix/vector, the operation result is obtained. The invention temporarily stores the neuron/matrix/vector data participating in the calculation on the scratchpad memory, so that the operation process can support the data of different widths more flexibly and effectively, and improve the execution performance of the computing task.

The present invention will be further described in detail below with reference to the specific embodiments of the invention.

1 is a schematic structural diagram of a neural network operation and a matrix/vector operation device according to the present invention. As shown in FIG. 1, the neural network operation and the matrix/vector operation device include:

a storage unit for storing a neuron/matrix/vector. In an embodiment, the storage unit may be a scratchpad memory capable of supporting different sizes of neuron/matrix/vector data; the present invention temporarily stores the necessary calculation data. There is a Scratchpad Memory, which enables the computing device to support data of different widths more flexibly and efficiently during neural network operations and matrix/vector operations.

a register unit for storing a neuron/matrix/vector address, wherein: the neuron address is an address stored by the neuron in the storage unit, the matrix address is an address stored in the storage unit, and the vector address is a vector in the storage unit. Stored Address; In one embodiment, the register unit can be a scalar register file that provides the scalar registers required for the operation. The scalar registers store not only the neuron/matrix/vector address but also the scalar data. When it comes to matrix/vector and scalar operations, the unit must acquire the matrix/vector address from the register unit and the corresponding scalar from the register unit.

A control unit for controlling the behavior of various modules in the device. In an embodiment, the control The system reads the prepared instruction, decodes and generates a plurality of micro-instructions, and sends them to other modules in the device, and the other modules perform corresponding operations according to the obtained micro-instructions.

An operation unit, configured to acquire various neural network operations and matrix/vector operation instructions, acquire a neuron/matrix/vector address in the register unit according to the instruction, and then, in the storage unit, according to the neuron/matrix/vector address The corresponding neurons/matrices/vectors are obtained, and then the operations are performed according to the acquired neurons/matrices/vectors, and the operation results are obtained, and the operation results are stored in the storage unit. The neural network operation and the matrix/vector operation unit include a vector multiplication unit, an accumulation unit, and a scalar multiplication unit. The neural network operation and the matrix/vector operation unit are responsible for the neural network/matrix/vector operation of the device, including but not limited to: convolutional neural network forward operation operation, convolutional neural network training operation, neural network Pooling operation operation, full connection nerve Network forward operation operation, full connection neural network training operation, batch normalization operation operation, RBM neural network operation operation, matrix-vector multiplication operation operation, matrix-matrix addition/subtraction operation operation, vector outer product (tensor) operation operation, Vector inner product operation, vector four operation, vector logic operation, vector transcendental operation, vector comparison operation, vector maximum/minimum operation, vector cyclic shift operation, generation of random vectors subject to certain distribution Operational operations. The arithmetic instruction is sent to the arithmetic unit for execution.

According to an embodiment of the invention, the apparatus further comprises: an instruction buffer unit for storing the operation instruction to be executed. The instruction is also cached in the instruction cache unit during execution. When an instruction is executed, if the instruction is also the earliest instruction in the uncommitted instruction in the instruction cache unit, the instruction will be submitted once submitted. The operation of this instruction will not be able to cancel the change of the device status. In an embodiment, the instruction cache unit may be a reordering cache.

According to an embodiment of the present invention, the apparatus further includes: an instruction queue for sequentially storing the decoded neural network operation and the matrix/vector operation operation instruction, considering that different instructions may have dependencies on the included registers. The relationship is used to cache the decoded instruction and issue the instruction when the dependency is satisfied.

According to an embodiment of the present invention, the apparatus further includes: a dependency processing unit, configured to determine, before the operation unit acquires the instruction, whether the operation instruction and the previous operation instruction are accessed The same neuron/matrix/vector storage address, if yes, the operation instruction is stored in the storage queue, and after the execution of the previous operation instruction is completed, the operation instruction in the storage queue is provided to the operation unit; otherwise, directly The arithmetic instruction is provided to the arithmetic unit. Specifically, when the operation instruction accesses the scratchpad memory, the front and back instructions may access the same block of storage space. In order to ensure the correctness of the instruction execution result, if the current instruction is detected to have a dependency relationship with the data of the previous instruction, the instruction must be Waiting in the storage queue until the dependency is removed.

According to an embodiment of the invention, the apparatus further comprises: an input and output unit for storing the neuron/matrix/vector in the storage unit, or acquiring the operation result from the storage unit. Among them, the input and output unit can directly store the unit, which is responsible for reading data from or writing data to the memory.

According to an embodiment of the present invention, the instruction set used in the apparatus of the present invention adopts a Load/Store structure, and the operation unit does not operate on data in the memory. This instruction set uses a very long instruction word architecture. By configuring the instructions differently, complex neural network operations can be completed, and simple matrix/vector operations can be performed. In addition, the present instruction set uses a fixed length instruction at the same time, so that the neural network operation and the matrix/vector operation device of the present invention fetch the next instruction in the decoding stage of the previous instruction.

2 is a schematic diagram of the format of an operation instruction of the present invention. As shown in FIG. 2, the operation instruction includes at least one operation code and at least three operands, wherein the operation code is used to indicate the function of the operation instruction, and the operation unit identifies one by Or multiple operation codes can perform different operations, and the operand is used to indicate the data information of the operation instruction, wherein the data information can be an immediate number or a register number. For example, when a matrix is to be acquired, the corresponding register number can be corresponding. The starting address of the matrix and the length of the matrix are obtained in the register, and the matrix stored in the corresponding address is obtained in the storage unit according to the starting address of the matrix and the length of the matrix.

3 is a schematic diagram of a format of a neural network operation instruction according to the present invention. As shown in FIG. 3, the neural network operation instruction includes at least one operation code and 16 operands, wherein the operation code is used to indicate the function of the neural network operation instruction. The arithmetic unit can perform different neural network operations by identifying one or more operation codes, and the operands are used to indicate data information of the neural network operation instructions, wherein the data information can be an immediate number or a register number.

4 is a schematic diagram of the format of the matrix-matrix operation instruction of the present invention, as shown in FIG. The matrix-matrix operation instruction includes at least one operation code for indicating a function of the matrix-matrix operation instruction, and at least four operands, wherein the operation unit can perform different matrix operations by identifying one or more operation codes. The operand is used to indicate data information of the matrix-matrix operation instruction, wherein the data information may be an immediate value or a register number.

5 is a schematic diagram of a format of a vector-vector operation instruction according to the present invention. As shown in FIG. 5, the vector-vector operation instruction includes at least one operation code and at least three operands, wherein the operation code is used to indicate the vector-vector operation. The function of the instruction unit can perform different vector operations by identifying one or more operation codes, and the operand is used to indicate data information of the vector-vector operation instruction, wherein the data information can be an immediate number or a register number.

6 is a schematic diagram of a format of a matrix-vector operation instruction of the present invention. As shown in FIG. 6, the matrix-vector operation instruction includes at least one operation code and at least 6 operands, wherein the operation code is used to indicate the matrix-vector operation. The function of the instruction unit can perform different matrix and vector operations by identifying one or more operation codes, and the operand is used to indicate data information of the matrix-vector operation instruction, wherein the data information can be an immediate number or a register number.

7 is a schematic structural diagram of a neural network operation and a matrix/vector operation device as a preferred embodiment of the present invention. As shown in FIG. 7, the device includes an instruction module, a decoding module, an instruction queue, a scalar register file, and a dependency relationship. Processing unit, storage queue, reordering cache, arithmetic unit, cache, IO memory access module;

The fetch module, which is responsible for fetching the next instruction to be executed from the instruction sequence and passing the instruction to the decoding module;

a decoding module, the module is responsible for decoding the instruction, and transmitting the decoded instruction to the instruction queue; as shown in FIG. 8, the decoding module includes: an instruction accepting module, a microinstruction generating module, a microinstruction queue, and a micro The instruction transmitting module; wherein the instruction accepting module is responsible for accepting the instruction fetched from the fetching module; the microinstruction decoding module decodes the instruction obtained by the instruction accepting module into a microinstruction for controlling each functional component; the microinstruction queue is used for storing the microinstruction a micro-instruction sent by the instruction decoding module; the micro-instruction transmitting module is responsible for transmitting the micro-instruction to each functional component;

An instruction queue for sequentially buffering the decoded instructions and sending them to the dependency processing unit;

A scalar register file that provides the scalar registers required by the device during the operation;

A dependency processing unit that processes the storage dependencies that an instruction may have with the previous instruction. The matrix operation instruction accesses the scratch pad memory, and the front and back instructions may access the same block of memory. In order to ensure the correctness of the execution result of the instruction, if the current instruction is detected to have a dependency on the data of the previous instruction, the instruction must wait in the storage queue until the dependency is eliminated.

The storage queue, the module is an ordered queue, and instructions that have dependencies on the data of the previous instruction are stored in the queue until the dependency is eliminated, and the instruction is submitted.

Reordering the cache, the instruction is also cached in the module during execution. When an instruction is executed, if the instruction is also the oldest instruction in the uncommitted instruction in the reordering buffer, the instruction will be submitted. . Once submitted, the operation of the instruction will not be able to cancel the state of the device; the instruction in the reordering cache acts as a placeholder. When the first instruction it contains has a data dependency, then the instruction does not Will be submitted (released); although there will be a lot of instructions coming in, but only part of the instruction (redirected cache size control), until the first instruction is submitted, the entire operation will proceed smoothly.

An arithmetic unit that is responsible for all neural network operations and matrix/vector operations of the device, including but not limited to: convolutional neural network forward operations, convolutional neural network training operations, neural network Pooling operations, full connection neural Network forward operation operation, full connection neural network training operation, batch normalization operation operation, RBM neural network operation operation, matrix-vector multiplication operation operation, matrix-matrix addition/subtraction operation operation, vector outer product (tensor) operation operation, Vector inner product operation, vector four operation, vector logic operation, vector transcendental operation, vector comparison operation, vector maximum/minimum operation, vector cyclic shift operation, generation of random vectors subject to certain distribution Operational operations. The operation instruction is sent to the operation unit for execution;

The high-speed register, the module is a data-specific temporary storage device capable of supporting different sizes of data;

IO memory access module, which is used to directly access the scratchpad memory and is responsible for reading data or writing data from the scratchpad memory.

9 is a flowchart of performing a neural network operation by an arithmetic device as a preferred embodiment of the present invention; A flowchart of a matrix/vector operation instruction, as shown in FIG. 9, the process of performing a neural network operation and a matrix/vector instruction includes:

S1, the fetch module takes out the neural network operation and the matrix/vector instruction, and sends the instruction to the decoding module.

S2. The decoding module decodes the instruction and sends the instruction to the instruction queue.

S3, in the decoding module, the instruction is sent to the instruction accepting module.

S4, the instruction accepting module sends the instruction to the micro-instruction generating module to generate the micro-instruction.

S5. The microinstruction generation module obtains the neural network operation operation code of the instruction and the neural network operation operand from the scalar register file, and decodes the instruction into a micro instruction that controls each functional component, and sends it to the microinstruction transmission queue.

S6, after obtaining the required data, the instruction is sent to the dependency processing unit. The dependency processing unit analyzes whether the instruction has a dependency on the data with the previous instruction that has not been executed. The instruction needs to wait in the store queue until it no longer has a dependency on the data with the previous unexecuted instruction.

S7, after the dependency relationship does not exist, the micro-instruction corresponding to the neural network operation and the matrix/vector instruction is sent to a functional component such as an arithmetic unit.

S8. The arithmetic unit extracts the required data from the cache according to the address and size of the required data, and then performs neural network operations and matrix/vector operations in the operation unit.

S9, after the operation is completed, the output data is written back to the specified address of the scratch pad memory, and the instruction in the reorder buffer is submitted.

In summary, the present invention discloses a device and method for neural network operation and matrix/vector operation, which can solve the problems of the current computer domain neural network algorithm and a large number of matrix/vector operations with the corresponding instructions. Compared with the existing conventional solutions, the present invention can have the advantages of command configurability, convenient use, supported neural network scale, flexible matrix/vector scale, and sufficient on-chip buffering.

The specific embodiments of the present invention have been described in detail in the foregoing detailed description of the embodiments of the present invention. All modifications, equivalents, improvements, etc., made within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

An apparatus for performing neural network operations and matrix/vector operations, comprising a storage unit, a register unit, a control unit, an arithmetic unit, and a scratchpad memory, wherein:

a storage unit for storing neurons/matrices/vectors;

a register unit for storing a neuron address/matrix address/vector address, wherein the neuron address is an address stored by the neuron in the storage unit, and the matrix address is an address of a matrix stored in the storage unit The vector address is an address of a vector stored in the storage unit;

a control unit, configured to perform a decoding operation, and control each unit module according to the read instruction;

An operation unit, configured to acquire a neuron address/matrix address/vector address from the register unit according to the instruction, and acquire a corresponding neuron/matrix in the storage unit according to the neuron address/matrix address/vector address/ Vectors, and operations based on the data carried in the neurons/matrices/vectors and/or instructions thus obtained, to obtain an operation result;

The feature is that the neuron/matrix/vector data participating in the calculation by the computing unit is temporarily stored in the scratchpad memory, and the arithmetic unit reads from the scratchpad memory when needed.
The apparatus for performing neural network operations and matrix/vector operations of claim 1 wherein said scratchpad memory is capable of supporting different sizes of neuron/matrix/vector data.
The apparatus for performing neural network operations and matrix/vector operations of claim 1 wherein said register unit is a scalar register file providing a scalar register required for operation.
The apparatus for performing a neural network operation and a matrix/vector operation according to claim 1, wherein said arithmetic unit comprises a vector multiplication component, an accumulation component, and a scalar multiplication component;

The arithmetic unit is responsible for the neural network/matrix/vector operation of the device, including convolutional neural network forward operation operation, convolutional neural network training operation, neural network Pooling operation operation, full connection neural network forward operation operation, full connection nerve Network training Practice operation, batch normalization operation, RBM neural network operation, matrix-vector multiplication operation, matrix-matrix addition/subtraction operation, vector outer product operation, vector inner product operation, vector four operation, vector logic operation Operation, vector transcendental function operation, vector comparison operation, vector maximum/minimum operation, vector cyclic shift operation, and generation of random vector operation obeying a certain distribution.
The apparatus for performing a neural network operation and a matrix/vector operation according to claim 1, wherein the apparatus further comprises an instruction buffer unit for storing an operation instruction to be executed; the instruction cache unit is preferably Reorder the cache;

The apparatus also includes an instruction queue for sequentially buffering the decoded instructions for transmission to the dependency processing unit.
The apparatus for performing a neural network operation and a matrix/vector operation according to claim 5, wherein said apparatus further comprises a dependency processing unit and a storage queue, said dependency processing unit being configured to acquire at the arithmetic unit Before the instruction, determining whether the operation instruction and the previous operation instruction access the same neuron/matrix/vector storage address, and if so, storing the operation instruction in the storage queue; otherwise, directly providing the operation instruction to the instruction The operation unit, after the execution of the previous operation instruction is completed, providing the operation instruction in the storage queue to the operation unit; the storage queue is configured to store an instruction having a dependency on the data of the previous instruction, and After the dependency is removed, the instruction is submitted.
The apparatus for performing a neural network operation and a matrix/vector operation according to claim 1, wherein the instruction set of the apparatus adopts a Load/Store structure, and the operation unit does not operate on data in the memory;

The instruction set of the apparatus preferably employs a very long instruction word architecture, and preferably uses fixed length instructions.
The apparatus for performing a neural network operation and a matrix/vector operation according to claim 1, wherein the arithmetic instruction executed by the arithmetic unit comprises at least one operation code and at least three operands; wherein the operation The code is used to indicate the function of the operation instruction, and the operation unit performs different operations by identifying one or more operation codes; the operand is used to indicate data information of the operation instruction, wherein the data information is an immediate number or Register number.

Preferably, when the operation instruction is a neural network operation instruction, the neural network The operation instruction includes at least one operation code and 16 operands;

Advantageously, when the operation instruction is a matrix-matrix operation instruction, the matrix-matrix operation instruction includes at least one operation code and at least 4 operands;

Advantageously, when the operation instruction is a vector-vector operation instruction, the vector-vector operation instruction includes at least one operation code and at least three operands;

Advantageously, when said operational instruction is a matrix-vector operation instruction, said matrix-vector operation instruction comprises at least one opcode and at least six operands.
An apparatus for performing a neural network operation and a matrix/vector operation, comprising:

The fetch module is configured to fetch an instruction to be executed from the instruction sequence and transmit the instruction to the decoding module;

a decoding module, configured to decode the instruction, and transmit the decoded instruction to the instruction queue;

An instruction queue, configured to sequentially cache the decoded instruction of the decoding module, and send the instruction to the dependency processing unit;

a scalar register file that provides a scalar register for use in operations;

a dependency processing unit, configured to determine whether the current instruction has a data dependency relationship with the previous instruction, and if present, store the current instruction in a storage queue;

a storage queue, configured to cache a current instruction having a data dependency relationship with the previous instruction, and transmitting the current instruction after the current instruction has a dependency relationship with the previous instruction;

Reordering the cache for caching the instruction when it is executed, and determining whether the instruction is the earliest instruction in the uncommitted instruction in the reordering cache after execution, and if so, submitting the instruction;

An arithmetic unit for performing all neural network operations and matrix/vector operations;

a cache memory for temporarily storing the neuron/matrix/vector data participating in the calculation of the arithmetic unit, the arithmetic unit reading from the scratchpad memory when needed; the cache memory preferably Support different sizes of data;

An IO memory access module for directly accessing the scratchpad memory, responsible for Read or write data in the scratchpad memory.
A method of performing neural network operations and matrix/vector instructions, comprising the steps of:

Step S1, the fetch module takes out a neural network operation and a matrix/vector instruction, and sends the instruction to the decoding module;

Step S2, the decoding module decodes the instruction, and sends the instruction to the instruction queue;

Step S3, in the decoding module, the instruction is sent to the instruction accepting module;

Step S4, the instruction accepting module sends the instruction to the micro-instruction generating module to generate the micro-instruction;

Step S5, the microinstruction generation module acquires the neural network operation operation code of the instruction and the neural network operation operand from the scalar register file, and decodes the instruction into a micro instruction that controls each functional component, and sends it to the microinstruction transmission. queue;

Step S6, after obtaining the required data, the instruction is sent to the dependency processing unit; the dependency processing unit analyzes whether the instruction has a dependency on the data with the previously unexecuted instruction, and if so, the The instruction needs to wait in the storage queue until it no longer has a dependency on the data with the previously unexecuted instruction;

Step S7, sending the micro-instruction corresponding to the instruction to the arithmetic unit;

In step S8, the arithmetic unit extracts the required data from the scratchpad memory according to the address and size of the required data, and then completes the neural network operation and/or the matrix/vector operation corresponding to the instruction in the operation unit.