Nothing Special   »   [go: up one dir, main page]

WO2017185418A1 - Device and method for performing neural network computation and matrix/vector computation - Google Patents

Device and method for performing neural network computation and matrix/vector computation Download PDF

Info

Publication number
WO2017185418A1
WO2017185418A1 PCT/CN2016/082015 CN2016082015W WO2017185418A1 WO 2017185418 A1 WO2017185418 A1 WO 2017185418A1 CN 2016082015 W CN2016082015 W CN 2016082015W WO 2017185418 A1 WO2017185418 A1 WO 2017185418A1
Authority
WO
WIPO (PCT)
Prior art keywords
instruction
vector
matrix
neural network
unit
Prior art date
Application number
PCT/CN2016/082015
Other languages
French (fr)
Chinese (zh)
Inventor
陶劲桦
陈天石
陈云霁
Original Assignee
北京中科寒武纪科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京中科寒武纪科技有限公司 filed Critical 北京中科寒武纪科技有限公司
Publication of WO2017185418A1 publication Critical patent/WO2017185418A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/3001Arithmetic instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/3012Organisation of register space, e.g. banked or distributed register file
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/3012Organisation of register space, e.g. banked or distributed register file
    • G06F9/30134Register stacks; shift registers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30145Instruction analysis, e.g. decoding, instruction word fields
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • the present invention relates to the field of neural network computing technologies, and more particularly to an apparatus and method for performing neural network operations and matrix/vector operations.
  • ANNs Artificial neural networks
  • NNNs neural networks
  • ANNs Artificial neural networks
  • This kind of network relies on the complexity of the system to adjust the relationship between a large number of internal nodes to achieve the purpose of processing information.
  • neural networks have made great progress in many fields such as intelligent control and machine learning. Since the neural network belongs to the mathematical model of the algorithm, which involves a large number of mathematical operations, how to perform the neural network operation quickly and accurately is an urgent problem to be solved.
  • an object of the present invention to provide an apparatus and method for performing neural network operations and matrix/vector operations to achieve efficient neural network operations and matrix/vector operations.
  • the present invention provides an apparatus for performing a neural network operation and a matrix/vector operation, including a storage unit, a register unit, a control unit, an arithmetic unit, and a scratchpad memory. among them:
  • a storage unit for storing neurons/matrices/vectors
  • a register unit for storing a neuron address/matrix address/vector address, wherein the neuron address is an address stored by the neuron in the storage unit, and the matrix address is an address of a matrix stored in the storage unit
  • the vector address is an address of a vector stored in the storage unit
  • control unit configured to perform a decoding operation, and control each unit module according to the read instruction
  • An operation unit configured to acquire a neuron address/matrix address/vector address from the register unit according to the instruction, and acquire a corresponding neuron/matrix in the storage unit according to the neuron address/matrix address/vector address/ Vectors, and operations based on the data carried in the neurons/matrices/vectors and/or instructions thus obtained, to obtain an operation result;
  • the feature is that the neuron/matrix/vector data participating in the calculation by the computing unit is temporarily stored in the scratchpad memory, and the arithmetic unit reads from the scratchpad memory when needed.
  • the scratch pad memory can support different sizes of neuron/matrix/vector data.
  • the register unit is a scalar register file, and provides a scalar register required in the operation process.
  • the arithmetic unit comprises a vector multiplication component, an accumulation component, and a scalar multiplication component
  • the arithmetic unit is responsible for the neural network/matrix/vector operation of the device, including convolutional neural network forward operation operation, convolutional neural network training operation, neural network Pooling operation operation, full connection neural network forward operation operation, full connection nerve Network training operation, batch normalization operation, RBM neural network operation, matrix-vector multiplication operation, matrix-matrix addition/subtraction operation, vector outer product operation, vector inner product operation, vector four operation, vector logic
  • the device further includes an instruction cache unit for storing an operation instruction to be executed; the instruction cache unit is preferably a reorder buffer;
  • the apparatus also includes an instruction queue for sequentially buffering the decoded instructions for transmission to the dependency processing unit.
  • the device further includes a dependency processing unit and a storage queue, and the dependency processing unit is configured to determine, before the operation unit acquires the instruction, whether the operation instruction and the previous operation instruction access the same neuron/matrix/vector Store the address, and if so, the operation refers to Storing the operation instruction in the storage queue; otherwise, directly providing the operation instruction to the operation unit, and after the execution of the previous operation instruction is completed, providing the operation instruction in the storage queue to the operation unit;
  • the store queue is used to store instructions that are dependent on the previous instruction on the data and to commit the instructions after the dependency is removed.
  • the instruction set of the device adopts a Load/Store structure, and the operation unit does not operate on data in the memory;
  • the instruction set of the apparatus preferably employs a very long instruction word architecture, and preferably uses fixed length instructions.
  • the operation instruction executed by the operation unit includes at least one operation code and at least three operands; wherein the operation code is used to indicate the function of the operation instruction, and the operation unit performs different operations by identifying one or more operation codes.
  • the operand is used to indicate data information of the operation instruction, wherein the data information is an immediate number or a register number.
  • the operation instruction when the operation instruction is a neural network operation instruction, the neural network operation instruction includes at least one operation code and 16 operands;
  • the operation instruction is a matrix-matrix operation instruction
  • the matrix-matrix operation instruction includes at least one operation code and at least 4 operands
  • the operation instruction when the operation instruction is a vector-vector operation instruction, the vector-vector operation instruction includes at least one operation code and at least three operands;
  • said operational instruction is a matrix-vector operation instruction
  • said matrix-vector operation instruction comprises at least one opcode and at least six operands.
  • the present invention further provides an apparatus for performing a neural network operation and a matrix/vector operation, comprising:
  • the fetch module is configured to fetch an instruction to be executed from the instruction sequence and transmit the instruction to the decoding module;
  • a decoding module configured to decode the instruction, and transmit the decoded instruction to the instruction queue
  • An instruction queue configured to sequentially cache the decoded instruction of the decoding module, and send the instruction to the dependency processing unit;
  • a scalar register file that provides a scalar register for use in operations
  • a dependency processing unit configured to determine whether the current instruction has a data dependency relationship with the previous instruction, and if present, store the current instruction in a storage queue
  • a storage queue configured to cache a current instruction having a data dependency relationship with the previous instruction, and transmitting the current instruction after the current instruction has a dependency relationship with the previous instruction;
  • Reordering the cache for caching the instruction when it is executed, and determining whether the instruction is the earliest instruction in the uncommitted instruction in the reordering cache after execution, and if so, submitting the instruction;
  • a cache memory for temporarily storing the neuron/matrix/vector data participating in the calculation of the arithmetic unit, the arithmetic unit reading from the scratchpad memory when needed;
  • the cache memory preferably Support different sizes of data;
  • An IO memory access module is configured to directly access the scratch pad memory and is responsible for reading or writing data from the scratch pad memory.
  • the present invention also provides a method for performing a neural network operation and a matrix/vector instruction, comprising the steps of:
  • Step S1 the fetch module takes out a neural network operation and a matrix/vector instruction, and sends the instruction to the decoding module;
  • Step S2 the decoding module decodes the instruction, and sends the instruction to the instruction queue
  • Step S3 in the decoding module, the instruction is sent to the instruction accepting module;
  • Step S4 the instruction accepting module sends the instruction to the micro-instruction generating module to generate the micro-instruction;
  • Step S5 the microinstruction generation module acquires the neural network operation operation code of the instruction and the neural network operation operand from the scalar register file, and decodes the instruction into a micro instruction that controls each functional component, and sends it to the microinstruction transmission. queue;
  • Step S6 after obtaining the required data, the instruction is sent to the dependency processing unit; the dependency processing unit analyzes whether the instruction has a dependency on the data with the previously unexecuted instruction, and if so, the The instruction needs to wait in the storage queue until it no longer has a dependency on the data with the previously unexecuted instruction;
  • Step S7 sending the micro-instruction corresponding to the instruction to the arithmetic unit
  • step S8 the arithmetic unit extracts the required data from the scratchpad memory according to the address and size of the required data, and then completes the neural network operation and/or the matrix/vector operation corresponding to the instruction in the operation unit.
  • the neural network operation and matrix/vector operation apparatus and method of the present invention have the following beneficial effects: the data participating in the calculation is temporarily stored in a scratch pad memory (Scratchpad Memory), so that the neural network operation And the matrix/vector operation process can support data of different widths more flexibly and effectively, and the customized neural network operation and matrix/vector operation module can more effectively implement various neural network operations and matrix/vector operations, and improve computing tasks.
  • the instructions used in the present invention have a format of a very long instruction word.
  • FIG. 1 is a schematic structural diagram of a neural network operation and a matrix/vector operation device according to the present invention
  • FIG. 2 is a schematic diagram showing the format of an instruction set of the present invention
  • FIG. 3 is a schematic diagram showing the format of a neural network operation instruction of the present invention.
  • FIG. 4 is a schematic diagram showing the format of a matrix-matrix operation instruction of the present invention.
  • Figure 5 is a schematic diagram showing the format of a vector-vector operation instruction of the present invention.
  • FIG. 6 is a schematic diagram showing the format of a matrix-vector operation instruction of the present invention.
  • FIG. 7 is a schematic structural diagram of a neural network operation and a matrix/vector operation device as an embodiment of the present invention.
  • FIG. 8 is a schematic structural diagram of a decoding module in a neural network operation and a matrix/vector operation device according to an embodiment of the present invention
  • FIG. 9 is a flow chart of a neural network operation and a matrix/vector operation device performing neural network operations and matrix/vector instructions as an embodiment of the present invention.
  • the invention discloses a neural network operation and a matrix/vector operation device, comprising a storage unit, a register unit, a control unit and an operation unit.
  • the storage unit stores a neuron/matrix/vector
  • the register unit stores the neuron/ The address of the matrix/vector storage and other parameters
  • the control unit performs a decoding operation, and controls each module according to the read instruction.
  • the operation unit acquires the neuron/matrix/in the instruction or the register unit according to the neural network operation and the matrix/vector operation instruction.
  • the vector address and other parameters are then obtained according to the neuron/matrix/vector address to obtain the corresponding neuron/matrix/vector in the storage unit, and then, according to the acquired neuron/matrix/vector, the operation result is obtained.
  • the invention temporarily stores the neuron/matrix/vector data participating in the calculation on the scratchpad memory, so that the operation process can support the data of different widths more flexibly and effectively, and improve the execution performance of the computing task.
  • FIG. 1 is a schematic structural diagram of a neural network operation and a matrix/vector operation device according to the present invention. As shown in FIG. 1, the neural network operation and the matrix/vector operation device include:
  • the storage unit may be a scratchpad memory capable of supporting different sizes of neuron/matrix/vector data; the present invention temporarily stores the necessary calculation data.
  • a Scratchpad Memory which enables the computing device to support data of different widths more flexibly and efficiently during neural network operations and matrix/vector operations.
  • the register unit can be a scalar register file that provides the scalar registers required for the operation.
  • the scalar registers store not only the neuron/matrix/vector address but also the scalar data. When it comes to matrix/vector and scalar operations, the unit must acquire the matrix/vector address from the register unit and the corresponding scalar from the register unit.
  • control unit for controlling the behavior of various modules in the device.
  • the control The system reads the prepared instruction, decodes and generates a plurality of micro-instructions, and sends them to other modules in the device, and the other modules perform corresponding operations according to the obtained micro-instructions.
  • An operation unit configured to acquire various neural network operations and matrix/vector operation instructions, acquire a neuron/matrix/vector address in the register unit according to the instruction, and then, in the storage unit, according to the neuron/matrix/vector address The corresponding neurons/matrices/vectors are obtained, and then the operations are performed according to the acquired neurons/matrices/vectors, and the operation results are obtained, and the operation results are stored in the storage unit.
  • the neural network operation and the matrix/vector operation unit include a vector multiplication unit, an accumulation unit, and a scalar multiplication unit.
  • the neural network operation and the matrix/vector operation unit are responsible for the neural network/matrix/vector operation of the device, including but not limited to: convolutional neural network forward operation operation, convolutional neural network training operation, neural network Pooling operation operation, full connection nerve Network forward operation operation, full connection neural network training operation, batch normalization operation operation, RBM neural network operation operation, matrix-vector multiplication operation operation, matrix-matrix addition/subtraction operation operation, vector outer product (tensor) operation operation, Vector inner product operation, vector four operation, vector logic operation, vector transcendental operation, vector comparison operation, vector maximum/minimum operation, vector cyclic shift operation, generation of random vectors subject to certain distribution Operational operations.
  • the arithmetic instruction is sent to the arithmetic unit for execution.
  • the apparatus further comprises: an instruction buffer unit for storing the operation instruction to be executed.
  • the instruction is also cached in the instruction cache unit during execution.
  • the instruction cache unit may be a reordering cache.
  • the apparatus further includes: an instruction queue for sequentially storing the decoded neural network operation and the matrix/vector operation operation instruction, considering that different instructions may have dependencies on the included registers. The relationship is used to cache the decoded instruction and issue the instruction when the dependency is satisfied.
  • the apparatus further includes: a dependency processing unit, configured to determine, before the operation unit acquires the instruction, whether the operation instruction and the previous operation instruction are accessed The same neuron/matrix/vector storage address, if yes, the operation instruction is stored in the storage queue, and after the execution of the previous operation instruction is completed, the operation instruction in the storage queue is provided to the operation unit; otherwise, directly
  • the arithmetic instruction is provided to the arithmetic unit.
  • the front and back instructions may access the same block of storage space.
  • the instruction In order to ensure the correctness of the instruction execution result, if the current instruction is detected to have a dependency relationship with the data of the previous instruction, the instruction must be Waiting in the storage queue until the dependency is removed.
  • the apparatus further comprises: an input and output unit for storing the neuron/matrix/vector in the storage unit, or acquiring the operation result from the storage unit.
  • the input and output unit can directly store the unit, which is responsible for reading data from or writing data to the memory.
  • the instruction set used in the apparatus of the present invention adopts a Load/Store structure, and the operation unit does not operate on data in the memory.
  • This instruction set uses a very long instruction word architecture. By configuring the instructions differently, complex neural network operations can be completed, and simple matrix/vector operations can be performed.
  • the present instruction set uses a fixed length instruction at the same time, so that the neural network operation and the matrix/vector operation device of the present invention fetch the next instruction in the decoding stage of the previous instruction.
  • the operation instruction includes at least one operation code and at least three operands, wherein the operation code is used to indicate the function of the operation instruction, and the operation unit identifies one by Or multiple operation codes can perform different operations, and the operand is used to indicate the data information of the operation instruction, wherein the data information can be an immediate number or a register number.
  • the corresponding register number can be corresponding. The starting address of the matrix and the length of the matrix are obtained in the register, and the matrix stored in the corresponding address is obtained in the storage unit according to the starting address of the matrix and the length of the matrix.
  • the neural network operation instruction includes at least one operation code and 16 operands, wherein the operation code is used to indicate the function of the neural network operation instruction.
  • the arithmetic unit can perform different neural network operations by identifying one or more operation codes, and the operands are used to indicate data information of the neural network operation instructions, wherein the data information can be an immediate number or a register number.
  • the matrix-matrix operation instruction includes at least one operation code for indicating a function of the matrix-matrix operation instruction, and at least four operands, wherein the operation unit can perform different matrix operations by identifying one or more operation codes.
  • the operand is used to indicate data information of the matrix-matrix operation instruction, wherein the data information may be an immediate value or a register number.
  • the vector-vector operation instruction includes at least one operation code and at least three operands, wherein the operation code is used to indicate the vector-vector operation.
  • the function of the instruction unit can perform different vector operations by identifying one or more operation codes, and the operand is used to indicate data information of the vector-vector operation instruction, wherein the data information can be an immediate number or a register number.
  • the matrix-vector operation instruction includes at least one operation code and at least 6 operands, wherein the operation code is used to indicate the matrix-vector operation.
  • the function of the instruction unit can perform different matrix and vector operations by identifying one or more operation codes, and the operand is used to indicate data information of the matrix-vector operation instruction, wherein the data information can be an immediate number or a register number.
  • the device includes an instruction module, a decoding module, an instruction queue, a scalar register file, and a dependency relationship.
  • Processing unit storage queue, reordering cache, arithmetic unit, cache, IO memory access module;
  • the fetch module which is responsible for fetching the next instruction to be executed from the instruction sequence and passing the instruction to the decoding module;
  • the module is responsible for decoding the instruction, and transmitting the decoded instruction to the instruction queue; as shown in FIG. 8, the decoding module includes: an instruction accepting module, a microinstruction generating module, a microinstruction queue, and a micro The instruction transmitting module; wherein the instruction accepting module is responsible for accepting the instruction fetched from the fetching module; the microinstruction decoding module decodes the instruction obtained by the instruction accepting module into a microinstruction for controlling each functional component; the microinstruction queue is used for storing the microinstruction a micro-instruction sent by the instruction decoding module; the micro-instruction transmitting module is responsible for transmitting the micro-instruction to each functional component;
  • An instruction queue for sequentially buffering the decoded instructions and sending them to the dependency processing unit
  • a scalar register file that provides the scalar registers required by the device during the operation
  • a dependency processing unit that processes the storage dependencies that an instruction may have with the previous instruction.
  • the matrix operation instruction accesses the scratch pad memory, and the front and back instructions may access the same block of memory.
  • the instruction In order to ensure the correctness of the execution result of the instruction, if the current instruction is detected to have a dependency on the data of the previous instruction, the instruction must wait in the storage queue until the dependency is eliminated.
  • the storage queue, the module is an ordered queue, and instructions that have dependencies on the data of the previous instruction are stored in the queue until the dependency is eliminated, and the instruction is submitted.
  • the instruction is also cached in the module during execution.
  • the instruction When an instruction is executed, if the instruction is also the oldest instruction in the uncommitted instruction in the reordering buffer, the instruction will be submitted. . Once submitted, the operation of the instruction will not be able to cancel the state of the device; the instruction in the reordering cache acts as a placeholder.
  • the instruction When the first instruction it contains has a data dependency, then the instruction does not Will be submitted (released); although there will be a lot of instructions coming in, but only part of the instruction (redirected cache size control), until the first instruction is submitted, the entire operation will proceed smoothly.
  • An arithmetic unit that is responsible for all neural network operations and matrix/vector operations of the device, including but not limited to: convolutional neural network forward operations, convolutional neural network training operations, neural network Pooling operations, full connection neural Network forward operation operation, full connection neural network training operation, batch normalization operation operation, RBM neural network operation operation, matrix-vector multiplication operation operation, matrix-matrix addition/subtraction operation operation, vector outer product (tensor) operation operation, Vector inner product operation, vector four operation, vector logic operation, vector transcendental operation, vector comparison operation, vector maximum/minimum operation, vector cyclic shift operation, generation of random vectors subject to certain distribution Operational operations.
  • the operation instruction is sent to the operation unit for execution;
  • the high-speed register, the module is a data-specific temporary storage device capable of supporting different sizes of data
  • IO memory access module which is used to directly access the scratchpad memory and is responsible for reading data or writing data from the scratchpad memory.
  • a flowchart of a matrix/vector operation instruction, as shown in FIG. 9, the process of performing a neural network operation and a matrix/vector instruction includes:
  • the fetch module takes out the neural network operation and the matrix/vector instruction, and sends the instruction to the decoding module.
  • the decoding module decodes the instruction and sends the instruction to the instruction queue.
  • the instruction accepting module sends the instruction to the micro-instruction generating module to generate the micro-instruction.
  • the microinstruction generation module obtains the neural network operation operation code of the instruction and the neural network operation operand from the scalar register file, and decodes the instruction into a micro instruction that controls each functional component, and sends it to the microinstruction transmission queue.
  • the instruction is sent to the dependency processing unit.
  • the dependency processing unit analyzes whether the instruction has a dependency on the data with the previous instruction that has not been executed. The instruction needs to wait in the store queue until it no longer has a dependency on the data with the previous unexecuted instruction.
  • the micro-instruction corresponding to the neural network operation and the matrix/vector instruction is sent to a functional component such as an arithmetic unit.
  • the arithmetic unit extracts the required data from the cache according to the address and size of the required data, and then performs neural network operations and matrix/vector operations in the operation unit.
  • the present invention discloses a device and method for neural network operation and matrix/vector operation, which can solve the problems of the current computer domain neural network algorithm and a large number of matrix/vector operations with the corresponding instructions.
  • the present invention can have the advantages of command configurability, convenient use, supported neural network scale, flexible matrix/vector scale, and sufficient on-chip buffering.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Pure & Applied Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Databases & Information Systems (AREA)
  • Algebra (AREA)
  • Neurology (AREA)
  • Advance Control (AREA)
  • Complex Calculations (AREA)

Abstract

A device and a method for performing neural network computation and matrix/vector computation, the device comprising a storage unit, a register unit, a control unit, a computation unit and a scratchpad memory. Neuron/matrix/vector data engaged in computation are temporarily stored in the scratchpad memory, such that data of different widths can be more flexibly and efficiently supported during computation, improving the execution performance of computation tasks. A customized neural network computation and matrix/vector computation module can accomplish various neural network computations and matrix/vector computations more efficiently, improving the execution performance of computation tasks; moreover, the instructions used are in the format of a very long instruction word.

Description

一种用于执行神经网络运算以及矩阵/向量运算的装置和方法Apparatus and method for performing neural network operations and matrix/vector operations 技术领域Technical field
本发明涉及神经网络运算技术领域,更具体地涉及一种用于执行神经网络运算以及矩阵/向量运算的装置和方法。The present invention relates to the field of neural network computing technologies, and more particularly to an apparatus and method for performing neural network operations and matrix/vector operations.
背景技术Background technique
人工神经网络(ANNs),简称神经网络(NNs),是一种模仿动物神经网络行为特征,进行分布式并行信息处理的算法数学模型。这种网络依靠系统的复杂程度,通过调整内部大量节点之间相互连接的关系,从而达到处理信息的目的。目前,神经网络在智能控制、机器学习等很多领域均获得长足发展。由于神经网络属于算法数学模型,其涉及大量的数学运算,因此如何快速、准确地执行神经网络运算是当前迫切需要解决的问题。Artificial neural networks (ANNs), referred to as neural networks (NNs), are mathematical models of algorithms that mimic the behavioral characteristics of animal neural networks and perform distributed parallel information processing. This kind of network relies on the complexity of the system to adjust the relationship between a large number of internal nodes to achieve the purpose of processing information. At present, neural networks have made great progress in many fields such as intelligent control and machine learning. Since the neural network belongs to the mathematical model of the algorithm, which involves a large number of mathematical operations, how to perform the neural network operation quickly and accurately is an urgent problem to be solved.
发明内容Summary of the invention
有鉴于此,本发明的目的在于提供一种执行神经网络运算以及矩阵/向量运算的装置和方法,以实现高效的神经网络运算以及矩阵/向量运算。In view of the above, it is an object of the present invention to provide an apparatus and method for performing neural network operations and matrix/vector operations to achieve efficient neural network operations and matrix/vector operations.
为了实现上述目的,作为本发明的一个方面,本发明提供了一种用于执行神经网络运算以及矩阵/向量运算的装置,包括存储单元、寄存器单元、控制单元、运算单元和高速暂存存储器,其中:In order to achieve the above object, as an aspect of the present invention, the present invention provides an apparatus for performing a neural network operation and a matrix/vector operation, including a storage unit, a register unit, a control unit, an arithmetic unit, and a scratchpad memory. among them:
存储单元,用于存储神经元/矩阵/向量;a storage unit for storing neurons/matrices/vectors;
寄存器单元,用于存储神经元地址/矩阵地址/向量地址,其中所述神经元地址为神经元在所述存储单元中存储的地址,所述矩阵地址为矩阵在所述存储单元中存储的地址,所述向量地址为向量在所述存储单元中存储的地址; a register unit for storing a neuron address/matrix address/vector address, wherein the neuron address is an address stored by the neuron in the storage unit, and the matrix address is an address of a matrix stored in the storage unit The vector address is an address of a vector stored in the storage unit;
控制单元,用于执行译码操作,根据读取指令控制各个单元模块;a control unit, configured to perform a decoding operation, and control each unit module according to the read instruction;
运算单元,用于根据指令从所述寄存器单元中获取神经元地址/矩阵地址/向量地址,根据所述神经元地址/矩阵地址/向量地址在所述存储单元中获取相应的神经元/矩阵/向量,以及根据由此获取的神经元/矩阵/向量和/或指令中携带的数据进行运算,得到运算结果;An operation unit, configured to acquire a neuron address/matrix address/vector address from the register unit according to the instruction, and acquire a corresponding neuron/matrix in the storage unit according to the neuron address/matrix address/vector address/ Vectors, and operations based on the data carried in the neurons/matrices/vectors and/or instructions thus obtained, to obtain an operation result;
其特征在于,将参与所述运算单元计算的神经元/矩阵/向量数据暂存在高速暂存存储器上,当需要时所述运算单元从所述高速暂存存储器上读取。The feature is that the neuron/matrix/vector data participating in the calculation by the computing unit is temporarily stored in the scratchpad memory, and the arithmetic unit reads from the scratchpad memory when needed.
其中,所述高速暂存存储器能够支持不同大小的神经元/矩阵/向量数据。The scratch pad memory can support different sizes of neuron/matrix/vector data.
其中,所述寄存器单元为标量寄存器堆,提供运算过程中所需的标量寄存器。Wherein, the register unit is a scalar register file, and provides a scalar register required in the operation process.
其中,所述运算单元包括向量乘法部件、累加部件和标量乘法部件;以及Wherein the arithmetic unit comprises a vector multiplication component, an accumulation component, and a scalar multiplication component;
所述运算单元负责装置的神经网络/矩阵/向量运算,包括卷积神经网络正向运算操作、卷积神经网络训练操作、神经网络Pooling运算操作、full connection神经网络正向运算操作、full connection神经网络训练操作、batch normalization运算操作、RBM神经网络运算操作、矩阵-向量乘运算操作、矩阵-矩阵加/减运算操作、向量外积运算操作、向量内积运算操作、向量四则运算操作、向量逻辑运算操作、向量超越函数运算操作、向量比较运算操作、求向量最大/最小值运算操作、向量循环移位运算操作、生成服从一定分布的随机向量运算操作。The arithmetic unit is responsible for the neural network/matrix/vector operation of the device, including convolutional neural network forward operation operation, convolutional neural network training operation, neural network Pooling operation operation, full connection neural network forward operation operation, full connection nerve Network training operation, batch normalization operation, RBM neural network operation, matrix-vector multiplication operation, matrix-matrix addition/subtraction operation, vector outer product operation, vector inner product operation, vector four operation, vector logic The arithmetic operation, the vector transcendental operation operation, the vector comparison operation operation, the vector maximum/minimum operation operation, the vector cyclic shift operation operation, and the generation of a random vector operation operation subject to a certain distribution.
其中,所述装置还包括指令缓存单元,用于存储待执行的运算指令;所述指令缓存单元优选为重排序缓存;以及The device further includes an instruction cache unit for storing an operation instruction to be executed; the instruction cache unit is preferably a reorder buffer;
所述装置还包括指令队列,用于顺序缓存译码后的指令,送往依赖关系处理单元。The apparatus also includes an instruction queue for sequentially buffering the decoded instructions for transmission to the dependency processing unit.
其中,所述装置还包括依赖关系处理单元和存储队列,所述依赖关系处理单元用于在运算单元获取指令前,判断所述运算指令与前一运算指令是否访问相同的神经元/矩阵/向量存储地址,若是,将所述运算指 令存储在所述存储队列中;否则,直接将该运算指令提供给所述运算单元,待前一运算指令执行完毕后,将存储队列中的所述运算指令提供给所述运算单元;所述存储队列用于存储与之前指令在数据上有依赖关系的指令,并在依赖关系消除之后,提交所述指令。The device further includes a dependency processing unit and a storage queue, and the dependency processing unit is configured to determine, before the operation unit acquires the instruction, whether the operation instruction and the previous operation instruction access the same neuron/matrix/vector Store the address, and if so, the operation refers to Storing the operation instruction in the storage queue; otherwise, directly providing the operation instruction to the operation unit, and after the execution of the previous operation instruction is completed, providing the operation instruction in the storage queue to the operation unit; The store queue is used to store instructions that are dependent on the previous instruction on the data and to commit the instructions after the dependency is removed.
其中,所述装置的指令集采用Load/Store结构,所述运算单元不对内存中的数据进行操作;以及Wherein, the instruction set of the device adopts a Load/Store structure, and the operation unit does not operate on data in the memory;
所述装置的指令集优选采用超长指令字架构,同时优选采用定长指令。The instruction set of the apparatus preferably employs a very long instruction word architecture, and preferably uses fixed length instructions.
其中,所述运算单元执行的运算指令包括至少一操作码和至少3个操作数;其中,所述操作码用于指示该运算指令的功能,运算单元通过识别一个或多个操作码进行不同的运算;所述操作数用于指示所述运算指令的数据信息,其中,所述数据信息为立即数或寄存器号。The operation instruction executed by the operation unit includes at least one operation code and at least three operands; wherein the operation code is used to indicate the function of the operation instruction, and the operation unit performs different operations by identifying one or more operation codes. The operand is used to indicate data information of the operation instruction, wherein the data information is an immediate number or a register number.
作为优选,当所述运算指令为神经网络运算指令时,所述神经网络运算指令包括至少一操作码和16个操作数;Advantageously, when the operation instruction is a neural network operation instruction, the neural network operation instruction includes at least one operation code and 16 operands;
作为优选,当所述运算指令为矩阵-矩阵运算指令时,所述矩阵-矩阵运算指令包括至少一操作码和至少4个操作数;Advantageously, when the operation instruction is a matrix-matrix operation instruction, the matrix-matrix operation instruction includes at least one operation code and at least 4 operands;
作为优选,当所述运算指令为向量-向量运算指令时,所述向量-向量运算指令包括至少一操作码和至少3个操作数;Advantageously, when the operation instruction is a vector-vector operation instruction, the vector-vector operation instruction includes at least one operation code and at least three operands;
作为优选,当所述运算指令为矩阵-向量运算指令时,所述矩阵-向量运算指令包括至少一操作码和至少6个操作数。Advantageously, when said operational instruction is a matrix-vector operation instruction, said matrix-vector operation instruction comprises at least one opcode and at least six operands.
作为本发明的另一个方面,本发明还提供了一种用于执行神经网络运算以及矩阵/向量运算的装置,其特征在于,包括:As another aspect of the present invention, the present invention further provides an apparatus for performing a neural network operation and a matrix/vector operation, comprising:
取指模块,用于从指令序列中取出下一条将要执行的指令,并将该指令传给译码模块;The fetch module is configured to fetch an instruction to be executed from the instruction sequence and transmit the instruction to the decoding module;
译码模块,用于对所述指令进行译码,并将译码后的指令传给指令队列;a decoding module, configured to decode the instruction, and transmit the decoded instruction to the instruction queue;
指令队列,用于顺序缓存所述译码模块译码后的指令,并送往依赖关系处理单元;An instruction queue, configured to sequentially cache the decoded instruction of the decoding module, and send the instruction to the dependency processing unit;
标量寄存器堆,用于提供标量寄存器供运算使用; a scalar register file that provides a scalar register for use in operations;
依赖关系处理单元,用于判断当前指令与前一条指令是否存在数据依赖关系,如果存在则将所述当前指令存储于存储队列;a dependency processing unit, configured to determine whether the current instruction has a data dependency relationship with the previous instruction, and if present, store the current instruction in a storage queue;
存储队列,用于缓存与前一条指令存在数据依赖关系的当前指令,当所述当前指令与前一条指令存在的依赖关系消除之后发射所述当前指令;a storage queue, configured to cache a current instruction having a data dependency relationship with the previous instruction, and transmitting the current instruction after the current instruction has a dependency relationship with the previous instruction;
重排序缓存,用于在指令执行时将其缓存,并在执行完之后判断所述指令是否是所述重排序缓存中未被提交指令中最早的一条指令,如果是则将所述指令提交;Reordering the cache for caching the instruction when it is executed, and determining whether the instruction is the earliest instruction in the uncommitted instruction in the reordering cache after execution, and if so, submitting the instruction;
运算单元,用于执行所有神经网络运算和矩阵/向量运算操作;An arithmetic unit for performing all neural network operations and matrix/vector operations;
高速暂存存储器,用于暂存参与所述运算单元计算的神经元/矩阵/向量数据,当需要时所述运算单元从所述高速暂存存储器上读取;所述高速暂存存储器优选能够支持不同大小的数据;a cache memory for temporarily storing the neuron/matrix/vector data participating in the calculation of the arithmetic unit, the arithmetic unit reading from the scratchpad memory when needed; the cache memory preferably Support different sizes of data;
IO内存存取模块,用于直接访问所述高速暂存存储器,负责从所述高速暂存存储器中读取或写入数据。An IO memory access module is configured to directly access the scratch pad memory and is responsible for reading or writing data from the scratch pad memory.
作为本发明的再一个方面,本发明还提供了一种执行神经网络运算以及矩阵/向量指令的方法,其特征在于,包括以下步骤:As still another aspect of the present invention, the present invention also provides a method for performing a neural network operation and a matrix/vector instruction, comprising the steps of:
步骤S1,取指模块取出一条神经网络运算以及矩阵/向量指令,并将所述指令送往译码模块;Step S1, the fetch module takes out a neural network operation and a matrix/vector instruction, and sends the instruction to the decoding module;
步骤S2,译码模块对所述指令译码,并将所述指令送往指令队列;Step S2, the decoding module decodes the instruction, and sends the instruction to the instruction queue;
步骤S3,在译码模块中,所述指令被送往指令接受模块;Step S3, in the decoding module, the instruction is sent to the instruction accepting module;
步骤S4,指令接受模块将所述指令发送到微指令生成模块,进行微指令生成;Step S4, the instruction accepting module sends the instruction to the micro-instruction generating module to generate the micro-instruction;
步骤S5,微指令生成模块从标量寄存器堆里获取所述指令的神经网络运算操作码和神经网络运算操作数,同时将所述指令译码成控制各个功能部件的微指令,送往微指令发射队列;Step S5, the microinstruction generation module acquires the neural network operation operation code of the instruction and the neural network operation operand from the scalar register file, and decodes the instruction into a micro instruction that controls each functional component, and sends it to the microinstruction transmission. queue;
步骤S6,在取得需要的数据后,所述指令被送往依赖关系处理单元;依赖关系处理单元分析所述指令与之前尚未执行完的指令在数据上是否存在依赖关系,如果存在,则所述指令需要在存储队列中等待至其与之前未执行完的指令在数据上不再存在依赖关系为止; Step S6, after obtaining the required data, the instruction is sent to the dependency processing unit; the dependency processing unit analyzes whether the instruction has a dependency on the data with the previously unexecuted instruction, and if so, the The instruction needs to wait in the storage queue until it no longer has a dependency on the data with the previously unexecuted instruction;
步骤S7,将所述指令对应的微指令送往运算单元;Step S7, sending the micro-instruction corresponding to the instruction to the arithmetic unit;
步骤S8,运算单元根据所需数据的地址和大小从高速暂存存储器中取出需要的数据,然后在运算单元中完成所述指令对应的神经网络运算和/或矩阵/向量运算。In step S8, the arithmetic unit extracts the required data from the scratchpad memory according to the address and size of the required data, and then completes the neural network operation and/or the matrix/vector operation corresponding to the instruction in the operation unit.
基于上述技术方案可知,本发明的神经网络运算及矩阵/向量运算装置和方法具有如下有益效果:将参与计算的数据暂存在高速暂存存储器(Scratchpad Memory,便签式存储器)上,使得神经网络运算以及矩阵/向量运算过程中可以更加灵活有效地支持不同宽度的数据,同时定制的神经网络运算以及矩阵/向量运算模块能够更加高效地实现各种神经网络运算以及矩阵/向量运算,提升计算任务的执行性能,本发明采用的指令具有超长指令字的格式。Based on the above technical solutions, the neural network operation and matrix/vector operation apparatus and method of the present invention have the following beneficial effects: the data participating in the calculation is temporarily stored in a scratch pad memory (Scratchpad Memory), so that the neural network operation And the matrix/vector operation process can support data of different widths more flexibly and effectively, and the customized neural network operation and matrix/vector operation module can more effectively implement various neural network operations and matrix/vector operations, and improve computing tasks. Execution performance, the instructions used in the present invention have a format of a very long instruction word.
附图说明DRAWINGS
图1是本发明的执行神经网络运算以及矩阵/向量运算装置的结构示意图;1 is a schematic structural diagram of a neural network operation and a matrix/vector operation device according to the present invention;
图2是本发明的指令集的格式示意图;2 is a schematic diagram showing the format of an instruction set of the present invention;
图3是本发明的神经网络运算指令的格式示意图;3 is a schematic diagram showing the format of a neural network operation instruction of the present invention;
图4是本发明的矩阵-矩阵运算指令的格式示意图;4 is a schematic diagram showing the format of a matrix-matrix operation instruction of the present invention;
图5是本发明的向量-向量运算指令的格式示意图;Figure 5 is a schematic diagram showing the format of a vector-vector operation instruction of the present invention;
图6是本发明的矩阵-向量运算指令的格式示意图;6 is a schematic diagram showing the format of a matrix-vector operation instruction of the present invention;
图7是作为本发明一实施例的神经网络运算以及矩阵/向量运算装置的结构示意图;7 is a schematic structural diagram of a neural network operation and a matrix/vector operation device as an embodiment of the present invention;
图8是作为本发明一实施例的神经网络运算以及矩阵/向量运算装置中的译码模块结构示意图;8 is a schematic structural diagram of a decoding module in a neural network operation and a matrix/vector operation device according to an embodiment of the present invention;
图9是作为本发明一实施例的神经网络运算以及矩阵/向量运算装置执行神经网络运算以及矩阵/向量指令的流程图。 9 is a flow chart of a neural network operation and a matrix/vector operation device performing neural network operations and matrix/vector instructions as an embodiment of the present invention.
具体实施方式detailed description
本发明公开了一种神经网络运算以及矩阵/向量运算的装置,包括存储单元、寄存器单元、控制单元和运算单元,存储单元中存储有神经元/矩阵/向量,寄存器单元中存储有神经元/矩阵/向量存储的地址和其他参数,控制单元执行译码操作,根据读取指令控制各个模块,运算单元根据神经网络运算以及矩阵/向量运算指令在指令中或寄存器单元中获取神经元/矩阵/向量地址和其他参数,然后,根据该神经元/矩阵/向量地址在存储单元中获取相应的神经元/矩阵/向量,接着,根据获取的神经元/矩阵/向量进行运算,得到运算结果。本发明将参与计算的神经元/矩阵/向量数据暂存在高速暂存存储器上,使得运算过程中可以更加灵活有效地支持不同宽度的数据,提升计算任务的执行性能。The invention discloses a neural network operation and a matrix/vector operation device, comprising a storage unit, a register unit, a control unit and an operation unit. The storage unit stores a neuron/matrix/vector, and the register unit stores the neuron/ The address of the matrix/vector storage and other parameters, the control unit performs a decoding operation, and controls each module according to the read instruction. The operation unit acquires the neuron/matrix/in the instruction or the register unit according to the neural network operation and the matrix/vector operation instruction. The vector address and other parameters are then obtained according to the neuron/matrix/vector address to obtain the corresponding neuron/matrix/vector in the storage unit, and then, according to the acquired neuron/matrix/vector, the operation result is obtained. The invention temporarily stores the neuron/matrix/vector data participating in the calculation on the scratchpad memory, so that the operation process can support the data of different widths more flexibly and effectively, and improve the execution performance of the computing task.
为使本发明的目的、技术方案和优点更加清楚明白,以下结合具体实施例,并参照附图,对本发明进一步详细说明。The present invention will be further described in detail below with reference to the specific embodiments of the invention.
图1是本发明的神经网络运算以及矩阵/向量运算装置的结构示意图,如图1所示,该神经网络运算以及矩阵/向量运算装置包括:1 is a schematic structural diagram of a neural network operation and a matrix/vector operation device according to the present invention. As shown in FIG. 1, the neural network operation and the matrix/vector operation device include:
存储单元,用于存储神经元/矩阵/向量,在一实施方式中,该存储单元可以是高速暂存存储器,能够支持不同大小的神经元/矩阵/向量数据;本发明将必要的计算数据暂存在高速暂存存储器上(Scratchpad Memory),使本运算装置在进行神经网络运算以及矩阵/向量运算过程中可以更加灵活有效地支持不同宽度的数据。a storage unit for storing a neuron/matrix/vector. In an embodiment, the storage unit may be a scratchpad memory capable of supporting different sizes of neuron/matrix/vector data; the present invention temporarily stores the necessary calculation data. There is a Scratchpad Memory, which enables the computing device to support data of different widths more flexibly and efficiently during neural network operations and matrix/vector operations.
寄存器单元,用于存储神经元/矩阵/向量地址,其中:神经元地址为神经元在存储单元中存储的地址、矩阵地址为矩阵在存储单元中存储的地址、向量地址为向量在存储单元中存储的地址;在一种实施方式中,寄存器单元可以是标量寄存器堆,提供运算过程中所需的标量寄存器,标量寄存器不只存放神经元/矩阵/向量地址,还存放有标量数据。当涉及到矩阵/向量与标量的运算时,运算单元不仅要从寄存器单元中获取矩阵/向量地址,还要从寄存器单元中获取相应的标量。a register unit for storing a neuron/matrix/vector address, wherein: the neuron address is an address stored by the neuron in the storage unit, the matrix address is an address stored in the storage unit, and the vector address is a vector in the storage unit. Stored Address; In one embodiment, the register unit can be a scalar register file that provides the scalar registers required for the operation. The scalar registers store not only the neuron/matrix/vector address but also the scalar data. When it comes to matrix/vector and scalar operations, the unit must acquire the matrix/vector address from the register unit and the corresponding scalar from the register unit.
控制单元,用于控制装置中各个模块的行为。在一实施方式中,控 制单元读取准备好的指令,进行译码生成多条微指令,发送给装置中的其他模块,其他模块根据得到的微指令执行相应的操作。A control unit for controlling the behavior of various modules in the device. In an embodiment, the control The system reads the prepared instruction, decodes and generates a plurality of micro-instructions, and sends them to other modules in the device, and the other modules perform corresponding operations according to the obtained micro-instructions.
运算单元,用于获取各种神经网络运算以及矩阵/向量运算指令,根据指令在所述寄存器单元中获取神经元/矩阵/向量地址,然后,根据该神经元/矩阵/向量地址在存储单元中获取相应的神经元/矩阵/向量,接着,根据获取的神经元/矩阵/向量进行运算,得到运算结果,并将运算结果存储于存储单元中。神经网络运算以及矩阵/向量运算单元包含包括向量乘法部件、累加部件和标量乘法部件。神经网络运算以及矩阵/向量运算单元负责装置的神经网络/矩阵/向量运算,包括但不限于:卷积神经网络正向运算操作、卷积神经网络训练操作、神经网络Pooling运算操作、full connection神经网络正向运算操作、full connection神经网络训练操作、batch normalization运算操作、RBM神经网络运算操作、矩阵-向量乘运算操作、矩阵-矩阵加/减运算操作、向量外积(张量)运算操作、向量内积运算操作、向量四则运算操作、向量逻辑运算操作、向量超越函数运算操作、向量比较运算操作、求向量最大/最小值运算操作、向量循环移位运算操作、生成服从一定分布的随机向量运算操作。运算指令被送往该运算单元执行。An operation unit, configured to acquire various neural network operations and matrix/vector operation instructions, acquire a neuron/matrix/vector address in the register unit according to the instruction, and then, in the storage unit, according to the neuron/matrix/vector address The corresponding neurons/matrices/vectors are obtained, and then the operations are performed according to the acquired neurons/matrices/vectors, and the operation results are obtained, and the operation results are stored in the storage unit. The neural network operation and the matrix/vector operation unit include a vector multiplication unit, an accumulation unit, and a scalar multiplication unit. The neural network operation and the matrix/vector operation unit are responsible for the neural network/matrix/vector operation of the device, including but not limited to: convolutional neural network forward operation operation, convolutional neural network training operation, neural network Pooling operation operation, full connection nerve Network forward operation operation, full connection neural network training operation, batch normalization operation operation, RBM neural network operation operation, matrix-vector multiplication operation operation, matrix-matrix addition/subtraction operation operation, vector outer product (tensor) operation operation, Vector inner product operation, vector four operation, vector logic operation, vector transcendental operation, vector comparison operation, vector maximum/minimum operation, vector cyclic shift operation, generation of random vectors subject to certain distribution Operational operations. The arithmetic instruction is sent to the arithmetic unit for execution.
根据本发明的一实施方式,该装置还包括:指令缓存单元,用于存储待执行的运算指令。指令在执行过程中,同时也被缓存在指令缓存单元中,当一条指令执行完之后,如果该指令同时也是指令缓存单元中未被提交指令中最早的一条指令,该指令将被提交,一旦提交,该条指令进行的操作对装置状态的改变将无法撤销。在一实施方式中,指令缓存单元可以是重排序缓存。According to an embodiment of the invention, the apparatus further comprises: an instruction buffer unit for storing the operation instruction to be executed. The instruction is also cached in the instruction cache unit during execution. When an instruction is executed, if the instruction is also the earliest instruction in the uncommitted instruction in the instruction cache unit, the instruction will be submitted once submitted. The operation of this instruction will not be able to cancel the change of the device status. In an embodiment, the instruction cache unit may be a reordering cache.
根据本发明的一实施方式,该装置还包括:指令队列,用于对译码后的神经网络运算以及矩阵/向量运算运算指令进行顺序存储,考虑到不同指令在包含的寄存器上有可能存在依赖关系,用于缓存译码后的指令,当依赖关系被满足之后发射指令。According to an embodiment of the present invention, the apparatus further includes: an instruction queue for sequentially storing the decoded neural network operation and the matrix/vector operation operation instruction, considering that different instructions may have dependencies on the included registers. The relationship is used to cache the decoded instruction and issue the instruction when the dependency is satisfied.
根据本发明的一实施方式,该装置还包括:依赖关系处理单元,用于在运算单元获取指令前,判断该运算指令与前一运算指令是否访问相 同的神经元/矩阵/向量存储地址,若是,将该运算指令存储在存储队列中,待前一运算指令执行完毕后,将存储队列中的该运算指令提供给所述运算单元;否则,直接将该运算指令提供给所述运算单元。具体地,运算指令访问高速暂存存储器时,前后指令可能会访问同一块存储空间,为了保证指令执行结果的正确性,当前指令如果被检测到与之前的指令的数据存在依赖关系,该指令必须在存储队列内等待至依赖关系被消除。According to an embodiment of the present invention, the apparatus further includes: a dependency processing unit, configured to determine, before the operation unit acquires the instruction, whether the operation instruction and the previous operation instruction are accessed The same neuron/matrix/vector storage address, if yes, the operation instruction is stored in the storage queue, and after the execution of the previous operation instruction is completed, the operation instruction in the storage queue is provided to the operation unit; otherwise, directly The arithmetic instruction is provided to the arithmetic unit. Specifically, when the operation instruction accesses the scratchpad memory, the front and back instructions may access the same block of storage space. In order to ensure the correctness of the instruction execution result, if the current instruction is detected to have a dependency relationship with the data of the previous instruction, the instruction must be Waiting in the storage queue until the dependency is removed.
根据本发明的一实施方式,该装置还包括:输入输出单元,用于将神经元/矩阵/向量存储于存储单元,或者,从存储单元中获取运算结果。其中,输入输出单元可直接存储单元,负责从内存中读取数据或写入数据。According to an embodiment of the invention, the apparatus further comprises: an input and output unit for storing the neuron/matrix/vector in the storage unit, or acquiring the operation result from the storage unit. Among them, the input and output unit can directly store the unit, which is responsible for reading data from or writing data to the memory.
根据本发明的一实施方式,用于本发明装置的指令集采用Load/Store结构,运算单元不会对内存中的数据进行操作。本指令集采用超长指令字架构,通过对指令进行不同的配置可以完成复杂的神经网络运算,也可以完成简单的矩阵/向量运算。另外,本指令集同时采用定长指令,使得本发明的神经网络运算以及矩阵/向量运算装置在上一条指令的译码阶段对下一条指令进行取指。According to an embodiment of the present invention, the instruction set used in the apparatus of the present invention adopts a Load/Store structure, and the operation unit does not operate on data in the memory. This instruction set uses a very long instruction word architecture. By configuring the instructions differently, complex neural network operations can be completed, and simple matrix/vector operations can be performed. In addition, the present instruction set uses a fixed length instruction at the same time, so that the neural network operation and the matrix/vector operation device of the present invention fetch the next instruction in the decoding stage of the previous instruction.
图2是本发明的运算指令的格式示意图,如图2所示,运算指令包括至少一操作码和至少3个操作数,其中,操作码用于指示该运算指令的功能,运算单元通过识别一个或多个操作码可进行不同的运算,操作数用于指示该运算指令的数据信息,其中,数据信息可以是立即数或寄存器号,例如,要获取一个矩阵时,根据寄存器号可以在相应的寄存器中获取矩阵起始地址和矩阵长度,再根据矩阵起始地址和矩阵长度在存储单元中获取相应地址存放的矩阵。2 is a schematic diagram of the format of an operation instruction of the present invention. As shown in FIG. 2, the operation instruction includes at least one operation code and at least three operands, wherein the operation code is used to indicate the function of the operation instruction, and the operation unit identifies one by Or multiple operation codes can perform different operations, and the operand is used to indicate the data information of the operation instruction, wherein the data information can be an immediate number or a register number. For example, when a matrix is to be acquired, the corresponding register number can be corresponding. The starting address of the matrix and the length of the matrix are obtained in the register, and the matrix stored in the corresponding address is obtained in the storage unit according to the starting address of the matrix and the length of the matrix.
图3是本发明的神经网络运算指令的格式示意图,如图3所示,神经网络运算指令包括至少一操作码和16个操作数,其中,操作码用于指示该神经网络运算指令的功能,运算单元通过识别一个或多个操作码可进行不同的神经网络运算,操作数用于指示该神经网络运算指令的数据信息,其中,数据信息可以是立即数或寄存器号。3 is a schematic diagram of a format of a neural network operation instruction according to the present invention. As shown in FIG. 3, the neural network operation instruction includes at least one operation code and 16 operands, wherein the operation code is used to indicate the function of the neural network operation instruction. The arithmetic unit can perform different neural network operations by identifying one or more operation codes, and the operands are used to indicate data information of the neural network operation instructions, wherein the data information can be an immediate number or a register number.
图4是本发明的矩阵-矩阵运算指令的格式示意图,如图4所示,矩 阵-矩阵运算指令包括至少一操作码和至少4个操作数,其中,操作码用于指示该矩阵-矩阵运算指令的功能,运算单元通过识别一个或多个操作码可进行不同的矩阵运算,操作数用于指示该矩阵-矩阵运算指令的数据信息,其中,数据信息可以是立即数或寄存器号。4 is a schematic diagram of the format of the matrix-matrix operation instruction of the present invention, as shown in FIG. The matrix-matrix operation instruction includes at least one operation code for indicating a function of the matrix-matrix operation instruction, and at least four operands, wherein the operation unit can perform different matrix operations by identifying one or more operation codes. The operand is used to indicate data information of the matrix-matrix operation instruction, wherein the data information may be an immediate value or a register number.
图5是本发明的向量-向量运算指令的格式示意图,如图5所示,向量-向量运算指令包括至少一操作码和至少3个操作数,其中,操作码用于指示该向量-向量运算指令的功能,运算单元通过识别一个或多个操作码可进行不同的向量运算,操作数用于指示该向量-向量运算指令的数据信息,其中,数据信息可以是立即数或寄存器号。5 is a schematic diagram of a format of a vector-vector operation instruction according to the present invention. As shown in FIG. 5, the vector-vector operation instruction includes at least one operation code and at least three operands, wherein the operation code is used to indicate the vector-vector operation. The function of the instruction unit can perform different vector operations by identifying one or more operation codes, and the operand is used to indicate data information of the vector-vector operation instruction, wherein the data information can be an immediate number or a register number.
图6是本发明的矩阵-向量运算指令的格式示意图,如图6所示,矩阵-向量运算指令包括至少一操作码和至少6个操作数,其中,操作码用于指示该矩阵-向量运算指令的功能,运算单元通过识别一个或多个操作码可进行不同的矩阵和向量运算,操作数用于指示该矩阵-向量运算指令的数据信息,其中,数据信息可以是立即数或寄存器号。6 is a schematic diagram of a format of a matrix-vector operation instruction of the present invention. As shown in FIG. 6, the matrix-vector operation instruction includes at least one operation code and at least 6 operands, wherein the operation code is used to indicate the matrix-vector operation. The function of the instruction unit can perform different matrix and vector operations by identifying one or more operation codes, and the operand is used to indicate data information of the matrix-vector operation instruction, wherein the data information can be an immediate number or a register number.
图7是作为本发明一优选实施例的神经网络运算以及矩阵/向量运算装置的结构示意图,如图7所示,该装置包括取指模块、译码模块、指令队列、标量寄存器堆、依赖关系处理单元、存储队列、重排序缓存、运算单元、高速暂存器、IO内存存取模块;7 is a schematic structural diagram of a neural network operation and a matrix/vector operation device as a preferred embodiment of the present invention. As shown in FIG. 7, the device includes an instruction module, a decoding module, an instruction queue, a scalar register file, and a dependency relationship. Processing unit, storage queue, reordering cache, arithmetic unit, cache, IO memory access module;
取指模块,该模块负责从指令序列中取出下一条将要执行的指令,并将该指令传给译码模块;The fetch module, which is responsible for fetching the next instruction to be executed from the instruction sequence and passing the instruction to the decoding module;
译码模块,该模块负责对指令进行译码,并将译码后指令传给指令队列;如图8所示,该译码模块包括:指令接受模块、微指令生成模块、微指令队列、微指令发射模块;其中,指令接受模块负责接受从取指模块取得的指令;微指令译码模块将指令接受模块获得的指令译码成控制各个功能部件的微指令;微指令队列用于存放从微指令译码模块发送的微指令;微指令发射模块负责将微指令发射到各个功能部件;a decoding module, the module is responsible for decoding the instruction, and transmitting the decoded instruction to the instruction queue; as shown in FIG. 8, the decoding module includes: an instruction accepting module, a microinstruction generating module, a microinstruction queue, and a micro The instruction transmitting module; wherein the instruction accepting module is responsible for accepting the instruction fetched from the fetching module; the microinstruction decoding module decodes the instruction obtained by the instruction accepting module into a microinstruction for controlling each functional component; the microinstruction queue is used for storing the microinstruction a micro-instruction sent by the instruction decoding module; the micro-instruction transmitting module is responsible for transmitting the micro-instruction to each functional component;
指令队列,用于顺序缓存译码后的指令,送往依赖关系处理单元;An instruction queue for sequentially buffering the decoded instructions and sending them to the dependency processing unit;
标量寄存器堆,提供装置在运算过程中所需的标量寄存器; A scalar register file that provides the scalar registers required by the device during the operation;
依赖关系处理单元,该模块处理指令与前一条指令可能存在的存储依赖关系。矩阵运算指令会访问高速暂存存储器,前后指令可能会访问同一块存储空间。为了保证指令执行结果的正确性,当前指令如果被检测到与之前的指令的数据存在依赖关系,该指令必须在存储队列内等待至依赖关系被消除。A dependency processing unit that processes the storage dependencies that an instruction may have with the previous instruction. The matrix operation instruction accesses the scratch pad memory, and the front and back instructions may access the same block of memory. In order to ensure the correctness of the execution result of the instruction, if the current instruction is detected to have a dependency on the data of the previous instruction, the instruction must wait in the storage queue until the dependency is eliminated.
存储队列,该模块是一个有序队列,与之前指令在数据上有依赖关系的指令被存储在该队列内,直至依赖关系消除之后,提交指令。The storage queue, the module is an ordered queue, and instructions that have dependencies on the data of the previous instruction are stored in the queue until the dependency is eliminated, and the instruction is submitted.
重排序缓存,指令在执行过程中,同时也被缓存在该模块中,当一条指令执行完之后,如果该指令同时也是重排序缓存中未被提交指令中最早的一条指令,该指令将被提交。一旦提交,该条指令进行的操作对装置状态的改变将无法撤销;该重排序缓存里的指令起到占位的作用,当它包含的第一条指令存在数据依赖时,那么该指令就不会提交(释放);尽管后面会有很多指令不断进入,但是只能接受部分指令(受重排序缓存大小控制),直到第一条指令被提交,整个运算过程才会顺利进行。Reordering the cache, the instruction is also cached in the module during execution. When an instruction is executed, if the instruction is also the oldest instruction in the uncommitted instruction in the reordering buffer, the instruction will be submitted. . Once submitted, the operation of the instruction will not be able to cancel the state of the device; the instruction in the reordering cache acts as a placeholder. When the first instruction it contains has a data dependency, then the instruction does not Will be submitted (released); although there will be a lot of instructions coming in, but only part of the instruction (redirected cache size control), until the first instruction is submitted, the entire operation will proceed smoothly.
运算单元,该模块负责装置的所有的神经网络运算和矩阵/向量运算操作,包括但不限于:卷积神经网络正向运算操作、卷积神经网络训练操作、神经网络Pooling运算操作、full connection神经网络正向运算操作、full connection神经网络训练操作、batch normalization运算操作、RBM神经网络运算操作、矩阵-向量乘运算操作、矩阵-矩阵加/减运算操作、向量外积(张量)运算操作、向量内积运算操作、向量四则运算操作、向量逻辑运算操作、向量超越函数运算操作、向量比较运算操作、求向量最大/最小值运算操作、向量循环移位运算操作、生成服从一定分布的随机向量运算操作。运算指令被送往该运算单元执行;An arithmetic unit that is responsible for all neural network operations and matrix/vector operations of the device, including but not limited to: convolutional neural network forward operations, convolutional neural network training operations, neural network Pooling operations, full connection neural Network forward operation operation, full connection neural network training operation, batch normalization operation operation, RBM neural network operation operation, matrix-vector multiplication operation operation, matrix-matrix addition/subtraction operation operation, vector outer product (tensor) operation operation, Vector inner product operation, vector four operation, vector logic operation, vector transcendental operation, vector comparison operation, vector maximum/minimum operation, vector cyclic shift operation, generation of random vectors subject to certain distribution Operational operations. The operation instruction is sent to the operation unit for execution;
高速暂存器,该模块是数据专用的暂存存储装置,能够支持不同大小的数据;The high-speed register, the module is a data-specific temporary storage device capable of supporting different sizes of data;
IO内存存取模块,该模块用于直接访问高速暂存存储器,负责从高速暂存存储器中读取数据或写入数据。IO memory access module, which is used to directly access the scratchpad memory and is responsible for reading data or writing data from the scratchpad memory.
图9是作为本发明一优选实施例的运算装置执行神经网络运算以及 矩阵/向量运算指令的流程图,如图9所示,执行神经网络运算以及矩阵/向量指令的过程包括:9 is a flowchart of performing a neural network operation by an arithmetic device as a preferred embodiment of the present invention; A flowchart of a matrix/vector operation instruction, as shown in FIG. 9, the process of performing a neural network operation and a matrix/vector instruction includes:
S1,取指模块取出该条神经网络运算以及矩阵/向量指令,并将该指令送往译码模块。S1, the fetch module takes out the neural network operation and the matrix/vector instruction, and sends the instruction to the decoding module.
S2,译码模块对指令译码,并将指令送往指令队列。S2. The decoding module decodes the instruction and sends the instruction to the instruction queue.
S3,在译码模块中,指令被送往指令接受模块。S3, in the decoding module, the instruction is sent to the instruction accepting module.
S4,指令接受模块将指令发送到微指令生成模块,进行微指令生成。S4, the instruction accepting module sends the instruction to the micro-instruction generating module to generate the micro-instruction.
S5,微指令生成模块从标量寄存器堆里获取指令的神经网络运算操作码和神经网络运算操作数,同时将指令译码成控制各个功能部件的微指令,送往微指令发射队列。S5. The microinstruction generation module obtains the neural network operation operation code of the instruction and the neural network operation operand from the scalar register file, and decodes the instruction into a micro instruction that controls each functional component, and sends it to the microinstruction transmission queue.
S6,在取得需要的数据后,该指令被送往依赖关系处理单元。依赖关系处理单元分析该指令与前面的尚未执行结束的指令在数据上是否存在依赖关系。该条指令需要在存储队列中等待至其与前面的未执行结束的指令在数据上不再存在依赖关系为止。S6, after obtaining the required data, the instruction is sent to the dependency processing unit. The dependency processing unit analyzes whether the instruction has a dependency on the data with the previous instruction that has not been executed. The instruction needs to wait in the store queue until it no longer has a dependency on the data with the previous unexecuted instruction.
S7,依赖关系不存在后,该条神经网络运算以及矩阵/向量指令对应的微指令被送往运算单元等功能部件。S7, after the dependency relationship does not exist, the micro-instruction corresponding to the neural network operation and the matrix/vector instruction is sent to a functional component such as an arithmetic unit.
S8,运算单元根据所需数据的地址和大小从高速暂存器中取出需要的数据,然后在运算单元中完成神经网络运算以及矩阵/向量运算。S8. The arithmetic unit extracts the required data from the cache according to the address and size of the required data, and then performs neural network operations and matrix/vector operations in the operation unit.
S9,运算完成后,将输出数据写回至高速暂存存储器的指定地址,同时重排序缓存中的该指令被提交。S9, after the operation is completed, the output data is written back to the specified address of the scratch pad memory, and the instruction in the reorder buffer is submitted.
综上所述,本发明公开了一种神经网络运算和矩阵/向量运算的装置和方法,配合相应的指令,能够很好地解决当前计算机领域神经网络算法和大量矩阵/向量运算的问题,相比于已有的传统解决方案,本发明可以具有指令可配置、使用方便、支持的神经网络规模和矩阵/向量规模灵活、片上缓存充足等优点。In summary, the present invention discloses a device and method for neural network operation and matrix/vector operation, which can solve the problems of the current computer domain neural network algorithm and a large number of matrix/vector operations with the corresponding instructions. Compared with the existing conventional solutions, the present invention can have the advantages of command configurability, convenient use, supported neural network scale, flexible matrix/vector scale, and sufficient on-chip buffering.
以上所述的具体实施例,对本发明的目的、技术方案和有益效果进行了进一步详细说明,应理解的是,以上所述仅为本发明的具体实施例而已,并不用于限制本发明,凡在本发明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。 The specific embodiments of the present invention have been described in detail in the foregoing detailed description of the embodiments of the present invention. All modifications, equivalents, improvements, etc., made within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims (10)

  1. 一种用于执行神经网络运算以及矩阵/向量运算的装置,包括存储单元、寄存器单元、控制单元、运算单元和高速暂存存储器,其中:An apparatus for performing neural network operations and matrix/vector operations, comprising a storage unit, a register unit, a control unit, an arithmetic unit, and a scratchpad memory, wherein:
    存储单元,用于存储神经元/矩阵/向量;a storage unit for storing neurons/matrices/vectors;
    寄存器单元,用于存储神经元地址/矩阵地址/向量地址,其中所述神经元地址为神经元在所述存储单元中存储的地址,所述矩阵地址为矩阵在所述存储单元中存储的地址,所述向量地址为向量在所述存储单元中存储的地址;a register unit for storing a neuron address/matrix address/vector address, wherein the neuron address is an address stored by the neuron in the storage unit, and the matrix address is an address of a matrix stored in the storage unit The vector address is an address of a vector stored in the storage unit;
    控制单元,用于执行译码操作,根据读取指令控制各个单元模块;a control unit, configured to perform a decoding operation, and control each unit module according to the read instruction;
    运算单元,用于根据指令从所述寄存器单元中获取神经元地址/矩阵地址/向量地址,根据所述神经元地址/矩阵地址/向量地址在所述存储单元中获取相应的神经元/矩阵/向量,以及根据由此获取的神经元/矩阵/向量和/或指令中携带的数据进行运算,得到运算结果;An operation unit, configured to acquire a neuron address/matrix address/vector address from the register unit according to the instruction, and acquire a corresponding neuron/matrix in the storage unit according to the neuron address/matrix address/vector address/ Vectors, and operations based on the data carried in the neurons/matrices/vectors and/or instructions thus obtained, to obtain an operation result;
    其特征在于,将参与所述运算单元计算的神经元/矩阵/向量数据暂存在高速暂存存储器上,当需要时所述运算单元从所述高速暂存存储器上读取。The feature is that the neuron/matrix/vector data participating in the calculation by the computing unit is temporarily stored in the scratchpad memory, and the arithmetic unit reads from the scratchpad memory when needed.
  2. 如权利要求1所述的用于执行神经网络运算以及矩阵/向量运算的装置,其特征在于,所述高速暂存存储器能够支持不同大小的神经元/矩阵/向量数据。The apparatus for performing neural network operations and matrix/vector operations of claim 1 wherein said scratchpad memory is capable of supporting different sizes of neuron/matrix/vector data.
  3. 如权利要求1所述的用于执行神经网络运算以及矩阵/向量运算的装置,其特征在于,所述寄存器单元为标量寄存器堆,提供运算过程中所需的标量寄存器。The apparatus for performing neural network operations and matrix/vector operations of claim 1 wherein said register unit is a scalar register file providing a scalar register required for operation.
  4. 如权利要求1所述的用于执行神经网络运算以及矩阵/向量运算的装置,其特征在于,所述运算单元包括向量乘法部件、累加部件和标量乘法部件;以及The apparatus for performing a neural network operation and a matrix/vector operation according to claim 1, wherein said arithmetic unit comprises a vector multiplication component, an accumulation component, and a scalar multiplication component;
    所述运算单元负责装置的神经网络/矩阵/向量运算,包括卷积神经网络正向运算操作、卷积神经网络训练操作、神经网络Pooling运算操作、full connection神经网络正向运算操作、full connection神经网络训 练操作、batch normalization运算操作、RBM神经网络运算操作、矩阵-向量乘运算操作、矩阵-矩阵加/减运算操作、向量外积运算操作、向量内积运算操作、向量四则运算操作、向量逻辑运算操作、向量超越函数运算操作、向量比较运算操作、求向量最大/最小值运算操作、向量循环移位运算操作、生成服从一定分布的随机向量运算操作。The arithmetic unit is responsible for the neural network/matrix/vector operation of the device, including convolutional neural network forward operation operation, convolutional neural network training operation, neural network Pooling operation operation, full connection neural network forward operation operation, full connection nerve Network training Practice operation, batch normalization operation, RBM neural network operation, matrix-vector multiplication operation, matrix-matrix addition/subtraction operation, vector outer product operation, vector inner product operation, vector four operation, vector logic operation Operation, vector transcendental function operation, vector comparison operation, vector maximum/minimum operation, vector cyclic shift operation, and generation of random vector operation obeying a certain distribution.
  5. 如权利要求1所述的用于执行神经网络运算以及矩阵/向量运算的装置,其特征在于,所述装置还包括指令缓存单元,用于存储待执行的运算指令;所述指令缓存单元优选为重排序缓存;以及The apparatus for performing a neural network operation and a matrix/vector operation according to claim 1, wherein the apparatus further comprises an instruction buffer unit for storing an operation instruction to be executed; the instruction cache unit is preferably Reorder the cache;
    所述装置还包括指令队列,用于顺序缓存译码后的指令,送往依赖关系处理单元。The apparatus also includes an instruction queue for sequentially buffering the decoded instructions for transmission to the dependency processing unit.
  6. 如权利要求5所述的用于执行神经网络运算以及矩阵/向量运算的装置,其特征在于,所述装置还包括依赖关系处理单元和存储队列,所述依赖关系处理单元用于在运算单元获取指令前,判断所述运算指令与前一运算指令是否访问相同的神经元/矩阵/向量存储地址,若是,将所述运算指令存储在所述存储队列中;否则,直接将该运算指令提供给所述运算单元,待前一运算指令执行完毕后,将存储队列中的所述运算指令提供给所述运算单元;所述存储队列用于存储与之前指令在数据上有依赖关系的指令,并在依赖关系消除之后,提交所述指令。The apparatus for performing a neural network operation and a matrix/vector operation according to claim 5, wherein said apparatus further comprises a dependency processing unit and a storage queue, said dependency processing unit being configured to acquire at the arithmetic unit Before the instruction, determining whether the operation instruction and the previous operation instruction access the same neuron/matrix/vector storage address, and if so, storing the operation instruction in the storage queue; otherwise, directly providing the operation instruction to the instruction The operation unit, after the execution of the previous operation instruction is completed, providing the operation instruction in the storage queue to the operation unit; the storage queue is configured to store an instruction having a dependency on the data of the previous instruction, and After the dependency is removed, the instruction is submitted.
  7. 如权利要求1所述的用于执行神经网络运算以及矩阵/向量运算的装置,其特征在于,所述装置的指令集采用Load/Store结构,所述运算单元不对内存中的数据进行操作;以及The apparatus for performing a neural network operation and a matrix/vector operation according to claim 1, wherein the instruction set of the apparatus adopts a Load/Store structure, and the operation unit does not operate on data in the memory;
    所述装置的指令集优选采用超长指令字架构,同时优选采用定长指令。The instruction set of the apparatus preferably employs a very long instruction word architecture, and preferably uses fixed length instructions.
  8. 如权利要求1所述的用于执行神经网络运算以及矩阵/向量运算的装置,其特征在于,所述运算单元执行的运算指令包括至少一操作码和至少3个操作数;其中,所述操作码用于指示该运算指令的功能,运算单元通过识别一个或多个操作码进行不同的运算;所述操作数用于指示所述运算指令的数据信息,其中,所述数据信息为立即数或寄存器号。The apparatus for performing a neural network operation and a matrix/vector operation according to claim 1, wherein the arithmetic instruction executed by the arithmetic unit comprises at least one operation code and at least three operands; wherein the operation The code is used to indicate the function of the operation instruction, and the operation unit performs different operations by identifying one or more operation codes; the operand is used to indicate data information of the operation instruction, wherein the data information is an immediate number or Register number.
    作为优选,当所述运算指令为神经网络运算指令时,所述神经网络 运算指令包括至少一操作码和16个操作数;Preferably, when the operation instruction is a neural network operation instruction, the neural network The operation instruction includes at least one operation code and 16 operands;
    作为优选,当所述运算指令为矩阵-矩阵运算指令时,所述矩阵-矩阵运算指令包括至少一操作码和至少4个操作数;Advantageously, when the operation instruction is a matrix-matrix operation instruction, the matrix-matrix operation instruction includes at least one operation code and at least 4 operands;
    作为优选,当所述运算指令为向量-向量运算指令时,所述向量-向量运算指令包括至少一操作码和至少3个操作数;Advantageously, when the operation instruction is a vector-vector operation instruction, the vector-vector operation instruction includes at least one operation code and at least three operands;
    作为优选,当所述运算指令为矩阵-向量运算指令时,所述矩阵-向量运算指令包括至少一操作码和至少6个操作数。Advantageously, when said operational instruction is a matrix-vector operation instruction, said matrix-vector operation instruction comprises at least one opcode and at least six operands.
  9. 一种用于执行神经网络运算以及矩阵/向量运算的装置,其特征在于,包括:An apparatus for performing a neural network operation and a matrix/vector operation, comprising:
    取指模块,用于从指令序列中取出下一条将要执行的指令,并将该指令传给译码模块;The fetch module is configured to fetch an instruction to be executed from the instruction sequence and transmit the instruction to the decoding module;
    译码模块,用于对所述指令进行译码,并将译码后的指令传给指令队列;a decoding module, configured to decode the instruction, and transmit the decoded instruction to the instruction queue;
    指令队列,用于顺序缓存所述译码模块译码后的指令,并送往依赖关系处理单元;An instruction queue, configured to sequentially cache the decoded instruction of the decoding module, and send the instruction to the dependency processing unit;
    标量寄存器堆,用于提供标量寄存器供运算使用;a scalar register file that provides a scalar register for use in operations;
    依赖关系处理单元,用于判断当前指令与前一条指令是否存在数据依赖关系,如果存在则将所述当前指令存储于存储队列;a dependency processing unit, configured to determine whether the current instruction has a data dependency relationship with the previous instruction, and if present, store the current instruction in a storage queue;
    存储队列,用于缓存与前一条指令存在数据依赖关系的当前指令,当所述当前指令与前一条指令存在的依赖关系消除之后发射所述当前指令;a storage queue, configured to cache a current instruction having a data dependency relationship with the previous instruction, and transmitting the current instruction after the current instruction has a dependency relationship with the previous instruction;
    重排序缓存,用于在指令执行时将其缓存,并在执行完之后判断所述指令是否是所述重排序缓存中未被提交指令中最早的一条指令,如果是则将所述指令提交;Reordering the cache for caching the instruction when it is executed, and determining whether the instruction is the earliest instruction in the uncommitted instruction in the reordering cache after execution, and if so, submitting the instruction;
    运算单元,用于执行所有神经网络运算和矩阵/向量运算操作;An arithmetic unit for performing all neural network operations and matrix/vector operations;
    高速暂存存储器,用于暂存参与所述运算单元计算的神经元/矩阵/向量数据,当需要时所述运算单元从所述高速暂存存储器上读取;所述高速暂存存储器优选能够支持不同大小的数据;a cache memory for temporarily storing the neuron/matrix/vector data participating in the calculation of the arithmetic unit, the arithmetic unit reading from the scratchpad memory when needed; the cache memory preferably Support different sizes of data;
    IO内存存取模块,用于直接访问所述高速暂存存储器,负责从所述 高速暂存存储器中读取或写入数据。An IO memory access module for directly accessing the scratchpad memory, responsible for Read or write data in the scratchpad memory.
  10. 一种执行神经网络运算以及矩阵/向量指令的方法,其特征在于,包括以下步骤:A method of performing neural network operations and matrix/vector instructions, comprising the steps of:
    步骤S1,取指模块取出一条神经网络运算以及矩阵/向量指令,并将所述指令送往译码模块;Step S1, the fetch module takes out a neural network operation and a matrix/vector instruction, and sends the instruction to the decoding module;
    步骤S2,译码模块对所述指令译码,并将所述指令送往指令队列;Step S2, the decoding module decodes the instruction, and sends the instruction to the instruction queue;
    步骤S3,在译码模块中,所述指令被送往指令接受模块;Step S3, in the decoding module, the instruction is sent to the instruction accepting module;
    步骤S4,指令接受模块将所述指令发送到微指令生成模块,进行微指令生成;Step S4, the instruction accepting module sends the instruction to the micro-instruction generating module to generate the micro-instruction;
    步骤S5,微指令生成模块从标量寄存器堆里获取所述指令的神经网络运算操作码和神经网络运算操作数,同时将所述指令译码成控制各个功能部件的微指令,送往微指令发射队列;Step S5, the microinstruction generation module acquires the neural network operation operation code of the instruction and the neural network operation operand from the scalar register file, and decodes the instruction into a micro instruction that controls each functional component, and sends it to the microinstruction transmission. queue;
    步骤S6,在取得需要的数据后,所述指令被送往依赖关系处理单元;依赖关系处理单元分析所述指令与之前尚未执行完的指令在数据上是否存在依赖关系,如果存在,则所述指令需要在存储队列中等待至其与之前未执行完的指令在数据上不再存在依赖关系为止;Step S6, after obtaining the required data, the instruction is sent to the dependency processing unit; the dependency processing unit analyzes whether the instruction has a dependency on the data with the previously unexecuted instruction, and if so, the The instruction needs to wait in the storage queue until it no longer has a dependency on the data with the previously unexecuted instruction;
    步骤S7,将所述指令对应的微指令送往运算单元;Step S7, sending the micro-instruction corresponding to the instruction to the arithmetic unit;
    步骤S8,运算单元根据所需数据的地址和大小从高速暂存存储器中取出需要的数据,然后在运算单元中完成所述指令对应的神经网络运算和/或矩阵/向量运算。 In step S8, the arithmetic unit extracts the required data from the scratchpad memory according to the address and size of the required data, and then completes the neural network operation and/or the matrix/vector operation corresponding to the instruction in the operation unit.
PCT/CN2016/082015 2016-04-29 2016-05-13 Device and method for performing neural network computation and matrix/vector computation WO2017185418A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610281291.5A CN107329936A (en) 2016-04-29 2016-04-29 A kind of apparatus and method for performing neural network computing and matrix/vector computing
CN201610281291.5 2016-04-29

Publications (1)

Publication Number Publication Date
WO2017185418A1 true WO2017185418A1 (en) 2017-11-02

Family

ID=60161583

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/082015 WO2017185418A1 (en) 2016-04-29 2016-05-13 Device and method for performing neural network computation and matrix/vector computation

Country Status (2)

Country Link
CN (1) CN107329936A (en)
WO (1) WO2017185418A1 (en)

Cited By (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108037908A (en) * 2017-12-15 2018-05-15 北京中科寒武纪科技有限公司 A kind of computational methods and Related product
CN108416422A (en) * 2017-12-29 2018-08-17 国民技术股份有限公司 A kind of convolutional neural networks implementation method and device based on FPGA
CN108959180A (en) * 2018-06-15 2018-12-07 北京探境科技有限公司 A kind of data processing method and system
CN109032669A (en) * 2018-02-05 2018-12-18 上海寒武纪信息科技有限公司 Processing with Neural Network device and its method for executing the instruction of vector minimum value
CN109542513A (en) * 2018-11-21 2019-03-29 济南浪潮高新科技投资发展有限公司 A kind of convolutional neural networks instruction data storage system and method
CN109754062A (en) * 2017-11-07 2019-05-14 上海寒武纪信息科技有限公司 The execution method and Related product of convolution extended instruction
CN109754061A (en) * 2017-11-07 2019-05-14 上海寒武纪信息科技有限公司 The execution method and Related product of convolution extended instruction
CN110383300A (en) * 2018-02-13 2019-10-25 上海寒武纪信息科技有限公司 A kind of computing device and method
CN110472734A (en) * 2018-05-11 2019-11-19 上海寒武纪信息科技有限公司 A kind of computing device and Related product
CN110503179A (en) * 2018-05-18 2019-11-26 上海寒武纪信息科技有限公司 Calculation method and Related product
CN110780921A (en) * 2019-08-30 2020-02-11 腾讯科技(深圳)有限公司 Data processing method and device, storage medium and electronic device
CN111047036A (en) * 2019-12-09 2020-04-21 Oppo广东移动通信有限公司 Neural network processor, chip and electronic equipment
CN111047024A (en) * 2018-10-12 2020-04-21 上海寒武纪信息科技有限公司 Computing device and related product
CN111079908A (en) * 2018-10-18 2020-04-28 上海寒武纪信息科技有限公司 Network-on-chip data processing method, storage medium, computer device and apparatus
CN111222632A (en) * 2018-11-27 2020-06-02 中科寒武纪科技股份有限公司 Computing device, computing method and related product
CN111275197A (en) * 2018-12-05 2020-06-12 上海寒武纪信息科技有限公司 Operation method, operation device, computer equipment and storage medium
CN111325321A (en) * 2020-02-13 2020-06-23 中国科学院自动化研究所 Brain-like computing system based on multi-neural network fusion and execution method of instruction set
CN111767997A (en) * 2018-02-27 2020-10-13 上海寒武纪信息科技有限公司 Integrated circuit chip device and related product
CN111767998A (en) * 2018-02-27 2020-10-13 上海寒武纪信息科技有限公司 Integrated circuit chip device and related product
CN111857828A (en) * 2019-04-25 2020-10-30 安徽寒武纪信息科技有限公司 Processor operation method and device and related product
CN111860805A (en) * 2019-04-27 2020-10-30 中科寒武纪科技股份有限公司 Fractal calculation device and method, integrated circuit and board card
CN113361679A (en) * 2020-03-05 2021-09-07 华邦电子股份有限公司 Memory device and operation method thereof
US11740898B2 (en) 2018-02-13 2023-08-29 Shanghai Cambricon Information Technology Co., Ltd Computing device and method
CN116841614A (en) * 2023-05-29 2023-10-03 进迭时空(杭州)科技有限公司 Sequential vector scheduling method under disordered access mechanism
US11797467B2 (en) 2018-10-18 2023-10-24 Shanghai Cambricon Information Technology Co., Ltd. Data processing device with transmission circuit
US11841822B2 (en) 2019-04-27 2023-12-12 Cambricon Technologies Corporation Limited Fractal calculating device and method, integrated circuit and board card
CN117807082A (en) * 2023-12-20 2024-04-02 中科驭数(北京)科技有限公司 Hash processing method, device, equipment and computer readable storage medium
CN118467136A (en) * 2024-05-30 2024-08-09 上海交通大学 Calculation and storage method and system suitable for large language model sparse reasoning

Families Citing this family (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108701015A (en) * 2017-11-30 2018-10-23 深圳市大疆创新科技有限公司 For the arithmetic unit of neural network, chip, equipment and correlation technique
TW201926147A (en) * 2017-12-01 2019-07-01 阿比特電子科技有限公司 Electronic device, accelerator, accelerating method applicable to neural network computation, and neural network accelerating system
CN109919308B (en) 2017-12-13 2022-11-11 腾讯科技(深圳)有限公司 Neural network model deployment method, prediction method and related equipment
CN108108189B (en) * 2017-12-15 2020-10-30 安徽寒武纪信息科技有限公司 Calculation method and related product
CN108197705A (en) * 2017-12-29 2018-06-22 国民技术股份有限公司 Convolutional neural networks hardware accelerator and convolutional calculation method and storage medium
CN108416431B (en) * 2018-01-19 2021-06-01 上海兆芯集成电路有限公司 Neural network microprocessor and macroinstruction processing method
WO2019165940A1 (en) * 2018-02-27 2019-09-06 上海寒武纪信息科技有限公司 Integrated circuit chip apparatus, board card and related product
CN111626413A (en) * 2018-03-14 2020-09-04 上海寒武纪信息科技有限公司 Computing device and method
CN108520296B (en) * 2018-03-20 2020-05-15 福州瑞芯微电子股份有限公司 Deep learning chip-based dynamic cache allocation method and device
CN108764470B (en) * 2018-05-18 2021-08-31 中国科学院计算技术研究所 Processing method for artificial neural network operation
CN110647973A (en) * 2018-06-27 2020-01-03 北京中科寒武纪科技有限公司 Operation method and related method and product
CN110147222B (en) * 2018-09-18 2021-02-05 安徽寒武纪信息科技有限公司 Arithmetic device and method
CN111078286B (en) * 2018-10-19 2023-09-01 上海寒武纪信息科技有限公司 Data communication method, computing system and storage medium
CN111079911B (en) * 2018-10-19 2021-02-09 中科寒武纪科技股份有限公司 Operation method, system and related product
CN109615059B (en) * 2018-11-06 2020-12-25 海南大学 Edge filling and filter expansion operation method and system in convolutional neural network
CN111258634B (en) * 2018-11-30 2022-11-22 上海寒武纪信息科技有限公司 Data selection device, data processing method, chip and electronic equipment
CN111260045B (en) * 2018-11-30 2022-12-02 上海寒武纪信息科技有限公司 Decoder and atomic instruction analysis method
CN111353591B (en) * 2018-12-20 2024-08-20 中科寒武纪科技股份有限公司 Computing device and related product
CN111860798A (en) * 2019-04-27 2020-10-30 中科寒武纪科技股份有限公司 Operation method, device and related product
CN110673786B (en) * 2019-09-03 2020-11-10 浪潮电子信息产业股份有限公司 Data caching method and device
CN112905525B (en) * 2019-11-19 2024-04-05 中科寒武纪科技股份有限公司 Method and equipment for controlling computing device to perform computation
CN111027690B (en) * 2019-11-26 2023-08-04 陈子祺 Combined processing device, chip and method for performing deterministic reasoning
CN111126583B (en) * 2019-12-23 2022-09-06 中国电子科技集团公司第五十八研究所 Universal neural network accelerator
CN111898752B (en) * 2020-08-03 2024-06-28 乐鑫信息科技(上海)股份有限公司 Apparatus and method for performing LSTM neural network operations
CN112348179B (en) * 2020-11-26 2023-04-07 湃方科技(天津)有限责任公司 Efficient convolutional neural network operation instruction set architecture construction method and device, and server
WO2023123453A1 (en) * 2021-12-31 2023-07-06 华为技术有限公司 Operation acceleration processing method, operation accelerator use method, and operation accelerator
CN115826910B (en) * 2023-02-07 2023-05-02 成都申威科技有限责任公司 Vector fixed point ALU processing system
CN117992396B (en) * 2024-03-29 2024-05-28 深存科技(无锡)有限公司 Stream tensor processor

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030187624A1 (en) * 2002-03-27 2003-10-02 Joze Balic CNC control unit with learning ability for machining centers
US6961719B1 (en) * 2002-01-07 2005-11-01 The United States Of America As Represented By The Administrator Of The National Aeronautics And Space Administration Hybrid neural network and support vector machine method for optimization
CN1700250A (en) * 2004-05-17 2005-11-23 中国科学院半导体研究所 Special purpose neural net computer system for pattern recognition and application method

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
SU980097A1 (en) * 1981-06-05 1982-12-07 Предприятие П/Я М-5769 Device for control of scratchpad buffer storage of multiprocessor electronic computer
US5481683A (en) * 1992-10-30 1996-01-02 International Business Machines Corporation Super scalar computer architecture using remand and recycled general purpose register to manage out-of-order execution of instructions
CN1103467C (en) * 1994-10-13 2003-03-19 北京南思达科技发展有限公司 Macroinstruction set symmetrical parallel system structure microprocessor
CN101739235A (en) * 2008-11-26 2010-06-16 中国科学院微电子研究所 Processor device for seamless mixing 32-bit DSP and general RISC CPU
CN101504599A (en) * 2009-03-16 2009-08-12 西安电子科技大学 Special instruction set micro-processing system suitable for digital signal processing application
US9176737B2 (en) * 2011-02-07 2015-11-03 Arm Limited Controlling the execution of adjacent instructions that are dependent upon a same data condition

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6961719B1 (en) * 2002-01-07 2005-11-01 The United States Of America As Represented By The Administrator Of The National Aeronautics And Space Administration Hybrid neural network and support vector machine method for optimization
US20030187624A1 (en) * 2002-03-27 2003-10-02 Joze Balic CNC control unit with learning ability for machining centers
CN1700250A (en) * 2004-05-17 2005-11-23 中国科学院半导体研究所 Special purpose neural net computer system for pattern recognition and application method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ZHANG, JINWEN ET AL.: "Design of DSP-based Neural Computer", COMPUTER ENGINEERING AND DESIGN, 28 April 1994 (1994-04-28), pages 17 - 19, ISSN: 1000-7024 *
ZHANG, LI ET AL.: "The Neural Computer Based on Transputer and Implementation Technology", JOURNAL OF SICHUAN UNION UNIVERSITY (ENGINEERING SCIENCE EDITION, vol. 3, no. 1, 28 January 1999 (1999-01-28), pages 114 - 117, ISSN: 1009-3087 *

Cited By (65)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109754062A (en) * 2017-11-07 2019-05-14 上海寒武纪信息科技有限公司 The execution method and Related product of convolution extended instruction
CN109754061B (en) * 2017-11-07 2023-11-24 上海寒武纪信息科技有限公司 Execution method of convolution expansion instruction and related product
CN109754062B (en) * 2017-11-07 2024-05-14 上海寒武纪信息科技有限公司 Execution method of convolution expansion instruction and related product
CN109754061A (en) * 2017-11-07 2019-05-14 上海寒武纪信息科技有限公司 The execution method and Related product of convolution extended instruction
CN108037908A (en) * 2017-12-15 2018-05-15 北京中科寒武纪科技有限公司 A kind of computational methods and Related product
CN108416422B (en) * 2017-12-29 2024-03-01 国民技术股份有限公司 FPGA-based convolutional neural network implementation method and device
CN108416422A (en) * 2017-12-29 2018-08-17 国民技术股份有限公司 A kind of convolutional neural networks implementation method and device based on FPGA
CN109062612A (en) * 2018-02-05 2018-12-21 上海寒武纪信息科技有限公司 Processing with Neural Network device and its method for executing Plane Rotation instruction
CN109032669A (en) * 2018-02-05 2018-12-18 上海寒武纪信息科技有限公司 Processing with Neural Network device and its method for executing the instruction of vector minimum value
US11836497B2 (en) 2018-02-05 2023-12-05 Shanghai Cambricon Information Technology Co., Ltd Operation module and method thereof
CN109101273A (en) * 2018-02-05 2018-12-28 上海寒武纪信息科技有限公司 Processing with Neural Network device and its method for executing vector maximization instruction
CN109032669B (en) * 2018-02-05 2023-08-29 上海寒武纪信息科技有限公司 Neural network processing device and method for executing vector minimum value instruction
CN109101273B (en) * 2018-02-05 2023-08-25 上海寒武纪信息科技有限公司 Neural network processing device and method for executing vector maximum value instruction
US12073215B2 (en) 2018-02-13 2024-08-27 Shanghai Cambricon Information Technology Co., Ltd Computing device with a conversion unit to convert data values between various sizes of fixed-point and floating-point data
CN110383300A (en) * 2018-02-13 2019-10-25 上海寒武纪信息科技有限公司 A kind of computing device and method
US11740898B2 (en) 2018-02-13 2023-08-29 Shanghai Cambricon Information Technology Co., Ltd Computing device and method
CN110383300B (en) * 2018-02-13 2024-03-05 上海寒武纪信息科技有限公司 Computing device and method
CN111767998B (en) * 2018-02-27 2024-05-14 上海寒武纪信息科技有限公司 Integrated circuit chip device and related products
CN111767997B (en) * 2018-02-27 2023-08-29 上海寒武纪信息科技有限公司 Integrated circuit chip device and related products
CN111767997A (en) * 2018-02-27 2020-10-13 上海寒武纪信息科技有限公司 Integrated circuit chip device and related product
CN111767998A (en) * 2018-02-27 2020-10-13 上海寒武纪信息科技有限公司 Integrated circuit chip device and related product
CN110472734B (en) * 2018-05-11 2024-03-29 上海寒武纪信息科技有限公司 Computing device and related product
CN110472734A (en) * 2018-05-11 2019-11-19 上海寒武纪信息科技有限公司 A kind of computing device and Related product
CN110503179A (en) * 2018-05-18 2019-11-26 上海寒武纪信息科技有限公司 Calculation method and Related product
CN110503179B (en) * 2018-05-18 2024-03-01 上海寒武纪信息科技有限公司 Calculation method and related product
CN108959180B (en) * 2018-06-15 2022-04-22 北京探境科技有限公司 Data processing method and system
CN108959180A (en) * 2018-06-15 2018-12-07 北京探境科技有限公司 A kind of data processing method and system
CN111047024A (en) * 2018-10-12 2020-04-21 上海寒武纪信息科技有限公司 Computing device and related product
CN111047024B (en) * 2018-10-12 2023-05-23 上海寒武纪信息科技有限公司 Computing device and related product
US11880328B2 (en) 2018-10-18 2024-01-23 Shanghai Cambricon Information Technology Co., Ltd. Network-on-chip data processing method and device
US11880330B2 (en) 2018-10-18 2024-01-23 Shanghai Cambricon Information Technology Co., Ltd. Network-on-chip data processing method and device
US12061564B2 (en) 2018-10-18 2024-08-13 Shanghai Cambricon Information Technology Co., Ltd. Network-on-chip data processing based on operation field and opcode
US11971836B2 (en) 2018-10-18 2024-04-30 Shanghai Cambricon Information Technology Co., Ltd. Network-on-chip data processing method and device
US11960431B2 (en) 2018-10-18 2024-04-16 Guangzhou University Network-on-chip data processing method and device
CN111079908A (en) * 2018-10-18 2020-04-28 上海寒武纪信息科技有限公司 Network-on-chip data processing method, storage medium, computer device and apparatus
CN111079908B (en) * 2018-10-18 2024-02-13 上海寒武纪信息科技有限公司 Network-on-chip data processing method, storage medium, computer device and apparatus
US11880329B2 (en) 2018-10-18 2024-01-23 Shanghai Cambricon Information Technology Co., Ltd. Arbitration based machine learning data processor
US11797467B2 (en) 2018-10-18 2023-10-24 Shanghai Cambricon Information Technology Co., Ltd. Data processing device with transmission circuit
US11809360B2 (en) 2018-10-18 2023-11-07 Shanghai Cambricon Information Technology Co., Ltd. Network-on-chip data processing method and device
US11868299B2 (en) 2018-10-18 2024-01-09 Shanghai Cambricon Information Technology Co., Ltd. Network-on-chip data processing method and device
US11841816B2 (en) 2018-10-18 2023-12-12 Shanghai Cambricon Information Technology Co., Ltd. Network-on-chip data processing method and device
CN109542513B (en) * 2018-11-21 2023-04-21 山东浪潮科学研究院有限公司 Convolutional neural network instruction data storage system and method
CN109542513A (en) * 2018-11-21 2019-03-29 济南浪潮高新科技投资发展有限公司 A kind of convolutional neural networks instruction data storage system and method
CN111222632A (en) * 2018-11-27 2020-06-02 中科寒武纪科技股份有限公司 Computing device, computing method and related product
CN111275197A (en) * 2018-12-05 2020-06-12 上海寒武纪信息科技有限公司 Operation method, operation device, computer equipment and storage medium
CN111275197B (en) * 2018-12-05 2023-11-10 上海寒武纪信息科技有限公司 Operation method, device, computer equipment and storage medium
CN111857828A (en) * 2019-04-25 2020-10-30 安徽寒武纪信息科技有限公司 Processor operation method and device and related product
CN111857828B (en) * 2019-04-25 2023-03-14 安徽寒武纪信息科技有限公司 Processor operation method and device and related product
US12093811B2 (en) 2019-04-27 2024-09-17 Cambricon Technologies Corporation Limited Fractal calculating device and method, integrated circuit and board card
US12026606B2 (en) 2019-04-27 2024-07-02 Cambricon Technologies Corporation Limited Fractal calculating device and method, integrated circuit and board card
CN111860805A (en) * 2019-04-27 2020-10-30 中科寒武纪科技股份有限公司 Fractal calculation device and method, integrated circuit and board card
US11841822B2 (en) 2019-04-27 2023-12-12 Cambricon Technologies Corporation Limited Fractal calculating device and method, integrated circuit and board card
CN111860805B (en) * 2019-04-27 2023-04-07 中科寒武纪科技股份有限公司 Fractal calculation device and method, integrated circuit and board card
CN110780921A (en) * 2019-08-30 2020-02-11 腾讯科技(深圳)有限公司 Data processing method and device, storage medium and electronic device
CN110780921B (en) * 2019-08-30 2023-09-26 腾讯科技(深圳)有限公司 Data processing method and device, storage medium and electronic device
CN111047036B (en) * 2019-12-09 2023-11-14 Oppo广东移动通信有限公司 Neural network processor, chip and electronic equipment
CN111047036A (en) * 2019-12-09 2020-04-21 Oppo广东移动通信有限公司 Neural network processor, chip and electronic equipment
CN111325321B (en) * 2020-02-13 2023-08-29 中国科学院自动化研究所 Brain-like computing system based on multi-neural network fusion and execution method of instruction set
CN111325321A (en) * 2020-02-13 2020-06-23 中国科学院自动化研究所 Brain-like computing system based on multi-neural network fusion and execution method of instruction set
CN113361679B (en) * 2020-03-05 2023-10-17 华邦电子股份有限公司 Memory device and method of operating the same
CN113361679A (en) * 2020-03-05 2021-09-07 华邦电子股份有限公司 Memory device and operation method thereof
CN116841614B (en) * 2023-05-29 2024-03-15 进迭时空(杭州)科技有限公司 Sequential vector scheduling method under disordered access mechanism
CN116841614A (en) * 2023-05-29 2023-10-03 进迭时空(杭州)科技有限公司 Sequential vector scheduling method under disordered access mechanism
CN117807082A (en) * 2023-12-20 2024-04-02 中科驭数(北京)科技有限公司 Hash processing method, device, equipment and computer readable storage medium
CN118467136A (en) * 2024-05-30 2024-08-09 上海交通大学 Calculation and storage method and system suitable for large language model sparse reasoning

Also Published As

Publication number Publication date
CN107329936A (en) 2017-11-07

Similar Documents

Publication Publication Date Title
WO2017185418A1 (en) Device and method for performing neural network computation and matrix/vector computation
WO2018024093A1 (en) Operation unit, method and device capable of supporting operation data of different bit widths
WO2017124647A1 (en) Matrix calculation apparatus
WO2017185389A1 (en) Device and method for use in executing matrix multiplication operations
KR102175044B1 (en) Apparatus and method for running artificial neural network reverse training
WO2017124648A1 (en) Vector computing device
WO2017185396A1 (en) Device and method for use in executing matrix addition/subtraction operations
US10275247B2 (en) Apparatuses and methods to accelerate vector multiplication of vector elements having matching indices
CN107315718B (en) Device and method for executing vector inner product operation
CN107315717B (en) Device and method for executing vector four-rule operation
WO2017185395A1 (en) Apparatus and method for executing vector comparison operation
CN107315716B (en) Device and method for executing vector outer product operation
WO2017185404A1 (en) Apparatus and method for performing vector logical operation
US11507386B2 (en) Booting tiles of processing units
WO2017172174A1 (en) Event-driven learning and reward modulation with spike timing dependent plasticity in neuromorphic computers
WO2017181336A1 (en) Maxout layer operation apparatus and method
WO2017185419A1 (en) Apparatus and method for executing operations of maximum value and minimum value of vectors
WO2017185388A1 (en) Device and method for generating random vectors conforming to certain distribution
US11416261B2 (en) Group load register of a graph streaming processor
US20230273818A1 (en) Highly parallel processing architecture with out-of-order resolution
WO2023172660A1 (en) Highly parallel processing architecture with out-of-order resolution

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16899929

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 16899929

Country of ref document: EP

Kind code of ref document: A1