Nothing Special   »   [go: up one dir, main page]

CN111831339B - Instruction execution method and device for intelligent processor and electronic equipment - Google Patents

Instruction execution method and device for intelligent processor and electronic equipment Download PDF

Info

Publication number
CN111831339B
CN111831339B CN202010688860.4A CN202010688860A CN111831339B CN 111831339 B CN111831339 B CN 111831339B CN 202010688860 A CN202010688860 A CN 202010688860A CN 111831339 B CN111831339 B CN 111831339B
Authority
CN
China
Prior art keywords
instruction
fractal
execution
unit
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010688860.4A
Other languages
Chinese (zh)
Other versions
CN111831339A (en
Inventor
支天
赵永威
李威
张士锦
杜子东
郭崎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN202010688860.4A priority Critical patent/CN111831339B/en
Publication of CN111831339A publication Critical patent/CN111831339A/en
Application granted granted Critical
Publication of CN111831339B publication Critical patent/CN111831339B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3867Concurrent instruction execution, e.g. pipeline or look ahead using instruction pipelines
    • G06F9/3869Implementation aspects, e.g. pipeline latches; pipeline synchronisation and clocking

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Advance Control (AREA)

Abstract

The disclosure provides an instruction execution method and device for an intelligent processor, wherein the method comprises the following steps: the instruction decoding, the serial decomposition sub-instruction for executing the fractal operation is decoded into a local instruction and the fractal operation; data loading, namely reading data required by fractal operation from an external storage unit to a local storage unit of an intelligent processor; performing operation, namely completing fractal operation on the data according to the fractal operation instruction; executing a protocol, and carrying out protocol operation on a fractal operation result according to a local instruction; writing back the data, and reading the protocol operation result stored in the local memory to an external memory; instruction decoding, data loading, operation execution, protocol execution, and data write back are performed in a pipelined fashion. The method can mobilize all modules on all layers at any time, and provides the data throughput rate of the intelligent processor, thereby improving the execution efficiency of the intelligent processor.

Description

Instruction execution method and device for intelligent processor and electronic equipment
Technical Field
The disclosure relates to the field of computer technology, and in particular, to an instruction execution method and device for an intelligent processor, and electronic equipment.
Background
Machine learning algorithms are becoming an emerging tool for increasing applications in industry, including image recognition, speech recognition, face recognition, video analysis, intelligent recommendation, game play, and other fields. In recent years, for machine learning loads that are increasingly widely used, many different scales of machine learning-specific computers have emerged in the industry. For example, at the mobile end, some smartphones employ a machine learning processor for face recognition, at the cloud server end, employ a machine learning computer for acceleration, and so on.
Machine learning algorithms have broad prospects, but applications are constrained by programming challenges. The application scene is widely applied to various application fields and different scale hardware platforms. If each application on each piece of hardware is to be programmed separately, programming difficulties can arise from the programming-scale dependencies. Thus, developers employ programming frameworks (e.g., tensorFlow, pyTorch, MXNet) as bridging models to bridge various applications and various hardware to ameliorate this problem.
However, the programming framework only alleviates the programming challenges that users encounter when programming; the challenges become more severe for hardware vendors. Now, hardware manufacturers need not only to provide a programming interface for each hardware product, but also to migrate each programming framework to each hardware product, which creates a huge software development cost. A single TensorFlow has more than one thousand operators, and optimizing an operator on a piece of hardware requires a sophisticated software engineer to work for several months.
Disclosure of Invention
In view of the foregoing drawbacks, an object of the present disclosure is to provide an instruction execution method, apparatus and electronic device for an intelligent processor, which are used for at least partially solving the above technical problems.
According to a first aspect of the present disclosure, there is provided an instruction execution method for an intelligent processor, the instruction execution method comprising: the instruction decoding, the serial decomposition sub-instruction for executing the fractal operation is decoded into a local instruction and the fractal operation; data loading, namely reading data required by the fractal operation from an external storage unit to a local storage unit of the intelligent processor; executing operation, namely completing fractal operation on the data according to the fractal operation instruction; executing a protocol, and carrying out protocol operation on the fractal operation result according to the local instruction; writing back data, and reading the protocol operation result stored in the local memory to the external memory; the instruction coding, data loading, operation execution, protocol execution, and data write back are performed in a pipelined fashion.
In some embodiments, the method further comprises: and serially decomposing, namely decomposing an original fractal instruction set into serial decomposition sub-instructions.
In some embodiments, the serial decomposition is performed asynchronously with the instruction coding, data loading, operation execution, protocol execution, and data writing back of pipelined execution.
In some embodiments, the instructions of the fractal calculation subunit of each level of the intelligent processor are executed according to instruction decoding, data loading, operation execution, protocol execution and data writing back pipeline; and executing the instructions of the fractal calculation subunit among all the layers of the intelligent processor according to a recursively nested fractal pipeline.
In some embodiments, the method further comprises: and temporarily storing the serial decomposition sub-instruction.
According to a second aspect of the present disclosure, there is provided an instruction execution apparatus for an intelligent processor, the instruction execution apparatus comprising: the instruction decoding unit decodes the serial decomposition sub-instruction for executing the fractal operation into a local instruction and a fractal operation; the data loading unit is used for reading the data required by the fractal operation from the external storage unit to the local storage unit of the intelligent processor; the operation execution unit is used for completing fractal operation on the data according to the fractal operation instruction; the protocol execution unit is used for carrying out protocol operation on the fractal operation result according to the local instruction; the data write-back unit is used for reading the protocol operation result stored in the local memory to the external memory; the instruction decoding unit, the data loading unit, the operation executing unit, the protocol executing unit and the data writing back unit execute according to a pipeline mode.
In some embodiments, the apparatus further comprises: and the serial decomposition unit is used for decomposing the original fractal instruction set into the serial decomposition sub-instructions.
In some embodiments, the serial decomposition unit executes asynchronously with the instruction decode unit, data load unit, operation execution unit, protocol execution unit, and data write back unit of pipeline execution.
In some embodiments, the instructions of the fractal calculation subunit of each level of the intelligent processor are executed according to an instruction decoding unit, a data loading unit, an operation execution unit, a protocol execution unit and a data writing back unit pipeline; and executing the instructions of the fractal calculation subunit among all the layers of the intelligent processor according to a recursively nested fractal pipeline.
In some embodiments, the apparatus further comprises: and the instruction temporary storage unit is used for temporarily storing the serial decomposition sub-instruction.
According to a third aspect of the present disclosure, there is provided an electronic device comprising the instruction execution apparatus described above.
Drawings
Fig. 1 schematically illustrates an architecture diagram providing a fractal von neumann architecture according to a first embodiment of the present disclosure;
FIG. 2 schematically illustrates a control system architecture for an intelligent processor provided in accordance with a first embodiment of the present disclosure;
Fig. 3 schematically shows a flowchart of a control method provided by a first embodiment of the present disclosure;
FIG. 4 schematically illustrates a flow chart of an instruction decomposition method provided by a second embodiment of the present disclosure;
FIG. 5 schematically illustrates a logic diagram of a specific example of an instruction decomposition method provided by a second embodiment of the present disclosure;
FIG. 6 schematically illustrates a block diagram of an instruction decomposition apparatus provided by a second embodiment of the present disclosure;
fig. 7 schematically illustrates a fractal pipeline formed by an intelligent processor of a two-layer system provided in a third embodiment of the present disclosure;
fig. 8 schematically shows a block diagram of an instruction execution apparatus provided by a third embodiment of the present disclosure;
fig. 9 schematically illustrates a structure diagram of a memory management device provided in a fourth embodiment of the present disclosure;
fig. 10 schematically shows a flowchart of a memory management method according to a fourth embodiment of the present disclosure.
Detailed Description
For the purposes of promoting an understanding of the principles and advantages of the disclosure, reference will now be made to the embodiments illustrated in the drawings and specific language will be used to describe the same.
In the drawings or description, like or identical parts are provided with the same reference numerals. Implementations not shown or described in the drawings are forms known to those of ordinary skill in the art. Additionally, although examples of parameters including particular values may be provided herein, it should be appreciated that the parameters need not be exactly equal to the corresponding values, but may be approximated to the corresponding values within acceptable error margins or design constraints. In addition, directional terms such as "upper", "lower", "front", "rear", "left", "right", and the like, which are mentioned in the following embodiments, are only directions referring to the drawings. Thus, directional terminology is used for purposes of illustration and is not intended to be limiting of the disclosure.
It has been found that an ideal machine learning computer should have isomorphic, serial, hierarchical properties to simplify programming (including writing machine learning applications and transplanting programming frameworks). If all machine learning computers, even if they are of entirely different scale, employ the same instruction set architecture, then the migration effort of the program does not need to be reworked for each new product separately, which would significantly free up the productivity of the programmer. Based on the above, the embodiment of the disclosure constructs a fractal machine learning computer by introducing the idea of an intelligent processor, so as to solve the above technical problems. The following is a detailed description.
To construct a fractal machine learning computer from the idea of an intelligent processor, it is first confirmed that the application load of machine learning is suitably expressed as a fractal form. The disclosed embodiments study the commonality computational primitives that several typical machine learning application loads have, and find that these application loads can be described using a set of computational primitives (vector inner product, vector distance, ordering, activation function, count, etc.).
Machine learning application loads typically belong to computationally and memory intensive applications, but are many different in terms of executing control flows, learning approaches, training methods, and the like. However, all machine learning application loads have a high degree of concurrency at some granularity, and therefore many heterogeneous machine learning computers design dedicated hardware to take advantage of this feature to achieve acceleration. Examples of such dedicated hardware include GPUs, FPGAs, and ASIC chips. Embodiments of the present disclosure first decompose these application loads into computational primitives and then express them using fractal expressions.
Specifically, the disclosed embodiments select six representative machine learning application loads to execute on a classical dataset and decompose the execution time required for each of the computational primitives.
TABLE 1
As shown in table 1, the disclosed embodiments select the following loads:
cnn—in view of the popularity of deep learning, the AlexNet algorithm and ImageNet dataset were chosen as representative application loads for Convolutional Neural Networks (CNNs).
Dnn—also for deep learning techniques, a multi-layer perceptron (MLP) of 3-layer structure is chosen as a representative application of Deep Neural Networks (DNNs).
The K-Means-K-average algorithm, a classical machine learning clustering algorithm.
K-NN-K-nearest neighbor algorithm, a classical machine learning classification algorithm.
SVM-support vector machine, a classical machine learning classification algorithm.
LVQ-learning vector quantization, a classical machine learning classification algorithm.
Based on this, the machine learning application load is decomposed into matrix operations and vector operations. Operations such as vector-matrix multiplication or matrix-vector multiplication are merged into matrix multiplication, operations such as matrix-matrix addition/subtraction, matrix-scalar multiplication, vector element-by-element operations, etc., into element-by-element transformations. Thus, the decomposition yields 7 main computational primitives, including inner product, convolution, pooling, moment multiplication, element-by-element transformation, ordering, and counting. For the simplicity of deep learning application expression, special convolution and pooling operations are additionally added besides moment multiplication; the inner product is effectively a vector-vector multiplication and can also be used to represent the fully connected layers in the deep neural network. It can be observed that these 7 commonality calculation primitives basically express the machine learning application load.
Next, the presently disclosed embodiments employ fractal operations to describe the above 7 co-computation primitives,
TABLE 2
As shown in Table 2, each computation primitive may have multiple k-decomposition patterns. Some operations produce partial results after decomposition, which require reduction to obtain the final result, the required reduction operations are listed in table 2; shared input data may exist between fractal operations obtained after decomposition of some operations, where data redundancy needs to be introduced, and the redundant parts are listed in table 2. It is readily found that by introducing reduction operations and data redundancy, all 7 commonality calculation primitives can be represented as fractal operations. Thus, to design new specialized architectures to efficiently perform these fractal operations, embodiments of the present disclosure need to address the following three key challenges:
1. reduction operations-in order to efficiently process reduction operations, embodiments of the present disclosure require the introduction of a lightweight local processing unit (LFU) in the architecture. After retrieving a portion of the result data from the fractal processing unit (FFU), the local processing unit can efficiently perform a reduction operation thereon.
2. Data redundancy-in the execution of fractal operations, embodiments of the present disclosure require the introduction of data redundancy. For this reason, the storage hierarchy in the fractal machine learning computer needs to ensure data consistency and find the opportunity for data multiplexing.
3. Data communication between different nodes of a fractal machine learning computer can create complex physical wiring, resulting in area, delay, and energy consumption overhead. Therefore, the embodiment of the disclosure discovers that only the father and son nodes need to communicate data in the executing process of fractal operation, so that the design of a data path is greatly simplified; the designer can design the fractal machine learning computer through iterative modularization, and all connecting lines are limited between father and son, so that the connecting line congestion is reduced.
The following describes in detail the technical solutions of the embodiments of the present disclosure to solve the above-mentioned key challenges.
A first embodiment of the present disclosure provides a control system for an intelligent processor, each layer of fractal calculation subunit of the intelligent processor including the control system, the control system including: the serial decomposition module is used for carrying out serial decomposition on a fractal instruction set corresponding to the fractal operation executed by the intelligent processor to obtain a serial decomposition sub-instruction, and temporarily storing the serial decomposition sub-instruction; the degradation module is used for degrading the serial decomposition sub-instruction, and modifying the serial decomposition sub-instruction issued by the previous layer of fractal calculation sub-unit to the current layer of fractal calculation sub-unit into the serial decomposition sub-instruction issued by the current layer of fractal calculation sub-unit to the next layer of fractal calculation sub-unit; the parallel decomposition module is used for carrying out parallel decomposition on the degraded serial decomposition sub-instruction to obtain the parallel decomposition sub-instruction which meets the concurrency requirement of the concurrency operation of all fractal calculation sub-units in the intelligent processor.
Fig. 1 schematically illustrates an architecture diagram providing a fractal von neumann architecture according to a first embodiment of the present disclosure. The intelligent processor described in the embodiments of the present disclosure is a computing system constructed using a fractal von neumann architecture.
In geometry, fractal refers to the self-similarity of a geometric figure on different scale, therefore, the fractal concept includes a scale invariant for describing the geometric figure, which is defined by a set of simple generation rules, and a part of the figure is repeatedly replaced by a mode, so that a complex figure with any scale can be generated. The replacement rule of the graph is a scale invariant. The disclosed embodiments employ a similar idea, taking the system description as a scale invariant, resulting in a fractal von neumann architecture.
As shown in fig. 1, the fractal von neumann architecture is an architecture that can be designed in an iterative and modularized manner, and by copying several copies generated by itself, a minimum fractal von neumann architecture is composed of a memory, a controller and an arithmetic unit (LFU and FFU), and an input/output module is matched, so that a minimum-scale computing system, namely a fractal computing subunit, can be formed. The larger fractal von Neumann architecture takes a smaller-scale fractal von Neumann architecture as an arithmetic unit, and consists of a plurality of concurrent arithmetic units, a controller, a memory and an input/input module; with such a push, fractal von neumann architecture is able to build arbitrarily-sized computing systems from an iterative modular design. Wherein each layer of the fractal von neumann architecture employs a controller having the same structure. Thus, the iterative modular design of the fractal von neumann architecture can greatly simplify the design and verification effort of control logic when designing hardware circuits.
The fractal von neumann architecture employs the same instruction set architecture on each layer, known as the Fractal Instruction Set Architecture (FISA). The fractal instruction set structure comprises two instructions: local instructions and fractal instructions.
The embodiment gives a definition of the structural formalization of the fractal instruction set:
definition 3.1 (FISA Command) FISA command I is a tripletWhere O is an operation, P is a finite set of operands, and G is a granularity identifier.
Definition of 3.2 (fractal instruction) FISA instructionsIs a fractal instruction if and only if there is a set of granularity identifiers G' 1 ,G′ 2 ,...,G′ n (G′ i G.ltoreq.is a partial order relationship defined in granularity identifier space) such that the execution behavior of I may be defined by I' 1 (G′ 1 ),I′ 2 (G′ 2 ),...,I′ n (G′ n ) Sequentially with other FISA instructions.
Definition 3.3 (FISA instruction set) an instruction set is a FISA instruction set if and only if it contains at least one fractal instruction.
Definition 3.4 (fractal computer) computer M with a structure of the FISA instruction set is a fractal computer if and only if at least one fractal instruction is present is executing fractal on the computer M.
TABLE 3 Table 3
The FISA instruction set design of the intelligent processor of the embodiment of the disclosure adopts a relatively high abstraction level, so that the programming production efficiency can be improved and the high calculation memory ratio can be achieved. As shown in Table 3, advanced operations such as convolution and ordering may be represented directly by one instruction. Lower level operations with lower computational memory ratios also add instruction sets, and thus better programming flexibility can be achieved. These low-level operations will typically be considered native instructions and the intelligent processor will tend to use LFUs to execute them to reduce data handling.
Further, the native instructions are used to describe a reduction operation, sent by the controller to a native processing unit (LFU), and executed on the native processing unit of the fractal von neumann architecture; the fractal instruction is used for describing fractal operation, the controller executes k-decomposition on the fractal instruction after receiving the fractal instruction, and sub-instructions and local instructions are decomposed, wherein the sub-instructions still have the form of the fractal instruction, and the sub-instructions are sent to a fractal processing unit (FFU) to be executed. Thus, the programmer need only consider a single, serial instruction set architecture when programming a split von neumann architecture. The heterogeneity between LFUs and FFUs, parallelism between FFUs can be resolved by the controller. Because each node (fractal processing unit) of the fractal von neumann architecture has the same instruction set structure at different levels, programmers do not need to consider the differences of different levels in programming, and do not need to write different programs for different-scale fractal von neumann architecture computers. Even after the same-series fractal von Neumann system structure is adopted, the supercomputer can execute the same program with the intelligent object terminal equipment, and the effect that one set of codes can run everywhere from cloud to terminal without modification is achieved.
The fractal von neumann architecture builds a storage hierarchy and manages memory in two categories: external storage and local storage. Only the outermost external storage is visible to the programmer (requiring programming management). In the fractal von neumann architecture, the local storage of the present level will be treated as external storage of the next level, shared for use by all fractal processing units (FFUs). Different from the design principle of a Reduced Instruction Set Computer (RISC), in the fractal instruction set structure, all storage spaces which can be operated by programmers are positioned in external storage, and each layer of controllers is responsible for controlling data communication between the external storage and the local storage; the controller of this layer generates instructions to the next layer, which acts as a programmer for the next layer, so that the controller also follows the principle of managing only the local storage of this layer, but not the memory inside the next layer. By this design, all the storage in the fractal von neumann architecture is managed by the layer controller, the responsibility is divided clearly, and the programming is simplified.
Fig. 2 schematically illustrates a control system structure diagram for an intelligent processor according to a first embodiment of the present disclosure.
As shown in fig. 2, each node (i.e., each layer of fractal calculation subunits) of the intelligent processor has the same controller for managing the child nodes such that the entire intelligent processor operates in a fractal manner. Each controller comprises a serial decomposition module, a degradation module and a parallel decomposition module.
The serial decomposition module comprises a first instruction queue temporary storage unit (IQ), a serial decomposition unit (SD) and a second instruction queue temporary storage unit (SQ).
In the serial decomposition stage, the input fractal instruction set is firstly temporarily stored in IQ and then is fetched by SD. And the SD serially decomposes the fractal instruction set into serial decomposition sub-instructions which are sequentially executed according to the limitation of hardware capacity corresponding to the intelligent processor, wherein the granularity of each serial decomposition sub-instruction does not exceed the allowable range of the hardware capacity, and the serial decomposition instructions are written into the SQ for temporary storage. Since the serial split module has two first-in-first-out queues of IQ and SQ as buffers, the serial split phase may not execute at the synchronous pace of the pipeline, but asynchronously execute alone until IQ is empty or SQ is full.
The demotion module (DD) comprises a checking unit, an allocation unit, a DMA and a replacement unit. The DD takes out a serial decomposition sub-instruction from the SQ, and 'downgrades' the serial decomposition sub-instruction, and rewrites the instruction from the instruction issued by the previous node to the node into the instruction issued by the previous node to the next node, and the specific operation comprises the following steps:
The check unit checks whether the data dependency is satisfied, schedules when instructions are launched into the pipeline, and inserts pipeline cavitation.
The allocation unit allocates local memory space for operands located in external memory in the serial split sub-instruction.
DMA (Direct Memory Access ) generates DMAC instructions to control the DMA to write data before and after execution of the instructions, forming a local backup of external data for access by the next level node.
The replacement unit replaces an operand corresponding to the serial decomposition sub-instruction with a local backup operand.
The parallel decomposition module comprises a parallel decomposition unit (PD) and a protocol control unit (RC). The serial decomposition sub-instruction obtained through decomposition comprises a fractal instruction and a local instruction, wherein the PD is used for executing k-decomposition on the fractal instruction to obtain the fractal sub-instruction, and the fractal sub-instruction is sent to a fractal processing unit in a fractal calculation subunit of each layer of the intelligent processor to execute fractal operation. The RC is used for executing k-decomposition on the local instruction to obtain a local sub-instruction, and sending the local sub-instruction to a local processing unit in a fractal calculation subunit of each layer of the intelligent processor so as to carry out reduction operation on the fractal calculation result of each layer.
The RC can also decide whether to take the local instruction as a delegate to be delivered to the fractal processing unit instead for execution, and when a node with weaker LFU performance encounters a local instruction with larger operand, the RC can choose to do so. That is, the RC does not send the local instruction to the LFU, but rather to a entrusting register (CMR) of the control system to temporarily store one beat, and at the next beat, the local instruction is regarded as a fractal instruction to be sent to the PD for decomposition, and then sent to the FFU for execution. Because the LFU in the pipeline always works one beat after the FFU, the data dependency relationship on the pipeline is not changed after the CMR temporary storage, and the accuracy of execution can be ensured.
In summary, the present embodiment provides a smart processor based on the fractal von neumann architecture by introducing a lightweight local processing unit (LFU). After retrieving a portion of the result data from the fractal processing unit (FFU), the local processing unit can efficiently perform a reduction operation thereon. Meanwhile, the structure of the intelligent processor controller is reasonably designed, so that the intelligent processor can be efficiently and accurately controlled to execute fractal operation.
The first embodiment of the present disclosure further provides a control method for an intelligent processor, by which each layer of fractal calculation subunit of the intelligent processor can be controlled to perform a fractal operation, and fig. 3 schematically illustrates a flowchart of the control method provided by the first embodiment of the present disclosure, as shown in fig. 3, where the control method includes:
S301, serially decomposing a fractal instruction set corresponding to the fractal operation executed by the intelligent processor to obtain serial decomposition sub-instructions, and temporarily storing the serial decomposition sub-instructions.
S302, degrading the serial decomposition sub-instruction, and modifying the serial decomposition sub-instruction issued by the previous layer of fractal calculation sub-unit to the current layer of fractal calculation sub-unit into the serial decomposition sub-instruction issued by the current layer of fractal calculation sub-unit to the next layer of fractal calculation sub-unit.
S303, carrying out parallel decomposition on the degraded serial decomposition sub-instruction to obtain a parallel decomposition sub-instruction meeting the concurrency requirement of the concurrency operation of all fractal calculation sub-units in the intelligent processor, so that the fractal calculation sub-units execute fractal operation according to the parallel decomposition sub-instruction.
Please refer to the above-mentioned control system embodiment for details of the control method embodiment, which brings about the same technical effects as the control system embodiment, and will not be repeated here.
In order to improve the efficiency and accuracy of the instruction decomposition, a second embodiment of the present disclosure provides an instruction decomposition method for the control system and method provided in the first embodiment, and fig. 4 schematically illustrates a flowchart of the instruction decomposition method provided in the second embodiment of the present disclosure, as shown in fig. 4, where the method may include, for example:
S401, determining decomposition priority of dimensions for decomposing operands of the fractal instruction.
S402, selecting the dimension of the current decomposition according to the decomposition priority.
S403, in the dimension of the current decomposition, the operands of the split instruction are serially decomposed.
Fig. 5 schematically illustrates a logic diagram of a specific example of an instruction decomposition method according to a second embodiment of the present disclosure, as shown in fig. 5, the specific logic is as follows:
firstly, a serial decomposition unit needs to record the dimension t of each fractal order which can be decomposed 1 ,t 2 ,...,t N Arranged in order of priority therebetween.
Then, the serial decomposition unit needs to determine in which dimension to decompose according to the priority, and the specific decision mode is as follows: for one dimension, setting the dimension and the dimension with the priority lower than that of the dimension as atomic granularity, and keeping the original granularity of the dimension with the priority higher than that of the dimension to obtain a first instruction identifier; decomposing the operand according to the first instruction identifier; judging whether the memory capacity required by the decomposed operand is smaller than the capacity of a memory component of the intelligent processor; if yes, the dimension is selected as the dimension of the current decomposition to decompose the operand, and if not, the dimension of the next decomposition is selected to judge. I.e. for each The number i=0, 1,2, once again, N, t 1 ,t 2 ,...,t i Set to atomic granularity to form a new granularity identifier
Finally, in the current dimension, serially decomposing the operands of the split instruction, including: and the decomposition granularity corresponding to the dimension with the priority lower than the dimension of the current decomposition is atomic granularity, the granularity corresponding to the dimension with the priority higher than the dimension is kept unchanged, the maximum granularity of the dimension of the current decomposition meeting the capacity limit of the memory component of the intelligent processor is determined, and the second instruction identifier is obtained. And serially decomposing the operands of the split instruction according to the second instruction identifier. I.e. select at t i Serial decomposition in dimension, then t 1 ,t 2 ,...,t i-1 Are all decomposed into atomic particle sizes (particle size 1), and t i+1 ,t i+2 ,...,t N The original granularity is kept unchanged. According to the binary search method, finding the maximum granularity t 'meeting the capacity limit' i The final output instruction has a granularity identifier
Further, a binary search method determines the maximum particle size t 'that meets the capacity limit' i Comprising the following steps:
the minimum decomposition grain size min is set to 0, and the maximum decomposition grain size max is set to t i Then at t i The dimension direction decomposition granularity is (max-min)/2 dimension.
Judging whether the memory capacity required by the decomposed operands is larger than the capacity of the memory component of the intelligent processor, if so, the maximum decomposition granularity of the operands is (max-min)/2D, and if not, the minimum decomposition granularity of the operands is (max-min)/2D.
Judging whether (max-min) is equal to 1, if so, t i The (max-min)/2-dimensional decomposition granularity is selected for decomposition.
The number of times the serial decomposition process needs to be judged is at most n+log M, M being the maximum capacity of hardware. Assuming that the serial decomposer can perform a decision once per hardware clock cycle, serial decomposition of a fractal instruction with 10 dimensions is performed on a node with 4GB storage, and at most 42 clock cycles need to be performed, so that an optimal decomposition scheme can be found within a reasonable time range. After finding the optimal decomposition scheme, the serial decomposer circularly outputs an instruction template according to granularity; and calculating the addresses of the operands in the resolved sub-instructions through accumulation.
Furthermore, the parallel decomposer for serial sub-instructions after serial decomposition can be implemented as follows: executing k-decomposition on the input instruction, and pressing the instruction obtained by the decomposition back to the input stack; and the loop is continued until the number of instructions in the stack exceeds the number of FFUs in the node.
The DMA controller (DMAC) accepts a relatively high level instruction form (DMAC instruction) and can perform data handling (e.g., n-dimensional tensor) in accordance with a high level data structure. The DMAC internally translates DMAC instructions into low-level DMA control primitives by generating loops to control DMA execution.
According to the instruction decomposition method provided by the embodiment, the optimal decomposition scheme can be found within a reasonable time range, the serial decomposer circularly outputs an instruction template according to granularity according to the optimal decomposition scheme, and addresses of operands in the decomposed sub-instructions are calculated through accumulation, so that the parallel efficiency of fractal operation is improved.
The second embodiment of the present disclosure further provides an instruction decomposition apparatus for the control system and method provided by the first embodiment, fig. 6 schematically illustrates a block diagram of the instruction decomposition apparatus provided by the second embodiment of the present disclosure, and as shown in fig. 6, the apparatus 600 may include, for example:
a determination module 610 determines a resolution priority for a dimension that resolves an operand of a split instruction.
A selection module 620, configured to select a dimension of the current decomposition according to the decomposition priority.
And the decomposition module 630 is configured to serially decompose the operands of the fractal instruction in the dimension of the current decomposition.
Please refer to the above instruction decomposition method embodiment for details, which bring about the same technical effects as the instruction decomposition method embodiment, and are not described here again.
Because the intelligent processor performs fractal operation, the root node decodes the fractal instruction set and sends the fractal instruction set to the FFU, and each FFU repeats the same execution mode until reaching the leaf node. The leaf nodes complete the actual operation and send the result back to the parent node, and each node repeats the same execution mode until the final result is summarized to the root node. In this process, the FFU can only wait for data and instructions to arrive most of the time, and wait for data to return to the root node after completing the operation. Thus, intelligent processors may not achieve the desired execution efficiency if not executed in a pipelined manner.
In order to improve throughput of an intelligent processor, a third embodiment of the present disclosure provides an instruction execution method for an intelligent processor, the instruction execution method including: and decoding the serial decomposition sub-instruction for executing the fractal operation into a local instruction and the fractal operation. Data loading, namely reading data required by fractal operation from an external storage unit to a local storage unit of an intelligent processor; and performing operation, namely completing fractal operation on the data according to the fractal operation instruction. And executing the protocol, and carrying out protocol operation on the fractal operation result according to the local instruction. And writing back the data, and reading the protocol operation result stored in the local memory to the external memory. Instruction decoding, data loading, operation execution, protocol execution, and data write back are performed in a pipelined fashion.
With continued reference to FIG. 2, the FISA instructions are executed in five pipeline stages: an instruction decode stage (ID), a data load stage (LD), an operation execution stage (EX), a reduction execution stage (RD), and a data write back stage (WB). In the ID stage, a serial decomposition sub-instruction is decoded into three control signals of a local instruction, a fractal instruction and a DMAC instruction by a controller; in the LD phase, DMA transfers data from external storage to local storage for FFU and LFU access; in the EX stage, FFU completes fractal operation; in the RD stage, the LFU completes reduction operation; in WB stage, DMA transfers the operation result from local memory to external memory to complete the execution of a serial decomposition sub-instruction.
Further, before the ID, the instruction execution method further includes serial decomposition of the instruction, where the SD decomposes the original fractal instruction set FISA into serial decomposition sub-instructions. Instruction decoding, data loading, operation execution, protocol execution and data writing back of the serial decomposition and pipeline execution are performed asynchronously, namely, the fractal instruction in the IQ is continuously decomposed into serial decomposition sub-instructions and written into the SQ for temporary storage outside the independent pipeline.
Because the analysis and calculation system of the embodiment of the disclosure adopts a fractal von Neumann architecture, on a single level, the instructions of the fractal calculation subunit of each level are executed according to instruction decoding, data loading, operation execution, protocol execution and data writing back pipelines. On the overall architecture, a five-stage pipeline formed at a single level constitutes a recursively nested fractal pipeline. Fig. 7 schematically illustrates a fractal pipeline formed by intelligent processors in a two-layer system, wherein different types of grids represent execution of a fractal instruction, and each block represents an execution stage of a serial decomposition sub-instruction, as shown in fig. 7. Within one EX stage of the previous stage, the next stage runs its own pipeline. Thus, the intelligent processor can mobilize all modules at all levels at any time, except for the start-up and drain phases of the pipeline.
According to the instruction execution method provided by the embodiment, the execution of the instruction is divided into a plurality of stages of pipeline execution in an instruction decoding stage, a data loading stage, an operation execution stage, a reduction execution stage and a data writing back stage, and the serial decomposition of the instruction is independent of asynchronous execution outside the pipeline, so that all modules on all layers can be mobilized at any time, the data throughput rate of the intelligent processor is provided, and the execution efficiency of the intelligent processor is improved.
The third embodiment of the present disclosure further provides an instruction execution apparatus for an intelligent processor, fig. 8 schematically illustrates a block diagram of the instruction execution apparatus provided by the third embodiment of the present disclosure, and as shown in fig. 8, the apparatus 800 may include, for example:
the instruction decoding unit 810 decodes the serial decomposition sub-instruction performing the fractal operation into a local instruction and a fractal operation.
And a data loading unit 820 for reading the data required for the fractal operation from the external storage unit to the local storage unit of the intelligent processor.
And an operation execution unit 830 for completing the fractal operation on the data according to the fractal operation instruction.
The protocol execution unit 840 is configured to perform a protocol operation on a fractal operation result according to a local instruction.
The data write-back unit 850 is configured to read the result of the protocol operation stored in the local memory to the external memory.
The instruction decoding unit, the data loading unit, the operation executing unit, the protocol executing unit and the data writing back unit execute according to a pipeline mode.
Please refer to the above-mentioned instruction execution method embodiment for details, which bring about the same technical effects as the instruction execution method embodiment, and are not described here again.
During operation of the controller, the SD, DD, and PD may need to allocate memory space, so that memory management of the intelligent processor is critical to overall efficiency. Where the PD needs to allocate space that typically only survives two pipeline stages, the adjacent EX and RD, the DD allocates space that survives one complete serial split sub-instruction cycle, and the SD allocates space that spans multiple serial split sub-instruction cycles.
Based on the difference in instruction life cycle, the fourth embodiment of the present disclosure provides a memory management device, fig. 9 schematically illustrates a structure diagram of the memory management device provided in the fourth embodiment of the present disclosure, and as shown in fig. 9, the memory management device 900 includes:
The circular memory segment 910 is used for placing external data, calculation results, temporary intermediate results required for reduction, and the like contained in the serial decomposition sub-instruction.
There are three hardware functional units that may access a circular memory segment: FFU (in EX phase), LFU (in RD phase) and DMA (in LD and WB phase), so the cyclic memory segment is divided into three regions, including a first memory region 911, a second memory region 912 and a third memory region 913, which are used for the intelligent processor to perform fractal operation, protocol operation, data loading and write-back call in the operation process, respectively. Three functional units each use a segment to avoid data collisions. The three regions will cyclically call the first memory region 911, the second memory region 912 and the third memory region 913 along with the cycle execution of the pipeline, and the cyclic process is: after the FFU performs the EX phase on a certain area, in the next pipeline cycle, the LFU will obtain the memory and complete the RD phase execution therein; after the LFU completes the execution of the RD stage, the DMA in the next pipeline cycle will obtain the memory, firstly completes the execution of the WB stage, and then completes the execution of the LD stage of a new instruction; and then the memory area in the next period is returned to the FFU, and so on.
The static memory section 920 includes a fourth memory area 921 and a fifth memory area 922, which are used for storing fractal instructions input during operation of the intelligent processor, that is, data that is loaded in advance during serial decomposition and shared among a plurality of serial decomposition sub-instructions is placed. The static memory segment is also divided into two areas, and the SD alternately arranges the space for using the static memory segment for each input fractal command so as to avoid the data conflict formed by the overlapping of the life cycle between the adjacent commands.
Further, the memory space is not actively released due to the distribution of the DD and SD control memories. The space recovery is carried out along with the progress of the pipeline, and after the memory segment is recycled for one round, new data is directly overwritten on old data. In order to fully utilize the data temporarily written in the memory, as shown in fig. 2, the memory management device further adds a tensor replacement unit (or tensor replacement table TTT) for recording an external storage address corresponding to the data currently stored in the circular memory segment or the static memory segment, and when the next operation needs to access the data in the external memory at the same address, the external storage address is replaced, so that the backup data temporarily stored in the local memory of the intelligent processor replaces the data in the external memory, and thus fewer data are obtained. In the operation process of the intelligent processor, the first memory area 911, the second memory area 912 and the third memory area 913 are periodically invoked, and when entering the next cycle invocation, the tensor replacement unit clears the external memory address recorded in the current cycle. So as to ensure timeliness of the replacement data. After adding TTT, the intelligent processor can forward the operation result of the last serial decomposition sub-instruction (generated after the RD phase is finished) directly to the input of the next serial decomposition sub-instruction (needed to be prepared before the EX phase is started), without writing back and re-reading. TTT can significantly improve the execution efficiency of the intelligent processor while data consistency is maintained.
According to the embodiment, the controller memory is classified and managed based on the difference of instruction execution life cycles, so that the execution efficiency of the intelligent processor can be improved, the tensor replacement unit is added in the memory management device, the execution efficiency of the intelligent processor can be further improved remarkably, and meanwhile, the data consistency is maintained.
The fourth embodiment of the present disclosure further provides a memory management method for an intelligent processor, and fig. 10 schematically illustrates a flowchart of the memory management method provided by the fourth embodiment of the present disclosure, as shown in fig. 10, where the memory management method includes:
s1001, when the input fractal instruction is serially decomposed, a fourth memory area and a fifth memory area of the static memory section are used for storage.
S1002, in the operation process of the intelligent processor, the fractal operation, the protocol operation, the data loading and the writing back of the intelligent processor call the first memory area, the second memory area and the third memory area of the circulating memory section respectively.
Please refer to the above-mentioned memory management device embodiment for details, which bring about the same technical effects as the memory management device embodiment, and are not described here again.
In addition, in some embodiments of the present disclosure, a chip is disclosed that includes the above-described intelligent processor.
In some embodiments of the present disclosure, a chip package structure is disclosed, which includes the chip.
In some embodiments of the present disclosure, a board card is disclosed, which includes the above chip package structure.
In some embodiments of the present disclosure, an electronic device is disclosed, which includes the above board card.
The electronic device includes a data processing device, a robot, a computer, a printer, a scanner, a tablet, an intelligent terminal, a cell phone, a vehicle recorder, a navigator, a sensor, a camera, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device vehicle, a household appliance, and/or a medical device.
The vehicle comprises an aircraft, a ship and/or a vehicle; the household appliances comprise televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas cookers and range hoods; the medical device includes a nuclear magnetic resonance apparatus, a B-mode ultrasonic apparatus, and/or an electrocardiograph apparatus.
While the foregoing embodiments have been described in some detail for purposes of clarity of understanding, it will be understood that the foregoing embodiments are merely illustrative of the invention and are not intended to limit the invention, and that any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (9)

1. An instruction execution method for an intelligent processor, the instruction execution method comprising:
the instruction decoding, the serial decomposition sub-instruction for executing the fractal operation is decoded into a local instruction and the fractal operation;
data loading, namely reading data required by the fractal operation from an external storage unit to a local storage unit of the intelligent processor;
executing operation, namely completing fractal operation on the data according to the fractal operation instruction;
executing a protocol, and carrying out protocol operation on the fractal operation result according to the local instruction;
writing back the data, and reading the protocol operation result stored in the local memory to an external memory;
the instruction decoding, data loading, operation execution, protocol execution and data writing back are executed in a pipeline mode;
the instruction of the fractal calculation subunit of each level of the intelligent processor is executed according to instruction decoding, data loading, operation execution, protocol execution and data writing and back flow line execution;
and executing the instructions of the fractal calculation subunit among all the layers of the intelligent processor according to a recursively nested fractal pipeline.
2. The instruction execution method according to claim 1, wherein the method further comprises:
And serially decomposing, namely decomposing an original fractal instruction set into serial decomposition sub-instructions.
3. The method of claim 2, wherein the serial decomposition is performed asynchronously with respect to the instruction decode, data load, operation execution, protocol execution, and data write back of pipeline execution.
4. The instruction execution method according to claim 2, wherein the method further comprises:
and temporarily storing the serial decomposition sub-instruction.
5. An instruction execution device for an intelligent processor, the instruction execution device comprising:
the instruction decoding unit decodes the serial decomposition sub-instruction for executing the fractal operation into a local instruction and a fractal operation;
the data loading unit is used for reading the data required by the fractal operation from the external storage unit to the local storage unit of the intelligent processor;
the operation execution unit is used for completing fractal operation on the data according to the fractal operation instruction;
the protocol execution unit is used for carrying out protocol operation on the fractal operation result according to the local instruction;
the data write-back unit is used for reading the protocol operation result stored in the local memory to the external memory;
The instruction decoding unit, the data loading unit, the operation executing unit, the protocol executing unit and the data writing-back unit execute according to a pipeline mode;
the instruction of the fractal calculation subunit of each level of the intelligent processor is executed according to an instruction decoding unit, a data loading unit, an operation executing unit, a protocol executing unit and a data writing back unit pipeline;
and executing the instructions of the fractal calculation subunit among all the layers of the intelligent processor according to a recursively nested fractal pipeline.
6. The instruction execution apparatus of claim 5, wherein the apparatus further comprises:
and the serial decomposition unit is used for decomposing the original fractal instruction set into the serial decomposition sub-instructions.
7. The instruction execution apparatus of claim 6, wherein the serial decomposition unit executes asynchronously with the instruction decode unit, data load unit, operation execution unit, protocol execution unit, and data write back unit of pipeline execution.
8. The instruction execution apparatus of claim 6, wherein the apparatus further comprises:
and the instruction temporary storage unit is used for temporarily storing the serial decomposition sub-instruction.
9. An electronic device comprising the apparatus of any one of claims 5-8.
CN202010688860.4A 2020-07-16 2020-07-16 Instruction execution method and device for intelligent processor and electronic equipment Active CN111831339B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010688860.4A CN111831339B (en) 2020-07-16 2020-07-16 Instruction execution method and device for intelligent processor and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010688860.4A CN111831339B (en) 2020-07-16 2020-07-16 Instruction execution method and device for intelligent processor and electronic equipment

Publications (2)

Publication Number Publication Date
CN111831339A CN111831339A (en) 2020-10-27
CN111831339B true CN111831339B (en) 2024-04-02

Family

ID=72924426

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010688860.4A Active CN111831339B (en) 2020-07-16 2020-07-16 Instruction execution method and device for intelligent processor and electronic equipment

Country Status (1)

Country Link
CN (1) CN111831339B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5715459A (en) * 1994-12-15 1998-02-03 International Business Machines Corporation Advanced graphics driver architecture
CN110489087A (en) * 2019-07-31 2019-11-22 北京字节跳动网络技术有限公司 A kind of method, apparatus, medium and electronic equipment generating fractal structure
CN110502330A (en) * 2018-05-16 2019-11-26 上海寒武纪信息科技有限公司 Processor and processing method
CN110538469A (en) * 2019-09-25 2019-12-06 杭州高低科技有限公司 tangible programming instruction building block capable of realizing instruction switching

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5715459A (en) * 1994-12-15 1998-02-03 International Business Machines Corporation Advanced graphics driver architecture
CN110502330A (en) * 2018-05-16 2019-11-26 上海寒武纪信息科技有限公司 Processor and processing method
CN110489087A (en) * 2019-07-31 2019-11-22 北京字节跳动网络技术有限公司 A kind of method, apparatus, medium and electronic equipment generating fractal structure
CN110538469A (en) * 2019-09-25 2019-12-06 杭州高低科技有限公司 tangible programming instruction building block capable of realizing instruction switching

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YongweiZhao等.Cambricon-F:machine learning computers with fractal von neumann architecture.《2019ACM/IEEE 46th Annual International Symposium on Computer Architecture》.2019,788-801. *

Also Published As

Publication number Publication date
CN111831339A (en) 2020-10-27

Similar Documents

Publication Publication Date Title
CN111831582B (en) Memory management device and method for intelligent processor and electronic equipment
CN111860807B (en) Fractal calculation device, fractal calculation method, integrated circuit and board card
US12105630B2 (en) Compile time logic for inserting a buffer between a producer operation unit and a consumer operation unit in a dataflow graph
Lin et al. Accelerating large sparse neural network inference using GPU task graph parallelism
Zhao et al. Machine learning computers with fractal von Neumann architecture
Zhao et al. Cambricon-F: machine learning computers with fractal von Neumann architecture
US11841822B2 (en) Fractal calculating device and method, integrated circuit and board card
CN111831333B (en) Instruction decomposition method and device for intelligent processor and electronic equipment
Chen et al. An instruction set architecture for machine learning
Wolfe Performant, portable, and productive parallel programming with standard languages
CN111831339B (en) Instruction execution method and device for intelligent processor and electronic equipment
US20220147808A1 (en) Compiler configurable to generate instructions executable by different deep learning accelerators from a description of an artificial neural network
CN115437637A (en) Compiling method and related device
WO2023030507A1 (en) Compilation optimization method and apparatus, computer device and storage medium
US20230325312A1 (en) Merging Buffer Access Operations in a Coarse-grained Reconfigurable Computing System
US20190042941A1 (en) Reconfigurable fabric operation linkage
CN111831332A (en) Control system and method for intelligent processor and electronic equipment
CN115840894A (en) Method for processing multidimensional tensor data and related product thereof
CN116569178A (en) Deep learning accelerator with configurable hardware options that can be optimized via a compiler
Sohrabizadeh et al. SPA-GCN: Efficient and Flexible GCN Accelerator with an Application for Graph Similarity Computation
US11775299B1 (en) Vector clocks for highly concurrent execution engines
US20230385125A1 (en) Graph partitioning and implementation of large models on tensor streaming processors
Khurge Strategic Infrastructural Developments to Reinforce Reconfigurable Computing for Indigenous AI Applications
US20240168915A1 (en) Graph Spatial Split
US20240370238A1 (en) Accelerator including hierarchical memory

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant