CN112416438A

CN112416438A - Method for realizing pre-branching of assembly line

Info

Publication number: CN112416438A
Application number: CN202011422792.3A
Authority: CN
Inventors: 王志平
Original assignee: Individual
Current assignee: Individual
Priority date: 2020-12-08
Filing date: 2020-12-08
Publication date: 2021-02-26

Abstract

A method for realizing pipeline pre-branching, for a processor core, the complexity of the pipeline often determines the performance of a core instruction set architecture, even determines a market area suitable for the processor. In fact, not the instruction set determines the complexity of the pipeline hardware, but the core pipeline with different complexity is required for different market regions, and then the adaptive instruction set architecture is available. Therefore, it is not scientific about the viewpoint that either party will replace the other party between RISC and CISC unless the relevant application market disappears. A short pipeline circuit architecture is more readily available in the low-end processor market, while a complex, ultra-long pipeline architecture is more suitable in the high-end processor market. However, the hardware structure of the ultra-long pipeline will bring about a very troublesome pre-branching problem, and the technical scheme of the invention can better solve the pre-branching problem aiming at the ultra-long pipeline structure, so that the ultra-long pipeline structure has higher efficiency.

Description

Method for realizing pre-branching of assembly line

Technical Field

The invention relates to the field of integrated circuits and computers, in particular to a method for realizing pipeline pre-branching.

Background

Under the current background art, there are two widely differing processor instruction set architectures, RISC and CISC (i.e., "reduced instruction set" and "complex instructions"), respectively. However, the real difference is in fact the hardware implementation of the instruction sets for these two large arrays, i.e. the circuit implementation of the pipeline.

The hardware corresponding to RISC implements basic operations required by programs such as operations, instruction/data reading, and data writing based on more "programming registers". All operations of the RISC are completed based on 'programming registers', so the RISC targets to make the circuit structure simpler in fact, and the core pipeline structures adopted by the RISC array are all simple pipeline structures, i.e. the pipeline length is shorter.

The hardware implementation of the CISC is based on fewer "programming registers" and more on the basic operations required by the programs such as operation, instruction/data reading, and data writing based on the pipeline hardware status register. Most of the operations of the CISC realize circuit functions based on program 'requirements', so the CISC needs a more complex circuit structure to complete CISC instructions, and core pipelines adopted by the CISC array are all complex pipeline structures, namely the pipeline length is long.

RISC has the advantage of simple circuit structure, but since almost all operations are done by "programming registers," the "requirement" for a program needs to be broken into several instructions to do it. Thus, the actual execution efficiency of RISC is very low for "demand" with more system bus operations, cache operations (although there is little pipeline regression penalty time between each instruction due to jumps). It is conceivable that the RISC has higher actual execution efficiency on the premise that the RISC initializes its own "programming registers", so that the RISC is very suitable for the application of small data-level operations. But in most complex systems (e.g. PCs or computer systems more complex than PCs) even in terminal devices with higher requirements for audio-video processing, most of the programs "demand" required is computational processing at a large data level, and more system bus accesses (more complex systems will have more peripherals, and more user programs, which are all running more heavily on the system bus).

It is apparent that CISC is a computer instruction set architecture, or computer technology, that locates places of market for more complex computer systems. The initial purpose of the circuit implementation of the CISC instruction set architecture is to meet the demand of a system and process a larger number of operations, so that the hardware structure of the CISC array has a more complex cache structure and a more complex pipeline structure.

From the technical space of circuit implementation, CISC has a larger utilization space than RISC, or CISC theoretically does not limit the space of hardware circuit implementation, but RISC basically limits the space that hardware circuit implementation can utilize. However, under the prior art, CISC has a disadvantage compared to RISC, and is a problem of branch prediction brought by a super-long pipeline. For instructions that jump according to the judgment condition, the pipeline has too long latency in the actual execution process due to the too long pipeline, and branch prediction is a hardware implementation method for solving the problem, that is, before the judgment condition obtains a result, the judgment condition is firstly assumed to be true or false, and then the execution is continued according to the assumed condition. Therefore, the problem of 50% of the latency for conditional jumps due to too long pipelines can be probabilistically solved. In the present context, it is clear that most high-end processors support dual-threaded cores, and that there is a possibility that, in theory, the present context can implement simultaneous assumptions for both "true" and "false" branches to achieve failure of either assumption, yet the other assumption can continue execution of conditional jump instructions without pipeline regression penalties due to false assumptions.

However, in the prior art, two "threads" existing in the kernel are actually used by the "threads", depending on the scheduling of the operating system, i.e., the program itself cannot be certain that all pipeline hardware resources can be used for branch prediction. Therefore, under the prior art, unless software, especially system software, can do more "coordination" or "synchronization", it is theoretically possible to realize that a program can be sure whether all pipeline hardware resources can be used for branch prediction, but in this way, it is more meaningless to actually improve efficiency (so, in fact, under the prior art, a high-end processing core does not implement more threads than 2 threads, namely pipeline hardware resources, just because more "threads" are implemented, and the functions are too single. In addition, even though the prior art realizes the above theoretical possibility, the prior art still cannot solve the problem of branch prediction of the ultra-long pipeline well, because in practical application, within a single instruction execution time (the time from the instruction entering the pipeline to the instruction generating output), a plurality of binary tree branch points exist in the pipeline at a high probability, and therefore, before the conditional jump instruction obtains a result, only 50% correctness probability can be assumed after the second binary tree branch point enters the pipeline.

Disclosure of Invention

The technical scheme of the invention realizes the solution of the branch Prediction problem of the processor core assembly line (ultra-long assembly line), and the branch Prediction is abbreviated as BTP (Branch Prediction protocol), namely the abbreviation of Binary Tree Prediction. In the technical solution of the present invention, BTP is defined as three specific technical embodiments, including: "passive branch prediction", "active branch prediction", and "adaptive branch prediction". Hereinafter, "negative branch prediction" is abbreviated as IBTP, i.e., abbreviation of Inactive BTP, "Positive branch prediction" is abbreviated as PBTP, i.e., abbreviation of Positive BTP, and "Adaptive branch prediction" is abbreviated as ABTP, i.e., abbreviation of Adaptive BTP.

When the pipeline encounters a Branch Point of a Binary Tree formed by a condition judging instruction (hereinafter, a "Branch of the Binary Tree" is abbreviated as BTB, namely the abbreviation of Binary Tree Branch, and a "Branch Point of the Binary Tree" is abbreviated as BTBP, namely the abbreviation of BTB Point), the pipeline only uses the current pipeline hardware resources to perform hypothesis judgment on a certain condition of the Binary Tree (a "true condition" or a "false condition"). And executing an instruction of a branch of the binary tree based on the tentative determination (hereinafter "pipeline hardware resource" is abbreviated as PL, and "true conditional" BTB is abbreviated as BTB)_TBTB of "false Condition" is abbreviated as BTB_F). Until the binary tree condition judgment instruction obtains an actual result, if the actual result is the same as the hypothesis judgment, the execution is continued, otherwise, all the instructions in the execution under the hypothesis judgment condition in the pipeline are cleared, the pipeline jumps to another branch with the opposite hypothesis judgment condition again, the instruction is restarted and executed. No other PL than the current PL is used to make any hypothetical prediction of BTBP, which is IBTP, throughout this process.

When the pipelineEncountering a BTBP formed by a conditional predicate instruction, the pipeline first makes a hypothetical predicate on a condition ("true condition" or "false condition") of the binary tree using the current PL, assuming a "true condition". And according to the 'true condition' judgment, executing BTB_T. At the same time, the other PL is taken to perform a hypothetical determination that is opposite to the current pipeline, i.e., a "false condition", and the BTB is executed based on this "false condition" determination_FAnd until the binary tree condition judgment instruction obtains an actual result. BTB if the actual result is "true Condition_TContinue execution, ignoring BTB_FAnd clears the BTB_FAll of the instructions above, release the BTB_FThe PL of the site. BTB if the actual result is "false condition", then_FContinue execution, ignoring BTB_TAnd clears the BTB_TAll of the instructions above, release the BTB_TThe PL of the site. In the whole process, any branch of the binary tree occupies one PL to perform branch prediction under corresponding conditions, and the BTP is PBTP.

It is clear that IBTP only requires the use of one PL, whereas PBTP requires a sufficient number of PLs for a pipeline of a certain length. In theory, although the amount of PL required for full PBTP support can be calculated for a pipeline of a certain length. However, a higher number of PLs will result in higher implementation costs and higher power consumption. Therefore, in practical applications, the amount of PL required in the kernel should depend on the requirements of the application. Therefore, for BTP, under the premise that there is already PL that does not guarantee to fully satisfy the PBTP number requirement, it is necessary to use PBTP preferentially to complete BTP, and when the PL number is insufficient, the pipeline needs to be adaptively switched to IBTP. PBTP is implemented in the case where the amount of PL is sufficient, while IBTP, i.e., ABTP, is implemented in the case where the amount of PL is insufficient.

In the technical scheme of the invention, ABTP is realized. In the specific kernel hardware implemented by the technical solution of the present invention, the number of PLs may be a fixed number, but for software, the specific number does not need to be determined. In a particular hardware implementation, each PL in the kernel has a unique "pipelineThread quotation marks ", hereinafter" pipeline Index "will be referred to simply as PLI, an abbreviation for PL Index. PL with PLI of 0, abbreviated PL₀PL where PL is 1, abbreviated as PL₁In this example, push. For any one PL, it has two states, namely an "idle state" and a "running state", and the pipeline in the "idle state" will be abbreviated hereinafter as PL_offPipeline in "run State" is abbreviated PL_on。

In the kernel hardware architecture, two additional hardware modules are implemented in addition to all PLs, namely a "pipeline resource hub" module and a "pipeline interconnection matrix" module. Hereinafter, "Pipeline Resource Center" will be abbreviated as PRC, an abbreviation of Pipeline Resource Center, and "Pipeline interconnection Matrix" will be abbreviated as PCM, an abbreviation of Pipeline Connection Matrix. The PRC is responsible for managing all PLs, and the PLI of the PL that is available is provided by the PRC when it is needed during program execution. PCM is responsible for the interconnection between different PLs, i.e. when a certain PL implements PBTP, it is necessary to complete the BTB through PCM_TOr BTB_FAnd other information to the target PL (i.e., the PL pointed to by the PLI provided by the PRC).

Specifically, the technical solution of the present invention is implemented by the following kernel instructions, including instructions, to implement the software and hardware of the technical solution of the present invention:

1. a "pipeline tree mode" instruction having the character "PMT" as an designator, but is not limited to having only "PMT" as a designator. "Tree mode" refers to "binary tree mode", so execution of the PMT instruction indicates that all PLs in the kernel enter the binary tree mode of operation, while the PRC also operates in the binary tree mode of operation. Operating in binary tree mode, PL in the kernel can only be used for BTB, but not by programs as threads (i.e., in fact, the kernel operates in single threaded mode).

2. A "passive branch jump" instruction having the character "JMPI" as an instruction character, but is not limited to having only "JMPI" as an instruction character. The JMPI instruction indicates that the IBTP mode is used for jumping according to the judgment conditionI.e. the way only negative branch prediction is used when the current instruction is branch predicted, no other PL is occupied. The "judgment condition" based on the JMPI may be a value of a programming register of the pipeline hardware, but the technical solution of the present invention is not limited to the "judgment condition" being a specific fixed address or a specific use of the programming register. JMPI instruction parameters need to specify BTBs_FOr indicate BTB_TThe instruction address of the desired jump, and another branch not assigned a jump address continues the address of the current instruction. The address indicated in the instruction parameter may be "direct addressing", i.e. the instruction address at which the value exhibited by the instruction parameter is the desired jump, or "indirect addressing", i.e. the pointer to a variable whose stored value is the desired jump.

3. An "active branch jump" instruction having the character "JMPP" as an executor, but is not limited to having "JMPP" as an designator. The JMPP instruction indicates that jump according to 'judgment condition' is carried out by using an ABTP mode, namely, the current instruction can occupy other PLs by using an adaptive branch prediction mode when carrying out branch prediction. The "decision condition" upon which the JMPP depends may be a programmed register of the pipeline hardware. JMPP instruction parameters need to specify BTBs_FOr indicate BTB_TThe instruction address of the desired jump, and another branch not assigned a jump address continues the address of the current instruction. The address specified in the instruction parameter may be a "direct address," or an "indirect address.

4. The "tree mode end" instruction uses the character "PMR" as an indicator, but is not limited to only "PMR" as an indicator. The PMR instruction will cause the PL in the core and the PRC to exit the "binary tree mode" of operation. I.e., the PL in the kernel will not be available for BTB, but can be used by programs as threads. When the kernel hardware executes PMR, if the PL executing the current program is not PL₀Then the current program will be migrated to the PL first₀And then actually executes the PMR instruction.

Drawings

Fig. 1 is a schematic diagram of a possible hardware of the technical solution of the present invention, which is a schematic diagram of a basic solution and does not show that the technical solution of the present invention needs to be fixed to the illustrated structure. The figure illustrates a simple structure of kernel hardware with 4 PLs, which is used to illustrate the physical association relationship between the hardware related to the technical solution of the present invention, and in the actual implementation of the kernel hardware, other modules not shown in the figure should exist. The PRC shown in the figure is the PRC module described above, and the PCM is the PCM module described above, PL₀To PL₃PL, which is PLI, from 0 to 3;

as shown in fig. 1, "code-coupled cache" and "data-coupled cache" are cache modules for caching code and data, respectively, which are physically connected to the PL;

FIG. 1 shows, line A illustrates the propagation direction and path of the PL to branch prediction related instructions;

FIG. 1, line B indicates whether the current PL available as provided by the PRC_offAnd PL_offThe PLI of (1);

as shown in fig. 1, line C and line D illustrate the access path of the PL to code and data, and the direction of transmission of the code and data, respectively;

as shown in fig. 1, line E illustrates the information that the PL provides to the PRC regarding the operating state of the PL;

lines F to I, shown in fig. 1, illustrate the interconnection between PLs to transfer the relevant instructions and information needed to complete the PBTP or ABTP.

Fig. 2 is a schematic diagram of a possible hardware of the technical solution of the present invention, which is a schematic diagram of a basic solution and does not show that the technical solution of the present invention needs to be fixed to the illustrated structure. The diagram illustrates an internal simple structure of the PL, which is used for explaining the physical association relationship between the internal hardware of the PL when the PL according to the technical solution of the present invention implements IBTP, and in the actual implementation of the PL hardware, other modules not shown in the diagram should exist;

shown in FIG. 2, EIF_AAnd EIF_BAre respectively provided withTwo paths A and B representing PL prefetch instructions, hereinafter collectively referred to as "instruction prefetch paths," are abbreviated as EIFC, an abbreviation for EIF Channel. EIF is an abbreviation of Early Instruction Fetch, indicating "Instruction Early prefetch". The EIF performs first-level instruction decoding on the length of the instruction and whether the instruction is a jump type instruction, and stores a certain number of instructions in the EIF;

FIG. 2 shows a BTPC as a "branch prediction" control module;

FIG. 2 shows that when PRC is unable to provide PL_offWhen PBTP is implemented, the BTPC will control the EIF according to the finally obtained judgment condition (as shown by the line C) of the conditional jump instruction_AAnd EIF_BWhich channel is the pipeline "current EIFC", i.e. IBTP is implemented. The current EIFC is BTB_T(or is BTB_FThe technical solution of the present invention is not limited to the BTB that must be fixed here_TOr BTB_FHowever, in actual hardware implementation, only the BTB can be fixed_TAnd BTB_FOne of them. Hereinafter, unless otherwise noted, "current EIFC" defaults to BTB_T). Whether it is BTB_TOr BTB_F，EIF_AAnd EIF_BThe instruction is self-considered to be the current EIFC when being prefetched, and the instruction is prefetched in the past according to the branch direction fixed by the current EIFC until no more instructions can be cached. In EIF_ANamely EIF_BIn the process of respective pre-fetching instructions, every time one BTBP is encountered, the instruction address of the ' self-thought ' non-current EIFC ' is recorded, but the instruction pre-fetching of the ' non-current EIFC ' is not continued, so that the EIF is controlled by the BTPC_AAnd EIF_BWhen the roles are exchanged, the initial address of the prefetch instruction can be initialized for real 'non-current EIFC';

when the PRC is able to provide PL, as shown in FIG. 2_offWhen PBTP is to be executed, PBTP is executed. That is, when the BTBP of the current EIFC is analyzed by BTP, the BTB is analyzed_F(or BTB_TThe technical solution of the present invention is not limited to the BTB that must be fixed here_FOr BTB_TBut is actually hardIn one implementation, only fixed to BTB_FOr BTB_TOne of them) to the PL specified by the PLI provided by the PRC via the PCM of FIG. 1_offLet the PL have_offStarting execution of BTB_F。

Detailed Description

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following describes specific embodiments of the present invention with reference to the drawings. It is obvious that the drawings in the following description are only some examples of the invention, and that for a person skilled in the art, other drawings and embodiments can be obtained from these drawings without inventive effort.

For the sake of simplicity, the drawings only schematically show the parts relevant to the present invention, and they do not represent the actual structure or flow of the product. In addition, for the sake of simplicity and comprehension of the drawings, some of the drawings have the same structure or function, only one of which is schematically depicted or labeled, and in the present specification, "a" or "an" does not only mean "only one" but also mean "more than one".

Example 1

In one embodiment of the present invention, the kernel enters a "binary tree mode" to perform ABTP processing on the first BTBP, which includes the steps of:

step 1, the program executes the PMR instruction, all PLs enter "binary Tree mode", PRC enters "binary Tree mode", as shown in FIG. 1, except PL₀Besides, PL₁、PL₂And PL₃All enter PL_offState, line B embodies the information as having available PL_offAnd embodies available PL_offPLI of (a) is 1;

step 2, as shown in FIG. 2, the BTP encounters the first BTBP, hereinafter abbreviated BTBP₀Mixing BTB_FInstruction address information and associated status information is passed to the PL₁，PL₁Starting execution of BTB_F(hereinafter, this BTB will be described_FAbbreviated as BTB₁），PL₀Performing BTB_T(hereinafter, this BTB will be described_TAbbreviated as BTB₀). At the same time, PRC updates the information record about PL, and line B embodies the information as having available PL_offAnd embodies available PL_offPLI of (2).

Example 2

An embodiment of the present invention, on the basis of embodiment 1, PL₀BTBP has not yet been obtained₀PL as a result of the judgment of₁Encounters a BTBP (hereinafter abbreviated as BTBP)₁），PL₁Performing ABTP treatment, comprising the following steps:

step 1, PL₁Will present PL₁BTB of (A)_FInstruction address information and associated status information is passed to the PL₂，PL₂Starting execution of BTB_F(hereinafter, this BTB will be described_FAbbreviated as BTB₂），PL₁Performing BTB_T(in this case, the BTB_TNamely BTB₁Continuation of (c). At the same time, PRC updates the information record about PL, and line B embodies the information as having available PL_offAnd embodies available PL_offPLI of (3).

EXAMPLE 3

An embodiment of the present invention, on the basis of embodiment 2, PL₀BTBP has not yet been obtained₀PL as a result of the judgment of₁BTBP has not yet been obtained₁PL as a result of the judgment of₀Encounters a BTBP (hereinafter abbreviated as BTBP)₂），PL₀Performing ABTP treatment, comprising the following steps:

step 1, PL₀Will present PL₀BTB of (A)_FInstruction address information and associated status information is passed to the PL₃，PL₃Starting execution of BTB_F(hereinafter, this BTB will be described_FAbbreviated as BTB₃），PL₀Performing BTB_T(in this case, the BTB_TNamely BTB₀Continuation of (c). At the same time, PRC updates the information record about PL, and line B embodies the information that no PL is available_off。

Example 4

In the inventionExample, on the basis of example 3, PL₀BTBP has not yet been obtained₀PL as a result of the judgment of₁BTBP has not yet been obtained₁PL as a result of the judgment of₂Encounters a BTBP (hereinafter abbreviated as BTBP)₃），PL₂Carrying out IBTP treatment, comprising the following steps:

step 1, PL₂For BTBP₃IBTP processing is performed to make PL₂Another EIFC (other than the current EIFC) in the set starts prefetching the BTB_FInstruction of (1), current EIFC continues prefetching BTBs_T(BTB at this time)_TNamely BTB₂Continuation of (c).

Example 5

An embodiment of the present invention, on the basis of embodiment 4, PL₀To obtain BTBP₀Is determined as BTB_FContinued operation, PL₀Is released, comprising the steps of:

step 1, clear all PLs₀Instruction in (1), PL₀Entry into PL_offA state;

step 2, clear all PLs₃Instruction in (1), PL₃Entry into PL_offA state;

step 3, PRC updates information record related to PL, and line B represents available PL_offAnd embodies available PL_offPLI of (2) is 0.

Example 6

An embodiment of the present invention, on the basis of embodiment 5, PL₁To obtain BTBP₁Is determined as BTB_FContinued operation, PL₁Is released, comprising the steps of:

step 1, clear all PLs₁Instruction in (1), PL₁Entry into PL_offA state;

step 2, PRC updates information record related to PL, and line B represents available PL_offAnd embodies available PL_offPLI of (2) is 0.

Example 7

An embodiment of the present invention, on the basis of embodiment 6, PL₂To obtain BTBP₃Is determined as BTB_FContinued operation, PL₂Switching "current EIFC", comprising the steps of:

step 1, removing PL₂In addition to instructions in "non-Current EIFC", PL is clear₂All other instructions in the "current EIFC" reset;

step 2, shown in FIG. 2, the BTPC transitions the "non-current EIFC" to the "current EIFC" and begins executing instructions that have been prefetched therein.

Example 8

An embodiment of the present invention, on the basis of embodiment 3, PL₀To obtain BTBP₀Is determined as BTB_TContinued operation, PL₁To PL₃Is released, comprising the steps of:

step 1, clear PL₁To PL₃All instructions in, PL₁To PL₃Incorporates PL_offA state;

step 2, PRC updates information record related to PL, and line B represents available PL_offAnd embodies available PL_offPLI of (1).

It should be noted that the above embodiments can be freely combined as necessary. The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. Implementing a circuit structure that establishes associations between multiple PLs, characterized by:

defining and implementing a PCM module in a kernel;

and/or

Different PLs may be interconnected via PCM, transmitting the instruction address and associated state information of a certain program, and after completing the transmission, the program will start to run in the target PL;

and/or

The program in PL mayBy terminating programs running in another PL via PCM and changing the state of the PL, from PL_onState change to PL_offStatus.

2. The method is characterized in that the method realizes that all PLs and other related circuits of a kernel enter a working mode of 'binary tree mode', and is characterized in that:

the PMT instruction is executed in the program.

3. A branch prediction strategy for PL implementing IBTP, according to claim 1 and claim 2, is implemented in that:

the JMPI instruction is executed in a program.

4. According to claim 1 and claim 2, implementing an ABTP branch prediction strategy where PL implements both IBTP and PBTP branch prediction strategies, characterized by:

a JMPP instruction is executed in the program.

5. According to claim 1 and claim 2, implementing a mode of operation that causes all PLs and other related circuitry of the core to exit "binary tree mode" and that enables program migration from non-specific PLs to specific PLs for execution, characterized in that:

the PMR instruction is executed in a program.

6. According to claim 4, it is realized that, during program execution, different branches in a "binary tree" in a program formed by instructions that jump upon a condition are made to run simultaneously in different PLs, wherein:

when the number of PLs in a core is greater than the number of all BTBs in the core, BTPC in PL will BTB_F(or BTB_T) To other at PL_offRun synchronously in PL of states.