CN112416438A - Method for realizing pre-branching of assembly line - Google Patents
Method for realizing pre-branching of assembly line Download PDFInfo
- Publication number
- CN112416438A CN112416438A CN202011422792.3A CN202011422792A CN112416438A CN 112416438 A CN112416438 A CN 112416438A CN 202011422792 A CN202011422792 A CN 202011422792A CN 112416438 A CN112416438 A CN 112416438A
- Authority
- CN
- China
- Prior art keywords
- instruction
- pipeline
- btb
- program
- pls
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 12
- 230000005540 biological transmission Effects 0.000 claims description 2
- 210000005100 blood-tumour barrier Anatomy 0.000 claims 1
- 230000005012 migration Effects 0.000 claims 1
- 230000003044 adaptive effect Effects 0.000 abstract description 5
- 239000000243 solution Substances 0.000 description 15
- 238000010586 diagram Methods 0.000 description 6
- 230000008569 process Effects 0.000 description 5
- 239000011159 matrix material Substances 0.000 description 3
- 239000003637 basic solution Substances 0.000 description 2
- 238000005034 decoration Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- VZDUQPHKUBZMLW-UHFFFAOYSA-N 1-(benzenesulfonyl)-4-chloro-2-nitrobenzene Chemical compound [O-][N+](=O)C1=CC(Cl)=CC=C1S(=O)(=O)C1=CC=CC=C1 VZDUQPHKUBZMLW-UHFFFAOYSA-N 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000009191 jumping Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3867—Concurrent instruction execution, e.g. pipeline or look ahead using instruction pipelines
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Advance Control (AREA)
Abstract
A method for realizing pipeline pre-branching, for a processor core, the complexity of the pipeline often determines the performance of a core instruction set architecture, even determines a market area suitable for the processor. In fact, not the instruction set determines the complexity of the pipeline hardware, but the core pipeline with different complexity is required for different market regions, and then the adaptive instruction set architecture is available. Therefore, it is not scientific about the viewpoint that either party will replace the other party between RISC and CISC unless the relevant application market disappears. A short pipeline circuit architecture is more readily available in the low-end processor market, while a complex, ultra-long pipeline architecture is more suitable in the high-end processor market. However, the hardware structure of the ultra-long pipeline will bring about a very troublesome pre-branching problem, and the technical scheme of the invention can better solve the pre-branching problem aiming at the ultra-long pipeline structure, so that the ultra-long pipeline structure has higher efficiency.
Description
Technical Field
The invention relates to the field of integrated circuits and computers, in particular to a method for realizing pipeline pre-branching.
Background
Under the current background art, there are two widely differing processor instruction set architectures, RISC and CISC (i.e., "reduced instruction set" and "complex instructions"), respectively. However, the real difference is in fact the hardware implementation of the instruction sets for these two large arrays, i.e. the circuit implementation of the pipeline.
The hardware corresponding to RISC implements basic operations required by programs such as operations, instruction/data reading, and data writing based on more "programming registers". All operations of the RISC are completed based on 'programming registers', so the RISC targets to make the circuit structure simpler in fact, and the core pipeline structures adopted by the RISC array are all simple pipeline structures, i.e. the pipeline length is shorter.
The hardware implementation of the CISC is based on fewer "programming registers" and more on the basic operations required by the programs such as operation, instruction/data reading, and data writing based on the pipeline hardware status register. Most of the operations of the CISC realize circuit functions based on program 'requirements', so the CISC needs a more complex circuit structure to complete CISC instructions, and core pipelines adopted by the CISC array are all complex pipeline structures, namely the pipeline length is long.
RISC has the advantage of simple circuit structure, but since almost all operations are done by "programming registers," the "requirement" for a program needs to be broken into several instructions to do it. Thus, the actual execution efficiency of RISC is very low for "demand" with more system bus operations, cache operations (although there is little pipeline regression penalty time between each instruction due to jumps). It is conceivable that the RISC has higher actual execution efficiency on the premise that the RISC initializes its own "programming registers", so that the RISC is very suitable for the application of small data-level operations. But in most complex systems (e.g. PCs or computer systems more complex than PCs) even in terminal devices with higher requirements for audio-video processing, most of the programs "demand" required is computational processing at a large data level, and more system bus accesses (more complex systems will have more peripherals, and more user programs, which are all running more heavily on the system bus).
It is apparent that CISC is a computer instruction set architecture, or computer technology, that locates places of market for more complex computer systems. The initial purpose of the circuit implementation of the CISC instruction set architecture is to meet the demand of a system and process a larger number of operations, so that the hardware structure of the CISC array has a more complex cache structure and a more complex pipeline structure.
From the technical space of circuit implementation, CISC has a larger utilization space than RISC, or CISC theoretically does not limit the space of hardware circuit implementation, but RISC basically limits the space that hardware circuit implementation can utilize. However, under the prior art, CISC has a disadvantage compared to RISC, and is a problem of branch prediction brought by a super-long pipeline. For instructions that jump according to the judgment condition, the pipeline has too long latency in the actual execution process due to the too long pipeline, and branch prediction is a hardware implementation method for solving the problem, that is, before the judgment condition obtains a result, the judgment condition is firstly assumed to be true or false, and then the execution is continued according to the assumed condition. Therefore, the problem of 50% of the latency for conditional jumps due to too long pipelines can be probabilistically solved. In the present context, it is clear that most high-end processors support dual-threaded cores, and that there is a possibility that, in theory, the present context can implement simultaneous assumptions for both "true" and "false" branches to achieve failure of either assumption, yet the other assumption can continue execution of conditional jump instructions without pipeline regression penalties due to false assumptions.
However, in the prior art, two "threads" existing in the kernel are actually used by the "threads", depending on the scheduling of the operating system, i.e., the program itself cannot be certain that all pipeline hardware resources can be used for branch prediction. Therefore, under the prior art, unless software, especially system software, can do more "coordination" or "synchronization", it is theoretically possible to realize that a program can be sure whether all pipeline hardware resources can be used for branch prediction, but in this way, it is more meaningless to actually improve efficiency (so, in fact, under the prior art, a high-end processing core does not implement more threads than 2 threads, namely pipeline hardware resources, just because more "threads" are implemented, and the functions are too single. In addition, even though the prior art realizes the above theoretical possibility, the prior art still cannot solve the problem of branch prediction of the ultra-long pipeline well, because in practical application, within a single instruction execution time (the time from the instruction entering the pipeline to the instruction generating output), a plurality of binary tree branch points exist in the pipeline at a high probability, and therefore, before the conditional jump instruction obtains a result, only 50% correctness probability can be assumed after the second binary tree branch point enters the pipeline.
Disclosure of Invention
The technical scheme of the invention realizes the solution of the branch Prediction problem of the processor core assembly line (ultra-long assembly line), and the branch Prediction is abbreviated as BTP (Branch Prediction protocol), namely the abbreviation of Binary Tree Prediction. In the technical solution of the present invention, BTP is defined as three specific technical embodiments, including: "passive branch prediction", "active branch prediction", and "adaptive branch prediction". Hereinafter, "negative branch prediction" is abbreviated as IBTP, i.e., abbreviation of Inactive BTP, "Positive branch prediction" is abbreviated as PBTP, i.e., abbreviation of Positive BTP, and "Adaptive branch prediction" is abbreviated as ABTP, i.e., abbreviation of Adaptive BTP.
When the pipeline encounters a Branch Point of a Binary Tree formed by a condition judging instruction (hereinafter, a "Branch of the Binary Tree" is abbreviated as BTB, namely the abbreviation of Binary Tree Branch, and a "Branch Point of the Binary Tree" is abbreviated as BTBP, namely the abbreviation of BTB Point), the pipeline only uses the current pipeline hardware resources to perform hypothesis judgment on a certain condition of the Binary Tree (a "true condition" or a "false condition"). And executing an instruction of a branch of the binary tree based on the tentative determination (hereinafter "pipeline hardware resource" is abbreviated as PL, and "true conditional" BTB is abbreviated as BTB)TBTB of "false Condition" is abbreviated as BTBF). Until the binary tree condition judgment instruction obtains an actual result, if the actual result is the same as the hypothesis judgment, the execution is continued, otherwise, all the instructions in the execution under the hypothesis judgment condition in the pipeline are cleared, the pipeline jumps to another branch with the opposite hypothesis judgment condition again, the instruction is restarted and executed. No other PL than the current PL is used to make any hypothetical prediction of BTBP, which is IBTP, throughout this process.
When the pipelineEncountering a BTBP formed by a conditional predicate instruction, the pipeline first makes a hypothetical predicate on a condition ("true condition" or "false condition") of the binary tree using the current PL, assuming a "true condition". And according to the 'true condition' judgment, executing BTBT. At the same time, the other PL is taken to perform a hypothetical determination that is opposite to the current pipeline, i.e., a "false condition", and the BTB is executed based on this "false condition" determinationFAnd until the binary tree condition judgment instruction obtains an actual result. BTB if the actual result is "true ConditionTContinue execution, ignoring BTBFAnd clears the BTBFAll of the instructions above, release the BTBFThe PL of the site. BTB if the actual result is "false condition", thenFContinue execution, ignoring BTBTAnd clears the BTBTAll of the instructions above, release the BTBTThe PL of the site. In the whole process, any branch of the binary tree occupies one PL to perform branch prediction under corresponding conditions, and the BTP is PBTP.
It is clear that IBTP only requires the use of one PL, whereas PBTP requires a sufficient number of PLs for a pipeline of a certain length. In theory, although the amount of PL required for full PBTP support can be calculated for a pipeline of a certain length. However, a higher number of PLs will result in higher implementation costs and higher power consumption. Therefore, in practical applications, the amount of PL required in the kernel should depend on the requirements of the application. Therefore, for BTP, under the premise that there is already PL that does not guarantee to fully satisfy the PBTP number requirement, it is necessary to use PBTP preferentially to complete BTP, and when the PL number is insufficient, the pipeline needs to be adaptively switched to IBTP. PBTP is implemented in the case where the amount of PL is sufficient, while IBTP, i.e., ABTP, is implemented in the case where the amount of PL is insufficient.
In the technical scheme of the invention, ABTP is realized. In the specific kernel hardware implemented by the technical solution of the present invention, the number of PLs may be a fixed number, but for software, the specific number does not need to be determined. In a particular hardware implementation, each PL in the kernel has a unique "pipelineThread quotation marks ", hereinafter" pipeline Index "will be referred to simply as PLI, an abbreviation for PL Index. PL with PLI of 0, abbreviated PL0PL where PL is 1, abbreviated as PL1In this example, push. For any one PL, it has two states, namely an "idle state" and a "running state", and the pipeline in the "idle state" will be abbreviated hereinafter as PLoffPipeline in "run State" is abbreviated PLon。
In the kernel hardware architecture, two additional hardware modules are implemented in addition to all PLs, namely a "pipeline resource hub" module and a "pipeline interconnection matrix" module. Hereinafter, "Pipeline Resource Center" will be abbreviated as PRC, an abbreviation of Pipeline Resource Center, and "Pipeline interconnection Matrix" will be abbreviated as PCM, an abbreviation of Pipeline Connection Matrix. The PRC is responsible for managing all PLs, and the PLI of the PL that is available is provided by the PRC when it is needed during program execution. PCM is responsible for the interconnection between different PLs, i.e. when a certain PL implements PBTP, it is necessary to complete the BTB through PCMTOr BTBFAnd other information to the target PL (i.e., the PL pointed to by the PLI provided by the PRC).
Specifically, the technical solution of the present invention is implemented by the following kernel instructions, including instructions, to implement the software and hardware of the technical solution of the present invention:
1. a "pipeline tree mode" instruction having the character "PMT" as an designator, but is not limited to having only "PMT" as a designator. "Tree mode" refers to "binary tree mode", so execution of the PMT instruction indicates that all PLs in the kernel enter the binary tree mode of operation, while the PRC also operates in the binary tree mode of operation. Operating in binary tree mode, PL in the kernel can only be used for BTB, but not by programs as threads (i.e., in fact, the kernel operates in single threaded mode).
2. A "passive branch jump" instruction having the character "JMPI" as an instruction character, but is not limited to having only "JMPI" as an instruction character. The JMPI instruction indicates that the IBTP mode is used for jumping according to the judgment conditionI.e. the way only negative branch prediction is used when the current instruction is branch predicted, no other PL is occupied. The "judgment condition" based on the JMPI may be a value of a programming register of the pipeline hardware, but the technical solution of the present invention is not limited to the "judgment condition" being a specific fixed address or a specific use of the programming register. JMPI instruction parameters need to specify BTBsFOr indicate BTBTThe instruction address of the desired jump, and another branch not assigned a jump address continues the address of the current instruction. The address indicated in the instruction parameter may be "direct addressing", i.e. the instruction address at which the value exhibited by the instruction parameter is the desired jump, or "indirect addressing", i.e. the pointer to a variable whose stored value is the desired jump.
3. An "active branch jump" instruction having the character "JMPP" as an executor, but is not limited to having "JMPP" as an designator. The JMPP instruction indicates that jump according to 'judgment condition' is carried out by using an ABTP mode, namely, the current instruction can occupy other PLs by using an adaptive branch prediction mode when carrying out branch prediction. The "decision condition" upon which the JMPP depends may be a programmed register of the pipeline hardware. JMPP instruction parameters need to specify BTBsFOr indicate BTBTThe instruction address of the desired jump, and another branch not assigned a jump address continues the address of the current instruction. The address specified in the instruction parameter may be a "direct address," or an "indirect address.
4. The "tree mode end" instruction uses the character "PMR" as an indicator, but is not limited to only "PMR" as an indicator. The PMR instruction will cause the PL in the core and the PRC to exit the "binary tree mode" of operation. I.e., the PL in the kernel will not be available for BTB, but can be used by programs as threads. When the kernel hardware executes PMR, if the PL executing the current program is not PL0Then the current program will be migrated to the PL first0And then actually executes the PMR instruction.
Drawings
Fig. 1 is a schematic diagram of a possible hardware of the technical solution of the present invention, which is a schematic diagram of a basic solution and does not show that the technical solution of the present invention needs to be fixed to the illustrated structure. The figure illustrates a simple structure of kernel hardware with 4 PLs, which is used to illustrate the physical association relationship between the hardware related to the technical solution of the present invention, and in the actual implementation of the kernel hardware, other modules not shown in the figure should exist. The PRC shown in the figure is the PRC module described above, and the PCM is the PCM module described above, PL0To PL3PL, which is PLI, from 0 to 3;
as shown in fig. 1, "code-coupled cache" and "data-coupled cache" are cache modules for caching code and data, respectively, which are physically connected to the PL;
FIG. 1 shows, line A illustrates the propagation direction and path of the PL to branch prediction related instructions;
FIG. 1, line B indicates whether the current PL available as provided by the PRCoffAnd PLoffThe PLI of (1);
as shown in fig. 1, line C and line D illustrate the access path of the PL to code and data, and the direction of transmission of the code and data, respectively;
as shown in fig. 1, line E illustrates the information that the PL provides to the PRC regarding the operating state of the PL;
lines F to I, shown in fig. 1, illustrate the interconnection between PLs to transfer the relevant instructions and information needed to complete the PBTP or ABTP.
Fig. 2 is a schematic diagram of a possible hardware of the technical solution of the present invention, which is a schematic diagram of a basic solution and does not show that the technical solution of the present invention needs to be fixed to the illustrated structure. The diagram illustrates an internal simple structure of the PL, which is used for explaining the physical association relationship between the internal hardware of the PL when the PL according to the technical solution of the present invention implements IBTP, and in the actual implementation of the PL hardware, other modules not shown in the diagram should exist;
shown in FIG. 2, EIFAAnd EIFBAre respectively provided withTwo paths A and B representing PL prefetch instructions, hereinafter collectively referred to as "instruction prefetch paths," are abbreviated as EIFC, an abbreviation for EIF Channel. EIF is an abbreviation of Early Instruction Fetch, indicating "Instruction Early prefetch". The EIF performs first-level instruction decoding on the length of the instruction and whether the instruction is a jump type instruction, and stores a certain number of instructions in the EIF;
FIG. 2 shows a BTPC as a "branch prediction" control module;
FIG. 2 shows that when PRC is unable to provide PLoffWhen PBTP is implemented, the BTPC will control the EIF according to the finally obtained judgment condition (as shown by the line C) of the conditional jump instructionAAnd EIFBWhich channel is the pipeline "current EIFC", i.e. IBTP is implemented. The current EIFC is BTBT(or is BTBFThe technical solution of the present invention is not limited to the BTB that must be fixed hereTOr BTBFHowever, in actual hardware implementation, only the BTB can be fixedTAnd BTBFOne of them. Hereinafter, unless otherwise noted, "current EIFC" defaults to BTBT). Whether it is BTBTOr BTBF,EIFAAnd EIFBThe instruction is self-considered to be the current EIFC when being prefetched, and the instruction is prefetched in the past according to the branch direction fixed by the current EIFC until no more instructions can be cached. In EIFANamely EIFBIn the process of respective pre-fetching instructions, every time one BTBP is encountered, the instruction address of the ' self-thought ' non-current EIFC ' is recorded, but the instruction pre-fetching of the ' non-current EIFC ' is not continued, so that the EIF is controlled by the BTPCAAnd EIFBWhen the roles are exchanged, the initial address of the prefetch instruction can be initialized for real 'non-current EIFC';
when the PRC is able to provide PL, as shown in FIG. 2offWhen PBTP is to be executed, PBTP is executed. That is, when the BTBP of the current EIFC is analyzed by BTP, the BTB is analyzedF(or BTBTThe technical solution of the present invention is not limited to the BTB that must be fixed hereFOr BTBTBut is actually hardIn one implementation, only fixed to BTBFOr BTBTOne of them) to the PL specified by the PLI provided by the PRC via the PCM of FIG. 1offLet the PL haveoffStarting execution of BTBF。
Detailed Description
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following describes specific embodiments of the present invention with reference to the drawings. It is obvious that the drawings in the following description are only some examples of the invention, and that for a person skilled in the art, other drawings and embodiments can be obtained from these drawings without inventive effort.
For the sake of simplicity, the drawings only schematically show the parts relevant to the present invention, and they do not represent the actual structure or flow of the product. In addition, for the sake of simplicity and comprehension of the drawings, some of the drawings have the same structure or function, only one of which is schematically depicted or labeled, and in the present specification, "a" or "an" does not only mean "only one" but also mean "more than one".
Example 1
In one embodiment of the present invention, the kernel enters a "binary tree mode" to perform ABTP processing on the first BTBP, which includes the steps of:
step 1, the program executes the PMR instruction, all PLs enter "binary Tree mode", PRC enters "binary Tree mode", as shown in FIG. 1, except PL0Besides, PL1、PL2And PL3All enter PLoffState, line B embodies the information as having available PLoffAnd embodies available PLoffPLI of (a) is 1;
step 2, as shown in FIG. 2, the BTP encounters the first BTBP, hereinafter abbreviated BTBP0Mixing BTBFInstruction address information and associated status information is passed to the PL1,PL1Starting execution of BTBF(hereinafter, this BTB will be describedFAbbreviated as BTB1),PL0Performing BTBT(hereinafter, this BTB will be describedTAbbreviated as BTB0). At the same time, PRC updates the information record about PL, and line B embodies the information as having available PLoffAnd embodies available PLoffPLI of (2).
Example 2
An embodiment of the present invention, on the basis of embodiment 1, PL0BTBP has not yet been obtained0PL as a result of the judgment of1Encounters a BTBP (hereinafter abbreviated as BTBP)1),PL1Performing ABTP treatment, comprising the following steps:
step 1, PL1Will present PL1BTB of (A)FInstruction address information and associated status information is passed to the PL2,PL2Starting execution of BTBF(hereinafter, this BTB will be describedFAbbreviated as BTB2),PL1Performing BTBT(in this case, the BTBTNamely BTB1Continuation of (c). At the same time, PRC updates the information record about PL, and line B embodies the information as having available PLoffAnd embodies available PLoffPLI of (3).
EXAMPLE 3
An embodiment of the present invention, on the basis of embodiment 2, PL0BTBP has not yet been obtained0PL as a result of the judgment of1BTBP has not yet been obtained1PL as a result of the judgment of0Encounters a BTBP (hereinafter abbreviated as BTBP)2),PL0Performing ABTP treatment, comprising the following steps:
step 1, PL0Will present PL0BTB of (A)FInstruction address information and associated status information is passed to the PL3,PL3Starting execution of BTBF(hereinafter, this BTB will be describedFAbbreviated as BTB3),PL0Performing BTBT(in this case, the BTBTNamely BTB0Continuation of (c). At the same time, PRC updates the information record about PL, and line B embodies the information that no PL is availableoff。
Example 4
In the inventionExample, on the basis of example 3, PL0BTBP has not yet been obtained0PL as a result of the judgment of1BTBP has not yet been obtained1PL as a result of the judgment of2Encounters a BTBP (hereinafter abbreviated as BTBP)3),PL2Carrying out IBTP treatment, comprising the following steps:
step 1, PL2For BTBP3IBTP processing is performed to make PL2Another EIFC (other than the current EIFC) in the set starts prefetching the BTBFInstruction of (1), current EIFC continues prefetching BTBsT(BTB at this time)TNamely BTB2Continuation of (c).
Example 5
An embodiment of the present invention, on the basis of embodiment 4, PL0To obtain BTBP0Is determined as BTBFContinued operation, PL0Is released, comprising the steps of:
step 1, clear all PLs0Instruction in (1), PL0Entry into PLoffA state;
step 2, clear all PLs3Instruction in (1), PL3Entry into PLoffA state;
step 3, PRC updates information record related to PL, and line B represents available PLoffAnd embodies available PLoffPLI of (2) is 0.
Example 6
An embodiment of the present invention, on the basis of embodiment 5, PL1To obtain BTBP1Is determined as BTBFContinued operation, PL1Is released, comprising the steps of:
step 1, clear all PLs1Instruction in (1), PL1Entry into PLoffA state;
step 2, PRC updates information record related to PL, and line B represents available PLoffAnd embodies available PLoffPLI of (2) is 0.
Example 7
An embodiment of the present invention, on the basis of embodiment 6, PL2To obtain BTBP3Is determined as BTBFContinued operation, PL2Switching "current EIFC", comprising the steps of:
step 1, removing PL2In addition to instructions in "non-Current EIFC", PL is clear2All other instructions in the "current EIFC" reset;
step 2, shown in FIG. 2, the BTPC transitions the "non-current EIFC" to the "current EIFC" and begins executing instructions that have been prefetched therein.
Example 8
An embodiment of the present invention, on the basis of embodiment 3, PL0To obtain BTBP0Is determined as BTBTContinued operation, PL1To PL3Is released, comprising the steps of:
step 1, clear PL1To PL3All instructions in, PL1To PL3Incorporates PLoffA state;
step 2, PRC updates information record related to PL, and line B represents available PLoffAnd embodies available PLoffPLI of (1).
It should be noted that the above embodiments can be freely combined as necessary. The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.
Claims (6)
1. Implementing a circuit structure that establishes associations between multiple PLs, characterized by:
defining and implementing a PCM module in a kernel;
and/or
Different PLs may be interconnected via PCM, transmitting the instruction address and associated state information of a certain program, and after completing the transmission, the program will start to run in the target PL;
and/or
The program in PL mayBy terminating programs running in another PL via PCM and changing the state of the PL, from PLonState change to PLoffStatus.
2. The method is characterized in that the method realizes that all PLs and other related circuits of a kernel enter a working mode of 'binary tree mode', and is characterized in that:
the PMT instruction is executed in the program.
3. A branch prediction strategy for PL implementing IBTP, according to claim 1 and claim 2, is implemented in that:
the JMPI instruction is executed in a program.
4. According to claim 1 and claim 2, implementing an ABTP branch prediction strategy where PL implements both IBTP and PBTP branch prediction strategies, characterized by:
a JMPP instruction is executed in the program.
5. According to claim 1 and claim 2, implementing a mode of operation that causes all PLs and other related circuitry of the core to exit "binary tree mode" and that enables program migration from non-specific PLs to specific PLs for execution, characterized in that:
the PMR instruction is executed in a program.
6. According to claim 4, it is realized that, during program execution, different branches in a "binary tree" in a program formed by instructions that jump upon a condition are made to run simultaneously in different PLs, wherein:
when the number of PLs in a core is greater than the number of all BTBs in the core, BTPC in PL will BTBF(or BTBT) To other at PLoffRun synchronously in PL of states.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011422792.3A CN112416438A (en) | 2020-12-08 | 2020-12-08 | Method for realizing pre-branching of assembly line |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011422792.3A CN112416438A (en) | 2020-12-08 | 2020-12-08 | Method for realizing pre-branching of assembly line |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112416438A true CN112416438A (en) | 2021-02-26 |
Family
ID=74776010
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011422792.3A Pending CN112416438A (en) | 2020-12-08 | 2020-12-08 | Method for realizing pre-branching of assembly line |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112416438A (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1349160A (en) * | 2001-11-28 | 2002-05-15 | 中国人民解放军国防科学技术大学 | Correlation delay eliminating method for streamline control |
US20080140996A1 (en) * | 2006-12-08 | 2008-06-12 | Michael William Morrow | Apparatus and methods for low-complexity instruction prefetch system |
US20120284463A1 (en) * | 2011-05-02 | 2012-11-08 | International Business Machines Corporation | Predicting cache misses using data access behavior and instruction address |
CN108804205A (en) * | 2017-04-28 | 2018-11-13 | 英特尔公司 | The intelligent thread dispatch of atomic operation and vectorization |
-
2020
- 2020-12-08 CN CN202011422792.3A patent/CN112416438A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1349160A (en) * | 2001-11-28 | 2002-05-15 | 中国人民解放军国防科学技术大学 | Correlation delay eliminating method for streamline control |
US20080140996A1 (en) * | 2006-12-08 | 2008-06-12 | Michael William Morrow | Apparatus and methods for low-complexity instruction prefetch system |
US20120284463A1 (en) * | 2011-05-02 | 2012-11-08 | International Business Machines Corporation | Predicting cache misses using data access behavior and instruction address |
CN108804205A (en) * | 2017-04-28 | 2018-11-13 | 英特尔公司 | The intelligent thread dispatch of atomic operation and vectorization |
Non-Patent Citations (2)
Title |
---|
人生看淡不服就干: "操作系统--阻塞,睡眠,挂起", 《HTTPS://WWW.JIANSHU.COM/P/AD29C92324A1》, 28 June 2017 (2017-06-28), pages 1 - 2 * |
范延滨主编: "《微型计算机系统原理、接口与EDA设计技术》", 北京邮电大学出版社, pages: 79 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107092467B (en) | Instruction sequence buffer for enhancing branch prediction efficiency | |
US20160132354A1 (en) | Application scheduling in heterogeneous multiprocessor computing platforms | |
US20140095847A1 (en) | Instruction and highly efficient micro-architecture to enable instant context switch for user-level threading | |
US20040073772A1 (en) | Method and apparatus for thread-based memory access in a multithreaded processor | |
CN115562729A (en) | Data processing apparatus having a stream engine with read and read/forward operand encoding | |
JP2009110509A (en) | System and method for selecting optimal processor performance level by using processor hardware feedback mechanism | |
CN105814538B (en) | Floating point enabled pipeline for emulating a shared memory architecture | |
CN102799418B (en) | Processor architecture and instruction execution method integrating sequence and VLIW (Very Long Instruction Word) | |
CN112241288A (en) | Dynamic control flow reunion point for detecting conditional branches in hardware | |
CN115374923A (en) | RISC-V expansion based universal neural network processor micro-architecture | |
CN101201732A (en) | Multi-mode microprocessor with 32 bits | |
Wallace et al. | Modeled and measured instruction fetching performance for superscalar microprocessors | |
US7620804B2 (en) | Central processing unit architecture with multiple pipelines which decodes but does not execute both branch paths | |
US20010016899A1 (en) | Data-processing device | |
US7290157B2 (en) | Configurable processor with main controller to increase activity of at least one of a plurality of processing units having local program counters | |
WO2022036690A1 (en) | Graph computing apparatus, processing method, and related device | |
CN112416438A (en) | Method for realizing pre-branching of assembly line | |
US8332596B2 (en) | Multiple error management in a multiprocessor computer system | |
US7328327B2 (en) | Technique for reducing traffic in an instruction fetch unit of a chip multiprocessor | |
US6119220A (en) | Method of and apparatus for supplying multiple instruction strings whose addresses are discontinued by branch instructions | |
CN111078289A (en) | Method for executing sub-threads of a multi-threaded system and multi-threaded system | |
CN104636207A (en) | Collaborative scheduling method and system based on GPGPU system structure | |
CN109408118A (en) | MHP heterogeneous multiple-pipeline processor | |
US20110231637A1 (en) | Central processing unit and method for workload dependent optimization thereof | |
JP2005108086A (en) | Data processor |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
AD01 | Patent right deemed abandoned | ||
AD01 | Patent right deemed abandoned |
Effective date of abandoning: 20240802 |