US20050278517A1

US20050278517A1 - Systems and methods for performing branch prediction in a variable length instruction set microprocessor

Info

Publication number: US20050278517A1
Application number: US11/132,428
Authority: US
Inventors: Kar-Lik Wong; James Hakewill; Nigel Topham; Rich Fuhler
Original assignee: ARC International
Current assignee: ARC International
Priority date: 2004-05-19
Filing date: 2005-05-19
Publication date: 2005-12-15
Also published as: US9003422B2; GB2428842A; GB0622477D0; WO2005114441A2; CN101002169A; WO2005114441A3; US20140208087A1; US20050273559A1; US20050289323A1; US20050289321A1; TW200602974A; US8719837B2; US20050278505A1; US20050278513A1

Abstract

A method of performing branch prediction in a microprocessor using variable length instructions is provided. An instruction is fetched from memory based on a specified fetch address and a branch prediction is made based on the address. The prediction is selectively discarded if the look-up was based on a non-sequential fetch to an unaligned instruction address and a branch target alignment cache (BTAC) bit of the instruction is equal to zero. In order to remove the inherent latency of branch prediction, an instruction prior to a branch instruction may be fetched concurrently with a branch prediction unit look-up table entry containing prediction information for a next instruction word. Then, the branch instruction is fetched and a prediction is made on this branch instruction based on information fetched in the previous cycle. The predicted target instruction is fetched on the next clock cycle. If zero overhead loops are used, a look-up table of a branch prediction unit is updated whenever the zero-overhead loop mechanism is updated. A last fetch address of a last instruction of a loop body of a zero overhead loop in the branch prediction look-up table is stored. Then, whenever an instruction fetch hits the end of a loop body, predictively re-directing an instruction fetch to the start of the loop body. The last fetch address of the loop body is derived from the address of the first instruction after the end of the loop.

Description

CROSS REFERENCE TO RELATED APPLICATION(S)

This application claims priority to provisional application No. 60/572,238 filed May 19, 2004, entitled “Microprocessor Architecture,” hereby incorporated by reference in its entirety.

FIELD OF THE INVENTION

This invention relates generally to microprocessor architecture and more specifically to an improved architecture and mode of operation of a microprocessor for performing branch prediction.

BACKGROUND OF THE INVENTION

A typical component of a multistage microprocessor pipeline is the branch prediction unit (BPU). Usually located in or near a fetch stage of the pipelines the branch prediction unit increases effective processing speed by predicting whether a branch to a non-sequential instruction will be taken based upon past instruction processing history. The branch prediction unit contains a branch look-up or prediction table that stores the address of branch instructions, an indication as to whether the branch was taken and a speculative target address for a taken branch. When an instruction is fetched, if the instruction is a conditional branch, the result of the conditional branch is speculatively predicted based on past branch history. This speculative or predictive result is injected into the pipeline. Thus, referencing a branch history table, the next instruction is speculatively loaded into the pipeline. Whether or not the prediction will be correct, will not be known until a later stage of the pipeline. However, if the prediction is correct, clock cycles will be saved by not having to go back to get the next instruction address. Otherwise, the current pipeline behind the stage in which the actual address of the next instruction is determined must be flushed and the correct branch inserted back in the first stage. While this may seem like a harsh penalty for incorrect predictions, in applications where the instruction set is limited and small loops are repeated many times, such as, for example, applications typically implemented with embedded processors, branch prediction is usually accurate enough such that the benefits associated with correct predictions outweigh the cost of occasional incorrect predictions—i.e., pipeline flush. In these types of applications branch prediction can achieve accuracy over ninety percent of the time. Thus, the risk of predicting an incorrect branch resulting in a pipeline flush is outweighed by the benefit of saved clock cycles.
While branch prediction is effective at increasing effective processing speed, problems may arise that reduce or eliminate these efficiency gains when dealing with a variable length microprocessor instruction set. For example, if the look-up table is a comprised of entries associated with 32-bit wide fetch entities and instructions have lengths varying from 16 to 64-bits, a specific lookup table address entry may not be sufficient to reference a particular instruction.
The description herein of various advantages and disadvantages associated with known apparatus, methods, and materials is not intended to limit the scope of the invention to their exclusion. Indeed, various embodiments of the invention may include one or more of the known apparatus, methods, and materials without suffering from their disadvantages.
As background to the techniques discussed herein, the following references are incorporated herein by reference: U.S. Pat. No. 6,862,563 issued Mar. 1, 2005 entitled “Method And Apparatus For Managing The Configuration And Functionality Of A Semiconductor Design” (Hakewill et al.); U.S. Ser. No. 10/423,745 filed Apr. 25, 2003, entitled “Apparatus and Method for Managing Integrated Circuit Designs”; and U.S. Ser. No. 10/651,560 filed Aug. 29, 2003, entitled “Improved Computerized Extension Apparatus and Methods”, all assigned to the assignee of the present invention.

SUMMARY OF THE INVENTION

Thus, there exists a need for microprocessor architecture with reduced power consumption, improved performance, reduction of silicon footprint and improved branch prediction as compared with state of the art microprocessors.
In various embodiments of this invention, a microprocessor architecture is disclosed in which branch prediction information is selectively ignored by the instruction pipeline in order to avoid injection of erroneous instructions into the pipeline. These embodiments are particularly useful for branch prediction schemes in which variable length instructions are predictively fetched. In various exemplary embodiments, a 32-bit word is fetched based on the address in the branch prediction table. However, in branch prediction systems based on addresses of 32-bit fetch objects, because the instruction memory is comprised of 32-bit entries, regardless of instruction length, this address may reference a word comprising two 16-bit instruction words, or a 16-bit instruction word and an unaligned instruction word of larger length (32, 48 or 64 bits) or parts of two unaligned instruction words of such larger lengths.
In various embodiments, the branch prediction table may contain a tag coupled to the lower bits of a fetch instruction address. If the entry at the location specified by the branch prediction table contains more than one instruction, for example, two 16-bit instructions, or a 16-bit instruction and a portion of a 32, 48 or 64-bit instruction, a prediction may be made based on an instruction that will ultimately be discarded. Though the instruction aligner will discard the incorrect instruction, a predicted branch will already have been injected into the pipeline and will not be discovered until branch resolution in a later stage of the pipeline causing a pipeline flush.
Thus, in various exemplary embodiments, to prevent such an incorrect prediction from being made, a prediction will be discarded beforehand if two conditions are satisfied. In various embodiments, a prediction will be discarded if a branch prediction look-up is based on a non-sequential fetch to an unaligned address, and secondly, if the branch target alignment cache (BTAC) bit is equal to zero. This second condition will only be satisfied if the prediction is based on an instruction having an aligned instruction address. In various exemplary embodiments, an alignment bit of zero will indicate that the prediction information is for an aligned branch. This will prevent the predictions based on incorrect instructions from being injected into the pipeline.
In various embodiments of this invention, a microprocessor architecture is disclosed which utilizes dynamic branch prediction while removing the inherent latency involved in branch prediction. In this embodiment, an instruction fetch address is used to look up in a BPU table recording historical program flow to predict when a non-sequential program flow is to occur. However, instead of using the instruction address of the branch instruction to index the branch table, the address of the instruction prior to the branch instruction in the program flow is used to index the branch in the branch table. Thus, fetching the instruction prior to the branch instruction will cause a prediction to be made and eliminate the inherent one step latency in the process of dynamic branch prediction caused by the fetching the address of the branch instruction itself. In the above embodiment, it should be noted that in some cases, a delay slot instruction may be inserted after a conditional branch such that the conditional branch is not the last sequential instruction. In such a case, because the delay slot instruction is the actual sequential departure point, the instruction prior to the non-sequential program flow would actually be the branch instruction. Thus, the BPU would index such an entry by the address of the conditional branch instruction itself, since it would be the instruction prior to the non-sequential instruction.
In various embodiments, use of a delay slot instruction will also affect branch resolution in the selection stage. In various exemplary embodiments, if a delay slot instruction is utilized, update of the BPU must be deferred for one execution cycle after the branch instruction. This process is further complicated by the use of variable length instructions. Performance of branch resolution after execution requires updating of the BPU table. However, when the processor instruction set includes variable length instructions it becomes essential to determine the last fetch address of the current instruction as well as the update address, i.e., the fetch address prior to the sequential departure point. In various exemplary embodiments, if the current instruction is an aligned or non-aligned 16-bit or an aligned 32-bit instruction, the last fetch address will be the instruction fetch address of the current instruction. The update address of an aligned 16-bit or aligned 32-bit instruction will be the last fetch address of the prior instruction. For a non-aligned 16-bit instruction, if it was arrived at sequentially, the update address will be the update address of the prior instruction. Otherwise, the update address will be the last fetch address of the prior instruction.
In the same embodiment, if the current instruction is non-aligned 32-bit or an aligned 48-bit instruction, the last fetch address will simply be the address of the next instruction. The update address will be the current instruction address. The last fetch address of a non-aligned 48-bit instruction or an aligned 64-bit instruction will be the address of the next instruction minus one and the update address will be the current instruction address. If the current instruction is a non-aligned 64-bit, the last fetch address will be the same as the next instruction address and the update address will be the next instruction address minus one.
In exemplary embodiments of this invention, a microprocessor architecture is disclosed which employs dynamic branch prediction and zero overhead loops. In such a processor, the BPU is updated whenever the zero-overhead loop mechanism is updated. Specifically, the BPU needs to store the last fetch address of the last instruction of the loop body. This allows the BPU to predictively re-direct instruction fetch to the start of the loop body whenever an instruction fetch hits the end of the loop body. In this embodiment, the last fetch address of the loop body can be derived from the address of the first instruction after the end of the loop, despite the use of variable length instructions, by exploiting the fact that instructions are fetched in 32-bit word chunks and that instruction sizes are in general integer multiple of a 16-bits. Therefore, in this embodiment, if the next instruction after the end of the loop body has an aligned address, the last instruction of the loop body has a last fetch address immediately preceding the address of the next instruction after the end of the loop body. Otherwise, if the next instruction after the end of the loop body has an unaligned address, the last instruction of the loop body has the same fetch address as the next instruction after the loop body.
At least one exemplary embodiment of the invention provides a method of performing branch prediction in a microprocessor using variable length instructions. The method of performing branch prediction in a microprocessor using variable length instructions according to this embodiment comprises fetching an instruction from memory based on a specified fetch address, making a branch prediction based on the address of the fetched instruction, and discarding the branch prediction if (1) the branch prediction look-up was based on a non-sequential fetch to an unaligned instruction address and (2) if a branch target alignment cache (BTAC) bit of the instruction is equal to zero.
At least one additional exemplary embodiment provides a method of performing dynamic branch prediction in a microprocessor. The method of performing dynamic branch prediction in a microprocessor according to this embodiment may comprise fetching the penultimate instruction word prior to a non-sequential program flow and a branch prediction unit look-up table entry containing prediction information for a next instruction word on a first clock cycle, fetching the last instruction word prior to a non-sequential program flow and making a prediction on non-sequential program flow based on information fetched in the previous cycle on a second clock cycle, and fetching the predicted target instruction on a third clock cycle.
Yet an additional exemplary embodiment provides a method of updating a look-up table of a branch prediction unit in a variable length instruction set microprocessor. The method of updating a look-up table of a branch prediction unit in a variable length instruction set microprocessor may comprise storing a last fetch address of a last instruction of a loop body of a zero overhead loop in the branch prediction look-up table, and predictively re-directing an instruction fetch to the start of the loop body whenever an instruction fetch hits the end of a loop body, wherein the last fetch address of the loop body is derived from the address of the first instruction after the end of the loop.
Other aspects and advantages of the invention will become apparent from the following detailed description, taken in conjunction with the accompanying drawings, illustrating by way of example the principles of the invention.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a diagram illustrating the contents of a 32-bit instruction memory and a corresponding table illustrating the location of particular instructions within the instruction memory in connection with a technique for selectively ignoring branch prediction information in accordance with at least one exemplary embodiment of this invention;
FIG. 2 is a flow chart illustrating the steps of a method for selectively discarding branch predictions corresponding to aligned 16-bit instructions having the same fetch address as a non-aligned 16-bit target instruction in accordance with at least one exemplary embodiment of this invention;
FIG. 3 is a flow chart illustrating a prior art method of performing branch prediction by storing non-sequential branch instructions in a branch prediction unit table that is indexed by the fetch address of the non-sequential branch instruction;
FIG. 4 is a flow chart illustrating a method for performing branch prediction by storing non-sequential branch instructions in a branch prediction table that is indexed by the fetch address of the instruction prior to the non-sequential branch instruction in accordance with at least one exemplary embodiment of this invention;
FIG. 5 is a diagram illustrating possible scenarios encountered during branch resolution when 32-bit words are fetched from memory in a system incorporating a variable length instruction architecture including instructions of 16-bits, 32-bits, 48-bits or 64-bits in length; and
FIGS. 6 and 7 are tables illustrating a method for computing the last instruction fetch address of a zero-overhead loop for dynamic branch prediction in a variable-length instruction set architecture processor;

DETAILED DESCRIPTION OF THE DISCLOSURE

The following description is intended to convey a thorough understanding of the invention by providing specific embodiments and details involving various aspects of a new and useful microprocessor architecture. It is understood, however, that the invention is not limited to these specific embodiments and details, which are exemplary only. It further is understood that one possessing ordinary skill in the art, in light of known systems and methods, would appreciate the use of the invention for its intended purposes and benefits in any number of alternative embodiments, depending upon specific design and other needs.
FIG. 1 is a diagram illustrating the contents of a 32-bit instruction memory and a corresponding table illustrating the location of particular instructions within the instruction memory in connection with a technique for selectively ignoring branch prediction information in accordance with at least one exemplary embodiment of this invention. When branch prediction is done in a microprocessor employing a variable length instruction set, a performance problem is created when a branch is made to an unaligned target address that is packed with an aligned instruction in the same 32-bit word that is predicted to be a branch.
In FIG. 1 a sequence of 32-bit wide memory words are shown containing instructions instr_1 through instr_4 in sequential locations in memory. Instr_2, is the target of a non-sequential instruction fetch. The BPU stores prediction information in its tables based only on the 32-bit fetch address of the start of the instruction. There can be more than one instruction in any 32-bit word in memory, however, only one prediction can be made per 32-bit word. Thus, the performance problem can be seen by referring to FIG. 1. The instruction address of instr_2 is actually 0x2, however, the fetch address is 0x0, and a fetch of this address will cause the entire 32-bit word comprised of 16-bits of instr_1 and 16-bits of instr_2 to be fetched. Under a simple BPU configuration, a branch prediction will be made for instr_1 based on the instruction fetch of the 32-bit word at address 0x0. The branch predictor does not take into account the fact that the instr_1 at 0x0 will be discarded by the aligner before it can be issued, however the prediction remains. The prediction would be correct if instr_1 is fetched as the result of a sequential fetch of 0x0, or if a branch was made to 0x0, but, in this case, where a branch is made to instr_2 at 0x2, the prediction is wrong. As a result, the prediction is wrong for instr_2 causing an incorrect instruction to hit the backstop and a pipeline flush, a severe performance penalty, to occur.
FIG. 2 is a flow chart outlining the steps of a method for solving the aforementioned problem by selectively discarding branch prediction information in accordance with various embodiments of the invention. Operation of the method begins at step 200 and proceeds to step 205 where a 32-bit word is read from memory at the specified fetch address of the target instruction. Next, in step 210, a prediction is made based on this fetched instruction. This prediction is based on the aligned instruction fetch location. Operation of the method then proceeds to step 215 where the first part of a two-part determination test is applied to whether the branch prediction lookup is based on a non-sequential fetch to an unaligned instruction address. In the context of FIG. 1, this condition would be satisfied by instr_2 because it is non-aligned (it does not start at the beginning of line 0x0, but rather after the first 16-bits). However, this condition alone is not sufficient because a valid branch prediction lookup can be based on a branch located at an unaligned instruction address. For example, if in FIG. 1 instr_1 is not a branch and instr_2 is a branch. If, in step 215, it is determined that the branch prediction lookup is based on a non-sequential fetch to an unaligned instruction address, operation of the method proceeds to the next step of the test, step 220. Otherwise, operation of the method jumps to step 225, where the prediction is assumed valid and passed.
Returning to step 220, in this step a second determination is made as to whether the branch target address cache (BTAC) alignment bit is 0, indicating that the prediction information is for an aligned branch. This bit will be 0 for all aligned branches and will be 1 for all unaligned branches because it is derived from the instruction address. The second bit of the instruction address will always be 0 for aligned branches (i.e., 0, 4, 8, f, etc.) and will always be 1 for unaligned branches (i.e., 2, 6, a, etc.). If, in step 220, it is determined that the branch target address cache (BTAC) alignment bit is not 0, operation proceeds to step 225 where the prediction is passed. Otherwise, if in step 220 it is determined that the BTAC alignment bit is 0, operation of the method proceeds to step 230, where the prediction is discarded. Thus, rather than causing an incorrect instruction to be injected into the pipeline which will ultimately cause a pipeline flush, the next sequential instruction will be correctly fetched. After step 230, operation of the method is the same as after step 225, where the next fetch address is updated in step 235 based on whether a branch was predicted and returns to step 205 where the next fetch occurs.
As discussed above, dynamic branch prediction is an effective technique to reduce branch penalty in a pipeline processor architecture. This technique uses the instruction fetch address to look up in internal tables recording program flow history to predict the target of a non-sequential program flow. Also, discussed above, branch prediction is complicated when a variable-length instruction architecture is used. In a variable-length instruction architecture, the instruction fetch address cannot be assumed to be identical to the actual instruction address. This makes it difficult for the branch prediction algorithm to guarantee sufficient instruction words are fetched and at the same time minimize unnecessary fetches.
One known method of ameliorating this problem is to add extra pipeline stages to the front of the processor pipeline to perform branch prediction prior to the instruction fetch to allow more time for the prediction mechanism to make a better decision. A negative consequence of this approach is that extra pipeline stages increase the penalty to correct an incorrect prediction. Alternatively, the extra pipeline stages would not be needed if prediction could be performed concurrent to instruction fetch. However, such a design has an inherent latency in which extra instructions are already fetched by the time a prediction is made.
Traditional branch prediction schemes use the instruction address of a branch instruction (non-sequential program instruction) to index its internal tables. FIG. 3 illustrates such a conventional indexing method in which two instructions are sequentially fetched, the first instruction being a branch instruction, and the second being the next sequential instruction word. In step 300, the branch instruction is fetched with the associated BPU table entry. In the next clock cycle, in step 305, this instruction is propagated in the pipeline to the next stage where it is detected as a predicted branch while the next instruction is fetched. Then, at step 310, in the next clock cycle, the target instruction is fetched based on the branch prediction made in the last cycle. Thus, a latency is introduced because three steps are required to fetch the branch instruction, make a prediction and fetch the target instruction. If the instruction word fetched in 305 is not part of the branch nor of its delay slot, then the word is discarded and as a result a “bubble” is injected into the pipeline.
FIG. 4 illustrates a novel and improved method for making a branch prediction in accordance with various embodiments of the invention. The method depicted in FIG. 4 is characterized in that the instruction address of the instruction preceding the branch instruction is used to index the BPU table rather than the instruction address of the branch instruction itself. As a result, by fetching the instruction just prior to the branch instruction, a prediction can be made from the address of this instruction while the branch instruction itself is being fetched.
Referring specifically to FIG. 4, the method begins in step 400 where the instruction prior to the branch instruction is fetched together with the BPU entry containing prediction information of the next instruction. Next, in step 405, the branch instruction is fetched while, concurrently, a prediction on this branch can be made based on information fetched in the previous cycle. Then, in step 410, in the next clock cycle, the target instruction is fetched. As illustrated, no extra instruction word is fetched between the branch and the target instructions. Hence, no bubble will be injected into the pipeline and overall performance of the processor is improved.
It should be noted that in some cases, due to the use of delay slot instructions, the branch instruction may not be the departure point (the instruction prior to non-sequential flow). Rather another instruction may appear after the branch instruction. Therefore, though the non-sequential jump is dictated by the branch instruction, the last instruction to be executed may not be the branch instruction, but may rather be the delay slot instruction. A delay slot is used in some processor architectures with short pipelines to hide branch resolution latency. Processors with dynamic branch prediction might still have to support the concept of delay slots to be compatible with legacy code. Where a delay slot instruction is used after the branch instruction, utilizing the above branch prediction scheme will cause the instruction address of the branch instruction, not the instruction before the branch instruction, to be used to index the BPU tables, because this instruction is actually the instruction before the last instruction. This fact has significant consequences for branch resolution as will be discussed below. Namely, in order to effectively perform branch resolution, we must know the last fetch address of the previous instruction.
As stated above, branch resolution occurs in the selection stage of the pipeline and causes the BPU to be updated to reflect the outcome of the conditional branch during the write-back stage. Referring to FIG. 5, FIG. 5 illustrates five potential scenarios encountered when performing branch resolution. These scenarios may be grouped into two groups by the way in which they are handled. Group one comprises a non-aligned 16-bit instruction and an aligned 16 or 32-bit instruction. Group two comprises one of three scenarios: a non-aligned 32 or 48-bit instruction, a non-aligned 48-bit or an aligned 64-bit instruction, and a non-aligned 64-bit instruction.
Two pieces of information need to be computed for every instruction under this scheme: namely the last fetch address of the current instruction, L₀and the update address of the current instruction U₀. In the case of the scenarios of group one, it is also necessary to know L₋₁, the last fetch address of the previous instruction, and U₋₁, the update address of the previous instruction. Looking at both the first scenario and second scenarios, a non-aligned 16 bit instruction, and an aligned 16 or 32 bit instruction respectively, L₀is simply the 30 most significant bits of the fetch address denoted as instr_addr[31:2]. However, because in both of the scenarios, the instruction address spans only one fetch address line, the update address U₀depends on whether these instructions were arrived at sequentially or as the result of a non-sequential instruction. However, in keeping with the method discussed in the context of FIG. 4, we know the last fetch address of the prior instruction , also known as L₋₁. This information is stored internally and is available as a variable to the current instruction in the select stage of the pipeline. In the first scenario, if the current instruction is arrived at through sequential program flow, it has the same departure address as the prior instruction and hence U₀will be U₋₁. Otherwise, the update address will be the last fetch address of the prior non-sequential instruction L₋₁. In the second scenario, the update address of a 16 or 32-bit aligned instruction, U₀will be the last fetch address of the prior instruction L₋₁, irrespective of whether the prior instruction was sequential or not.
Scenarios 3-5 can be handled in the same manner by taking advantage of the fact that each instruction fetch fetches a contiguous 32-bit word. Therefore, when the instruction is sufficiently long and/or unaligned to span two or more consecutive fetched instruction words in memory, we know with certainty that L₀, the last fetch address, can be derived from the instruction address of the next sequential instruction, denoted as next_addr[31:2] in FIG. 5. In scenarios 3 and 5, covering non-aligned 32-bit, aligned 48-bit and non-aligned 64-bit instructions, the last portion of the current instruction share the same fetch address with the start of the next sequential instruction. Hence L₀will be next_addr[31:2]. In scenario 4, covering non-aligned 48-bit or aligned 64-bit instructions, the fetch address of the last portion of the current instruction is one less than the start address of the next sequential instruction. Hence, L₀=next_addr[31:2]−1. On the other hand, in scenario 3 and 4, the current instruction spans two consecutive 32-bit fetched instruction words. The fetch address prior to the last portion of the current instruction is always the fetch address of the start of the instruction. Therefore, U₀will be inst_addr[31:2]. In scenario 5, the last portion of the current instruction shares the same fetch address as the start of the next sequential instruction. Hence, U₀will be next_addr[31:2]−1. In the scheme just described, the update address U₀and last fetch address L₀are computed based on 4 values that are provided to the selection stage as early arriving signals directly from registers. These signals are namely inst_addr, next_addr, L₋₁and U₋₁. Only one multiplexer is required to compute U₀in scenario 1, and one decrementer is required to compute L₀in scenario 4 and U₀in scenario 5. The overall complexity of the novel and improved branch prediction method being disclosed is only marginally increased comparing with traditional methods.
In yet another embodiment of the invention, a method and apparatus are provided for computing the last instruction fetch of a zero-overhead loop for dynamic branch prediction in a variable length instruction set microprocessor. Zero-overhead loops, as well as the previously discussed dynamic branch prediction, are both powerful techniques for improving effective processor performance. In a microprocessor employing both techniques, the BPU has to be updated whenever the zero-overhead loop mechanism is updated. In particular, the BPU needs the last instruction fetch address of the loop body. This allows the BPU to re-direct instruction fetch to the start of the loop body whenever an instruction fetch hits the end of the loop body. However, in a variable-length instruction architecture, determining the last fetch address of a loop body is not trivial. Typically, a processor with a variable-length instruction set only keeps track of the first address an instruction is fetched from. However, the last fetch address of a loop body is the fetch address of the last portion of the last instruction of the loop body and is not readily available.
Typically, a zero-overhead loop mechanism requires an address related to the end of the loop body to be stored as part of the architectural state. In various exemplary embodiments, this address can be denoted as LP_END. If LP_END is assigned the address of the next instruction after the last instruction of the loop body, the last fetch address of the loop body, designated in various exemplary embodiments as LP_LAST, can be derived by exploiting two facts. Firstly, despite the variable length nature of the instruction set, instructions are fetched in fixed size chunks, namely 32-bit words. The BPU works only with the fetch address of theses fixed size chunks. Secondly, instruction sizes of variable-length are usually an integer multiple of a fixed size, namely 16-bits. Based on these facts, an instruction can be classified as aligned if the start address of the instruction is the same as the fetch address. If LP_END is an aligned address, LP_LAST must be the fetch address that precedes that of LP_END. If LP_END is non aligned, LP_LAST is the fetch address of LP_END. Thus, the equation LP_LAST=LP_END[31:2]−(˜LP_END[1]) can be used to derive the LP_LAST whether or not LP_END is aligned.
Referring to FIGS. 6 and 7, two examples are illustrated in which LP_END is both non-aligned and aligned. In both cases, the instruction “sub” is the last instruction of the loop body. In the first case, LP_END is located at 0xA. In this case, LP_END is unaligned and LP_END[1] is 1, thus, the inversion of LP_END[1] is 0 and the last fetch address of the loop body, LP_LAST, is LP_END[32:2] which is 0x8. In the second case, LP_END is aligned and located at 0x18. LP_END[1] is 0, as with all aligned instructions, thus, the inversion of LP_END is 1 and LP₁₃LAST is LP_END[31:2]−1 or the line above LP_LAST, line 0x14. Note that in the above calculations least significant bits of addresses that are known to be zero are ignored for the sake of simplifying the description.
While the foregoing description includes many details and specificities, it is to be understood that these have been included for purposes of explanation only, and are not to be interpreted as limitations of the present invention. Many modifications to the embodiments described above can be made without departing from the spirit and scope of the invention.

Claims

1. In a microprocessor, a method of performing branch prediction using variable length instructions, the method comprising:

fetching an instruction from memory based on a specified fetch address;

making a branch prediction based on the address of the fetched instruction; and

discarding the branch prediction if:

(1) a branch prediction look-up was based on a non-sequential fetch to an unaligned instruction address; and

(2) a branch target alignment cache (BTAC) bit of the instruction is equal to a predefined value.

2. The method according to claim 1, further comprising passing a predicted instruction associated with the branch prediction if either (1) or (2) is false.

3. The method according to claim 2, further comprising updating a next fetch address if a branch prediction is incorrect.

4. The method according to claim 3, wherein the microprocessor comprises an instruction for pipeline having a select stage, and updating comprises after resolving a branch in the select stage, updating the branch prediction unit (BPU) with the address of the next instruction resulting from that branch.

5. The method according to claim 1, wherein making a branch prediction comprises parsing a branch look-up table of a branch prediction unit (BPU) that indexes non-sequential branch instructions by their addresses in association with the next instruction taken.

6. The method according to claim 1, wherein an instruction is determined to be unaligned if it does not start at the beginning of a memory address line.

7. The method according to claim 1, wherein a BTAC alignment bit will be one of a 0 or a 1 for an aligned branch instruction and the other of a 0 or a 1 for an unaligned branch instruction.

8. In a microprocessor, a method of performing dynamic branch prediction comprising:

fetching the penultimate instruction word prior to a non-sequential program flow and a branch prediction unit look-up table entry containing prediction information for a next instruction on a first clock cycle;

fetching the last instruction word prior to a non-sequential program flow and making a prediction on this non-sequential program flow based on information fetched in the previous cycle on a second clock cycle; and

fetching the predicted target instruction on a third clock cycle.

9. The method according to claim 8, wherein fetching an instruction prior to a branch instruction and a branch prediction look-up table entry comprises using the instruction address of the instruction just prior to the branch instruction in the program flow index the branch in the branch table.

10. The method according to claim 9, wherein if a delay slot instruction appears after the branch instruction, fetching an instruction prior to a branch instruction and a branch prediction look-up table entry comprises using the instruction address of the branch instruction, not the instruction before the branch instruction, to index the BPU tables.

11. A method of updating a look-up table of a branch prediction unit in a variable length instruction set microprocessor, the method comprising:

storing a last fetch address of a last instruction of a loop body of a zero overhead loop in the branch prediction look-up table; and

predictively re-directing an instruction fetch to the start of the loop body whenever an instruction fetch hits the end of a loop body, wherein the last fetch address of the loop body is derived from the address of the first instruction after the end of the loop.

12. The method according to claim 11, wherein storing comprises, if the next instruction after the end of the loop body has an aligned address, the last instruction of the loop body has a last fetch address immediately preceding the address of the next instruction after the end of the loop body, otherwise, if the next instruction after the end of the loop body has an unaligned address, the last instruction of the loop body has a last fetch address the same as the address of the next instruction after the loop body.