ILP2 (Unit4)
ILP2 (Unit4)
ILP2 (Unit4)
Hardware-Based Speculation
Speculating on Branch
Outcomes
To optimally exploit ILP (instruction-level parallelism) – e.g. with
pipelining, Tomasulo, etc. – it is critical to efficiently maintain
control dependencies (=branch dependencies)
Key idea: Speculate on the outcome of branches(=predict) and
execute instructions as if the predictions are correct
of course, we must proceed in such a manner as to be able to
recover if our speculation turns out wrong
Three components of hardware-based speculation
dynamic branch prediction to pick branch outcome
speculation to allow instructions to execute before control
dependencies are resolved, i.e., before branch outcomes become
known – with ability to undo in case of incorrect speculation
dynamic scheduling
Speculating with Tomasulo
Modern processors such as PowerPC 603/604, MIPS R10000,
Intel Pentium II/III/4, Alpha 21264 extend Tomasulo’s
approach to support speculation
Key ideas:
separate execution from completion: allow instructions to execute
speculatively but do not let instructions update registers or
memory until they are no longer speculative
therefore, add a final step – after an instruction is no longer
speculative – when it is allowed to make register and memory
updates, called instruction commit
allow instructions to execute and complete out of order but force
them to commit in order
add a hardware buffer, called the reorder buffer (ROB), with
registers to hold the result of an instruction between completion
and commit
Tomasulo Hardware with
Speculation
Basic structure of MIPS floating-point unit based on Tomasulo and extended
to handle speculation: ROB is added and store buffer in original Tomasulo is
eliminated as its functionality is integrated into the ROB
ROB is a queue!
Tomasulo’s Algorithm with
Speculation: Four Stages
1. Issue: get instruction from Instruction Queue
if reservation station and ROB slot free (no structural hazard),
control issues instruction to reservation station and ROB, and
sends to reservation station operand values (or reservation
station source for values) as well as allocated ROB slot number
2. Execution: operate on operands (EX)
when both operands ready then execute;
if not ready, watch CDB for result
3. Write result: finish execution (WB)
write on CDB to all awaiting units and ROB;
mark reservation station available
4. Commit: update register or memory with ROB result
when instruction reaches head of ROB and results present,
update register with result or store to memory and remove
instruction from ROB
if an incorrectly predicted branch reaches the head of ROB,
flush the ROB, and restart at correct successor of branch
ROB Data Structure
ROB entry fields
Instruction type: branch, store, register operation (i.e., ALU
or load)
State: indicates if instruction has completed and value is ready
Destination: where result is to be written – register number
for register operation (i.e. ALU or load), memory address for
store
branch has no destination result
Value: holds the value of instruction result till time to commit
Load1 no
Load2 no
Add1 no
Add2 no
Add3 no
Addi indicates ith reservation station for the FP add unit, etc.
Reorder Buffer
Entry Busy Instruction State Destination Value
At the time MUL.D is ready to commit only the two L.D instructions have
already committed, though others have completed execution
Actually, the MUL.D is at the head of the ROB – the L.D instructions are
shown only for understanding purposes
#X represents value field of ROB entry number X
Registers
Field F0 F1 F2 F3 F4 F5 F6 F7 F8 F10
Reorder# 3 6 4 5
Busy yes no no no no no yes … yes yes
Floating point registers
Example
Loop: LD F0 0 R1
MULTD F4 F0 F2
SD F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop
wide
Need compiling technique that schedules across several branches
16
VLIW: Very Long Instruction Word
18
Recall: Unrolled Loop
that Minimizes Scalar Stalls
1 Loop: L.D F0,0(R1) L.D to ADD.D: 1 Cycle
2 L.D F6,-8(R1) ADD.D to S.D: 2 Cycles
3 L.D F10,-16(R1)
4 L.D F14,-24(R1)
5 ADD.D F4,F0,F2
6 ADD.D F8,F6,F2
7 ADD.D F12,F10,F2
8 ADD.D F16,F14,F2
9 S.D 0(R1),F4
10 S.D -8(R1),F8
11 S.D -16(R1),F12
12 DSUBUI R1,R1,#32
13 BNEZ R1,LOOP
14 S.D 8(R1),F16 ; 8-32 = -24
22
23
More Instruction-Fetch Bandwidth
24
Speculation:
Register Renaming vs. ROB
Alternative to ROB is larger physical set of registers combined
with register renaming
Extended registers replace function of both ROB and
reservation stations
Instruction issue maps names of architectural registers to
physical register numbers in extended register set
On issue, allocates a new unused register for destination
26
In Conclusion …
interrupting
Machines with precise exceptions provide one single point in program
to restart execution
All instructions before that point have completed
27