CAQA5e ch3
CAQA5e ch3
CAQA5e ch3
Instruction-Level Parallelism
and Its Exploitation
Introduction
Introduction
Pipelining become universal technique in 1985
Overlaps execution of instructions
Exploits Instruction Level Parallelism
Challenges:
Data dependency
Instruction j is data dependent on instruction i if
Instruction i produces a result that may be used by instruction j
Instruction j is data dependent on instruction k and instruction k
is data dependent on instruction i
Control Dependence
Ordering of instruction i with respect to a branch
instruction
Instruction control dependent on a branch cannot be moved
before the branch so that its execution is no longer controller
by the branch
An instruction not control dependent on a branch cannot be
moved after the branch so that its execution is controlled by
the branch
Introduction
Examples
Example 1: OR instruction dependent
DADDU R1,R2,R3 on DADDU and DSUBU
BEQZ R4,L
DSUBU R1,R1,R6
L:
OR R7,R1,R8
Advantages:
Compiler doesnt need to have knowledge of
microarchitecture
Handles cases where dependencies are unknown at
compile time
Disadvantage:
Substantial increase in hardware complexity
Complicates exceptions
Branch Prediction
Dynamic Scheduling
Dynamic scheduling implies:
Out-of-order execution
Out-of-order completion
Tomasulos Approach
Tracks when operands are available
Introduces register renaming in hardware
Minimizes WAW and WAR hazards
Branch Prediction
Register Renaming
Example:
DIV.D F0,F2,F4
ADD.D F6,F0,F8
antidependence
S.D F6,0(R1)
SUB.D F8,F10,F14 antidependence
MUL.D F6,F10,F8
DIV.D F0,F2,F4
ADD.D S,F0,F8
S.D S,0(R1)
SUB.D T,F10,F14
MUL.D F6,F10,T
operand values
RS fetches and buffers an operand as soon as it becomes
available (not necessarily involving register file)
Pending instructions designate the RS to which they will send
their output
Result values broadcast on a result bus, called the common data bus (CDB)
Only the last output updates the register file
As instructions are issued, the register specifiers are renamed
with the reservation station
May be more reservation stations than registers
Branch Prediction
Tomasulos Algorithm
Load and store buffers
Contain data and addresses, act like reservation
stations
Top-level design:
Branch Prediction
Tomasulos Algorithm
Three Steps:
Issue
Get next instruction from FIFO queue
If available RS, issue the instruction to the RS with operand values if
available
If operand values not available, stall the instruction
Execute
When operand becomes available, store it in any reservation
stations waiting for it
When all operands are ready, issue the instruction
Loads and store maintained in program order through effective
address
No instruction allowed to initiate execution until all branches that
proceed it in program order have completed
Write result
Write result on CDB into reservation stations and store buffers
(Stores must wait until address and value are received)
Branch Prediction
Example
Branch Prediction
Hardware-Based Speculation
Execute instructions along predicted execution
paths but only commit the results if prediction
was correct
Instruction commit: allowing an instruction to
update the register file when instruction is no
longer speculative
Need an additional piece of hardware to prevent
any irrevocable action until an instruction
commits
I.e. updating state or taking an execution
Branch Prediction
Reorder Buffer
Reorder buffer holds the result of instruction
between completion and commit
Four fields:
Instruction type: branch/store/register
Destination field: register number
Value field: output value
Ready field: completed execution?
Exceptions:
Not recognized until it is ready to commit
Multiple Issue and Static Scheduling
Multiple Issue and Static Scheduling
To achieve CPI < 1, need to complete multiple
instructions per clock
Solutions:
Statically scheduled superscalar processors
VLIW (very long instruction word) processors
dynamically scheduled superscalar processors
Multiple Issue and Static Scheduling
Multiple Issue
Multiple Issue and Static Scheduling
VLIW Processors
Package multiple operations into one instruction
Modern microarchitectures:
Dynamic scheduling + multiple issue + speculation
Two approaches:
Assign reservation stations and update pipeline
control table in half clock cycles
Only supports 2 instructions/clock
Design logic to handle any possible dependencies
between the instructions
Hybrid approaches
Value prediction
Uses:
Loads that load from a constant pool
Instruction that produces a value from a small set of values
Not been incorporated into modern processors
Similar idea--address aliasing prediction--is used on
some processors