Instruction Level Pipelining
Instruction Level Pipelining
Instruction Level Pipelining
Chapter 3
Instruction-Level Parallelism
and Its Exploitation
dcm 4
Basic superscalar 5-stage pipeline
Superscalar a processor executes more than one instruction during a
clock cycle by simultaneously dispatching multiple instructions to
redundant functional units on the processor.
The hardware determines (statically/ dynamically) which one of a block on
n instructions will be executed next.
dcm 6
More about hazards
Data hazards RAW, WAR, WAW.
Structural stalls +
Control stalls
Challenges:
Data dependency
Instruction j is data dependent on instruction i if
Instruction i produces a result that may be used by instruction j
Instruction j is data dependent on instruction k and instruction k
is data dependent on instruction i
dcm 13
Introduction
Examples
• Example 1: OR instruction dependent
DADDU R1,R2,R3 on DADDU and DSUBU
BEQZ R4,L
DSUBU R1,R1,R6
L: …
OR R7,R1,R8
Scheduled code:
Loop: L.D F0,0(R1)
DADDUI R1,R1,#-8
ADD.D F4,F0,F2
stall
stall
S.D F 4,8(R1)
BNE R1,R2,Loop
dcm 19
Compiler Techniques
Unrolled loop
Loop: L.D F0,0(R1)
ADD.D F4,F0,F2
S.D F4,0(R1) % drop DADDUI & BNE
L.D F6,-8(R1)
ADD.D F8,F6,F2
S.D F8,-8(R1) %drop DADDUI & BNE
L.D F10,-16(R1)
ADD.D F12,F10,F2
S.D F12,-16(R1) % drop DADDUI & BNE
L.D F14,-24(R1)
ADD.D F16,F14,F2
S.D F16,-24(R1)
DADDUI R1,R1,#-32
BNE R1,R2,Loop
dcm 23
Compiler Techniques
3. Strip mining
Note: not to be confused with branch target prediction guess the target of a
taken conditional or unconditional jump before it is computed by decoding and
executing the instruction itself. Both are often combined into the same circuitry.
Advantages:
Compiler doesn’t need to have knowledge of
microarchitecture
Handles cases where dependencies are unknown at
compile time
Disadvantage:
Substantial increase in hardware complexity
Complicates exceptions
Tomasulo’s Approach
Tracks when operands are available
Introduces register renaming in hardware
Minimizes WAW and WAR hazards
Example:
DIV.D F0,F2,F4
ADD.D F6,F0,F8
antidependence WAR - F8
S.D F6,0(R1)
SUB.D F8,F10,F14 antidependence WAW – F6
MUL.D F6,F10,F8
+ name dependence with F6
i: DIV.D F0,F2,F4
i+1: ADD.D S,F0,F8 (instead of ADD.D F6,F0,F8)
i+2 S.D S,0(R1) (instead of S.D F6,0(R1)
i+3 SUB.D T,F10,F14 (instead of SUB.D F8,F10,F14)
i+4 MUL.D F6,F10,T (instead of MUL.D F6,F10,F8)
values
RS fetches and buffers an operand as soon as it becomes available
(not necessarily involving register file)
Pending instructions designate the RS to which they will send their
output
Result values broadcast on a result bus, called the common data bus (CDB)
Only the last output updates the register file
As instructions are issued, the register specifiers are renamed with
the reservation station
May be more reservation stations than registers
Execute
When operand becomes available, store it in any reservation stations
waiting for it
When all operands are ready, issue the instruction
The Second load from memory loaction 45+(R3) is issued; data will be
stored in load buffer 2 (Load2).
Multiple loads can be outstanding.
The result of Load2 available for MULTD executed by Mult1 and SUBD
executed by Add1. Both can now proceed as they have both operands.
Mult2 executes DIVD and cannot proceed yet as it waits for the results of
Add1.
Issue ADDD
The results of the SUBD produced by Add1 will be available in the next
cycle.
ADDD instruction executed by Add2 waits for them.
Only MULTD and DIVD instructions did not complete. DIVD is waiting for
the result of MULTD before moving to the execute stage.
DIVD will finish execution in cycle 56 and the result will be in F6-F7
in cycle 57.
Exceptions:
Not recognized until it is ready to commit
dcm 59
Multiple Issue and Static Scheduling
Multiple issue and static scheduling
dcm 60
Multiple Issue and Static Scheduling
Multiple issue processors
dcm 63
Example
Modern microarchitectures:
Dynamic scheduling + multiple issue + speculation
Two approaches:
Assign reservation stations and update pipeline control table in
half clock cycles
Only supports 2 instructions/clock
FP addition
Integer operations
Load/Store
Time of issue, execution, and writing the result for a dual-issue of the pipeline.
The LD following the BNE (cycles 3, 6) cannot start execution earlier, it must
wait until the branch outcome is determined as there is no speculation
Requires
Wide enough datapaths to IC (Instruction cache)
Branch target buffer cache that stores the predicted address for
the next instruction after branch.
Optimization:
Larger branch-target buffer
Add target instruction into buffer to deal with longer
decoding time required by larger buffer
“Branch folding”
improves performance
Value prediction
Uses:
processors
dcm 81
Hardware approaches to multithreading
1. Fine-grain multithreading interleaves execution of multiple threads,
switches between threads in each clock cycle, round-robin.
Increase in throughput, but slows down execution of each thread.
Examples Sun Niagra, Nvidia GPU
2. Coarse grain multithreading switches only on costly stalls e.g., level 2
or 3 cache misses
High pipeline start-up costs
No processors
3. Simultaneous multithreading (SMT) fine-grain multithreading on
multiple-issue, dynamically scheduled processor.
Most common
Figure 3.45 The relative performance and energy efficiency for a set of single-
threaded benchmarks shows the i7 920 is 4 to over 10 times faster than the Atom 230
but that it is about 2 times less power efficient on average! Performance is shown in
the columns as i7 relative to Atom, which is execution time (i7)/execution time (Atom).
Energy is shown with the line as Energy (Atom)/Energy (i7). The i7 never beats the
Atom in energy efficiency, although it is essentially as good on four benchmarks,
three of which are floating point. Only one core is active on the i7, and the rest are in
deep power saving mode. Turbo Boost is used on the i7, which increases its
performance advantage but slightly decreases its relative energy efficiency.
The product of the CPI and the clock rate determine performance!!