Tuesday, October 31, 2023 10:53 PM: Discuss, The Schemes For Dealing With The Pipeline Stalls Caused by Branch Hazards

Tuesday, October 31, 2023 10:53 PM
Consider the following instructions are executed in five stage pipeline:

lw
add
sub
RI, O(R2)
R3, R4, R5
R6, RI, R3
fixplain the flow of execution in a pipeline using a simple diagram. 2. a.Consider the following
instructions are executed five stage pipclinc: lw R
Explain the now of execution in a pipeline using a simple diagram
b. For the instruction mix above, on what instruction results does the last add instruction
depend on? 3. Indicate the type of hazards and data dependences for the following instructions:
(5)
b. Iw R5, - 16(R5)
sw R5, - - 6(R5)
add R5, R5, R5
4. For the above sequence, add nop instructions to eliminate the hazards. Assume that there is
no forwarding in this pipelined processor.
5. Discuss, the schemes for dealing with the pipeline stalls caused by branch hazards. (3)
Discuss, the schemes for dealing with the pipeline stalls caused by branch hazards.
ChatGPT
Sure, let's dive into the strategies for mitigating pipeline stalls caused by branch hazards.
1. Branch Prediction:
• Static Prediction: This involves predicting branches at compile-time based on the
program's structure. However, it may not be very accurate for dynamic and runtime-
dependent branches.
• Dynamic Prediction: Here, the processor predicts the outcome of a branch during
runtime. Two-level predictors, tournament predictors, and neural branch predictors are
examples of dynamic prediction techniques.
2. Branch Target Buffer (BTB):
• A BTB is a cache that stores target addresses of recent branches. When a branch
instruction is encountered, the BTB is checked to fetch the predicted target address.
This can help in reducing stalls as the pipeline can continue with the predicted
instruction.
3. Delayed Branching:
• Instructions following a branch are executed regardless of the branch outcome. If the
branch is taken, the pipeline is flushed, and incorrect instructions are discarded. This
reduces the impact of branch stalls but may lead to wasted cycles if the prediction is
incorrect.
4. Out-of-Order Execution:
• Modern processors often use out-of-order execution, allowing instructions after a
branch to be executed even before the branch outcome is determined. If the branch
prediction was correct, the results are committed; otherwise, the pipeline is rolled back.
5. Speculative Execution:
• Similar to out-of-order execution, speculative execution involves executing instructions
based on predicted branch outcomes. If the prediction is correct, the results are
committed; otherwise, the speculatively executed instructions are discarded.
6. Loop Buffer:
• A loop buffer is a small, fast cache that stores recently executed instructions in loops. It
helps in predicting loop branches more accurately, reducing the number of
mispredicted branches.
7. Software Pipelining:
• This is a compiler optimization technique that reorganizes the code to minimize
pipeline stalls. It can be used to overlap the execution of multiple iterations of a loop,
reducing the impact of branch hazards.
Each of these techniques comes with its advantages and limitations. The effectiveness of a
particular scheme may depend on the characteristics of the code being executed and the
architecture of the processor.
From <https://chat.openai.com/c/cb22d345-9d92-4035-ad45-9cbc0fb32bb6>
4.13.1: As we know lw used to load word from memory to a register. and sw same but stores a
register into memory add to adds together its two operands.
Instruction sequence Dependences

a
I1: lw $1,40($6)
I2: add $6,$2,$2
I3: sw $6,50($1)
RAW on $1 from I1 to I3
RAW on $6 from I2 to I3
WAR on $6 from I1 to I2 and I3
b
I1: lw $5,-16($5)
I2: sw $5,-16($5)
I3: add $5,$5,$5
RAW on $5 from I1 to I2 and I3

WAR on $5 from I1 and I2 to I3
WAW on $5 from I1 to I3
4.13.2: In the basic five-stage pipeline WAR and WAW dependences do not cause any hazards.
Without forwarding, any RAW dependence between an instruction and the next two
instructions (if register read happens in the second half of the clock cycle and the register write
happens in the fi rst half). The code that eliminates these hazards by inserting nop instructions
is:
Instruction sequence
a
lw $1,40($6)
add $6,$2,$2
nop
sw $6,50($1)
Delay I3 to avoid RAW hazard on $1 from I1
New Section 1 Page 1

b
lw $5,-16($5)
nop
nop
sw $5,-16($5)
add $5,$5,$5
Note: No RAW hazard from on $5 from I1 now
4.13.3: With full forwarding, an ALU instruction can forward a value to EX stage of the next
instruction without a hazard. However, a load cannot forward to the EX stage of the next
instruction (by can to the instruction after that).The code that eliminates these hazards by
inserting nop instructions is:
a
lw $1,40($6)
add $6,$2,$2
sw $6,50($1)
No RAW hazard on $1 from I1 (forwarded)
b
lw $5,-16($5)
nop
sw $5,-16($5)
add $5,$5,$5

Value for $5 is forwarded from I2 now
Note: no RAW hazard from on $5 from I1 now
4.13.4 The total execution time is the clock cycle time times the number of cycles. Without any
stalls, a three-instruction sequence executes in 7 cycles (5 to complete the fi rst instruction, then
one per instruction). The execution without forwarding must add a stall for every nop we had in
4.13.2, and execution forwarding must add a stall cycle for every nop we had in 4.13.3. Overall,
we get:
No forwarding With forwarding Speed-up due to forwarding

(7 + 1) × 300ps = 2400ps
7 × 400ps = 2800ps
0.86 (This is really a slowdown)
(7 + 2) × 200ps = 1800ps
(7 + 1) × 250ps = 2000ps
4.13.5 With ALU-ALU-only forwarding, an ALU instruction can forward to the next instruction,
but not to the second-next instruction (because that would be forwarding from MEM to EX). A
load cannot forward at all, because it determines the data value in MEM stage, when it is too
late for ALU-ALU forwarding. We have:
a
lw $1,40($6)
add $6,$2,$2
nop
sw $6,50($1)
Can’t use ALU-ALU forwarding, ($1 loaded in MEM)
lw $5,-16($5)
nop
nop
sw $5,-16($5)
add $5,$5,$5
Can’t use ALU-ALU forwarding ($5 loaded in MEM)
4.13.6: Total execulation time of this instruction sequence with different condition ALU-ALU, NO
forwarding are given bellow.
No forwarding With ALU-ALU forwarding only Speed-up with ALU-ALU forwarding

(7 + 1) × 300ps = 2400ps
(7 + 1) × 360ps = 2880ps
(7 + 2) × 200ps = 1800ps
(7 + 2) × 220ps = 1980ps
2A

. Assuming that N instructions are executed, and all N instructions are add instructions(takes 4clock
cycles), what is the speedup of a pipelined implementation when compared to a multi -cycle
implementation? Your answer should be an expression that is a function of N. (Assume clock cycle time
is 305 ps ) chegg
4.10 In this exercise, we examine how resource hazards, control hazards,

and
Instruction Set Architecture (ISA) design can affect pipelined execution.
Problems
in this exercise refer to the following fragment of MIPS code:
sw r16, 12(r6)
lw r16, 8(r6)
beq r5, r4, Label # Assume r5!=r4
add r5, r1, r4
slt r5, r15, r4
Assume that individual pipeline stages have the following latencies:
IF ID EX MEM WB
200ps 120ps 150ps 190ps 100ps
4.10.1 For this problem, assume that all branches are perfectly
predicted (this eliminates all control hazards) and that no delay slots are
used. If we
only have one memory (for both instructions and data), there is a
structural hazard
every time we need to fetch an instruction in the same cycle in which
another
instruction accesses data. To guarantee forward progress, this hazard
must always
be resolved in favor of the instruction that accesses data. What is the total
execution
time of this instruction sequence in the 5-stage pipeline that only has one
memory?
We have seen that data hazards can be eliminated by adding nops to the
code. Can
you do the same with this structural hazard? Why?
4.10.2 For this problem, assume that all branches are perfectly
predicted (this eliminates all control hazards) and that no delay slots are
used.
If we change load/store instructions to use a register (without an off set)
as the
address, these instructions no longer need to use the ALU. As a result,
MEM and
EX stages can be overlapped and the pipeline has only 4 stages. Change
this code to
accommodate this changed ISA. Assuming this change does not aff ect
clock cycle
time, what speedup is achieved in this instruction sequence?
4.10.3 Assuming stall-on-branch and no delay slots, what speedup is
achieved on this code if branch outcomes are determined in the ID stage,
relative to
the execution where branch outcomes are determined in the EX stage?
4.10.4 Given these pipeline stage latencies, repeat the speedup
calculation from 4.10.2, but take into account the (possible) change in
clock cycle
time. When EX and MEM are done in a single stage, most of their work can
be
done in parallel. As a result, the resulting EX/MEM stage has a latency that
is the

is the
larger of the original two, plus 20 ps needed for the work that could not be
done
in parallel.
4.10.5 Given these pipeline stage latencies, repeat the speedup
calculation from 4.10.3, taking into account the (possible) change in clock
cycle
time. Assume that the latency ID stage increases by 50% and the latency of
the EX
stage decreases by 10ps when branch outcome resolution is moved from
EX to ID.
4.10.6 Assuming stall-on-branch and no delay slots, what is the new
clock cycle time and execution time of this instruction sequence if beq
address
computation is moved to the MEM stage? What is the speedup from this
change?
Assume that the latency of the EX stage is reduced by 20 ps and the
latency of the
MEM stage is unchanged when branch outcome resolution is moved from
EX to
MEM.
From <https://www.chegg.com/homework-help/questions-and-answers/410-exercise-
examine-resource-hazards-control-hazards-instruction-set-architecture-isa-des-
q24942268>

How to prevent hazards using Tomasulo's algorithm in Dynamic Scheduling. Explain with an example
ChatGPT
Tomasulo's algorithm is a dynamic scheduling algorithm used in superscalar processors to execute

instructions out of order, maximizing the utilization of execution units. One key feature of Tomasulo's
algorithm is its ability to handle data hazards through the use of reservation stations and a common
data bus.
To prevent hazards using Tomasulo's algorithm, let's focus on the three types of hazards: data hazards,
structural hazards, and control hazards.
1. Data Hazards:
○ Tomasulo's algorithm addresses data hazards by using reservation stations, which act as
buffers for instructions before they are issued to the functional units. Each reservation
station is associated with a specific functional unit and monitors the availability of operands.
○ For example, consider the following instructions:
markdown
○ 1.ADD R1, R2, R3
2.SUB R4, R1, R5
3.MUL R6, R7, R1
○ In Tomasulo's algorithm, these instructions would enter reservation stations. The ADD
instruction would reserve a spot in the reservation station associated with the ADD
functional unit. The subsequent SUB and MUL instructions would also enter reservation
stations.
○ If the ADD instruction has not completed its execution, the SUB and MUL instructions can
still proceed if their operands are available, preventing data hazards.
1. Structural Hazards:
○ Structural hazards arise when multiple instructions compete for the same functional unit.
Tomasulo's algorithm addresses this by allowing multiple reservation stations for the same
type of functional unit.
○ For example, if there are multiple ADD instructions in the instruction stream, each ADD
instruction can use a separate reservation station associated with the ADD functional unit.
2. Control Hazards:
○ Tomasulo's algorithm primarily focuses on data and structural hazards, but control hazards,
such as branches, may still exist. Techniques like branch prediction and out-of-order
execution can help mitigate control hazards.
In summary, Tomasulo's algorithm prevents hazards by:
• Using reservation stations to handle data hazards.
• Allowing multiple reservation stations for the same type of functional unit to address structural
hazards.
Remember that Tomasulo's algorithm is just one approach to dynamic scheduling, and other techniques
like scoreboarding also address similar issues in out-of-order execution architectures.
From <https://chat.openai.com/c/6f195770-72b0-41e4-a502-26ffc055ebcc>
In the following loop, find all the true dependences,

output dependences, and anti-dependences. Eliminate
the output dependences and anti-dependences by
renaming. How many cycles would it take to execute
this loop on a system with 10 processors, assuming
addition takes only one cycle. for (i=0; i<100; i++) { A[i] =
A[i] * B[i]; /* S1 */
In the following loop, find all the true dependences, output dependences, and anti -dependences.
Eliminate the output dependences and anti-dependences by renaming. How many cycles would it take
to execute this loop on a system with 10 processors, assuming addition takes only one cycle.
for (i=0; i<100; i++) {

A[i] = A[i] * B[i]; /* S1 */

A[i] = A[i] * B[i]; /* S1 */
B[i] = A[i] + c; /* S2 */
A[i] = C[i] * c; /* S3 */
C[i] = D[i] * A[i]; /* S4 */
From <https://www.chegg.com/homework-help/questions-and-answers/following-loop-find-true-dependences-output-
dependences-anti-dependences-eliminate-output--q45515765>
n case of any doubt comment below. Please UPVOTE if you Liked the answer.
Output dependency exists two or more instructions writes to common register to memorylocation.
Output dependencies is seen between S1 and S3 causing thru A[i].
Anti-dependencies occurs whenever a succeeding instruction writes a register/memory locationwhich is
formerly read by a previous instruction reads,
here anti dependency is seen between S4and S3 cause anti dependency thru C[i].
Code if re-written as below:
for (i=0;i<100;i++)
{
T[i] = A[i] * B[i]; /* S1 */
B[i] = T[i] + c; /* S2 */
A1[i] = C[i] * c; /* S3 */
C1[i] = D[i] * A1[i]; /* S4 */
}
True dependencies exists, which would be S4 and S3 thru A[i]; and for S2 and S1 through T[i]
From <http://exedustack.nl/why-112-million-rated-moiss-caicedo-shouldnt-leave-brighton-this-year/>
Explanation:
=>S1: A[i] = A[i] + B[i];
=>S2:B[i+1] = C[i] + D[i];
=>We can see that there is no dependencies between S1 and S2 as no elements A[] or B[] or C[] of S1
depend on S2 or elements of S2 on S1.
=>Loops are not parallel as we are running two lines of code inside for loop on same processor core.
Making parallel:
=>We call loops parallel when we can run statements by seperate process in parallel like in different
CPU cores.
for(i=0;i<100;i++){
A[i] = A[i] + B[i];
}
for(i=0;i<100;i++){
B[i+1] = C[i] + D[i];
}
=>So above loops will run paralley on diffent CPU cores.
I have explained each and every part with the help of statements attached to it.
From <http://exedustack.nl/why-112-million-rated-moiss-caicedo-shouldnt-leave-brighton-this-year/>
. Draw and e x plain the basic structure vector architecture
https://www.lkouniv.ac.in/site/writereaddata/siteContent/202004291236420982rohit_vector_architect
ure.pdf

Tuesday, October 31, 2023 10:53 PM: Discuss, The Schemes For Dealing With The Pipeline Stalls Caused by Branch Hazards

Uploaded by

Copyright:

Available Formats

Tuesday, October 31, 2023 10:53 PM: Discuss, The Schemes For Dealing With The Pipeline Stalls Caused by Branch Hazards

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Tuesday, October 31, 2023 10:53 PM: Discuss, The Schemes For Dealing With The Pipeline Stalls Caused by Branch Hazards

Uploaded by

Copyright:

Available Formats

Tuesday, October 31, 2023 10:53 PM

Consider the following instructions are executed in five stage pipeline:

Instruction sequence Dependences

RAW on $5 from I1 to I2 and I3

Delay I3 to avoid RAW hazard on $1 from I1

New Section 1 Page 1

Delay I2 to avoid RAW hazard on $5 from I1

Note: No RAW hazard from on $5 from I1 now

No RAW hazard on $1 from I1 (forwarded)

Delay I2 to avoid RAW hazard on $5 from I1

No forwarding With forwarding Speed-up due to forwarding

0.86 (This is really a slowdown)

0.90 (This is really a slowdown)

Can’t use ALU-ALU forwarding, ($1 loaded in MEM)

Can’t use ALU-ALU forwarding ($5 loaded in MEM)

No forwarding With ALU-ALU forwarding only Speed-up with ALU-ALU forwarding

0.83 (This is really a slowdown)

0.91 (This is really a slowdown)

New Section 1 Page 2

4.10 In this exercise, we examine how resource hazards, control hazards,

New Section 1 Page 3

New Section 1 Page 4

Tomasulo's algorithm is a dynamic scheduling algorithm used in superscalar processors to execute

In the following loop, find all the true dependences,

for (i=0; i<100; i++) {

New Section 1 Page 6

. Draw and e x plain the basic structure vector architecture

New Section 1 Page 7

You might also like