Computer Architecture, A Quantitative Approach (Solution For 5th Edition)
Computer Architecture, A Quantitative Approach (Solution For 5th Edition)
Computer Architecture, A Quantitative Approach (Solution For 5th Edition)
Chapter 2 Solutions
Chapter 3 Solutions
Chapter 4 Solutions
Chapter 5 Solutions
Chapter 6 Solutions
Appendix A Solutions
Appendix B Solutions
Appendix C Solutions
2
6
13
33
44
50
63
83
92
Chapter 1 Solutions
Case Study 1: Chip Fabrication Cost
1.1
0.30 3.89 4
a. Yield = 1 + --------------------------- = 0.36
4.0
b. It is fabricated in a larger technology, which is an older plant. As plants age,
their process gets tuned, and the defect rate decreases.
2
1.2
30
( 30 2 )
a. Dies per wafer = ----------------------------- ------------------------------- = 471 54.4 = 416
1.5
sqrt ( 2 1.5 )
0.30 1.5 4
Yield = 1 + ------------------------ = 0.65
4.0
Profit = 416 0.65 $20 = $5408
2
30
( 30 2 )
b. Dies per wafer = ----------------------------- ------------------------------- = 283 42.1 = 240
2.5
sqrt ( 2 2.5 )
0.30 2.5 4
Yield = 1 + -------------------------- = 0.50
4.0
Profit = 240 0.50 $25 = $3000
c. The Woods chip
d. Woods chips: 50,000/416 = 120.2 wafers needed
Markon chips: 25,000/240 = 104.2 wafers needed
Therefore, the most lucrative split is 120 Woods wafers, 30 Markon wafers.
1.3
0.75 1.99 2 4
a. Defect Free single core = 1 + ---------------------------------- = 0.28
4.0
No defects = 0.282 = 0.08
One defect = 0.28 0.72 2 = 0.40
No more than one defect = 0.08 + 0.40 = 0.48
Wafer size
b. $20 = -----------------------------------old dpw 0.28
$20 0.28 = Wafer size/old dpw
Wafer size
$20 0.28
x = -------------------------------------------------- = ------------------------1/2 old dpw 0.48
1/2 0.48
= $23.33
Chapter 1 Solutions
1.5
14 KW
a. ------------------------------------------------------------ = 183
( 66 W + 2.3 W + 7.9 W )
14 KW
b. --------------------------------------------------------------------- = 166
( 66 W + 2.3 W + 2 7.9 W )
c. 200 W 11 = 2200 W
2200/(76.2) = 28 racks
Only 1 cooling door is required.
1.6
a. The IBM x346 could take less space, which would save money in real estate.
The racks might be better laid out. It could also be much cheaper. In addition,
if we were running applications that did not match the characteristics of these
benchmarks, the IBM x346 might be faster. Finally, there are no reliability
numbers shown. Although we do not know that the IBM x346 is better in any
of these areas, we do not know it is worse, either.
1.7
a. (1 8) + .8/2 = .2 + .4 = .6
2
Power new
( V 0.60 ) ( F 0.60 )
3
- = 0.6 = 0.216
b. -------------------------- = ---------------------------------------------------------2
Power old
V F
.75
c. 1 = -------------------------------- ; x = 50%
(1 x) + x 2
2
Power new
( V 0.75 ) ( F 0.60 )
2
- = 0.75 0.6 = 0.338
d. -------------------------- = ---------------------------------------------------------2
Power old
V F
Exercises
1.8
a. (1.35)10 = approximately 20
b. 3200 (1.4)12 = approximately 181,420
c. 3200 (1.01)12 = approximately 3605
d. Power density, which is the power consumed over the increasingly small area,
has created too much heat for heat sinks to dissipate. This has limited the
activity of the transistors on the chip. Instead of increasing the clock rate,
manufacturers are placing multiple cores on the chip.
Copyright 2012 Elsevier, Inc. All rights reserved.
a. 50%
b. Energy = load V2. Changing the frequency does not affect energyonly
power. So the new energy is load ( V)2, reducing it to about the old
energy.
1.10
a. 60%
b. 0.4 + 0.6 0.2 = 0.58, which reduces the energy to 58% of the original
energy.
c. newPower/oldPower = Capacitance (Voltage .8)2 (Frequency .6)/
Capacitance Voltage Frequency = 0.82 0.6 = 0.256 of the original power.
d. 0.4 + 0 .3 2 = 0.46, which reduce the energy to 46% of the original energy.
1.11
a. 109/100 = 107
b. 107/107 + 24 = 1
c. [need solution]
1.12
1.13
1.14
Chapter 1 Solutions
12
Net speedup
10
8
6
4
2
0
10
20
30 40 50 60 70
Percent vectorization
80
90
100
1.16
1.17
1.18
a. 1/(.2 + .8/N)
b. 1/(.2 + 8 0.005 + 0.8/8) = 2.94
c. 1/(.2 + 3 0.005 + 0.8/8) = 3.17
d. 1/(.2 + logN 0.005 + 0.8/N)
e. d/dN(1/((1 P) + logN 0.005 + P/N)) = 0
Chapter 2 Solutions
Case Study 1: Optimizing Cache Performance via
Advanced Techniques
2.1
a. Each element is 8B. Since a 64B cacheline has 8 elements, and each column
access will result in fetching a new line for the non-ideal matrix, we need a
minimum of 8x8 (64 elements) for each matrix. Hence, the minimum cache
size is 128 8B = 1KB.
b. The blocked version only has to fetch each input and output element once.
The unblocked version will have one cache miss for every 64B/8B = 8 row
elements. Each column requires 64Bx256 of storage, or 16KB. Thus, column
elements will be replaced in the cache before they can be used again. Hence
the unblocked version will have 9 misses (1 row and 8 columns) for every 2 in
the blocked version.
c. for (i = 0; i < 256; i=i+B) {
for (j = 0; j < 256; j=j+B) {
for(m=0; m<B; m++) {
for(n=0; n<B; n++) {
output[j+n][i+m] = input[i+m][j+n];
}
}
}
}
d. 2-way set associative. In a direct-mapped cache the blocks could be allocated
so that they map to overlapping regions in the cache.
e. You should be able to determine the level-1 cache size by varying the block
size. The ratio of the blocked and unblocked program speeds for arrays that
do not fit in the cache in comparison to blocks that do is a function of the
cache block size, whether the machine has out-of-order issue, and the bandwidth provided by the level-2 cache. You may have discrepancies if your
machine has a write-through level-1 cache and the write buffer becomes a
limiter of performance.
2.2
Since the unblocked version is too large to fit in the cache, processing eight 8B elements requires fetching one 64B row cache block and 8 column cache blocks.
Since each iteration requires 2 cycles without misses, prefetches can be initiated
every 2 cycles, and the number of prefetches per iteration is more than one, the
memory system will be completely saturated with prefetches. Because the latency
of a prefetch is 16 cycles, and one will start every 2 cycles, 16/2 = 8 will be outstanding at a time.
2.3
Chapter 2 Solutions
2.5
a. Hint: This is visible in the graph above as a slight increase in L2 miss service
time for large data sets, and is 4KB for the graph above.
b. Hint: Take independent strides by the page size and look for increases in
latency not attributable to cache sizes. This may be hard to discern if the
amount of memory mapped by the TLB is almost the same as the size as a
cache level.
c. Hint: This is visible in the graph above as a slight increase in L2 miss service
time for large data sets, and is 15ns in the graph above.
d. Hint: Take independent strides that are multiples of the page size to see if the
TLB if fully-associative or set-associative. This may be hard to discern if the
amount of memory mapped by the TLB is almost the same as the size as a
cache level.
2.6
a. Hint: Look at the speed of programs that easily fit in the top-level cache as a
function of the number of threads.
b. Hint: Compare the performance of independent references as a function of
their placement in memory.
2.7
Exercises
2.8
a. The access time of the direct-mapped cache is 0.86ns, while the 2-way and
4-way are 1.12ns and 1.37ns respectively. This makes the relative access
times 1.12/.86 = 1.30 or 30% more for the 2-way and 1.37/0.86 = 1.59 or
59% more for the 4-way.
b. The access time of the 16KB cache is 1.27ns, while the 32KB and 64KB are
1.35ns and 1.37ns respectively. This makes the relative access times 1.35/
1.27 = 1.06 or 6% larger for the 32KB and 1.37/1.27 = 1.078 or 8% larger for
the 64KB.
c. Avg. access time = hit% hit time + miss% miss penalty, miss% = misses
per instruction/references per instruction = 2.2% (DM), 1.2% (2-way), 0.33%
(4-way), .09% (8-way).
Direct mapped access time = .86ns @ .5ns cycle time = 2 cycles
2-way set associative = 1.12ns @ .5ns cycle time = 3 cycles
Copyright 2012 Elsevier, Inc. All rights reserved.
a. The average memory access time of the current (4-way 64KB) cache is 1.69ns.
64KB direct mapped cache access time = .86ns @ .5 ns cycle time = 2 cycles
Way-predicted cache has cycle time and access time similar to direct mapped
cache and miss rate similar to 4-way cache.
The AMAT of the way-predicted cache has three components: miss, hit with
way prediction correct, and hit with way prediction mispredict: 0.0033 (20)
+ (0.80 2 + (1 0.80) 3) (1 0.0033) = 2.26 cycles = 1.13ns
b. The cycle time of the 64KB 4-way cache is 0.83ns, while the 64KB directmapped cache can be accessed in 0.5ns. This provides 0.83/0.5 = 1.66 or 66%
faster cache access.
c. With 1 cycle way misprediction penalty, AMAT is 1.13ns (as per part a), but
with a 15 cycle misprediction penalty, the AMAT becomes: 0.0033 20 +
(0.80 2 + (1 0.80) 15) (1 0.0033) = 4.65 cycles or 2.3ns.
d. The serial access is 2.4ns/1.59ns = 1.509 or 51% slower.
2.10
a. The access time is 1.12ns, while the cycle time is 0.51ns, which could be
potentially pipelined as finely as 1.12/.51 = 2.2 pipestages.
b. The pipelined design (not including latch area and power) has an area of
1.19 mm2 and energy per access of 0.16nJ. The banked cache has an area of
1.36 mm2 and energy per access of 0.13nJ. The banked design uses slightly
more area because it has more sense amps and other circuitry to support the
two banks, while the pipelined design burns slightly more power because the
memory arrays that are active are larger than in the banked case.
2.11
a. With critical word first, the miss service would require 120 cycles. Without
critical word first, it would require 120 cycles for the first 16B and 16 cycles
for each of the next 3 16B blocks, or 120 + (3 16) = 168 cycles.
b. It depends on the contribution to Average Memory Access Time (AMAT) of
the level-1 and level-2 cache misses and the percent reduction in miss service
times provided by critical word first and early restart. If the percentage reduction in miss service times provided by critical word first and early restart is
roughly the same for both level-1 and level-2 miss service, then if level-1
misses contribute more to AMAT, critical word first would likely be more
important for level-1 misses.
Chapter 2 Solutions
2.12
2.13
a. A 2GB DRAM with parity or ECC effectively has 9 bit bytes, and would
require 18 1Gb DRAMs. To create 72 output bits, each one would have to
output 72/18 = 4 bits.
b. A burst length of 4 reads out 32B.
c. The DDR-667 DIMM bandwidth is 667 8 = 5336 MB/s.
The DDR-533 DIMM bandwidth is 533 8 = 4264 MB/s.
2.14
a. This is similar to the scenario given in the figure, but tRCD and CL are
both 5. In addition, we are fetching two times the data in the figure. Thus it
requires 5 + 5 + 4 2 = 18 cycles of a 333MHz clock, or 18 (1/333MHz) =
54.0ns.
b. The read to an open bank requires 5 + 4 = 9 cycles of a 333MHz clock, or
27.0ns. In the case of a bank activate, this is 14 cycles, or 42.0ns. Including
20ns for miss processing on chip, this makes the two 42 + 20 = 61ns and
27.0 + 20 = 47ns. Including time on chip, the bank activate takes 61/47 = 1.30
or 30% longer.
2.15
The costs of the two systems are $2 130 + $800 = $1060 with the DDR2-667
DIMM and 2 $100 + $800 = $1000 with the DDR2-533 DIMM. The latency to
service a level-2 miss is 14 (1/333MHz) = 42ns 80% of the time and 9 (1/333
MHz) = 27ns 20% of the time with the DDR2-667 DIMM.
It is 12 (1/266MHz) = 45ns (80% of the time) and 8 (1/266MHz) = 30ns
(20% of the time) with the DDR-533 DIMM. The CPI added by the level-2
misses in the case of DDR2-667 is 0.00333 42 .8 + 0.00333 27 .2 = 0.130
giving a total of 1.5 + 0.130 = 1.63. Meanwhile the CPI added by the level-2
misses for DDR-533 is 0.00333 45 .8 + 0.00333 30 .2 = 0.140 giving a
total of 1.5 + 0.140 = 1.64. Thus the drop is only 1.64/1.63 = 1.006, or 0.6%,
while the cost is $1060/$1000 = 1.06 or 6.0% greater. The cost/performance of
the DDR2-667 system is 1.63 1060 = 1728 while the cost/performance of the
DDR2-533 system is 1.64 1000 = 1640, so the DDR2-533 system is a better
value.
2.16
10
a. The system built from 1Gb DRAMs will have twice as many banks as the
system built from 2Gb DRAMs. Thus the 1Gb-based system should provide
higher performance since it can have more banks simultaneously open.
b. The power required to drive the output lines is the same in both cases, but the
system built with the x4 DRAMs would require activating banks on 18 DRAMs,
versus only 9 DRAMs for the x8 parts. The page size activated on each x4 and
x8 part are the same, and take roughly the same activation energy. Thus since
there are fewer DRAMs being activated in the x8 design option, it would have
lower power.
2.18
a. With policy 1,
Precharge delay Trp = 5 (1/333 MHz) = 15ns
Activation delay Trcd = 5 (1/333 MHz) = 15ns
Column select delay Tcas = 4 (1/333 MHz) = 12ns
Access time when there is a row buffer hit
r ( Tcas + Tddr )
Th = -------------------------------------100
With policy 2,
Access time = Trcd + Tcas + Tddr
If A is the total number of accesses, the tip-off point will occur when the net
access time with policy 1 is equal to the total access time with policy 2.
i.e.,
r
100 r
--------- ( Tcas + Tddr )A + ----------------- ( Trp + Trcd + Tcas + Tddr )A
100
100
= (Trcd + Tcas + Tddr)A
100 Trp
r = ---------------------------Trp + Trcd
r = 100 (15)/(15 + 15) = 50%
If r is less than 50%, then we have to proactively close a page to get the best
performance, else we can keep the page open.
b. The key benefit of closing a page is to hide the precharge delay Trp from the
critical path. If the accesses are back to back, then this is not possible. This
new constrain will not impact policy 1.
Chapter 2 Solutions
11
c. For any row buffer hit rate, policy 2 requires additional r (2 + 4) nJ per
access. If r = 50%, then policy 2 requires 3nJ of additional energy.
2.19
Hibernating will be useful when the static energy saved in DRAM is at least equal
to the energy required to copy from DRAM to Flash and then back to DRAM.
DRAM dynamic energy to read/write is negligible compared to Flash and can be
ignored.
9
8 10 2 2.56 10
Time = ------------------------------------------------------------64 1.6
= 400 seconds
The factor 2 in the above equation is because to hibernate and wakeup, both Flash
and DRAM have to be read and written once.
2.20
2.21
a. Programs that do a lot of computation but have small memory working sets
and do little I/O or other system calls.
b. The slowdown above was 60% for 10%, so 20% system time would run
120% slower.
c. The median slowdown using pure virtualization is 10.3, while for para virtualization the median slowdown is 3.76.
12
d. The null call and null I/O call have the largest slowdown. These have no real
work to outweigh the virtualization overhead of changing protection levels,
so they have the largest slowdowns.
2.22
The virtual machine running on top of another virtual machine would have to emulate privilege levels as if it was running on a host without VT-x technology.
2.23
a. As of the date of the Computer paper, AMD-V adds more support for virtualizing virtual memory, so it could provide higher performance for memoryintensive applications with large memory footprints.
b. Both provide support for interrupt virtualization, but AMDs IOMMU also
adds capabilities that allow secure virtual machine guest operating system
access to selected devices.
2.24
2.25
Chapter 3 Solutions
13
Chapter 3 Solutions
Case Study 1: Exploring the Impact of Microarchitectural
Techniques
3.1
The baseline performance (in cycles, per loop iteration) of the code sequence in
Figure 3.48, if no new instructions execution could be initiated until the previous instructions execution had completed, is 40. See Figure S.2. Each instruction requires one clock cycle of execution (a clock cycle in which that
instruction, and only that instruction, is occupying the execution units; since
every instruction must execute, the loop will take at least that many clock
cycles). To that base number, we add the extra latency cycles. Dont forget the
branch shadow cycle.
Loop:
LD
F2,0(Rx)
1+4
DIVD
F8,F2,F0
1 + 12
MULTD
F2,F6,F2
1+5
LD
F4,0(Ry)
1+4
ADDD
F4,F0,F4
1+1
ADDD
F10,F8,F2
1+1
ADDI
Rx,Rx,#8
ADDI
Ry,Ry,#8
SD
F4,0(Ry)
1+1
SUB
R20,R4,Rx
BNZ
R20,Loop
1+1
____
40
Figure S.2 Baseline performance (in cycles, per loop iteration) of the code sequence
in Figure 3.48.
3.2
How many cycles would the loop body in the code sequence in Figure 3.48
require if the pipeline detected true data dependencies and only stalled on those,
rather than blindly stalling everything just because one functional unit is busy?
The answer is 25, as shown in Figure S.3. Remember, the point of the extra
latency cycles is to allow an instruction to complete whatever actions it needs, in
order to produce its correct output. Until that output is ready, no dependent
instructions can be executed. So the first LD must stall the next instruction for
three clock cycles. The MULTD produces a result for its successor, and therefore
must stall 4 more clocks, and so on.
14
Loop:
LD
F2,0(Rx)
1 + 4
DIVD
F8,F2,F0
1 + 12
MULTD
F2,F6,F2
1 + 5
LD
F4,0(Ry)
1 + 4
<stall>
<stall>
<stall>
<stall>
F4,F0,F4
1 + 1
F10,F8,F2
1 + 1
ADDI
Rx,Rx,#8
ADDI
Ry,Ry,#8
SD
F4,0(Ry)
1 + 1
SUB
R20,R4,Rx
BNZ
R20,Loop
1 + 1
25
Figure S.3 Number of cycles required by the loop body in the code sequence in
Figure 3.48.
3.3
Consider a multiple-issue design. Suppose you have two execution pipelines, each
capable of beginning execution of one instruction per cycle, and enough fetch/
decode bandwidth in the front end so that it will not stall your execution. Assume
results can be immediately forwarded from one execution unit to another, or to itself.
Further assume that the only reason an execution pipeline would stall is to observe a
true data dependency. Now how many cycles does the loop require? The answer
is 22, as shown in Figure S.4. The LD goes first, as before, and the DIVD must wait
for it through 4 extra latency cycles. After the DIVD comes the MULTD, which can run
in the second pipe along with the DIVD, since theres no dependency between them.
(Note that they both need the same input, F2, and they must both wait on F2s readiness, but there is no constraint between them.) The LD following the MULTD does not
depend on the DIVD nor the MULTD, so had this been a superscalar-order-3 machine,
Copyright 2012 Elsevier, Inc. All rights reserved.
Chapter 3 Solutions
Execution pipe 0
Loop:
LD
F2,0(Rx)
15
Execution pipe 1
;
<nop>
<nop>
<nop>
<nop>
<nop>
DIVD
F8,F2,F0
MULTD
LD
F4,0(Ry)
<nop>
<nop>
<nop>
<nop>
<nop>
ADD
<nop>
<nop>
<nop>
<nop>
<nop>
<nop>
<nop>
ADDD
F10,F8,F2
ADDI
ADDI
Ry,Ry,#8
SD
F4,0(Ry)
SUB
R20,R4,Rx
BNZ
R20,Loop
F4,F0,F4
<nop>
F2,F6,F2
Rx,Rx,#8
that LD could conceivably have been executed concurrently with the DIVD and the
MULTD. Since this problem posited a two-execution-pipe machine, the LD executes in
the cycle following the DIVD/MULTD. The loop overhead instructions at the loops
bottom also exhibit some potential for concurrency because they do not depend on
any long-latency instructions.
3.4
Possible answers:
1. If an interrupt occurs between N and N + 1, then N + 1 must not have been
allowed to write its results to any permanent architectural state. Alternatively,
it might be permissible to delay the interrupt until N + 1 completes.
2. If N and N + 1 happen to target the same register or architectural state (say,
memory), then allowing N to overwrite what N + 1 wrote would be wrong.
3. N might be a long floating-point op that eventually traps. N + 1 cannot be
allowed to change arch state in case N is to be retried.
Copyright 2012 Elsevier, Inc. All rights reserved.
16
Long-latency ops are at highest risk of being passed by a subsequent op. The
DIVD instr will complete long after the LD F4,0(Ry), for example.
3.5
Figure S.5 demonstrates one possible way to reorder the instructions to improve the
performance of the code in Figure 3.48. The number of cycles that this reordered
code takes is 20.
Execution pipe 0
Loop: LD
Execution pipe 1
;
LD
F2,0(Rx)
F4,0(Ry)
DIVD
F8,F2,F0
ADDD
MULTD
F2,F6,F2
F4,F0,F4
SD
<nop>
<nop>
ADDI
Rx,Rx,#8
ADDI
Ry,Ry,#8
<nop>
<nop>
<nop>
<nop>
<nop>
SUB
R20,R4,Rx
ADDD
BNZ
R20,Loop
F10,F8,F2
<nop>
F4,0(Ry)
#ops:
11
#nops:
(20 2) 11 = 29
3.6
a. Fraction of all cycles, counting both pipes, wasted in the reordered code
shown in Figure S.5:
11 ops out of 2x20 opportunities.
1 11/40 = 1 0.275
= 0.725
b. Results of hand-unrolling two iterations of the loop from code shown in Figure S.6:
exec time w/o enhancement
c. Speedup = -------------------------------------------------------------------exec time with enhancement
Speedup = 20 / (22/2)
Speedup = 1.82
Copyright 2012 Elsevier, Inc. All rights reserved.
Chapter 3 Solutions
Execution pipe 0
Loop:
17
Execution pipe 1
LD
F2,0(Rx)
LD
F4,0(Ry)
LD
F2,0(Rx)
F4,0(Ry)
LD
DIVD
F8,F2,F0
ADDD
F4,F0,F4
DIVD
F8,F2,F0
ADDD
F4,F0,F4
MULTD
F2,F0,F2
SD
F4,0(Ry)
MULTD
F2,F6,F2
SD
F4,0(Ry)
<nop>
ADDI
Rx,Rx,#16
ADDI
Ry,Ry,#16
<nop>
<nop>
<nop>
<nop>
<nop>
<nop>
<nop>
ADDD
F10,F8,F2
SUB
R20,R4,Rx
ADDD
F10,F8,F2
BNZ
R20,Loop
<nop>
cycles per loop iter 22
Figure S.6 Hand-unrolling two iterations of the loop from code shown in Figure S.5.
3.7
Consider the code sequence in Figure 3.49. Every time you see a destination register in the code, substitute the next available T, beginning with T9. Then update all
the src (source) registers accordingly, so that true data dependencies are maintained. Show the resulting code. (Hint: See Figure 3.50.)
Loop:
LD
T9,0(Rx)
IO:
MULTD
T10,F0,T2
I1:
DIVD
T11,T9,T10
I2:
LD
T12,0(Ry)
I3:
ADDD
T13,F0,T12
I4:
SUBD
T14,T11,T13
I5:
SD
T14,0(Ry)
3.8
See Figure S.8. The rename table has arbitrary values at clock cycle N 1. Look at
the next two instructions (I0 and I1): I0 targets the F1 register, and I1 will write the F4
register. This means that in clock cycle N, the rename table will have had its entries 1
and 4 overwritten with the next available Temp register designators. I0 gets renamed
first, so it gets the first T reg (9). I1 then gets renamed to T10. In clock cycle N,
instructions I2 and I3 come along; I2 will overwrite F6, and I3 will write F0. This
means the rename tables entry 6 gets 11 (the next available T reg), and rename table
entry 0 is written to the T reg after that (12). In principle, you dont have to allocate T
regs sequentially, but its much easier in hardware if you do.
I0:
SUBD
F1,F2,F3
I1:
ADDD
F4,F1,F2
I2:
MULTD
F6,F4,F1
I3:
DIVD
F0,F2,F6
Renamed in cycle N
Renamed in cycle N + 1
Clock cycle
N 1
Rename table
18
N +1
12
10
11
62
62
62
62
62
62
63
63
63
63
63
63
12 11 10 9
14 13 12 11
16 15 14 13
Next avail
T reg
Figure S.8 Cycle-by-cycle state of the rename table for every instruction of the code
in Figure 3.51.
3.9
5 + 5 > 10
ADD
10 + 10 > 20
ADD
20 + 20 > 40
Chapter 3 Solutions
3.10
19
An example of an event that, in the presence of self-draining pipelines, could disrupt the pipelining and yield wrong results is shown in Figure S.10.
alu0
Clock
cycle
alu1
ld/st
ld/st
LW R4, 0(R0)
LW R4, 0(R0)
br
LW R5, 8(R1)
LW R5, 8(R1)
3
4 ADDI R10, R4, #1
5 ADDI R10, R4, #1
SW R7, 0(R6)
SW R9, 8(R8)
SW R7, 0(R6)
SW R9, 8(R8)
BNZ R4, Loop
Figure S.10 Example of an event that yields wrong results. What could go wrong
with this? If an interrupt is taken between clock cycles 1 and 4, then the results of the LW
at cycle 2 will end up in R1, instead of the LW at cycle 1. Bank stalls and ECC stalls will
cause the same effectpipes will drain, and the last writer wins, a classic WAW hazard.
All other intermediate results are lost.
3.11
See Figure S.11. The convention is that an instruction does not enter the execution
phase until all of its operands are ready. So the first instruction, LW R3,0(R0),
marches through its first three stages (F, D, E) but that M stage that comes next
requires the usual cycle plus two more for latency. Until the data from a LD is available at the execution unit, any subsequent instructions (especially that ADDI R1, R1,
#1, which depends on the 2nd LW) cannot enter the E stage, and must therefore stall
at the D stage.
Loop length
LW R3,0(R0)
(2.11a) 4 cycles lost to branch overhead
Figure S.11 Phases of each instruction per clock cycle for one iteration of the loop.
18
SW R1,0(R3)
SUB R4,R3,R2
ADDI R1,R1,#1
LW R1,0(R3)
17
11
16
10
15
13
LW R3,0(R0)
14
12
Loop:
19
...
20
a. 4 cycles lost to branch overhead. Without bypassing, the results of the SUB
instruction are not available until the SUBs W stage. That tacks on an extra 4
clock cycles at the end of the loop, because the next loops LW R1 cant begin
until the branch has completed.
b. 2 cycles lost w/ static predictor. A static branch predictor may have a heuristic
like if branch target is a negative offset, assume its a loop edge, and loops
are usually taken branches. But we still had to fetch and decode the branch
to see that, so we still lose 2 clock cycles here.
c. No cycles lost w/ correct dynamic prediction. A dynamic branch predictor
remembers that when the branch instruction was fetched in the past, it eventually turned out to be a branch, and this branch was taken. So a predicted taken
will occur in the same cycle as the branch is fetched, and the next fetch after
that will be to the presumed target. If correct, weve saved all of the latency
cycles seen in 3.11 (a) and 3.11 (b). If not, we have some cleaning up to do.
3.12
F2,0(Rx)
F8,F2,F0
F2,F8,F2
LD
F4,0(Ry)
ADDD
ADDD
F4,F0,F4
F10,F8,F2
ADDI
ADDI
SD
Rx,Rx,#8
Ry,Ry,#8
F4,0(Ry)
SUB
BNZ
R20,R4,Rx
R20,Loop
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
Chapter 3 Solutions
21
b. See Figure S.13. The number of clock cycles taken by the code sequence is 25.
Cycle op was dispatched to FU
alu0
alu1
Clock cycle 1
ADDI Rx,Rx,#8
SUB R20,R4,Rx
ld/st
ADDI Ry,Ry,#8
LD F2,0(Rx)
LD F4,0(Ry)
ncy
LD
late
4
5
6
DIVD F8,F2,F0
ADD
D la
y
latenc
...
18
19
ADDD F4,F0,F4
DIVD
tenc
SD F4,0(Ry)
MULTD F2,F8,F2
20
UL
TD
21
lat
en
22
cy
23
24
BNZ R20,Loop
25
Branch shadow
ADDD F10,F8,F2
c. See Figures S.14 and S.15. The bold instructions are those instructions that
are present in the RS, and ready for dispatch. Think of this exercise from the
Reservation Stations point of view: at any given clock cycle, it can only
see the instructions that were previously written into it, that have not
already dispatched. From that pool, the RSs job is to identify and dispatch
the two eligible instructions that will most boost machine performance.
0
LD
F2, 0(Rx)
LD
F2, 0(Rx)
LD
F2, 0(Rx)
LD
F2, 0(Rx)
LD
F2, 0(Rx)
LD
F2, 0(Rx)
DIVD
F8,F2,F0
DIVD
F8,F2,F0
DIVD
F8,F2,F0
DIVD
F8,F2,F0
DIVD
F8,F2,F0
DIVD
F8,F2,F0
MULTD
F2,F8,F2
MULTD
F2,F8,F2
MULTD
F2,F8,F2
MULTD
F2,F8,F2
MULTD
F2,F8,F2
MULTD
F2,F8,F2
LD
F4, 0(Ry)
LD
F4, 0(Ry)
LD
F4, 0(Ry)
LD
F4, 0(Ry)
LD
F4, 0(Ry)
LD
F4, 0(Ry)
ADDD
F4,F0,F4
ADDD
F4,F0,F4
ADDD
F4,F0,F4
ADDD
F4,F0,F4
ADDD
F4,F0,F4
ADDD
F4,F0,F4
ADDD
F10,F8,F2
ADDD
F10,F8,F2
ADDD
F10,F8,F2
ADDD
F10,F8,F2
ADDD
F10,F8,F2
ADDD
F10,F8,F2
ADDI
Rx,Rx,#8
ADDI
Rx,Rx,#8
ADDI
Rx,Rx,#8
ADDI
Rx,Rx,#8
ADDI
Rx,Rx,#8
ADDI
Rx,Rx,#8
ADDI
Ry,Ry,#8
ADDI
Ry,Ry,#8
ADDI
Ry,Ry,#8
ADDI
Ry,Ry,#8
ADDI
Ry,Ry,#8
ADDI
Ry,Ry,#8
SD
F4,0(Ry)
SD
F4,0(Ry)
SD
F4,0(Ry)
SD
F4,0(Ry)
SD
F4,0(Ry)
SD
F4,0(Ry)
SUB
R20,R4,Rx
SUB
R20,R4,Rx
SUB
R20,R4,Rx
SUB
R20,R4,Rx
SUB
R20,R4,Rx
SUB
R20,R4,Rx
BNZ
20,Loop
BNZ
20,Loop
BNZ
20,Loop
BNZ
20,Loop
BNZ
20,Loop
BNZ
20,Loop
22
alu0
alu1
ld/st
LD F2,0(Rx)
LD F4,0(Ry)
3
4
ADDI Rx,Rx,#8
ADDI Ry,Ry,#8
SUB R20,R4,Rx
DIVD F8,F2,F0
ADDD F4,F0,F4
7
8
SD F4,0(Ry)
Clock cycle 9
...
18
19
MULTD F2,F8,F2
20
21
22
23
BNZ R20,Loop
24
25
ADDD F10,F8,F2
Branch shadow
25 clock cycles total
alu1
ld/st
LD F2,0(Rx)
Clock cycle 2
LD F4,0(Ry)
3
4
ADDI Rx,Rx,#8
ADDI Ry,Ry,#8
SUB R20,R4,Rx
DIVD F8,F2,F0
ADDD F4,F0,F4
7
8
SD F4,0(Ry)
9
...
18
19
MULTD F2,F8,F2
20
21
22
23
BNZ R20,Loop
24
25
ADDD F10,F8,F2
Branch shadow
25 clock cycles total
Chapter 3 Solutions
23
1.
2.
3.
Full bypassing: critical path is LD -> Div -> MULT -> ADDD. Bypassing
would save 1 cycle from latency of each, so 4 cycles total
4.
alu1
ld/st
LD F2,0(Rx)
Clock cycle 2
LD F2,0(Rx)
LD F4,0(Ry)
3
4
ADDI Rx,Rx,#8
ADDI Ry,Ry,#8
SUB R20,R4,Rx
DIVD F8,F2,F0
DIVD F8,F2,F0
ADDD F4,F0,F4
SD F4,0(Ry)
...
18
19
MULTD F2,F8,F2
20
MULTD F2,F8,F2
21
22
23
24
25
ADDD F10,F8,F2
26
ADDD F10,F8,F2
BNZ R20,Loop
Branch shadow
26 clock cycles total
Figure S.17 Number of clock cycles required to do two loops worth of work. Critical
path is LD -> DIVD -> MULTD -> ADDD. If RS schedules 2nd loops critical LD in cycle 2, then
loop 2s critical dependency chain will be the same length as loop 1s is. Since were not
functional-unit-limited for this code, only one extra clock cycle is needed.
24
Exercises
3.13
Unscheduled code
Scheduled code
DADDIU
R4,R1,#800
DADDIU
R4,R1,#800
L.D
F2,0(R1)
L.D
F2,0(R1)
stall
L.D
F6,0(R2)
MUL.D
F4,F2,F0
MUL.D
F4,F2,F0
L.D
F6,0(R2)
DADDIU
R1,R1,#8
stall
DADDIU
R2,R2,#8
stall
DSLTU
R3,R1,R4
stall
stall
stall
stall
ADD.D
F6,F4,F6
stall
stall
stall
stall
10
stall
11
S.D
F6,0(R2)
12
DADDIU
R1,R1,#8
13
DADDIU
R2,R2,#8
14
DSLTU
R3,R1,R4
15
stall
16
BNEZ
17
stall
ADD.D
F6,F4,F6
BNEZ
R3,foo
S.D
F6,-8(R2)
R3,foo
Figure S.18 The execution time per element for the unscheduled code is 16 clock
cycles and for the scheduled code is 10 clock cycles. This is 60% faster, so the clock
must be 60% faster for the unscheduled code to match the performance of the scheduled code on the original hardware.
Scheduled code
DADDIU
R4,R1,#800
L.D
F2,0(R1)
L.D
F6,0(R2)
MUL.D
F4,F2,F0
Figure S.19 The code must be unrolled three times to eliminate stalls after
scheduling.
Chapter 3 Solutions
L.D
F2,8(R1)
L.D
F10,8(R2)
MUL.D
F8,F2,F0
L.D
F2,8(R1)
L.D
F14,8(R2)
10
MUL.D
F12,F2,F0
11
ADD.D
F6,F4,F6
12
DADDIU
R1,R1,#24
13
ADD.D
F10,F8,F10
14
DADDIU
R2,R2,#24
15
DSLTU
R3,R1,R4
16
ADD.D
F14,F12,F14
17
S.D
F6,-24(R2)
18
S.D
F10,-16(R2)
19
BNEZ
R3,foo
20
S.D
F14,-8(R2)
25
Memory
reference 2
L.D F1,0(R1)
L.D F2,8(R1)
L.D F3,16(R1)
L.D F4,24(R1)
L.D F5,32(R1)
L.D F6,40(R1)
MUL.D F1,F1,F0
MUL.D F2,F2,F0
L.D F7,0(R2)
L.D F8,8(R2)
MUL.D F3,F3,F0
MUL.D F4,F4,F0
L.D F9,16(R2)
L.D F10,24(R2)
MUL.D F5,F5,F0
MUL.D F6,F6,F0
L.D F11,32(R2)
L.D F12,40(R2)
Cycle
FP operation 1
FP operation 2
Integer operation/branch
DADDIU
R1,R1,48
DADDIU
R2,R2,48
Figure S.20 15 cycles for 34 operations, yielding 2.67 issues per clock, with a VLIW efficiency of 34 operations
for 75 slots = 45.3%. This schedule requires 12 floating-point registers.
26
ADD.D F7,F7,F1
ADD.D F8,F8,F2
10
ADD.D F9,F9,F3
ADD.D F10,F10,F4
11
ADD.D F11,F11,F5
ADD.D F12,F12,F6
DSLTU
R3,R1,R4
12
13
S.D F7,-48(R2)
S.D F8,-40(R2)
14
S.D F9,-32(R2)
S.D F10,-24(R2)
15
S.D F11,-16(R2)
S.D F12,-8(R2)
BNEZ R3,foo
Unrolled 10 times:
Cycle
Memory
reference 1
Memory
reference 2
FP operation 1
FP operation 2
L.D F1,0(R1)
L.D F2,8(R1)
L.D F3,16(R1)
L.D F4,24(R1)
L.D F5,32(R1)
L.D F6,40(R1)
MUL.D F1,F1,F0
MUL.D F2,F2,F0
L.D F7,48(R1)
L.D F8,56(R1)
MUL.D F3,F3,F0
MUL.D F4,F4,F0
L.D F9,64(R1)
L.D F10,72(R1)
MUL.D F5,F5,F0
MUL.D F6,F6,F0
L.D F11,0(R2)
L.D F12,8(R2)
MUL.D F7,F7,F0
MUL.D F8,F8,F0
L.D F13,16(R2)
L.D F14,24(R2)
MUL.D F9,F9,F0
MUL.D F10,F10,F0
L.D F15,32(R2)
L.D F16,40(R2)
L.D F17,48(R2)
L.D F18,56(R2)
ADD.D F11,F11,F1
ADD.D F12,F12,F2
10
L.D F19,64(R2)
L.D F20,72(R2)
ADD.D F13,F13,F3
ADD.D F14,F14,F4
11
ADD.D F15,F15,F5
ADD.D F16,F16,F6
12
ADD.D F17,F17,F7
ADD.D F18,F18,F8
13
14
15
16
17
Integer
operation/branch
DADDIU R1,R1,48
DADDIU R2,R2,48
DSLTU
R3,R1,R4
ADD.D F20,F20,F10
BNEZ R3,foo
Figure S.21 17 cycles for 54 operations, yielding 3.18 issues per clock, with a VLIW efficiency of 54 operations for
85 slots = 63.5%. This schedule requires 20 floating-point registers.
Chapter 3 Solutions
3.14
Iteration
Instruction
Issues at
Executes/
Memory
L.D F2,0(R1)
First issue
MUL.D F4,F2,F0
19
Wait for F2
Mult rs [34]
Mult use [518]
L.D F6,0(R2)
Ldbuf [4]
ADD.D F6,F4,F6
20
30
Wait for F4
Add rs [520]
Add use [2129]
S.D F6,0(R2)
31
Wait for F6
Stbuf1 [631]
DADDIU R1,R1,#8
DADDIU R2,R2,#8
DSLTU R3,R1,R4
10
BNEZ R3,foo
11
L.D F2,0(R1)
10
12
13
MUL.D F4,F2,F0
11
14
19
34
Wait for F2
Mult busy
Mult rs [1219]
Mult use [2033]
L.D F6,0(R2)
12
13
14
Ldbuf [13]
ADD.D F6,F4,F6
13
35
45
Wait for F4
Add rs [1435]
Add use [3644]
S.D F6,0(R2)
14
46
DADDIU R1,R1,#8
15
16
17
DADDIU R2,R2,#8
16
17
18
DSLTU R3,R1,R4
17
18
20
BNEZ R3,foo
18
20
L.D F2,0(R1)
19
21
22
MUL.D F4,F2,F0
20
23
34
49
Wait for F2
Mult busy
Mult rs [2134]
Mult use [3548]
L.D F6,0(R2)
21
22
23
Ldbuf [22]
ADD.D F6,F4,F6
22
50
60
Wait for F4
Add rs [2349]
Add use [5159]
Wait for R3
Wait for F6
Stbuf [1546]
Wait for R3
27
28
S.D F6,0(R2)
23
55
Wait for F6
Stbuf [2455]
DADDIU R1,R1,#8
24
25
26
DADDIU R2,R2,#8
25
26
27
DSLTU R3,R1,R4
26
27
28
BNEZ R3,foo
27
29
Wait for R3
Iteration
Instruction
Issues at
Executes/
Memory
Write CDB at
Comment
L.D F2,0(R1)
MUL.D F4,F2,F0
19
Wait for F2
Mult rs [24]
Mult use [5]
L.D F6,0(R2)
Ldbuf [3]
ADD.D F6,F4,F6
20
30
Wait for F4
Add rs [320]
Add use [21]
S.D F6,0(R2)
31
DADDIU R1,R1,#8
DADDIU R2,R2,#8
DSLTU R3,R1,R4
BNEZ R3,foo
L.D F2,0(R1)
MUL.D F4,F2,F0
10
25
Wait for F2
Mult rs [710]
Mult use [11]
L.D F6,0(R2)
10
INT busy
INT rs [89]
ADD.D F6,F4,F6
26
36
Wait for F4
Add RS [826]
Add use [27]
S.D F6,0(R2)
37
DADDIU R1,R1,#8
10
Wait for F6
Stbuf [431]
INT busy
INT rs [56]
INT busy
INT rs [67]
Wait for F6
11
INT busy
INT rs [810]
Chapter 3 Solutions
DADDIU R2,R2,#8
11
12
INT busy
INT rs [1011]
DSLTU R3,R1,R4
12
13
INT busy
INT rs [1012]
BNEZ R3,foo
10
14
L.D F2,0(R1)
11
15
16
MUL.D F4,F2,F0
11
17
32
Wait for F2
Mult rs [1217]
Mult use [17]
L.D F6,0(R2)
12
16
17
INT busy
INT rs [1316]
ADD.D F6,F4,F6
12
33
43
Wait for F4
Add rs [1333]
Add use [33]
S.D F6,0(R2)
14
44
Wait for F6
INT rs full in 15
DADDIU R1,R1,#8
15
17
DADDIU R2,R2,#8
16
18
DSLTU R3,R1,R4
20
21
INT rs full
BNEZ R3,foo
21
22
INT rs full
29
Wait for R3
3.15
Issues at
Executes/Memory
Write CDB at
ADD.D F2,F4,F6
12
ADD R1,R1,R2
ADD R1,R1,R2
ADD R1,R1,R2
ADD R1,R1,R2
10
ADD R1,R1,R2
11
12 (CDB conflict)
30
3.16
Branch PC
mod 4
Entry
Prediction
Outcome
Mispredict?
no
none
NT
NT
no
change to NT
NT
NT
no
none
NT
NT
no
none
NT
yes
no
none
NT
yes
change to NT
no
none
NT
yes
Table Update
Figure S.25 Individual branch outcomes, in order of execution. Misprediction rate = 3/9 = .33.
Local Predictor
Branch PC mod 2
Entry
Prediction
Outcome
Mispredict?
Table Update
no
change to T
NT
yes
NT
NT
no
none
NT
yes
NT
yes
change to NT
no
none
NT
NT
no
none
no
none
no
change to T
Figure S.26 Individual branch outcomes, in order of execution. Misprediction rate = 3/9 = .33.
Chapter 3 Solutions
3.17
31
For this problem we are given the base CPI without branch stalls. From this we can
compute the number of stalls given by no BTB and with the BTB: CPInoBTB and
CPIBTB and the resulting speedup given by the BTB:
CPInoBTB CPI base + Stalls base
Speedup = ------------------------- = --------------------------------------------------CPI BTB
CPIbase + Stalls BTB
StallsnoBTB = 15% 2 = 0.30
BTB Result
BTB
Prediction
Miss
Penalty
(Cycles)
Hit
Correct
Hit
Incorrect
Therefore:
Stalls BTB = ( 1.5% 3 ) + ( 12.1% 0 ) + ( 1.3% 4 ) = 1.2
1.0 + 0.30
Speedup = --------------------------- = 1.2
1.0 + 0.097
3.18
32
Chapter 4 Solutions
Case Study: Implementing a Vector Kernel on a Vector
Processor and GPU
4.1
$r1,#0
$f0,0($RtipL)
l.s
l.s
l.s
l.s
l.s
l.s
l.s
l.s
l.s
l.s
l.s
l.s
l.s
l.s
l.s
mul.s
mul.s
mul.s
mul.s
add.s
add.s
add.s
mul.s
mul.s
mul.s
mul.s
add.s
add.s
add.s
mul.s
st.s
add
$f1,0($RclL)
$f2,4($RtipL)
$f3,4($RclL)
$f4,8($RtipL)
$f5,8($RclL)
$f6,12($RtipL)
$f7,12($RclL)
$f8,0($RtipR)
$f9,0($RclR)
$f10,4($RtipR)
$f11,4($RclR)
$f12,8($RtipR)
$f13,8($RclR)
$f14,12($RtipR)
$f15,12($RclR)
$f16,$f0,$f1
$f17,$f2,$f3
$f18,$f4,$f5
$f19,$f6,$f7
$f20,$f16,$f17
$f20,$f20,$f18
$f20,$f20,$f19
$f16,$f8,$f9
$f17,$f10,$f11
$f18,$f12,$f13
$f19,$f14,$f15
$f21,$f16,$f17
$f21,$f21,$f18
$f21,$f21,$f19
$f20,$f20,$f21
$f20,0($RclP)
$RclP,$RclP,#4
add
$RtiPL,$RtiPL,#16
# initialize k
# load all values for first
expression
# accumulate
# accumulate
# final multiply
# store result
# increment clP for next
expression
# increment tiPL for next
expression
Chapter 4 Solutions
add
$RtiPR,$RtiPR,#16
addi
and
$r1,$r1,#1
$r2,$r2,#3
bneq
add
$r2,skip
$RclL,$RclL,#16
add
$RclR,$RclR,#16
skip: blt
$r1,$r3,loop
33
$r1,#0
$VL,#4
$v0,0($RclL)
$v1,0($RclR)
$v2,0($RtipL)
$v3,16($RtipL)
$v4,32($RtipL)
$v5,48($RtipL)
$v6,0($RtipR)
$v7,16($RtipR)
$v8,32($RtipR)
$v9,48($RtipR)
$v2,$v2,$v0
mulvv.s
mulvv.s
mulvv.s
mulvv.s
$v3,$v3,$v0
$v4,$v4,$v0
$v5,$v5,$v0
$v6,$v6,$v1
mulvv.s
mulvv.s
mulvv.s
sumr.s
sumr.s
sumr.s
sumr.s
sumr.s
$v7,$v7,$v1
$v8,$v8,$v1
$v9,$v9,$v1
$f0,$v2
$f1,$v3
$f2,$v4
$f3,$v5
$f4,$v6
sumr.s
sumr.s
sumr.s
mul.s
$f5,$v7
$f6,$v8
$f7,$v9
$f0,$f0,$f4
# initialize k
# initialize vector length
# multiply left
sub-expressions
# multiply right
sub-expression
# reduce right
sub-expressions
34
mul.s
mul.s
mul.s
s.s
s.s
s.s
s.s
add
$f1,$f1,$f5
$f2,$f2,$f6
$f3,$f3,$f7
$f0,0($Rclp)
$f1,4($Rclp)
$f2,8($Rclp)
$f3,12($Rclp)
$RtiPL,$RtiPL,#64
add
$RtiPR,$RtiPR,#64
add
$RclP,$RclP,#16
add
$RclL,$RclL,#16
add
$RclR,$RclR,#16
addi
blt
$r1,$r1,#1
$r1,$r3,loop
# store results
# increment
expression
# increment
expression
# increment
expression
# increment
expression
# increment
expression
# assume r3 = seq_length
4.2
MIPS: loop is 41 instructions, will iterate 500 4 = 2000 times, so roughly 82000
instructions
VMIPS: loop is also 41 instructions but will iterate only 500 times, so roughly
20500 instructions
4.3
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
lv
lv
lv
lv
lv
lv
lv
lv
lv
lv
sumr.s
sumr.s
sumr.s
sumr.s
sumr.s
sumr.s
sumr.s
sumr.s
mulvv.s
mulvv.s
mulvv.s
mulvv.s
mulvv.s
mulvv.s
mulvv.s
mulvv.s
#
#
#
#
#
#
#
#
#
#
clL
clR
tiPL
tiPL
tiPL
tiPL
tiPR
tiPR
tiPR
tiPR
0
1
2
3
0
1
2
3
18 chimes, 4 results, 15 FLOPS per result, 18/15 = 1.2 cycles per FLOP
Chapter 4 Solutions
35
4.4
In this case, the 16 values could be loaded into each vector register, performing vector multiplies from four iterations of the loop in single vector multiply instructions.
This could reduce the iteration count of the loop by a factor of 4. However, without
a way to perform reductions on a subset of vector elements, this technique cannot
be applied to this code.
4.5
4.6
clP[threadIdx.x*4 + blockIdx.x+12*500*4]
clP[threadIdx.x*4+1 + blockIdx.x+12*500*4]
clP[threadIdx.x*4+2+ blockIdx.x+12*500*4]
clP[threadIdx.x*4+3 + blockIdx.x+12*500*4]
clL[threadIdx.x*4+i+ blockIdx.x*2*500*4]
clR[threadIdx.x*4+i+ (blockIdx.x*2+1)*500*4]
36
tipL[threadIdx.x*16+AA + blockIdx.x*2*500*16]
tipL[threadIdx.x*16+AC + blockIdx.x*2*500*16]
tipL[threadIdx.x*16+TT + blockIdx.x*2*500*16]
tipR[threadIdx.x*16+AA + (blockIdx.x*2+1)*500*16]
tipR[threadIdx.x*16+AC +1 + (blockIdx.x*2+1)*500*16]
#
#
#
#
#
#
add.u64
mul.u64
mul.u64
add.u64
ld.param.u64
add.u64
%r2, %ctaid.x,1
%r2, %r2, 4000
%r3, %tid.x, 4
%r2, %r2, %r3
%r3, [clR]
%r2,%r2,%r3
#
#
#
#
#
#
#
ld.global.f32
st.shared.f32
ld.global.f32
st.shared.f32
ld.global.f32
st.shared.f32
ld.global.f32
st.shared.f32
ld.global.f32
st.shared.f32
ld.global.f32
st.shared.f32
ld.global.f32
st.shared.f32
ld.global.f32
st.shared.f32
%f1, [%r1+0]
[clL_s+0], %f1
%f1, [%r2+0]
[clR_s+0], %f1
%f1, [%r1+4]
[clL_s+4], %f1
%f1, [%r2+4]
[clR_s+4], %f1
%f1, [%r1+8]
[clL_s+8], %f1
%f1, [%r2+8]
[clR_s+8], %f1
%f1, [%r1+12]
[clL_s+12], %f1
%f1, [%r2+12]
[clR_s+12], %f1
mul.u64
mul.u64
add.u64
Chapter 4 Solutions
ld.param.u64
add.u64
%r2, [tipL]
%r1,%r2,%r2
add.u64
mul.u64
mul.u64
add.u64
ld.param.u64
add.u64
%r2, %ctaid.x,1
%r2, %r2, 16000
%r3, %tid.x, 64
%r2, %r2, %r3
%r3, [tipR]
%r2,%r2,%r3
#
#
#
#
#
#
mul.u64
mul.u64
add.u64
ld.param.u64
add.u64
#
#
#
#
#
#
ld.global.f32
ld.global.f32
ld.global.f32
%f1,[%r1]
%f2,[%r1+4]
# load tiPL[0]
# load tiPL[1]
%f16,[%r1+60]
# load tiPL[15]
ld.global.f32
ld.global.f32
ld.global.f32
%f17,[%r2]
%f18,[%r2+4]
# load tiPR[0]
# load tiPR[1]
%f32,[%r1+60]
# load tiPR[15]
ld.shared.f32
ld.shared.f32
ld.shared.f32
ld.shared.f32
ld.shared.f32
ld.shared.f32
ld.shared.f32
ld.shared.f32
%f33,[clL_s]
%f34,[clL_s+4]
%f35,[clL_s+8]
%f36,[clL_s+12]
%f37,[clR_s]
%f38,[clR_s+4]
%f39,[clR_s+8]
%f40,[clR_s+12]
# load clL
mul.f32
mul.f32
mul.f32
mul.f32
add.f32
add.f32
add.f32
mul.f32
mul.f32
mul.f32
mul.f32
add.f32
add.f32
add.f32
st.global.f32
%f1,%f1,%f33
%f2,%f2,%f34
%f3,%f3,%f35
%f4,%f4,%f36
%f1,%f1,%f2
%f1,%f1,%f3
%f1,%f1,%f4
%f17,%f17,%f37
%f18,%f18,%f38
%f19,%f19,%f39
%f20,%f20,%f40
%f17,%f17,%f18
%f17,%f17,%f19
%f17,%f17,%f20
[%r3],%f17
# first expression
# load clR
# store result
37
38
mul.f32
mul.f32
mul.f32
mul.f32
add.f32
add.f32
add.f32
mul.f32
mul.f32
mul.f32
mul.f32
add.f32
add.f32
add.f32
st.global.f32
%f5,%f5,%f33
%f6,%f6,%f34
%f7,%f7,%f35
%f8,%f8,%f36
%f5,%f5,%f6
%f5,%f5,%f7
%f5,%f5,%f8
%f21,%f21,%f37
%f22,%f22,%f38
%f23,%f23,%f39
%f24,%f24,%f40
%f21,%f21,%f22
%f21,%f21,%f23
%f21,%f21,%f24
[%r3+4],%f21
# second expression
mul.f32
mul.f32
mul.f32
mul.f32
add.f32
add.f32
add.f32
mul.f32
mul.f32
mul.f32
mul.f32
add.f32
add.f32
add.f32
st.global.f32
%f9,%f9,%f33
%f10,%f10,%f34
%f11,%11,%f35
%f12,%f12,%f36
%f9,%f9,%f10
%f9,%f9,%f11
%f9,%f9,%f12
%f25,%f25,%f37
%f26,%f26,%f38
%f27,%f27,%f39
%f28,%f28,%f40
%f25,%f26,%f22
%f25,%f27,%f23
%f25,%f28,%f24
[%r3+8],%f25
# third expression
mul.f32
mul.f32
mul.f32
mul.f32
add.f32
add.f32
add.f32
mul.f32
mul.f32
mul.f32
mul.f32
add.f32
add.f32
add.f32
st.global.f32
%f13,%f13,%f33
%f14,%f14,%f34
%f15,%f15,%f35
%f16,%f16,%f36
%f13,%f14,%f6
%f13,%f15,%f7
%f13,%f16,%f8
%f29,%f29,%f37
%f30,%f30,%f38
%f31,%f31,%f39
%f32,%f32,%f40
%f29,%f29,%f30
%f29,%f29,%f31
%f29,%f29,%f32
[%r3+12],%f29
# fourth expression
# store result
# store result
# store result
Chapter 4 Solutions
4.8
39
It will perform well, since there are no branch divergences, all memory references
are coalesced, and there are 500 threads spread across 6 blocks (3000 total threads),
which provides many instructions to hide memory latency.
Exercises
4.9
a. This code reads four floats and writes two floats for every six FLOPs, so
arithmetic intensity = 6/6 = 1.
b. Assume MVL = 64:
li
li
loop: lv
lv
mulvv.s
lv
$VL,44
$r1,0
$v1,a_re+$r1
$v3,b_re+$r1
$v5,$v1,$v3
$v2,a_im+$r1
lv
mulvv.s
subvv.s
sv
mulvv.s
mulvv.s
addvv.s
sv
bne
addi
$v4,b_im+$r1
$v6,$v2,$v4
$v5,$v5,$v6
$v5,c_re+$r1
$v5,$v1,$v4
$v6,$v2,$v3
$v5,$v5,$v6
$v5,c_im+$r1
$r1,0,else
$r1,$r1,#44
#
#
#
#
#
#
skip: blt
# load b_im
# a+im*b_im
# a+re*b_re - a+im*b_im
# store c_re
# a+re*b_im
# a+im*b_re
# a+re*b_im + a+im*b_re
# store c_im
# check if first iteration
# first iteration,
increment by 44
# guaranteed next iteration
$r1,$r1,#256 # not first iteration,
increment by 256
$r1,1200,loop # next iteration?
1.
mulvv.s
lv
2.
3.
4.
5.
6.
lv
subvv.s
mulvv.s
mulvv.s
addvv.s
mulvv.s
sv
lv
lv
sv
j loop
else: addi
c.
#
#
#
#
#
#
#
6 chimes
40
mulvv.s
mulvv.s
subvv.s
mulvv.s
mulvv.s
addvv.s
sv
lv
sv
lv
lv
lv
#
#
#
#
#
#
a_re*b_re
a_im*b_im
subtract and store c_re
a_re*b_im
a_im*b_re, load next a_re
add, store c_im, load next b_re,a_im,b_im
Same cycles per result as in part c. Adding additional load/store units did not
improve performance.
4.10
Assuming that vector computation can be overlapped with memory access, total
time = 410 ms.
The hybrid system requires:
Even if host I/O can be overlapped with GPU execution, the GPU will require
430 ms and therefore will achieve lower performance than the host.
4.11
$VL,4
$v0(0),$v0(4)
$v0(8),$v0(12)
$v0(16),$v0(20)
$v0(24),$v0(28)
$v0(32),$v0(36)
$v0(40),$v0(44)
$v0(48),$v0(52)
$v0(56),$v0(60)
Chapter 4 Solutions
41
a. Reads 40 bytes and writes 4 bytes for every 8 FLOPs, thus 8/44 FLOPs/byte.
b. This code performs indirect references through the Ca and Cb arrays, as they
are indexed using the contents of the IDx array, which can only be performed
at runtime. While this complicates SIMD implementation, it is still possible
to perform type of indexing using gather-type load instructions. The innermost loop (iterates on z) can be vectorized: the values for Ex, dH1, dH2, Ca,
and Cb could be operated on as SIMD registers or vectors. Thus this code is
amenable to SIMD and vector execution.
c. Having an arithmetic intensity of 0.18, if the processor has a peak floatingpoint throughout > (30 GB/s) (0.18 FLOPs/byte) = 5.4 GFLOPs/s, then this
code is likely to be memory-bound, unless the working set fits well within the
processors cache.
d. The single precision arithmetic intensity corresponding to the edge of the roof
is 85/4 = 21.25 FLOPs/byte.
4.13
4.14
a. Using the GCD test, a dependency exists if GCD (2,4) must divide 5 4. In
this case, a loop-carried dependency does exist.
b. Output dependencies
S1 and S3 cause through A[i]
Anti-dependencies
S4 and S3 cause an anti-dependency through C[i]
Re-written code
for (i=0;i<100;i++) {
T[i] = A[i] * B[i]; /* S1 */
B[i] = T[i] + c; /* S2 */
A1[i] = C[i] * c; /* S3 */
C1[i] = D[i] * A1[i]; /* S4 */}
Copyright 2012 Elsevier, Inc. All rights reserved.
42
True dependencies
S4 and S3 through A[i]
S2 and S1 through T[i]
c. There is an anti-dependence between iteration i and i+1 for array B. This can
be avoided by renaming the B array in S2.
4.15
#include
#include
#include
#include
4.16
This GPU has a peak throughput of 1.5 16 16 = 384 GFLOPS/s of singleprecision throughput. However, assuming each single precision operation requires
four-byte two operands and outputs one four-byte result, sustaining this throughput
(assuming no temporal locality) would require 12 bytes/FLOP 384 GFLOPs/s =
4.6 TB/s of memory bandwidth. As such, this throughput is not sustainable, but can
still be achieved in short bursts when using on-chip memory.
4.17
<stdio.h>
<stdlib.h>
<sys/time.h>
<cuda.h>
*
*
*
*
blockDim.y + threadIdx.y;
blockDim.x + threadIdx.x;
blockDim.y;
blockDim.x;
state = d_board[(row)*cols+(col)];
for (i=0;i<iterations;i++) {
neighbors=0;
if (row!=0) {
if (col!=0) if (d_board[(row-1)*cols+(col-1)]==1) neighbors++;
if (d_board[(row-1)*cols+(col)]==1) neighbors++;
if (col!=(cols-1)) if (d_board[(row-1)*cols+(col+1)]==1) neighbors++;
}
if (col!=0) if (d_board[(row)*cols+(col-1)]==1) neighbors++;
Chapter 4 Solutions
43
44
Chapter 5 Solutions
Case Study 1: Single-Chip Multicore Multiprocessor
5.1
f.
5.3
Chapter 5 Solutions
45
Shared
C
P
Pl U
on ac wri
bu e in te
s va
lid
a
k
ac
y
oc
bl
or
is
m
th
k me
r
oc t
fo
bl or
e
s ab
at
hi
lid
r t k;
va
fo loc
In
s
b
is
m ck
a
e
rit teb
ri
W
W
Writeback block;
abort memory access
te
Invalid
CPU write
Read miss
Modified
ss
ce
Owned
CPU write
Place invalidate on bus
CPU write hit
CPU read hit
5.4
rit
R
e
W ad
ab rite mi
ac or ba ss
ce t m ck
ss e b
m lo
or c
y k;
k
oc
bl
s
hi
rt
fo
e
at
us
es
lid
ar n b
va
sh o
in
o ss
or
, n mi
s
is
ad d
m
re ea
e
r
PU ce
a
Pl
Read miss
Modified
Shared
C
P PU
on lac wr
bu e in ite
s va
lid
at
e
Invalid
CPU write
Place write miss on bus
46
Excl.
5.6
a. p0: read 100, Read miss, satisfied in memory, no sharers MSI: S, MESI: E
p0: write 100 40, MSI: send invalidate, MESI: silent transition from E to M
MSI: 100 + 15 = 115 stall cycles
MESI: 100 + 0 = 100 stall cycles
b. p0: read 120, Read miss, satisfied in memory, sharers both to S
p0: write 120 60, Both send invalidates
Both: 100 + 15 = 115 stall cycles
c. p0: read 100, Read miss, satisfied in memory, no sharers MSI: S, MESI: E
p0: read 120, Read miss, memory, silently replace 120 from S or E
Both: 100 + 100 = 200 stall cycles, silent replacement from E
Copyright 2012 Elsevier, Inc. All rights reserved.
Chapter 5 Solutions
47
d. p0: read 100, Read miss, satisfied in memory, no sharers MSI: S, MESI: E
p1: write 100 60, Write miss, satisfied in memory regardless of protocol
Both: 100 + 100 = 200 stall cycles, dont supply data in E state (some
protocols do)
e. p0: read 100, Read miss, satisfied in memory, no sharers MSI: S, MESI: E
p0: write 100 60, MSI: send invalidate, MESI: silent transition from E to M
p1: write 100 40, Write miss, P0s cache, writeback data to memory
MSI: 100 + 15 + 40 + 10 = 165 stall cycles
MESI: 100 + 0 + 40 + 10 = 150 stall cycles
5.7
a. Assume the processors acquire the lock in order. P0 will acquire it first, incurring 100 stall cycles to retrieve the block from memory. P1 and P3 will stall
until P0s critical section ends (ping-ponging the block back and forth) 1000
cycles later. P0 will stall for (about) 40 cycles while it fetches the block to
invalidate it; then P1 takes 40 cycles to acquire it. P1s critical section is 1000
cycles, plus 40 to handle the write miss at release. Finally, P3 grabs the block
for a final 40 cycles of stall. So, P0 stalls for 100 cycles to acquire, 10 to give
it to P1, 40 to release the lock, and a final 10 to hand it off to P1, for a total of
160 stall cycles. P1 essentially stalls until P0 releases the lock, which will be
100 + 1000 + 10 + 40 = 1150 cycles, plus 40 to get the lock, 10 to give it to
P3, 40 to get it back to release the lock, and a final 10 to hand it back to P3.
This is a total of 1250 stall cycles. P3 stalls until P1 hands it off the released
lock, which will be 1150 + 40 + 10 + 1000 + 40 = 2240 cycles. Finally, P3
gets the lock 40 cycles later, so it stalls a total of 2280 cycles.
b. The optimized spin lock will have many fewer stall cycles than the regular
spin lock because it spends most of the critical section sitting in a spin loop
(which while useless, is not defined as a stall cycle). Using the analysis below
for the interconnect transactions, the stall cycles will be 3 read memory misses
(300), 1 upgrade (15) and 1 write miss to a cache (40 + 10) and 1 write miss to
memory (100), 1 read cache miss to cache (40 + 10), 1 write miss to memory
(100), 1 read miss to cache and 1 read miss to memory (40 + 10 + 100),
followed by an upgrade (15) and a write miss to cache (40 + 10), and finally a
write miss to cache (40 + 10) followed by a read miss to cache (40 + 10) and
an upgrade (15). So approximately 945 cycles total.
c. Approximately 31 interconnect transactions. The first processor to win arbitration for the interconnect gets the block on its first try (1); the other two
ping-pong the block back and forth during the critical section. Because the
latency is 40 cycles, this will occur about 25 times (25). The first processor
does a write to release the lock, causing another bus transaction (1), and the
second processor does a transaction to perform its test and set (1). The last
processor gets the block (1) and spins on it until the second processor releases
it (1). Finally the last processor grabs the block (1).
48
Chapter 5 Solutions
49
5.10
b. P0,0: write 108 88, Write upgrade received by P0,0; invalidate received
by P3,1
c. P0,0: write 118 90, Write miss received by P0,0; invalidate received by P1,0
d. P1,0: write 128 98, Write miss received by P1,0.
5.11
C
S PU
m end wr
es i ite
sa nv
ge ali
da
te
Modified
ba
ck
CPU write
F
W etch
rit
e
da
t
Shared
e
e
at k
at
lid ac
lid
va b
va
in ta
In
h da
tc
Fe rite
W
Fetch invalidate
Write data back
Invalid
Read miss
Send data
Owned
CPU write
Send invalidate message
CPU write hit
CPU read hit
5.12
The Exclusive state (E) combines properties of Modified (M) and Shared (S).
The E state allows silent upgrades to M, allowing the processor to write the block
without communicating this fact to memory. It also allows silent downgrades to I,
allowing the processor to discard its copy with notifying memory. The memory
must have a way of inferring either of these transitions. In a directory-based system,
this is typically done by having the directory assume that the node is in state M and
forwarding all misses to that node. If a node has silently downgraded to I, then it
sends a NACK (Negative Acknowledgment) back to the directory, which then
infers that the downgrade occurred. However, this results in a race with other messages, which can cause other problems.
Read miss
Data value reply,
Sharers = {P}
Write miss
Data value reply
Sharers = {P}
Invalid
Shared
Read miss
Data value reply
Sharers = sharers + {P}
Read miss
Fetch; Data value reply
Sharers = sharers + {P}
Modified
Write miss
Fetch invalidate
Data value response
Sharers = {P}
W
Se rite
m n m
D es d in iss
Sh ata sag va
ar va e lid
er lu to at
s = e sh e
{P rep ar
} ly ers
50
Owned
Read miss
Fetch
Data value response
Sharers = sharers + {P}
Write miss
Fetch invalidate
Data value response
Sharers = {P}
Read hit
Miss, will replace modified data (B0) and get new line
in shared state
P0,0: M MIA I ISD S Dir: DM {P0,0} DI {}
c. P0,0: write 120 80 Miss will replace modified data (B0) and get new line
in modified state
P0,0: M MIA I IMAD IMA M
P3,1: S I
Dir: DS {P3,0} DM {P0,0}
d, e, f: steps similar to parts a, b, and c
5.14
Miss, will replace modified data (B0) and get new line
in shared state
P0,0: M MIA I ISD S
Chapter 5 Solutions
51
Miss, will replace modified data (B0) and get new line
in shared state
P1,0: M MIA I ISD S
Dir: DS {P3,0} DS {P3,0; P0,0} DS {P3,0;
P0,0; P1,0}
Miss, will replace modified data (B0) and get new line
in shared state
P0,0: M MIA I ISD S
P1,0: write 120 80 Miss will replace modified data (B0) and get new line
in modified state
P1,0: M MIA I IMAD IMA M
P3,1: S I
Dir: DS {P3,1} DS {P3,0; P1,0} DM {P1,0}
c, d, e: steps similar to parts a and b
5.15
All protocols must ensure forward progress, even under worst-case memory access
patterns. It is crucial that the protocol implementation guarantee (at least with a
probabilistic argument) that a processor will be able to perform at least one memory operation each time it completes a cache miss. Otherwise, starvation might
result. Consider the simple spin lock code:
tas:
DADDUI R2, R0, #1
lockit:
EXCH R2, 0(R1)
BNEZ R2, lockit
If all processors are spinning on the same loop, they will all repeatedly issue
GetM messages. If a processor is not guaranteed to be able to perform at least one
instruction, then each could steal the block from the other repeatedly. In the worst
case, no processor could ever successfully perform the exchange.
52
5.17
Forwarded_
GetS
Forwarded_
GetM
PutM_
Ack
Data
Last ACK
send
Ack/I
error
error
error
error
error
send
Ack/I
error
error
error
error
error
send
GetM/OM
send
PutM/OI
error
send Data
send Data/I
error
do Read
do Write
send
PutM/MI
error
send Data/O
send Data/I
error
error
error
IS
send
Ack/ISI
error
error
error
save Data,
do Read/S
error
ISI
send Ack
error
error
error
save Data,
do Read/I
error
IM
send Ack
IMO
IMIA
error
save Data
do Write/M
IMI
error
error
error
error
save Data
do Write,
send Data/I
IMO
send
Ack/IMI
IMOI
error
save Data
do Write,
send Data/O
IMOI
error
error
error
error
save Data
do Write,
send Data/I
State
Read
Write
send
GetS/IS
send
GetM/IM
error
do Read
send
GetM/IM
do Read
INV
OI
error
send Data
send Data
/I
error
error
MI
error
send Data
send Data
/I
error
error
OM
error
send Data
send Data/IM
error
save Data
do Write/M
State
Read
Write
Replacement
(owner)
INV
(nonowner)
DI
send Data,
add to sharers/DS
error
send PutM_Ack
DS
send Data,
add to sharers
error
send PutM_Ack
DO
forward GetS,
add to sharers
send PutM_Ack
DM
forward GetS,
add to requester and
owner to sharers/DO
send PutM_Ack
Chapter 5 Solutions
5.18
53
Exercises
5.19
f ( i,p )
i = 1 -----------i
p
Substituting this value for Execution timenew into the speedup equation makes
Amdahls Law a function of the available processors, p.
5.20
54
To keep the figures from becoming cluttered, the coherence protocol is split into
two parts as was done in Figure 5.6 in the text. Figure S.34 presents the
CPU portion of the coherence protocol, and Figure S.35 presents the bus portion
of the protocol. In both of these figures, the arcs indicate transitions and the text
along each arc indicates the stimulus (in normal text) and bus action (in bold text)
that occurs during the transition between states. Finally, like the text, we assume a
write hit is handled as a write miss.
Figure S.34 presents the behavior of state transitions caused by the CPU itself. In
this case, a write to a block in either the invalid or shared state causes us to broadcast a write invalidate to flush the block from any other caches that hold the
block and move to the exclusive state. We can leave the exclusive state through
either an invalidate from another processor (which occurs on the bus side of the
coherence protocol state diagram), or a read miss generated by the CPU (which
occurs when an exclusive block of data is displaced from the cache by a second
block). In the shared state only a write by the CPU or an invalidate from another
processor can move us out of this state. In the case of transitions caused by events
external to the CPU, the state diagram is fairly simple, as shown in Figure S.35.
When another processor writes a block that is resident in our cache, we unconditionally invalidate the corresponding block in our cache. This ensures that the
next time we read the data, we will load the updated value of the block from
memory. Also, whenever the bus sees a read miss, it must change the state of an
exclusive block to shared as the block is no longer exclusive to a single cache.
The major change introduced in moving from a write-back to write-through
cache is the elimination of the need to access dirty blocks in another processors
caches. As a result, in the write-through protocol it is no longer necessary to provide the hardware to force write back on read accesses or to abort pending memory accesses. As memory is updated during any write on a write-through cache, a
processor that generates a read miss will always retrieve the correct information
from memory. Basically, it is not possible for valid cache blocks to be incoherent
with respect to main memory in a system with write-through caches.
Chapter 5 Solutions
CPU read
Invalid
55
Shared
(read only)
CPU
read
CPU write
Invalidate block
CPU read
miss
CPU write
Invalidate block
Exclusive
(read/write)
CPU read
hit or write
Figure S.34 CPU portion of the simple cache coherency protocol for write-through
caches.
Invalid
Write miss
Invalidate block
Shared
(read only)
Write miss
Invalidate block
Exclusive
(read/write)
Read miss
Figure S.35 Bus portion of the simple cache coherency protocol for write-through
caches.
56
5.22
To augment the snooping protocol of Figure 5.7 with a Clean Exclusive state we
assume that the cache can distinguish a read miss that will allocate a block destined
to have the Clean Exclusive state from a read miss that will deliver a Shared block.
Without further discussion we assume that there is some mechanism to do so.
The three states of Figure 5.7 and the transitions between them are unchanged,
with the possible clarifying exception of renaming the Exclusive (read/write)
state to Dirty Exclusive (read/write).
The new Clean Exclusive (read only) state should be added to the diagram along
with the following transitions.
from Clean Exclusive to Clean Exclusive in the event of a CPU read hit on
this block or a CPU read miss on a Dirty Exclusive block
from Clean Exclusive to Shared in the event of a CPU read miss on a Shared
block or on a Clean Exclusive block
from Clean Exclusive to Shared in the event of a read miss on the bus for this
block
from Clean Exclusive to Invalid in the event of a write miss on the bus for this
block
from Clean Exclusive to Dirty Exclusive in the event of a CPU write hit on
this block or a CPU write miss
from Dirty Exclusive to Clean Exclusive in the event of a CPU read miss on a
Dirty Exclusive block
from Invalid to Clean Exclusive in the event of a CPU read miss on a Dirty
Exclusive block
from Shared to Clean Exclusive in the event of a CPU read miss on a Dirty
Exclusive block
Several transitions from the original protocol must change to accommodate the
existence of the Clean Exclusive state. The following three transitions are those
that change.
5.23
from Dirty Exclusive to Shared, the label changes to CPU read miss on a
Shared block
from Invalid to Shared, the label changes to CPU miss on a Shared block
from Shared to Shared, the miss transition label changes to CPU read miss on
a Shared block
An obvious complication introduced by providing a valid bit per word is the need
to match not only the tag of the block but also the offset within the block when
snooping the bus. This is easy, involving just looking at a few more bits. In addition, however, the cache must be changed to support write-back of partial cache
blocks. When writing back a block, only those words that are valid should be written to memory because the contents of invalid words are not necessarily coherent
Chapter 5 Solutions
57
with the system. Finally, given that the state machine of Figure 5.7 is applied at
each cache block, there must be a way to allow this diagram to apply when state
can be different from word to word within a block. The easiest way to do this would
be to provide the state information of the figure for each word in the block. Doing
so would require much more than one valid bit per word, though. Without replication of state information the only solution is to change the coherence protocol
slightly.
5.24
5.25
Because false sharing occurs when both the data object size is smaller than the
granularity of cache block valid bit(s) coverage and more than one data object is
stored in the same cache block frame in memory, there are two ways to prevent
false sharing. Changing the cache block size or the amount of the cache block covered by a given valid bit are hardware changes and outside the scope of this exercise. However, the allocation of memory locations to data objects is a software
issue.
The goal is to locate data objects so that only one truly shared object occurs per
cache block frame in memory and that no non-shared objects are located in the
same cache block frame as any shared object. If this is done, then even with just a
single valid bit per cache block, false sharing is impossible. Note that shared,
read-only-access objects could be combined in a single cache block and not contribute to the false sharing problem because such a cache block can be held by
many caches and accessed as needed without an invalidations to cause unnecessary cache misses.
Copyright 2012 Elsevier, Inc. All rights reserved.
58
To the extent that shared data objects are explicitly identified in the program
source code, then the compiler should, with knowledge of memory hierarchy
details, be able to avoid placing more than one such object in a cache block frame
in memory. If shared objects are not declared, then programmer directives may
need to be added to the program. The remainder of the cache block frame should
not contain data that would cause false sharing misses. The sure solution is to pad
with block with non-referenced locations.
Padding a cache block frame containing a shared data object with unused memory locations may lead to rather inefficient use of memory space. A cache block
may contain a shared object plus objects that are read-only as a trade-off between
memory use efficiency and incurring some false-sharing misses. This optimization almost certainly requires programmer analysis to determine if it would be
worthwhile. Generally, careful attention to data distribution with respect to cache
lines and partitioning the computation across processors is needed.
5.26
The problem illustrates the complexity of cache coherence protocols. In this case,
this could mean that the processor P1 evicted that cache block from its cache and
immediately requested the block in subsequent instructions. Given that the writeback message is longer than the request message, with networks that allow out-oforder requests, the new request can arrive before the write back arrives at the directory. One solution to this problem would be to have the directory wait for the write
back and then respond to the request. Alternatively, the directory can send out a
negative acknowledgment (NACK). Note that these solutions need to be thought
out very carefully since they have potential to lead to deadlocks based on the particular implementation details of the system. Formal methods are often used to check
for races and deadlocks.
5.27
If replacement hints are used, then the CPU replacing a block would send a hint to
the home directory of the replaced block. Such hint would lead the home directory
to remove the CPU from the sharing list for the block. That would save an invalidate message when the block is to be written by some other CPU. Note that while
the replacement hint might reduce the total protocol latency incurred when writing
a block, it does not reduce the protocol traffic (hints consume as much bandwidth
as invalidates).
5.28
a. Considering first the storage requirements for nodes that are caches under the
directory subtree:
The directory at any level will have to allocate entries for all the cache blocks
cached under that directorys subtree. In the worst case (all the CPUs under
the subtree are not sharing any blocks), the directory will have to store as
many entries as the number of blocks of all the caches covered in the subtree.
That means that the root directory might have to allocate enough entries to
reference all the blocks of all the caches. Every memory block cached in a
directory will represented by an entry <block address, k-bit vector>, the k-bit
vector will have a bit specifying all the subtrees that have a copy of the block.
For example, for a binary tree an entry <m, 11> means that block m is cached
under both branches of the tree. To be more precise, one bit per subtree would
Chapter 5 Solutions
Root (Level 0)
Directory
0
Level 1
Directory
59
k-1
Level 1
Directory
Level L-1
Directory
CPU0
CPU1
CPUk-1
60
5.29
Test and set code using load linked and store conditional.
MOV R3, #1
LL R2, 0(R1)
SC R3, 0(R1)
Typically this code would be put in a loop that spins until a 1 is returned in R3.
5.30
Assume a cache line that has a synchronization variable and the data guarded by
that synchronization variable in the same cache line. Assume a two processor system with one processor performing multiple writes on the data and the other processor spinning on the synchronization variable. With an invalidate protocol, false
sharing will mean that every access to the cache line ends up being a miss resulting
in significant performance penalties.
5.31
The monitor has to be place at a point through which all memory accesses pass.
One suitable place will be in the memory controller at some point where accesses
from the 4 cores converge (since the accesses are uncached anyways). The monitor
will use some sort of a cache where the tag of each valid entry is the address
accessed by some load-linked instruction. In the data field of the entry, the core
number that produced the load-linked access -whose address is stored in the tag
field- is stored.
This is how the monitor reacts to the different memory accesses.
Checks the cache, if there is any entry with whose address matches the
read address even if there is a partial address match (for example, read
[0:7] and read [4:11] overlap match in addresses [4:7]), the matching cache
entry is invalidated and a new entry is created for the new read (recording
the core number that it belongs to). If there is no matching entry in the
cache, then a new entry is created (if there is space in the cache). In either
case the read progresses to memory and returns data to originating core.
Checks the cache, if there is any entry with whose address matches the
write address even if there is a partial address match (for example, read
[0:7] and write [4:11] overlap match in addresses [4:7]), the matching
cache entry is invalidated. The write progresses to memory and writes data
to the intended address.
Checks the cache, if there is any entry with whose address matches the
write address even if there is a partial address match (for example, read
[0:7] and write [4:11] overlap match in addresses [4:7]), the core number
in the cache entry is compared to the core that originated the write.
Chapter 5 Solutions
61
If the core numbers are the same, then the matching cache entry is invalidated, the write proceeds to memory and returns a success signal to the
originating core. In that case, we expect the address match to be perfect
not partial- as we expect that the same core will not issue load-linked/store
conditional instruction pairs that have overlapping address ranges.
If the core numbers differ, then the matching cache entry is invalidated, the
write is aborted and returns a failure signal to the originating core. This
case signifies that synchronization variable was corrupted by another core
or by some regular store operation.
5.32
5.33
Inclusion states that each higher level of cache contains all the values present in the
lower cache levels, i.e., if a block is in L1 then it is also in L2. The problem states
that L2 has equal or higher associativity than L1, both use LRU, and both have the
same block size.
When a miss is serviced from memory, the block is placed into all the caches, i.e.,
it is placed in L1 and L2. Also, a hit in L1 is recorded in L2 in terms of updating
LRU information. Another key property of LRU is the following. Let A and B
both be sets whose elements are ordered by their latest use. If A is a subset of B
such that they share their most recently used elements, then the LRU element of
B must either be the LRU element of A or not be an element of A.
This simply states that the LRU ordering is the same regardless if there are 10
entries or 100. Let us assume that we have a block, D, that is in L1, but not in L2.
Since D initially had to be resident in L2, it must have been evicted. At the time
of eviction D must have been the least recently used block. Since an L2 eviction
took place, the processor must have requested a block not resident in L1 and
obviously not in L2. The new block from memory was placed in L2 (causing the
eviction) and placed in L1 causing yet another eviction. L1 would have picked
the least recently used block to evict.
Since we know that D is in L1, it must be the LRU entry since it was the LRU
entry in L2 by the argument made in the prior paragraph. This means that L1
would have had to pick D to evict. This results in D not being in L1 which results
in a contradiction from what we assumed. If an element is in L1 it has to be in L2
(inclusion) given the problems assumptions about the cache.
62
5.34
Analytical models can be used to derive high-level insight on the behavior of the
system in a very short time. Typically, the biggest challenge is in determining the
values of the parameters. In addition, while the results from an analytical model can
give a good approximation of the relative trends to expect, there may be significant
errors in the absolute predictions.
Trace-driven simulations typically have better accuracy than analytical models,
but need greater time to produce results. The advantages are that this approach
can be fairly accurate when focusing on specific components of the system (e.g.,
cache system, memory system, etc.). However, this method does not model the
impact of aggressive processors (mispredicted path) and may not model the
actual order of accesses with reordering. Traces can also be very large, often taking gigabytes of storage, and determining sufficient trace length for trustworthy
results is important. It is also hard to generate representative traces from one class
of machines that will be valid for all the classes of simulated machines. It is also
harder to model synchronization on these systems without abstracting the synchronization in the traces to their high-level primitives.
Execution-driven simulation models all the system components in detail and is
consequently the most accurate of the three approaches. However, its speed of
simulation is much slower than that of the other models. In some cases, the extra
detail may not be necessary for the particular design parameter of interest.
5.35