L7 Multicore 1

CS3350B
Computer Architecture
Winter 2015
Lecture 7.1: Multicore: Basics and Key Issues
Marc Moreno Maza

www.csd.uwo.ca/Courses/CS3350b
[Adapted from lectures on

Computer Organization and Design,
Patterson & Hennessy, 4th or 5th edition, 2011]
0
Why We need Multiprocessors?
 Uniprocessor performance
 25% annual improvement rate from 1978 to 1986
 52% annual improvement rate from 1986 to 2002
- Profound impact of RISC, x86
 20% annual improvement rate from 2002 to present
- Power wall: solutions for higher ILP are power-
inefficient
- ILP wall: hard to exploit more ILP
- Memory wall: ever-increasing memory latency relative
to processor speed
1
Beyond Implicit Parallelism
 Consider “atxpy”:
double a, x[SIZE], y[SIZE], z[SIZE];
void atxpy(){
for (i = 0; i < SIZE; i++)
z[i] = a*x[i] + y[i];
}
 Lots of instruction-level parallelism (ILP)

 Great!
 But how much can we really exploit? 4-issue? 8-issue?
- Limits to (efficient) super-scalar execution
 But, if SIZE is 10,000 the loop has 10,000-way parallelism!

 How do we exploit it?
2
Where are We Now?
 Multiprocessor – a computer system with at least two
processors
Processor Processor Processor
Cache Cache Cache
Interconnection Network
Memory I/O
 Can deliver high throughput for independent jobs via job-level

parallelism or process-level parallelism
 And improve the run time of a single program that has been
specially crafted to run on a multiprocessor - a parallel
processing program
3
Multicores Now Common
 The power challenge has forced a change in the design
of microprocessors
 Since 2002 the rate of improvement in the response time of
programs has slowed from a factor of 1.5 per year to less than a
factor of 1.2 per year
 Today’s microprocessors typically contain more than one

core – Chip Multicore microProcessors (CMPs) – in a
single IC
 The number of cores is expected to double every two years
Product AMD Intel IBM Power Sun Niagara

Barcelona Nehalem 6 2
Cores per chip 4 4 2 8
Clock rate 2.5 GHz ~2.5 GHz? 4.7 GHz 1.4 GHz
Power 120 W ~100 W? ~100 W? 94 W
4
Transition to Multicore
Sequential App
Performance
5
Other Multiprocessor Basics
 Some of the problems that need higher performance can

be handled simply by using a cluster – a set of
independent servers (or PCs) connected over a local
area network (LAN) functioning as a single large
multiprocessor
 Search engines, Web servers, email servers, databases, …
 A key challenge is to craft parallel (concurrent) programs

that have high performance on multiprocessors as the
number of processors increase – i.e., that scale
 Scheduling, load balancing, time for synchronization, overhead
for communication
6
Multicore vs Multi-processor
Multicore Processor with

Shared L2 Cache
Multi-Processor System with
Cores that share L2 Cache
7
Multicore Organization Alternatives
(a) ARM11 MPCore
(b) AMD Opteron
(c) Intel Core Duo
(d) Intel Core i7
8
Itanium 2 Dual Core
9
High Level Multicore Architectural view
Intel® Core™ Microarchitecture – Memory Sub-system
Intel Core 2
Intel Core 2
Quad Processor
Duo Processor
A A A A
A A
E E E E
E E
C
C1 C2
B
B B
64B Cache Line 64B Cache Line

Memory Memory
Dual Core has shared cache, Quad core
has both shared And separated cache
A = Architectural State E = Execution Engine & Interrupt 10

C = 2nd Level Cache B = Bus Interface connects to main memory & I/O
Intel Core i7 Block Diagram
 November 2008
 Four x86 SMT processors
 Dedicated L2, shared L3 cache
 Speculative pre-fetch for caches
 On chip DDR3 memory controller
 Three 8 byte channels (192
bits) giving 32GB/s
 No front side bus
 QuickPath Interconnection
 Cache coherent point-to-point link
 High speed communications between processor chips
 6.4G transfers per second, 16 bits per transfer
 Dedicated bi-directional pairs
11
 Total bandwidth 25.6GB/s
Simultaneous Multithreading (SMT)
 A variation on multithreading that uses the resources of a
multiple-issue, dynamically scheduled processor
(superscalar) to exploit both program ILP and thread-level
parallelism (TLP)
 Most SS processors have more machine level parallelism than
most programs can effectively use (i.e., than have ILP)
 With register renaming and dynamic scheduling, multiple
instructions from independent threads can be issued without
regard to dependencies among them
- Need separate rename tables for each thread or need to be able to
indicate which thread the entry belongs to
- Need the capability to commit from multiple threads in one cycle
 Intel’s Pentium 4 SMT is called hyperthreading

 Supports just two threads (doubles the architecture state)
12
SMT
13
Cache Coherence
 What is the coherence problem?

 Core writes to a location in its L1 cache
 Other L1 caches may hold shared copies - these will be
immediately out of date
 The core may either

 Write through to L2 cache and/or memory
 Copy back only when cache line is rejected
 In either case because each core may have its own copy, it is
not sufficient just to update L2 and/or memory
14
Cache Coherence in Multicores
 In multicore processors its likely that the cores will share

a common physical address space, causing a cache
coherence problem
Read X Read X
read again
Core 1 Core 2
Write 1 to X X=?
L1 I$ L1 D$ L1 I$ L1 D$
X
X == 01 X=0
X=1
0
Unified (shared) L2
16
A Coherent Memory System
 Any read of a data item should return the most recently

written value of the data item
 Coherence – defines what values can be returned by a read
- Writes to the same location are serialized (two writes to the same
location must be seen in the same order by all cores)
 Consistency – determines when a written value will be returned
by a read
 To enforce coherence, caches (hardware) must provide

 Replication of shared data items in multiple cores’ caches
 Replication reduces both latency and contention for a read shared
data item
 Migration of shared data items to a core’s local cache
 Migration reduced the latency of the access the data and the
bandwidth demand on the shared memory (L2 in our example)
17
Snooping Protocols
 Schemes where every core knows which other core has a

copy of its cached data are far too complex.
 So each core (cache system) ‘snoops’ (i.e. watches

continually) for activity concerned with data addresses
which it has cached.
 This has normally been implemented with a bus structure

which is ‘global’, i.e. all communication can be seen by all
 Snooping Protocols can be implemented without a bus, but

for simplicity the next slides assume a shared bus.
 There are ‘directory based’ coherence schemes but we will

not consider them this year.
18
Snooping Protocols
Write Invalidate
1. A core wanting to write to an address, grabs a
bus cycle and sends a ‘write invalidate’
message which contains the address
2. All snooping caches invalidate their copy of
appropriate cache line
3. The core writes to its cached copy (assume for
now that it also writes through to memory)
4. Any shared read in other cores will now miss in
cache and re-fetch the new data.
19
Snooping Protocols
Write Update
1. A core wanting to write grabs bus cycle and broadcasts
address & new data as it updates its own copy
2. All snooping caches update their copy
 Note that in both schemes, the problem of simultaneous

writes is taken care of by bus arbitration - only one core
can use the bus at any one time.
20
Update or Invalidate?
 Update looks the simplest, most obvious and
fastest, but:
 Multiple writes to the same word (no intervening read)
need only one invalidate message but would require
an update for each
 Writes to same block in (usual) multi-word cache

block require only one invalidate but would require
multiple updates.
21
Update or Invalidate?
 Due to both spatial and temporal locality, the

previous cases occur often.
 Busbandwidth is a precious commodity in shared

memory multi-cores chips
 Experience has shown that invalidate protocols

use significantly less bandwidth.
 We will only consider implementation details only of

the invalidate protocols.
All commercial machines use write-invalidate as their
standard coherence protocol
22
Implementation Issues
 In both schemes, knowing if a cached value is not shared

(no copies in another cache) can avoid sending any
messages.
 Invalidate description assumed that a cache value

update was written through to memory. If we used a
‘copy back’ scheme (usual for high performance) other
cores could re-fetch incorrect old value on a cache miss.
 We need a protocol to handle all this.
23
MESI Protocol (1)
A practical multi-core invalidate protocol which
attempts to minimize bus usage.
 Allowsusage of a ‘copy back’ scheme - i.e.

L2/main memory is not updated until a ‘dirty’
cache line is displaced
 Extension of the usual cache tags, i.e. invalid tag

and ‘dirty’ tag in normal copy back cache.
 Tomake the description simpler, we will ignore

L2 cache and treat L2/main memory as a single
main memory unit
24
MESI Protocol (2)
Any cache line can be in one of 4 states (2 bits)
 Modified – The cache line has been modified and is different

from main memory – This is the only cached copy. (cf. ‘dirty’)
 Exclusive – The cache line is the same as main memory and

is the only cached copy
 Shared - Same value as main memory but copies may exist in

other caches.
 Invalid - Line data is not valid (as in simple cache)
25
MESI Protocol (3)
 Cacheline state changes are a function of memory
access events.
 Events may be either

 Due to local core activity (i.e. cache access)
 Due to bus activity - as a result of snooping
 Each cache line has its own state affected only if

the address matches
26
MESI Protocol (4)
 Operation can be described informally by looking
at actions in a local core
 Read Hit
 Read Miss
 Write Hit
 Write Miss
 More formally by a state transition diagram
27
MESI Local Read Hit
 The line must be in one of MES
 This must be the correct local value (if M it must

have been modified locally)
 Simply return value
 No state change
28
MESI Local Read Miss
A core makes read request to main memory upon

a read miss: detailed action depends on copies
in other cores
Case 1: One cache has an E copy

 The snooping cache puts the copy value on the bus
 The memory access is abandoned
 The local core caches the value
 Both lines are set to S
Case 2: No other copy in caches

 The core waits for a memory response
 The value is stored in the cache and marked E
29
Case 3: Several caches have a copy (S)

 One snooping cache puts the copy value on
the bus (arbitrated)
 The local core caches the value and sets the
tag to S
 Other copies remain S
30
Case 4: One cache has M (modified) copy

 The snooping cache puts its copy of the value
on the bus
 The local core caches the value and sets the

tag to S
 The source (M) value is copied back to
memory
 The source value changes its tag from M to S
31
MESI Local Write Hit
Line must be one of MES

 M
 line is exclusive and already ‘dirty’
 Update local cache value
 no state change
 E
 Update local cache value
 Change E to M
 S
 Core broadcasts an invalidate on bus
 Snooping cores with an S copy change S to I
 The local cache value is updated
 The local state changes from S to M
32
MESI Local Write Miss
Detailed action depends on copies in other cores

Case 1: No other copies
 Local copy state set to M
Case 2: Other copies, either one in state E or more in state S

 Value read from memory to local cache - bus
transaction marked RWITM (read with intent to modify)
 The snooping cores see this and set their tags to I
 The local copy is updated and sets the tag to M
33
MESI Local Write Miss
Case 3: Another copy in state M

 Core issues bus transaction marked RWITM
 The snooping core sees this
- Blocks the RWITM request
- Takes control of the bus
- Writes back its copy to memory
- Sets its copy state to I
 The original local core re-issues RWITM request

 This is now simply a no-copy case
- Value read from memory to local cache
- Local copy value updated
- Local copy state set to M
34
Comments on MESI Protocol
 Relieson global view of all memory activity –
usually implies a global bus
 Bus is a limited shared resource
 As number of cores increases
 Demands on bus bandwidth increase – more total memory
activity
 The bus gets slower due to increased capacitive load
 General consensus is that bus-based systems

cannot be extended beyond a small number (8
or 16?) cores
35
Example of Snooping Invalidation
Read X Read X
Core 1 Core 2
Write 1 to X Read X
L1 I$ L1 D$ L1 I$ L1 D$
X
X == 01 X == 01I
X
I
X=1
0
Unified (shared) L2
 When the second miss by Core 2 occurs, Core 1

responds with the value canceling the response from
the L2 cache (and also updating the L2 copy)
37
Another Example for Write Invalidate Snooping
P1 P2 P3
u=? 3
u=?
4 5 $
$ $
u :5 u :5 u = 7
1 I/O devices
2
u :5 u=7
Memory
 Must invalidate before step 3

 All recent MPUs use write invalidate
(invalidate protocols use significantly less bandwidth)
38
Block Size Effects on Multicores
 Writes to one word in a multi-word block mean that the

full block is invalidated
 Multi-word blocks can also result in false sharing:

when two cores are writing to two different variables that
happen to fall in the same cache block
 With write-invalidate, false sharing increases cache miss rates
Core1 Core2
A B 4 word cache block
39
False Sharing
 Performance issue in programs where cores may write to different
memory addresses BUT in the same cache lines
 Known as Ping-Ponging – Cache line is shipped between cores
Core 0 Core 1
X[0] = 0 X[1] = 0
X[0] = 1
X[1] = 1
Time
X[0] = 2
1 0
0
2 1 1
False Sharing is not an issue in shared cache.

It is an issue in separated cache
40
Coherency Misses
1. True sharing misses arise from the communication of
data through the cache coherence mechanism
• Invalidates due to 1st write to shared block
• Reads by another CPU of modified block in different
cache
• Miss would still occur if block size were 1 word
2. False sharing misses when a block is invalidated

because some word in the block, other than the one
being read, is written into
• Invalidation does not cause a new value to be
communicated, but only causes an extra cache miss
• Block is shared, but no word in block is actually shared
 miss would not occur if block size were 1 word
41
MP Performance for Processor Commercial Workload:
OLTP, Decision Support (Database), Search Engine
• Uniprocessor
(Memory) Cycles per Instruction

cache misses
improve with
cache size increase
(Instruction,
Capacity/Conflict,
Compulsory)
•True sharing and

false sharing
unchanged going
from 1 MB to 8 MB
(L3 cache)
42
MP Performance 2MB Cache Commercial Workload:
OLTP, Decision Support (Database), Search Engine
(Memory) Cycles per Instruction

True sharing,
false sharing
increase
going from 1
to 8 CPUs
43
Avoiding False Sharing
Change either
 Algorithm
 adjust the implementation of the algorithm (the loop
stride) to access data in different cache line for each
thread
Or
 Data Structure:
 add some “padding” to a data structure or arrays ( just
enough padding generally less than cache line size) so
that threads access data from different cache lines.
44
Summary
 Multicores are common multiprocessors nowadays

 Offer computing resources to improve throughput by
process and thread level parallelism in addition to ILP
 Dedicated on-chip caches and shared lower level caches
 Key issues
 Cache coherence
- Popular protocol: snooping invalidation
 False-sharing
45
Example: True v. False Sharing v. Hit?
Assume x1 and x2 in same cache block.

P1 and P2 both read x1 and x2 before.
Time P1 P2 True, False, Hit? Why?

1 Write x1 True miss; invalidate x1 in P2
2 Read x2 False miss; x1 irrelevant to P2
3 Write x1 False miss; x1 irrelevant to P2
4 Write x2 False miss; x1 irrelevant to P2
5 Read x2 True miss; invalidate x2 in P1
46
Aside: Interconnection Networks
 Network topologies
 Arrangements of processors, switches, and links
Bus Ring
N-cube (N = 3)
2D Mesh
Fully connected
47
Aside: Multistage Networks
48
Aside: Network Characteristics
 Performance
 Latency per message (unloaded network)
 Throughput
- Link bandwidth
- Total network bandwidth
- Bisection bandwidth
 Congestion delays (depending on traffic)
 Cost
 Power
 Routability in silicon
49
Next: Exploit TLP on Multicores
 Basic idea: Processor resources are expensive and

should not be left idle
 Long memory latency to memory on cache miss?
 Hardware switches threads to bring in other useful work
while waiting for cache miss
 Cost of thread context switch must be much less than
cache miss latency
 Put in redundant hardware so don’t have to save context
on every thread switch:
 PC, Registers, L1 caches?
 Attractive for apps with abundant TLP

 Commercial multi-user workloads
50
struct foo {
int x;
int y;
};
static struct foo f;
/* The two following functions are running concurrentlyby two threads: */
int sum_a(void) {
int s = 0;
int i;
for (i = 0; i < 1000000; ++i)
s += f.x;
return s;
}
void inc_b(void) {
int i;
for (i = 0; i < 1000000; ++i)
++f.y;
}
51

L7 Multicore 1

Uploaded by

Copyright:

Available Formats

L7 Multicore 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

L7 Multicore 1

Uploaded by

Copyright:

Available Formats

CS3350B

Lecture 7.1: Multicore: Basics and Key Issues

Marc Moreno Maza

[Adapted from lectures on

 Lots of instruction-level parallelism (ILP)

 But, if SIZE is 10,000 the loop has 10,000-way parallelism!

Processor Processor Processor

Cache Cache Cache

 Can deliver high throughput for independent jobs via job-level

 Today’s microprocessors typically contain more than one

Product AMD Intel IBM Power Sun Niagara

 Some of the problems that need higher performance can

 A key challenge is to craft parallel (concurrent) programs

Multicore Processor with

(a) ARM11 MPCore

(b) AMD Opteron

(c) Intel Core Duo

(d) Intel Core i7

64B Cache Line 64B Cache Line

A = Architectural State E = Execution Engine & Interrupt 10

 Intel’s Pentium 4 SMT is called hyperthreading

 What is the coherence problem?

 The core may either

 In multicore processors its likely that the cores will share

 Any read of a data item should return the most recently

 To enforce coherence, caches (hardware) must provide

 Schemes where every core knows which other core has a

 So each core (cache system) ‘snoops’ (i.e. watches

 This has normally been implemented with a bus structure

 Snooping Protocols can be implemented without a bus, but

 There are ‘directory based’ coherence schemes but we will

 Note that in both schemes, the problem of simultaneous

 Writes to same block in (usual) multi-word cache

 Due to both spatial and temporal locality, the

 Busbandwidth is a precious commodity in shared

 Experience has shown that invalidate protocols

 We will only consider implementation details only of

 In both schemes, knowing if a cached value is not shared

 Invalidate description assumed that a cache value

 We need a protocol to handle all this.

 Allowsusage of a ‘copy back’ scheme - i.e.

 Extension of the usual cache tags, i.e. invalid tag

 Tomake the description simpler, we will ignore

 Modified – The cache line has been modified and is different

 Exclusive – The cache line is the same as main memory and

 Shared - Same value as main memory but copies may exist in

 Invalid - Line data is not valid (as in simple cache)

 Events may be either

 Each cache line has its own state affected only if

 More formally by a state transition diagram

 This must be the correct local value (if M it must

 Simply return value

A core makes read request to main memory upon

Case 1: One cache has an E copy

Case 2: No other copy in caches

Case 3: Several caches have a copy (S)

Case 4: One cache has M (modified) copy

 The local core caches the value and sets the

Line must be one of MES