L7 Multicore 1
L7 Multicore 1
L7 Multicore 1
Computer Architecture
Winter 2015
Uniprocessor performance
25% annual improvement rate from 1978 to 1986
52% annual improvement rate from 1986 to 2002
- Profound impact of RISC, x86
20% annual improvement rate from 2002 to present
- Power wall: solutions for higher ILP are power-
inefficient
- ILP wall: hard to exploit more ILP
- Memory wall: ever-increasing memory latency relative
to processor speed
1
Beyond Implicit Parallelism
Consider “atxpy”:
double a, x[SIZE], y[SIZE], z[SIZE];
void atxpy(){
for (i = 0; i < SIZE; i++)
z[i] = a*x[i] + y[i];
}
Interconnection Network
Memory I/O
And improve the run time of a single program that has been
specially crafted to run on a multiprocessor - a parallel
processing program
3
Multicores Now Common
The power challenge has forced a change in the design
of microprocessors
Since 2002 the rate of improvement in the response time of
programs has slowed from a factor of 1.5 per year to less than a
factor of 1.2 per year
Sequential App
Performance
5
Other Multiprocessor Basics
6
Multicore vs Multi-processor
7
Multicore Organization Alternatives
8
Itanium 2 Dual Core
9
High Level Multicore Architectural view
Intel® Core™ Microarchitecture – Memory Sub-system
Intel Core 2
Intel Core 2
Quad Processor
Duo Processor
A A A A
A A
E E E E
E E
C
C1 C2
B
B B
QuickPath Interconnection
Cache coherent point-to-point link
High speed communications between processor chips
6.4G transfers per second, 16 bits per transfer
Dedicated bi-directional pairs
11
Total bandwidth 25.6GB/s
Simultaneous Multithreading (SMT)
A variation on multithreading that uses the resources of a
multiple-issue, dynamically scheduled processor
(superscalar) to exploit both program ILP and thread-level
parallelism (TLP)
Most SS processors have more machine level parallelism than
most programs can effectively use (i.e., than have ILP)
With register renaming and dynamic scheduling, multiple
instructions from independent threads can be issued without
regard to dependencies among them
- Need separate rename tables for each thread or need to be able to
indicate which thread the entry belongs to
- Need the capability to commit from multiple threads in one cycle
12
SMT
13
Cache Coherence
14
Cache Coherence in Multicores
Read X Read X
read again
Core 1 Core 2
Write 1 to X X=?
L1 I$ L1 D$ L1 I$ L1 D$
X
X == 01 X=0
X=1
0
Unified (shared) L2
16
A Coherent Memory System
17
Snooping Protocols
Write Invalidate
1. A core wanting to write to an address, grabs a
bus cycle and sends a ‘write invalidate’
message which contains the address
2. All snooping caches invalidate their copy of
appropriate cache line
3. The core writes to its cached copy (assume for
now that it also writes through to memory)
4. Any shared read in other cores will now miss in
cache and re-fetch the new data.
19
Snooping Protocols
Write Update
1. A core wanting to write grabs bus cycle and broadcasts
address & new data as it updates its own copy
2. All snooping caches update their copy
20
Update or Invalidate?
Update looks the simplest, most obvious and
fastest, but:
Multiple writes to the same word (no intervening read)
need only one invalidate message but would require
an update for each
21
Update or Invalidate?
22
Implementation Issues
23
MESI Protocol (1)
A practical multi-core invalidate protocol which
attempts to minimize bus usage.
25
MESI Protocol (3)
Cacheline state changes are a function of memory
access events.
26
MESI Protocol (4)
Operation can be described informally by looking
at actions in a local core
Read Hit
Read Miss
Write Hit
Write Miss
27
MESI Local Read Hit
The line must be in one of MES
No state change
28
MESI Local Read Miss
29
MESI Local Read Miss
30
MESI Local Read Miss
31
MESI Local Write Hit
E
Update local cache value
Change E to M
S
Core broadcasts an invalidate on bus
Snooping cores with an S copy change S to I
The local cache value is updated
The local state changes from S to M
32
MESI Local Write Miss
33
MESI Local Write Miss
34
Comments on MESI Protocol
Relieson global view of all memory activity –
usually implies a global bus
Bus is a limited shared resource
As number of cores increases
Demands on bus bandwidth increase – more total memory
activity
The bus gets slower due to increased capacitive load
35
Example of Snooping Invalidation
Read X Read X
Core 1 Core 2
Write 1 to X Read X
L1 I$ L1 D$ L1 I$ L1 D$
X
X == 01 X == 01I
X
I
X=1
0
Unified (shared) L2
P1 P2 P3
u=? 3
u=?
4 5 $
$ $
u :5 u :5 u = 7
1 I/O devices
2
u :5 u=7
Memory
Core1 Core2
39
False Sharing
Performance issue in programs where cores may write to different
memory addresses BUT in the same cache lines
Known as Ping-Ponging – Cache line is shipped between cores
Core 0 Core 1
X[0] = 0 X[1] = 0
X[0] = 1
X[1] = 1
Time
X[0] = 2
1 0
0
2 1 1
41
MP Performance for Processor Commercial Workload:
OLTP, Decision Support (Database), Search Engine
• Uniprocessor
42
MP Performance 2MB Cache Commercial Workload:
OLTP, Decision Support (Database), Search Engine
43
Avoiding False Sharing
Change either
Algorithm
adjust the implementation of the algorithm (the loop
stride) to access data in different cache line for each
thread
Or
Data Structure:
add some “padding” to a data structure or arrays ( just
enough padding generally less than cache line size) so
that threads access data from different cache lines.
44
Summary
45
Example: True v. False Sharing v. Hit?
46
Aside: Interconnection Networks
Network topologies
Arrangements of processors, switches, and links
Bus Ring
N-cube (N = 3)
2D Mesh
Fully connected
47
Aside: Multistage Networks
48
Aside: Network Characteristics
Performance
Latency per message (unloaded network)
Throughput
- Link bandwidth
- Total network bandwidth
- Bisection bandwidth
Congestion delays (depending on traffic)
Cost
Power
Routability in silicon
49
Next: Exploit TLP on Multicores
50
struct foo {
int x;
int y;
};
int sum_a(void) {
int s = 0;
int i;
for (i = 0; i < 1000000; ++i)
s += f.x;
return s;
}
void inc_b(void) {
int i;
for (i = 0; i < 1000000; ++i)
++f.y;
}
51