Multi Chip Module GPU
Multi Chip Module GPU
Multi Chip Module GPU
Akhil Arunkumar‡ Evgeny Bolotin† Benjamin Cho∓ Ugljesa Milic+ Eiman Ebrahimi†
Slowdown compared to
0.8
6TB/s inter-GPM BW
0.6
DRAM
DRAM
L2$
L2$
XBAR XBAR
GPM 0 GPM 1
0.4
0.2
GPM 3
0
GPM 2
DRAM
DRAM
M-Intensive C-Intensive
L2$
L2$
XBAR XBAR
High Parallelism Limited Parallelism
SMs + L1$ SMs + L1$ Figure 4: Relative performance sensitivity to inter-GPM link
bandwidth for a 4-GPM, 256SM MCM-GPU system.
Figure 3: Basic MCM-GPU architecture comprising four GPU The MCM-GPU memory system is a Non Uniform Memory Ac-
modules (GPMs). cess (NUMA) architecture, as its inter-GPM links are not expected to
provide full aggregated DRAM bandwidth to each GPM. Moreover,
an additional latency penalty is expected when accessing memory on
and data partitioning between tasks running on such an MCM-GPU
remote GPMs. This latency includes data movement time within the
that is organized as multiple independent GPUs in a single package.
local GPM to the edge of the die, serialization and deserialization
latency over the inter-GPM link, and the wire latency to the next
3.2 MCM-GPU and GPM Architecture GPM. We estimate each additional inter-GPM hop latency, for a po-
As discussed in Sections 1 and 2, moving forward beyond 128 SM tentially multi-hop path in the on-package interconnect as 32 cycles.
counts will almost certainly require at least two GPMs in a GPU. Each additional hop also adds an energy cost compared to a local
Since smaller GPMs are significantly more cost-effective [31], in DRAM access. Even though we expect the MCM-GPU architecture
this paper we evaluate building a 256 SM GPU out of four GPMs to incur these bandwidth, latency, and energy penalties, we expect
of 64 SMs each. This way each GPM is configured very similarly them to be much lower compared to off-package interconnects in a
to today’s biggest GPUs. Area-wise each GPM is expected to be multi-GPU system (see Table 2).
40% - 60% smaller than today’s biggest GPU assuming the process
node shrinks to 10nm or 7nm. Each GPM consists of multiple SMs 3.3 On-Package Bandwidth Considerations
along with their private L1 caches. SMs are connected through
3.3.1 Estimation of On-package Bandwidth Requirements.
the GPM-Xbar to a GPM memory subsystem comprising a local
We calculate the required inter-GPM bandwidth in a generic MCM-
memory-side L2 cache and DRAM partition. The GPM-Xbar also
GPU. The basic principle for our analysis is that on-package links
provides connectivity to adjacent GPMs via on-package GRS [45]
need to be sufficiently sized to allow full utilization of expensive
inter-GPM links.
DRAM bandwidth resources. Let us consider a 4-GPM system with
Figure 3 shows the high-level diagram of this 4-GPM MCM-
an aggregate DRAM bandwidth of 4b units (3TB/s in our example),
GPU. Such an MCM-GPU is expected to be equipped with 3TB/s
such that b units of bandwidth (768 GB/s in our example) are deliv-
of total DRAM bandwidth and 16MB of total L2 cache. All DRAM
ered by the local memory partition directly attached to each GPM.
partitions provide a globally shared memory address space across
Assuming an L2 cache hit-rate of ∼ 50% for the average case, 2b
all GPMs. Addresses are fine-grain interleaved across all physical
units of bandwidth would be supplied from each L2 cache partition.
DRAM partitions for maximum resource utilization. GPM-Xbars
In a statistically uniform address distribution scenario, 2b units of
route memory accesses to the proper location (either the local or
bandwidth out of each memory partition would be equally consumed
a remote L2 cache bank) based on the physical address. They also
by all four GPMs. Extending this exercise to capture inter-GPM
collectively provide a modular on-package ring or mesh interconnect
communication to and from all memory partitions results in the
network. Such organization provides spatial traffic locality among lo-
total inter-GPM bandwidth requirement of the MCM-GPU. A link
cal SMs and memory partitions, and reduces on-package bandwidth
bandwidth of 4b would be necessary to provide 4b total DRAM band-
requirements. Other network topologies are also possible especially
width. In our 4-GPM MCM-GPU example with 3TB/s of DRAM
with growing number of GPMs, but a full exploration of inter-GPM
bandwidth (4b), link bandwidth settings of less than 3TB/s are ex-
network topologies is outside the scope of this paper. The L2 cache
pected to result in performance degradation due to NUMA effects.
is a memory-side cache, caching data only from its local DRAM
Alternatively, inter-GPM bandwidth settings greater than 3TB/s are
partition. As such, there is only one location for each cache line,
not expected to yield any additional performance.
and no cache coherency is required across the L2 cache banks. In
the baseline MCM-GPU architecture we employ a centralized CTA 3.3.2 Performance Sensitivity to On-Package Bandwidth.
scheduler that schedules CTAs to MCM-GPU SMs globally in a Figure 4 shows performance sensitivity of a 256 SM MCM-GPU
round-robin manner as SMs become available for execution, as in system as we decrease the inter-GPM bandwidth from an abun-
the case of a typical monolithic GPU. dant 6TB/s per link all the way to 384GB/s. The applications are
MCM-GPU: Multi-Chip-Module GPUs ISCA ’17, June 24-28, 2017, Toronto, ON, Canada
grouped into two major categories of high- and low-parallelism, Number of GPMs 4
similar to Figure 2. The scalable high-parallelism category is further Total number of SMs. 256
subdivided into memory-intensive and compute-intensive applica- GPU frequency 1GHz
Max number of warps 64 per SM
tions (For further details about application categories and simulation
Warp scheduler Greedy then Round Robin
methodology see Section 4).
L1 data cache 128 KB per SM, 128B lines, 4 ways
Our simulation results support our analytical estimations above.
Total L2 cache 16MB, 128B lines, 16 ways
Increasing link bandwidth to 6TB/s yields diminishing or even no Inter-GPM interconnect 768GB/s per link, Ring, 32 cycles/hop
return for an entire suite of applications. As expected, MCM-GPU Total DRAM bandwidth 3 TB/s
performance is significantly affected by the inter-GPM link band- DRAM latency 100ns
width settings lower than 3TB/s. For example, applications in the Table 3: Baseline MCM-GPU configuration.
memory-intensive category are the most sensitive to link bandwidth,
with 12%, 40%, and 57% performance degradation for 1.5TB/s,
768GB/s, and 384GB/s settings respectively. Compute-intensive Benchmark Abbr. Memory Footprint (MB)
applications are also sensitive to lower link bandwidth settings, how- Algebraic multigrid solver AMG 5430
ever with lower performance degradations. Surprisingly, even the Neural Network Convolution NN-Conv 496
non-scalable applications with limited parallelism and low memory Breadth First Search BFS 37
intensity show performance sensitivity to the inter-GPM link band- CFD Euler3D CFD 25
Classic Molecular Dynamics CoMD 385
width due to increased queuing delays and growing communication
Kmeans clustering Kmeans 216
latencies in the low bandwidth scenarios. Lulesh (size 150) Lulesh1 1891
3.3.3 On-Package Link Bandwidth Configuration. Lulesh (size 190) Lulesh2 4309
Lulesh unstructured Lulesh3 203
NVIDIA’s GRS technology can provide signaling rates up to 20
Adaptive Mesh Refinement MiniAMR 5407
Gbps per wire. The actual on-package link bandwidth settings for Mini Contact Solid Mechanics MnCtct 251
our 256 SM MCM-GPU can vary based on the amount of design Minimum Spanning Tree MST 73
effort and cost associated with the actual link design complexity, the Nekbone solver (size 18) Nekbone1 1746
choice of packaging technology, and the number of package routing Nekbone solver (size 12) Nekbone2 287
layers. Therefore, based on our estimations, an inter-GPM GRS link SRAD (v2) Srad-v2 96
bandwidth of 768 GB/s (equal to the local DRAM partition band- Shortest path SSSP 37
width) is easily realizable. Larger bandwidth settings such as 1.5 Stream Triad Stream 3072
TB/s are possible, albeit harder to achieve, and a 3TB/s link would re- Table 4: The high parallelism, memory intensive workloads and
quire further investment and innovations in signaling and packaging their memory footprints2 .
technology. Moreover, higher than necessary link bandwidth settings
would result in additional silicon cost and power overheads. Even
though on-package interconnect is more efficient than its on-board We study a diverse set of 48 benchmarks that are taken from
counterpart, it is still substantially less efficient than on-chip wires four benchmark suites. Our evaluation includes a set of production
and thus we must minimize inter-GPM link bandwidth consumption class HPC benchmarks from the CORAL benchmarks [6], graph
as much as possible. applications from Lonestar suite [43], compute applications from
In this paper we assume a low-effort, low-cost, and low-energy Rodinia [24], and a set of NVIDIA in-house CUDA benchmarks.
link design point of 768GB/s and make an attempt to bridge the Our application set covers a wide range of GPU application domains
performance gap due to relatively lower bandwidth settings via ar- including machine learning, deep neural networks, fluid dynamics,
chitectural innovations that improve communication locality and medical imaging, graph search, etc. We classify our applications into
essentially eliminate the need for more costly and less energy effi- two categories based on the available parallelism — high parallelism
cient links. The rest of the paper proposes architectural mechanisms applications (parallel efficiency >= 25%) and limited parallelism ap-
to capture data-locality within GPM modules, which eliminate the plications (parallel efficiency < 25%). We further categorize the high
need for costly inter-GPM bandwidth solutions. parallelism applications based on whether they are memory-intensive
(M-Intensive) or compute-intensive (C-Intensive). We classify an
4 SIMULATION METHODOLOGY application as memory-intensive if it suffers from more than 20% per-
We use an NVIDIA in-house simulator to conduct our performance formance degradation if the system memory bandwidth is halved. In
studies. We model the GPU to be similar to, but extrapolated in size the interest of space, we present the detailed per-application results
compared to the recently released NVIDIA Pascal GPU [17]. Our for the M-Intensive category workloads and present only the average
SMs are modeled as in-order execution processors that accurately numbers for the C-Intensive and limited-parallelism workloads. The
model warp-level parallelism. We model a multi-level cache hierar- set of M-Intensive benchmarks, and their memory footprints are
chy with a private L1 cache per SM and a shared L2 cache. Caches detailed in Table 4. We simulate all our benchmarks for one billion
are banked such that they can provide the necessary parallelism to warp instructions, or to completion, whichever occurs first.
saturate DRAM bandwidth. We model software based cache coher-
ence in the private caches, similar to state-of-the-art GPUs. Table 3 2
Other evaluated compute intensive and limited parallelism workloads are not shown in
summarizes baseline simulation parameters. Table 4.
ISCA ’17, June 24-28, 2017, Toronto, ON, Canada A. Arunkumar et al.
SMs + L1$ SMs + L1$ cache capacity is moved to the L1.5 caches, a 16MB L1.5 cache
where almost all of the memory-side L2 cache is moved to the L1.5
L1.5$ L1.5$ caches3 , and finally a 32MB L1.5 cache, a non iso-transistor scenario
where in addition to moving the entire L2 cache capacity to the L1.5
DRAM
DRAM
L2$
L2$
XBAR XBAR caches we add an additional 16MB of cache capacity. As the primary
GPM 0 GPM 1 objective of the L1.5 cache is to reduce the inter-GPM bandwidth
consumption, we evaluate different cache allocation policies based
GPM 3 GPM 2
on whether accesses are to the local or remote DRAM partitions.
DRAM
DRAM
XBAR XBAR
L2$
L2$
Figure 6 summarizes the MCM-GPU performance for differ-
L1.5$ L1.5$ ent L1.5 cache sizes. We report the average performance speedups
for each category, and focus on the memory-intensive category by
SMs + L1$ SMs + L1$ showing its individual application speedups. We observe that per-
formance for the memory-intensive applications is sensitive to the
L1.5 cache capacity, while applications in the compute-intensive and
Figure 5: MCM-GPU architecture equipped with L1.5 GPM- limited-parallelism categories show very little sensitivity to various
side cache to capture remote data and effectively reduce inter- cache configurations. When focusing on the memory-intensive ap-
GPM bandwidth and data access latency. plications, an 8MB iso-transistor L1.5 cache achieves 4% average
performance improvement compared to the baseline MCM-GPU. A
16MB iso-transistor L1.5 cache achieves 8% performance improve-
5 OPTIMIZED MCM-GPU ment, and a 32MB L1.5 cache that doubles the transistor budget
We propose three mechanisms to minimize inter-GPM bandwidth achieves an 18.3% performance improvement. We choose the 16MB
by capturing data locality within a GPM. First, we revisit the MCM- cache capacity for the L1.5 and keep the total cache area constant.
GPU cache hierarchy and propose a GPM-side hardware cache. Our simulation results confirm the intuition that the best alloca-
Second, we augment our architecture with distributed CTA sched- tion policy for the L1.5 cache is to only cache remote accesses, and
uling to exploit inter-CTA data locality within the GPM-side cache therefore we employ a remote-only allocation policy in this cache.
and in memory. Finally, we propose data partitioning and locality- From Figure 6 we can see that such a configuration achieves the
aware page placement to further reduce on-package bandwidth re- highest average performance speedup among the two iso-transistor
quirements. The three mechanisms combined significantly improve configurations. It achieves an 11.4% speedup over the baseline for
MCM-GPU performance. the memory-intensive GPU applications. While the GPM-side L1.5
cache has minimal impact on the compute-intensive GPU applica-
5.1 Revisiting MCM-GPU Cache Architecture tions, it is able to capture the relatively small working sets of the
limited-parallelism GPU applications and provide a performance
5.1.1 Introducing L1.5 Cache.
speedup of 3.5% over the baseline. Finally, Figure 6 shows that
The first mechanism we propose to reduce on-package link band-
the L1.5 cache generally helps applications that incur significant
width is to enhance the MCM-GPU cache hierarchy. We propose to
performance loss when moving from a 6TB/s inter-GPM bandwidth
augment our baseline GPM architecture in Figure 3 with a GPM-
setting to 768GB/s. This trend can be seen in the figure as the
side cache that resides between the L1 and L2 caches. We call this
memory-intensive applications are sorted by their inter-GPM band-
new cache level the L1.5 cache as shown in Figure 5. Architec-
width sensitivity from left to right.
turally, the L1.5 cache can be viewed as an extension of the L1
In addition to improving MCM-GPU performance, the GPM-side
cache and is shared by all SMs inside a GPM. We propose that the
L1.5 cache helps to significantly reduce the inter-GPM communi-
L1.5 cache stores remote data accesses made by a GPM partition. In
cation energy associated with on-package data movements. This is
other words, all local memory accesses will bypass the L1.5 cache.
illustrated by Figure 7 which summarizes the total inter-GPM band-
Doing so reduces both remote data access latency and inter-GPM
width with and without L1.5 cache. Among the memory-intensive
bandwidth. Both these properties improve performance and reduce
workloads, inter-GPM bandwidth is reduced by as much as 39.9% for
energy consumption by avoiding inter-GPM communication.
the SSSP application and by an average of 16.9%, 36.4%, and 32.9%
To avoid increasing on-die transistor overhead for the L1.5 cache,
for memory-intensive, compute-intensive, and limited-parallelism
we add it by rebalancing the cache capacity between the L2 and L1.5
workloads respectively. On average across all evaluated workloads,
caches in an iso-transistor manner. We extend the GPU L1 cache
we observe that inter-GPM bandwidth utilization is reduced by 28%
coherence mechanism to the GPM-side L1.5 caches as well. This
due to the introduction of the GPM-side L1.5 cache.
way, whenever an L1 cache is flushed on a synchronization event
such as reaching a kernel execution boundary, the L1.5 cache is 5.2 CTA Scheduling for GPM Locality
flushed as well. Since the L1.5 cache can receive multiple invalida- In a baseline MCM-GPU similar to monolithic GPU, at kernel
tion commands from GPM SMs, we make sure that the L1.5 cache launch, a first batch of CTAs are scheduled to the SMs by a central-
is invalidated only once for each synchronization event. ized scheduler in-order. However during kernel execution, CTAs are
5.1.2 Design Space Exploration for the L1.5 Cache. 3
A small cache capacity of 32KB is maintained in the memory-side L2 cache to
We evaluate MCM-GPU performance for three different L1.5 cache accelerate atomic operations.
MCM-GPU: Multi-Chip-Module GPUs ISCA ’17, June 24-28, 2017, Toronto, ON, Canada
8 MB L1.5 8 MB Remote Only L1.5 16 MB L1.5 16 MB Remote Only L1.5 32 MB L1.5 32 MB Remote Only L1.5
3
2.5
Speedup Over Baseline
2
1.5
MCM-GPU
1
0.5
Stream
SSSP
MnCtct
NN-Conv
Srad-v2
Lulesh1
Lulesh2
MiniAMR
Kmeans
Nekbone1
BFS
CFD
Lulesh3
Nekbone2
AMG
MST
CoMD
M-Intensive
C-Intensive
Lim. Parallel
M-Intensive GeoMean
Increasing Sensitivity to Inter-GPM Bandwidth
Figure 6: Performance of 256 SM, 768 GB/s inter-GPM BW MCM-GPU with 8MB (iso-transistor), 16 MB (iso-transistor), and 32
MB (non-iso-transistor) L1.5 caches. The M-Intensive applications are sorted by their sensitivity to inter-GPM bandwidth.
2
1.5
1
0.5
0 CTA A CTA A+1 CTA A+2 CTA A+3
Srad-v2
BFS
NN-Conv
Stream
SSSP
C-Intensive
Lulesh1
Lulesh2
MiniAMR
Kmeans
Nekbone1
AMG
MST
CFD
Lulesh3
MnCtct
Nekbone2
CoMD
M-Intensive
Lim. Parallel
4
GPM 0 GPM 1
3
Speedup over Baseline
1
CTA X time MP 0 MP 1
0 P0 P3
Lim. Parallel
Nekbone1
Nekbone2
Lulesh1
Lulesh2
MiniAMR
Kmeans
Lulesh3
NN-Conv
Stream
SSSP
BFS
Srad-v2
AMG
CFD
MST
MnCtct
CoMD
M-Intensive
C-Intensive
DRAM P0 P3 DRAM P1 P2
M-Intensive GeoMean
Figure 11: First Touch page mapping policy: (a) Access order.
(b) Proposed page mapping policy
Figure 9: Performance of MCM-GPU system with a distributed
scheduler.
To improve MCM-GPU performance, special care is needed for
Baseline MCM-GPU 16MB Remote-Only L1.5 and DS page placement to reduce inter-GPM traffic when possible. Ideally,
3 we would like to map memory pages to physical DRAM partitions
2.5
Inter-GPM BW (TB/s)
2 such that they would incur as many local memory accesses as possi-
1.5 ble. In order to maximize DRAM bandwidth utilization and prevent
1 camping on memory channels within the memory partitions, we
0.5 will still interleave addresses at a fine granularity across the mem-
0 ory channels of each memory partition (analogous to the baseline
Lulesh1
Lulesh2
MiniAMR
Lulesh3
Lim. Parallel
SSSP
NN-Conv
AMG
MST
C-Intensive
Kmeans
Stream
Nekbone1
BFS
MnCtct
Nekbone2
CFD
CoMD
M-Intensive
Srad-v2
Inter-GPM BW (TB/s)
CTA 0 CTA 5 CTA 9 CTA 13
2
CTA 0 CTA 1 CTA 2 CTA 6 CTA 4CTA 10 CTA 6CTA 14
1
CTA 1 CTA 3 CTA 3 CTA 7 CTA 5CTA 11 CTA 7CTA 15
0
NN-Conv
Stream
Srad-v2
Lulesh1
SSSP
Lulesh2
MiniAMR
Kmeans
Nekbone1
Lulesh3
BFS
MnCtct
Nekbone2
AMG
MST
CFD
CoMD
M-Intensive
C-Intensive
Lim. Parallel
CTA 8 CTA 4 CTA 10CTA 8 CTA 12CTA 12 CTA 14CTA 16
CTA 9 CTA 11 CTA 13 CTA 15
i = 0, 1, … , n-1
CTAs from consecutive kernel
invocations M-Intensive Average
Figure 12: Exploiting cross-kernel CTA locality with First Figure 14: Reduction in inter-GPM bandwidth with First Touch
Touch page placement and distributed CTA scheduling page placement
MCM-GPU
2 3
MCM-GPU
1
0 2
NN-Conv
Stream
Srad-v2
Lulesh1
SSSP
Lulesh2
MiniAMR
Kmeans
Nekbone1
Lulesh3
BFS
MnCtct
Nekbone2
AMG
MST
CFD
CoMD
M-Intensive
C-Intensive
Lim. Parallel
0
0 10 20 30 40 50
M-Intensive GeoMean Workloads
Figure 13: Performance of MCM-GPU with First Touch page Figure 15: S-curve summarizing the optimized MCM-GPU per-
placement formance speedups for all workloads.
hit rates since the caches are flushed at kernel boundaries. However, 5.4 MCM-GPU Performance Summary
we benefit significantly from more local accesses when distributed Figure 15 shows the s-curve depicting the performance improve-
scheduling is combined with first touch mapping. ment of MCM-GPU for all workloads in our study. Of the evaluated
FT also allows for much more efficient use of the cache hierarchy. 48 workloads, 31 workloads experience performance improvement
Since FT page placement keeps many accesses local to the memory while 9 workloads suffer some performance loss. M-Intensive work-
partition of a CTA’s GPM, it reduces pressure on the need for an loads such as CFD, CoMD and others experience drastic reduction in
L1.5 cache to keep requests from going to remote memory partitions. inter-GPM traffic due to our optimizations and thus experience signif-
In fact using the first touch policy shifts the performance bottleneck icant performance gains of up to 3.2× and 3.5× respectively. Work-
from inter-GPM bandwidth to local memory bandwidth. Figure 13 loads in the C-Intensive and limited parallelism categories that show
shows this effect. In this figure, we show two bars for each bench- high sensitivity to inter-GPM bandwidth also experience significant
mark — FT with DS and 16MB remote-only L1.5 cache, and FT with performance gains (e.g. 4.4× for SP and 3.1× for XSBench). On the
DS and 8MB remote-only L1.5 cache. The 16MB L1.5 cache leaves flip side, we observe two side-effects of the proposed optimizations.
room for only 32KB worth of L2 cache in each GPM. This results in For example, for workloads such as DWT and NN that have limited
sub-optimal performance as there is insufficient cache capacity that parallelism and are inherently insensitive to inter-GPM bandwidth,
is allocated to local memory traffic. We observe that in the presence the additional latency introduced by the presence of the L1.5 cache
of FT, an 8MB L1.5 cache along with a larger 8MB L2 achieves can lead to performance degradation by up to 14.6%. Another reason
better performance. The results show that with this configuration we for potential performance loss as observed in Streamcluster is due
can obtain 51% /11.3% / 7.9% performance improvements compared to the reduced capacity of on-chip writeback L2 caches4 which leads
to the baseline MCM-GPU in memory-intensive, compute-intensive, to increased write traffic to DRAM. This results in performance loss
and limited parallelism applications respectively. Finally Figure 14 of up to 25.3% in this application. Finally, we observe that there are
shows that with FT page placement a multitude of workloads experi- workloads (two in our evaluation set) where different CTAs perform
ence a drastic reduction in their inter-GPM traffic, sometimes almost unequal amount of work. This leads to workload imbalance due to
eliminating it completely. On average our proposed MCM-GPU
achieves a 5× reduction in inter-GPM bandwidth compared to the 4
L1.5 caches are set up as write-through to support software based GPU coherence
baseline MCM-GPU. implementation
ISCA ’17, June 24-28, 2017, Toronto, ON, Canada A. Arunkumar et al.
30
1.6
MCM-GPU (%)
Multi-GPU
20 1.4
10
1.2
0
-10 1.0
Remote-Only Distributed
DS First
FT MCM-GPU MCM-GPU Monolithic
Scheduling Touch
Optimized MCM-GPU MCM-GPU Monolithic
L1.5 (768 GB/s) (6TB/s)
Multi-GPU (768 GB/s) (6 TB/s) GPU
Applied Alone Proposed Unbuildable
Buildable Unbuildable
Figure 16: Breakdown of the sources of performance improve-
ments of optimized MCM-GPU when applied alone and to- Figure 17: Performance comparison of MCM-GPU and Multi-
gether. Three proposed architectural improvements for MCM- GPU.
GPU almost close the gap with unbuildable monolithic GPU.
MCM-GPU proposal, each GPU has a private 128KB L1 cache per
the coarse-grained distributed scheduling. We leave further optimiza- SM, an 8MB memory-side cache, and 1.5 TB/s of DRAM band-
tions of the MCM-GPU architecture that would take advantage of width. We assume such a configuration as a maximally sized future
this potential opportunity for better performance to future work. monolithic GPU design. We assume that two GPUs are intercon-
In summary, we have proposed three important mircroarchitec- nected via the next generation of on-board level links with 256 GB/s
tural enhancements to the baseline MCM-GPU architecture: (i) a of aggregate bandwidth, improving upon the 160 GB/s commer-
remote-only L1.5 cache, (ii) a distributed CTA scheduler, and (iii) cially available today [17]. For the sake of comparison with the
a first touch data page placement policy. It is important to note that MCM-GPU we assume the multi-GPU to be fully transparent to the
these independent optimizations, work best when they are combined programmer. This is accomplished by assuming the following two
together. Figure 16 shows the performance benefit of employing the features: (i) a unified memory architecture between two peer GPUs,
three mechanisms individually. The introduction of the L1.5 cache where both GPUs can access local and remote DRAM resources
provides a 5.2% performance. Distributed scheduling and first touch with load/store semantics, (ii) a combination of system software and
page placement on the other hand, do not improve performance at all hardware which automatically distributes CTAs of the same kernel
when applied individually. In fact they can even lead to performance across GPUs.
degradation, e.g., -4.7% for the first touch page placement policy. In such a multi-GPU system the challenges of load imbalance,
However, when all three mechanisms are applied together, we ob- data placement, workload distribution and interconnection band-
serve that the optimized MCM-GPU, achieves a speedup of 22.8% as width discussed in Sections 3 and 5, are amplified due to severe
shown in Figure 16. We observe that combining distributed schedul- NUMA effects from the lower inter-GPU bandwidth. Distributed
ing with the remote-only cache improves cache performance and re- CTA scheduling together with the first-touch page allocation mecha-
duces the inter-GPM bandwidth further. This results in an additional nism (described respectively in Sections 5.2 and 5.3) are also applied
4.9% performance benefit compared to having just the remote-only to the multi-GPU. We refer to this design as a baseline multi-GPU
cache while also reducing inter-GPM bandwidth by an additional system. Although a full study of various multi-GPU design options
5%. Similarly, when first touch page placement is employed in con- was not performed, alternative options for CTA scheduling and page
junction with the remote-only cache and distributed scheduling, it allocation were investigated. For instance, a fine grain CTA assign-
provides an additional speedup of 12.7% and reduces inter-GPM ment across GPUs was explored but it performed very poorly due to
bandwidth by an additional 47.2%. These results demonstrate that the high interconnect latency across GPUs. Similarly, round-robin
our proposed enhancements not only exploit the currently available page allocation results in very low and inconsistent performance
data locality within a program but also improve it. Collectively, all across our benchmark suite.
three locality-enhancement mechanisms achieve a 5× reduction in Remote memory accesses are even more expensive in a multi-
inter-GPM bandwidth. These optimizations enable the proposed GPU when compared to MCM-GPU due to the relative lower quality
MCM-GPU to achieve a 45.5% speedup compared to the largest of on-board interconnect. As a result, we optimize the multi-GPU
implementable monolithic GPU and be within 10% of an equally baseline by adding GPU-side hardware caching of remote GPU
equipped albeit unbuildable monolithic GPU. memory, similar to the L1.5 cache proposed for MCM-GPU. We
have explored various L1.5 cache allocation policies and configu-
6 MCM-GPU VS MULTI-GPU rations, and observed the best average performance with a half of
An alternative way of scaling GPU performance is to build multi- the L2 cache capacity moved to the L1.5 caches that are dedicated
GPU systems. This section compares performance and energy effi- to caching remote DRAM accesses, and another half retained as the
ciency of the MCM-GPU and two possible multi-GPU systems. L2 cache for caching local DRAM accesses. We refer to this as the
optimized multi-GPU.
6.1 Performance vs Multi-GPU Figure 17 summarizes the performance results for different build-
A system with 256 SMs can also be built by interconnecting two able GPU organizations and unrealizable hypothetical designs, all
maximally sized discrete GPUs of 128 SMs each. Similar to our normalized to the baseline multi-GPU configuration. The optimized
MCM-GPU: Multi-Chip-Module GPUs ISCA ’17, June 24-28, 2017, Toronto, ON, Canada
multi-GPU which has GPU-side caches outperforms the baseline efficient manner. Most recent work in the area of low-power links
multi-GPU by an average of 25.1%. Our proposed MCM-GPU on has focused on differential signaling because of its better noise im-
the other hand, outperforms the baseline multi-GPU by an average munity and lower noise generation [40, 44]. Some contemporary
of 51.9% mainly due to higher quality on-package interconnect. MCMs, like those used in the Power 6 processors, have over 800
single-ended links, operating at speeds of up to 3.2 Gbps, from a sin-
6.2 MCM-GPU Efficiency gle processor [28]. NVIDIA’s Ground-Referenced Signaling (GRS)
technology for organic package substrates has been demonstrated
Besides enabling performance scalability, MCM-GPUs are energy
to work at 20 Gbps while consuming just 0.54pJ/bit in a standard
and cost efficient. MCM-GPUs are energy efficient as they enable
28nm process [45].
denser integration of GPU modules on a package that alternatively
The MCM-GPU design exposes a NUMA architecture. One of the
would have to be connected at a PCB level as in a multi-GPU case. In
main mechanisms to improve the performance of NUMA systems is
doing so, MCM-GPUs require significantly smaller system footprint
to preserve locality by assigning threads in close proximity to the
and utilize more efficient interconnect technologies, e.g., 0.5 pJ/b on-
data. In a multi-core domain, existing work tries to minimize the
package vs 10 pJ/b on-board interconnect. Moreover, if we assume
memory access latency by thread-to-core mapping [21, 38, 51], or
almost constant GPU and system power dissipation, the performance
memory allocation policy [22, 27, 34]. Similar problems exist in
advantages of the MCM-GPU translate to additional energy savings.
MCM-GPU systems where the primary bottleneck is the inter-GPM
In addition, superior transistor density achieved by the MCM-GPU
interconnection bandwidth. Moreover, improved CTA scheduling
approach allows to lower GPU operating voltage and frequency. This
has been proposed to exploit the inter-CTA locality, higher cache
moves the GPU to a more power-efficient operating point on the tran-
hit ratios, and memory bank-level parallelism [37, 41, 52] for mono-
sistor voltage-frequency curve. Consequently, it allows trading off
lithic GPUs. In our case, distributed CTA scheduling along with
ample performance (achieved via abundant parallelism and number
the first-touch memory mapping policy exploits inter-CTA locali-
of transistors in package) for better power efficiency.
ties both within a kernel and across multiple kernels, and improves
Finally, at a large scale such as HPC clusters the MCM-GPU
efficiency of the newly introduced GPM-side L1.5 cache.
improves performance density and as such reduces the number of
Finally, we propose to expose the MCM-GPU as a single logical
GPUs per node and/or number of nodes per cabinet. This leads to a
GPU via hardware innovations and extensions to the driver software
smaller number of cabinets at the system level. Smaller total system
to provide programmer- and OS-transparent execution. While there
size translates to smaller number of communicating agents, smaller
have been studies that propose techniques to efficiently utilize multi-
network size and shorter communication distances. These result in
GPU systems [20, 23, 33, 36], none of the proposals provide a fully
lower system level energy dissipation on communication, power
transparent approach suitable for MCM- GPUs.
delivery, and cooling. Similarly, higher system density also leads
to total system cost advantages and lower overheads as described
above. Moreover, MCM-GPUs are expected to result in lower GPU
silicon cost as they replace large dies with medium size dies that 8 CONCLUSIONS
have significantly higher silicon yield and cost advantages. Many of today’s important GPU applications scale well with GPU
compute capabilities and future progress in many fields such as exas-
7 RELATED WORK cale computing and artificial intelligence will depend on continued
GPU performance growth. The greatest challenge towards building
Multi-Chip-Modules are an attractive design point that have been
more powerful GPUs comes from reaching the end of transistor den-
extensively used in the industry to integrate multiple heterogeneous
sity scaling, combined with the inability to further grow the area of
or homogeneous chips in the same package. For example, on the
a single monolithic GPU die. In this paper we propose MCM-GPU,
homogeneous front, IBM Power 7 [5] integrates 4 modules of 8
a novel GPU architecture that extends GPU performance scaling at
cores each, and AMD Opteron 6300 [4] integrates 2 modules of
a package level, beyond what is possible today. We do this by parti-
8 cores each. On the heterogeneous front, the IBM z196 [3] inte-
tioning the GPU into easily manufacturable basic building blocks
grates 6 processors with 4 cores each and 2 storage controller units
(GPMs), and by taking advantage of the advances in signaling tech-
in the same package. The Xenos processor used in the Microsoft
nologies developed by the circuits community to connect GPMs
Xbox360 [1] integrates a GPU and an EDRAM memory module
on-package in an energy efficient manner.
with its memory controller. Similarly, Intel offers heterogeneous
We discuss the details of the MCM-GPU architecture and show
and homogeneous MCM designs such as the Iris Pro [11] and the
that our MCM-GPU design naturally lends itself to many of the
Xeon X5365 [2] processors respectively. While MCMs are popular
historical observations that have been made in NUMA systems. We
in various domains, we are unaware of any attempt to integrate ho-
explore the interplay of hardware caches, CTA scheduling, and data
mogeneous high performance GPU modules on the same package
placement in MCM-GPUs to optimize this architecture. We show
in an OS and programmer transparent fashion. To the best of our
that with these optimizations, a 256 SMs MCM-GPU achieves 45.5%
knowledge, this is the first effort to utilize MCM technology to scale
speedup over the largest possible monolithic GPU with 128 SMs.
GPU performance.
Furthermore, it performs 26.8% better than an equally equipped
MCM package level integration requires efficient signaling tech-
discrete multi-GPU, and its performance is within 10% of that of a
nologies. Recently, Kannan et al. [31] explored various packaging
hypothetical monolithic GPU that cannot be built based on today’s
and architectural options for disintegrating multi-core CPU chips
technology roadmap.
and studied its suitability to provide cache-coherent traffic in an
ISCA ’17, June 24-28, 2017, Toronto, ON, Canada A. Arunkumar et al.
[45] John W. Poulton, William J. Dally, Xi Chen, John G. Eyles, Thomas H. Greer,
Stephen G. Tell, John M. Wilson, and C. Thomas Gray. 2013. A 0.54 pJ/b 20
Gb/s Ground-Referenced Single-Ended Short-Reach Serial Link in 28 nm CMOS
for Advanced Packaging Applications. IEEE Journal of Solid-State Circuits 48,
12 (Dec 2013), 3206–3218. https://doi.org/10.1109/JSSC.2013.2279053
[46] Debendra D. Sharma. 2014. PCI Express 3.0 Features and Requirements
Gathering for beyond. (2014). https://www.openfabrics.org/downloads/Media/
Monterey_2011/Apr5_pcie%20gen3.pdf Accessed: 2016-06-20.
[47] Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional
Networks for Large-Scale Image Recognition. ArXiv e-prints (Sept. 2014).
arXiv:cs.CV/1409.1556
[48] Bruce W. Smith and Kazuaki Suzuki. 2007. Microlithography: Science and Tech-
nology, Second Edition. https://books.google.com/books?id=_hTLDCeIYxoC
[49] Jeff A. Stuart and John D. Owens. 2009. Message Passing on Data-parallel
Architectures. In Proceedings of the IEEE International Symposium on Paral-
lel&Distributed Processing (IPDPS ’09). IEEE, Washington, DC, USA, 1–12.
https://doi.org/10.1109/IPDPS.2009.5161065
[50] Jeff A. Stuart and John D. Owens. 2011. Multi-GPU MapReduce on GPU Clusters.
In Proceedings of the IEEE International Parallel & Distributed Processing
Symposium (IPDPS ’11). IEEE, Washington, DC, USA, 1068–1079. https://doi.
org/10.1109/IPDPS.2011.102
[51] David Tam, Reza Azimi, and Michael Stumm. 2007. Thread Clustering: Sharing-
aware Scheduling on SMP-CMP-SMT Multiprocessors. In Proceedings of the 2Nd
ACM SIGOPS/EuroSys European Conference on Computer Systems (EuroSys ’07).
ACM, New York, NY, USA, 47–58. https://doi.org/10.1145/1272996.1273004
[52] Jin Wang, Norm Rubin, Albert Sidelnik, and Sudhakar Yalamanchili. 2016.
LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs. In Proceed-
ings of the 43rd International Symposium on Computer Architecture (ISCA ’16).
IEEE, Piscataway, NJ, USA, 583–595. https://doi.org/10.1109/ISCA.2016.57
[53] Kenneth M. Wilson and Bob B. Aglietti. 2001. Dynamic Page Placement to
Improve Locality in CC-NUMA Multiprocessors for TPC-C. In Proceedings of
the ACM/IEEE Conference on Supercomputing (SC ’01). ACM, New York, NY,
USA, 33–33. https://doi.org/10.1145/582034.582067