1 Introduction
Future
high-performance computing (HPC) systems are
1 driven toward heterogeneity of compute and memory resources in response to the expected halt of traditional technology scaling, combined with continuous demands for increased performance [
56,
78] and the wide landscape of HPC applications [
69]. In the long term, many HPC systems are expected to feature a variety of
graphical processing units (GPUs), partially programmable accelerators [
62,
77], fixed-function accelerators [
11,
61,
68], reconfigurable accelerators such as
field-programmable gate arrays (FPGAs) [
40,
70], and new classes of memory [
80] that blur the line between memory and storage technology.
If we preserve our current method of allocating resources to applications in units of statically configured nodes where every node is identical, then future systems risk substantially underutilizing expensive resources. This is because not every application will be able to profitably use specialized hardware resources as the value of a given accelerator can be very application-dependent. The potential for waste of resources when a given application does not use them grows with the number of new heterogeneous technologies and accelerators that might be co-integrated into future nodes.
This observation, combined with the desire to increase utilization even of “traditional” resources, has led to research on systems that can pool and compose resources of different types in a fine-grain manner to match application requirements. This capability is referred to as
resource disaggregation. In datacenters, resource disaggregation has increased the utilization of GPUs and memory [
19,
33,
35,
50,
64,
67]. Such approaches usually employ a full-system solution where resources can be pooled from across the system. While this approach maximizes the flexibility and range of resource disaggregation, it also increases the overhead to implement resource disaggregation, for instance by requiring long-range communication that stresses bandwidth and increases latency [
21,
55,
83]. As a result, some work focuses on intra-rack disaggregation [
34,
75].
While resource disaggregation is regarded as a promising approach in HPC in addition to datacenters, there is currently no solid understanding of what range or flexibility of disaggregation HPC applications require [
13,
47] and what is the expected improvement of resource utilization through this approach. Without any data-driven analysis of the workload, we risk over-designing resource disaggregation that will make it not only unnecessarily expensive but also may overly penalize application performance due to high latencies and limited communication bandwidth [
34,
75].
To that end, we study and quantify what level of resource disaggregation is sufficient for typical HPC workloads and what the efficiency increase opportunity is if HPC embraces this approach, to guide future research into specific technological solutions. We perform a detailed, data-driven analysis in an exemplar open-science, high-ranked, production HPC system with a diverse scientific workload and complement our analysis with profiling key
machine learning (ML) applications. For our system analysis, we sample key system-wide and per-job metrics that indicate how efficiently resources are used, sampled every second for a duration of three weeks on NERSC’s Cori [
38]. Cori is a top 20, open-science HPC system that supports thousands of projects, multiple thousands of users, and executes a diverse set of HPC workloads from fusion energy, material science, climate research, physics, computer science, and many other science domains [
3]. Because Cori has no GPUs, we also study machine learning (ML) applications executing on NVIDIA GPUs. For these applications, we examine a range of scales, training, and inference while analyzing utilization of key resources.
Based on our analysis, we find that for a system configuration similar to Cori, intra-rack disaggregation suffices the vast majority of the time even after reducing overall resources. In particular, in a rack (cabinet) configuration similar to Cori [
38] but with ideal intra-rack resource disaggregation where
network interface controllers (NICs) and memory resources can be allocated to jobs in a fine-grain manner but only within racks, we show that a
central processing unit (CPU) has 99.5% probability to find all resources it requires inside its rack. Focusing on jobs, with 20% fewer memory modules and NIC bandwidth per rack, a job has an 11% probability to have to span more racks than its minimum possible in Cori. In addition, in our sampling time range and at worst across Haswell and KNL nodes, we could reduce 69.01% memory bandwidth, 5.36% memory capacity, and 43.35% NIC bandwidth in Cori while still satisfying the worst-case average rack utilization. This quantifies how many resources intra-rack disaggregation can reduce at best.
In summary, our contributions are as follows:
•
We perform a data-driven, detailed, and well-rounded study of utilization metrics for key resources in a top 20, open-science, production HPC system.
•
We profile important ML applications executing on NVIDIA GPUs focusing on utilization of key resources, which is an emerging workload on HPC systems.
•
We propose and measure new metrics relevant to resource disaggregation. In particular, we show how soon a job’s resource utilization becomes a good predictor for the future, how quickly node resource utilization changes, and how a job’s utilization of different resources correlates.
•
We demonstrate spatial usage imbalance of resources to motivate job scheduling policies that use resource utilization as an optimization metric.
•
We quantify the probability that a job has to span more racks than the minimum as a function of how aggressively we reduce intra-rack resources in our model system (Cori).
•
We quantify how many rack resources we can reduce with ideal intra-rack disaggregation.
•
For HPC systems with workloads or hardware configurations substantially different than Cori, our analysis serves as a framework to show what metrics are relevant to resource disaggregation and how to translate system measurements to quantify expected effectiveness of intra-rack disaggregation as a function of how aggressively we reduce resources. This helps repeating our analysis in other HPC systems.
•
In summary, we show that for an HPC system similar to Cori, intra-rack disaggregation suffices the vast majority of the time. Demonstrating this insight with concrete data is important to guide future research.
3 System-wide Analysis
3.1 Motivation of Cori and Description
Cori is an open-science top 20 HPC system that is operated by NERSC and supports the diverse set of workloads that are in scope for USA’s
department of energy (DOE) [
38]. Cori supports in the order of one thousand projects, a few thousand users, and executes a vastly diverse set of representative single-core to full-machine workloads from fusion energy, material science, climate research, physics, computer science, and many other science domains [
3]. In this work, we use Cori as a model system to represent the common requirements of HPC systems that are also typical of many other HPC systems in the sciences. We recognize that HPC systems with vastly different workloads or hardware configurations should consider repeating an analysis similar to ours, but the broad user base of this system should provide a microcosm of potential requirements across a wide variety of systems across the world. Furthermore, over the 45-year history of the NERSC HPC facility and 12 generations of systems with diverse architectures, the workload over time has evolved very slowly despite substantial changes to the underlying system architecture, because the requirements of the scientific mission of such facilities take precedence over the architecture of the facility.
Cori is an approximately 30 Pflop/s Cray XC40 system and consists of 2,388 Intel Xeon “Haswell” and 9,688 Intel Xeon Phi “Knight’s Landing” (KNL) compute nodes [
73]. Each Haswell node has two 2.3 GHz 16-core Intel Xeon E5-2698 v3 CPUs with two hyperthreads per core. Each KNL node has a single Intel Xeon Phi 7250 central processing unit (CPU) with 68 cores at 1.4 GHz and four hyperthreads per core. Haswell nodes have eight (four per CPU socket) 16 GB 2,133 MHz DDR4 modules with a peak transfer rate of 17 GB/s per module [
12,
84]. KNL nodes have 96 GB of 2,400 MHz DDR4 modules and 16 GB of MCDRAM modules. Therefore, Haswell nodes have 128 GB of memory while KNL 112 GB. Each XC40 cabinet housing Haswell and KNL nodes has three chassis; each chassis has 16 compute blades with four nodes per blade.
Cori also has a Lustre file system and uses the Cray Aries interconnect [
9]. Similar to many modern systems, hard disk space is already disaggregated in Cori by placing hard disks in a common part of the system [
36,
38]. Therefore, we do not study file system disaggregation. Nodes connect to Aries routers through NICs using PCIe 3.0 links with 16 GB/s sustained bandwidth per direction [
1,
2,
26,
65]. The four nodes in each blade share a network interface controller (NIC). Cray’s Aries in Cori employs a Dragonfly topology with over 45 TB/s of bisection bandwidth and support for adaptive routing. An Aries router connects eight NIC ports to 40 network ports. Network ports operate at rates of 4.7 to 5.25 GB/s per direction [
1,
65].
Currently, Cori uses SLURM version 20.02.6 [
46] and Cray OS 7.0.UP01. When submitting a job, users request a type of node (Haswell or KNL typically), number of nodes, and a job duration. Users do not indicate an expected memory or CPU usage, but depending on their expectations they can choose “high memory” or “shared” job queues. The maximum job duration that can be requested is 48 h.
3.2 Methodology
We collect system-wide data using the
lightweight distributed metric service (LDMS) [
7] on Cori. LDMS samples statistics in every node every second and also records which job ID each node is allocated to. Statistics include information from the OS level such as CPU idle times and memory capacity usage, as well as hardware counters at the CPUs and NICs. After collection, statistics are aggregated into data files that we process to produce our results. We also calibrate counters by comparing their values against workloads with known metric values, by comparing to literature, and by comparing similar metrics derived through different means, such as hardware counters, the OS, or the job scheduler. For generating graphs, we use python 3.9.1 and pandas 1.2.2.
We sample Cori from April 1 to April 20 of 2021 (UTC-7). However, NIC bandwidth and memory capacity are sampled from April 5 to April 19 due to the large volume of data. Sampling more frequently than 1s or for longer periods would produce even more data to analyze than the few TBs that our analysis already processes. For this reason, sampling for more than three weeks at a time is impractical. Still, our three weeks sampling period is long enough such that the workload we capture is typical of Cori [
3]. Unfortunately, since we cannot capture application executable files and input data sets, and because we find job names to not be a reliable indicator of the algorithm a job executes, we do not have precise application information to identify the exact applications that executed in our captured date ranges. For the same reason, we cannot replay application execution in a system simulator.
To focus on larger jobs and disregard test or debug jobs, we also generate statistics only for nodes that are allocated to jobs that use at least four nodes and last at least 2 h. Our data does not include the few tens of high memory or interactive (login) nodes in Cori. Because statistics are sampled no more frequently than every second (at second boundaries), any maximum values we report such as in Table
1 are values calculated within each second boundary and thus do not necessary accurately capture peaks that last for less than one second. Therefore, our analysis focuses on
sustained workload behavior. Memory occupancy and NIC bandwidth are expressed as utilization percentages from the aforementioned per-node maximums. We do not express memory bandwidth as a percentage, because the memory module data is placed in and the CPU that accesses the data affect the maximum possible bandwidth.
We present most data using node-wide
cumulative distribution function (CDF) graphs that sample each metric at each node individually at every sampling interval. Each sample from each node is added to the dataset. This includes idling nodes, but NERSC systems consistently assign well over 99% of nodes to jobs at any time. This method is useful for displaying how each metric varies across nodes and time, and is insensitive to job size and duration, because it disregards job IDs. Table
1 summarizes our sampled data from Cori. Larger maximum utilizations for KNL nodes for some metrics are partly, because there are more KNL nodes in Cori.
3.3 Job Size and Duration
Fig.
1 shows area plots for the size of jobs nodes are assigned to, as a percentage of total nodes. For each node, we determine the size of the job (in nodes) that the node is assigned to. We record that size value for each node in the system individually. Results are shown as a function of time, with one sample every 30s. We only show this data for a subset of our sampling period because of the large volume of data, particularly for KNL nodes, because there are
\(4\times\) more KNL nodes than Haswell nodes. There are no KNL jobs larger than 2,048 nodes in the illustrated period.
As shown, the majority of jobs easily fit within a rack in Cori. In our 20-day sample period, 77.5% of jobs in Haswell and 40.5% in KNL nodes request only one node. Similarly, 86.4% of jobs in Haswell and 75.9% in KNL nodes request no more than four nodes. 41% of jobs in Haswell and 21% for KNL nodes use at least four nodes
and execute for at least 2 h. Also, job duration (Table
1) is multiple orders of magnitude larger than the reconfiguration delay of modern hardware technologies that can implement resource disaggregation, such as photonic switch fabrics that can reconfigure from a few tens of microseconds to a few tens of nanoseconds [
20,
60]. Our data set includes a handful of jobs in both Haswell and KNL nodes that use special reservation to exceed the normal maximum number of hours for a job. In Alibaba’s systems, batch jobs are shorter with 99% of jobs under two minutes and 50% under ten seconds [
36].
3.4 Memory Capacity
Fig.
2 (left) shows a CDF of node-wide memory occupied by user applications, as calculated from “proc” reports of total minus available memory [
14]. Fig.
2 (right) shows a CDF of maximum memory occupancy among all nodes, sampled every 1s. Occupied memory is expressed as percentage utilization from the capacity available to user applications (total memory in “proc” reports), and thus does not include memory reserved for the firmware and kernel binary code.
As shown by the red example lines, three quarters of the time, Haswell nodes use no more than 17.4% of on-node memory and KNL nodes 50.1%. Looking at the maximum occupancy among all nodes in each sampling period, half of the time that is no more than 11.9% for Haswell nodes and 59.85% for KNL nodes. Looking at individual jobs instead of nodes (not illustrated), 74.63% of Haswell jobs and 86.04% of KNL jobs never use more than 50% of on-node memory.
Our analysis is in line with past work. In four Lawrence Livermore National Laboratory clusters, approximately 75% of the time, no more than 20% of memory is used [
67]. Other work observed that many HPC applications use in the range of hundreds of MBs per computation core [
85]. Interestingly, Alibaba’s published data [
36] show that a similar observation is only true for machines that execute batch jobs, because batch jobs tend to be short and small. In contrast, for online jobs or co-located batch and online jobs, the minimum average machine memory utilization is 80%.
These observations suggest that memory capacity is overprovisioned by average. Overprovisioning of memory capacity is mostly a result of (i) sizing memory per node to satisfy memory-intensive applications or phases of applications that execute infrequently but are considered important, and (ii) allocating memory to nodes statically.
3.5 Memory Bandwidth
Fig.
3 (left) shows CDFs of bi-directional (read plus write) node-wide memory bandwidth for data transfers to and from memory. These results are calculated by the
last-level cache (LLC) line size and the load and store LLC misses reported by “perf” counters [
25] and crosschecked against PAPI counters [
16]. Due to counter accessibility limitations, LLC statistics are only available for Haswell nodes. In Haswell, hardware prefetching predominantly occurs at cache levels below the LLC [
43]. Thus, prefetches are captured by the aforementioned LLC counters we sample.
As shown, three quarters of the time Haswell nodes use at most 0.46 GB/s of memory bandwidth. 16.2% of the time Haswell nodes use more than 1 GB/s. The distribution shows a long tail, indicative of bursty behavior. In addition, aggregate read bandwidth is 4.4\(\times\) larger than aggregate write bandwidth and the majority of bursty behavior comes from reads. If we focus on individual jobs (not illustrated), then 30.9% of jobs never use more than 1 GB/s per node.
To focus on the worst case, Fig.
3 (right) shows a CDF of the maximum memory bandwidth among all Haswell nodes, sampled in 1s periods (i.e., sustained maximums). As shown, half of the time the maximum bandwidth among all Haswell nodes is at least 17.6 GB/s and three quarters of the time 31.5 GB/s. In addition, sustained system-wide maximum memory bandwidth at 30s time windows within our overall sampling period rarely exceeds 40 GB/s, while 30s average values rarely exceed 2 GB/s.
Inevitably, our analysis is sensitive to the configuration of Cori’s Haswell nodes. Faster or more computation cores per node would place more pressure on memory bandwidth [
18] as already evident in GPUs [
29]. Still, our data hints that memory bandwidth is another resource that may be overdesigned
by average in HPC. However, we cannot relate our findings to the impact to application performance if we were to reduce available memory bandwidth. That is because applications may exhibit brief but important phases of high memory bandwidth usage that may be penalized if we reduce available memory bandwidth [
18].
3.6 CPU Utilization
Fig.
4 (left) shows a CDF of average idle time among all compute cores in nodes, expressed as a percentage. These statistics were generated using “proc” reports [
14] of idle kernel cycles for each hardware thread. To generate a sample for each node every 1s, we average the idle percentage of all hardware threads in each node. As shown, about half of the time Haswell nodes have at most a 49.9% CPU idle percentage and KNL nodes 76.5%. For Haswell nodes, average system-wide CPU idle time in each sampling period never drops lower than 28% in a 30 s period and for KNL 30% (not illustrated). These statistics are largely due to the two hardware threads per compute core in Haswell and four in KNL, because in Cori 80% of the time Haswell nodes use only one hardware thread and 50% in KNL [
3]. Similarly, many jobs reserve entire nodes but do not use all cores in those nodes. Datacenters have also reported 28%–55% CPU idle in the case of Google trace data [
66] and 20%–50% most of the time in Alibaba [
36].
3.7 Network Bandwidth
We measure per-node injection bandwidth at every NIC by using hardware counters in the Cray Aries interconnect [
23]. Those counters record how many payload bytes each node sent to and received from the Aries network. We report results for each node as a percentage utilization of the maximum per-node NIC bandwidth of 16 GB/s per direction [
2,
65]. We also verify against similar statistics generated by using NIC flit counters and multiplying by the flit size. In Cori, access to the global file system uses the Aries network so our statistics include file system accesses.
Fig.
4 (right) shows a CDF of node-wide NIC bandwidth utilization. As shown, 75% of the time Haswell nodes use at most 0.5% of available NIC bandwidth. For KNL nodes the latter percentage becomes 1.25%. In addition, NIC bandwidth consistently exhibits a sustained bursty behavior. In particular, in a two-week period, sustained 30s average NIC bandwidth in about 60 separate occurrences increased by more than 3
\(\times\) compared to the overall average.
3.8 Variability of Job Requirements
In this section, we analyze
how much metrics change across a job’s lifetime. Fig.
5 shows a CDF of the standard deviation of all values throughout each job’s execution, calculated separately for different metrics. This graph was generated by calculating the standard deviation of values for each metric that each job has throughout the job’s execution. A high standard deviation indicates that the metric varies substantially throughout the job’s execution. To normalize for different absolute values of each job, standard deviation is expressed as a percentage of the per-job average value for each metric. A value of 50% indicates that the job’s standard deviation is half of the job’s average for that metric.
As shown, occupied memory and CPU idle percentages do not highly vary during job execution, but memory and NIC bandwidths do. The variability of memory and NIC bandwidths is intuitive, because many applications exhibit phases of low and high memory bandwidth. Network and memory bandwidth have been previously observed to have bursty behavior for many applications [
45,
49,
82]. In contrast, once an application completes reserving memory capacity, the reservation’s size typically does not change significantly until the application terminates.
These observations are important for provisioning resources for disaggregation. For metrics that do not considerably vary throughout a job’s execution, average system-wide or per-job measurements of those metrics are more representative of future behavior. Therefore, provisioning for average utilization, with perhaps an additional factor such as a standard deviation, likely will satisfy application requirements for those metrics for the majority of the time. In contrast, metrics that vary considerably have an average that is less representative. Therefore, for those metrics resource disaggregation should consider the maximum or near-maximum value.
3.9 Rate of Change of Metrics
As a next step, we analyze how quickly metrics change across nodes. This is a useful metric to indicate how quickly future hardware implementations of resource disaggregation should be able to reconfigure to satisfy changing resource demands. To that end, we calculate the percentage change of our metrics from one 1s sample to the next. For instance, if memory occupied in one sample is 4 GB and in the next sample it becomes 5 GB, we add 20% to our rate of change dataset. Due to the computational intensity of this analysis and our large data set, memory occupancy and NIC bandwidth graphs only consider data from April 5 to 9.
Fig.
6 shows the rate of change for occupied memory both node-wide and maximum across nodes. As shown, occupied node memory is a metric that rarely changes substantially. Ninety-eight percent of the time, occupied memory changes by less than 0.1% from one second to the next. Even if we consider the maximum rate of change across all nodes every 1 s, 90% of the time the maximum rate of change is no more than 1%.
Fig.
7 shows node-wide CDFs for the rate of change of memory and NIC bandwidth. From one 1 s sample to the next, memory and NIC bandwidth do not substantially change with a probability of 80%. However, 2% (10%) of the time from one second to the next, the change exceeds 100% for memory (NIC) bandwidth. This indicates a burst of usage from a low near-idle value to a high value. In fact, the CDF distributions have long tails. These results quantify the magnitude and frequency of the bursty nature of both memory and NIC bandwidths.
3.10 Spatial Load Imbalance
To show the results of current job scheduling policies that do not take into account resource utilization, Fig.
8 shows a heatmap of average, standard deviation, and maximum memory occupancy for Haswell and KNL nodes, sampled across our 20-day sample period. As shown, memory occupancy is unequal across nodes. In fact, even the standard deviation of maximum memory occupancies is 15.22% for Haswell nodes and 4.76% for KNL nodes. KNL nodes are more balanced in memory occupancy but still show considerable imbalance. These observations motivate future work on resource disaggregation-aware job schedulers. Even though they are not illustrated, heatmaps for memory bandwidth and NIC bandwidth provide similar insights. For instance, the standard deviation of the maximum memory bandwidth across Haswell nodes is 17.34 GB/s and for NIC bandwidth 23.14% across all nodes.
3.11 How Soon Jobs Become Predictable
Subsequently, we ask the question of
how soon a job’s resource usage becomes a good predictor (i.e., indicative) of its future usage. This is useful to determine how soon a system can observe a job’s behavior and make an informed decision on how many resources to assign to it. Fig.
9 illustrates the percentage runtime of a job until each of the three metrics shown reach at least 80% of the job’s average usage throughout its lifetime. For instance, in the case of memory occupancy, a value of 20% means that at 20% of its runtime, a job reserves 80% or more memory of the job’s average memory occupancy, calculated throughout the job’s entire runtime.
As shown by the CDFs, the majority of jobs become predictable early. For instance, over 85% of jobs reach their 80% of average memory occupancy by 20% of their runtime. This number becomes 92% for NIC bandwidth and 79% for memory bandwidth. In fact, over 60% of jobs are predictable by 5% of their runtime for all three metrics. This observation is encouraging in that systems that rely on observing job resource usage do not have to wait long after a job starts to gain a good understanding about a job’s resource usage.
3.12 Temporal Behavior
In this section, we investigate the temporal variability of average and maximum memory bandwidth across Haswell nodes. We choose memory bandwidth to focus on memory usage, and because we observe that memory occupancy does not vary substantially with time, as we discuss in Section
3.9. Temporal variability is useful to indicate how many valleys and peaks appear in system-wide usage and what is their spacing.
Fig.
10 shows a time–series graph for the first 15 days of our sampling period, to better observe behavior in detail. Maximum memory bandwidth shows bursts, usually correlated with a burst in average memory bandwidth. Maximum memory bandwidth also shows a noticeable 30s variability, as indicated by the shaded region. Average memory bandwidth remains low, but does show short bursts of 2
\(\times\) or
\(3\times\) a few times a day. Relatively, these are more frequent and more intense bursts than maximum memory bandwidth, but not significantly. Also, an important observation is that even though system-wide average memory bandwidth shows these bursts, it never exceeds 5 GB/s.
3.13 Correlation Between Metrics
In this section, we investigate whether there is correlation between how jobs use different resources. Fig.
11 shows scatterplots that correlate different metrics. Each data point represents a single job, defined by a separate job ID in Cori’s job queue. A data point’s X and Y values are that job’s average resource utilization (for each metric in the axes) throughout its runtime. A job’s average IPC is measured by averaging the IPC of all cores in nodes assigned to the job. A core’s IPC is calculated by dividing the core’s total executed instructions by the number of cycles, for the duration the core was assigned to each job. A core’s instructions and cycles are reported by PAPI counters [
16].
Graphs also include the calculated correlation factor between the two metrics in each graph for all job samples. The correlation factor ranges from \(-1.0\) to 1.0, where \(-1\) means strong negative correlation, 0 means no correlation, and 1.0 means strong positive correlation. For instance, no correlation means that one value increasing shows no tendency for the other value to change in either direction.
As shown, jobs that occupy more memory have a small probability to use more memory bandwidth than other jobs. Similarly, jobs that use higher NIC bandwidth are slightly more likely to use higher memory bandwidth. Finally, there is practically no correlation between IPC and memory bandwidth as well as memory occupancy and NIC bandwidth.
4 Deep-learning Applications Analysis
Many of today’s HPC systems deploy GPUs as accelerators [
76]. At the same time, an increasing amount of ML workloads execute on HPC systems and cloud providers continue to increase their HPC offerings. Therefore, to supplement our system-wide data, we analyze a set of ML workloads that run on a GPU-accelerated system to illustrate resource underutilization and show how much disaggregation these workloads motivate. We study ML/DL workloads, as we believe these are a significant share of applications in future HPC systems. We therefore conduct a set of controlled experiments using representative workloads on a typical GPU-accelerated system.
4.1 Methodology
The workloads we study are part of the MLPerf benchmark suite version 0.7 [
59]. MLPerf is maintained by a large consortium of companies that are pioneering AI in terms of neural network design and system architectures for both training and inference. The workloads that are part of the suite have been developed by world-leading research organizations and are representative of workloads that execute in actual production systems. We select a range of different neural networks, representing different applications:
Transformer [79]: Transformer’s novel multi-head attention mechanism allows for better parallel processing of the input sequence and therefore faster training times, but it also overcomes the vanishing gradient issue that typical RNNs suffer from [
39]. It is for these reasons that Transformers became state-of-the-art for natural language processing tasks. Such tasks include machine translation, time series prediction, as well as text understanding and generation. Transformers are the fundamental building block for networks like
bidirectional encoder representations from transformers (BERT) [
24] and GPT [
15]. It has also been demonstrated that Transformers are used for vision tasks [
28].
BERT [24]: The BERT network only implements the encoder and is designed to be a language model. Training is often done in two phases; the first phase is unsupervised to learn representations and the second phase is then used to fine-tune the network with labeled data. Language models are deployed in translation systems and human-to-machine interactions. We focus on supervised fine-tuning.
ResNet50 [37]: vision tasks, particular image classification, were among the first to make DL popular. First developed by Microsoft, ResNet50 (Residual Network with 50 layers) is often regarded as a standard benchmark for DL tasks and is one of the most used DL networks for image classification and segmentation, object detection, and other vision tasks.
DLRM [63]: the last benchmark in our study is the
deep-learning recommendation model (DLRM). Recommender systems differ from the other networks in that they deploy vast embedding tables. These tables are sparsely accessed before a dense representation is fed into a more classical neural network. Many companies deploy these systems to offer customer recommendations based on their history of items they bought or content they enjoyed.
All workloads are ran using the official MLPerf docker containers on the datasets that are also used for the official benchmark results. Given the large number of hyperparameters to run these networks, we refer to the docker containers and scripts that are used to run the benchmarks. We only adapt batch sizes and denote them in our results.
4.1.1 Test System for Our Experiments.
We run our experiments on a cluster comprising of NVIDIA DGX1 nodes [
4]. Each node is equipped with eight Volta-class V100 GPUs with 32 GB of HBM memory. GPU of a single node are interconnected in a hybrid mesh-cube using NVLink, giving each GPUs a total NVLink bandwidth of 1.2 Tbps. Furthermore, each node provides four ConnectX-4 Infiniband EDR NICs for a total node-to-node bandwidth of 400 Gbps. We run experiments on up to 16 nodes. A DGX1 system is a typical server configuration for training and therefore representative of real deployments. For example, Facebook deploys similar systems [
72]. We note that inference is often run on a different system and a more service-oriented architecture. However, we still run inference on the same system to highlight differences in resource allocation, showing that a disaggregated system can support both workloads efficiently.
4.1.2 Measurement of Resource Utilization.
CPU and host memory metrics are collected through psutil [
6], GPU metrics (utilization, memory, NVLink, and PCIe) come from the NVML library [
5], and InfiniBand statistics are collected through hardware counters provided by the NICs. NVML also provides a “memory utilization” metric, which is a measure of how many cycles the memory is busy with processing a request. A utilization of 100% means that the memory is busy processing a request every cycle.
We launch a script that gathers the metrics at a sampling period of 1s before starting the workload. We found the overhead of profiling to be small and therefore negligible. Although a full training run of the aforementioned neural nets can take many hours, the workloads are repetitive so it is sufficient to limit measurements to a shorter period.
For compute utilization, we only consider utilization above 5% in the geometric mean calculations. This excludes idle periods at the beginning, for example. For memory capacity, we take the the maximal capacity during the workload’s runtime. Similar to our Cori results, here we also find that once memory is allocated, it remains mostly constant. Bandwidth is shown as the geometric mean. We exclude outliers by filtering out values that are outside of 1.5\(\times\) interquartile range (IQR), including values of zero. It is important to note that a sampling period of 1s is insufficient to determine peak bandwidth for high-speed links. However, it is still a valuable proxy for average utilization.
4.2 Training Versus Inference
Machine learning typically consists of two phases: training and inference. During training the network learns and optimizes parameters from a carefully curated dataset. Training is a throughput-critical task and input samples are batched together to increase efficiency. Inference, however, is done on a trained and deployed model and is often sensitive to latency. Input batches are usually smaller and there is less computation and lower memory footprints, as no errors need to be backpropagated and parameters are not optimized.
We measure various metrics for training and inference runs for BERT and ResNet50. Results are shown in Fig.
12 (left). GPU utilization is high for BERT during training and inference phases, both in terms of compute and memory capacity utilization. However, the CPU compute and memory capacity utilization is low. ResNet50, however, shows large CPU compute utilization, which is also higher during inference. Inference requires significantly less computation, which means the CPU is more utilized compared to training to provide the data and launch the work on the GPU. Training consumes significantly more GPU resources, especially memory capacity. This is not surprising, as these workloads were designed for maximal performance on GPUs. Certain parts of the system, notably CPU resources, remain underutilized, which motivates disaggregation. Further, training and inference have different requirements and disaggregation helps to provision resources accordingly.
The need for disaggregation is also evident in NVIDIA’s introduction of multi-instance GPU [
22], which allows to partition a GPUs into seven independent and smaller GPUs. Our work takes this further and considers disaggregation at the rack scale.
4.3 Training Resource Utilization
Fig.
12 (right) shows resource utilization during the training of various MLPerf workloads. These benchmarks are run on a single DGX1 system with 8 Volta V100 GPUs. While we can generally observe that CPU utilization is low, GPU utilization is consistently high across all workloads. We also depict bandwidth utilization of NVLink and PCIe. We note that bandwidth is shown as an average effective bandwidth across all GPUs for the entire measurement period. In addition, we can only sample in intervals of 1 s, which limits our ability to capture peaks in high-speed links like NVLink. Nonetheless, we can observe that overall effective bandwidth is low, which suggests that links are not highly utilized by average.
All of the shown workloads are data-parallel, while DLRM also implements model parallelism. In data parallelism, parameters need to be reduced across all workers, resulting in an all-reduce operation for every optimization step. As a result, the network can be underutilized during computation of parameter gradients. The highest bandwidth utilization is from DLRM, which is attributed to the model parallel phase and its all-to-all communication.
4.4 Inter-node Scaling
Another factor of utilization is inter-node scaling. We run BERT and ResNet50 on up to 16 DGX1 systems, connected via InfiniBand EDR. The results are depicted in Fig.
13. For ResNet50, we also distinguish between weak scaling and strong scaling. BERT is shown for weak scaling only. Weak scaling is preferred as it generally leads to higher efficiency and utilization, as shown by our results. In data parallelism, this means we keep the number of input samples per GPU, referred to as
sub-batch (SB), constant, and while we scale-out the effective global batch size increases. At some point the global batch size reaches a critical limit after which the network stops converging or converges slower so that any performance benefit diminishes. At this point, strong scaling becomes the only option to further reduce training time.
As Fig.
13 shows, strong scaling increases the bandwidth requirements, both intra- and inter-node, but reduces compute and memory utilization of individual GPUs. While some under-utilization is still beneficial in terms of total training time, it eventually becomes too inefficient. Many neural nets train for hours or even days on small-scale systems, rendering large-scale training necessary. This naturally leads to some underutilization of certain resources, which motivates disaggregation to allow resources to be used for other tasks.
6 Discussion
6.1 Limitations of Sampling a Production System
Sampling a production system has significant value, because it demonstrates what users actually execute and how resources are used in practice. At the same time, sampling a production system inevitably has practical limitations. For instance, it requires privileged access and sampling infrastructure in place. In addition, even though Cori is a top 20 system that executes a wide array of open-science HPC applications, observations are affected by the set of applications and hardware configuration of the system. Therefore, HPC systems substantially different than Cori can use our analysis as a framework to repeat a similar study. In addition, sampling typically does not capture application executable files or input data sets. Therefore, reconstructing our analysis in a system simulator is impossible but also impractical due to the vast slowdown of a simulator for large-scale simulations compared to a real system. Similarly, sampling a production system has no method to differentiate when application demands exceed available resources as well as the resulting slowdown. For this reason and our 1s sampling period, our study focuses on sustained behavior and cannot make claims for the impact of resource disaggregation to application performance.
6.2 Hardware Implementation of Disaggregation
As we outline in Section
2.2, past work pursued a variety of hardware implementations and job schedulers for resource disaggregation. Our study quantitatively shows potential benefits of intra-rack disaggregation in a system similar to Cori to allow future research on hardware implementations to perform a more informed cost-benefit analysis. In addition, we study topics in the second half of Section
3 that assist in guiding important related topics such as how often a disaggregated system should be reconfigured. Therefore, future work can better decide whether to pursue full-system, intra-rack, or a hierarchical implementation of resource disaggregation and for which resources.
6.3 Future Applications and Systems
While it is hard to speculate how important HPC applications will evolve over the next decade, we have witnessed little change in HPC fundamental algorithms during the previous decade. It is those fundamentals that currently cause imbalance that motivates resource disaggregation. Another consideration is application resource demands relative to available resources in future systems. For instance, if future applications require significantly more memory than the memory available per CPU today, then this may motivate full system disaggregation of memory, especially if there is significant variability across applications. Similarly, if a subset of future applications request non-volatile memory (NVM), then this may also motivate full system disaggregation of NVM, similar to how file system storage is disaggregated today.
However, future systems may have larger racks or nodes with more resources, strengthening the case for intra-rack resource disaggregation. When it comes to specialized fixed-function accelerators, a key question is how much data transfer they require and how many applications can use them. This can help determine which fixed-function accelerators should be disaggregated within racks, hierarchically, or across the system. Different resources can be disaggregated at different ranges. Ultimately, the choice should be made for each resource type for a given mix of applications, following an analysis similar to our study.
6.4 Future Work
Future work should explore the performance and cost tradeoff when allocating resources to applications whose utilization is dynamic. For instance, providing enough memory bandwidth to satisfy only the application’s average demand is more likely to degrade the application’s performance, but increases average resource utilization. Future work should also consider the impact of resource disaggregation to application performance, which should also consider the underlying hardware to implement resource disaggregation and the software stack. Job scheduling for heterogeneous HPC systems [
27] should be aware of job resource usage and disaggregation hardware limitations. For instance, scheduling all nodes of an application in the same rack benefits locality but also increases the probability that all nodes will stress the same resource, thus hurting resource disaggregation.