The decline of Moore’s Law, Dennard scaling’s demise, and environmental sustainability constraints have shaped the infrastructure landscape. So far, computing systems have been judged on peak performance or idealized benchmark performance with little regard for energy use or environmental impact. To find a path to grow artificial intelligence (AI) and cloud computing efficiently and responsibly, we need new metrics to guide us. Computing systems should now be measured on how they leverage available power and on their carbon footprint.
The Table provides a roadmap of performance/cost metrics in order of increasing recency, accuracy, and importance. We conclude with an example using real workloads that illustrates a 3x–13x gain over a conventional solution by paying attention to new metrics, such as goodput for performance and data-center power and carbon emissions for cost.
Peak Performance/Purchase Price
Peak performance is the status quo, especially for emerging AI accelerators, as it is easy to calculate, and it showcases maximum speed. Alas, it does not predict actual performance.
4 The flaw of using purchase price is that it focuses on today’s chip cost rather than lifetime system cost.
Benchmark Performance/Total Cost of Ownership
Benchmarks such as MLPerf
8 were invented to improve prediction of real performance. The TPC-C benchmark added maintenance cost to purchase price,
4 leading eventually to
Total Cost of Ownership (TCO):1where OpEx (Operational Expenditure) is the cost paid during the lifetime of the chip and infrastructure. It covers electricity costs consumed—including power distribution and cooling—and the cost of datacenter space over the server’s lifetime. TPC-C sets N to three years. With the slowing of Moore’s Law, new server CPUs deliver in excess of 100 cores rather than much faster ones, and performance/cost these days hardly improves. Most software is not designed for numerous cores, so in practice new servers are barely faster than old servers. When combined with small performance/cost gains, servers are replaced less frequently, stretching N to perhaps six to eight years today from three years a decade ago. In some systems today, OpEx is half of the TCO.
Workload Goodput/TCO
A problem with using benchmarks for performance is that they do not age well. As benchmark results affect sales, they immediately become the target of engineering efforts that help benchmarks but not necessarily real programs, so they lose their predictive value if refreshed infrequently. They also often target chip performance rather than systems performance. Popular benchmarks such as
Coremark,
Dhrystone,
Linpack, and
SPEC2017 are all cautionary examples.
Running the actual workload, such as production AI training, is obviously more accurate than chip benchmarks. However, it is also important to capture if computers are underutilized or if computation is wasted. Goodput is a networking term that only counts the information bits actually delivered, subtracting the protocol overhead and retransmissions due to failures. We borrow that term here to adjust workload performance by subtracting effort wasted on underutilization or unreliability problems.
As an example of underutilization, AI training and many other applications are
bulk synchronous parallel,
3 where all the computers in a system operate concurrently for one step and then exchange messages. They all wait until all messages are received, so communication speed is important. As the time per step is set by the slowest computer, load balance is critical to ensure that all computers do useful work. Stragglers can significantly degrade overall performance of large-scale synchronous jobs, such as AI training.
2A final goodput adjustment is to remove wasted work from unreliability. The overhead of software error detection, checkpointing, and error recovery can be substantial. Worryingly, hardware errors are increasing—with many going undetected—as we follow Moore’s Law to tinier transistors.
5,9Workload Goodput/Datacenter Power
The TCO formula implicitly assumes that sufficient datacenter (DC) capacity is available to house new servers, because the TCO formula includes DC provisioning cost. However, building new DCs is not always possible or affordable, especially at large scale, as some local electric utilities have practical limits to the maximum power available. Capital expenditure for DCs is also limited, in part because it needs to be spent up front. The combination of zoning, environmental regulations, competition for electricity from other customers, the desire to reduce dirty energy sources, and limited capital puts significant pressure on building new DCs. Thus, an important new metric to join goodput/TCO is the performance that fits within a current DC’s power envelope.
Not all DC power gets to the servers inside.
Power Usage Effectiveness (
PUE) is the industry-standard metric of DC efficiency, defined as the ratio between total energy usage (including all overheads, like cooling and power distribution) divided by the energy directly consumed by the DC’s computing equipment. If 1.5W of power must be delivered to the DC to, in the end, deliver 1W of power to a server after accounting for distribution overheads and cooling, then the PUE would be 1.5. This metric rewards DCs that reduce PUE, as they can hold more servers. In 2007, it was
2.5 (150% overhead), but by 2022, cloud providers cut PUEs to 1.1 (10% overhead).
7 Reducing energy overhead 15x in 15 years illustrates the impact of new metrics.
A simple way to calculate the maximum number of servers per DC is to reduce the electrical power available by the PUE and then divide that result by the maximum electricity consumption per server. The last term is limited by the
Thermal Design Power (
TDP), the maximum heat a computer can cool. In practice, not all servers in a DC operate close to TDP simultaneously. An optimization is to instead deploy more servers assuming an over-provisioned TDP capacity, called the
Oversubscription Rate (
OSR).
1 We only need a backoff method for the rare times when the actual power for many servers in a DC is too high, which is much easier for AI workloads, such as training or bulk inference, than for traditional user-facing applications. This insight means that the number of servers per DC can be based more on average power rather than on the worst-case TDP, increasing the importance of reducing average power.
Workload Goodput/CO2e
Environmental sustainability is increasingly crucial, which means information technology should minimize its carbon footprint. Measured in carbon dioxide equivalent emissions, CO2e accounts for other greenhouse gasses, such as methane. The CO2e unit is metric tons (t), which are 1,000kg. Therefore, an additional important metric to goodput/TCO and goodput/DC power is goodput/CO2e.
CO2e can be divided into the operational portion—from running computer equipment—and the embodied portion, from manufacturing it and building DCs. Embodied CO2e also includes the sourcing of raw materials, upstream energy use by suppliers to the manufacturers, and transportation.
Environmental organizations define rules to ensure that operational and embodied emissions are accounted for and allocated correctly at the proper stages of a value chain. Multiple standards specify boundaries and ensure all emissions are accounted for appropriately. Corporate accounting commonly uses
Greenhouse Gas Protocol (
GHGP) while product-level carbon footprinting commonly uses
ISO standards 14040 and
14044. Both GHGP and these ISO standards encompass all operational and embodied emissions (further refined as
scope 1, 2, and 3).
Operational CO
2e use is well understood, and it depends heavily on
where the energy is consumed.
10 If the local grid relies primarily on renewable energy sources instead of fossil fuels, the footprint can drop 10x. The unit is grams of CO
2e per kilowatt-hour (kWh). The worldwide average today is 475, but sites can drop below 60 by using solar or wind power and rise above 700 by burning coal.
Unfortunately, the embodied CO
2e of CPU servers and AI accelerators is rarely published. The range of embodied CO2e per server from the limited publications is 1t to 4t.
11 Given the high variance, more investigation is necessary. The embodied footprint of DC buildings is also less documented but likely much less than the servers inside given the 20-year amortization of DCs.
7Embodied emissions, similar to operational emissions, also heavily depend on where the chips are manufactured, as half is from electricity use.
12 The top countries that manufacture chips are Taiwan, South Korea, and Japan, where the grid carbon intensity is still high at
542, 457, and 594 grams/kWh, respectively. Given these drawbacks to chip manufacturing and the grid decarbonization plans of many countries that deploy most chips, embodied CO
2e is fundamental to the carbon footprint of information technology and needs to be better quantified and tracked.
11An Example
To illustrate the metrics and their importance, the last column of the Table compares a hypothetical deployment of two recent AI accelerators: Google’s TPU v3 and TPU v4. Public data
6 records:
•
The performance ratio of peak floating point operations per second is 2.24.
•
The MLPerf benchmark performance ratio of TPU v4 over TPU v3 is 3.14.
•
The average performance ratio for the training of representative production AI models is 2.10.
A few assumptions supply the missing parameters:
•
The hypothetical relative purchase price is 1.2x for TPU v4 given its larger chip size.
4,6•
The hypothetical relative OpEx is 0.8 based on TPU v4’s lower average power.
6•
The hypothetical relative TCO is ~1.0 (assuming a 50-50 split between price and OpEx).
•
The hypothetical goodput of TPU v4 is another 1.2x due to its optical circuit switches,
6 which improve communication speed and reduce downtime by quickly substituting spares for failed TPUs.
Performance/cost in the Table swings from peak/purchase price metric value of 1.9x of TPU v4 over TPU v3 to benchmark/TCO metric value of 3.2x.
The other metrics offer much larger gains. By taking advantage of oversubscription for TPU v4 as opposed to a standard allocation for TPU v3, goodput/DC power metric becomes 6x for TPU v4. Even after accounting for the larger energy use and embodied carbon footprint of deploying more servers that oversubscription enables, building a TPU v4 DC near green energy instead of an average location raises the goodput/CO2e metric value to 26x.
Conclusion
It’s hard to improve what you do not measure. The proposed metrics upgrade the conventional performance/cost equation in both the numerator and the denominator. The former improved from peak performance to benchmark performance to goodput, while the latter advanced from purchase price to TCO, DC power capacity, and carbon emissions. To find the best solution, infrastructure architects should co-optimize three metrics: goodput/TCO, goodput/data center power, and goodput/CO2e.
Given where we are with Moore’s Law, Dennard scaling, and environmental sustainability, our information technology community should:
•
Reduce average power consumption of hardware we design or purchase.
•
Request tools to measure a program’s energy use and CO2e and then reduce its footprint.
•
Refine manufacturing processes so that all computers and components can eventually be labeled with their embodied CO2e.
•
Recruit clean energy sites for new data centers and then favor their use.
•
Research how to lower the embodied CO2e associated with computer and semiconductor manufacturing.
David Patterson (
[email protected]) is a Distinguished Engineer at Google in Mountain View, CA, USA.
Amin Vahdat is a Fellow and senior vice president of Engineering at Google.
Xiaoyu Ma is an engineer at Google.