Managing Correctable Memory Errors On Cisco UCS Servers: White Paper
Managing Correctable Memory Errors On Cisco UCS Servers: White Paper
Managing Correctable Memory Errors On Cisco UCS Servers: White Paper
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 1 of 9
Contents
Field Recommendations: Correctable Errors and Threshold Policies ................................................................ 3
Overview of Memory Errors ..................................................................................................................................... 3
Classification of Memory Errors.............................................................................................................................. 3
Correctable Versus Uncorrectable Errors............................................................................................................... 3
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 2 of 9
failures. Cisco Unified Computing System (Cisco UCS ) servers use ECC memory. Therefore, powerful error
correcting codes such as those provided by the Intel Xeon processors in Cisco UCS servers can detect memory
errors so that silent data corruption does not occur.
Hard Versus Soft Errors
Errors that are caused by a persistent physical defect are traditionally referred to as hard errors. A hard error may
be caused by an assembly defect such as a solder bridge or cracked solder joint, or may be the result of a defect in
the memory chip itself. Rewriting the memory location and retrying the read access will not eliminate a hard error.
This error will continue to repeat.
Errors caused by a brief electrical disturbance, either inside the DRAM chip or on an external interface, are referred
to as soft errors. Soft errors are transient and do not continue to repeat. If the soft error was the result of a
disturbance during the read operation, then simply retrying the read may yield correct data. If the soft error was
caused by a disturbance that upset the contents of the memory array, then rewriting the memory location will
correct the error.
Hard errors are typically detected by memory tests run by the Cisco UCS BIOS at boot time, and any modules
containing hard errors are mapped out so that they cannot cause errors during runtime. Cisco UCS servers employ
memory patrol scrubbing to automatically detect and correct soft errors during runtime.
Page 3 of 9
program processing. Uncorrectable errors generally cannot be fixed and may make it impossible for the application
or operating system to continue processing.
Increased Capacity
A primary reason for increased error rates is the rapid increase in the size of memory systems. As more bits of
memory are added to the system, the likelihood that any one of them will encounter an error increases. Such
increases in system memory capacity are the result of the shrinking size of DRAM modules (that is, more bits can
be packed on a single die). Since 2008, DRAM capacities have increased 16x, from 512 megabits to 8 gigabits. As
chip capacity has increased, individual bit cells have been getting smaller. As the bit cell gets smaller, the number
of stored charges per bit decreases, making it more difficult to distinguish between a stored 1 and a stored 0. The
basic storage element, or bit cell, in a DRAM chip is a tiny capacitor. DRAM bit cells are inherently leaky, and
smaller bit cells storing fewer chargers are less tolerant of this leakage. Additionally, smaller bit cells are more
easily upset by external sources such as alpha particles and cosmic rays. Todays advanced DRAM technologies
deliver up to 8 gigabits of memory on a single die and up to 64 gigabytes of memory on a single memory module.
In addition, todays processors incorporate multiple memory channels on each processor socket and multiple
modules on each channel.
Increased Bandwidth
Memory-system bandwidth has also been increasing steadily. Not only does each processor socket have multiple
memory channels, but the speed of those channels has increased. Just a few years ago, the top speed for DDR2
memory interfaces was 800 mega-transfers per second (Mtps). Using advanced DDR4 memory, the Cisco UCS
B200 M4 Blade Server supports memory channels operating at 2133 Mtps. Ever-increasing operating frequencies,
while providing higher bandwidth, also result in smaller bit times. As individual bit times decrease, timing margin
also decreases, making it more difficult for receiving circuitry to separate each bit from those that precede and
follow it.
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 4 of 9
Scrub Protocol
In all normal memory read accesses, the memory controller checks for and corrects single-bit errors. However,
sometimes the data in the entire memory array may not be accessible for reasons related to data locality. Thus,
scrub patrol protocols provide additional correction capabilities that are needed beyond the usual SECDED ECC
codes. The scrub patrol routine reads the entire memory array and corrects any single-bit errors. This patrol routine
occurs periodically at a predetermined interval, usually once every 24 hours.
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 5 of 9
Historically, memory modules with correctable errors were treated the same as modules with uncorrectable errors
based on the following two premises:
Even though several of these modules ran more than 10 weeks with a large number of correctable errors
every week, they never had a single uncorrectable error during or after. This finding suggests that
incidences of correctable errors are not a reliable indicator of subsequent uncorrectable errors.
Six modules experienced an average of 66 uncorrectable errors per week, but had no prior correctable
errors. This data suggests that there is no correlation between incidences of correctable errors and
uncorrectable errors.
Figure 1.
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 6 of 9
Table 1.
Error Types
Count of Errors
Count of Modules
296
144
5,950,523
16
66
In seven cases (see the red dots in Figure 1), modules experienced both correctable and uncorrectable errors
during the same week. However, the average correctable error count on these modules was several orders of
magnitude lower than on modules with correctable errors only (157,000 versus 5.9 million). Hence, this data
debunks the premise that there is a strong correlation between correctable errors and uncorrectable errors.
Given that there is no correlation between correctable errors and uncorrectable errors based on this study, flagging
a module with only correctable errors for immediate replacement to avoid uncorrectable errors may create
unnecessary server disruptions without actually reducing the risk that an uncorrectable error will occur.
Conclusion
Industry demands for greater capacity, greater bandwidth, and lower operating voltages lead to increased memoryerror rates. Traditionally, the industry has treated correctable errors in the same way as uncorrectable errors,
requiring the module to be replaced immediately upon alert. Given extensive research that correctable errors are
not correlated with uncorrectable errors, and that correctable errors do not degrade system performance, the Cisco
UCS team recommends against immediate replacement of modules with correctable errors. Customers who
experience a degraded memory alert for correctable errors should upgrade to that latest UCSM 2.2(7), 3.1(1), or
CIMC 2.0(9) for racks standalone mode. If the customer does not wish to upgrade, then reset the memory error
and resume operation. Following this recommendation will avoid unnecessary server disruption.
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 7 of 9
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 8 of 9
DIMMs: Reasons to Use Only Cisco Qualified Memory on Cisco UCS Servers
For additional information about Cisco UCS servers, please visit the Cisco UCS server webpage.
Printed in USA
2016 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
C11-736116-03
04/16
Page 9 of 9