1 Introduction
Persistent memory (PM) technologies offer attractive features for developing storage systems and applications. For example,
phase-change memory (PCM) [
95],
spin-transfer torque RAM (STT-RAM) [
62], Intel’s infamous Optane DCPMM [
10], and the promising vendor-neutral
Compute Express Link–(CXL) based PM technologies [
34,
91] can support byte-granularity accesses with close to DRAM latencies, while also providing durability guarantees. Such new properties have inspired a wide range of PM-based software optimizations [
8,
41,
52,
70,
98,
114].
Unfortunately, building correct PM-based software systems is challenging [
69,
93]. For example, to ensure persistence, PM writes must be flushed from CPU cache explicitly via specific instructions (e.g.,
clflushopt); to ensure ordering, memory fences must be inserted (e.g.,
mfence). Moreover, to manage PM devices and support PM programming libraries (e.g., PMDK [
19]), multiple
operating system (OS) kernel subsystems must be revised (e.g.,
dax,
libnvdimm). Such complexity could potentially lead to obscure bugs that hurt system reliability and security.
Addressing the challenge above will require cohesive efforts from multiple related directions including PM bug detection [
38,
76,
77,
88,
89], PM programming support [
19], PM specifications [
46], among others. All of these directions will benefit from a better understanding of real-world PM-related bug characteristics.
Many studies have been conducted to understand and guide the improvement of software [
39,
43,
54,
68,
78,
80]. For example, Lu et al. [
78] studied 5,079 patches of six Linux file systems and derived various patterns of file system evolution; the study has inspired various follow-up research on file systems reliability [
51,
86], and the dataset of file system bug patches has been directly used for evaluating the effectiveness of new bug detection tools [
86]. While influential, this study does not cover PM-related issues, as the
direct-access (DAX), feature of file systems was introduced after this study was performed. More recently, researchers have studied PM related bug cases. For example, Neal et al. [
88] studied 63 PM bugs (mostly from the PMDK library [
19]) and identified two general patterns of PM misuse. While these existing efforts have generated valuable insights for their targets, they do not cover the potential PM-related issues in the Linux kernel.
In this work, we perform the first comprehensive study on PM-related bugs in the Linux kernel. We focus on the Linux kernel for its prime importance in supporting PM programming [
14,
16,
20]. Our study is based on 1,553 PM-related patches committed in Linux between January 2011 and December 2021, spanning over 10 years. For each patch, we carefully examine its purpose and logic, which enables us to gain quantitative insights along multiple dimensions:
First, we observe that a large number of PM patches (38.9%) are for maintenance purpose, and a similar portion (38.4%) are for adding new features or improving the efficiency of existing ones. These two major categories reflect the significant efforts needed to add PM devices to the Linux ecosystem and to keep the kernel well maintained. Meanwhile, a non-negligible portion (22.7%) are bug patches for fixing correctness issues.
Next, we analyze the PM bug patches in depth. We find that the majority of kernel subsystems have been involved in the bug patches (e.g., “arch,” “fs,” “drivers,” “block,” and “mm”), with drivers and file systems being the most “buggy” ones. This reflects the complexity of implementing the PM support correctly in the kernel, especially the nvdimm driver and the DAX file system support.
In terms of bug patterns, we find that the classic semantic and concurrency bugs remain pervasive in our dataset (49.7% and 14.8%, respectively), although the root causes are different. Also, many PM bugs are uniquely dependent on hardware (19.0%), which may be caused by misunderstanding of specifications, miscalculation of addresses, and so on. Such bugs may lead to missing devices, inaccessible devices, or even security issues, among others.
In terms of bug fixes, we find that PM bugs tend to require more lines of code to fix compared to non-PM bugs reported in previous studies [
78]. Also, 20.8% bugs require modifying multiple kernel subsystems to fix, which implies the complexity. In the extreme cases (0.9%), developers may temporarily “fix” a PM bug by disabling a PM feature, hoping for a major re-work in the future. However, we observe that different PM bugs may be fixed in a similar way by refining the sanity checks.
Moreover, to better understand the conditions for manifesting the issues and help develop effective remedy solutions, we identify a subset of bug patches with relatively complete information and attempt to reproduce them experimentally. We find that configuration parameters in different utilities (e.g., mkfs for creating a file system and ndctl for managing the libnvdimm subsystem) are critically important for manifesting the issues, which suggests that it is necessary to take into account the configuration states when building bug detection tools.
Finally, we look into the potential solutions to address the PM-related issues in our study. We examine multiple representative PM bug detectors [
76,
77,
88] and find that they are largely inadequate for addressing the PM bugs in our study. However, a few recently proposed non-PM bug detectors [
63,
66,
99,
101] could potentially be applied to detect a great portion of PM bugs if a few common challenges (e.g., precise PM emulation, PM-specific configuration, and workload support) are addressed. To better understand the feasibility of extending existing tools for PM bug detection, we further extend one existing bug detector called Dr.Checker [
82] to analyze PM kernel modules. By adding PM-specific modifications, the extended Dr.Checker, which we call Dr.Checker+, can successfully analyze the major PM code paths in the Linux kernel. While the effectiveness is still limited by the capability of the vanilla Dr.Checker, we believe that extending existing tools to make them work for the PM subsystem can be an important first step toward addressing the PM-related challenges exposed in our study.
Note that this article is extended from a conference version [
106]. The major changes include (1) collecting and analyzing one new year of PM-related patches (i.e., January to December 2021), (2) conducting reproducibility experiments and identifying manifestation conditions, (3) analyzing more existing tools and extending Dr.Checker for analyzing PM driver modules, and (4) adding background, bug examples, and so on, to make the article more clear and complete. We have released our study results including the dataset and the extended Dr.Checker+ publicly on Git [
3]. We hope our study could contribute to the development of effective PM bug detectors and the enhancement of robust PM-based systems.
The rest of the article is organized as follows: Section
2 provides background on the unique hardware and software characteristics of PM devices; Section
3 describes the study methodology; Section
4 presents the overview of PM patches; Section
5 characterizes PM bugs in detail; Section
6 presents our experiments on PM bug reproducibility; Section
7 discusses the implications on bug detection, including our extension to Dr. Checker for analyzing PM subsystem in the Linux kernel; Section
8 discusses related work; and Section
10 concludes the article.
4 PM Patch Overview
We classify all PM-related patches into three categories as shown in Table
1: (1) “
Bug” means fixing existing correctness issues (e.g., misalignment of NVDIMM namespaces); (2) “
Feature” means adding new features (e.g., extend device flags) or improving the efficiency of existing designs; (3) “
Maintenance” means code refactoring, compilation, or documentation update, and so on.
Overall, the largest category in our dataset is “
Maintenance” (38.9%), which is consistent with previous studies on Linux file system patches [
78]. This reflects the significant effort needed to keep PM-related kernel components well maintained. We find that the majority of maintenance patches are related to code refactoring (e.g., removing deprecated functions or driver entry points), and occasionally the refactoring code may introduce new correctness issues that need to be fixed via other patches.
The second largest category is “
Feature” (38.4%). This reflects the significant changes needed to add PM to the Linux ecosystem that has been optimized for non-PM devices for decades. One interesting observation is that many (40+) feature patches are proactive (e.g.,
“In preparation for adding more flags, convert the existing flag to a bit-flag” [
5]), which may imply that PM-based extensions tend to be well planned in advance. Also, most recent feature patches are related to supporting PM devices on the CXL interface [
4], which indicates the rapid evolution of PM relevant techniques.
The “Bug” patches, which directly represent confirmed and resolved correctness issues in the kernel, account for a non-negligible portion (i.e., 22.7% overall). We analyze this important set of patches further in the next section.
6 PM Bug Reproducibility
To better understand the conditions for manifesting the issues and help derive effective remedy solutions, we further perform a set of reproducibility experiments. We identify a subset of bug patches with relatively complete information and attempt to reproduce them experimentally. We find that configuration parameters in different storage utilities (e.g., mkfs for creating a file system, ndctl for managing the libnvdimm subsystem) are critically important for manifesting the issues, which suggests that it is necessary to take into account the configuration states when developing relevant testing or debugging tools.
As shown in Table
5, we look for five pieces of information from the bug patches when selecting candidates for the reproducibility experiments, which includes configuration parameters (
1Config), hardware information (
2HW), emulation platform (
3Emu), test suite information (
4Test), and reproducing steps information (
5Step). Based on the five metrics, we identify 39 bug candidates that contain relatively complete information for the experiments.
Table
6 summarizes the reproducibility results. At the time of this writing, we are able to reproduce 11 of the 39 candidates based on the information derived from the patches (
✓in the last column). Most interestingly, we find that the configuration parameters of storage utilities (i.e.,
mkfs,
mount,
ndctl) and workloads are critically important for triggering the bugs, which are summarized in the four columns of “Critical Configurations.”
For example, in bug #15, when the file system is mounted with both DAX and read-only parameters, accessing the file system results in a “Segmentation fault,” because the DAX fault handler attempts to write to the journal in read-only mode. As another example, the ndctl utility enables configuring the libnvdimm subsystem of the Linux kernel. In bug #35, a resource leak was observed when an existing NVDIMM namespace was reconfigured to “device-dax” mode via ndctl.
Note that Table
6 only shows the necessary configurations for triggering the bug cases, which may not be sufficient. In fact, there are many other factors that can make a case un-reproducible in our experiments. For example, we are unable to reproduce cases that require test cases from the
ndctl test suite [
17]. These test cases require “nfit_test.ko” kernel module, which is not available in the generic Linux kernel source tree. Also, some cases are dependent on specific architectures (e.g., PowerPC) that are incompatible with our experimental systems. In addition, many concurrency bugs such as data races cannot be easily reproduced due to the nondeterminism of thread interleavings.
Overall, we can see that most of the bug cases in Table
6 require specific configurations to trigger, which suggests the importance of considering configurations for testing and debugging [
40,
84,
92,
100]. We summarize all the reproducible bug cases, including the necessary triggering conditions and scripts, in a public dataset to facilitate follow-up research [
3].
7 Implications ON PM Bug Detection
Our study has exposed a variety of PM-related issues, which may help develop effective PM bug detectors and build robust PM systems. For example, since 20.8% PM bug patches involve multiple kernel subsystems, simply focusing on one subsystem is unlikely enough; instead, a full-stack approach is much needed, and identifying the potential dependencies among components would be critical. However, since many bugs in different subsystems may follow similar patterns, capturing one bug pattern may benefit multiple subsystems (see Section
9.1 for more discussions).
As one step to address the PM-related issues identified in the study, we analyze a few state-of-the-art bug detection tools in this section. We discuss the limitations as well as the opportunities for both PM bug detectors and Non-PM bug detectors (Section
7.1 and Section
7.2, respectively). Moreover (Section
7.3), we present our efforts and results on extending one state-of-the-art static bug detector (i.e., Dr. Checker [
82]) for analyzing PM drivers, which account for the majority of bug cases in our dataset (Figure
2).
7.1 PM Bug Detectors
Multiple PM-specific bug detection tools have been proposed recently, including PMTest [
77], XFDetector [
76], and AGAMOTTO [
88]. These tools mostly focus on user-level PM programs. We have performed bug detection experiments using these tools, and we are able to verify their effectiveness by reproducing most of the bug detection results reported in the literature. Unfortunately, we find that they are fundamentally limited for capturing the PM bugs in our dataset. For example, XFDetector [
76] relies on Intel Pin [
21], which can only instrument user-level programs. PMTest [
77] can be applied to kernel modules, but it requires manual annotations, which is impractical for major kernel subsystems. AGAMOTTO [
88] relies on KLEE [
35] to symbolically explore user-level PM programs. While it is possible to integrate KLEE with virtual machines to enable full-stack symbolic execution (as in S2E [
42]), novel PM-specific path reduction algorithms are likely needed to avoid the state explosion problem [
67]. One recent work Jaaru [
53] leverages commit stores, a common coding pattern in user-level PM programs, to reduce the number of execution paths that need to be explored for model checking. Nevertheless, such elegant pattern has not been observed in our dataset due to the complexity of kernel-level PM optimizations (Section
5). Therefore, additional innovations on path reduction are likely needed to apply model checking to detect diverse PM-related bugs in the kernel effectively.
7.2 Non-PM Bug Detectors
Great efforts have been made to detect non-PM bugs in the kernel [
13,
25,
27,
63,
66,
85,
99,
101]. For example, CrashMonkey [
85] logs the
bio requests and emulates crashed disk states to test the crash consistency of traditional file systems. As discussed in Section
5.3, such tricky crash consistency issues exist in PM subsystems, too. Nevertheless, extending CrashMonkey to detect PM bugs may require substantial modifications including tracking PM accesses and PM-critical instructions (e.g., mfence), designing PM-specific workloads, among others.
Similarly, fuzzing-based tools have proven to be effective for kernel bug detection [
24,
63,
66,
99,
101]. For example, Syzkaller [
24] is a kernel fuzzer that executes kernel code paths by randomizing inputs for various system calls and has been the foundation for building other fuzzers; Razzer [
63] combines fuzzing with static analysis and detects data races in multiple kernel subsystems (e.g., “
driver,” “
fs,” and “
mm”), which could potentially be extended to cover a large portion of concurrency PM bugs in our dataset. Since Syzkaller, Razzer, and similar fuzzers heavily rely on virtualized (e.g., QEMU [
33]) or simplified (e.g., LKL [
15]) environments to achieve high efficiency for kenel fuzzing, one common challenge and opportunity for extending them is to emulate PM devices and interfaces precisely to ensure the fidelity.
Also, Linux kernel developers have incorporated tools such as Kernel Address Sanitizer [
25], Undefined Behavior Sanitizer [
27], and memory leak detectors (Kmemcheck) [
13] within the kernel code to detect various memory bugs (e.g., null pointers, use-after-free, resource leak). These sanitizers instrument the kernel code during compilation and examine bug patterns at runtime. Similarly to other dynamic tools, these tools can only detect issues on the executed code paths. In other words, their effectiveness heavily depends on the quality of the inputs. As discussed in Section
6, many PM issues in our dataset require specific configuration parameters from utilities (e.g.,
mkfs,
ndctl) and workloads (e.g.,
mmap(MAP_SHARED),
open(O_TRUNC)) to trigger, so we believe it is important to consider such PM-critical configurations when leveraging existing kernel sanitizers for detecting the issues exposed in our study.
As discussed above, while various bug detectors have been proposed and used in practice, addressing PM-related issues in our dataset will likely require PM-oriented innovations, including precise PM emulation, PM-specific configuration and workload support, and so on, which we leave as future work.
However, to better understand the feasibility of extending existing tools for PM bug detection, we present our efforts and results on extending one existing bug detector called Dr.Checker [
82] in this section. We select Dr.Checker for three main reasons: First, it has proven effective for analyzing kernel-level drivers [
82], and, as shown in Figure
2 (Section
5.1), drivers account for the majority of bugs in our dataset. second, Dr.Checker is based on static analysis without dynamic execution, which makes it less sensitive to the limitations discussed in the previous sections (e.g., device emulation, input generation). Third, Dr. Checker employs multiple bug detection algorithms that can detect multiple bug patterns identified in our study (Section
5.2). We name our extension as Dr.Checker+ and release it on Git to facilitate follow-up research on PM bug detection [
3].
About Dr.Checker and Its Limitations. Dr.Checker [
82] mainly uses two static analysis techniques (i.e., points-to and taint analysis) to detect memory-related bugs (e.g., buffer overflow) in generic Linux drivers. It performs flow-sensitive and context-sensitive code traversal to achieve high precision for driver code analysis. One special requirement for applying Dr.Checker is to identify correct
entry points (i.e., functions invoking the driver code) as well as the argument types of entry points. For example, Figure
10 shows a VFS interface (“struct file_operations”) with function pointers that allow userspace programs to invoke operations on
libnvdimm driver module. The functions included in the structure (e.g., “
nvdimm_ioctl”) are entry points that enable us to manipulate the underlying NVDIMM devices through the driver code. The entry points and their argument types collectively determine the
entry types, which in turn determines the
taint sources for initiating relevant analysis. For example, the vanilla Dr. Checker describes an IOCTL
entry type where the first argument of the entry point function is marked as
PointerArgs (i.e., the argument points to a kernel location, which contains the tainted data) and the second argument is marked as
TaintedArgs (i.e., the argument contains tainted data and is referenced directly). This entry type is applicable to entry point functions that have similar signature. In the case of the
libnvdimm kernel module, the IOCTL entry type is applicable to two entry point functions (i.e.,
nd_ioctl,
nvdimm_ioctl). In addition, Dr.Checker includes a number of
detectors to check specific bug patterns based on the taint analysis (e.g.,
IntegerOverflowDetector and
GlobalVariableRaceDetector).
While Dr.Checker has been applied to analyze a number of Linux device drivers [
82], it does not support major PM drivers directly. Table
7 summarizes the seven PM driver modules involved in our study, including
nfit.ko,
nd_pmem.ko,
nd_blk.ko,
nd_btt.ko,
libnvdimm.ko,
device_dax.ko, and
dax.ko. These driver modules provide various supports to make PM devices usable on Linux-based systems. For example,
dax.ko provides generic support for direct access to PM, which is critical for implementing the DAX feature for Linux file systems including Ext4 and XFS. As discussed in Section
5 and Section
6, these PM drivers tend to contain the most PM bugs in our dataset, and many PM bugs require PM driver features to trigger (e.g.,
-o dax). Based on our study of the bug patterns and relevant source code, we find that these PM drivers have much more diverse entry types compared to the embedded drivers analyzed by the vanilla Dr.Checker [
82]. As shown in the second to the last column of Table
7, the vanilla Dr.Checker [
82] only supports one entry type (i.e.,
ioctl(file)) used in the
libnvdimm module, leaving the majority of PM driver code unattended.
Extending Dr.Checker to Dr.Checker+. To make Dr.Checker works for the major PM drivers, we manually examine the source code of PM drivers and identify critical entry points. As summarized in Table
8, we have added five new entry types (i.e., DEV_OP, DAX_OP, BLK_OP, GETGEO, MOUNT) besides the original IOCTL entry type. The new entry types include the major entry point functions used in the PM driver modules. Moreover, we identify the critical input arguments and map them to the appropriate taint types defined by Dr.Checker (i.e.,
TaintedArgs for arguments directly passed by the userspace, and
PointerArgs and
TaintedArgsData for arguments pointing to a kernel memory location that contains tainted data). In addition, to make Dr.Checker work for newer versions of Linux kernel, we have ported the implementation to the latest version of the LLVM/Clang compiler infrastructure [
26]. Overall, as summarized in the last column of Table
7, the enhanced Dr. Checker+ is able to support bug detection in the seven major PM driver modules based on our comprehensive study of the PM driver bugs and the relevant source code.
7.3 Extending Dr. Checker for Analyzing PM Kernel Modules
Experimental Results of Dr.Checker+. We have applied the extended Dr.Checker+ to analyze seven major PM kernel modules in Linux kernel v5.2. In this set of experiments, we applied four detectors: (1)
IntegerOverflowDetector checks for tainted data used in operations (e.g.,
add,
sub, or
mul) that may cause an integer overflow or underflow, (2)
TaintedPointerDereferenceChecker detects pointers that are tainted and directly dereferenced, (3)
GlobalVariableRaceDetector checks for global variables that are accessed without a mutex, and (4)
TaintedSizeDetector checks for tainted data that is used as a size argument in any of the
copy_to_ or
copy_from_ functions that may result in information leak or buffer overflows. Table
9 summarizes the experimental results. Overall, Dr. Checker+ can process all the target kernel modules successfully, and we have observed warnings (i.e., potential issues) reported by its four detectors in five of the seven kernel modules (i.e.,
nfit.ko,
nd_pmem.ko,
nd_btt.ko,
libnvdimm.ko, and
dax.ko). For example, the
TaintedPointerDereferenceChecker was able to identify a potential null pointer dereference in the
nd_btt driver module, where the pointer variable
bdev in the
btt_getgeo entry point of
nd_btt was accessed without a check for its validity. If the routines invoking the entry point function do not check for the null value before using it, then this code may lead to a kernel crash.
Note that the warnings reported by Dr.Checker+ do not necessarily imply PM bugs due to the conservative static analysis used in Dr.Checker. For example, the GlobalVariableRaceDetector in Dr. Checker would falsely report a warning for any access to a global variable outside of a critical section. By looking into the warnings reported in our experiments, we observed a similar false alarm: Dr. Checker+ may report a warning when the driver code invokes the macro “WARN_ON” to print to the kernel error log, which is benign.
Also, since Dr.Checker’s detectors are stateless, they may report a warning for every occurrence of the same issue. For example, the page offset in pmem_dax_direct_access is used to calculate the physical address of a page, which involves bit manipulation operations and the resulting value may overflow the range of possible integer values. The IntegerOverflowDetector may report a warning whenever the method is invoked.
Overall, our experience is that extending Dr.Checker for analyzing PM-related issues in the Linux kernel is non-trivial and time-consuming due to the complexity of the PM subsystem. However, the actual code modification is minor: We only need to modify about 100LoC in Dr.Checker to cover the major PM driver modules. While the effectiveness is still fundamentally limited by the capability of the core techniques used in the existing tools (e.g., Dr.Checker’s static analysis may report false alarms), we believe that extending existing tools to make them work for the PM subsystem in the kernel can be an important first step toward addressing the PM-related challenges exposed in our study. We leave the investigation of improving Dr.Checker further (e.g., improving the static analysis algorithms, adding diagnosis support for understanding the root causes of warnings, or fixing the warnings detected) and building new detection tools as future work.
8 Related Work
Studies of Software Bugs. Many researchers have performed empirical studies on bugs in open source software [
39,
43,
49,
54,
68,
72,
78,
80,
83,
84,
107,
109]. For example, Lu et al. [
80] studied 105 concurrency bugs from four applications and found that atomicity-violation and order-violation are two common bug patterns; Lu et al. [
78] studied 5,079 file system patches (including 1,800 bugs fixed between Dec. 2003 and May 2011) and identified the trends of six file systems; Mahmud et al. [
83,
84] studied bug patterns in file systems and utilities and extracted configuration dependencies that may affect the manifestation of bugs. Our study is complementary to the existing ones as we focus on bugs related to the latest PM technology, which may involve issues beyond existing foci (e.g., user-level concurrency bugs [
49,
80,
109,
110], non-PM file systems [
78], cryptographic modules [
68], and configurations [
83,
84]).
Studies of Production System Failures. Researchers have also studied various failure incidents in production systems [
32,
55,
56,
57,
59,
60,
75,
96,
97,
108], many of which are caused by software bugs. For example, Gunawi et al. [
55] studied 597 cloud service outages and derived multiple lessons including the outage impacts, causes, and so on; they found that many root causes were not described clearly. Similarly, Liu et al. [
75] studied hundreds of high-severity incidents in Microsoft Azure. Due to the nature of the data source, these studies typically focus on high-level insights (e.g., caused by hardware, software, or human factors) instead of source-code level bug patterns described in this study. Since PM-based servers are emerging for production systems [
30], and many production systems are based on Linux kernel, our study may help understand PM-related incidents in the real world.
Tools for Testing and Debugging Storage Systems. Many tools have been created to test storage systems [
31,
36,
37,
51,
59,
83,
86,
87,
102,
111,
112] or help debug system failures [
58,
64,
65,
71,
90,
104,
105,
108]. For example, EXPLODE [
102], B
\(^3\) [
87], and Zheng et al. [
111] apply fault injections to emulate crash images to detect crash-consistency bugs in file systems, which has also been observed in our dataset. PFault [
36,
59] applies fault injection to test Lustre and BeeGFS parallel file systems building on top of the Linux kernel. Gist [
65] applies program slicing to help pinpoint the root cause of failures in storage software including SQLite, Memcached, and so on. Duo et al. [
108] extended PANDA [
48] to track device commands to help diagnosis. In general, these tools rely on detailed understanding of the underlying bug patterns to be effective. We believe our study on PM-related issues can contribute to building more effective bug detectors or debugging tools for PM-based storage systems, which we leave as future work.