research-article

Open access

Efficient Crash Consistency for NVMe over PCIe and RDMA

Authors:

Xiaojian Liao,

Youyou Lu,

Zhe Yang,

Jiwu ShuAuthors Info & Claims

ACM Transactions on Storage, Volume 19, Issue 1

Article No.: 7, Pages 1 - 35

https://doi.org/10.1145/3568428

Published: 11 January 2023 Publication History

All formats PDF

Abstract

This article presents crash-consistent Non-Volatile Memory Express (ccNVMe), a novel extension of the NVMe that defines how host software communicates with the non-volatile memory (e.g., solid-state drive) across a PCI Express bus and RDMA-capable networks with both crash consistency and performance efficiency. Existing storage systems pay a huge tax on crash consistency, and thus cannot fully exploit the multi-queue parallelism and low latency of the NVMe and RDMA interfaces. ccNVMe alleviates this major bottleneck by coupling the crash consistency to the data dissemination. This new idea allows the storage system to achieve crash consistency by taking the free rides of the data dissemination mechanism of NVMe, using only two lightweight memory-mapped I/Os (MMIOs), unlike traditional systems that use complex update protocol and synchronized block I/Os. ccNVMe introduces a series of techniques including transaction-aware MMIO/doorbell and I/O command coalescing to reduce the PCIe traffic as well as to provide atomicity. We present how to build a high-performance and crash-consistent file system named MQFS atop ccNVMe. We experimentally show that MQFS increases the IOPS of RocksDB by 36% and 28% compared to a state-of-the-art file system and Ext4 without journaling, respectively.

1 Introduction

The storage hardware has improved significantly over the last decade, e.g., an off-the-shelf solid-state drive (SSD) [14] can deliver over 7 GB/s bandwidth and provide 5 microseconds I/O latency. To better utilize high-performance SSDs, the Non-Volatile Memory express (NVMe) is introduced at the device driver layer to offer fast accesses over PCIe. With these changes, the performance bottlenecks are shifted back to the software stack.

The crash consistency (i.e., consistently update the persistent data structures despite a sudden system crash such as a power outage) is one fundamental and challenging issue faced by the storage systems. Providing crash consistency incurs expensive performance overhead, and further prevents the system software from taking full advantage of the fast storage devices. Responding to this challenge, recent works [5, 7, 17, 21, 27, 28, 32, 33, 34, 48, 49, 56, 60] have been made to enhance the software stack.

Although the hardware and software stack can be continuously advanced, there still remains one critical issue: the inefficiency from the boundary of the hardware and software (i.e., the NVMe driver) prevents the software stack from further providing higher performance (Section 3). For example, as presented in Table 1, to guarantee crash consistency of a transaction that consists of N individual 4 KB user data blocks, existing file systems (e.g., Ext4 and HoraeFS [27]) built atop NVMe over PCIe need to wait for the completion of these data blocks which involve several memory-mapped I/Os (MMIOs), DMAs, block I/Os, and IRQs. This consumes the available bandwidth of the PCIe links and the SSD, and increases the transaction latency, therefore lowering the application performance. Similar issues exist in the NVMe over RDMA stack since the local and networked NVMe stacks fail to cooperate with the hardware transactions (e.g., hardware doorbells of NVMe and RDMA) and software transactions (e.g., journaling) efficiently.

Table 1.

System	Software overhead	PCIe traffic
System	Software overhead	MMIO	DMA(Q)	Block I/O	IRQ
Ext4/NVMe	High	2(N+2)	2(N+2)	N+2	N+2
HoraeFS/NVMe	Medium	2(N+2)	2(N+2)	N+2	N+2
MQFS/ccNVMe	Low	4	N+1	N+1	N+1
MQFS-A/ccNVMe	Low	2	0	0	0

Table 1. Software Overhead and PCIe Traffic of Different Systems for Ensuring Crash Consistency

The number represents the count of operations needed for ensuring crash consistency of a transaction that consists of N individual 4-KB data blocks. MMIO: memory-mapped I/O over PCIe. DMA(Q): device transfers queue entries from/to host using DMA. Block I/O: 4-KB data blocks transferred via PCIe. IRQ: interrupt request. Existing file systems atop NVMe require (N+2) block I/Os and IRQs over PCIe in the common case; the number 2 indicates the extra journal description and commit record requests. Built atop ccNVMe, MQFS removes the commit record by taking the free rides of the doorbell operations, thereby reducing the number of block I/O and IRQ by 1. ccNVMe also reduces the number of MMIO and DMA(Q) via the transaction-aware MMIO and doorbell techniques. By further decoupling atomicity from durability, MQFS-A atop ccNVMe does not need to wait for the completion of DMAs(Q), block I/Os, and IRQs.

In this article, we propose ccNVMe, a novel extension of NVMe to define how host software communicates with the SSD across a PCIe bus with both crash consistency and performance efficiency (Section 4). The key idea of ccNVMe is to couple the crash consistency to the data dissemination; a transaction (a set of requests that must be executed atomically) is guaranteed to be crash consistent when it is about to be dispatched over PCIe. The data dissemination mechanism of the original NVMe already tracks the lifecycle (e.g., submitted or completed) of each request in the hardware queues and doorbells. ccNVMe leverages this feature to submit and complete the transaction in an ordered and atomic fashion, and makes the tracked lifecycles persistent for recovery, thereby letting the software ensure crash consistency by taking the free rides of the data dissemination MMIOs. Specifically, a transaction is crash consistent once ccNVMe rings the submission or completion doorbells; in the face of a sudden crash, the system is recovered by the tracked lifecycles from ccNVMe and versioned data blocks of software transactions (e.g., journaling).

ccNVMe communicates with the SSD over PCIe in a transaction-aware fashion, rather than the eager per-request basis of the original NVMe; this reduces the number of MMIOs, block I/Os, and interrupt requests (see MQFS/ccNVMe of Table 1), and thus increases the maximum performance that the file system can achieve. By further decoupling atomicity from durability, ccNVMe ensures crash consistency just after ringing (notifying) the SSD’s submission queue doorbell, with only two MMIOs (see MQFS-A/ccNVMe of Table 1).

We extend ccNVMe over PCIe (our previous work [29]) to RDMA-capable networks, enabling CPU-efficient and crash-consistent remote storage access (Section 5). The key idea is to align each RDMA queue pair to each NVMe hardware queue and leverage the in-order delivery property of RDMA networks, which seamlessly integrates NVMe over RDMA stack into ccNVMe over PCIe stack, thereby allowing applications to access a remote ccNVMe drive as if using a local ccNVMe drive. We further employ transaction-aware doorbells and I/O command coalescing techniques in ccNVMe over RDMA to reduce unnecessary CPU-initiated MMIOs, NIC-initiated DMAs, and RDMA operations for a transaction.

ccNVMe is pluggable and agnostic to storage systems; any storage system demanding crash consistency can enable ccNVMe and explicitly mark the request as an atomic one. Here, we design and implement MQFS to exploit the fast atomicity and high parallelism of ccNVMe (Section 6). We further introduce a range of techniques including multi-queue journaling and metadata shadow paging to reduce the software overhead.

We implement ccNVMe and MQFS in the Linux kernel (Section 7). ccNVMe places the submission queues along with its head and tail values on the persistent memory region (PMR) [45] of the NVMe SSDs, and embeds the transaction order in the reserved fields of the NVMe I/O command. As a result, ccNVMe provides failure atomicity without any logic changes to the hardware, and is compatible with the original NVMe.

We experimentally compare ccNVMe and MQFS against Ext4 [19], HoraeFS [27], which is a state-of-the-art journaling file system, and Ext4-NJ (Section 8); Ext4-NJ does not perform journaling and we assume it to be the ideal upper bound of the Ext4 on modern NVMe SSDs. We find MQFS performs significantly better than Ext4 and HoraeFS for a range of workloads. MQFS even surpasses Ext4-NJ when the workload is not severely bounded by I/O operations. In particular, MQFS increases the throughput of RocksDB by 66%, 36%, and 28%, compared to Ext4, HoraeFS, and Ext4-NJ, respectively. Through the crash consistency test of CrashMonkey [37], we demonstrate that MQFS can recover to a correct state after a crash.

In summary, we make the following contributions:

–

We propose ccNVMe to achieve high performance and crash consistency by coupling the crash consistency to the data dissemination, decoupling atomicity from durability and introducing transaction-aware MMIO and doorbell.

–

We propose MQFS to fully exploit ccNVMe, along with a range of techniques to reduce software overhead.

–

We implement and evaluate ccNVMe and MQFS in the Linux kernel, demonstrating that ccNVMe and MQFS outperform state-of-the-art systems.

2 NVMe

NVMe [39] is an interface like serial ATA (SATA) for software to communicate with high-performance SSDs. It is much more efficient than legacy interfaces due to its low latency and high parallelism of its high-performance queuing mechanism. The NVMe supports 65,535 I/O hardware queues each with 65,535 commands (i.e., queue depth). Each hardware queue is mapped to each CPU core to deliver scalable performance. The host operating system or application can connect to the SSD via two kinds of physical links: local PCIe (referred to as NVMe over PCIe) and network fabrics (referred to as NVMe over Fabrics). Here, we use Figures 1 and 2 to briefly introduce the data dissemination mechanism of NVMe over PCIe and RDMA, respectively.

Fig. 1.

Fig. 2.

2.1 NVMe Over PCIe

Figure 1 presents the procedure of command processing of NVMe over PCIe. Each host CPU core has its own independent submission queue (SQ), completion queue (CQ) and associated doorbells (SQDB and CQDB). The SQ and CQ are essentially circular buffers stored in host memory; the SQDB stores the tail value of SQ while the CQDB stores the head value of CQ. The host first places the I/O command in the free SQ slot (①), followed by updating the SQDB with the new tail value (②) to notify the SSD of the incoming command. The SSD then fetches the command (③) and transfers the data from host (④). After a command has completed execution, the SSD places a CQ entry in the free slot of the CQ (⑤), followed by generating an interrupt to the host (⑥). The host consumes the new CQ entry and then writes the CQDB with a new head value to indicate that the CQ entry has been consumed (⑦). As we can see, an I/O request requires at least two MMIOs, two DMAs of the queues, one block I/O, and one interrupt request (e.g., MSI-X).

PMR is a new feature of NVMe released in the 1.4 spec circa June 2019 [42]. It is a region of general purpose read/write persistent memory of the SSD. The SSD can enable this feature by exposing a portion of persistent memory (e.g., capacitor-backed DRAM or Optane Memory) which can be accessed by the CPU load and store instructions (e.g., MMIO). In this article, to implement and evaluate ccNVMe on a variety of commercial SSDs that do not support PMR, we use 2 MB in-SSD capacitor-backed DRAM to package the tested SSDs as PMR-enabled ones (details in Section 7).

2.2 NVMe Over RDMA

Storage disaggregation has become a common practice in data centers to allow flexible and independent scaling of compute and storage resources. Compared to traditional hard disk drives, modern NVMe SSDs are orders-of-magnitude faster, which makes network latencies and CPU overhead much more pronounced. NVMe over Fabrics (NVMe-oF) is introduced as an alternative to traditional iSCSI protocol, so as to mitigate protocol overhead of remote storage access.

NVMe-oF defines a common architecture that supports remote storage access to NVMe block storage over a range of networking fabrics (e.g., TCP and RDMA) [40]. NVMe-oF consists of two major components: the initiator (or client) and the target (or server). To write a remote NVMe SSD, the initiator of NVMe-oF first sends the NVMe I/O command and associated data blocks to the target using network protocols. Receiving the I/O command and data blocks, the target generates NVMe over PCIe commands to access the NVMe SSD by the same steps of NVMe over PCIe described in Section 2.1. When the write request completes, the target generates a completion response and passes it to the initiator. Read and administration requests follow a similar procedure of the write request.

Since this work is based on NVMe over RDMA, we introduce the details of a typical implementation of NVMe over RDMA. In a nutshell, NVMe over RDMA uses reliable connected (RC) transport; it uses two-sided RDMA SEND operations to exchange I/O commands and completion responses, and one-sided RDMA WRITE or READ operations to transfer data blocks. Figure 2 presents the organization of NVMe over RDMA and steps of a write request. Each CPU core of the initiator server has independent RDMA SQ and receive queue (RQ) for each NVMe SSD. The completion response of an RDMA request is placed in the RDMA CQ.

To process a write request, the initiator first posts the NVMe I/O command to the target by starting an RDMA SEND (①) operation and ringing the SQDB. Receiving the RDMA SEND operation, the target NIC consumes an RQ entry to find the appropriate memory location to store the NVMe I/O command (②). After parsing the I/O command, the target fetches data blocks from the initiator using the RDMA READ operation (③). When all data blocks arrive at the target, the target generates local NVMe over PCIe I/O command to store data blocks (④), which is the same as the local NVMe write command mentioned in Section 2.1. When data blocks reach the NVMe SSD and a completion response is reported by the SSD (⑤), the target generates a completion response and sends it to the initiator by an RDMA SEND operation (⑥). The write request is finally finished after the completion response is placed in the initiator DRAM by the NIC (⑦) and is acknowledged by the initiator.

Besides the PCIe traffic of the NVMe over PCIe stack, a 4-KB write request of NVMe over RDMA requires at least two RDMA SEND operations (① and ⑥), three MMIO operations on the SQDB, one DMA for the NIC to pull the I/O command, and one RDMA READ operation (③) to fetch the data block. Moreover, the NIC needs to consume RQ entries (② and ⑦) which will be reclaimed by the initiator and target; the initiator and target need to process the CQ entries to check if the RDMA operation successes (①, ③, and ⑥) or if new RDMA SEND operations arrive (② and ⑦).

3 Motivation

3.1 Evaluation of Journaling atop NVMe over PCIe

In this section, we revisit the crash consistency on modern NVMe SSDs. Journaling (a.a, write-ahead logging) is a popular solution used by many file systems including Linux Ext4 [19], IBM’s JFS [4], SGI’s XFS [54], and Windows NTFS [35] to solve the crash consistency issue. Hence, we perform experiments on journaling file systems, in particular Ext4, Ext4 without journaling (Ext4-NJ), and a recently proposed HoraeFS [27] to understand the crash consistency overhead. In the Ext4-NJ setup, we disable journaling of Ext4, and assume it to be the ideal upper bound of Ext4 on modern NVMe SSDs. Using the FIO [2] tool, we launch up to 24 threads, and each performs 4 KB append writes to its private file followed by fsync independently. We choose this workload as the massive small synchronous updates can stress the crash consistency machinery (i.e., journaling). We consider three NVMe SSDs that were introduced over the last 6 years, including flash and Optane SSDs; the performance matrix of these SSDs is presented in Table 3; the other configurations of the testbed are described in Section 8.1. Figure 3 shows the overall results. The gap between Ext4-NJ and Ext4 (or HoraeFS) quantifies the crash consistency overhead.

Fig. 3.

In the older NVMe drive, as shown in Figure 3(a), the journaling setups (i.e., Ext4 and HoraeFS) perform comparably against the no-journaling setup (i.e., Ext4-NJ), and even outperform Ext4-NJ. Using journaling to take advantage of the higher sequential bandwidth of the SSD, and optimizing journaling as in HoraeFS to reduce the software overhead, delivers significant improvements on throughput; the SSD’s bandwidth is therefore saturated (see Figure 3(d)).

However, as NVMe SSDs evolve, the crash consistency overhead becomes significant and tends to be more severe, as presented in Figure 3(b) and (c). Notably in the 24-core case of Figure 3(c), the crash consistency overhead (i.e., the ratio of (Ext4-NJ - HoraeFS) to HoraeFS) is nearly 66%. Except for Ext-NJ, all file systems fail to fully exploit the available bandwidth. Further analyses suggest that the inefficiency comes from the software overhead and PCIe traffic.

Software overhead. Many Ext4-based file systems including HoraeFS and BarrierFS [56] use a separate thread to dispatch the journal blocks for ordering and consistency. The computing power of a single CPU core is sufficient for old drives, but is inadequate for newer fast drives. Moreover, the context switches between the application and journaling thread introduce non-negligible CPU overhead. Efficiently utilizing multi-cores to perform journaling in the application’s context becomes important, as we will show in MQFS.

PCIe traffic. To achieve atomicity of N 4-KB data blocks, the journaling generates two extra blocks (i.e., the journal description and commit block) for a single transaction. This approach requires 2 \(\times\) (N+2) MMIOs, 2 \(\times\) (N+2) DMAs from/to the queue entries, (N+2) block I/Os, and (N+2) interrupt requests if block merging is disabled. When the SSD is fully driven by the software stack with enough CPU cores, the application performance is instead bottlenecked by the boundary of the software and hardware. With a large bandwidth consumed at the device driver, the available bandwidth provided to the file system is therefore limited. Moreover, the file system needs to wait for the completion of these I/Os and requests to ensure the atomicity of a transaction. This increases the transaction latency, leaves the CPU in an idle state, and thus lowers the throughput.

3.2 Issues of Journaling atop NVMe over RDMA

When the initiator accesses NVMe storage over RDMA, the crash consistency overhead becomes worse as RDMA opearations are serialized and some of the CPU and I/O resources are wasted. We use Figure 4 that shows the Ext4 journaling atop Linux NVMe over RDMA to illustrate the crash consistency overhead in terms of the network stack (excluding the NVMe over PCIe stack).

Fig. 4.

Issue 1: PCIe traffic of the RDMA doorbell. A transaction (TX) consists of one or more transaction data requests and a transaction commit record. The file system can persist multiple transaction data requests in parallel; it dispatches the next transaction data request immediately after the device driver rings the RDMA doorbell. The Linux device driver rings the doorbell for each RDMA operation of a single transaction data request. This eager doorbell (i.e., CPU-generated MMIO over PCIe) without batching incurs non-negligible PCIe overhead, as the PCIe header size of doorbell is 20–26 bytes, which is comparable to the common size of data payloads atop PCIe. On one hand, the doorbell essentially modifies an 8-byte value that is smaller than the PCIe header; eager doorbell costs a large fraction of PCIe bandwidth on the PCIe header. On the other hand, the size of the doorbell (CPU-generated MMIO) value should align to the cache line size (e.g., 64 bytes) due to the CPU write combining; the size of the RDMA header is 36 bytes for RC mode and the size of the NVMe I/O command is 64 bytes, both of which are comparable to the doorbell size; with this, eager doorbell reduces the rate at which the CPU dispatches NVMe I/O commands to the NIC as the time spent on doorbell is non-negligible in the overall I/O dispatch phase.

Issue 2: PCIe and network traffic of the RDMA READ and journal commit record. The journaling needs a commit record to ensure the atomicity of the transaction data, which requires extra RDMA operations and consumes the network bandwidth. Moreover, the commit record cannot be issued until the transaction data becomes durable; this increases the latency of the overall I/O path.

Issue 3: PCIe and network traffic of the RDMA SEND and NVMe I/O command. For each transaction data request, the Linux device driver generates an NVMe I/O command and posts the I/O command by RDMA SEND without merging. This potentially generates more NVMe I/O commands than needed and costs more network bandwidth, especially when transaction data requests that target on continuous logical block addresses can be consolidated.

Issue 4: Serialization of dependent transactions. For two dependent transactions, the file system cannot start the next transaction until the former transaction completes. This serializes the persistence of dependent transactions and therefore decreases the transaction concurrency.

This work is dedicated to addressing the aforementioned issues by ccNVMe over RDMA.

4 ccNVMe OVER PCIe

To reduce the PCIe traffic and improve the performance efficiency for crash-consistent storage systems, we propose ccNVMe to provide efficient atomicity and ordering. This section presents the design and implementation of ccNVMe over PCIe.

The key idea of ccNVMe is to couple the crash consistency to the data dissemination. The original NVMe already records the requests in the submission queues and their states in the doorbells; ringing the SQDB indicates that the requests are about to (but not yet) be submitted to the SSD while ringing the CQDB suggests that these requests are completed. These two doorbells (states) naturally represent the “0” (nothing) and “1” (all) states of the atomicity.

Based on this observation, we extend NVMe over PCIe to be atomic and further crash consistent. ccNVMe makes the submission queues durable in case of a sudden crash, and rings the doorbells in the unit of a transaction rather than a request, to let the requests of a transaction reach the same state (e.g., all or nothing), thereby achieving atomicity. We show how to provide the transaction abstraction atop the original request notion in Section 4.2. However, as NVMe does not prescribe any ordering constraint nor persistence of the submission queue, it is non-trivial to persist the submission queues entries over the PCIe link and ring the doorbell efficiently and correctly. We introduce transaction-aware MMIO and doorbell for efficient persistence in Section 4.3, and present how to ring the doorbell to enforce ordering guarantees in Section 4.4.

4.1 Overview

Figure 5 presents an overview of ccNVMe. In the leftmost of the figure, ccNVMe is a device driver that sits between the block layer and the storage devices. But unlike traditional NVMe, ccNVMe moves further by providing the atomicity guarantees at the boundary of the hardware and software layers. This design has two major advantages. First, ccNVMe lets the atomicity take the free rides of fast NVMe queuing and doorbell operations, and thus accelerates the crash consistency guarantees. Second, ccNVMe provides generic atomic primitives, which can free the upper layer systems from the need to implement a complex update protocol and to reason about correctness and consistency. For example, the applications can directly issue atomic operations, or use the classic file systems APIs (e.g., write) followed by fsync or a new file system primitive fatomic proposed by this work to ccNVMe to ensure failure atomicity.

Fig. 5.

In the right part of Figure 5, ccNVMe keeps the multi-queue design of the original NVMe intact; each CPU core has its own independent I/O queues (i.e., SQs and a CQ), doorbells, and hardware context (e.g., interrupt handler) if the underlying SSD offers enough hardware resources. The only difference is the (optional) ccNVMe extension added to each core. In particular, ccNVMe creates persistent submission queues (P-SQs) and corresponding doorbells (P-SQDB) in the PMR of the NVMe SSD. When receiving atomic operations, ccNVMe generates ccNVMe I/O commands to the P-SQ and rings the P-SQDB. Now, the atomicity is achieved by only two MMIOs (i.e., ① and ②) in the common case. We design the ccNVMe I/O command by using the reserved fields of the NVMe common command; this makes ccNVMe compatible with NVMe. Consequently, the storage device can directly fetch the I/O commands from the P-SQ without any logic changes.

The other procedures of an I/O command of ccNVMe, including the data transfer (③), interrupt (④ and ⑤), and command completion (⑥), is almost similar to NVMe, except that the basic operational unit is the transaction (a set of operations that need to be executed atomically) rather than the request from each slot of the queues.

4.2 Transaction: The Basic Operational Unit

Each entry of an SQ represents a request to a continuous range of logical block addresses. ccNVMe distinguishes the atomic request from the non-atomic request via a special attribute REQ_TX. A special atomic request with REQ_TX_COMMIT serves as a commit point for a transaction. Hence, the commit request implicitly flushes the device to ensure durability, by issuing a flush command first and setting the FUA bit in the I/O command, if the volatile write cache is present in the SSD. ccNVMe embeds these attributes in a reserved field in the I/O command (Table 2), and handles these atomic requests differently based on their category (Sections 4.3 and 4.4).

Table 2.

Dword	Bits	NVMe	ccNVMe
2	00:31	reserved	transaction ID
12	16:19	reserved	REQ_TX or REQ_TX_COMMIT

Table 2. The Command Format of ccNVMe over PCIe

Each command is 64 bytes. Dword: 4 bytes.

ccNVMe groups a set of requests as a transaction and assigns each transaction a unique transaction ID. The transaction ID can be generated by the applications or file systems by a logical or physical timestamp (e.g., RDTSCP instruction). This ID is used for the unique identification of a transaction as well as deciding the persistence order across multiple submission queues. The transaction ID is stored in a reserved field of the command (Table 2).

4.3 Transaction-Aware MMIO and Doorbell

ccNVMe uses persistent MMIO writes to insert atomic requests to P-SQ and ring P-SQDB, which is different from NVMe that uses non-persistent MMIO writes. Figure 6(a) illustrates the persistent MMIO write. MMIO write is performed directly by the CPU store instruction. Here, since the P-SQ structure is organized in a circular log, ccNVMe leverages the write combining (WC) buffer mode of the CPU to consolidate consecutive writes into a larger write burst, thereby improving the memory and PCIe accesses efficiency. To ensure persistence, ccNVMe uses MMIO flushing via two steps. First, clflush followed by mfence is used to flush the MMIO writes to the PCIe Root Complex. Second, exploiting the PCIe ordering that a read request must not pass a posted request (e.g., write) (Table 2-39 in PCIe 3.1a spec [46]), ccNVMe issues an extra MMIO read request of zero-byte length to ensure that the MMIO writes finally reach PMR.

Fig. 6.

Unfortunately, persistent MMIO write is significantly slower than the non-persistent one. As shown in Figure 7, when issuing 64 byte payloads, the latency of persistent write (i.e., write+sync) is 2.5\(\times\) higher than that of non-persistent write (i.e., write). We also notice that the bandwidth and latency of the persistent write are approaching non-persistent write, especially when the MMIO size is larger than 512 bytes.

Fig. 7.

The original NVMe uses non-persistent MMIOs and can place the submission queues in the host memory. As a result, it updates the submission queues and rings the doorbells in a relatively eager fashion: whenever a request is inserted into the NVMe submission queue, it rings the doorbell immediately. However, in ccNVMe that requires persistent MMIOs and operates at the unit of a transaction, ringing the doorbell on a per-request basis results in considerable overhead, for two reasons. First, issuing persistent MMIO writes without batching prevents the CPU from exploiting the coalescing potentials in the WC buffer, lowering the performance. Second, per-request doorbell incurs unnecessary MMIOs over PCIe, as a transaction is completed only if all of its requests are finished; only one doorbell operation is needed.

ccNVMe introduces transaction-aware MMIO and doorbell for dispatching the requests and ringing the doorbell. The key idea here is to postpone the MMIO flushing (i.e., ② and ③ in Figure 6(a)) and doorbell until a transaction is being committed. Figure 6(b) depicts an example. Suppose a transaction consists of two requests, \(W_{x-1}\) and \(W_{x-2}\). In step 1, \(W_{x-1}\) is a normal atomic request with REQ_TX that comes first, and ccNVMe stores it using the CPU store instruction. When receiving a commit request with REQ_TX_COMMIT, ccNVMe triggers MMIO flushing. In step 2, ccNVMe uses cache line flush and PCIe read as presented in Figure 6(a) to persist the queue entries to P-SQ. Finally in step 3, ccNVMe rings the P-SQDB by setting the tail pointer of the P-SQ. Compared to the naïve approach which requires N MMIO flushings and N doorbell operations for a transaction that contains N requests, ccNVMe only requires one MMIO flushing and one doorbell operation, regardless of the size of the transaction.

4.4 Correctness and Crash Recovery

Crash consistency includes atomicity and ordering guarantees. We present how ccNVMe provides these two guarantees during normal execution and after a crash. The key idea here is to complete the dependent transaction atomically and in order and track the lifecycle of transactions in the face of a sudden crash. During crash recovery, ccNVMe uses the lifecycles from PMR to redo the completed transactions while dropping non-atomic or out-of-order transactions.

Normal execution. In ccNVMe, the requests are submitted and completed in the unit of transaction. This is achieved via the transaction-aware doorbell mechanism. In particular, ccNVMe holds the same assumption that ringing the doorbell (i.e., writing a value to a 4 B address) itself is an atomic operation as in the original NVMe. ccNVMe rings the doorbell atomically (i.e., steps ② and ⑥ of Figure 5) after updating the entries of P-SQ and CQ. Consequently, the transactions are submitted and completed atomically.

The ordering here means the completion order of dependent transactions during normal execution. ccNVMe maintains only the “first-come-first-complete” order of each hardware queue although the original NVMe does not prescribe any ordering constraint. ccNVMe allows the device controller to process the I/O commands from the submission queue in any order, the same as the original NVMe. Yet, ccNVMe completes the I/O commands in order by chaining the completion doorbell (i.e., updating P-SQ-head and ringing the CQDB sequentially). This ensures that a transaction is made complete only when its preceding ones finish; the upper layer systems thus see the completion states of dependent transactions in order.

Crash recovery. During crash recovery, ccNVMe finds the unfinished transactions and leaves the specific recovery algorithms (e.g., rollback) to upper layer systems. In particular, in the face of a sudden power outage, the data of PMR including P-SQ, P-SQDB, and P-SQ-head are saved to a backup region of the persistent media (e.g., flash memory) of the SSD. When power resumes, the data is loaded back onto the PMR. Then, ccNVMe performs crash recovery during the NVMe probe; it provides the upper layer system with the unfinished transactions for recovery. Specifically, the transactions of the P-SQ that range from the P-SQ-head to P-SQDB are unfinished ones. ccNVMe makes an in-memory copy of these unfinished transactions; the upper layer systems can thus use this copy for recovery logic (e.g., replay finished transactions and discard unfinished ones). As ccNVMe always completes the transactions atomically and in order, it keeps the correct persistence order after crash recovery.

ccNVMe does not guarantee any global order across multiple hardware queues; it only assists upper layer systems with the global order by providing the persistent transaction ID field. Upper layer systems can embed the global order in this field to decide the persistence order during recovery. We show a file system crash recovery in Section 6.5 and experimentally study its correctness in Section 8.6.

4.5 Programming Model

ccNVMe is a generic device driver that does not change the interfaces for upper layer systems. Specifically, the kernel file system can use the intact submit_bio function to submit the write requests that require crash consistency; the application can use the original nvme command or the ioctl system call to submit raw ccNVMe commands. The only exception is that upper layer systems must explicitly mark the request (e.g., tag the bio structure with REQ_TX) and control the ordering across multiple hardware queues (e.g., write the transaction ID to a new field of the original bio).

In the current design, ccNVMe does not control the ordering and atomicity across multiple hardware queues for the consideration of CPU and I/O concurrency. Therefore, ccNVMe does not allow the requests of a transaction to be distributed to different hardware queues. This requires that the thread queuing atomic requests to ccNVMe cannot change its running core until it commits the transaction (i.e., marks a request with REQ_TX_COMMIT), as the current NVMe storage stack follows the principle of assigning a dedicated hardware queue to each core as much as possible. As we will show in Section 8.5.2, queuing a transaction consumes only microsecond-scale latency, and this is not a big limitation. We leave the solution to this limitation for future work.

4.6 Discussion

ccNVMe does not need any change in hardware logic, though, additional changes in the NVMe SSD controller are highly likely to significantly boost the performance.

Transaction-aware scheduling. The responsiveness of a transaction is determined by its slowest request. The NVMe SSD controller can leverage the transaction notion of ccNVMe to dispatch and schedule requests to different channels and chips, to achieve low transaction latency.

Transaction-aware interrupt coalescing. The NVMe has standardized interrupt coalescing to mitigate host interrupt overhead by reducing the rate at which interrupt requests (e.g., MSI-X) are generated by a controller. Nonetheless, the suitable aggregation granularity of the interrupt coalescing is hard to decide due to the semantic gap and workload change. Using ccNVMe, the controller can send only one interrupt to the host only when a transaction is completed.

5 ccNVMe OVER RDMA

We extend ccNVMe over PCIe to ccNVMe over RDMA, enabling efficient and crash-consistent remote storage access. The key idea of ccNVMe over RDMA is similar to ccNVMe over PCIe: the initiator and target communicate with the NICs and ccNVMe over PCIe stack by the transaction abstraction rather than the original request notion. This design batches MMIO and RDMA operations, which improves the access efficiency of the PCIe and RDMA while not sacrificing the transaction-level latency. Moreover, this design can seamlessly integrate NVMe over RDMA into ccNVMe over PCIe stack, enjoying the fast crash consistency in the device driver. In this section, we first present an overview of ccNVMe over RDMA, and then introduce key techniques.

5.1 Overview

Figure 8 shows an overview of ccNVMe over RDMA. ccNVMe over RDMA is a device driver that spans over the initiator and target, providing crash-consistent access to NVMe block storage (Figure 8(a)). ccNVMe over RDMA automatically converts user requests that demand crash consistency to a set of RDMA operations; upper layer systems (e.g., the file system) and applications can thus issue atomic writes to ccNVMe over RDMA, just as accessing a local ccNVMe over PCIe drive.

Fig. 8.

To deliver scalable performance and integrate ccNVMe over RDMA to ccNVMe over PCIe stack, ccNVMe creates a dedicated RDMA queue pair for each NVMe hardware queue (Figure 8(b)); all NVMe requests from an RDMA send queue (e.g., RDMA queue 0) are dispatched to its dedicated NVMe hardware queue (e.g., NVMe queue 0). For RDMA SEND and WRITE operations, the RC transport of RDMA guarantees in-order delivery of messages (work requests) from each queue pair. With the in-order delivery property, for two NVMe write I/O commands, the command that arrives at the RDMA send queue earlier is dispatched to the associated NVMe hardware queue earlier; this ensures that write requests from different transactions will not interleave with each other. Hence, the target driver can dispatch atomic write requests to local ccNVMe over PCIe stack correctly.

The number of CPU cores in the initiator can be larger than that of NVMe hardware queues. In such a case, some cores (e.g., cores 1 and 2 of Figure 8(b)) share an RDMA queue pair and NVMe hardware queue, and thus transactions from different cores may mix in the same NVMe hardware queue, which can lead to an inconsistency where a transaction is committed by another transaction’s commit record. To distinguish transactions from different cores, we use the stream notion which consists of a set of dependent transactions; individual streams do not have dependency; ccNVMe over RDMA embeds a stream ID into each atomic write request. For example, if transactions from cores 0 and 1 have dependency, they share the same stream ID; otherwise, each core assigns different stream IDs to its requests. In this section, to simplify the description, we assume that each CPU core has a dedicated RDMA queue pair and NVMe hardware queue.

Figure 9 shows the workflow of ccNVMe over RDMA for each CPU core. Instead of directly dispatching an RDMA SEND operation to the NIC like classic Linux NVMe over RDMA, the initiator first stages RDMA SEND operations and associated NVMe I/O commands (①). It rings the SQDB until it encounters a special write request that marks the end of a transaction (i.e., request with a REQ_TX_COMMIT flag); this reduces the number of doorbell operations to 1 (②). Next, the initiator NIC sends the I/O commands to the target (③) where a set of RDMA READ operations are generated and staged (④). Receiving all I/O commands of a transaction, the target generates RDMA READ operations and rings the SQDB (⑤) to fetch data blocks (⑥) in parallel. Then, the target generates local block I/Os and dispatches them to the ccNVMe over PCIe stack (⑦). As the ccNVMe over RDMA dispatches transactions to ccNVMe over PCIe atomically (i.e., staging all write requests within a transaction before dispatching) and in order (i.e., the in-order delivery property of the RDMA RC transport), it keeps the intact crash consistency from ccNVMe over PCIe. When all blocks I/Os complete, the target generates completion responses and sends them back to the initiator by RDMA SEND operations and a transaction-aware doorbell (⑧–⑩).

Fig. 9.

In the next sections, we first present the details of transaction-aware doorbell (Section 5.2) on the RDMA send queues. We then show how to coalesce NVMe I/O commands to further improve the access efficiency of RDMA (Section 5.3), followed by introducing techniques of separation of atomicity from durability (Section 5.4). We finally present a solution to adopt the design of ccNVMe over RDMA to TCP transport (Section 5.5).

5.2 Transaction-Aware Doorbell

The standard of NVMe over Fabrics does not prescribe how to ring doorbells; the storage stack of Linux NVMe over RDMA simply updates the doorbell register in the NIC whenever a work request (wr) is inserted into the RDMA send queue. This traditional approach reduces the efficiency of CPU-initiated doorbells and potentially slows NIC-initiated DMAs. ccNVMe over RDMA uses the same transaction-aware doorbell technique as the ccNVMe over PCIe to connect the initiator/target and the NIC, therefore boosting the performance as well as guaranteeing atomicity.

Specifically, ccNVMe rings the doorbell when all requests within a transaction arrive or when the number of work requests exceeds a pre-defined threshold (half of the queue depth by default), whichever happens earlier; this reduces the number of CPU-initiated doorbells and resolves issue 1 from Section 3.2. Moreover, the doorbell of ccNVMe over RDMA indicates a commit point for the transaction and thus upper layer systems can remove the commit record, thereby addressing issue 2 from Section 3.2.

The transaction-aware doorbell can also improve the performance of NIC-initiated DMAs. In the original NVMe over RDMA, the locations of the payloads (e.g., the NVMe I/O commands) of the RDMA SEND operations are usually non-continuous. NICs that cannot access non-continuous addresses with a single DMA command need to issue multiple DMA commands to fetch the payloads. NICs that can access non-continuous addresses with a single DMA command need to put more metadata that points to scattered memory locations in the DMA header. Both types of NICs pay much more overhead of the DMA header when fetching non-continuous payloads than when fetching continuous payloads, and this leads to lower DMA efficiency. As the doorbell in ccNVMe is batched, ccNVMe places the batched payloads (i.e., the NVMe I/O commands) in continuous memory addresses, to facilitate NICs’ DMA and achieve higher transfer efficiency of the payloads.

5.3 I/O Command Coalescing

Based on transaction-aware doorbell, ccNVMe over RDMA further coalesces NVMe I/O commands whose data blocks target consecutive logical block addresses, to reduce the number of RDMA SEND operations (① and ⑧ of Figure 8) and NVMe block I/Os (⑦ of Figure 8), therefore mitigating issue 3 from Section 3.2. I/O command coalescing is done in the unit of a transaction. For example, as shown in Figure 10, suppose the logical block addresses of the write request W1 range from 100 to 102 and W2 ranges from 103 to 104. ccNVMe over RDMA merges two NVMe I/O commands of W1 and W2 into a sole command before sending the write requests to an RDMA send queue; the merging process changes the starting address of the merged I/O command of W1-2 to 100 and length to 5. If two write requests which can be merged vary in type (e.g., W5 and W6 of Figure 10), the merged write request needs to keep the commit point.

Fig. 10.

Coalescing NVMe I/O commands reduces the number of operations on the associated CQ and RQ, which further reduces the CPU load on the initiator/target and PCIe traffic between initiator/target and NICs. In the common implementation, the original NVMe over RDMA generates a completion entry to the RDMA CQ for an RDMA operation, to let the initiator or target check whether the operation fails or not. The NIC needs to inform the initiator or target of the arrival of the completion entry by interruption; the initiator or target needs to consume the completion entry. As a result, reducing the number of RDMA operations alleviates the overhead of generating and consuming completion entries. For RDMA SEND operations, the NIC needs to consume the entries of RQ from the initiator and target. Consuming RQ entries also generates completion entries; the initiator and target need to process these entries and reclaim free memory locations and fill them into RDMA RQ. Hence, reducing the number of RDMA SEND operations decreases the overhead of fetching and cleaning the entries of RQs.

Coalescing NVMe I/O commands itself requires some investment on CPU cycles and increases the latency of the overall I/O path. But this is worthwhile for networked storage, especially for fast RDMA networks whose speed is sensitive to the overhead of protocol implementation [16]. First, as emerging ultra-fast NVMe SSDs and RDMA NICs deliver a 4-KB I/O latency of sub-10 microseconds, recent studies suggest that the I/O stack should accelerate or even disable request merging [25, 59]. ccNVMe over RDMA sequentially scans write requests within a transaction only once for coalescing; this simple coalescing algorithm does not introduce too much CPU overhead and latency compared to the hardware latency. Second, some crash-consistent storage systems (e.g., journaling-based and log-structured file systems) usually place the data blocks in continuous logical block addresses. In these systems, coalescing is always successful and beneficial as reducing NVMe I/O commands improves the performance to a greater extent compared to the coalescing overhead itself. For systems where the probability to find adjacent write requests is low, the coalescing mechanism can be disabled.

Coalescing NVMe I/O commands can also improve the performance of RDMA READ operations by allocating continuous memory regions for data blocks. This optimization, however, is not implemented in ccNVMe over RDMA and left for future work as it requires substantial changes to the memory allocation policy of upper layer systems. An RDMA READ operation fetches data blocks from continuous memory addresses. Although NVMe I/O commands are coalesced, the data blocks are distributed randomly in the memory and thus the target needs to issue multiple RDMA READ operations to simultaneously fetch data blocks. If the memory addresses of data blocks are continuous, only one RDMA READ operation is needed.

5.4 Separating Atomicity from Durability

Recall that ccNVMe over PCIe separates atomicity from durability: whenever a transaction rings the submission queue doorbell of NVMe, it becomes atomic since ccNVMe recovers the system to a consistent state by the information from PMR in the face of a sudden system crash. ccNVMe over RDMA pushes this separation further: whenever a transaction rings the send queue doorbell of RDMA, it becomes atomic. This extended separation is straightforward: as the RC-based RDMA guarantees in-order delivery for each queue pair and the target always dispatches write requests from network in order and atomically, the I/O path from the initiator to the ccNVMe over PCIe stack is atomic and ordered. If the system fails when write requests are transferred through the network, all requests are dropped and will not be dispatched to ccNVMe over PCIe stack, and this keeps an empty storage state and ensures crash consistency. Otherwise, all write requests reach the target and the crash consistency issue regresses to the condition of ccNVMe over PCIe stack.

Similar to ccNVMe over PCIe, the separation of ccNVMe over RDMA also parallelizes the delivery and persistence of dependent transactions within a RDMA queue pair: the following transaction can be dispatched to the RDMA send queue immediately after the previous transactions are dispatched. This technique solves issue 4 from Section 3.2.

5.5 Discussion of Porting ccNVMe Over PCIe to TCP

The idea of ccNVMe over RDMA is not restricted to the RDMA transport, but can be applied to any lossless networks with in-order delivery property (e.g., TCP). This section presents the guidelines of porting ccNVMe over PCIe to TCP transport.

The TCP is a network protocol and provides reliable in-order delivery within a TCP connection (e.g., socket). NVMe over TCP [44] is an application layer protocol atop TCP and uses a TCP connection to exchange NVMe I/O commands, responses, and data blocks between the initiator and target. Similar to NVMe over RDMA, each core of the initiator of Linux NVMe over TCP can establish a long-lived TCP connection to the target to offer scalable performance. NVMe over TCP uses a dedicated kernel thread (work queue) to accept write requests from the block layer and dispatches them to the TCP stack. An instant doorbell is introduced to wake up the kernel thread to process the incoming requests; the overhead of this eager doorbell is high as it incurs frequent context switches [13]. The transaction-aware doorbell (Section 5.2) can be used to alleviate the overhead. Similar to ccNVMe over RDMA, the kernel thread of NVMe over TCP can coalesce NVMe I/O commands (Section 5.3) to reduce the size of protocol data units (PDUs) and further the TCP payloads. To enable separation of atomicity from durability (Section 5.4), the kernel thread needs to preserve the dispatch order from the block layer and thus the atomicity can be guaranteed when write requests reach the kernel thread.

6 MQFS: the Multi-Queue File System

ccNVMe is file system and application agnostic; any file system or application desiring crash consistency can be adopted to ccNVMe, by explicitly marking the atomic requests and assigning the same transaction ID to the requests from a transaction. Recall that our study in Section 3 shows modern Linux file systems still suffer from software overhead and thus are unable to take full advantage of ccNVMe. In this article, we introduce multi-queue file system (MQFS) to fully exploit the atomicity guarantee and multi-queue parallelism of ccNVMe. MQFS can be deployed atop both the ccNVMe over PCIe and RDMA, since both systems offer a block device abstraction via the block layer.

6.1 Overview

We develop MQFS based on Ext4 [19], reusing some of its techniques including in-memory indexing and directory/file structure. The major difference is the multi-queue journaling introduced to replace the traditional journaling module (i.e., JBD2), along with a range of techniques to ensure both high performance and strong consistency. Here, we present how the critical functions of the crash consistency perform and interact with ccNVMe, followed by introducing each technique at length in the next sections.

MQFS divides the logical address space of the device into a file system area and several journal areas; the file system area remains intact as in Ext4; MQFS partitions the journal area into multiple portions, and each portion is mapped to a hardware queue. By tagging persistent updates as atomic ones, each core performs journaling on its own hardware queue and journal area, and thus reduces the synchronization from multiple CPU cores.

Synchronization primitives. The fsync of MQFS guarantees both atomicity and durability. Every time a transaction is needed (e.g., create, fsync), the linearization point is incremented atomically and assigned as the transaction ID. When fsync is called, MQFS tags the updates with REQ_TX and the final journal description block with REQ_TX_COMMIT, followed by sending these blocks to the journal area for atomicity and recovery. Note that compared to JBD2, MQFS eliminates the commit block and removes the ordering points (e.g., FLUSH), thereby reducing the write traffic and boosting performance; ringing the P-SQDB actually plays the same role as the commit block. The fsync returns successfully until the transaction is made durable, i.e., all updates have experienced steps ①–⑥ of Figure 5.

Atomicity primitives. MQFS decouples the atomicity from durability, and introduces two new interfaces, fatomic and fdataatomic, to separately support the atomicity guarantee. fatomic synchronizes the same set of blocks of fsync, but returns without ensuring durability, i.e., all updates have experienced steps ① and ② of Figures 5 and 9 for ccNVMe over PCIe and RDMA, respectively. fdataatomic is similar to fatomic, except that it does not flush the file metadata (e.g., timestamps) if the file size is unchanged. Refer to the following code. write(file1, “Hello”);write(file1, “World”);fatomic(file1); using fatomic, the application can ensure that the file content is either empty or “Hello World”; no intermediate result (e.g., “Hello”) will be persisted.

We present the I/O path of the synchronization and atomicity primitives, and the advantages of separation of atomicity from durability using detailed graphs and numbers in Sections 8.5.2 and 9.5.

6.2 Multi-Queue Journaling

Each core writes the journaled data blocks to its dedicated journal area with less coordination between other cores at runtime. Conflicts are resolved by using the global transaction ID among transactions during checkpointing. A simple way of checkpointing is to suspend all logging requests to each hardware queue, and then checkpoint the journaled data from the journal areas in the order determined by the transaction ID. MQFS instead introduces multi-queue journaling to allow one core to perform checkpointing without suspending the logging requests of other cores.

The key idea of multi-queue journaling is to use per-core in-memory indexes to coordinate the logging and checkpointing, while using on-disk transaction IDs to decide the persistence order. Specifically, the index is a radix tree, which manages the state and newest version of a portion of journaled data blocks (Figure 11). MQFS distributes the journaled blocks to the radix trees with different strategies based on the journaling mode. In data journaling, MQFS distributes the journaled blocks by hashing the final location of the journaled data, e.g., logical block address % the number of trees. In metadata journaling mode, as only the metadata is journaled and the metadata is scattered over multiple block groups (a portion of file system area), MQFS finds the radix tree by hashing the block group ID of the journaled metadata. Each radix tree takes the logical block address of the journaled block as the key, and outputs the journal description entry (JH) recording the mapping from journal block address to final block address along with the transaction ID (TxID) and its current state (state).

Fig. 11.

MQFS uses these indexes to checkpoint the newest data block and append (but not suspend) the incoming conflicting logging requests. In Figure 11, suppose journal area 1 runs out of space and checkpointing is triggered; note that JH with the same subscript indicates a data block written to the same logical block address. MQFS replays the log sequentially; for \(\rm JH_{1}\), it finds its TxID is lower than the newest one from tree 1, i.e., another journal area contains a newer block, and thus skips the journaled data (JD). \(\rm JH_{2}\) in its log is the newest one and therefore can be checkpointed; before checkpointing, the state field is set to chp, indicating that this block is being checkpointed. Now, suppose journal area 2 receives a new \(\rm JH_{2}\). By searching tree 2, MQFS finds that another journal area is checkpointing this block. It then appends the new \(\rm JH_{2}\) after the old \(\rm JH_{2}\) of tree 2, marks this entry as log, and continues to write \(\rm JH_{2}\) and JD. Using the in-memory indexes to carefully control the concurrency, and the transaction ID to correctly enforce the checkpointing order, MQFS can process logging and checkpointing with higher runtime concurrency.

6.3 Metadata Shadow Paging

The metadata is small (e.g., 256 B) and the file system usually stitches metadata from different files into a single shared metadata block. Though accessing different parts of the metadata block, operations from different threads are executed serially. For example, as shown in Figure 12(a), two threads T1 and T2 update the same block \(\rm D_{1}\). Although the two threads update disjoint parts, they are serialized by the page lock, due to the access granularity of the virtual memory subsystem.

Fig. 12.

To ease this overhead and to construct the journal entries in parallel, we introduce metadata shadow paging to further parallelize I/O operations. The main idea is to update the metadata page sequentially while making a local copy for journaling. MQFS uses this technique to fully exploit the concurrent logging of the multi-queue journaling.

For example in Figure 12(b), T1 updates the in-memory \(\rm D_{1}\), makes a copy, and then journals that copy \(\rm D_{1-1}\). Immediately after T1 has made a copy, T2 can start processing \(\rm D_{1}\) with the same procedure, i.e., copy and journal \(\rm D_{1-2}\).

In MQFS, only the metadata blocks use shadow paging because only a few metadata blocks (e.g., 1–3) are needed for a fsync call in the common case. Data blocks still use the typical lock-based approach because (1) the data blocks are aligned with page granularity without page-level contention from different files, and (2) the number of data blocks is usually non-deterministic; a request with enormous user data blocks can consume a large portion of memory space.

MQFS uses metadata shadow paging to journal the file system directory tree. Specifically, MQFS first takes a snapshot of the updated metadata, and then journals the read-only snapshot. For example in Figure 12(c), T1 creates a new file B and persists it to the storage. Meanwhile, T2 creates and syncs a new file C. Assume A is new to /root and T1 goes first. T1 performs shadow paging on path /root/A/B, and then releases the lock on the directory entries of A and /root. After that, T2 performs shadow paging on path A/C. Finally, T1 and T2 journal individual path snapshots in parallel, thereby increasing concurrency; MQFS merges the two paths at the checkpoint phase, applying the newest file system directory tree to the file system area.

6.4 Handling Block Reuse Across Multi-Queue

A challenge of MQFS is handling block reuse. Like Ext4, MQFS supports both data and ordered metadata journaling. The tricky block reuse case becomes more challenging in MQFS. In metadata journaling, the file system journals only the file system metadata and lets the user data blocks bypass the journal area. Problems arise from the bifurcated paths.

For example, the file system first journals a directory entry (the content of directories is considered metadata) \(\rm JD_{1}\). Then the user deletes the directory, freeing the block of \(\rm D_{1}\). Later, the file system reuses \(\rm D_{1}\) and writes some user data to it, bypassing journal. Assume a crash happens at this time. The recovery replays the log which overwrites \(\rm D_{1}\) with the old directory entry \(\rm JD_{1}\). As a result, the user data is filled with the content of the directory entry and is thus corrupt. To address this block reuse problem, classic journaling adds revocation record JR to avoid replaying the revoked \(\rm JD_{1}\).

Unfortunately, directly applying the JR to MQFS cannot solve this problem. Using the same example, as shown in Figure 13(a), suppose journal area 1 runs out of space and performs a checkpoint on \(\rm JD_{1}\). At this time, journal area 2 receives a \(\rm JR\) on \(\rm D_{1}\), indicating the previous \(\rm JD_{1}\) of journal area 1 cannot be replayed. After that, the file system reuses the \(\rm D_{1}\) and submits a user block directly to the file system area, bypassing the journal. Assume the \(\rm JD_{1}\) is successfully checkpointed and a crash happens before the persistence of the later \(\rm D_{1}\). During crash recovery, the JR does not take any effect because the \(\rm JD_{1}\) is already written back to the file system area. As a result, the user still sees the incorrect data block (i.e., the old directory entry).

Fig. 13.

The root cause of this issue is the following: though the JR synchronizes the journal area and file system area, it is unable to coordinate the journal areas across multiple queues. Hence, MQFS uses the per-core radix trees for synchronization, writing the JR record selectively.

Specifically, as shown in Figure 13(b), there are two cases when the file system is about to submit a JR record: (1) the reused block is being checkpointed by the journal area 1 and (2) the reused block is not yet checkpointed. In the first case, the JR record is canceled and MQFS regresses to data journaling mode for \(\rm JD_{1}\) and journals the \(\rm JD_{1}\) for correctness; even \(\rm D_{1}\) is a user data block. In the second case, the JR record is accepted by journal area 2; the radix tree removes associated JH that is older than the JR. The next checkpoint of journal area 1 therefore ignores \(\rm JD_{1}\).

6.5 Crash Recovery

Graceful shutdown. At a graceful shutdown (e.g., umount), MQFS waits for the completion of all in-progress transactions before detaching from ccNVMe. This ensures that MQFS does not rely on any information from ccNVMe for replaying the journal and ensuring crash consistency.

Sudden crash. MQFS performs crash recovery in the unit of transaction. It first reads the P-SQ from ccNVMe to find the unfinished transactions; MQFS discards these transactions. For committed transactions, MQFS links the transactions ordered by the transaction ID from all journal areas, and replays them sequentially, the same as in the classic single-compound journaling of Ext4.

7 Implementation Details

We implement ccNVMe in the Linux kernel 4.18.20 as a loadable kernel module by extending the nvme module, which is based on the NVMe 1.2 spec [41] (circa Nov. 2014). We look at the newest 1.4c spec [43] (circa Mar. 2021); there is no change in the command processing steps, common command fields, nor in the specific write command in this version, and thus the ccNVMe design can be also applied to the newest NVMe.

Using ioremap_wc, ccNVMe remaps the PMR region from a PMR-enabled SSD to enable the write combining feature of the CPU on this region. ccNVMe uses memcpy_fromio for MMIO read and memcpy_toio for MMIO write. As our tested SSDs have not enabled PMR yet, we use an indirect approach to evaluate ccNVMe, as depicted in Figure 14.

Fig. 14.

In the ideal implementation (Figure 14(a)), a request requires one round trip to the Test SSD. Our implementation (Figure 14(b)) uses a PMR SSD to wrap a Test NVMe SSD as a PMR-enabled one. In particular, MQFS first submits the request to ccNVMe (① of Figure 14). ccNVMe then forwards the request to the Test SSD through the original NVMe driver (② of Figure 14), after it performs queue and doorbell operations on the PMR SSD. When the request finishes (③ of Figure 14), ccNVMe rings the doorbell (if desired) on the PMR SSD (④ of Figure 14) before returning to MQFS. In our implementation, the MMIO operations over PCIe (i.e., ①, ②, and ⑥ of Figure 5) are duplicated; one to PMR SSD and another to the Test SSD. The block I/O and MSI-X (i.e., ③, ④, and ⑤ of Figure 5) remains one from the Test SSD. Therefore, the evaluation atop our implementation can reflect the least performance and the same consistency of the ideal implementation.

Recall that ccNVMe over RDMA uses stream to distinguish transactions that come from different cores (Section 5.1). A key implementation is how to transfer the stream ID over the network. We use the third dword (4 bytes) which is a reserved field of NVMe I/O command to deliver the stream ID. The target can thus fetch the stream ID from the reserved field.

8 EVALUATION OF ccNVMe OVER PCIe

In this section, we first describe the setups of our test environment (Section 8.1). Next, we examine the performance of transaction processing of ccNVMe (Section 8.2) and evaluate MQFS against the state-of-the-art journaling file systems through microbenchmark (Section 8.3) and macrobenchmark (Section 8.4). Then, we perform a deep dive into understanding how different aspects of ccNVMe and MQFS contribute to its performance gains (Section 8.5). Finally, we verify the crash consistency of MQFS in the face of a series of complex crash scenarios (Section 8.6).

8.1 Experimental Setup

Hardware. We conduct all experiments in a server with two Intel E5-2680 V3 CPUs; each CPU has 12 physical cores and its base and max turbo frequency are 2.50 GHz and 3.50 GHz, respectively. We use three SSDs; their performance is presented in Table 3. The PMR SSD has 2 MB PMR and its PMR performance is presented in Figure 7.

Table 3.

Name	Seq. Bandwidth	Rand. IOPS	4-KB Latency
Intel flash 750 NVMe	Read: 2.2 GB/s Write: 0.95 GB/s	Read: 430K Write: 230K	Read: 20 \(\mu\)s Write: 20 \(\mu\)s
Intel Optane 905P NVMe	Read: 2.6 GB/s Write: 2.2 GB/s	Read: 575K Write: 550K	Read: 10 \(\mu\)s Write: 10 \(\mu\)s
Intel Optane DC P5800X\(^{1}\)	Read: 7.2 GB/s Write: 6.2 GB/s	Read: 1.5M Write: 1.5M	Read: 5 \(\mu\)s Write: 5 \(\mu\)s

Table 3. NVMe SSDs Performance

\(^{1}\)This is a PCIe 4.0 SSD. On our PCIe 3.0 server, its sequential read/write bandwidth and random read/write IOPS are 3.3 GB/s, 3.3 GB/s, 850K, and 820K, respectively. Its 4-KB random read/write latency through the Linux kernel NVMe stack is 8 \(\mu\)s and 9 \(\mu\)s, respectively.

Compared systems. For the performance of atomicity guarantees, we compare ccNVMe against the classic approach (e.g., JBD2) and Horae’s approach [27]. For file system and application performance, we compare MQFS against Ext4, Ext4-NJ, and HoraeFS [27], a state-of-the-art journaling file system optimized for NVMe SSDs. Ext4 is mounted with default options. To show the ideal performance upper bound, we disable the journaling in Ext4 and refer this setup to Ext4-NJ. Note that all the tested file systems are based on the same code base of the Ext4, share the same OS, use metadata journaling, and use 1 GB journal space in total.

8.2 Transaction Performance

This section evaluates the transaction performance of different approaches: the classic that writes a journal description block and journaled blocks followed by writing a commit record; the Horae one that removes the ordering points of the classic one; and the ccNVMe one that packs the journal description block and the journaled blocks as a single transaction. During the test, we vary the number of threads and the size of a transaction. Each transaction consists of several random 4 KB requests. Each thread performs its own transactions independently. Figure 15 reports the results.

Fig. 15.

Single-core performance. ccNVMe-atomic outperforms the classic and Horae by 3\(\times\) and 2.2\(\times\) averagely in a single core, as shown in Figure 15(a). Compared to the classic and Horae, ccNVMe achieves 1.5\(\times\) and 1.2\(\times\) throughput when we wait for the durability of the transactions. Through inspecting the I/O utilization via the iostat tool, we observe that even with a 64 KB write size, the classic and Horae drives only 62% and 63% the bandwidth while ccNVMe achieves 93% (Figure 15(b)). ccNVMe-atomic does not expose the latency of block I/O and interrupt handler to the transaction processing, thus achieving higher throughput and I/O utilization. Moreover, ccNVMe removes the ordering points in transaction processing as in Horae, and reduces the traffic (i.e., the commit block and doorbell MMIOs) over PCIe, therefore outperforming its peers.

Multi-core performance. We extend the single-core measurements to use up to 12 threads, and each issues 4 KB atomic writes. Figure 15(b) presents the results. We highlight two takeaways here. First, by decoupling atomicity from durability, ccNVMe-atomic saturates the bandwidth using only two cores, while others need at least eight cores (Figure 15(d)). Second, when the load is high (i.e., over eight cores), all approaches are able to saturate the bandwidth by issuing independent transactions. However, as ccNVMe eliminates the commit block and reduces the MMIOs, ccNVMe still brings 50% TPS gain against the classic and Horae (Figure 15(c)).

8.3 File System Performance

We examine the throughput and latency of the file systems. Here, we use FIO [2] to issue append write followed by fsync or fdataatomic, which always trigger metadata journaling. During the test, we vary the size of each write request and the number of threads. Figure 16 shows the results.

Fig. 16.

Single-core performance. From Figure 16(a), we observe that MQFS exhibits 2.1\(\times\), 1.9\(\times\), and 1.2\(\times\) throughput averagely against Ext4, HoraeFS, and Ext4-NJ, respectively, in a single core. As presented in Figure 16(b), the fsync latency decreases by 56%, 41%, and 24% on average, when we use MQFS against when we use Ext4, HoraeFS, and Ext4-NJ, respectively. From the error bar, we find that MQFS delivers more stable latency. Here, we find that the SSD’s bandwidth is not fully saturated by the single thread. Unlike HoraeFS that uses a dedicated thread to perform journaling, MQFS performs journaling at the application’s context to avoid context switch, and scales the journaling to multi-queue, thereby increasing the throughput and decreasing the latency. Due to the elimination of the ordering points in journaling, MQFS overlaps the CPU and I/O processing, and thus prevails Ext4-NJ.

Multi-core performance. In Figure 16(c), when the number of threads is lower than 12, MQFS exhibits up to 2.4\(\times\), 1.5\(\times\), and 1.1\(\times\) throughput gain against Ext4, HoraeFS, and Ext4-NJ, respectively. As shown in Figure 16(d), MQFS reduces the average fsync latency by 55.6% and 28% averagely when compared to Ext4 and HoraeFS, respectively. Here, the reasons are a little different. First, fsync calls from different threads are likely to contend for the same metadata block. In Ext4 and HoraeFS, the accesses to the same block are serialized by the block-level lock. MQFS copies out the metadata block for journaling and thus improves the I/O concurrency. Second, in MQFS, when a thread performs checkpointing, except for necessary version comparison of its local transaction ID with the global one on the global radix trees, it does not block other threads. When the number of threads grows over 12, MQFS saturates the throughput; it achieves 68% the throughput of Ext4-NJ, and outperforms Ext4 and HoraeFS by 2\(\times\) and 1.5\(\times\), respectively. The major bottleneck here is shifted to the write traffic over PCIe. As MQFS does not need the journal commit block and reduces the MMIOs using write combining, it provides higher throughput.

Decoupling atomicity from durability. From Figure 16, we also observe MQFS-atomic further improves performance over MQFS and Ext4-NJ. The improvements come from two aspects. First, ccNVMe itself decouples atomicity from durability; built atop ccNVMe, MQFS guarantees atomicity once the atomic requests are inserted into the hardware queue (i.e., ① and ② of Figure 5), which is very fast (more details in Section 8.5.2). Second, compared to Ext4-NJ, the threads of MQFS need not synchronize on the shared page, and thus insert the requests independently, efficiently using the CPU cycles.

8.4 Application Performance

We now evaluate MQFS performance over the I/O intensive Varmail [55], and both CPU and I/O intensive RocksDB [9].

Varmail. Varmail is a metadata and fsync intensive workload from Filebench [55]. Here, we use the default configuration of Varmail. Figure 17(a) plots the results.

Fig. 17.

In SSD A, MQFS achieves 2.4\(\times\), 1.2\(\times\), and 0.9\(\times\) the throughput of Ext4, HoraeFS, and Ext4-NJ, respectively. In the faster SSD B, MQFS outperforms Ext4 and HoraeFS by 2.6\(\times\) and 1.1\(\times ,\) respectively; MQFS achieves comparable throughput compared to Ext4-NJ. The improvement of MQFS comes from the following aspects. First, in SSD A, all HoraeFS, Ext4-NJ, and MQFS are bounded by the I/O. Compared to HoraeFS, MQFS eliminates the journaling commit block and reduces the persistent MMIOs and thus provides higher throughput. Second, in the faster SSD B, I/O is no longer the bottleneck for HoraeFS and Ext4-NJ. Varmail contains many persistent metadata operations such as creat and unlink followed by fsync. MQFS parallelizes the I/O processing of the metadata blocks by metadata shadow paging while Ext4-NJ and HoraeFS serialize the accesses to the shared metadata blocks. Consequently, MQFS utilizes the CPU more efficiently to fully drive the SSD and thus provides higher throughput.

RocksDB. RocksDB is a popular key-value store deployed in several production clusters [9]. We deploy RocksDB atop the tested file systems and measure the throughput of the user requests. Here, we use db_bench, a benchmark tool of RocksDB to evaluate the file system performance under the fillsync workload, which represents the random write-dominant case. During the test, the benchmark launches 24 threads, and each issues 16-byte key and 1024-byte value to a 20 GB dataset. Figure 17(b) shows the result.

In SSD A, MQFS prevails EXT4-NJ and HoraeFS by 40%. In SSD B, the throughput increases by 66%, 36%, and 28% when we use MQFS against when we use Ext4, HoraeFS, and Ext4-NJ, respectively. MQFS overlaps the I/O processing of the data, metadata, and journaled blocks, and reduces the cache line flushings over the PCIe. Therefore, MQFS significantly reduces the CPU cycles spent on idle-waiting (i.e., block I/O) or busy-waiting (i.e., MMIO) for I/O completion. This in turn reserves more CPU cycles for RocksDB and file system logic. During the test, we observe that MQFS has 5\(\times\) higher CPU utilization (i.e., the CPU cycles consumed in kernel space) and RocksDB atop MQFS has 2\(\times\) higher CPU utilization (i.e., the CPU cycles consumed in user space). Moreover, the MQFS does not need a commit record, which not only reduces the number of block I/Os that need extra CPU operations (e.g., memory allocation), but also removes the context switches introduced by the interrupt handler. As a result of higher CPU and I/O efficiency, MQFS outperforms its peers on RocksDB which is both CPU and I/O intensive.

8.5 Understanding the Performance

In this section, we evaluate how various design techniques of MQFS contribute to its performance improvement.

8.5.1 Performance Contribution.

Now, we show that each of the design techniques of MQFS, i.e., ccNVMe (Section 4), the multi-queue journaling (Section 6.2), and the metadata shadow paging (Section 6.3), are essential to improve the performance. The test increases the number of threads, and each issues 4 KB write followed by fsync on a private file. We choose Ext4 as the baseline because MQFS is implemented atop Ext4.

Figure 18(a) shows the result on an Optane 905P SSD. The ccNVMe (+ccNVMe) contributes to performance significantly; it achieves approximately 1.4\(\times\) the throughput of the baseline. Atop the ccNVMe, the multi-queue journaling (+MQJournal) further makes around 47% gains averagely. The metadata shadow paging (+MetaPaging) shows a further 23% throughput improvement. The above results therefore indicate that all of the three building blocks are indeed necessary to improve the performance.

Fig. 18.

We further quantify the effect of each technique in a faster SSD. The results are shown in Figure 18(b). We find that the advantages of ccNVMe become more obvious; the ccNVMe increases the throughput by up to 2.1\(\times\). This is because when I/O becomes faster, both the CPU and I/O efficiency become dominant factors affecting performance. ccNVMe removes context switches and decouples atomicity from the durability, thereby accelerating the rate at which the CPU dispatches requests to the device. Moreover, ccNVMe reduces the block I/O and MMIO traffic over PCIe, making more bandwidth for file system usage. The MQJournal also boosts the throughput by 53% averagely with varying the threads. When enabling MetaPaging, the throughput increases by 20% on average. This suggests that scaling the I/Os of the journaling and decoupling the atomicity from the durability to parallelize CPU and I/Os bring significant performance improvement. We also notice that the benefit of ccNVMe becomes narrow beyond eight threads. This is because when the number of threads exceeds eight, the multicore scalability of the traditional journaling becomes the major performance bottleneck that hides the benefits of ccNVMe.

8.5.2 Decomposing the Latency.

In this section, we investigate the file system internal procedure to understand the performance of MQFS against Ext4-NJ. The test initiates one thread and repeats the following operations: it first creates a file, and then writes 4 KB data to the file, ending with calling fsync on the file. As shown in the topmost of Figure 19(a), for each fsync, MQFS starts a transaction, searches the dirty data blocks, and allocates space for them (S-iD), followed by sending the blocks to ccNVMe. After that, MQFS processes the file metadata (S-iM) and the parent directory (S-pM) with similar procedures. Next, it constructs and submits the journal description block that contains the transaction ID and the mapping from the home logical block address to the journal logical block address of the journaled data (S-JH). It finally waits for the completion and durability of these blocks (W-x). The table below presents the average time (in nanoseconds) spent on each function. Similarly, Figure 19(b) presents the fsync path of Ext4-NJ, which synchronously processes each type of data block without journaling.

Fig. 19.

MQFS decreases the overall latency by 42% compared to Ext4-NJ. This improvement comes from two aspects: higher CPU and I/O efficiency. First, the CPU is used more efficiently in MQFS. With ccNVMe, MQFS continuously submits the iD, iM, pM, and JH, without leaving the CPU in an idle state waiting for the I/O like Ext4-NJ. From the figure, we can see that atomicity guarantee (i.e., fatomic) costs only around 10 \(\mu\)s. Second, the I/O in MQFS is performed with higher efficiency. MQFS queues more I/Os to the storage, taking full advantage of the internal data parallelism of the SSD. Nevertheless, the fatomic and fsync can be further improved according to our analysis. The first is the block layer which is still relatively heavyweight for today’s ultra-low latency SSD; for example, S-iM still costs more than 1 \(\mu\)s to pass the request. The second is in S-iD, where Ext4 introduces non-negligible overhead for searching the dirty blocks and allocating space.

8.6 Crash Consistency

We test if MQFS recovers correctly in the face of unexpected system failures. We use CrashMonkey [37], a black-box crash test tool, to automatically generate and perform 1,000 tests for each workload. We run four workloads to cover several error-prone file system calls including rename; the generic workloads are from xfstest. Table 4 reports the results. As MQFS always packs the target files of a file operation into a single transaction for atomicity, it passes all 1,000 test cases.

Table 4.

Workload	Brief Description	Crash Points
Workload	Brief Description	Total	Passed
create_delete	create() and remove() on files.	1,000	1,000
generic 035	rename() overwrite on existing files and directories. From xfstest 035.	1,000	1,000
generic 106	link() and unlink() on files, remove() directory. From xfstest 106.	1,000	1,000
generic 321	Various directory fsync() tests. From xfstest 321.	1,000	1,000

Table 4. Crash Consistency Test

9 EVALUATION OF ccNVMe OVER RDMA

In this section, we first describe the setups of our test environment for ccNVMe over RDMA (Section 9.1). Next, we examine the performance of transaction processing of ccNVMe (Section 9.2) and evaluate MQFS against the state-of-the-art journaling file systems through microbenchmark (Section 9.3) and macrobenchmark (Section 9.4). We finally perform a dive into understanding performance gains (Section 9.5).

9.1 Experimental Setup

Hardware. We conduct all experiments in two physical servers; one is the initiator and the other is the target server. Each server has two Intel Xeon Gold 5220 CPUs. Each CPU has 18 physical cores and its base and max turbo frequency are 2.20 GHz and 3.90 GHz, respectively. We test two kinds of SSDs, the Intel Optane 905P and Optane DC P5800X SSDs. We use 2 MB PMR for each SSD. The servers connect to each other with a 200 Gbps Mellanox ConnectX-6 RDMA NIC.

Compared systems. For the performance of atomicity guarantees, we compare ccNVMe against the classic journaling approach (e.g., JBD2). For file system and application performance atop NVMe over RDMA, we compare MQFS against Ext4 and Ext4-NJ which disables journaling in Ext4. Ext4 is mounted with default options. As Horae does not support RDMA networks, we do not include its performance on the NVMe over RDMA setup. Note that all the tested file systems are based on the same code base of the Ext4, share the same OS, use metadata journaling, and use 1 GB journal space in total.

9.2 Transaction Performance

This section evaluates the transaction performance of different approaches: the classic that writes a journal description block and journaled blocks followed by writing a commit record; the ccNVMe one that persists the journal description block and the journaled blocks in parallel. During the test, we keep the size of the transaction data to 4 KB and vary the number of threads. Each thread performs its own transactions independently. We also increase the number of target SSDs to two and dispatch dependent transactions across two SSDs. As shown in Figure 20(a), on Optane 905P SSD, ccNVMe outperforms the classic by 2.6\(\times\) and 1.8\(\times\) when the number of thread is 1 and 24, respectively. By decoupling atomicity from durability, ccNVMe-atomic further increases the throughput by 11.9\(\times\) compared to ccNVMe when the number of thread is 1. A similar trend of the performance improvement can also be found in Optane P5800X SSD (Figure 20(b)) and multiple SSDs (Figure 20(c)).

Fig. 20.

Through inspecting the I/O utilization via the iostat tool, we observe that even with 24 threads, the classic drives only half the bandwidth while ccNVMe achieves 99% of the hardware bandwidth with a single thread (Figure 20(d)). These improvements come from both the ccNVMe over PCIe and RDMA designs: first, ccNVMe removes the ordering points and commit record within a transaction, thus achieving higher I/O throughput; second, ccNVMe reduces unnecessary PCIe traffic (i.e., MMIO operations and DMAs) by transaction-aware doorbell and I/O coalescing techniques, thereby improving the CPU efficiency (the maximum I/O throughput each CPU core can serve). An interesting observation is that the CPU becomes the bottleneck in ccNVMe-atomic when I/O becomes dramatically faster. For example, in the slower CPU, a single thread achieves 350K Tx/s atop ccNVMe-atomic over PCIe (Figure 15(c)). However, in the faster CPU, a single thread achieves 550K Tx/s atop ccNVMe-atomic over RDMA which requires more CPU cycles to process the network stack (Figure 20(c)). Moreover, ccNVMe-atomic fully exploits a single SSD with only one thread (Figure 20(a) and (b)) while it requires two threads for two SSDs (Figure 20(c)). These results indicate that reducing unnecessary PCIe and network traffic which requires more CPU cycles is indeed important for emerging fast networks and SSDs as in ccNVMe over RDMA.

9.3 File System Performance

This section uses FIO [2] to measure the throughput and latency of the file systems. To always trigger metadata journaling, the tests issue append write followed by fsync or fdataatomic. The tests vary the size of each write request and the number of threads. Figure 21 shows the results. Single-core performance. From Figure 21(a), we observe that MQFS exhibits 2.1\(\times\) and 1.4\(\times\) throughput averagely against Ext4 and Ext4-NJ, respectively, in a single core. As presented in Figure 21(b), the fsync latency of MQFS decreases by 54% and 28% on average compared to Ext4 and Ext4-NJ, respectively. During test, we find that the SSD’s bandwidth is not fully utilized by the single CPU core. MQFS performs better than its peers due to higher CPU efficiency, i.e., CPU is the bottleneck. In particular, the transaction-aware doorbell and I/O command coalescing techniques of ccNVMe over RDMA reduce unnecessary waste of CPU cycles. Moreover, MQFS performs journaling at the application’s thread which avoids context switches. These improvements in the CPU efficiency increase the throughput and decrease the average latency of MQFS; when atomicity is decoupled from atomicity (MQFS-atomic), these improvements are more pronounced.

Fig. 21.

From the error bar, we find that MQFS offers more stable latency compared to Ext4. We also notice that the standard deviation of the latency of MQFS-atomic becomes higher when the write size is large (Figure 21(b)). The major reason is the queuing delay in the RDMA send queue. As MQFS-atomic continuously pushes I/O requests to the RDMA queue without any rate limiter, the queue quickly becomes ful and thus some later requests must wait for a relatively long period to get free queue entries. By adding a congestion control mechanism in the future work, the latency of MQFS-atomic can be less variable.

Multi-core performance. Figure 21(c) shows that MQFS exhibits 3.2\(\times\) and 1.3\(\times\) throughput gain averagely against Ext4 and Ext4-NJ, respectively. As shown in Figure 21(d), MQFS reduces the average fsync latency by 69.2% and 20.3% averagely compared to Ext4 and Ext4-NJ, respectively. The main reasons come from two aspects: concurrent journaling and overlap between CPU and I/O processing. First, MQFS scales the journaling processing to multiple cores, reducing the contention on the shared metadata block; this improves the CPU concurrency and I/O throughput. Second, due to the separation of atomicity from durability, MQFS dispatches the next transaction immediately after previous transactions are inserted into the hardware queue; this overlaps the CPU and I/O processing and increases the I/O throughput especially when the number of threads is low.

9.4 Application Performance

This section evaluates the performance of MQFS atop applications including Varmail [55] and RocksDB [9].

Varmail. Varmail is workload from Filebench [55] that contains intensive metadata and fsync operations. We use the default configuration of Varmail and plot the results in Figure 22(a).

Fig. 22.

In SSD A, MQFS achieves 2.63\(\times\) the throughput of Ext4. In the faster SSD B, MQFS outperforms Ext4 by 2.91\(\times\). MQFS achieves comparable throughput compared to Ext4-NJ in both SSDs. Compared to Ext4, MQFS eliminates the journaling commit block and reduces the PCIe traffic and hence provides higher throughput. Compared to Ext4-NJ, MQFS parallelizes the I/O processing of the metadata blocks by metadata shadow paging while Ext4-NJ serializes the accesses to the shared metadata blocks. Consequently, MQFS utilizes the CPU more efficiently to drive the SSD and thus offers higher throughput.

RocksDB. RocksDB is a popular key-value store deployed in several production clusters [9]. We deploy RocksDB atop the tested file systems and measure the throughput of the user requests. We use db_bench, a benchmark tool of RocksDB, to evaluate the file system performance under the fillsync workload which generates random write file operations. During the test, the benchmark launches 24 threads, and each issues 16-byte key and 1,024-byte value to a 20 GB dataset. Figure 22(b) shows the result.

MQFS outperforms Ext4 by 1.81\(\times\) and 1.62\(\times\) in SSD A and B, respectively. MQFS delivers similar throughput against Ext4-NJ. RocksDB demands both CPU and I/O resources. MQFS provides higher CPU and I/O efficiency and thus outperforms its peers. First, ccNVMe reduces the CPU cycles spent on the device driver by transaction-aware doorbell and I/O coalescing; this in turn reserves more CPU cycles for the RocksDB and file system logic. Second, MQFS overlaps the I/O processing of the data, metadata, and journaled blocks, improving the I/O concurrency of the RocksDB compaction.

9.5 Understanding the Improvements

This section investigates the file system internal procedure to understand the performance of MQFS against Ext4-NJ. Only one thread is initiated and repeats the following operations: it first creates a file, and then writes 4 KB data to the file, ending with calling fsync on the file. As shown in the topmost of Figure 23(a), each fsync request of MQFS starts a transaction, searches the dirty data blocks, and allocates space for them (S-iD), followed by sending the blocks to ccNVMe over RDMA. After that, MQFS processes the file metadata (S-iM) and the parent directory (S-pM) with similar procedures. Next, it constructs and submits the journal description block that contains the transaction ID and the mapping from the home logical block address to the journal logical block address of the journaled data (S-JH). It finally waits for the completion and durability of these blocks (W-x). The table below presents the average time (in nanoseconds) spent on each function. Similarly, Figure 23(b) presents the fsync path of Ext4-NJ, which synchronously processes each type of data block without journaling.

Fig. 23.

MQFS decreases the overall latency by 24% compared to Ext4-NJ. When atomicity is decoupled from durability, the latency of fatomic of the MQFS call is only 22% of the fsync call. The reasons for performance improvements are similar to the ccNVMe over PCIe. We also notice that the benefits of decoupling atomicity from durability are more significant in the networked storage stack compared to the local storage stack.

10 Related Work

Crash-consistent file systems. Recent works have optimized crash-consistent storage systems, in particular the journaling file systems [5, 7, 8, 10, 12, 17, 20, 24, 25, 27, 30, 38, 48, 53, 56]. These systems rely on the classic journaling over NVMe to provide failure atomicity, which waits for the completion of several PCIe round trips. Built atop ccNVMe, MQFS, however, achieves the atomicity guarantee by using only two persistent MMIOs. This increases the throughput as well as decreases the latency, since ccNVMe conceals the PCIe transfer and interrupt handler overhead to the file system for atomicity, and reduces the traffic (e.g., the commit record) over PCIe. We next discuss the comparison of the techniques (i.e., multi-queue journaling and metadata shadow paging) against these journaling file systems at length.

One category of these works [17, 30, 48, 49] is to improve the multicore scalability by partitioning the journal into multiple micro journals, which is similar to multi-queue journaling. The differences lie in the control flow of each micro journal and the coordination among micro journals.

First, IceFS [30], SpanFS [17], CCFS [49], and iJournaling [48] introduce extra write traffic (e.g., the commit record) and expensive ordering points (e.g., the FLUSH). Partitioning amplifies the extra write traffic, as it prevents multiple transactions from sharing a commit record to amortize the write traffic. MQFS, however, does not need ordering points and extra write traffic, by taking the free rides of the NVMe doorbell operations.

Second, the virtual journals of IceFS share a single physical log and need to suspend logging and serialize checkpointing when making free space. SpanFS allows each domain to perform checkpointing in parallel, but needs to maintain a global consistency over multiple domains (i.e., building connections across domains) at logging phase; this introduces extra synchronization overhead and write traffic. iJournaling preserves the legacy single-compound journal, and may need to synchronize the compound journal and the file-level journals during checkpointing. MQFS instead uses scalable in-memory indexes for higher runtime scalability, and detects conflicts during checkpointing and recovery.

Another category [7, 27, 56] is to decouple the ordering from durability, thereby removing the ordering points of journaling. ccNVMe naturally decouples the transaction ordering from durability when queuing requests (the in-order completion in Section 4.4). ccNVMe further decouples a stronger property, the atomicity, from the durability, providing a clearer post-crash state. In addition, MQFS differs from them in the multicore scalability and the handling of page conflicts.

First, OptFS, BarrierFS, and HoraeFS use only one thread to commit transactions. Hence, the throughput is bounded by the single thread and the latency increases due to communication (e.g., context switches) between the application and JBD2 thread. In contrast, MQFS scales the journaling to multiple hardware queues and performs it in the applications’ context, to fully exploit the concurrency provided by the SSD and the multi-core CPUs of modern computers.

Second, in BarrierFS and HoraeFS, a running transaction with a conflict page cannot be committed until the dependent transactions have made this page durable. This serializes the committing phase on transactions sharing the same page. MQFS uses metadata shadow paging to handle page conflict, thereby increasing the I/O concurrency of the committing phase.

ScaleFS [5] logically records the directory changes in per-core operation logs for running transactions and merges these logs during committing. The hybrid-granularity journaling of CCFS associates byte-range changes with the running transaction and super-imposes the delta on the global block when transaction committing starts. These designs are orthogonal to the metadata shadow paging, and can be applied to MQFS to concurrently buffer the in-memory changes for the running transaction before committing.

ccNVMe does not provide isolation as in TxOS [50]. Instead, we leave the isolation to upper layer systems since there are various levels of isolation and different systems have their own demand for isolation. Providing isolation at upper layer systems is orthogonal to ccNVMe’s design.

Transactional storage. A school of works provide atomicity interfaces at disk level [6, 15, 18, 31, 47, 51, 52]. They can achieve higher performance than ccNVMe by leveraging the features of storage media (e.g., copy-on-write of NAND flash). Yet, they require extensive hardware changes and it remains unclear whether similar designs can be applied to emerging Optane memory-based SSDs. ccNVMe requires the storage to enable the standard PMR, which is relatively simple (by using capacitor-backed DRAM or directly exposing a portion of the persistent Optane memory). ccNVMe does not limit the number of concurrent atomic writes as long as the hardware queue is available, while in transactional SSD this is limited by the internal resources (e.g., device-side CPU).

Byte-addressable SSD. Flatflash [1] exploits the byte addressability of SSD for a unified memory-storage hierarchy. Bae et al. [3] design an SSD with dual byte and block interfaces and simply put the database logging atop its SSD. Coinpurse [57] uses the PMR to expedite non-aligned writes from the file systems. Horae [27] builds dedicated ordering interfaces atop PMR. The SPDK community supported the PMR feature on April 4, 2021 [58]. In this work, we use PMR to extend NVMe for efficient crash consistency.

Remote storage access. Researchers [11, 13, 22, 23, 26, 36] have studied and improved the normal (non-crash-consistent) I/O path of the networked storage. To guarantee the crash consistency, their works still rely on the traditional approaches (e.g., journaling atop NVMe) as in the Linux NVMe over RDMA stack. ReFlex [23] leverages the hardware virtualization techniques to provide a kernel-bypass data path. i10 [13] improves the throughput of NVMe over TCP stack by dedicated end-to-end I/O path and delayed doorbells. LeapIO [26] uses ARM-based SOCs to offload the storage stack and expose NVMe SSDs to both the local and remote virtual machines. Gimbal [36] uses SmartNIC-based JBOFs to design a software storage switch that supports efficient multi-tenancy. Our design can be applied to their work to accelerate the crash-consistent I/O path (e.g., fsync).

11 Conclusion

We present ccNVMe, a new approach to achieve high performance and crash consistency simultaneously in storage systems. By coupling the crash consistency to the data dissemination and decoupling atomicity from durability, ccNVMe ensures atomicity guarantee with only two lightweight MMIOs and therefore improves the performance. We introduce MQFS to fully exploit the ccNVMe, showing that MQFS successfully saturates the SSD’s bandwidth with fewer CPU cores and outperforms state-of-the-art file systems.

Acknowledgments

We sincerely thank the anonymous reviewers for their valuable feedback.

References

[1]

Ahmed Abulila, Vikram Sharma Mailthody, Zaid Qureshi, Jian Huang, Nam Sung Kim, Jinjun Xiong, and Wen-mei Hwu. 2019. FlatFlash: Exploiting the byte-accessibility of SSDs within a unified memory-storage hierarchy. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’19). Association for Computing Machinery, New York, NY, 971–985. DOI:

Abstract

1 Introduction

2 NVMe

2.1 NVMe Over PCIe

2.2 NVMe Over RDMA

3 Motivation

3.1 Evaluation of Journaling atop NVMe over PCIe

3.2 Issues of Journaling atop NVMe over RDMA

4 ccNVMe OVER PCIe

4.1 Overview

4.2 Transaction: The Basic Operational Unit

4.3 Transaction-Aware MMIO and Doorbell

4.4 Correctness and Crash Recovery

4.5 Programming Model

4.6 Discussion

5 ccNVMe OVER RDMA

5.1 Overview

5.2 Transaction-Aware Doorbell

5.3 I/O Command Coalescing

5.4 Separating Atomicity from Durability

5.5 Discussion of Porting ccNVMe Over PCIe to TCP

6 MQFS: the Multi-Queue File System

6.1 Overview

6.2 Multi-Queue Journaling

6.3 Metadata Shadow Paging

6.4 Handling Block Reuse Across Multi-Queue

6.5 Crash Recovery

7 Implementation Details

8 EVALUATION OF ccNVMe OVER PCIe

8.1 Experimental Setup

8.2 Transaction Performance

8.3 File System Performance

8.4 Application Performance

8.5 Understanding the Performance

8.5.1 Performance Contribution.

8.5.2 Decomposing the Latency.

8.6 Crash Consistency

9 EVALUATION OF ccNVMe OVER RDMA

9.1 Experimental Setup

9.2 Transaction Performance

9.3 File System Performance

9.4 Application Performance

9.5 Understanding the Improvements

10 Related Work

11 Conclusion

Acknowledgments

References

Cited By

Index Terms

Recommendations

Crash Consistent Non-Volatile Memory Express

Optimization of RocksDB for Redis on Flash

montage: NVM-based scalable synchronization framework for crash-consistent file systems

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Get Access

Login options

Full Access

Figures

Other

Share