Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Open access

D2Comp: Efficient Offload of LSM-tree Compaction with Data Processing Units on Disaggregated Storage

Published: 14 September 2024 Publication History

Abstract

LSM-based key-value stores suffer from sub-optimal performance due to their slow and heavy background compactions. The compaction brings severe CPU and network overhead on high-speed disaggregated storage. This article further reveals that data-intensive compression in compaction consumes a significant portion of CPU power. Moreover, the multi-threaded compactions cause substantial CPU contention and network traffic during high-load periods. Based on the above observations, we propose fine-grained dynamical compaction offloading by leveraging the modern Data Processing Unit (DPU) to alleviate the CPU and network overhead. To achieve this, we first customized a file system to enable efficient data access for DPU. We then leverage the Arm cores on the DPU to meet the burst CPU and network requirements to reduce resource contention and data movement. We further employ dedicated hardware-based accelerators on the DPU to speed up the compression in compactions. We integrate our DPU-offloaded compaction with RocksDB and evaluate it with NVIDIA’s latest Bluefield-2 DPU on a real system. The evaluation shows that the DPU is an effective solution to solve the CPU bottleneck and reduce data traffic of compaction. The results show that compaction performance is accelerated by 2.86 to 4.03 times, system write and read throughput is improved by up to 3.2 times and 1.4 times respectively, and host CPU contention and network traffic are effectively reduced compared to the fine-tuned CPU-only baseline.

1 Introduction

Persistent key-value stores (KVSs) are widely used as back-end storage engines for various data center applications [24, 34]. Popular KVSs, such as LevelDB [20], and RocksDB [18], adopt a log-structured merge tree (LSM-tree) [37] as their backbone index. LSM-tree achieves excellent write performance by batching writes in memory and then flushing them to storage sequentially. To guarantee read performance and space efficiency, LSM-tree runs periodic compactions, which merge multiple files into ordered files and delete duplicates. However, this comes at the cost of dramatic performance fluctuations and long-tail latency during compactions, since they compete with foreground workloads for CPU and IO resources and cause write stalls. Furthermore, with the prevalence of disaggregated storage in data centers, compaction introduces a large amount of data movement between storage and compute nodes, further burdening the network (see Section 2.2.2).
Previous works [3, 6, 7, 38, 43, 44] have attempted to overcome this problem either by restricting IO resources for compaction, scheduling compaction during low-load periods, or offloading compaction to remote CPU servers. However, their attempts were based on the assumption that IO was the bottleneck. As high-speed storage devices and networks are deployed in the data center, the compaction bottleneck gradually shifts from IO to CPU [17, 29, 46]. The compaction performance is bounded by the computations, and leveraging multi-threaded programming consumes a lot of CPU cycles and results in performance interference with foreground workloads. To address this issue, X-engine [46] employs FPGA to accelerate compactions. However, FPGA significantly increases the cost and the development challenges. It also lacks the flexibility to adapt to complex data formats and diverse compaction strategies. Furthermore, this solution still performs the compaction on the host, which introduces a lot of data movement under the disaggregated architecture. Hence, the community dreams of a more cost-effective and flexible computational device for offloading in production environments.
Fortunately, cloud vendors continuously embrace data processing units (DPUs) [9, 10, 33] into their storage infrastructure in recent years to improve energy efficiency. As a new member of heterogeneous processors, the DPU balances cost, programmability, and performance, which provides new opportunities for compaction offloading. The generic Arm cores equipped by the DPU facilitate compaction offloading to release host CPU resources for user functions. Meanwhile, efficient ASIC accelerators can speed up data-intensive computations to further boost compaction performance. Furthermore, the low cost and low power consumption of DPUs make them ideal hardware for building energy-efficient storage servers [22, 35]. This provides an opportunity for near-data compaction, where we can offload compaction to the storage-side DPU to reduce data movement in the network.
Nevertheless, DPU, which provides less reconfigurable hardware, raises several challenges in balancing the benefits and costs it brings. (1) Limited Processing Power. The Arm cores in DPU do not have enough computing power to process compaction more efficiently than the host CPU. (2) Unconfigurable Hardware Accelerators. Although it mainly relies on integrating domain-specific processors to accelerate selected algorithms, which is the primary way to improve the DPU’s processing efficiency, LSM compaction is typically not considered. (3) No Efficient Access to Data Files. As a low-power PCIe card, the DPU cannot directly access data files on disk without consulting the file system on the host, which can result in additional data movement. The less powerful and more isolated execution environment implies that naive offloading may lead to performance degradation.
The D \({^2}\) Comp Approach. This article presents D \({^2}\) Comp, the first solution for offloading LSM compaction to DPUs. D \({^2}\) Comp investigates LSM compaction, dissects the DPU’s hardware capabilities, and introduces several optimizations to address the aforementioned challenges. Firstly, a file system is customized for the LSM-based KVS to efficiently offload compaction to the DPU. With guaranteed data consistency, this file system enables the DPU to directly fetch files from the storage device for compaction, thereby eliminating dependence on the host CPU for data transfer and reducing redundant data movement. Additionally, a mechanism is implemented on top of the file system to synchronize the runtime context, further supporting the offload of compaction. Secondly, a resource-aware compaction scheduler is designed to adapt to the DPU with limited computing power. The scheduler prioritizes performance-critical \(L_0\) - \(L_1\) compaction on fast host cores to ensure optimal performance. It dynamically offloads \(L_2\) - \(L_n\) compaction to slow DPU cores based on their computation headroom, efficiently reducing host CPU consumption and network traffic while minimizing offload slowdown. Finally, domain-specific accelerators on the DPU are leveraged to accelerate data-intensive (de)compression during compaction and read operations. This approach reduces write stalls and enhances read performance.
We integrate D \({^2}\) Comp with RocksDB, a popular LSM-tree-based KV store [18]. We evaluate RocksDB with db_bench and YCSB workloads with varying compositions. Experimental results show that the D \({^2}\) Comp speeds up the compaction by 2.86 to 4.03 times, improves the overall write throughput and read throughput of the storage engine by up to 3.2 times and 1.3 times respectively, and it effectively reduces network traffic and CPU contention compared with the best CPU baseline. Overall, we make the following contributions:
Compaction breakdown. A detailed analysis of CPU and network overhead of LSM compaction on high-speed disaggregated storage.
DPU-aware and LSM-specialized file system (DALFS). A file system for DPU to efficiently offload LSM compaction. The file system allows the DPU to fetch files directly from the disk for compaction.
Resource-aware dispatching. A scheduler that dynamically offload different levels of compaction to CPU and DPU based on the compute headroom to maximize offloading revenues.
Hardware-assisted compression. A hardware acceleration module for efficiently accelerating the (de)compression step of LSM compaction and read.
Implementation and evaluation. An integration with RocksDB [18] and in-depth evaluation of D \({^2}\) Comp compared to the fine-tuned CPU-only baseline. Evaluation results using both micro- and macro-benchmarks show that our proposal increases the overall throughput and reduces network traffic and host CPU consumption.
This article is organized as follows. Section 2 introduces the backgrounds of LSM-tree KV stores, and provides compaction overhead results, as well as the opportunities and challenges to offload compaction with DPUs. Section 3 explains the design principles and implementation details. And we evaluate our design in Section 4 and provide qualitative comparison against other hardware approaches in Section 5. Finally, we discuss related work in Section 6 and conclude in Section 7.

2 Background and Motivation

2.1 LSM-based KV Store

The LSM-tree is a data structure that is widely used in write-intensive scenarios. Here we demonstrate a widely deployed LSM-tree implementation, RocksDB [18] as an example. Figure 1 shows the single node and the disaggregated setup, respectively. Both setups consist of DRAM and SSD components, with the difference that in the disaggregated setup, the DRAM and SSD components are decoupled and reside in the compute and storage nodes, respectively. The compute node access storage node via NVMe-over-Fabric protocol [1].
Fig. 1.
Fig. 1. Illustration of LSM-based KV store with single node and disaggregated setup.
The DRAM component holds two skip lists called Memtable and Immutable Memtable. The SSD component contains multiple sorted string tables (SSTables) and log files. Incoming data is first inserted into a Memtable. Once the Memtable is full, it becomes a read-only Immutable Memtable that will be flushed to storage as a new SSTable. The KV pairs in an SSTable are sorted. On-disk SSTables are organized in levels, denoted as \(L_0\) , \(L_1, \ldots , L_n\) , with exponentially growing capacity between adjacent levels. The SSTables have no overlapped key ranges except those in \(L_0\) . When a given \(L_i\) becomes full, the KV pairs at \(L_i\) will be merged into \(L_{i+1}\) through compactions.
Compactions run in separate background threads. To execute compaction at \(L_i\) , (1) the overlapping SSTables in \(L_{i}\) and \(L_{i+1}\) are read into memory. (2) These SSTables are decompressed and decoded into full KV pairs. (3) These KV pairs are sorted and merged. (4) The merged KV pairs are encoded and compressed to generate new SSTables. (5) The new SSTables are writed back to \(L_{i+1}\) . With the compaction, duplicate keys are deleted and unordered keys are organized into order, so the space occupation and read performance of LSM-tree are retained at an acceptable level. For the disaggregated setup, steps (1) and (5) are converted into network read/write operations, that is the compute node needs to read SSTables from the storage node to the local to perform merging and sorting, and then write them back.
To serve read requests, RocksDB searches the MemTable first, the Immutable MemTable next, and then SSTables in \(L_0\) through \(L_n\) in order. Since SSTables in \(L_0\) contain overlapping keys, a lookup may search multiple files at \(L_0\) [6, 13, 16]. To speed up read operations, RocksDB also adds Bloom filters to each SSTable to reduce the amount of IO, which benefits both the single node and the disaggregated setup.

2.2 Compaction Overhead

LSM-based KVS is mainly targeted at write-intensive scenarios where compaction is frequently triggered [17]. However, compaction is an expensive operation in terms of both CPU and I/O. As storage and network performance continues to improve, the compaction overhead becomes more apparent. On the one hand, it consumes lots of CPU cycles and causes performance interference with foreground workloads [29, 46]. Additionally, with the prevalence of disaggregated architectures in data centers, compaction leads to a lot of data movement between the compute and storage nodes, further increasing the network burden [43]. We next provide a detailed analysis of the CPU and network overheads of LSM Compaction, respectively.

2.2.1 CPU Overhead.

In this section, we evaluate the CPU overhead of both individual compaction task and the multi-threaded case. We conduct comprehensive investigations to break down compactions based on the widely deployed KV store RocksDB with the open-source benchmark db_bench.
Breakdown a single compaction task. In this experiment, we prepare 10M KV records with varying key and value sizes and manually trigger a compaction to merge them. Evaluating the compaction by recording the execution time of essential steps, we find that (de)compression steps dominate the compaction, as shown in Figure 2(a). At all KV sizes, (de)compression takes the lion’s share, accounting for 79%-85% of the total time. In contrast, sorting and merging only account for about 13%-19% of the time, and IO operations only account for about 1%-3% of the time.
Fig. 2.
Fig. 2. Illustration of the CPU overhead of LSM compaction.
Data compression is an effective technique for reducing the total cost of ownership (TCO). For LSM KV stores, compression is naturally integrated into the background compaction processes that organize the on-disk SSTable files. For the compaction of RocksDB, each read block needs to be decompressed before joining the merge process, and newly generated blocks need to be compressed before being written back to storage. As a result, many compression operations are generated. Although the compression benefits I/O performance due to the reduced read/write operations, but it comes at the cost of significant CPU overhead. On fast storage devices, data compression may become the bottleneck of the system. Through our experiments, we conclude that for compaction, data-intensive (de)compression is a key factor in causing CPU bottleneck. Therefore, speeding up these steps is critical to improving compaction performance.
Breakdown multi-threaded compactions. Although multi-threaded compaction is an intuitive way to improve performance, it only produces modest gains and may result in CPU contention [46]. To figure out the reason, we evaluate the multi-threaded compactions by recording the overall CPU utilization and period of compactions at different levels. In this experiment, we first load a 100GB dataset of 16-bytes-1KB key-value items into RocksDB in uniformly random order and then perform 20M random write requests by setting the compaction thread pool to 16.
Unlike common expectations, the CPU utilization shows periodic bursts rather than always remaining at a high level, as shown in Figure 2(b). By comparing the periods of compaction jobs and CPU utilization curves, we find that the troughs correspond to the \(L_0\) - \(L_1\) compactions, while the CPU bursts are mainly sparked by the upper-level compactions (i.e., \(L_2\) - \(L_n\) compactions). The red lines in the figure demonstrate this, where the length along the x-axis represents the execution time of the \(L_0\) - \(L_1\) compaction and the right y-axis shows the amount of data processed in the compaction. As we elaborate in Section 2.1, since \(L_0\) allows overlapping key ranges between SSTables, almost all SSTables in both levels join the \(L_0\) - \(L_1\) compaction. As a result, the size of \(L_0\) - \(L_1\) compaction is very large and takes a long time to execute. After that, the upper-level compactions explode centrally due to a large amount of ingested data, resulting in the CPU burst. This breakdown gives us the insight that multi-threaded compaction performance is limited by the \(L_0\) - \(L_1\) compaction, while CPU bursts occur mainly during the execution of upper-level compactions.

2.2.2 Network Overhead.

Another problem posed by LSM-tree compaction is the significant increase in network traffic in the disaggregated architectures that are becoming more common in data centers. By decoupling compute and storage resources, disaggregated storage can provide better resource resiliency and cost effectiveness. As high-speed networks are deployed, more and more applications are being deployed on disaggregated storage architectures, including LSM-tree KV stores [7, 25, 43]. However, for the LSM-tree, executing compactions on compute nodes leads to significant data movement, which increases energy consumption and degrades performance. To demonstrate this, we quantify the network traffic to RocksDB on the disaggregated storage using db_bench.
Network amplification. In this experiment, we deployed a RocksDB instance on a compute node, which accesses the remote storage node via the NVMe-oF protocol. The compute nodes and storage nodes are connected to each other through a 100GbE switch. We randomly write different amounts of data into RocksDB and record the Network Amplification (NA) of the system. NA refers to the ratio of the total network traffic generated at run-time to the amount of data actually written by the user. A larger NA means higher network traffic. As shown in Figure 3(a), the NA of the system increases significantly as the amount of data written increases. When loading 500 million 1KB KV pairs, the system generates about 26 times network amplification, which means about 13 TB of network traffic is generated. By further breaking down the network traffic, we can see that compactions are the main cause of network amplification, which leads to about 23 times network amplification when loading a 500GB dataset. As the amount of written data increases, more and more compaction jobs are generated, which will read a large number of files from the storage node to perform a sort and merge operation and then write them back, resulting in significant data traffic.
Fig. 3.
Fig. 3. Illustration of the network overhead of LSM compaction.
Data traffic over time. Figure 3(b) further shows the network traffic over time when randomly loading a 100GB dataset. Similar to the CPU utilization burst, network traffic shows periodic bursts that are caused by concurrent \(L_{2}\) - \(L_{n}\) compactions. For a single RocksDB instance, the maximum network throughput is around 7.3 Gbps. In a real production environment, a compute node may deploy multiple RocksDB instances [17], and the background compaction of these instances may consume tens of Gbps of network bandwidth. In a disaggregated architecture, network bandwidth is a scarce resource since it is shared by different compute and storage nodes. Therefore, reducing the network traffic of compaction is necessary to improve the overall performance.
In order to alleviate the CPU and network overhead, we choose to offload compaction to the DPUs, recently widely deployed in the cloud vendors’ infrastructure, which better balance cost, programmability, and efficiency compared to the FPGAs.

2.3 DPU Offloading Opportunity & Challenge

Recently, DPUs have been introduced as a new infrastructure in data centers to enable offloading of CPU functions, providing opportunities to solve the above-mentioned problems. A typical DPU architecture typically contains the following components as shown in Figure 4: (1) A certain number of general-purpose compute cores with Arm or MIPS architecture, which provide high programming flexibility. (2) A set of dedicated ASIC accelerators that can efficiently handle data-intensive computations such as (de)compression, deduplication, and hashing. (3) A set of interconnected units for networking and storage that can be used to offload data transfer protocols such as NVMe-oF [1]. These components provide new opportunities and challenges for the offloading of compaction.
Fig. 4.
Fig. 4. Typical architecture of DPU.
Efficient compression accelerator. Compression is typically a data-intensive computing operation that CPUs cannot handle efficiently [23], so many DPU vendors (e.g., Nvidia [10], Intel [26], Marvell [33]) have integrated a dedicated compression accelerator into their products. These accelerators typically have much higher compression throughput than their software alternatives, providing a new opportunity to break the compaction bottleneck. Figure 5 shows the benefits of leveraging the deflate accelerator of the BlueField-2 DPU [36] to substitute the zlib library used by RocksDB compaction. After using this hardware accelerator (denoted by the “Host+ASIC”), the time used for (de)compression in a single compaction is significantly reduced. Although using multi-cores for parallel compression can further improve the throughput of software algorithms, it significantly increases the CPU burden and could cause severe performance interference. Therefore, offloading compression to domain-specific accelerators is a more promising way to solve the compaction bottleneck.
Fig. 5.
Fig. 5. Offloading opportunity and challenge.
Generic but slow Arm cores. Another opportunity is that there are additional generic Arm cores that can be used to absorb CPU bursts caused by compaction. CPU bursts can hurt overall performance when there are constrained CPU resources on the host, a phenomenon that is common in modern cloud servers as IO speeds continue to increase [28, 42]. Nevertheless, naively offloading all compaction to the DPU leads to a reduced performance gain or even degrades the performance due to the wimpy DPU architecture with a limited number of low-frequency Arm cores and a small amount of cache memory. As shown in Figure 5, the execution time increases significantly after offloading the compaction to the Arm core (denoted by the “Arm”). And even by leveraging the accelerator to speed up the (de)compression steps (denoted by “Arm+ASIC”), the time for other steps, such as merging and sorting, still inevitably increases due to the slow Arm cores. As a result, the performance gains are not as good as using the accelerator directly on the host (“Host+ASIC”). Therefore, we must both use idle Arm cores to relieve the host CPU burden and carefully offload compaction to these cores to minimize slowdown.
Effective direct storage access. Furthermore, the DPU also provides a chance to increase offloading effectiveness by enabling direct storage access. The RDMA NIC on the DPU offloads the NVMe-oF protocol, thus allowing the Arm subsystem to read and write data directly from the storage device without going through the CPU. However, this effective offload infrastructure cannot be directly exploited since the compaction works on top of file semantics. Although existing file-sharing services such as NFS [41] can provide file semantics, it does not take this hardware feature into account and simply uses it for compaction offloading causing inefficiency. As shown in Figure 5, the IO time of compaction after offloading with this service (“Arm”, “Arm+ASIC”) is increased compared with running it on the host (“Host”, “Host+ASIC”) due to the redundant data copies from CPU to DPU. Therefore, a file system is required to offer file-level semantic data sharing and take advantage of this powerful infrastructure to increase offloading efficiency.
Energy-efficient storage infrastructure. Finally, DPU helps improve the energy efficiency of disaggregated storage. Compared to traditional x86 storage servers, DPU storage servers have lower energy consumption and can achieve similar performance [22, 35], so many vendors are starting to produce DPU-based storage servers [8, 14, 36, 39]. With the NVMe-oF protocol target offloaded to the RDMA NIC, the Arm cores of the storage-side DPU are further freed up, and we can utilize these idle Arm cores for near-data compaction to reduce data movement in the network. Nevertheless, given the limited number of Arm cores on the storage-side DPU, simply offloading all compaction tasks to the storage-side DPU may lead to performance degradation, and thus a dynamic offloading strategy is necessary to balance performance and data movement.

3 D \({^2}\) Comp Design

In this section, we introduce D \({^2}\) Comp, which aims to reduce the CPU and network overhead of LSM compaction on high-speed disaggregated storage by leveraging DPUs. D \({^2}\) Comp has the following design goals:
Improved compression efficiency. Compaction plays a crucial role in enhancing the writing performance of LSM-tree. Nevertheless, its performance is constrained by the data-intensive (de)compression process. To enhance compaction efficiency and alleviate the CPU burden, there is a need for D \({^2}\) Comp to speed up the (de)compression step.
Minimize CPU contention. The performance predictability of applications and storage systems suffers from competition for shared resources. Multi-threaded compactions cause severe CPU bursts and affect overall performance. D \({^2}\) Comp must minimize competition for CPUs.
Minimize data movement. Performing compactions on compute nodes results in a large amount of data movement, which consumes energy and leads to contention for network bandwidth. In addition, the reliance on the host CPU to transfer data further exacerbates the data movement overhead, which affects the IO efficiency. D \({^2}\) Comp shall minimize the data movement across the network.
Maximize offload revenues. Considering the limited computing power on the DPU, naively offloading all tasks to the DPU does not result in optimal acceleration. D \({^2}\) Comp needs to dynamically dispatch compaction jobs to different computing devices based on their compute headroom to maximize performance gains from offloading.
System Overview. Figure 6 shows the system overview of the storage engine with D \({^2}\) Comp on Disaggregated Storage. It consists of three key techniques to achieve the proposing design goals. The first technique is a DPU-Aware and LSM-specialized file system (Section 3.1). This file system includes an FS service on the host CPU to ensure global data consistency and an FS library for DPU to directly fetch files from SSD to perform compaction. With this file system we eliminate the dependency of DPU on the host CPU to transfer data, thus avoiding redundant data movement. The second technique is a resource-aware dispatcher (Section 3.2). This scheduler dynamically offloads compaction to the DPU based on the priority of the compaction jobs and compute headroom on the compute device, which can effectively reduce the computational burden on the host CPU and data movement in the network. The third technique is a hardware-assisted compression module (Section 3.3). The module fully exploits the capabilities of the DPU’s (de)compression hardware accelerator and integrates it into LSM-based KVs to improve compaction efficiency and read performance. Next, we describe these three techniques in detail in turn.
Fig. 6.
Fig. 6. System overview of storage engine with D \({^2}\) Comp on disaggregated storage.

3.1 DPU-Aware and LSM-Specialized File System

To minimize data movement across the network and host CPU contention, D \({^2}\) Comp should offload compactions to DPU. However, the compaction process is based on file semantics and DPUs are not aware of it. A naive idea would be to run a file-sharing service (e.g., NFS [41]) on the host CPU so that DPUs can grab SSTable files through the CPU as shown in Figure 7(a). However, this approach leads to redundant data movement and consumes CPU resources because all data transfers must pass through the host CPU. A better way is to let the DPU directly read/write files from SSD to perform compactions as shown in Figure 7(b). However, achieving this goal is non-trivial and data consistency must be carefully considered. We next describe how D \({^2}\) Comp achieves a balance of efficiency and correctness by using a DPU-aware and LSM-specialized file system (DALFS).
Fig. 7.
Fig. 7. Comparison of file access for compaction offloading using NFS and DALFS.
Recall that the LSM-tree batch writes in memory and flushes the newly generated SSTable to storage in an append-only fashion. Once persisted, the SSTable becomes read-only. The backend compaction will only operate on old read-only SSTables, so there are no write conflicts with foreground flushes. As for reads, they can still be responded to during compaction because the old SSTables still exist on SSDs. When compaction is complete, new SSTables are generated, read requests are redirected to these new files, and then the old SSTables are deleted, so the compaction does not cause read conflicts either. In summary, there is no conflict between background compaction and foreground operations to read or write the same file, but rather a conflict would only exist for concurrent access to file system metadata (e.g., creating and deleting SSTables). Therefore, D \({^2}\) Comp only needs to ensure that the file system metadata are consistent.
To ensure file system metadata consistency, a global metadata manager is required. This manager can be placed on the host CPU or on the DPU. D \({^2}\) Comp chooses the first approach as shown in Figure 7(b). The DALFS service on the host is responsible for metadata consistency, and all metadata operations (e.g., opens, creates, deletes) from KV instance and DALFS library are forwarded to it for processing. This scheme has two benefits: (1) handing off metadata services to different hosts avoids a global metadata bottleneck, and (2) it frees up more DPU Arm cores for compaction jobs. Under this structure, metadata requests for DALFS libraries still require network transfers and host CPU processing. Nevertheless, for LSM KVs, metadata operations are much less frequent than data operations, and thus there is essentially little impact on overall performance.
As for data operations, such as reads and writes, can be processed locally. To execute a read or write request, the DALFS library first fetches file metadata from the local cache, which records the corresponding logical block addresses (LBA) of the file, then generates block requests and sends them to the RDMA NIC via the NVMe-oF initiator. The RDMA NIC of the storage node includes an offloaded NVMe-oF target that converts incoming block requests into NVMe commands and sends them to the SSD for processing. With the separation of metadata path and data path, the data transfer of both read and write files can be done directly by the DPU without going through the host CPU, thus avoiding unnecessary data replication. The performance of these data operations is also improved because the data is transferred through the hardware.
D \({^2}\) Comp implements the DALFS service and DALFS library in the user space. The DALFS service and library are customized for the LSM KV, which provides the necessary file interfaces for running the LSM KV instance and the compactor. We obviate many complex features found in traditional file systems, thus simplifying the file system development process. Nevertheless, it is sufficient for running LSM KV and verifying our idea. The idea of customising the file system for LSM KVs is also widely used in industrial products [2, 45] and research articles [7]. Unlike these customisations, our DALFS further takes into account DPU characteristics and compaction offloading requirements. With our DALFS, the DPU can directly fetch files from the SSD to perform compaction, which avoids unnecessary data movement and improves IO efficiency for offloading tasks.

3.2 Resource-Aware Dispatching

Although DALFS provides efficient file access and ensures data consistency, the DPU still lacks the runtime context to run compaction. Furthermore, the Arm cores on the DPU are much slower than the host x86 cores and are limited in number, so naively offloading all compaction jobs to the DPU does not maximize offload revenues. Next, we describe how D \({^2}\) Comp addresses these challenges.
D \({^2}\) Comp synchronizes the execution context for the compactor and the primary instance by leveraging the on-disk manifest file. Figure 8 shows the detailed offload mechanism. To offload a compaction job, (1) the primary instance first sends an RPC request, which records the compaction parameters; (2) after the compactor receives the request, it first synchronizes the state of the primary by reading and applying the manifest file. This file records the version changes of LSM KVs, such as the creation and deletion of SSTable files. Once the synchronization is complete, the compactor has a complete storage view of the LSM-tree; (3) then, the compactor executes the same compaction logic as the primary; (4) after the compaction is completed, the compactor returns the results to the primary, which is responsible for deleting the old SSTable files, installing the newly generated SSTable files, and applying the version changes to the manifest file. Through the above mechanism, the compactor on the DPU can obtain the latest metadata of the LSM-tree to perform the compaction. And this mechanism imposes little overhead since the size of the RPC packet is much smaller than the size of SSTables to compact, and the catch-up process requires few incremental modifications.
Fig. 8.
Fig. 8. The compaction offload mechanism.
With DALFS and synchronization mechanisms, it is now possible to run compaction on the DPU. But it is still a challenge to efficiently schedule compaction jobs between CPU and DPU to maximize offload revenues. The biggest difficulty here is the balance of performance and data movement. The x86 cores on the host are much faster than the Arm cores on the DPU, but they are further away from the storage device, and performing compaction on them causes data movement across the network. On the other hand, the DPU is closer to the storage devices but has limited computational power, and offloading the entire compaction to it does not result in optimal performance gains, as shown in Section 2.3.
D \({^2}\) Comp addresses the second challenge by exploiting the different characteristics of compaction at different levels. Recall that the impact of compaction jobs at different levels on overall performance and resource consumption is different. \(L_0\) - \(L_1\) compaction has a greater impact on overall performance and may cause write stalls and affect read performance if it is executed too slowly. However, due to the overlap of the key ranges of SSTables in \(L_0\) , the \(L_0\) - \(L_1\) compaction is generally not highly parallelized and consumes limited CPU resources. As for \(L_2\) - \(L_n\) compaction, it has very little impact on read and write performance, but it usually has higher parallelism and is the main cause of CPU consumption and network traffic. Therefore, D \({^2}\) Comp gives different offloading priorities to the \(L_0\) - \(L_1\) compaction and the \(L_2\) - \(L_n\) compaction. It leaves performance-critical \(L_0\) - \(L_1\) compaction on the host CPU to ensure performance, and prioritizes offloading resource-consuming \(L_2\) - \(L_n\) compaction to storage-side DPUs to reduce network traffic and host CPU consumption.
Despite the guaranteed \(L_0\) - \(L_1\) performance, a potential problem with the above mechanism is that too many \(L_2\) - \(L_n\) compactions may overload the DPU, especially considering that there may be multiple instances on the host sharing the same storage node. To address this challenge, D \({^2}\) Comp employs a resource-aware dispatching policy to dynamically schedule compaction tasks on these computing devices. It monitors the workload status of the DPU in real time and checks if it is overloaded. When the DPU is detected to be overloaded, a newly generated \(L_2\) - \(L_n\) compaction is placed on the host for execution. The detailed design is shown in Figure 9. By default, D \({^2}\) Comp determines the maximum number of compactions the DPU can execute in parallel by offline profiling and sets it as the overload threshold. This detection is effective because the compaction jobs at the \(L_2\) - \(L_n\) layers are basically of similar size, about 3-4 SSTables, and thus they consume similar CPU resources. With this dynamic priority scheduling, the network traffic and CPU contention are reduced while performance gains are guaranteed.
Fig. 9.
Fig. 9. The resource-aware dispatching policy.

3.3 Hardware-Assisted Compression

Offloading compaction to the DPU eliminates the footprint on the host CPU and redundant data movement in the network. However, it comes at the cost of decreased performance because the Arm cores in the DPU are much slower than the host x86 cores. D \({^2}\) Comp addresses this problem by exploiting the deflate accelerator on the DPU to speed up the most time-consuming (de)-compression steps of compaction. Although it may seem that the utilization of hardware accelerators only requires the replacement of the corresponding (de)compression API, there are several points that need to be considered for better utilization of the accelerator. (1) An easy-to-use interface should be provided for LSM KV with as few invasive modifications as possible, (2) The performance potential of the deflate accelerator should be fully unlocked, and (3) A fault-tolerance mechanism should be provided since the hardware accelerator may fail.
Figure 10 shows the detailed design of hardware-assisted compression services. It integrates several software algorithms and the hardware offload module into a compression service and provides a unified interface for the applications. This reduces invasive modifications to the applications and facilitates the user to call different algorithms based on their needs. The interface offers two APIs: \(compress\_block()\) and \(uncompress\_block()\) . The caller only needs to pass in the input/output buffers, the type of algorithm, and the (de)compression parameters. This pluggable design does not introduce any additional modifications to the internal operations of LSM KV and provides enhanced usability.
Fig. 10.
Fig. 10. Design of hardware-assisted compression service.
As for the performance, D \({^2}\) Comp uses polling-based IO mode to take full advantage of the deflate accelerator. Specifically, when the hardware offload module is selected, it will pack the compression parameters and the addresses of user-passed input and output buffers, generate a (de)compression operation, and submit the operation to the hardware’s submission queue. It then immediately polls the hardware’s completion queue for the result. This synchronous polling approach, although it wastes some CPU cycles, significantly reduces (de)compression latency. For a 4KB block of data, the round-trip latency of this approach is less than 10us, whereas using the asynchronous call method incurs significant thread synchronization and context switching overhead. To distinguish between requests submitted by different threads, we use a priority queue to assign the qpair_id of the hardware accelerator to these threads. In addition, we move the time-consuming memory allocation operations out of the critical path by pre-allocating the buffer and reusing it later, which further reduces the overhead of compression.
To further improve the availability of hardware compression algorithms, D \({^2}\) Comp uses an algorithm selector. In case of runtime errors in the hardware accelerator, the algorithm selector can seamlessly switch to the corresponding software alternatives to provide fault tolerance and high availability. D \({^2}\) Comp employs the DPU-accelerated zlib library as the default setting since the BF2 now only supports acceleration for deflate algorithm [15] that is used by zlib and gzip. However, with the continuous upgrading of hardware, more hardware-assisted algorithms can be added to the algorithm selector for more flexible configurations.

4 Evaluation

We implement D \({^2}\) Comp based on RocksDB, with around 1K LOC (C++) used for the compression module and integration with RocksDB, 2K LOC used for the compactor module and resource-aware dispatch logic, and 4K LOC used for the file system.

4.1 Goals

We evaluate DPU-offloaded compaction using micro and macro benchmarks on the popular KV storage engine RocksDB [18]. Our evaluation sets out to answer the following questions:
(1)
How effective is the D \({^2}\) Comp in accelerating single compaction at different KV sizes? (Section 4.3)
(2)
What is the impact of the D \({^2}\) Comp on the overall performance? How effective is the D \({^2}\) Comp in reducing host CPU consumption and network traffic? How scalable is D \({^2}\) Comp under high write loads? (Section 4.4)
(3)
How does the D \({^2}\) Comp perform under various macro benchmarks? (Section 4.5)
(4)
What is the impact of different configurations (e.g., turning off the compression) on the performance gains of D \({^2}\) Comp? (4.6)

4.2 Experimental Setup

Our evaluation testbed consists of 2x dual-socket Intel Xeon Gold 5218 servers at 2.3GHz with 32 cores, 64GB DDR4 DRAM. All nodes run Ubuntu 20.04 with Linux kernel version 5.4. Each node is equipped with an NVIDIA BlueField-2 DPU. The DPU has 8x Armv8 A72 cores with 6MB shared L3 cache, 16GB DRAM, and 100Gbps RDMA NIC. For the storage nodes, we equip four 480GB Samsung 983 Zet SSDs and configure the DPU to offload the NVMe-oF targets to the RDMA NIC. The two nodes are connected to a 100GbE switch and we use RoCE for RDMA.
System configurations. Our natural Baseline is vanilla RocksDB (v. 6.4.9), which runs all compactions on the compute node’s CPU. Unless otherwise stated, all tests with RocksDB use the following configurations. The compression algorithm is set to zlib. The bloom filter is turned on. Direct IO is turned on to eliminate the impact of the OS page cache. We vary multiple parameters in our experiments, including the number of compaction threads and the key-value size. The remaining parameters are set to RocksDB defaults.
Comparison groups. We also set up several comparison groups to verify our optimization techniques. Baseline+HC indicates that the (de)compression steps of compaction are offloaded to the accelerator of the DPU, but the rest of the steps still run on the host CPU. Naive-Off uses NFS service to achieve data file sharing across CPU and DPU, and it naively offloads all compaction jobs to DPU without hardware-assisted compression (HC). DComp utilizes the DPU-aware file system to support offloading and turn on the HC, but it only offloads compaction to the compute node’s DPU. While D \({^2}\) Comp further supports offloading compactions to the storage node’s DPU and implements a resource-aware dispatching (RD) policy across the host CPU, storage-side DPU, and compute-side DPU, as described in 3.2.
Workloads and Datasets. We run tests using two popular KV workloads, YCSB and db_bench. For most of the tests with db_bench, we follow the common practice of loading a 100GB database with a randomly generated key using a 16-byte key size and a 1KB value size as the initial state. The run-time phase makes 20M requests.

4.3 Evaluating the Single Compaction Task

To evaluate the impact of D \({^2}\) Comp on the single compaction performance, we prepare 10M KV records with varying key and value sizes and manually trigger a compaction to merge them. In this test, DComp represents the case of offloading the entire compaction job to the compute node’s DPU, while D \({^2}\) Comp represents the case of offloading the entire compaction job to the storage node’s DPU, and both cases turn on the HC. Figure 11 compares the efficiency of different comparison groups in executing a single compaction job.
Fig. 11.
Fig. 11. Evaluation of the efficiency of different comparison groups in executing a single compaction job.
Compaction throughput. Figure 11(a) shows the compaction throughput, in terms of the merged input KV records per second. We observe that with the hardware-assisted compression (HC) turned on, the compaction performance is significantly higher across all KV sizes. Take the commonly used key-value size of 16bytes-1KB as an example, the performance improvement of Baseline+HC over Baseline is \(3.81\times\) . This performance gain mainly comes from the hardware acceleration of the (de)compression steps in the compaction. Figure 11(b) shows the execution time breakdown, the time used for (de)compression is significantly decreased after offloading them to the deflate accelerator. Naive-Off has a lower performance compared to Baseline because after offloading to the DPU, the execution time of all computational steps such is considerably increased due to the slower speed of the Arm kernel. DComp and D \({^2}\) Comp outperform Naive-Off because the accelerator of the DPU speeds up the (de)compression steps. Additionally, compared to Naive-Off, both DComp and D \({^2}\) Comp exhibit shorter I/O times. This is primarily because the DPU-aware file system reduces redundant data movement. However, this performance gain is not as pronounced as that achieved by accelerators, as the compaction performance is mainly constrained by computation. From this experiment, we conclude that hardware-assisted compression is critical to improve compaction performance.
Impact of KV size. From Figure 11(a), we can also observe a phenomenon that the larger KV sizes benefit more from the HC than the small KV sizes. For example, with the 16bytes-128bytes and 16bytes-32bytes key-value sizes, Baseline+HC only achieves an improvement in \(3.34\times\) and \(2.86\times\) throughput over Baseline, respectively. This is mainly due to the fact that SSTable contains more KV pairs for a smaller KV size, and thus the time spent on data merging and data reordering is longer. For the (de)compression steps, the KV size has little effect on its execution time because it operates at the fixed-size block granularity.
Resource consumption. Figure 11(c) further shows the host CPU utilization and network traffic of different comparison groups. Executing a single compaction task on the host CPU consumes an entire core and brings about 7.6 GB of network traffic as indicated by the Baseline. Baseline+HC has similar data traffic with a lower host CPU usage compared to the Baseline because the (de)compression operations are offloaded to the accelerator. The Naive-Off eliminates the host CPU footprint by offloading the compaction to the DPU. However, it causes double network traffic since the data needs to be read to the CPU first before being passed to the DPU. This is also confirmed in Figure 11(b), which shows that the IO time of Naive-Off is significantly higher than the other groups. DComp eliminates the redundant data traffic by enabling DPU to read and write data directly from SSD. However, it still needs to fetch data from the storage node to compute-side DPU, the network data movement is still unavoidable. D \({^2}\) Comp further eliminates this data movement by offloading the compaction to the storage node’s DPU.

4.4 Evaluating the Overall Performance

In this section, we use the micro-benchmark db_bench to evaluate the impact of different comparison groups on overall read and write throughput, latency, and resource consumption. In this test, Naive-Off+HC denotes offloading all compaction tasks to the DPU with HC turned on, while DComp and D \(^2\) Comp leave \(L_0\) - \(L_1\) compaction on the host CPU and offload \(L_2\) - \(L_n\) compaction to the DPUs of the compute node and storage node, respectively.
Write throughput. Figure 12(a) shows the scalability of random write throughput by increasing the number of background compaction threads. We can see that the benefits of hardware-assisted compression on single compaction almost reflect on the throughput improvement in this write-intensive workload. It brings \(2.98\times\) to \(3.95\times\) throughput improvement over the Baseline under various numbers of compaction threads. In our tests, performance no longer increases when the number of compaction threads increases to 8 for the Baseline. This is because adding CPU threads only reduces CPU competition for concurrent compaction jobs at different levels. When the CPU resources are enough to cope with the burst compaction jobs (e.g., each compaction job runs on a different CPU core), continuing to increase them brings no more benefits. The ultimate performance is bound by the execution speed of a single compaction job on a single CPU core, which is limited by data-intensive compression. With our HC, this limitation is broken, and therefore the overall performance is significantly improved. Nevertheless, naively offloading all compaction jobs to the DPU (i.e., Naive-Off+HC) reduces the performance gains because other computational steps are slowed down due to the slow Arm cores. DComp and D \({^2}\) Comp overcome this problem by leaving performance-critical \(L_0\) - \(L_1\) compaction jobs to the fast host cores, which maximizes the retention of performance gains at the cost of a little host CPU consumption and network traffic. Baseline + HC slightly outperforms D \({^2}\) Comp because it performs all compaction jobs on fast host cores. However, this approach consumes a large amount of host CPU resources and causes significant network amplification as shown in Figure 12(c).
Fig. 12.
Fig. 12. Comparing the CPU-only baseline and different DPU-offloaded schemes using the DBBench.
Scalability. Figure 12(b) further shows the scalability of the different comparison groups with respect to the number of rocksdb instances. The y-axis represents the average write throughput of all instances. Each instance sets the compaction threads to 16. The Baseline and Baseline+HC have a steady average throughput with increasing number of instances, as there are enough host CPU cores for compaction. However, executing all compaction jobs on the host could cause severe CPU contention with the foreground workload and bring network traffic. The Naive-Off+HC eliminates the host CPU footprint and reduce network traffic by offloading all compaction jobs to DPUs. However, the average throughput decreases gradually as the number of instances increases. This is mainly because the compaction jobs for multiple instances competes for the limited number of Arm cores on the DPU. In contrast, D \({^2}\) Comp and DComp maintain the performance gains by leaving performance critical \(L_0\) - \(L_1\) compaction jobs on the host CPU and dynamic offloading \(L_2\) - \(L_n\) compaction jobs to DPUs. In this test, we can also observe that even with four instances, each having 16 concurrent compaction threads, D \({^2}\) Comp can still maintain a high write throughput. This demonstrates that the DPU is capable of handling a large number of \(L_2\) - \(L_n\) concurrent compaction jobs with little impact on write performance. In addition, when higher write loads cause the DPU to become overloaded, newly generated tasks are scheduled to be executed on the compute nodes, thus ensuring performance benefits.
Network amplification. Figure 12(c) shows the network amplification of different comparison groups when randomly loading datasets of different sizes with 8 compaction threads. The network amplification of Baseline, Baseline+HC and DComp gets higher as the amount of written data increases because more and more generated compaction jobs need to read data from the storage node and write it back. Naively offloading all compaction jobs to the storage-side DPU (i.e., Naive-Off+HC) can even lead to higher network amplification, almost \(2\times\) as much as Baseline at 500GB datasets. This is because these jobs still need to fetch files through the NFS service on the host CPU, resulting in redundant data movement. In contrast, D \({^2}\) Comp eliminates these data movements by allowing the DPU to read and write files from the SSD directly and reduces network traffic by offloading the \(L_{2}\) - \(L_{n}\) compactions to the DPU of the storage node.
Read throughput. For read-only workloads, since compaction is not triggered, the offloading policy does not have an impact on their performance, and their performance is mainly affected by HC. Our test shows that D \({^2}\) Comp increases sequential and random read throughput by \(1.13\times\) and \(1.28\times\) , respectively, compared to baseline. This improvement mainly comes from the hardware acceleration of the decompression step of read operations. Overall, this experiment demonstrates that decompression also influences the read performance of LSM-tree on fast storage devices. D \({^2}\) Comp reduces this overhead by offloading it to the accelerator, thus benefiting the read performance.
Latency. Table 1 presents the latency of fillrandom and readrandom benchmarks. D \({^2}\) Comp effectively reduces the average latency of write and read operations because it speeds up the (de)compression steps of compaction and read operations. The reduction in the p99 latency of the write operation mainly owes to the reduced write stalls, since the \(L_{0}\) - \(L_{1}\) compaction is sped up [6].
Table 1.
 Write latencyRead latency
Avg.99th99.9thAvg.99th99.9th
Baseline30.31152116324.56490
D \({^2}\) Comp8.1510115419.25571
Table 1. Write and Read Latency ( \(\mu\) s)

4.5 YCSB Benchmarks

Now we compare the CPU-only Baseline and D \({^2}\) Comp using the YCSB benchmarks. We set the background compaction thread pool to 8. Figure 13 reports the YCSB results using the following workloads: workload a (50% read and 50% update), workload b (95% read and 5% update), workload c (read-only), workload d (95% read and 5% insert), workload e (95% scan and 5% insert) and workload f (50% read-modify-write latest records and 50% random reads).
Fig. 13.
Fig. 13. Comparing the CPU-only baseline and the DPU-offloaded compaction using the YCSB benchmark.
In these tests, our proposal manages to improve the throughput and reduce the host CPU utilization. Reads dominate all these workloads, resulting in non-frequent compactions in the underlying KV store. Hence, the throughput improvement is not as high as the write-intensive workloads with our proposal. We observe 30.8%, 20.2%, 15.6%, and 23.0% gains on throughput for workload a, workload b, workload d, and workload f, respectively. Both these workloads contain non-trivial writes. For the read-only workload c and scan-intensive workload e, our proposal also get 47.0%, and 22.1% throughput gains respectively due to the hardware-assisted acceleration for the decompression step during reads. Since all these workloads are read-driven, RD is rarely triggered, the improvement mainly comes from the HC. In addition, our proposal reduces the tail latency under all these workloads by up to 16.9%. Overall, D \({^2}\) Comp also benefits the macro benchmarks.

4.6 The Impact of Configurations

In this section, we investigated the impact of turning off compression and using a hierarchical compression strategy [34], i.e., no compression for the top two levels, fast lz4 compression for the middle two levels, and slow zlib compression for the last level. We set the number of compaction threads to 16 and evaluate the overall throughput of various compression strategies when the host CPU resources are sufficient (16 cores for compaction) and insufficient (1 core for compaction).
Table 2 shows the result. Turning off compression improves overall performance because it removes computational overheads for compression. However, this comes at the cost of increased storage costs, with a space consumption of 77.34GB. Given the high cost of fast NVMe SSDs, reducing the storage footprint is necessary [11, 17, 23]. Using a hierarchical compression strategy reduces the storage footprint to 42.33GB with a small degradation in performance, since the multi-threaded performance is mainly bounded by the \(L_{0}\) - \(L_{1}\) compaction. However, both NoCompress and TierCompress strategies can have significant performance reduction with limited host cores. This is because operations such as data reordering and merging can still cause high computational overhead when compression is turned off. They consume significant CPU resources and cause resource contention. TierCompress drops more than NoCompress because it opens up the compression for \(L_{2}\) - \(L_{n}\) levels, further exacerbating the CPU load. D \({^2}\) Comp mitigate the performance slowdown by offloading overloaded compaction jobs to DPUs, reducing host CPU contention and maintaining the overall throughput. Overall, D \({^2}\) Comp still benefits the write performance of the LSM-tree when the compression is turned off since it effectively mitigates host CPU contention.
Table 2.
Compress strategyStorage CostThroughput(Ops/s)
16 cores1 cores
NoCompress77.34GB16320228600
NoCompress +RD76.69GB163745120229
TierCompress42.33GB12373410386
TierCompress +HC+RD40.47GB12552696978
Table 2. The Impact of Compression

5 Discussion

In this section, we provide a qualitative comparison of different hardware acceleration solutions. Our comparison includes the BlueField-2 DPU and the Alevo U200 FPGA. Table 3 shows the (de)compression throughput, cost, and power consumption for these two hardware options. The (de)compression throughput presented in the table is specific to the zlib, as the BlueField-2 DPU currently only supports acceleration for the deflate algorithm. Data related to the Alevo U200 FPGA is sourced from the AMD official website [4].
Table 3.
HardwareThroughputCostPower consumption
compressiondecompression
BlueField-2 DPU6GB/s10GB/s \(\sim\) $185775W/150W
Alevo U200 FPGA (total 892K LUTs)2GB/s (61K LUTs)518MB/s (6.7K LUTs) \(\sim\) $5341225W
Table 3. Comparison of BlueField-2 DPU and Alevo U200 FPGA
From a performance perspective, the FPGA can offer higher (de)compression throughput. It consumes 61K LUTs to provide 2 GB/s of compression throughput and 6.7K LUTs to offer 518 MB/s of decompression throughput [5]. Considering that the U200 FPGA itself has 892K LUTs, this card can potentially achieve (de)compression throughput in the range of tens of GB/s. However, practical usage may not reach such high levels due to LUT consumption by other operations, such as RDMA network stack processing, I/O stack processing, and encoding/decoding and sorting merge computations. According to Alibaba Cloud’s deployment experience in real-world environments, their FPGAs achieved up to 7.3 GB/s compression throughput [47], slightly higher than the performance of the BlueField-2 DPU (i.e., 6 GB/s for compression and 10 GB/s for decompression).
Despite having a performance advantage, FPGA solutions significantly increase costs and power consumption, and it also lacks the flexibility to fast adapt to complex data formats and diverse compaction strategies. Furthermore, FPGAs are sensitive to environmental factors and therefore less stable. According to Alibaba Cloud’s actual experience, about 33% of data corruption and 27% of operational downtime are caused by FPGAs [47]. Consequently, they are not suitable for large-scale deployment. In contrast, the DPU’s ASIC+Arm solution balances performance with significantly lower costs and power consumption, and is more stable and flexible than FPGAs, thus gaining favor with the mainstream cloud vendors. Although current (de)compression throughput of the BlueField-2 DPU is weaker than the theoretical peak of the FPGA, it can meet the (de)compression bandwidth requirements for concurrent compaction jobs. For scenarios with high performance demands, deploying additional DPUs in storage nodes can address those needs. Nevertheless, we recommend that DPU vendors further improve the hardware (de)compression performance of DPUs in subsequent products to match the rapidly evolving network and storage bandwidths and add accelerators for more compression algorithms.

6 Related Work

Compaction Offloading. Many systems have attempted to offload the resource-intensive compaction of LSM-tree. Ahmad and Kemme [3], RocksDB-Cloud [38], and dLSM [43] propose to offload compaction to remote CPU servers, which eliminates resource contention between compaction and foreground operations. However, the performance of compaction is still bounded by the CPU due to its inefficiency in data-intensive compression. To break the CPU bottleneck, X-engine [46] uses FPGAs to accelerate compaction, which yields good performance gains but introduces expensive hardware costs that are not suitable for large-scale deployment. FaaS [44] offloads compactions to a remote FaaS cluster, which can provide better elastic processing power and alleviate the CPU contention problem, but it performs the compactions at remote compute nodes, which causes a lot of data movement. In contrast, D \({^2}\) Comp offloads compaction to the more cost-effective DPU, achieving respectable performance with lower resource consumption.
DPU Offloading. With the recent increase in DPU capabilities, many studies have started to explore the use of DPUs to offload cloud services and data center applications. Examples include network functions [19, 32], HPC applications [27], file systems [21, 28, 30], and databases [31, 40, 42]. Our work follows this direction by focusing on the offloading and accelerating of LSM-tree-based databases. iKnowFirst [12] also noticed this direction, and it uses DPU to separate hot and cold data and analyze workload characteristics to adjust RocksDB’s compaction strategy on the host. However, since the compaction is still executed on the host, host CPU consumption and network data movement due to the compaction remain unavoidable. In contrast, D \({^2}\) Comp proposes to offload compaction execution to the DPU, almost eliminating the compaction footprint on the host and reducing data movement across the network. In addition, D \({^2}\) Comp further exploits the efficient ASIC accelerator to accelerate data-intensive computations (e.g., data compression) of compaction, greatly improving compaction performance and reducing write stalls.

7 Conclusion

In this article, we argue that LSM-tree compaction has serious CPU and network overhead on high-speed disaggregated storage. With fine-grained breakdown, we find that the performance of single compaction can be bounded by the data-intensive compression, and multi-threaded compaction brings severe CPU bursts and network traffic. We further propose to offload compactions to DPUs to improve efficiency and integrate our proposal with RocksDB, a state-of-the-art LSM-tree KV store. To facilitate offloading, we have designed and implemented a hardware-assisted compression module, a resource-aware dispatching strategy, and a DPU-aware file system to achieve efficient offloading. With evaluations comparing our proposal with the fine-tuned CPU-only baseline, we show that the proposed D \({^2}\) Comp can improve system throughput, reduce tail latency, and effectively mitigate CPU contention and data movement.

Acknowledgments

This work was sponsored by the National Key Research and Development Program of China under grant No. 2023YFB4502701 and No. 2022YFB4501100, the National Natural Science Founda-tion of China under Grant No. 62072196 and No. 62102155, the Key Research and Development Program of Guangdong Province under Grant No. 2021B0101400003, the Creative Research Group Project of NSFC No. 61821003, the Fundamental Research Funds for the Central Universities (HUST: 021XXJS034), and the Youth Foundation Project of Zhejiang Lab (No. K2023PI0AA04).

References

[1]
[2]
Abutalib Aghayev, Sage Weil, Michael Kuchnik, Mark Nelson, Gregory R. Ganger, and George Amvrosiadis. 2019. File systems unfit as distributed storage backends: Lessons from 10 years of Ceph evolution. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 353–369.
[3]
Muhammad Yousuf Ahmad and Bettina Kemme. 2015. Compaction management in distributed key-value datastores. Proceedings of the VLDB Endowment 8, 8 (2015), 850–861.
[6]
Oana Balmau, Florin Dinu, Willy Zwaenepoel, Karan Gupta, Ravishankar Chandhiramoorthi, and Diego Didona. 2019. SILK: Preventing latency spikes in log-structured merge key-value stores. In 2019 USENIX Annual Technical Conference (USENIX ATC’19). 753–766.
[7]
Laurent Bindschaedler, Ashvin Goel, and Willy Zwaenepoel. 2020. Hailstorm: Disaggregated compute and storage for distributed LSM-based databases. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 301–316.
[9]
Brad Burres, Dan Daly, Mark Debbage, Eliel Louzoun, Christine Severns-Williams, Naru Sundar, Nadav Turbovich, Barry Wolford, and Yadong Li. 2021. Intel’s hyperscale-ready infrastructure processing unit (IPU). In 2021 IEEE Hot Chips 33 Symposium (HCS’21). IEEE, 1–16.
[10]
Idan Burstein. 2021. Nvidia data center processing unit (DPU) architecture. In 2021 IEEE Hot Chips 33 Symposium (HCS’21). IEEE, 1–20.
[11]
Hao Chen, Chaoyi Ruan, Cheng Li, Xiaosong Ma, and Yinlong Xu. 2021. SpanDB: A fast, cost-effective LSM-tree based KV store on hybrid storage. In 19th USENIX Conference on File and Storage Technologies (FAST’21). 17–32.
[12]
Jiahong Chen, Shengzhe Wang, Zhihao Zhang, Suzhen Wu, and Bo Mao. 2023. iKnowFirst: An efficient DPU-assisted compaction for LSM-tree-based key-value stores. In 2023 IEEE 34th International Conference on Application-specific Systems, Architectures and Processors (ASAP’23). IEEE, 53–60.
[13]
Yifan Dai, Yien Xu, Aishwarya Ganesan, Ramnatthan Alagappan, Brian Kroth, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2020. From WiscKey to Bourbon: A learned index for log-structured merge trees. In Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation. 155–171.
[15]
Peter Deutsch. 1996. RFC1951: DEFLATECompressed Data Format Specification Version 1.3. (1996).
[16]
Chen Ding, Ting Yao, Hong Jiang, Qiu Cui, Liu Tang, Yiwen Zhang, Jiguang Wan, and Zhihu Tan. 2022. TriangleKV: Reducing write stalls and write amplification in LSM-tree based KV stores with triangle container in NVM. IEEE Transactions on Parallel and Distributed Systems 33, 12 (2022), 4339–4352.
[17]
Siying Dong, Andrew Kryczka, Yanqin Jin, and Michael Stumm. 2021. Evolution of development priorities in key-value stores serving large-scale applications: The RocksDB experience. In 19th USENIX Conference on File and Storage Technologies (FAST’21). 33–49.
[19]
Daniel Firestone, Andrew Putnam, Sambhrama Mundkur, Derek Chiou, Alireza Dabagh, Mike Andrewartha, Hari Angepat, Vivek Bhanu, Adrian Caulfield, Eric Chung, Harish Kumar Chandrappa, Somesh Chaturmohta, Matt Humphrey, Jack Lavier, Norman Lam, Fengfen Liu, Kalin Ovtcharov, Jitu Padhye, Gautham Popuri, Shachar Raindel, Tejas Sapre, Mark Shaw, Gabriel Silva, Madhan Sivakumar, Nisheeth Srivastava, Anshuman Verma, Qasim Zuhair, Deepak Bansal, Doug Burger, Kushagra Vaid, David A. Maltz, and Albert Greenberg. 2018. Azure accelerated networking: SmartNICs in the public cloud. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI’18). USENIX Association, Renton, WA, 51–66. https://www.usenix.org/conference/nsdi18/presentation/firestone
[21]
Peter-Jan Gootzen, Jonas Pfefferle, Radu Stoica, and Animesh Trivedi. 2023. DPFS: DPU-powered file system virtualization. In Proceedings of the 16th ACM International Conference on Systems and Storage (Haifa, Israel) (SYSTOR’23). Association for Computing Machinery, New York, NY, USA.
[22]
Zerui Guo, Hua Zhang, Chenxingyu Zhao, Yuebin Bai, Michael Swift, and Ming Liu. 2023. LEED: A low-power, fast persistent key-value store on SmartNIC JBOFs. In Proceedings of the ACM SIGCOMM 2023 Conference. 1012–1027.
[23]
Xiaokang Hu, Fuzong Wang, Weigang Li, Jian Li, and Haibing Guan. 2019. QZFS: QAT accelerated compression in file system for application agnostic and cost efficient data storage. In 2019 USENIX Annual Technical Conference (USENIX ATC’19). 163–176.
[24]
Dongxu Huang, Qi Liu, Qiu Cui, Zhuhe Fang, Xiaoyu Ma, Fei Xu, Li Shen, Liu Tang, Yuxing Zhou, Menglong Huang, Wan Wei, Cong Liu, Jian Zhang, Jianjun Li, Xuelian Wu, Lingyu Song, Ruoxi Sun, Shuaipeng Yu, Lei Zhao, Nicholas Cameron, Liquan Pei, and Xin Tang. 2020. TiDB: a Raft-based HTAP database. Proceedings of the VLDB Endowment 13, 12 (2020), 3072–3084.
[25]
Haoyu Huang and Shahram Ghandeharizadeh. 2021. Nova-LSM: A distributed, component-based LSM-tree key-value store. In Proceedings of the 2021 International Conference on Management of Data. 749–763.
[27]
Arpan Jain, Nawras Alnaasan, Aamir Shafi, Hari Subramoni, and Dhabaleswar K. Panda. 2021. Accelerating CPU-based distributed DNN training on modern HPC clusters using BlueField-2 DPUs. In 2021 IEEE Symposium on High-Performance Interconnects (HOTI’21). IEEE, 17–24.
[28]
Jongyul Kim, Insu Jang, Waleed Reda, Jaeseong Im, Marco Canini, Dejan Kostić, Youngjin Kwon, Simon Peter, and Emmett Witchel. 2021. LineFS: Efficient SmartNIC offload of a distributed file system with pipeline parallelism. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles. 756–771.
[29]
Baptiste Lepers, Oana Balmau, Karan Gupta, and Willy Zwaenepoel. 2019. KVell: The design and implementation of a fast persistent key-value store. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 447–461.
[30]
Qiang Li, Lulu Chen, Xiaoliang Wang, Shuo Huang, Qiao Xiang, Yuanyuan Dong, Wenhui Yao, Minfei Huang, Puyuan Yang, Shanyang Liu, Zhaosheng Zhu, Huayong Wang, Haonan Qiu, Derui Liu, Shaozong Liu, Yujie Zhou, Yaohui Wu, Zhiwu Wu, Shang Gao, Chao Han, Zicheng Luo, Yuchao Shao, Gexiao Tian, Zhongjie Wu, Zheng Cao, Jinbo Wu, Jiwu Shu, Jie Wu, and Jiesheng Wu. 2023. Fisc: A Large-scale cloud-native-oriented file system. In 21st USENIX Conference on File and Storage Technologies (FAST’23). USENIX Association, Santa Clara, CA, 231–246. https://www.usenix.org/conference/fast23/presentation/li-qiang-fisc
[31]
Jiaxin Lin, Tao Ji, Xiangpeng Hao, Hokeun Cha, Yanfang Le, Xiangyao Yu, and Aditya Akella. 2023. Towards accelerating data intensive application’s shuffle process using SmartNICs. Proceedings of the ACM on Measurement and Analysis of Computing Systems 7, 2 (2023), 1–23.
[32]
Ming Liu, Tianyi Cui, Henry Schuh, Arvind Krishnamurthy, Simon Peter, and Karan Gupta. 2019. Offloading distributed applications onto SmartNICs using iPipe. In Proceedings of the ACM Special Interest Group on Data Communication. 318–333.
[34]
Yoshinori Matsunobu, Siying Dong, and Herman Lee. 2020. MyRocks: LSM-tree database storage engine serving Facebook’s Social Graph. Proceedings of the VLDB Endowment 13, 12 (2020), 3217–3230.
[35]
Jaehong Min, Ming Liu, Tapan Chugh, Chenxingyu Zhao, Andrew Wei, In Hwan Doh, and Arvind Krishnamurthy. 2021. Gimbal: Enabling multi-tenant storage disaggregation on SmartNIC JBOFs. In Proceedings of the 2021 ACM SIGCOMM 2021 Conference. 106–122.
[37]
Patrick O’Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O’Neil. 1996. The log-structured merge-tree (LSM-tree). Acta Informatica 33, 4 (1996), 351–385.
[38]
Facebook. 2023. RocksDB, a persistent key-value store for fast storage enviroments. http://rocksdb.org/
[39]
[40]
Henry N. Schuh, Weihao Liang, Ming Liu, Jacob Nelson, and Arvind Krishnamurthy. 2021. Xenic: SmartNIC-accelerated distributed transactions. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles. 740–755.
[41]
Spencer Shepler, Brent Callaghan, David Robinson, Robert Thurlow, Carl Beame, Mike Eisler, and David Noveck. 2003. Network File System (NFS) Version 4 Protocol. Technical Report.
[42]
Shangyi Sun, Rui Zhang, Ming Yan, and Jie Wu. 2022. SKV: A SmartNIC-offloaded distributed key-value store. In 2022 IEEE International Conference on Cluster Computing (CLUSTER’22). IEEE, 1–11.
[43]
Ruihong Wang, Jianguo Wang, Prishita Kadam, M. Tamer Özsu, and Walid G. Aref. 2023. dLSM: An LSM-based index for memory disaggregation. In 2023 IEEE 39th International Conference on Data Engineering (ICDE’23). IEEE, 2835–2849.
[44]
Xiaoliang Wang, Jianchuan Li, Peiquan Jin, Kuankuan Guo, Yuanjin Lin, and Ming Zhao. 2021. Supporting elastic compaction of LSM-tree with a FaaS cluster. In 2021 IEEE International Conference on Cluster Computing (CLUSTER’21). IEEE, 819–820.
[45]
Ziye Yang, James R. Harris, Benjamin Walker, Daniel Verkamp, Changpeng Liu, Cunyin Chang, Gang Cao, Jonathan Stern, Vishal Verma, and Luse E. Paul. 2017. SPDK: A development kit to build high performance storage applications. In 2017 IEEE International Conference on Cloud Computing Technology and Science (CloudCom’17). IEEE, 154–161.
[46]
Teng Zhang, Jianying Wang, Xuntao Cheng, Hao Xu, Nanlong Yu, Gui Huang, Tieying Zhang, Dengcheng He, Feifei Li, Wei Cao, Zhongdong Huang, and Jianling Sun. 2020. FPGA-accelerated compactions for LSM-based Key-value store. In 18th USENIX Conference on File and Storage Technologies (FAST’20). USENIX Association, Santa Clara, CA, 225–237. https://www.usenix.org/conference/fast20/presentation/zhang-teng
[47]
Weidong Zhang, Erci Xu, Qiuping Wang, Xiaolu Zhang, Yuesheng Gu, Zhenwei Lu, Tao Ouyang, Guanqun Dai, Wenwen Peng, Zhe Xu, Shuo Zhang, Dong Wu, Yilei Peng, Tianyun Wang, Haoran Zhang, Jiasheng Wang, Wenyuan Yan, Yuanyuan Dong, Wenhui Yao, Zhongjie Wu, Lingjun Zhu, Chao Shi, Yinhu Wang, Rong Liu, Junping Wu, Jiaji Zhu, and Jiesheng Wu. 2024. What’s the Story in EBS Glory: Evolutions and Lessons in Building Cloud Block Store. In 22nd USENIX Conference on File and Storage Technologies (FAST’24). USENIX Association, Santa Clara, CA, 277–291. https://www.usenix.org/conference/fast24/presentation/zhang-weidong

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization  Volume 21, Issue 3
September 2024
592 pages
EISSN:1544-3973
DOI:10.1145/3613629
  • Editor:
  • David Kaeli
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 September 2024
Online AM: 09 April 2024
Accepted: 02 April 2024
Revised: 26 March 2024
Received: 06 February 2024
Published in TACO Volume 21, Issue 3

Check for updates

Author Tags

  1. LSM-tree
  2. key-value store
  3. data processing units
  4. disaggregated storage

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 534
    Total Downloads
  • Downloads (Last 12 months)534
  • Downloads (Last 6 weeks)189
Reflects downloads up to 01 Oct 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media