CN114741028A

CN114741028A - A persistent key-value storage method, device and system based on OCSSD

Info

Publication number: CN114741028A
Application number: CN202210272087.2A
Authority: CN
Inventors: 童薇; 冯丹; 詹天奇; 黄栋
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2022-03-18
Filing date: 2022-03-18
Publication date: 2022-07-12
Anticipated expiration: 2042-03-18
Also published as: CN114741028B

Abstract

The invention discloses a method, a device and a system for storing persistent key values based on OCSSD, belonging to the field of key value storage, wherein the compact operation comprises the following steps: reading SSTable selected from the L layer and SSTable which is overlapped with the selected SSTable in the L +1 layer in a key value range to a memory, and sequencing key value pairs to obtain an ordered key value pair sequence; identifying the key value pair subsequences in the ordered key value pair sequence which are completely the same as the read original data blocks as reusable data blocks, and organizing the rest key value pairs into data blocks to be written back; for the reusable data blocks, allocating first logical addresses for the reusable data blocks, and mapping the first logical addresses to first physical addresses of the corresponding original data blocks in the OCSSD; and for the data block to be written back, writing the data block into a free second physical address in the L +1 layer, allocating a second logical address to the data block, and mapping the data block to the mapping relation of the second physical address. The invention can effectively reduce the data writing amount in the compact operation process, thereby reducing the influence on the writing performance and the SSD life.

Description

A persistent key-value storage method, device and system based on OCSSD

技术领域technical field

本发明属于键值存储领域，更具体地，涉及一种基于OCSSD的持久键值存储方法、设备及系统。The invention belongs to the field of key-value storage, and more particularly, relates to an OCSSD-based persistent key-value storage method, device and system.

背景技术Background technique

持久键值(KV)存储是现代大规模存储基础结构不可或缺的一部分，用于存储大量的非结构化数据。日志结构合并树(LSM-tree)是键值存储实现中最受欢迎的数据结构之一，因为它将随机写转换为顺序写，同时支持高效的单点查询和范围查询。为了加快读取操作，LSM-tree键值存储系统将键值对保持在多层结构中。在处理写入和更新操作时，LSM-tree键值存储系统首先将传入的更新缓存在内存缓冲区中，并在缓冲区已满时将整个缓冲区批量地转储到持久性存储中。LSM-tree键值存储系统的写入和更新操作均以日志结构方式进行追加写，旧数据不会被新写入数据覆盖。因此，LSM-tree键值存储系统被广泛用于大规模生产环境中，包括Google的BigTable和LevelDB，Facebook中的Cassandra和RocksDB，Twitter中的HBase。Persistent key-value (KV) stores are an integral part of modern large-scale storage infrastructure for storing large amounts of unstructured data. A log-structured merge-tree (LSM-tree) is one of the most popular data structures in key-value store implementations because it converts random writes to sequential writes while supporting efficient single-point and range queries. To speed up read operations, the LSM-tree key-value store system keeps key-value pairs in a multi-layered structure. When processing write and update operations, the LSM-tree key-value store system first caches incoming updates in an in-memory buffer and dumps the entire buffer into persistent storage in batches when the buffer is full. The write and update operations of the LSM-tree key-value storage system are additionally written in a log-structured manner, and the old data will not be overwritten by the newly written data. Therefore, LSM-tree key-value storage systems are widely used in large-scale production environments, including Google's BigTable and LevelDB, Facebook's Cassandra and RocksDB, and Twitter's HBase.

LSM-tree键值存储以SSTable作为树结构中的节点，SSTable即硬盘上的存储文件，包括多个数据块，每个数据块又包括多个键值对；SSTable将键值对有序存放，LSM-tree的树结构中，level0层的SSTable由内存中的Immutable Memtable直接持久化生成，因为没有和当前层的其他文件合并过，因此level0层的SSTable里的键值会发生重叠，其余层的SSTable文件均由当前层和前一层的SSTable文件合并而来，键值不会重叠。在LSM-tree中，当一层的键值对数据量达到其容量上限时，键值对将会与下一层的键值对数据合并，该操作称为compaction操作。如图1所示，这一操作分为三个步骤：首先，它在Li层(例如L2层)指定一个SSTable，再将Li+1层中与之存在键值范围重叠的SSTable读出到内存中；其次，对读入内存的SSTable中的键值对进行合并排序，在这个过程中会合并键相同的键值对，丢弃过期的键值对，保留最新的键值对，以及根据删除标记丢弃失效的键值对，将有效的键值对按照键的某种排序(如字典排序)重新写入新的SSTable文件中；最后，将所有的SSTable写回Li+1。可以看到，LSM-tree键值存储中的compaction操作会引入大量额外的I/O，导致I/O放大的问题。LSM-tree key-value storage uses SSTable as a node in the tree structure. SSTable is a storage file on the hard disk, including multiple data blocks, and each data block includes multiple key-value pairs; SSTable stores key-value pairs in an orderly manner. In the tree structure of LSM-tree, the SSTable of level 0 is directly and persistently generated by the Immutable Memtable in memory. Because it has not been merged with other files of the current layer, the key values in the SSTable of level 0 will overlap, and the values of the other layers will overlap. SSTable files are merged from the SSTable files of the current layer and the previous layer, and the key values will not overlap. In LSM-tree, when the amount of key-value pair data in one layer reaches its upper limit, the key-value pair will be merged with the key-value pair data in the next layer, which is called a compaction operation. As shown in Figure 1, this operation is divided into three steps: first, it specifies an SSTable in the Li layer (such as the L2 layer), and then reads out the SSTable in the Li+1 layer with overlapping key value ranges to the memory Second, merge and sort the key-value pairs in the SSTable read into the memory. In this process, the key-value pairs with the same key will be merged, the expired key-value pairs will be discarded, the latest key-value pairs will be retained, and according to the deletion mark Discard the invalid key-value pairs, and rewrite the valid key-value pairs into a new SSTable file according to some sort of key (such as lexicographical sorting); finally, write all SSTables back to Li+1. It can be seen that the compaction operation in the LSM-tree key-value store will introduce a lot of extra I/O, resulting in the problem of I/O amplification.

随着NAND闪存技术的飞速发展，SSD的低访问延迟、高带宽和低成本的特性使得固态磁盘(SSD)正在逐渐取代LSM-tree键值存储中的HDD，然而，由于LSM-tree键值存储接收到大量的写入/更新操作时，将触发频繁的compaction操作，因此，SSD上的LSM-tree键值存储在写入密集型工作负载中存在高I/O放大的问题，影响键值存储引擎的写性能，并且额外的数据写入也会影响SSD的寿命。With the rapid development of NAND flash technology, the low access latency, high bandwidth and low cost of SSD make solid-state disk (SSD) gradually replace HDD in LSM-tree key-value storage. However, due to LSM-tree key-value storage When a large number of write/update operations are received, frequent compaction operations will be triggered. Therefore, LSM-tree key-value storage on SSD has high I/O amplification problems in write-intensive workloads, affecting key-value storage. The write performance of the engine, and additional data writes can also affect the life of the SSD.

发明内容SUMMARY OF THE INVENTION

针对现有技术的缺陷和改进需求，本发明提供了一种基于OCSSD的持久键值存储方法、设备及系统，其目的在于，改善LSM-tree键值存储的compaction操作因存在高I/O放大而影响键值存储引擎写性能和SSD寿命的问题。In view of the defects and improvement requirements of the prior art, the present invention provides a persistent key-value storage method, device and system based on OCSSD, the purpose of which is to improve the compaction operation of LSM-tree key-value storage due to high I/O amplification. And the problem that affects the write performance of the key-value storage engine and the life of the SSD.

为实现上述目的，按照本发明的一个方面，提供了一种基于OCSSD的持久键值存储方法，利用LSM-tree组织数据，其中，compaction操作包括如下步骤；In order to achieve the above object, according to an aspect of the present invention, a persistent key-value storage method based on OCSSD is provided, which utilizes LSM-tree to organize data, wherein the compaction operation includes the following steps;

键值对排序步骤：将从第L层选中的SSTable以及第L+1层中与所选中的SSTable存在键值范围重叠的SSTable均从OCSSD中读取到内存，并对读入内存的SSTable中的键值对进行排序，得到有序键值对序列；所读取的SSTable中的数据块记为原数据块；L和L+1为LSM-tree中的层序号；Key-value pair sorting steps: The SSTable selected from the L layer and the SSTable in the L+1 layer whose key value range overlaps with the selected SSTable are read from the OCSSD to the memory, and the SSTable read into the memory is read into the memory. Sort the key-value pairs to obtain an ordered sequence of key-value pairs; the data block in the read SSTable is recorded as the original data block; L and L+1 are the layer sequence numbers in the LSM-tree;

数据检测步骤：将有序键值对序列中与原数据块完全相同的键值对子序列识别为可重用数据块，并将其余键值对组织为待写回数据块；Data detection step: Identify the key-value pair subsequence that is exactly the same as the original data block in the ordered key-value pair sequence as reusable data blocks, and organize the remaining key-value pairs into data blocks to be written back;

重映射步骤：对于可重用数据块，为其分配新的第一逻辑地址，并获得与之内容相同的原数据块在OCSSD中的第一物理地址，建立第一逻辑地址到第一物理地址的映射关系；Remapping step: for the reusable data block, assign a new first logical address to it, obtain the first physical address of the original data block with the same content in the OCSSD, and establish a relationship between the first logical address and the first physical address. Mapping relations;

写回步骤：对于待写回数据块，为其分配新的第二逻辑地址，将其写入第L+1层中空闲的第二物理地址，并建立第二逻辑地址到第二物理地址的映射关系。Write back step: for the data block to be written back, assign a new second logical address to it, write it into the free second physical address in the L+1 layer, and establish a link between the second logical address and the second physical address. Mapping relations.

本发明在compaction操作对所读取的SSTable进行键值对排序后，会进行数据检测，从排序后的键值对序列中识别出与读取的原数据块内容完全相同的数据块，作为可重用数据块，对于这些可重用数据块，则直接通过数据重映射的方式，将新分配的逻辑地址映射到原数据块的物理地址，而无需将这些可重用数据块写入OCSSD中，由此能够有效减少compaction操作过程中数据的写入量，进而减小I/O放大，从而减小对键值存储引擎写性能和SSD寿命的影响。In the present invention, after the key-value pairs are sorted on the read SSTable by the compaction operation, data detection is performed, and the data blocks with the same content as the read original data blocks are identified from the sorted key-value pairs sequence, as possible Reuse data blocks. For these reusable data blocks, the newly allocated logical address is mapped to the physical address of the original data block directly through data remapping, without writing these reusable data blocks into OCSSD. It can effectively reduce the amount of data written during the compaction operation, thereby reducing the I/O amplification, thereby reducing the impact on the write performance of the key-value storage engine and the life of the SSD.

进一步地，将有序键值对序列中与原数据块完全相同的键值对子序列识别为可重用数据块，包括：Further, identify the key-value pair subsequence that is exactly the same as the original data block in the ordered key-value pair sequence as a reusable data block, including:

利用与数据块大小相同的滑动窗口在有序键值对序列上滑动，每滑动至一个位置后，判断是否存在与滑动窗口中的键值对子序列相同的原数据块，若是，则将滑动窗口中的键值对子序列识别为可重用数据块，并使滑动窗口以数据块大小为滑动长度滑动至下一个位置；若否，则使滑动窗口以1为滑动长度滑动至下一个位置。Use a sliding window with the same size as the data block to slide on the sequence of ordered key-value pairs. After sliding to a position, determine whether there is an original data block that is the same as the key-value pair subsequence in the sliding window. The key-value pair subsequence in the window is identified as a reusable data block, and the sliding window is made to slide to the next position with the size of the data block as the sliding length; if not, the sliding window is made to slide to the next position with the sliding length of 1.

本发明利用滑动窗口识从排序后的键值对序列中的识别出可重用数据块，即与原数据块内容完全相同的数据块，相比于传统的基于去重的方法，本发明在识别过程中无需进行复杂的哈希值计算，也避免了哈希值的管理开销，因此，本发明能够高效识别出可重用数据块，并且计算成本低、计算开销小。The present invention uses a sliding window to identify reusable data blocks in the sequence of sorted key-value pairs, that is, data blocks with the exact same content as the original data block. There is no need to perform complex hash value calculation in the process, and the management overhead of the hash value is also avoided. Therefore, the present invention can efficiently identify reusable data blocks, and has low computational cost and low computational cost.

进一步地，滑动窗口中的键值对子序列与原数据块是否相同的判断方式包括：Further, the way of judging whether the key-value pair subsequence in the sliding window is the same as the original data block includes:

将滑动窗口中的首、尾两个键值对的键值分别与原数据块的首、尾两个键值对的键值进行比较，若两个键值对的键值对应相等，则判断滑动窗口中的键值对子序列与该原数据块相同。Compare the key values of the first and last key-value pairs in the sliding window with the key values of the first and last two key-value pairs of the original data block. If the corresponding key values of the two key-value pairs are equal, then judge The subsequence of key-value pairs in the sliding window is the same as the original data block.

由于SSTable中的键值对有序，因此，当滑动窗口内的有序键值对序列的首、尾键值对与从SSTable中读取的某一个原数据块的首、尾键值对的键值对应相等时，滑动窗口内的有序键值对序列与该原数据块中的键值对完全相同，本发明以此作为判断依据，能够进一步提高可重用数据块的识别效率。Since the key-value pairs in the SSTable are ordered, when the first and last key-value pairs of the ordered key-value pair sequence in the sliding window are different from the first and last key-value pairs of an original data block read from the SSTable When the key-value correspondences are equal, the sequence of the ordered key-value pairs in the sliding window is exactly the same as the key-value pairs in the original data block, and the present invention uses this as the judgment basis, and can further improve the identification efficiency of the reusable data block.

进一步地，本发明提供的基于OCSSD的持久键值存储方法，还包括：Further, the OCSSD-based persistent key-value storage method provided by the present invention also includes:

为OCSSD中的每一个物理地址维护一个引用计数，用于记录指向该物理地址的逻辑地址的计数；Maintain a reference count for each physical address in the OCSSD, which is used to record the count of logical addresses pointing to the physical address;

当某个物理地址的引用计数为0时，将该物理地址置为无效。When the reference count of a physical address is 0, the physical address is invalidated.

由于本发明在执行compaction操作过程，对于所识别出可重用数据块，与之对应的原数据块仍然有效，本发明对物理地址维护引用计数，并仅在某物理地址的引用技术为0时将该物理地址置为无效，保证了被引用的重复数据不会被其他删除操作删除而失效。Since the present invention is performing the compaction operation process, for the identified reusable data block, the corresponding original data block is still valid, the present invention maintains the reference count for the physical address, and only when the reference technology of a certain physical address is 0 The physical address is invalidated to ensure that the referenced duplicate data will not be invalidated by other deletion operations.

进一步地，数据检测步骤中，识别出可重用数据块之后，将其余键值对组织为待写回数据块之前，还包括：Further, in the data detection step, after identifying the reusable data block, before organizing the remaining key-value pairs into the data blocks to be written back, the method further includes:

按照闪存页大小将可重用数据块和其余键值对进行数据对齐。Data alignment of reusable data blocks and remaining key-value pairs according to the flash page size.

本发明按照闪存页大小对排序后的有序键值对中的可重用数据块和其余键值对进行数据对齐，能够避免逻辑页写入时出现跨物理页的情况，从而能够避免因为不对齐导致的额外读写开销。The present invention aligns the reusable data blocks and other key-value pairs in the sorted ordered key-value pairs according to the size of the flash memory pages, which can avoid the situation of spanning physical pages when the logical page is written, thereby avoiding the misalignment Additional read and write overhead caused.

进一步地，SSTable中的数据块大小为与OCSSD中的物理页大小一致。Further, the size of the data block in the SSTable is the same as the size of the physical page in the OCSSD.

本发明将SSTable中的数据块大小设置成与OCSSD中的物理页大小一致的方式，能够减少额外的读写闪存次数。The present invention sets the size of the data block in the SSTable in a manner consistent with the size of the physical page in the OCSSD, which can reduce the extra times of reading and writing flash memory.

进一步地，利用超级块管理OCSSD中的闪存物理块。Further, the physical blocks of flash memory in the OCSSD are managed by superblocks.

本发明以超级块来管理整个闪存，每个写请求依次发送到每个并行单元，这样可以最大化利用OCSSD的并行性。The present invention manages the entire flash memory with a super block, and each write request is sent to each parallel unit in turn, so that the parallelism of the OCSSD can be maximized.

按照本发明的另一个方面，提供了一种基于OCSSD的持久键值存储设备，利用LSM-tree组织数据，包括：在LSM-tree KV存储引擎中实现的键值对排序模块和数据检测模块，以及在主机FTL中实现的数据重映射模块和数据写回模块；According to another aspect of the present invention, a persistent key-value storage device based on OCSSD is provided, which uses LSM-tree to organize data, including: a key-value pair sorting module and a data detection module implemented in the LSM-tree KV storage engine, And the data remapping module and data write-back module implemented in the host FTL;

键值对排序模块，用于在compaction操作中将从第L层选中的SSTable以及第L+1层中与所选中的SSTable存在键值范围重叠的SSTable均从OCSSD中读取到内存，并对读入内存的SSTable中的键值对进行排序，得到有序键值对序列；所读取的SSTable中的数据块记为原数据块；L和L+1为LSM-tree中的层序号；The key-value pair sorting module is used to read the SSTable selected from the L layer and the SSTable in the L+1 layer with overlapping key value ranges with the selected SSTable in the compaction operation from the OCSSD. Sort the key-value pairs in the SSTable read into the memory to obtain an ordered sequence of key-value pairs; the data block in the read SSTable is recorded as the original data block; L and L+1 are the layer numbers in the LSM-tree;

数据检测模块，用于将有序键值对序列中与原数据块完全相同的键值对子序列识别为可重用数据块，并将其余键值对组织为待写回数据块；The data detection module is used to identify the subsequence of key-value pairs in the sequence of ordered key-value pairs that are exactly the same as the original data block as reusable data blocks, and organize the remaining key-value pairs into data blocks to be written back;

重映射模块，用于为可重用数据块分配新的第一逻辑地址，并获得与可重用数据块内容相同的原数据块在OCSSD中的第一物理地址，建立第一逻辑地址到第一物理地址的映射关系；The remapping module is configured to assign a new first logical address to the reusable data block, obtain the first physical address in the OCSSD of the original data block with the same content as the reusable data block, and establish the first logical address to the first physical address. The mapping relationship of the address;

写回模块，用于为待写回数据块分配新的第二逻辑地址，将待写回数据块写入第L+1层中空闲的第二物理地址，并建立第二逻辑地址到第二物理地址的映射关系。The write-back module is used to assign a new second logical address to the data block to be written back, write the data block to be written back to the second physical address that is free in the L+1 layer, and establish the second logical address to the second physical address. Physical address mapping.

进一步地，LSM-tree KV存储引擎中采用用户态文件系统BlobFS，且主机FTL采用用户态驱动SPDK。Further, the user-mode file system BlobFS is used in the LSM-tree KV storage engine, and the host FTL uses the user-mode driver SPDK.

本发明采用用户态文件系统BlobFS与用户态驱动SPDK来管理主机端FTL，缩短了软件IO栈，减少主机端延迟。The present invention adopts the user-mode file system BlobFS and the user-mode driver SPDK to manage the host-side FTL, shortens the software IO stack, and reduces the host-side delay.

按照本发明的又一个方面，提供了一种持久键值存储系统，包括OCSSD以及本发明提供的基于OCSSD的持久键值存储设备。According to another aspect of the present invention, a persistent key-value storage system is provided, including an OCSSD and an OCSSD-based persistent key-value storage device provided by the present invention.

总体而言，通过本发明所构思的以上技术方案，能够取得以下有益效果：In general, through the above technical solutions conceived by the present invention, the following beneficial effects can be achieved:

(1)本发明在compaction操作过程中，从已排序的有序键值对序列中识别出可重用数据块，并通过地址重映射的方式，将可重用数据块的逻辑地址映射到OCSSD中原数据块的物理地址，而不将这些可重用数据块写入OCSSD中，由此能够有效减少compaction操作过程中数据的写入量，进而减小I/O放大，从而减小对键值存储引擎写性能和SSD寿命的影响。(1) In the process of compaction operation, the present invention identifies reusable data blocks from the sorted sequence of key-value pairs, and maps the logical addresses of the reusable data blocks to the original data in OCSSD by means of address remapping. The physical address of the block, instead of writing these reusable data blocks into the OCSSD, can effectively reduce the amount of data written during the compaction operation, thereby reducing the I/O amplification, thereby reducing the write to the key-value storage engine. Impact on performance and SSD life.

(2)本发明基于OCSSD实现主机端FTL对重复数据的重映射与管理，消除去重与KV存储引擎之间的语义隔离。(2) The present invention realizes remapping and management of duplicate data by FTL on the host side based on OCSSD, and eliminates the semantic isolation between deduplication and KV storage engine.

(3)本发明采用用户态文件系统BlobFS与用户态驱动SPDK来管理主机端FTL，缩短了软件IO栈，减少主机端延迟。(3) The present invention adopts the user-mode file system BlobFS and the user-mode driver SPDK to manage the host-side FTL, shortens the software IO stack, and reduces the host-side delay.

附图说明Description of drawings

图1为现有的LSM-tree键值存储中compaction操作的执行过程示意图；Figure 1 is a schematic diagram of the execution process of the compaction operation in the existing LSM-tree key-value store;

图2为本发明实施例1提供的基于OCSSD的持久键值存储方法流程图；2 is a flowchart of an OCSSD-based persistent key-value storage method provided in Embodiment 1 of the present invention;

图3为本发明实施例1提供的基于OCSSD的持久键值存储方法逻辑图；3 is a logical diagram of an OCSSD-based persistent key-value storage method provided in Embodiment 1 of the present invention;

图4为本发明实施例1提供的可重用数据块识别方法示意图；4 is a schematic diagram of a method for identifying a reusable data block provided in Embodiment 1 of the present invention;

图5为本发明实施例1提供的数据对齐示意图；5 is a schematic diagram of data alignment provided in Embodiment 1 of the present invention;

图6为本发明实施例1提供的针对可重用数据块进行地址重映射过程的示意图；6 is a schematic diagram of an address remapping process for a reusable data block provided by Embodiment 1 of the present invention;

图7为本发明实施例1提供的地址重映射的IO栈；Fig. 7 is the IO stack of address remapping provided by Embodiment 1 of the present invention;

图8为本发明实施例2提供的基于OCSSD的持久键值存储设备及实施例3提供的持久键值存储系统示意图。8 is a schematic diagram of an OCSSD-based persistent key-value storage device provided by Embodiment 2 of the present invention and a persistent key-value storage system provided by Embodiment 3.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。此外，下面所描述的本发明各个实施方式中所涉及到的技术特征只要彼此之间未构成冲突就可以相互组合。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention. In addition, the technical features involved in the various embodiments of the present invention described below can be combined with each other as long as they do not conflict with each other.

在本发明中，本发明及附图中的术语“第一”、“第二”等(如果存在)是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。In the present invention, the terms "first", "second" and the like (if present) in the present invention and the accompanying drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence.

LSM-tree键值存储中的compaction操作会引入大量额外的I/O，导致I/O放大的问题，影响键值存储引擎的写性能和SSD的使用寿命，为了改善该问题，本发明提供了一种基于OCSSD的持久键值存储方法、设备及系统，其整体思路在于：在compaction操作的过程中，识别出可重用的数据块，即排序后的有序键值对序列中与读取的原数据块的键值对内容完全一致的数据块，对于这些数据块，保留SSD中原数据块的有效性，而不将这些数据块重新写回SSD中，与传统的compaction操作相比，有效减少了数据写入量，从而改善了写放大的问题，也减少了对SSD使用寿命的影响。The compaction operation in the LSM-tree key-value storage will introduce a large amount of extra I/O, leading to the problem of I/O amplification, affecting the write performance of the key-value storage engine and the service life of the SSD. In order to improve this problem, the present invention provides A persistent key-value storage method, device and system based on OCSSD, the overall idea of which is: in the process of compaction operation, reusable data blocks are identified, that is, the sorted sequence of key-value pairs is the same as the read data block. Data blocks whose key-value pairs of the original data blocks are exactly the same. For these data blocks, the validity of the original data blocks in the SSD is retained, and these data blocks are not rewritten to the SSD. Compared with the traditional compaction operation, the reduction is effectively reduced. This reduces the amount of data written, thereby improving the problem of write amplification and reducing the impact on SSD life.

在详细解释本发明的技术方案之前，先对本发明所涉及的相关技术术语进行简要介绍如下：Before explaining the technical solutions of the present invention in detail, the related technical terms involved in the present invention are briefly introduced as follows:

FTL(Flash translation layer)，即闪存转换层，建立了闪存存储介质与设备主控器之间的连接关系，用于管理闪存存储介质；FTL (Flash translation layer), that is, the flash memory translation layer, establishes the connection relationship between the flash memory storage medium and the device master, and is used to manage the flash memory storage medium;

OCSSD(Open-Channel SSD)，即开放通道SSD，其开放内部通道，在主机端实现FTL；OCSSD (Open-Channel SSD), that is, open channel SSD, which opens internal channels and implements FTL on the host side;

SDPK(Storage Performance Development Kit)，是Intel发表的用户态NVMe驱动；SPDK支持OCSSD以及主机端闪存转换层；SDPK (Storage Performance Development Kit) is a user-mode NVMe driver published by Intel; SPDK supports OCSSD and host-side flash conversion layer;

BlobFS，是一个简单、扁平的文件系统，可以安装和卸载，并且可以通过“open”、“read”、“stat”和“mmap”等操作访问blob。BlobFS, is a simple, flat filesystem that can be mounted and unmounted, and blobs can be accessed via operations such as "open", "read", "stat", and "mmap".

以下为实施例。The following are examples.

实施例1：Example 1:

一种基于OCSSD的持久键值存储方法，利用LSM-tree组织数据。A persistent key-value storage method based on OCSSD, using LSM-tree to organize data.

参阅图2和图3，本实施例中，compaction操作包括如下步骤；Referring to FIG. 2 and FIG. 3, in this embodiment, the compaction operation includes the following steps;

在图1所示的传统的compaction操作的过程中，以SSTable为粒度选择与所选中的Li层SSTable有键值范围重叠的Li+1层的SSTable进行合并，这是一种粗粒度的选择方式，这种粗粒度的选择方式导致Li+1层被选择SSTable中存在大量不与Li层SSTable的键值范围重叠的键值对，即无关键值对，这些无关键值对只参与排序，然后直接写回Li+1层，导致大量数据被重复写入；这种跨层级的重写会导致大量额外的I/O，损害键值存储引擎写性能和SSD的寿命。In the process of the traditional compaction operation shown in Figure 1, the SSTable of the Li+1 layer that has the key value range overlapping with the selected Li-layer SSTable is selected and merged with the SSTable as the granularity. This is a coarse-grained selection method. , this coarse-grained selection method leads to a large number of key-value pairs that do not overlap with the key-value range of the Li-layer SSTable in the selected SSTable of the Li+1 layer, that is, no key-value pairs. These non-key-value pairs only participate in the sorting, and then Directly writing back to the Li+1 layer causes a large amount of data to be repeatedly written; this cross-layer rewriting will cause a lot of extra I/O, hurting the write performance of the key-value storage engine and the lifespan of the SSD.

图3所示，为采用本实施例所提供的步骤执行compaction操作的一个示例，其中被选取的SSTable位于LSM-tree的第2层；参阅图3可知，本实施例中，通过数据检测步骤和重映射步骤，会从排序后的键值对序列中识别出与读取的原数据块内容完全相同的数据块，作为可重用数据块，这些可重用数据块中的键值对就是无关键值对，对于所识别出的可重用数据块，则直接通过数据重映射的方式，将新分配的逻辑地址映射到原数据块的物理地址，而无需将这些可重用数据块写入OCSSD中；As shown in FIG. 3, it is an example of using the steps provided by this embodiment to perform the compaction operation, wherein the selected SSTable is located at the second layer of the LSM-tree; referring to FIG. 3, it can be seen that in this embodiment, through the data detection step and The remapping step will identify the data blocks with the same content as the original data block read from the sorted sequence of key-value pairs as reusable data blocks. The key-value pairs in these reusable data blocks are no key values. Yes, for the identified reusable data blocks, the newly allocated logical address is directly mapped to the physical address of the original data block through data remapping without writing these reusable data blocks into OCSSD;

对比图1和图3所示的操作过程可知，本实施例此能够有效减少compaction操作过程中数据的写入量，进而减小I/O放大，从而减小对键值存储引擎写性能和SSD寿命的影响。Comparing the operation process shown in Figure 1 and Figure 3, it can be seen that this embodiment can effectively reduce the amount of data written during the compaction operation, thereby reducing the I/O amplification, thereby reducing the write performance of the key-value storage engine and SSD. impact on lifespan.

为了高效识别出可重用数据块，本实施例中，将有序键值对序列中与原数据块完全相同的键值对子序列识别为可重用数据块，包括：In order to efficiently identify reusable data blocks, in this embodiment, a key-value pair subsequence that is exactly the same as the original data block in the ordered key-value pair sequence is identified as a reusable data block, including:

利用与数据块大小相同的滑动窗口在有序键值对序列上滑动，每滑动至一个位置后，判断是否存在与滑动窗口中的键值对子序列相同的原数据块，若是，则将滑动窗口中的键值对子序列识别为可重用数据块，并使滑动窗口以数据块大小为滑动长度滑动至下一个位置；若否，则使滑动窗口以1为滑动长度滑动至下一个位置；Use a sliding window with the same size as the data block to slide on the sequence of ordered key-value pairs. After sliding to a position, determine whether there is an original data block that is the same as the key-value pair subsequence in the sliding window. The key-value pair subsequence in the window is identified as a reusable data block, and the sliding window is made to slide to the next position with the size of the data block as the sliding length; if not, the sliding window is made to slide to the next position with 1 as the sliding length;

考虑到SSTable中的键值对有序，因此，当滑动窗口内的有序键值对序列的首、尾键值对与从SSTable中读取的某一个原数据块的首、尾键值对的键值对应相等时，滑动窗口内的有序键值对序列与该原数据块中的键值对完全相同，基于此，本实施例以此为判断依据，判断滑动窗口中的键值对子序列与原数据块是否相同，具体如下：Considering that the key-value pairs in the SSTable are ordered, therefore, when the first and last key-value pairs of the ordered key-value pair sequence in the sliding window are the same as the first and last key-value pairs of an original data block read from the SSTable When the corresponding key-values are equal, the sequence of ordered key-value pairs in the sliding window is exactly the same as the key-value pairs in the original data block. Based on this, this embodiment uses this as a judgment basis to judge the key-value pairs in the sliding window. Whether the subsequence is the same as the original data block, as follows:

以图4为例对上述可重用数据块的识别过程做进一步的说明。如图4所示，每个数据块包括四个键值对；在compaction操作中，所读出四个原数据块的键值分别是D1＝(A0、A1、A2、A3)、D2＝(B0、B1、B2、B3)、D3＝(A1、A2、A3、A4)和D4＝(A5、A6、A7、A8)，进行键值对合并排序后，所得到的有序键值对序列为：Taking FIG. 4 as an example, the above-mentioned identification process of the reusable data block is further described. As shown in Figure 4, each data block includes four key-value pairs; in the compaction operation, the key values of the four original data blocks read out are D1=(A0, A1, A2, A3), D2=( B0, B1, B2, B3), D3=(A1, A2, A3, A4) and D4=(A5, A6, A7, A8), after merging and sorting key-value pairs, the resulting sequence of ordered key-value pairs for:

(A0、A1、A2、A3、A4、A5、A6、A7、A8、B0、B1、B2、B3)；(A0, A1, A2, A3, A4, A5, A6, A7, A8, B0, B1, B2, B3);

利用长度为4的滑动窗口在该有序键值对序列上滑动；Use a sliding window of length 4 to slide over the sequence of ordered key-value pairs;

初始时刻，滑动窗口内的键值对子序列为(A0、A1、A2、A3)，该键值对子序列的首、尾键值对的键值分别与原数据块D1中首、尾键值对的键值相等，将此时滑动窗口内的子序列识别为一个可重用数据块；At the initial moment, the key-value pair subsequence in the sliding window is (A0, A1, A2, A3). The key value of the value pair is equal, and the subsequence in the sliding window at this time is identified as a reusable data block;

使滑动窗口向后滑动4个键值对，此时滑动窗口内的键值对子序列为(A4、A5、A6、A7)，不存在与之相同的原数据块；The sliding window is slid backward by 4 key-value pairs. At this time, the subsequence of key-value pairs in the sliding window is (A4, A5, A6, A7), and there is no original data block identical to it;

使滑动窗口向后滑动1个键值对，此时滑动窗口内的键值对子序列为(A5、A6、A7、A8)，该键值对子序列的首、尾键值对的键值分别与原数据块D4中首、尾键值对的键值相等，将此时滑动窗口内的子序列识别为一个可重用数据块；Make the sliding window slide backward by 1 key-value pair. At this time, the key-value pair subsequence in the sliding window is (A5, A6, A7, A8), and the key value of the first and last key-value pair of the key-value pair subsequence is The key values of the first and last key-value pairs in the original data block D4 are respectively equal, and the subsequence in the sliding window at this time is identified as a reusable data block;

使滑动窗口向后滑动4个键值对，此时滑动窗口内的键值对子序列为(B0、B1、B2、B3)，该键值对子序列的首、尾键值对的键值分别与原数据块D2中首、尾键值对的键值相等，将此时滑动窗口内的子序列识别为一个可重用数据块。Make the sliding window slide backward by 4 key-value pairs. At this time, the key-value pair subsequence in the sliding window is (B0, B1, B2, B3), and the key value of the first and last key-value pairs of the key-value pair subsequence The key values of the first and last key-value pairs in the original data block D2 are respectively equal, and the subsequence in the sliding window at this time is identified as a reusable data block.

本实施例利用滑动窗口识从排序后的键值对序列中的识别出可重用数据块，即与原数据块内容完全相同的数据块，相比于传统的基于去重的方法，本实施例在识别过程中无需进行复杂的哈希值计算，也避免了哈希值的管理开销，因此，本实施例能够高效识别出可重用数据块，并且计算成本低、计算开销小。This embodiment uses a sliding window to identify reusable data blocks in the sorted sequence of key-value pairs, that is, data blocks with exactly the same content as the original data block. Compared with the traditional method based on deduplication, this embodiment There is no need to perform complex hash value calculation in the identification process, and the management overhead of the hash value is also avoided. Therefore, this embodiment can efficiently identify reusable data blocks, and has low computational cost and low computational overhead.

在检测出可重用数据块之后，本实施例会进一步进行数据对齐，以保证可重用数据块和待写回数据块在新、旧SSTable中的起始偏移和长度一定是按照闪存页大小大小进行对齐的；After the reusable data block is detected, this embodiment will further perform data alignment to ensure that the starting offset and length of the reusable data block and the data block to be written back in the new and old SSTables must be based on the size of the flash memory page. aligned;

图5为例，读出的原数据块分别是(A、B、C、E)、(G、H、I、J)、(R、T、X、Z)、(A、C、D、E)、(L、M、N、P)和(Q、S、U、Y)，经过键值合并排序之后，得到的有序键值对序列为：Figure 5 is an example, the original data blocks read are (A, B, C, E), (G, H, I, J), (R, T, X, Z), (A, C, D, E), (L, M, N, P) and (Q, S, U, Y), after key-value merging and sorting, the sequence of ordered key-value pairs obtained is:

(A、B、C、D、E、G、H、I、J、L、M、N、P、Q、S、U、Y、Z)；(A, B, C, D, E, G, H, I, J, L, M, N, P, Q, S, U, Y, Z);

识别出的可重用数据块分别是(G、H、I、J)和(L、M、N、P)，按照闪存页大小进行数据对齐之后如图5所示，其中，空白位置以0进行填充；The identified reusable data blocks are (G, H, I, J) and (L, M, N, P), respectively. After data alignment is performed according to the flash page size, as shown in Figure 5, the blank position is 0. filling;

按照闪存页大小对排序后的有序键值对中的可重用数据块和其余键值对进行数据对齐，能够避免逻辑页写入时出现跨物理页的情况，从而能够避免因为不对齐导致的额外读写开销。Aligning the reusable data blocks and other key-value pairs in the sorted ordered key-value pairs according to the flash page size can avoid the situation of spanning physical pages when writing logical pages, thereby avoiding the misalignment caused by Additional read and write overhead.

进行数据对齐之后，对于可重用数据块，直接将其逻辑地址映射到相对应的原数据块在OCSSD中的物理地址即可；对于待写回数据块，则在OCSSD中分配新的物理地址，将待写回数据块写回至该物理地址，并将待写回数据块的逻辑地址映射到该新分配的物理地址，如图6所示；After data alignment, for the reusable data block, directly map its logical address to the physical address of the corresponding original data block in the OCSSD; for the data block to be written back, assign a new physical address in the OCSSD, Write the data block to be written back to the physical address, and map the logical address of the data block to be written back to the newly allocated physical address, as shown in Figure 6;

在图6中，从逻辑层面来看，本实施例中的compaction操作与传统的compaction操作相同，都会使得原文件，即原SSTable无效，但是从物理层面来看，并不会发生针对可重用数据块的写回操作。In FIG. 6 , from the logical point of view, the compaction operation in this embodiment is the same as the traditional compaction operation, which will invalidate the original file, that is, the original SSTable, but from the physical point of view, the reusable data will not be targeted. Writeback operation of the block.

由于原SSTable在compaction操作执行完成后会被置为无效，为了避免其中的可重用数据块因其他删除操作而失效，本实施例进一步会为OCSSD中的每一个物理地址维护一个引用计数，用于记录指向该物理地址的逻辑地址的计数；Since the original SSTable will be invalidated after the compaction operation is completed, in order to prevent the reusable data blocks from becoming invalid due to other delete operations, this embodiment further maintains a reference count for each physical address in the OCSSD, which is used for record the count of logical addresses pointing to this physical address;

作为一种优选的实施方式，本实施例中，SSTable中的数据块大小设置成与OCSSD中的物理页大小一致的方式，能够减少额外的读写闪存次数；As a preferred implementation, in this embodiment, the size of the data block in the SSTable is set to be consistent with the size of the physical page in the OCSSD, which can reduce the number of additional reads and writes of the flash memory;

并且，利用超级块管理OCSSD中的闪存物理块；And, use the super block to manage the flash physical blocks in the OCSSD;

本实施例以超级块来管理整个闪存，会将多个闪存物理块组织为一个超级块进行，从而每个写请求依次发送到每个并行单元，这样可以最大化利用OCSSD的并行性；In this embodiment, the super block is used to manage the entire flash memory, and multiple physical blocks of the flash memory are organized into one super block, so that each write request is sent to each parallel unit in turn, so that the parallelism of the OCSSD can be maximized;

可选地，本实施例中，当空闲的超级块少于5％时，开始垃圾回收；具体采用贪心的垃圾回收策略，优先对无效数据最多的超级块进行垃圾回收。Optionally, in this embodiment, when the free super blocks are less than 5%, garbage collection is started; specifically, a greedy garbage collection strategy is adopted, and garbage collection is preferentially performed on the super blocks with the most invalid data.

作为一种可选的实施方式，本实施中，利用用户态文件系统BlobFS与用户态驱动SPDK来管理主机端FTL，以实现上述各种操作；其中，执行数据重映射步骤时的IO栈如图7所示，可以看到，本实施例采用BlobFS与SPDK相结合的实现方式，有效缩短了软件IO栈，减少主机端延迟。As an optional implementation manner, in this implementation, the user-mode file system BlobFS and the user-mode driver SPDK are used to manage the host-side FTL, so as to realize the above-mentioned various operations; wherein, the IO stack when performing the data remapping step is as shown in the figure As shown in Fig. 7, it can be seen that this embodiment adopts the combination of BlobFS and SPDK, which effectively shortens the software IO stack and reduces the delay on the host side.

实施例2：Example 2:

一种基于OCSSD的持久键值存储设备，利用LSM-tree组织数据。An OCSSD-based persistent key-value storage device that utilizes LSM-tree to organize data.

参阅图8，该键值存储设备包括LSM-tree KV存储引擎和主机FTL，并且LSM-treeKV存储引擎中实现了键值对排序模块和数据检测模块，主机FTL中实现了数据重映射模块和数据写回模块；其中：Referring to Figure 8, the key-value storage device includes an LSM-tree KV storage engine and a host FTL, and the LSM-treeKV storage engine implements a key-value pair sorting module and a data detection module, and the host FTL implements a data remapping module and data Write back to the module; where:

写回模块，用于为待写回数据块分配新的第二逻辑地址，将待写回数据块写入第L+1层中空闲的第二物理地址，并建立第二逻辑地址到第二物理地址的映射关系；The write-back module is used to assign a new second logical address to the data block to be written back, write the data block to be written back to the second physical address that is free in the L+1 layer, and establish the second logical address to the second physical address. The mapping relationship of physical addresses;

作为一种优选的实施方式，本实施例中，LSM-tree KV存储引擎中采用用户态文件系统BlobFS，且主机FTL采用用户态驱动SPDK；As a preferred implementation, in this embodiment, the user-mode file system BlobFS is used in the LSM-tree KV storage engine, and the host FTL uses the user-mode driver SPDK;

本实施例中，各模块的具体实施方式可参考上述实施例1中的描述，在此将不做复述。In this embodiment, reference may be made to the description in Embodiment 1 above for the specific implementation of each module, which will not be repeated here.

实施例3：Example 3:

一种持久键值存储系统，如图8所示，包括OCSSD以及上述实施例2提供的基于OCSSD的持久键值存储设备。A persistent key-value storage system, as shown in FIG. 8 , includes an OCSSD and the OCSSD-based persistent key-value storage device provided in Embodiment 2 above.

本领域的技术人员容易理解，以上所述仅为本发明的较佳实施例而已，并不用以限制本发明，凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等，均应包含在本发明的保护范围之内。Those skilled in the art can easily understand that the above are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present invention, etc., All should be included within the protection scope of the present invention.

Claims

1. A persistent key value storage method based on OCSD organizes data by using LSM-tree, and is characterized in that the compact operation comprises the following steps:

key value pair sorting step: reading the SSTable selected from the L layer and the SSTable which is overlapped with the selected SSTable in the L +1 layer in the key value range from the OCSSD to a memory, and sequencing the key value pairs in the SSTable read into the memory to obtain an ordered key value pair sequence; the read data blocks in the SSTable are marked as original data blocks; l and L +1 are layer serial numbers in the LSM-tree;

a data detection step: identifying key value pair subsequences which are completely the same as the original data blocks in the ordered key value pair sequences as reusable data blocks, and organizing the rest key value pairs into data blocks to be written back;

a remapping step: for the reusable data block, allocating a new first logical address to the reusable data block, obtaining a first physical address of an original data block with the same content in the OCSSD, and establishing a mapping relation from the first logical address to the first physical address;

a write-back step: and for the data block to be written back, allocating a new second logical address to the data block to be written back, writing the new second logical address into a free second physical address in the L + 1-th layer, and establishing a mapping relation from the second logical address to the second physical address.

2. The OCSSD-based persistent key value storage method of claim 1, wherein identifying a pair sequence of key value pairs in the ordered sequence of key value pairs that is identical to an original data block as a reusable data block comprises:

sliding the ordered key value pair sequences by using a sliding window with the same size as the data blocks, judging whether an original data block which is the same as the key value pair subsequence in the sliding window exists after the original data block slides to one position, if so, identifying the key value pair subsequence in the sliding window as a reusable data block, and sliding the sliding window to the next position by using the size of the data block as the sliding length; if not, the sliding window is made to slide to the next position by taking 1 as the sliding length.

3. The OCSSD-based persistent key value storage method of claim 2, wherein the manner of determining whether the key value pair subsequence in the sliding window is the same as the original data block comprises:

and comparing the key values of the head and tail key value pairs in the sliding window with the key values of the head and tail key value pairs of the original data block respectively, and if the key values of the two key value pairs are correspondingly equal, judging that the key value pair subsequence in the sliding window is the same as the original data block.

4. The ocsd-based persistent key value storage method of claim 1, further comprising:

maintaining a reference count for each physical address in the OCSSD for recording a count of logical addresses pointing to the physical address;

when the reference count of a certain physical address is 0, the physical address is set to be invalid.

5. The OCSSD-based persistent key value storage method according to any one of claims 1 to 4, wherein in the data detection step, after identifying the reusable data block and before organizing the remaining key value pairs to be written back to the data block, further comprising:

and performing data alignment on the reusable data blocks and the rest key value pairs according to the size of the flash page.

6. The OCSSD-based persistent key value storage method according to any one of claims 1 to 4, wherein a data block size in SSTable is consistent with a physical page size in the OCSSD.

7. The OCSSD-based persistent key value storage method according to any one of claims 1 to 4, wherein a flash memory physical block in the OCSSD is managed by using a super block.

8. An OCSD-based persistent key value storage device that organizes data using an LSM-tree, comprising: the system comprises a key value pair sorting module and a data detection module which are realized in an LSM-tree KV storage engine, and a data remapping module and a data write-back module which are realized in a host FTL;

the key value pair sequencing module is used for reading the SSTable selected from the L layer and the SSTable with the key value range overlapping with the selected SSTable in the L +1 layer from the OCSSD to the memory in the compact operation, and sequencing the key value pairs in the SSTable read into the memory to obtain an ordered key value pair sequence; the read data blocks in the SSTable are marked as original data blocks; l and L +1 are layer serial numbers in the LSM-tree;

the data detection module is used for identifying the key value pair subsequences which are completely the same as the original data blocks in the ordered key value pair sequences as reusable data blocks and organizing the rest key value pairs into data blocks to be written back;

the remapping module is configured to allocate a new first logical address to the reusable data block, obtain a first physical address of an original data block having the same content as the reusable data block in the ocsd, and establish a mapping relationship between the first logical address and the first physical address;

the write-back module is configured to allocate a new second logical address to the data block to be written back, write the data block to be written back into a free second physical address in the L +1 th layer, and establish a mapping relationship between the second logical address and the second physical address.

9. The ocsd-based persistent key-value storage device of claim 8, wherein a user-state file system blob fs is employed in the LSM-tree KV storage engine, and the host FTL employs a user-state driver SPDK.

10. A persistent key-value storage system comprising an ocsd and the ocsd-based persistent key-value storage device of claim 8 or 9.