Nothing Special   »   [go: up one dir, main page]

CN107391761B - Data management method and device based on repeated data deletion technology - Google Patents

Data management method and device based on repeated data deletion technology Download PDF

Info

Publication number
CN107391761B
CN107391761B CN201710750609.4A CN201710750609A CN107391761B CN 107391761 B CN107391761 B CN 107391761B CN 201710750609 A CN201710750609 A CN 201710750609A CN 107391761 B CN107391761 B CN 107391761B
Authority
CN
China
Prior art keywords
data
stored
metadata information
length
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710750609.4A
Other languages
Chinese (zh)
Other versions
CN107391761A (en
Inventor
胡永刚
王利朋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Wave Intelligent Technology Co Ltd
Original Assignee
Suzhou Wave Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Wave Intelligent Technology Co Ltd filed Critical Suzhou Wave Intelligent Technology Co Ltd
Priority to CN201710750609.4A priority Critical patent/CN107391761B/en
Publication of CN107391761A publication Critical patent/CN107391761A/en
Application granted granted Critical
Publication of CN107391761B publication Critical patent/CN107391761B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1748De-duplication implemented within the file system, e.g. based on file segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0608Saving storage space on storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • G06F3/0641De-duplication techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data management method and a device based on a repeated data deleting technology, wherein the method calculates a fingerprint value of target data through a HASH algorithm; determining a storage position corresponding to the fingerprint value of the target data through CRUSH mapping; then, the target data is used as data to be stored, and whether the storage position of the data to be stored is stored with data is judged; if yes, adding one to the reference count of the data to be stored; if the first metadata information of the data to be stored is not stored, the data to be stored is stored, the reference count of the data to be stored is set to be one, and finally the first metadata information of the data to be stored is stored. Therefore, in the process of data storage, the repeated storage of data is avoided, and the working efficiency is improved; meanwhile, based on the repeated data technology, the management of data is realized, the cost is saved, and the service life of the storage system is prolonged. The data management device based on the data de-duplication technology provided by the embodiment of the invention also has the technical effects.

Description

Data management method and device based on repeated data deletion technology
Technical Field
The invention relates to the technical field of cloud computing data centers, in particular to a data management method and device based on a data de-duplication technology.
Background
With the rapid development of computer technology and internet industry, data information is increasing day by day, and a distributed storage system is developed in order to save storage space and realize resource sharing. The distributed storage system dispersedly stores data on a plurality of independent devices, adopts an expandable system structure, utilizes a plurality of storage servers to share storage load, utilizes the position server to position storage information, can improve the reliability, the availability and the management efficiency of the system, and is easy to expand.
However, since many terminals can access the storage server, a large amount of repeated data inevitably exists in the storage server, and the storage space is occupied, the repeated data deleting technology for optimizing the storage capacity solves the problem. Deduplication technology has found wide applications in backup, long-term archiving, and data disaster recovery, by eliminating duplicate data in a storage system, reducing the data actually stored in the system or transmitted over a network. In the field of distributed storage, in order to reduce the cost of storage unit capacity, the processing of online repeated data is urgent.
Therefore, how to implement the repeating data technology in the field of distributed storage, that is, how to implement the operations of storing, reading, and deleting data in the field of distributed storage by using the repeating data technology, is a problem to be solved by those skilled in the art.
Disclosure of Invention
The invention aims to provide a data management method and device based on a repeated data deleting technology, so as to realize the storage, reading and deleting operations of data based on the repeated data technology in the field of distributed storage.
In order to achieve the above purpose, the embodiment of the present invention provides the following technical solutions:
a data management method based on data de-duplication technology comprises the following steps:
s11, calculating a fingerprint value of the target data through a HASH algorithm;
s12, determining a storage position corresponding to the fingerprint value of the target data through CRUSH mapping; taking the target data as data to be stored, and executing S13;
s13, judging whether data exist in the storage position corresponding to the data to be stored; if yes, go to S14; if not, go to S15;
s14, adding one to the reference count corresponding to the data to be stored, and executing S16;
s15, storing the data to be stored to the storage position corresponding to the data to be stored, setting the reference count corresponding to the data to be stored to be one, and executing S16;
s16, storing first metadata information of the data to be stored, wherein the first metadata information comprises: fingerprint value of the data to be stored.
Before executing the S11, the method further includes:
s21, judging whether second metadata information corresponding to the target data exists or not; if yes, go to S22; if not, go to S11;
s22, acquiring the second metadata information;
s23, judging whether the second metadata information has a fingerprint value; if yes, go to S24; if not, go to S11;
s24, comparing the length of the target data with a preset data length; if the length of the target data is equal to the preset data length, executing S11; if the length of the target data is smaller than the preset data length, executing S25;
s25, splicing the target data and the data corresponding to the second metadata information to obtain spliced data, calculating a fingerprint value of the spliced data, and executing S26;
s26, determining a storage position corresponding to the fingerprint value of the splicing data through CRUSH mapping; the spliced data is regarded as data to be stored, and S13 is executed.
Wherein, if the length of the target data is equal to the preset data length, the method includes:
if the length of the target data is equal to the preset data length, subtracting one from the reference count of the data corresponding to the second metadata information;
judging whether the reference count of the data corresponding to the second metadata information is zero or not;
and if so, deleting the data corresponding to the second metadata information.
The splicing the target data and the data corresponding to the second metadata information to obtain spliced data, and calculating the fingerprint value of the spliced data includes:
acquiring data content corresponding to the second metadata information;
splicing the target data and the data corresponding to the second metadata information according to the preset data length and the preset data offset to obtain spliced data;
calculating a fingerprint value of the spliced data;
subtracting one from the reference count of the data corresponding to the second metadata information;
judging whether the reference count of the data corresponding to the second metadata information is zero or not;
and if so, deleting the data corresponding to the second metadata information.
Wherein, still include:
receiving a deletion request sent by a client;
determining data to be deleted according to the deletion request, and acquiring third data information of the data to be deleted and fingerprint values of the data to be deleted in the third data information;
determining a storage position corresponding to the fingerprint value of the data to be deleted through CRUSH mapping, and subtracting one from the reference count corresponding to the data to be deleted;
judging whether the reference count corresponding to the data to be deleted is zero or not;
and if so, deleting the data to be deleted and the third element data information.
A data management apparatus based on deduplication technology, comprising:
the first calculation module is used for calculating a fingerprint value of the target data through a HASH algorithm;
the first determining module is used for determining a storage position corresponding to the fingerprint value of the target data through CRUSH mapping, and taking the target data as data to be stored;
the first judgment module is used for judging whether data exist in the storage position corresponding to the data to be stored;
the first execution module is used for adding one to the reference count corresponding to the data to be stored when the data is stored in the storage position corresponding to the data to be stored;
the first storage module is used for storing the data to be stored to the storage position corresponding to the data to be stored and setting the reference count corresponding to the data to be stored to be one when the data is not stored in the storage position corresponding to the data to be stored;
the second storage module is used for storing first metadata information of the data to be stored, and the first metadata information comprises: fingerprint value of the data to be stored.
Wherein, still include:
the second judgment module is used for judging whether second metadata information corresponding to the target data exists or not; if not, triggering the first computing module;
the first acquisition module is used for acquiring second metadata information corresponding to the target data when the second metadata information exists;
the third judging module is used for judging whether the second metadata information has a fingerprint value; if not, triggering the first computing module;
the comparison module is used for comparing the length of the target data with a preset data length when the fingerprint value exists in the second metadata information; if the length of the target data is equal to the preset data length, triggering the first calculation module;
the splicing module is used for splicing the target data and the data corresponding to the second metadata information to obtain spliced data when the length of the target data is smaller than the preset data length, and calculating a fingerprint value of the spliced data;
and the second determining module is used for determining the storage position corresponding to the fingerprint value of the splicing data through CRUSH mapping.
Wherein the comparison module comprises:
the first execution unit is used for subtracting one from the reference count of the data corresponding to the second metadata information when the length of the target data is equal to the preset data length;
a first judging unit, configured to judge whether a reference count of data corresponding to the second metadata information is zero;
and the first deleting unit is used for deleting the data corresponding to the second metadata information when the reference count of the data corresponding to the second metadata information is zero.
Wherein, the concatenation module includes:
an acquisition unit configured to acquire data content corresponding to the second metadata information;
the splicing unit is used for splicing the target data and the data corresponding to the second metadata information according to the preset data length and the preset data offset to obtain spliced data;
the calculation unit is used for calculating the fingerprint value of the splicing data;
the second execution unit is used for subtracting one from the reference count of the data corresponding to the second metadata information;
a second judging unit, configured to judge whether a reference count of data corresponding to the second metadata information is zero;
and the second deleting unit is used for deleting the data corresponding to the second metadata information when the reference count of the data corresponding to the second metadata information is zero.
Wherein, still include:
the receiving module is used for receiving a deleting request sent by a client;
the second obtaining module is used for determining data to be deleted according to the deletion request and obtaining third metadata information of the data to be deleted and fingerprint values of the data to be deleted in the third metadata information;
a third determining module, configured to determine, through CRUSH mapping, a storage location corresponding to a fingerprint value of the data to be deleted, and subtract one from a reference count corresponding to the data to be deleted;
the fourth judging module is used for judging whether the reference count corresponding to the data to be deleted is zero or not;
and the deleting module is used for deleting the data to be deleted and the third metadata information when the reference count corresponding to the data to be deleted is zero.
According to the scheme, the data management method based on the data de-duplication technology provided by the embodiment of the invention comprises the following steps:
s11, calculating a fingerprint value of the target data through a HASH algorithm;
s12, determining a storage position corresponding to the fingerprint value of the target data through CRUSH mapping; taking the target data as data to be stored, and executing S13;
s13, judging whether data exist in the storage position corresponding to the data to be stored; if yes, go to S14; if not, go to S15;
s14, adding one to the reference count corresponding to the data to be stored, and executing S16;
s15, storing the data to be stored to the storage position corresponding to the data to be stored, setting the reference count corresponding to the data to be stored to be one, and executing S16;
s16, storing first metadata information of the data to be stored, wherein the first metadata information comprises: fingerprint value of the data to be stored.
Therefore, the fingerprint value of the target data is calculated through the HASH algorithm; determining a storage position corresponding to the fingerprint value of the target data through CRUSH mapping; the uniqueness of the target data is determined by the fingerprint value, and the uniqueness of the storage position of the target data is further determined; then, the target data is used as data to be stored, and whether data exist in a storage position corresponding to the data to be stored is judged; because the data to be stored has a unique storage position, if the data is stored in the storage position, the data to be stored is indicated to be stored, the data to be stored is not stored any more, and the reference count corresponding to the data to be stored is increased by one; if the storage position does not store the data, indicating that the data to be stored is not stored, storing the data to be stored to a storage position corresponding to the data to be stored, setting a reference count corresponding to the data to be stored to be one, and finally storing first metadata information of the data to be stored, wherein the first metadata information comprises: fingerprint value of the data to be stored. Therefore, by the method, in the process of data storage, not only is the repeated storage of data avoided, but also the working efficiency is improved, and the storage space of the system is saved; meanwhile, based on the repeated data technology, the data management is realized in the field of distributed storage, the cost is saved, and the service life of a storage system is prolonged.
Accordingly, the data management device based on the data de-duplication technology provided by the embodiment of the invention also has the technical effects.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flowchart of a data management method based on a deduplication technology according to an embodiment of the present invention;
FIG. 2 is a flowchart of another data management method based on deduplication technology according to an embodiment of the present disclosure;
FIG. 3 is a flowchart of a data deletion method in a data management method based on a deduplication technology according to an embodiment of the present invention;
fig. 4 is a schematic diagram of a data management apparatus based on a deduplication technology according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention discloses a data management method and device based on a repeated data deleting technology, which aim to realize the operations of storing, reading and deleting data based on the repeated data technology in the field of distributed storage.
Referring to fig. 1, a data management method based on a deduplication technology provided in an embodiment of the present invention includes:
s11, calculating a fingerprint value of the target data through a HASH algorithm;
specifically, in this embodiment, the target data is data to be stored in the current operation, and the target data needs to be partitioned first before calculating the fingerprint value of the target data.
In the field of distributed storage, data to be stored is generally divided into data of the size of an underlying storage object in order to regularize the data stored in the underlying storage. For example: if the underlying storage object is divided into 4M size and the size of the target data is 10M, the target data is divided into three blocks of 4M, 4M and 2M according to 4M. I.e. the data to be stored is cut into blocks smaller than or equal to 4M.
Specifically, when calculating the fingerprint value of the target data, the fingerprint value of the target data is calculated by the HASH algorithm according to the data content of the blocks, and the fingerprint value corresponds to the data content of the blocks one to one, that is, corresponds to the data to be stored one to one, that is, the data content matches the fingerprint value in pairs, and forms key-value-pair matching information. If the target data is divided into a plurality of blocks, each block has a fingerprint value, and subsequent operation is performed on the data corresponding to each fingerprint value; if the target data is divided into one block, the target data has one fingerprint value, and the subsequent operation is performed on the fingerprint value. In this embodiment, the target data is divided into one block having a unique fingerprint value.
S12, determining a storage position corresponding to the fingerprint value of the target data through CRUSH mapping; taking the target data as data to be stored, and executing S13;
specifically, according to the fingerprint value calculated by the HASH algorithm, the target data is replaced by the fingerprint value of the target data through a CRUSH mapping process on a Rados layer, and the fingerprint value is transmitted to the object storage device, so that the object storage device searches for a storage position corresponding to the target data in a storage system of the object storage device, and further determines the storage position.
S13, judging whether data exist in the storage position corresponding to the data to be stored; if yes, go to S14; if not, go to S15;
specifically, the target data is used as the data to be stored, and after the object storage device determines the storage location of the data to be stored, it is first determined whether the data is stored in the storage location, and if the data is stored, it indicates that the data to be stored has already been stored; if the data is not stored, the data to be stored is not stored.
S14, adding one to the reference count corresponding to the data to be stored, and executing S16;
specifically, if it is determined in step S13 that the data to be stored has already been stored, the data to be stored is not stored any more, but the reference count corresponding to the data to be stored is incremented by one.
S15, storing the data to be stored to the storage position corresponding to the data to be stored, setting the reference count corresponding to the data to be stored to be one, and executing S16;
specifically, if it is determined in step S13 that the data to be stored is not already stored, the data to be stored is stored in the storage location corresponding to the data to be stored, and the reference count corresponding to the data to be stored is set to one.
S16, storing first metadata information of the data to be stored, wherein the first metadata information comprises: fingerprint value of the data to be stored.
Specifically, after the storage of the data to be stored is completed, the reference count corresponding to the data to be stored is also stored to the special storage location prepared by the object storage device; meanwhile, the metadata information of the data to be stored is also stored, and the metadata information includes various attributes such as fingerprint values of the data to be stored.
Specifically, when storing the fingerprint value, the 8K of metadata is stored first, and then the fingerprint value corresponding to the file is stored after the metadata. The metadata information of the metadata storage is stored in a cluster environment by taking 8K as an object and taking a file as a unit, for a 4MB data block, a 4088KB space is available for storing fingerprint data, SHA-1 is adopted as a fingerprint HASH algorithm, the size of one fingerprint is 20 bytes, and at this time, 209305 fingerprint values corresponding to 817GB data are stored.
It can be seen that, in the data management method based on the deduplication technology provided by this embodiment, the fingerprint value of the target data is calculated by the HASH algorithm; determining a storage position corresponding to the fingerprint value of the target data through CRUSH mapping; the uniqueness of the target data is determined by the fingerprint value, and the uniqueness of the storage position of the target data is further determined; then, the target data is used as data to be stored, and whether data exist in a storage position corresponding to the data to be stored is judged; because the data to be stored has a unique storage position, if the data is stored in the storage position, the data to be stored is indicated to be stored, the data to be stored is not stored any more, and the reference count corresponding to the data to be stored is increased by one; if the storage position does not store the data, indicating that the data to be stored is not stored, storing the data to be stored to a storage position corresponding to the data to be stored, setting a reference count corresponding to the data to be stored to be one, and finally storing first metadata information of the data to be stored, wherein the first metadata information comprises: fingerprint value of the data to be stored. Therefore, by the method, in the process of data storage, not only is the repeated storage of data avoided, but also the working efficiency is improved, and the storage space of the system is saved; meanwhile, based on the repeated data technology, the data management is realized in the field of distributed storage, the cost is saved, and the service life of a storage system is prolonged.
Referring to fig. 2, another data management method based on a deduplication technology provided in an embodiment of the present invention includes:
s21, judging whether second metadata information corresponding to the target data exists or not; if yes, go to S22; if not, go to S11;
specifically, in this embodiment, before the target data is stored, it is first determined whether the second metadata information of the target data exists, that is, whether the target data is stored for the first time or stored again, so as to determine whether the current operation is to create a write or modify a write. If the second metadata information of the target data exists, the target data is not stored for the first time, the current operation is determined to be modification, and the step S22 is continuously executed; if the second metadata information of the target data does not exist, indicating that the target data is stored for the first time, step S11 is performed.
S22, acquiring the second metadata information;
specifically, the specific process of acquiring the second metadata information is as follows: the file system client acquires index information of the target number and acquires second metadata information from a metadata storage request; and the metadata storage acquires second metadata information according to the index information of the target data, wherein the second metadata information comprises a fingerprint value of the target data, and the fingerprint value is stored in a key value pair mode.
It should be noted that only the second metadata information is obtained here, and if the data content corresponding to a certain metadata is to be obtained, that is, the data is read, the client needs to obtain the index information of the data to be read, and request the metadata information from the metadata storage; the metadata storage acquires metadata information according to the index information, wherein the metadata information comprises fingerprint values of all objects forming the file and stored in a key value pair mode; rados reads data directly from the object storage device according to the data offset, data length, and fingerprint value. Thus, the data reading process is completed.
S23, judging whether the second metadata information has a fingerprint value; if yes, go to S24; if not, go to S11;
specifically, after the second metadata information of the target data is obtained in step S22, it is necessary to determine whether the second metadata information is complete, that is, determine whether a fingerprint value exists in the second metadata information, and if a fingerprint value exists, continue to execute step S24; if there is no fingerprint value, step S11 is executed.
S24, comparing the length of the target data with a preset data length; if the length of the target data is equal to the preset data length, executing S11; if the length of the target data is smaller than the preset data length, executing S25;
specifically, after determining that the fingerprint value exists in the second metadata information through the step S23, the length of the target data needs to be compared with the preset data length. Before comparing the data length, the target data is generally divided into sizes, and the specific process of dividing the sizes is similar to the above embodiment, and therefore is not described herein again.
Specifically, after the target data is subjected to the block processing, the block length is compared with the preset data length, in this embodiment, assuming that the target data is divided into one data block, the length of the data block is equal to the length of the target data, and then the length of the target data is compared with the preset data length. The preset data length is a default length in the system, and the default length of the system is 4M. If the length of the target data is equal to the preset data length, executing S11; if the length of the target data is smaller than the preset data length, continuing to execute S25;
s25, splicing the target data and the data corresponding to the second metadata information to obtain spliced data, calculating a fingerprint value of the spliced data, and executing S26;
specifically, if the length of the target data is smaller than the preset data length, the target data and the data corresponding to the second metadata information need to be spliced according to the data offset and the data length. In the present embodiment, the preset data length is set to 4M. For example: the length of data corresponding to the second metadata information is a data object of 0-4M, the length of the target data is 1M, and at the moment, the position of 2-3M in 0-4M needs to be modified; firstly, reading all the data 0-4M corresponding to the second metadata information, splicing the data with the target data 1M, namely dividing 0-4M into three sections of 0-2M, 2-3M and 3-4M, replacing the original 1M content of 2-3M with the 1M content of the target data, and splicing the three sections of 0-2M, new 2-3M and 3-4M together to form new 4M data, namely obtaining spliced data.
S26, determining a storage position corresponding to the fingerprint value of the splicing data through CRUSH mapping; the spliced data is regarded as data to be stored, and S13 is executed.
Specifically, in this embodiment, the specific process of determining the storage location corresponding to the fingerprint value of the concatenated data is similar to that in the above embodiment, and therefore is not described herein again. After determining the storage location corresponding to the fingerprint value of the concatenated data, the concatenated data needs to be used as the data to be stored, and the step S13 is continuously executed.
It can be seen that, in the data management method based on the deduplication technology provided in this embodiment, the method first determines whether second metadata information corresponding to the target data exists; when second metadata information exists in the target data, the second metadata information is obtained; when the second metadata information does not exist in the target data, S11 is performed; after second metadata information is obtained, judging whether a fingerprint value exists in the second metadata information or not; if yes, comparing the length of the target data with a preset data length; if not, go to S11; after comparing the length of the target data with a preset data length, if the length of the target data is equal to the preset data length, performing S11; if the length of the target data is smaller than the preset data length, splicing the target data and the data corresponding to the second metadata information to obtain spliced data, calculating a fingerprint value of the spliced data, and determining a storage position corresponding to the fingerprint value of the spliced data through CRUSH mapping; and taking the spliced data as data to be stored, and executing S13. By the method, in the process of data storage, not only is the repeated storage of data avoided, but also the working efficiency is improved, and the storage space of the system is saved; meanwhile, based on the repeated data technology, the data management is realized in the field of distributed storage, the cost is saved, and the service life of a storage system is prolonged.
Based on any of the above embodiments, it should be noted that, if the length of the target data is equal to the preset data length, the method includes:
if the length of the target data is equal to the preset data length, subtracting one from the reference count of the data corresponding to the second metadata information;
judging whether the reference count of the data corresponding to the second metadata information is zero or not;
and if so, deleting the data corresponding to the second metadata information.
Specifically, in the process of modifying and writing data, when the length of the target data is equal to the preset data length, the reference count of the data corresponding to the second metadata information is decremented by one, and if there is no other reference in the data corresponding to the second metadata information, the reference count after the decrement is zero, and at this time, the data corresponding to the second metadata information is deleted.
Based on any of the above embodiments, it should be noted that the splicing the target data and the data corresponding to the second metadata information to obtain spliced data, and calculating a fingerprint value of the spliced data includes:
acquiring data content corresponding to the second metadata information;
splicing the target data and the data corresponding to the second metadata information according to the preset data length and the preset data offset to obtain spliced data;
calculating a fingerprint value of the spliced data;
subtracting one from the reference count of the data corresponding to the second metadata information;
judging whether the reference count of the data corresponding to the second metadata information is zero or not;
and if so, deleting the data corresponding to the second metadata information.
Specifically, in the process of modifying and writing data, the length of the target data is smaller than the preset data length, the target data and the data corresponding to the second metadata information are spliced according to the preset data length and the data offset to obtain spliced data, and a fingerprint value of the spliced data is calculated; and further judging whether other references exist in the data corresponding to the second metadata information, wherein the specific process is as follows: subtracting one from the reference count of the data corresponding to the second metadata information, and if the reference count is zero after being subtracted by one, indicating that no other references exist in the data corresponding to the second metadata information, and deleting the data corresponding to the second metadata information; and if the reference count is not zero after being subtracted by one, indicating that other references exist in the data corresponding to the second metadata information, and keeping the data corresponding to the second metadata information.
Based on any of the above embodiments, it should be noted that the data management method based on data de-duplication provided in the embodiments of the present invention further includes a data de-duplication method, and with reference to fig. 3, the specific process includes:
s31, receiving a deletion request sent by the client;
s32, determining data to be deleted according to the deletion request, and acquiring third metadata information of the data to be deleted and a fingerprint value of the data to be deleted in the third metadata information;
s33, determining a storage position corresponding to the fingerprint value of the data to be deleted through CRUSH mapping, and subtracting one from the reference count corresponding to the data to be deleted;
s34, judging whether the reference count corresponding to the data to be deleted is zero or not;
s35, if yes, deleting the data to be deleted and the third metadata information;
and S36, if not, not executing the deleting operation.
Specifically, when the data deletion method is executed, a data reading process is included, that is, the data to be deleted is determined according to the deletion request, and the third metadata information of the data to be deleted and the fingerprint value of the data to be deleted in the third metadata information are obtained, where only the third metadata information of the data to be deleted and the fingerprint value thereof are read, and the content of the data to be deleted is not read. Determining a storage position corresponding to the fingerprint value of the data to be deleted through CRUSH mapping, subtracting one from the reference count corresponding to the data to be deleted, deleting the metadata information of the data to be deleted, and informing a client that the data to be deleted is successfully deleted; and if the reference count is zero after being reduced by one, the data corresponding to the second metadata information is indicated to have no other references, and the data to be deleted is deleted.
In the following, a data management apparatus based on a deduplication technology provided by an embodiment of the present invention is introduced, and a data management apparatus based on a deduplication technology described below and a data management method based on a deduplication technology described above may be referred to each other.
Referring to fig. 4, an embodiment of the present invention provides a data management apparatus based on a deduplication technology, including:
a first calculating module 401, configured to calculate a fingerprint value of target data through a HASH algorithm;
a first determining module 402, configured to determine, through CRUSH mapping, a storage location corresponding to a fingerprint value of the target data, and use the target data as data to be stored;
a first judging module 403, configured to judge whether data exists in a storage location corresponding to the data to be stored;
a first execution module 404, configured to, when data is stored in a storage location corresponding to the data to be stored, increment a reference count corresponding to the data to be stored by one;
a first storage module 405, configured to, when there is no data stored in the storage location corresponding to the data to be stored, store the data to be stored to the storage location corresponding to the data to be stored, and set a reference count corresponding to the data to be stored to one;
a second storage module 406, configured to store first metadata information of data to be stored, where the first metadata information includes: fingerprint value of the data to be stored.
Wherein, still include:
the second judgment module is used for judging whether second metadata information corresponding to the target data exists or not; if not, triggering the first computing module;
specifically, when the second judging module judges that the second metadata information corresponding to the target data does not exist, the first calculating module is triggered, and the first calculating module calculates the fingerprint value of the target data through the HASH algorithm; determining a storage position corresponding to the fingerprint value of the target data through CRUSH mapping by a first determining module, and taking the target data as data to be stored; the first judging module judges whether data exist in the storage position corresponding to the data to be stored or not; when the storage position corresponding to the data to be stored stores data, the first execution module increases the reference count corresponding to the data to be stored by one; when the storage position corresponding to the data to be stored does not store data, the first storage module stores the data to be stored to the storage position corresponding to the data to be stored and sets the reference count corresponding to the data to be stored to be one; and finally, the second storage module stores first metadata information of the data to be stored, wherein the first metadata information comprises: fingerprint value of the data to be stored.
The first acquisition module is used for acquiring second metadata information corresponding to the target data when the second metadata information exists;
the third judging module is used for judging whether the second metadata information has a fingerprint value; if not, triggering the first computing module;
specifically, when the third interpretation module judges that no fingerprint value exists in the second metadata information, the first calculation module is triggered; calculating a fingerprint value of the target data through a HASH algorithm by a first calculation module; determining a storage position corresponding to the fingerprint value of the target data through CRUSH mapping by a first determining module, and taking the target data as data to be stored; the first judging module judges whether data exist in the storage position corresponding to the data to be stored or not; when the storage position corresponding to the data to be stored stores data, the first execution module increases the reference count corresponding to the data to be stored by one; when the storage position corresponding to the data to be stored does not store data, the first storage module stores the data to be stored to the storage position corresponding to the data to be stored and sets the reference count corresponding to the data to be stored to be one; and finally, the second storage module stores first metadata information of the data to be stored, wherein the first metadata information comprises: fingerprint value of the data to be stored.
The comparison module is used for comparing the length of the target data with a preset data length when the fingerprint value exists in the second metadata information; if the length of the target data is equal to the preset data length, triggering the first calculation module;
specifically, when a fingerprint value exists in the second metadata information, comparing the length of the target data with a preset data length; if the length of the target data is equal to the preset data length, triggering the first calculation module; calculating a fingerprint value of the target data through a HASH algorithm by a first calculation module; determining a storage position corresponding to the fingerprint value of the target data through CRUSH mapping by a first determining module, and taking the target data as data to be stored; the first judging module judges whether data exist in the storage position corresponding to the data to be stored or not; when the storage position corresponding to the data to be stored stores data, the first execution module increases the reference count corresponding to the data to be stored by one; when the storage position corresponding to the data to be stored does not store data, the first storage module stores the data to be stored to the storage position corresponding to the data to be stored and sets the reference count corresponding to the data to be stored to be one; and finally, the second storage module stores first metadata information of the data to be stored, wherein the first metadata information comprises: fingerprint value of the data to be stored.
The splicing module is used for splicing the target data and the data corresponding to the second metadata information to obtain spliced data when the length of the target data is smaller than the preset data length, and calculating a fingerprint value of the spliced data;
and the second determining module is used for determining the storage position corresponding to the fingerprint value of the splicing data through CRUSH mapping.
Wherein the comparison module comprises:
the first execution unit is used for subtracting one from the reference count of the data corresponding to the second metadata information when the length of the target data is equal to the preset data length;
a first judging unit, configured to judge whether a reference count of data corresponding to the second metadata information is zero;
and the first deleting unit is used for deleting the data corresponding to the second metadata information when the reference count of the data corresponding to the second metadata information is zero.
Wherein, the concatenation module includes:
an acquisition unit configured to acquire data content corresponding to the second metadata information;
the splicing unit is used for splicing the target data and the data corresponding to the second metadata information according to the preset data length and the preset data offset to obtain spliced data;
the calculation unit is used for calculating the fingerprint value of the splicing data;
the second execution unit is used for subtracting one from the reference count of the data corresponding to the second metadata information;
a second judging unit, configured to judge whether a reference count of data corresponding to the second metadata information is zero;
and the second deleting unit is used for deleting the data corresponding to the second metadata information when the reference count of the data corresponding to the second metadata information is zero.
Wherein, still include:
the receiving module is used for receiving a deleting request sent by a client;
the second obtaining module is used for determining data to be deleted according to the deletion request and obtaining third metadata information of the data to be deleted and fingerprint values of the data to be deleted in the third metadata information;
a third determining module, configured to determine, through CRUSH mapping, a storage location corresponding to a fingerprint value of the data to be deleted, and subtract one from a reference count corresponding to the data to be deleted;
the fourth judging module is used for judging whether the reference count corresponding to the data to be deleted is zero or not;
and the deleting module is used for deleting the data to be deleted and the third metadata information when the reference count corresponding to the data to be deleted is zero.
It can be seen that, in the data management apparatus based on the deduplication technology provided in this embodiment, first, the first calculation module calculates the fingerprint value of the target data through the HASH algorithm; a first determining module determines a storage position corresponding to the fingerprint value of the target data through CRUSH mapping, and the target data is used as data to be stored; judging whether data exist in a storage position corresponding to the data to be stored or not by a first judging module; when the storage position corresponding to the data to be stored stores data, the first execution module increases the reference count corresponding to the data to be stored by one; when the storage position corresponding to the data to be stored does not store data, the first storage module stores the data to be stored to the storage position corresponding to the data to be stored and sets the reference count corresponding to the data to be stored to be one; and finally, the second storage module stores first metadata information of the data to be stored, wherein the first metadata information comprises: fingerprint value of the data to be stored. Thereby completing the storage of the data and the storage of the metadata information thereof.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (8)

1. A data management method based on data de-duplication technology is characterized by comprising the following steps:
s11, calculating a fingerprint value of the target data through a HASH algorithm;
s12, determining a storage position corresponding to the fingerprint value of the target data through CRUSH mapping; taking the target data as data to be stored, and executing S13;
s13, judging whether data exist in the storage position corresponding to the data to be stored; if yes, go to S14; if not, go to S15;
s14, adding one to the reference count corresponding to the data to be stored, and executing S16;
s15, storing the data to be stored to the storage position corresponding to the data to be stored, setting the reference count corresponding to the data to be stored to be one, and executing S16;
s16, storing first metadata information of the data to be stored, wherein the first metadata information comprises: fingerprint value of the data to be stored;
before executing the S11, the method further includes:
s21, judging whether second metadata information corresponding to the target data exists or not; if yes, go to S22; if not, go to S11;
s22, acquiring the second metadata information;
s23, judging whether the second metadata information has a fingerprint value; if yes, go to S24; if not, go to S11;
s24, comparing the length of the target data with a preset data length; if the length of the target data is equal to the preset data length, executing S11; if the length of the target data is smaller than the preset data length, executing S25;
s25, splicing the target data and the data corresponding to the second metadata information to obtain spliced data, calculating a fingerprint value of the spliced data, and executing S26;
s26, determining a storage position corresponding to the fingerprint value of the splicing data through CRUSH mapping; the spliced data is regarded as data to be stored, and S13 is executed.
2. The data management method based on the deduplication technology as claimed in claim 1, wherein the determining, if the length of the target data is equal to the predetermined data length, comprises:
if the length of the target data is equal to the preset data length, subtracting one from the reference count of the data corresponding to the second metadata information;
judging whether the reference count of the data corresponding to the second metadata information is zero or not;
and if so, deleting the data corresponding to the second metadata information.
3. The data management method based on the deduplication technology according to claim 1, wherein the splicing the target data and the data corresponding to the second metadata information to obtain spliced data, and calculating a fingerprint value of the spliced data includes:
acquiring data content corresponding to the second metadata information;
splicing the target data and the data corresponding to the second metadata information according to the preset data length and the preset data offset to obtain spliced data;
calculating a fingerprint value of the spliced data;
subtracting one from the reference count of the data corresponding to the second metadata information;
judging whether the reference count of the data corresponding to the second metadata information is zero or not;
and if so, deleting the data corresponding to the second metadata information.
4. The data management method based on data deduplication technology according to any one of claims 1-3, further comprising:
receiving a deletion request sent by a client;
determining data to be deleted according to the deletion request, and acquiring third data information of the data to be deleted and fingerprint values of the data to be deleted in the third data information;
determining a storage position corresponding to the fingerprint value of the data to be deleted through CRUSH mapping, and subtracting one from the reference count corresponding to the data to be deleted;
judging whether the reference count corresponding to the data to be deleted is zero or not;
and if so, deleting the data to be deleted and the third element data information.
5. A data management apparatus based on a data deduplication technology, comprising:
the first calculation module is used for calculating a fingerprint value of the target data through a HASH algorithm;
the first determining module is used for determining a storage position corresponding to the fingerprint value of the target data through CRUSH mapping, and taking the target data as data to be stored;
the first judgment module is used for judging whether data exist in the storage position corresponding to the data to be stored;
the first execution module is used for adding one to the reference count corresponding to the data to be stored when the data is stored in the storage position corresponding to the data to be stored;
the first storage module is used for storing the data to be stored to the storage position corresponding to the data to be stored and setting the reference count corresponding to the data to be stored to be one when the data is not stored in the storage position corresponding to the data to be stored;
the second storage module is used for storing first metadata information of the data to be stored, and the first metadata information comprises: fingerprint value of the data to be stored;
wherein the data management apparatus further comprises:
the second judgment module is used for judging whether second metadata information corresponding to the target data exists or not; if not, triggering the first computing module;
the first acquisition module is used for acquiring second metadata information corresponding to the target data when the second metadata information exists;
the third judging module is used for judging whether the second metadata information has a fingerprint value; if not, triggering the first computing module;
the comparison module is used for comparing the length of the target data with a preset data length when the fingerprint value exists in the second metadata information; if the length of the target data is equal to the preset data length, triggering the first calculation module;
the splicing module is used for splicing the target data and the data corresponding to the second metadata information to obtain spliced data when the length of the target data is smaller than the preset data length, and calculating a fingerprint value of the spliced data;
and the second determining module is used for determining the storage position corresponding to the fingerprint value of the splicing data through CRUSH mapping.
6. The data management apparatus based on data deduplication technology as claimed in claim 5, wherein the comparing module comprises:
the first execution unit is used for subtracting one from the reference count of the data corresponding to the second metadata information when the length of the target data is equal to the preset data length;
a first judging unit, configured to judge whether a reference count of data corresponding to the second metadata information is zero;
and the first deleting unit is used for deleting the data corresponding to the second metadata information when the reference count of the data corresponding to the second metadata information is zero.
7. The data management device based on data deduplication technology of claim 5, wherein the concatenation module comprises:
an acquisition unit configured to acquire data content corresponding to the second metadata information;
the splicing unit is used for splicing the target data and the data corresponding to the second metadata information according to the preset data length and the preset data offset to obtain spliced data;
the calculation unit is used for calculating the fingerprint value of the splicing data;
the second execution unit is used for subtracting one from the reference count of the data corresponding to the second metadata information;
a second judging unit, configured to judge whether a reference count of data corresponding to the second metadata information is zero;
and the second deleting unit is used for deleting the data corresponding to the second metadata information when the reference count of the data corresponding to the second metadata information is zero.
8. The data management apparatus based on data deduplication technology according to any one of claims 5-7, further comprising:
the receiving module is used for receiving a deleting request sent by a client;
the second obtaining module is used for determining data to be deleted according to the deletion request and obtaining third metadata information of the data to be deleted and fingerprint values of the data to be deleted in the third metadata information;
a third determining module, configured to determine, through CRUSH mapping, a storage location corresponding to a fingerprint value of the data to be deleted, and subtract one from a reference count corresponding to the data to be deleted;
the fourth judging module is used for judging whether the reference count corresponding to the data to be deleted is zero or not;
and the deleting module is used for deleting the data to be deleted and the third metadata information when the reference count corresponding to the data to be deleted is zero.
CN201710750609.4A 2017-08-28 2017-08-28 Data management method and device based on repeated data deletion technology Active CN107391761B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710750609.4A CN107391761B (en) 2017-08-28 2017-08-28 Data management method and device based on repeated data deletion technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710750609.4A CN107391761B (en) 2017-08-28 2017-08-28 Data management method and device based on repeated data deletion technology

Publications (2)

Publication Number Publication Date
CN107391761A CN107391761A (en) 2017-11-24
CN107391761B true CN107391761B (en) 2020-03-06

Family

ID=60346237

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710750609.4A Active CN107391761B (en) 2017-08-28 2017-08-28 Data management method and device based on repeated data deletion technology

Country Status (1)

Country Link
CN (1) CN107391761B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109800218B (en) * 2019-01-04 2024-04-09 平安科技(深圳)有限公司 Distributed storage system, storage node device and data deduplication method
CN110399348A (en) * 2019-07-19 2019-11-01 苏州浪潮智能科技有限公司 File deletes method, apparatus, system and computer readable storage medium again
WO2021013335A1 (en) * 2019-07-23 2021-01-28 Huawei Technologies Co., Ltd. Devices, system and methods for deduplication
CN114816251A (en) * 2019-07-26 2022-07-29 华为技术有限公司 Data processing method, device and computer storage readable storage medium
CN111711674B (en) * 2020-06-05 2023-03-14 华南师范大学 Cloud computing method based on Internet of things

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101276366A (en) * 2007-03-27 2008-10-01 株式会社日立制作所 Computer system preventing storage of duplicate files
CN102495894A (en) * 2011-12-12 2012-06-13 成都市华为赛门铁克科技有限公司 Method, device and system for searching repeated data
CN103248711A (en) * 2013-05-23 2013-08-14 华为技术有限公司 File uploading method and server
CN105049213A (en) * 2015-07-27 2015-11-11 小米科技有限责任公司 File signature method and device
CN106649702A (en) * 2016-12-20 2017-05-10 上海斐讯数据通信技术有限公司 File storage method and apparatus of cloud storage system, and cloud storage system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9122729B2 (en) * 2009-07-31 2015-09-01 Cumulus Data Llc Chain-of-custody for archived data

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101276366A (en) * 2007-03-27 2008-10-01 株式会社日立制作所 Computer system preventing storage of duplicate files
CN102495894A (en) * 2011-12-12 2012-06-13 成都市华为赛门铁克科技有限公司 Method, device and system for searching repeated data
CN103248711A (en) * 2013-05-23 2013-08-14 华为技术有限公司 File uploading method and server
CN105049213A (en) * 2015-07-27 2015-11-11 小米科技有限责任公司 File signature method and device
CN106649702A (en) * 2016-12-20 2017-05-10 上海斐讯数据通信技术有限公司 File storage method and apparatus of cloud storage system, and cloud storage system

Also Published As

Publication number Publication date
CN107391761A (en) 2017-11-24

Similar Documents

Publication Publication Date Title
CN107391761B (en) Data management method and device based on repeated data deletion technology
US9305005B2 (en) Merging entries in a deduplication index
US10380073B2 (en) Use of solid state storage devices and the like in data deduplication
US9851917B2 (en) Method for de-duplicating data and apparatus therefor
RU2626334C2 (en) Method and device for processing data object
EP3376393B1 (en) Data storage method and apparatus
CN110347651B (en) Cloud storage-based data synchronization method, device, equipment and storage medium
US8719237B2 (en) Method and apparatus for deleting duplicate data
CA3068345C (en) Witness blocks in blockchain applications
WO2017020576A1 (en) Method and apparatus for file compaction in key-value storage system
WO2018121430A1 (en) File storage and indexing method, apparatus, media, device and method for reading files
US11249987B2 (en) Data storage in blockchain-type ledger
US10198462B2 (en) Cache management
CN110908589A (en) Data file processing method, device and system and storage medium
CN110618974A (en) Data storage method, device, equipment and storage medium
CN103501319A (en) Low-delay distributed storage system for small files
CN115203159B (en) Data storage method, device, computer equipment and storage medium
CN111274245B (en) Method and device for optimizing data storage
CN105493080A (en) Method and apparatus for context aware based data de-duplication
CN107423425B (en) Method for quickly storing and inquiring data in K/V format
US11372570B1 (en) Storage device, computer system, and data transfer program for deduplication
CN115033551A (en) Database migration method and device, electronic equipment and storage medium
CN114185850A (en) Cloud storage duplicate removal method and device based on sliding window block optimization algorithm
CN112988461B (en) Data backup method, edge node, data center and computer storage medium
US11151159B2 (en) System and method for deduplication-aware replication with an unreliable hash

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20200110

Address after: 215100 No. 1 Guanpu Road, Guoxiang Street, Wuzhong Economic Development Zone, Suzhou City, Jiangsu Province

Applicant after: Suzhou Wave Intelligent Technology Co., Ltd.

Address before: 450018 Henan province Zheng Dong New District of Zhengzhou City Xinyi Road No. 278 16 floor room 1601

Applicant before: Zhengzhou Yunhai Information Technology Co. Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant