Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3357223.3362731acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Public Access

RapidCDC: Leveraging Duplicate Locality to Accelerate Chunking in CDC-based Deduplication Systems

Published: 20 November 2019 Publication History

Abstract

I/O deduplication is a key technique for improving storage systems' space and I/O efficiency. Among various deduplication techniques content-defined chunking (CDC) based deduplication is the most desired one for its high deduplication ratio. However, CDC is compute-intensive and time-consuming, and has been recognized as a major performance bottleneck of the CDC-based deduplication system.
In this paper we leverage the existence of a property in the duplicate data, named duplicate locality, that reveals the fact that multiple duplicate chunks are likely to occur together. In other words, one duplicate chunk is likely to be immediately followed by a sequence of contiguous duplicate chunks. The longer the sequence, the stronger the locality is. After a quantitative analysis of duplicate locality in real-world data, we propose a suite of chunking techniques that exploit the locality to remove almost all chunking cost for deduplicatable chunks in CDC-based deduplication systems. The resulting deduplication method, named RapidCDC, has two salient features. One is that its efficiency is positively correlated to the deduplication ratio. RapidCDC can be as fast as a fixed-size chunking method when applied on data sets with high data redundancy. The other feature is that its high efficiency does not rely on high duplicate locality strength. These attractive features make RapidCDC's effectiveness almost guaranteed for datasets with high deduplication ratio. Our experimental results with synthetic and real-world datasets show that RapidCDC's chunking speedup can be up to 33x higher than regular CDC. Meanwhile, it maintains (nearly) the same deduplication ratio.

References

[1]
Bhavish Aggarwal, Aditya Akella, Ashok Anand, Athula Balachandran, Pushkar Chitnis, Chitra Muthukrishnan, Ramachandran Ramjee, and George Varghese. 2010. EndRE: An End-system Redundancy Elimination Service for Enterprises. In Proceedings of the 7th USENIX Conference on Networked Systems Design and Implementation (NSDI'10). USENIX Association, Berkeley, CA, USA, 28--28. http://dl.acm.org/citation.cfm?id=1855711.1855739
[2]
Samer Al-Kiswany, Abdullah Gharaibeh, Elizeu Santos-Neto, George Yuan, and Matei Ripeanu. 2008. StoreGPU: Exploiting Graphics Processing Units to Accelerate Distributed Storage Systems. In Proceedings of the 17th International Symposium on High Performance Distributed Computing (HPDC '08). ACM, New York, NY, USA, 165--174. https://doi.org/10.1145/1383422.1383443
[3]
Pramod Bhatotia, Rodrigo Rodrigues, and Akshat Verma. 2012. Shredder: GPU-accelerated Incremental Storage and Computation. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (FAST'12). USENIX Association, Berkeley, CA, USA, 14--14. http://dl.acm.org/citation.cfm?id=2208461.2208475
[4]
Nikolaj Bjrner, Andreas Blass, and Yuri Gurevich. 2010. Content-dependent Chunking for Differential Compression, the Local Maximum Approach. J. Comput. Syst. Sci. 76, 3-4 (May 2010), 154--203. https://doi.org/10.1016/j.jcss.2009.06.004
[5]
Apache Cassandra. 2014. Apache cassandra. Available online at http://planetcassandra.org/what-is-apache-cassandra (2014), 13.
[6]
Biplob K Debnath, Sudipta Sengupta, and Jin Li. 2010. ChunkStash: Speeding Up Inline Storage Deduplication Using Flash Memory. In USENIX annual technical conference. 1--16.
[7]
Dell EMC. [n.d.]. Data Domain - Data Backup Appliance, Data Protection. https://www.dellemc.com/en-us/data-protection/data-domain-backup-storage.htm.
[8]
Docker Inc. 2016. Official repositories on Docker Hub. https://hub.docker.com/.
[9]
Kave Eshghi, Mark Lillibridge, Lawrence Wilcock, Guillaume Belrose, and Rycharde Hawkes. 2007. Jumbo Store: Providing Efficient Incremental Upload and Versioning for a Utility Rendering Service. In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST '07). USENIX Association, Berkeley, CA, USA, 22--22. http://dl.acm.org/citation.cfm?id=1267903.1267925
[10]
Abdullah Gharaibeh, Samer Al-Kiswany, Sathish Gopalakrishnan, and Matei Ripeanu. 2010. A GPU Accelerated Storage System. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing (HPDC '10). ACM, NewYork, NY, USA, 167--178. https://doi.org/10.1145/1851476.1851497
[11]
Docker Inc. 2018. debian: Docker Official Images. https://hub.docker.com/_/debian/.
[12]
Docker Inc. 2018. Node: Docker Official Images. https://hub.docker.com/_/node/.
[13]
Docker Inc. 2018. wordpress: Docker Official Images. https://hub.docker.com/_/wordpress/.
[14]
Himshai Kambo and Bharati Sinha. 2017. Secure data deduplication mechanism based on Rabin CDC and MD5 in cloud computing environment. In Recent Trends in Electronics, Information & Communication Technology (RTEICT), 2017 2nd IEEE International Conference on. IEEE, 400--404.
[15]
Mark Lillibridge, Kave Eshghi, Deepavali Bhagwat, Vinay Deolalikar, Greg Trezise, and Peter Camble. 2009. Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality. In Proccedings of the 7th Conference on File and Storage Technologies (FAST '09). USENIX Association, Berkeley, CA, USA, 111--123. http://dl.acm.org/citation.cfm?id=1525908.1525917
[16]
YV Lokeshwari, B Prabavathy, and Chitra Babu. 2011. Optimized cloud storage with high throughput deduplication approach. In Proceedings of the International Conference on Emerging Technology Trends (ICETT). Citeseer.
[17]
Dirk Meister and André Brinkmann. 2009. Multi-level Comparison of Data Deduplication in a Backup Scenario. In Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference (SYSTOR '09). ACM, New York, NY, USA, Article 8, 12 pages. https://doi.org/10.1145/1534530.1534541
[18]
Athicha Muthitacharoen, Benjie Chen, and David Mazières. 2001. A Low-bandwidth Network File System. SIGOPS Oper. Syst. Rev. 35, 5 (Oct. 2001), 174--187. https://doi.org/10.1145/502059.502052
[19]
Neo Technology. 2018. Neo4j Graph Database Platform. https://neo4j.com/.
[20]
NetApp inc. 2018. ONTAP Data Management Software: ONTAP Data Management Software. https://www.netapp.com/us/products/data-management-software/ontap.aspx.
[21]
Alexander Neumann. 2018. Fast implementation of Content Defined Chunking (CDC) based on a rolling Rabin Checksum in C. https://github.com/fd0/rabin-cdc.
[22]
Fan Ni, Xing Lin, and Song Jiang. 2019. SS-CDC: A Two-stage Parallel Content-defined Chunking for Deduplicating Backup Storage. In Proceedings of the 12th ACM International Conference on Systems and Storage (SYSTOR '19). ACM, New York, NY, USA, 86--96. https://doi.org/10.1145/3319647.3325834
[23]
Michael O Rabin. 1981. Fingerprinting by random polynomials. Technical report (1981).
[24]
Salvatore Sanfilippo and Pieter Noordhuis. 2015. Redis. http://redis.io (2015).
[25]
Philip Shilane, Grant Wallace, Mark Huang, and Windsor Hsu. 2012. Delta Compressed and Deduplicated Storage Using Stream-Informed Locality. In HotStorage.
[26]
Kiran Srinivasan, Tim Bisson, Garth Goodson, and Kaladhar Voruganti. 2012. iDedup: Latency-aware, Inline Data Deduplication for Primary Storage. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (FAST'12). USENIX Association, Berkeley, CA, USA, 24--24. http://dl.acm.org/citation.cfm?id=2208461.2208485
[27]
Niraj Tolia, Michael Kozuch, Mahadev Satyanarayanan, Brad Karp, Thomas C Bressoud, and Adrian Perrig. 2003. Opportunistic Use of Content Addressable Storage for Distributed File Systems. In USENIX Annual Technical Conference, General Track, Vol. 3. 127--140.
[28]
Cristian Ungureanu, Benjamin Atkin, Akshat Aranya, Salil Gokhale, Stephen Rago, Grzegorz Calkowski, Cezary Dubnicki, and Aniruddha Bohra. 2010. HydraFS: A High-throughput File System for the HYDRAstor Content-addressable Storage System. In Proceedings of the 8th USENIX Conference on File and Storage Technologies (FAST'10). USENIX Association, Berkeley, CA, USA, 17--17. http://dl.acm.org/citation.cfm?id=1855511.1855528
[29]
Grant Wallace, Fred Douglis, Hangwei Qian, Philip Shilane, Stephen Smaldone, Mark Chamness, and Windsor Hsu. 2012. Characteristics of Backup Workloads in Production Systems. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (FAST'12). USENIX Association, Berkeley, CA, USA, 4--4. http://dl.acm.org/citation.cfm?id=2208461.2208465
[30]
Jiansheng Wei, Hong Jiang, Ke Zhou, and Dan Feng. 2010. MAD2: A Scalable High-throughput Exact Deduplication Approach for Network Backup Services. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST) (MSST '10). IEEE Computer Society, Washington, DC, USA, 1--14. https://doi.org/10.1109/MSST.2010.5496987
[31]
Youjip Won, Kyeongyeol Lim, and Jaehong Min. 2015. MUCH: Multithreaded Content-Based File Chunking. IEEE Trans. Comput. 64, 5 (2015), 1375--1388.
[32]
Wen Xia, Hong Jiang, Dan Feng, and Yu Hua. 2011. SiLo: A Similarity-locality Based Near-exact Deduplication Scheme with Low RAM Overhead and High Throughput. In Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference (USENIXATC'11). USENIX Association, Berkeley, CA, USA, 26--28. http://dl.acm.org/citation.cfm?id=2002181.2002207
[33]
Wen Xia, Hong Jiang, Dan Feng, and Lei Tian. 2012. Accelerating data deduplication by exploiting pipelining and parallelism with multicore or manycore processors. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (FASTâĂŹ12 Poster). 1--2.
[34]
Wen Xia, Hong Jiang, Dan Feng, Lei Tian, Min Fu, and Yukun Zhou. 2014. Ddelta: A deduplication-inspired fast delta compression approach. Performance Evaluation 79 (2014), 258--272.
[35]
Wen Xia, Yukun Zhou, Hong Jiang, Dan Feng, Yu Hua, Yuchong Hu, Yucheng Zhang, and Qing Liu. 2016. FastCDC: A Fast and Efficient Content-defined Chunking Approach for Data Deduplication. In Proceedings of the 2016 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC '16). USENIX Association, Berkeley, CA, USA, 101--114. http://dl.acm.org/citation.cfm?id=3026959.3026969
[36]
Chuanshuai Yu, Chengwei Zhang, Yiping Mao, and Fulu Li. 2015. Leap-based content defined chunkingâĂŤtheory and implementation. In Mass Storage Systems and Technologies (MSST), 2015 31st Symposium on. IEEE, 1--12.
[37]
Yucheng Zhang, Dan Feng, Hong Jiang, Wen Xia, Min Fu, Fangting Huang, and Yukun Zhou. 2017. A Fast Asymmetric Extremum Content Defined Chunking Algorithm for Data Deduplication in Backup Storage Systems. IEEE Trans. Comput. 66, 2 (Feb. 2017), 199--211. https://doi.org/10.1109/TC.2016.2595565
[38]
Benjamin Zhu, Kai Li, and Hugo Patterson. 2008. Avoiding the Disk Bottleneck in the Data Domain Deduplication File System. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST'08).

Cited By

View all
  • (2024)H2C-Dedup: Reducing I/O and GC Amplification for QLC SSDs from the Deduplication Metadata PerspectiveProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698507(704-719)Online publication date: 20-Nov-2024
  • (2024)Design and Implementation of Deduplication on F2FSACM Transactions on Storage10.1145/366273520:4(1-50)Online publication date: 29-Apr-2024
  • (2024)Chunk2vec: A novel resemblance detection scheme based on Sentence‐BERT for post‐deduplication delta compression in network transmissionIET Communications10.1049/cmu2.12719Online publication date: 4-Jan-2024
  • Show More Cited By

Index Terms

  1. RapidCDC: Leveraging Duplicate Locality to Accelerate Chunking in CDC-based Deduplication Systems

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SoCC '19: Proceedings of the ACM Symposium on Cloud Computing
    November 2019
    503 pages
    ISBN:9781450369732
    DOI:10.1145/3357223
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 20 November 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. CDC
    2. content-defined chunking
    3. deduplication
    4. locality
    5. storage systems

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    Conference

    SoCC '19
    Sponsor:
    SoCC '19: ACM Symposium on Cloud Computing
    November 20 - 23, 2019
    CA, Santa Cruz, USA

    Acceptance Rates

    SoCC '19 Paper Acceptance Rate 39 of 157 submissions, 25%;
    Overall Acceptance Rate 169 of 722 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)247
    • Downloads (Last 6 weeks)44
    Reflects downloads up to 19 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)H2C-Dedup: Reducing I/O and GC Amplification for QLC SSDs from the Deduplication Metadata PerspectiveProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698507(704-719)Online publication date: 20-Nov-2024
    • (2024)Design and Implementation of Deduplication on F2FSACM Transactions on Storage10.1145/366273520:4(1-50)Online publication date: 29-Apr-2024
    • (2024)Chunk2vec: A novel resemblance detection scheme based on Sentence‐BERT for post‐deduplication delta compression in network transmissionIET Communications10.1049/cmu2.12719Online publication date: 4-Jan-2024
    • (2024)Emerging Research Trends in Data Deduplication: A Bibliometric Analysis from 2010 to 2023Archives of Computational Methods in Engineering10.1007/s11831-024-10074-x31:6(3313-3330)Online publication date: 26-Feb-2024
    • (2023)The Design of Fast and Lightweight Resemblance Detection for Efficient Post-Deduplication Delta CompressionACM Transactions on Storage10.1145/358466319:3(1-30)Online publication date: 16-Feb-2023
    • (2023)TurboHash: A Hash Table for Key-value Store on Persistent MemoryProceedings of the 16th ACM International Conference on Systems and Storage10.1145/3579370.3594766(35-48)Online publication date: 5-Jun-2023
    • (2023)Accelerating Content-Defined Chunking for Data Deduplication Based on Speculative JumpIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.329077034:9(2568-2579)Online publication date: 1-Sep-2023
    • (2023)Enabling Balanced Data Deduplication in Mobile Edge ComputingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.324706134:5(1420-1431)Online publication date: May-2023
    • (2023)EaD: ECC-Assisted Deduplication With High Performance and Low Memory Overhead for Ultra-Low Latency Flash StorageIEEE Transactions on Computers10.1109/TC.2022.315266572:1(208-221)Online publication date: 1-Jan-2023
    • (2023)DedupBench: A Benchmarking Tool for Data Chunking Techniques2023 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE)10.1109/CCECE58730.2023.10288834(469-474)Online publication date: 24-Sep-2023
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media