Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

The Design of Fast and Lightweight Resemblance Detection for Efficient Post-Deduplication Delta Compression

Published: 19 June 2023 Publication History

Abstract

Post-deduplication delta compression is a data reduction technique that calculates and stores the differences of very similar but non-duplicate chunks in storage systems, which is able to achieve a very high compression ratio. However, the low throughput of widely used resemblance detection approaches (e.g., N-Transform) usually becomes the bottleneck of delta compression systems due to introducing high computational overhead. Generally, this overhead mainly consists of two parts: ① calculating the rolling hash byte by byte across data chunks and ② applying multiple transforms on all of the calculated rolling hash values.
  In this article, we propose Odess, a fast and lightweight resemblance detection approach, that greatly reduces the computational overhead for resemblance detection while achieving high detection accuracy and a high compression ratio. Odess first utilizes a novel Subwindow-based Parallel Rolling (SWPR) hash method using Single Instruction Multiple Data [1] (SIMD) to accelerate calculation of rolling hashes (corresponding to the first part of the overhead). Odess then uses a novel Content-Defined Sampling method to generate a much smaller proxy hash set from the whole rolling hash set and quickly applies transforms on this small hash set for resemblance detection (corresponding to the second part of the overhead).
Evaluation results show that during the stage of resemblance detection, the Odess approach is ∼31.4× and ∼7.9× faster than the state-of-the-art N-Transform and Finesse (a recent variant of N-Transform [39]), respectively. When considering an end-to-end data reduction storage system, the Odess-based system’s throughput is about 3.20× and 1.41× higher than the N-Transform- and Finesse-based systems’ throughput, respectively, while maintaining the high compression ratio of N-Transform and achieving ∼1.22× higher compression ratio over Finesse.

References

[1]
Hossein Amiri and Asadollah Shahbahrami. 2020. SIMD programming using Intel vector extensions. Journal of Parallel and Distributed Computing 135 (2020), 83–100.
[2]
Lior Aronovich, Ron Asher, Eitan Bachmat, Haim Bitner, Michael Hirsch, and Shmuel T. Klein. 2009. The design of a similarity based deduplication system. In Proceedings of the Israeli Experimental Systems Conference (SYSTOR’09). ACM, Haifa, 1–14.
[3]
Tony Asaro and Heidi Biggar. 2007. Data De-Duplication and Disk-to-Disk Backup Systems: Technical and Business Considerations. Technical Report. The Enterprise Strategy Group. 2–15.
[4]
Andrei Z. Broder. 1997. On the resemblance and containment of documents. In Proceedings of Compression and Complexity of Sequences. IEEE, Salerno, Italy, 21–29.
[5]
Andrei Z. Broder. 1998. Identifying and filtering near-duplicate documents. In Lecture Notes in Computer Science, Vol. 1848, Springer, 1–10.
[6]
Andrei Z. Broder, Moses Charikar, Alan M. Frieze, and Michael Mitzenmacher. 2000. Min-wise independent permutations. Journal of Computer and System Sciences 60, 3 (2000), 630–659.
[7]
Fred Douglis and Arun Iyengar. 2003. Application-specific delta-encoding via resemblance detection. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’03).USENIX Association, San Antonio, TX, 113–126.
[8]
George Forman, Kave Eshghi, and Stephane Chiocchetti. 2005. Finding similar files in large document repositories. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining.ACM, New York, NY, 394–400.
[9]
Min Fu, Dan Feng, Yu Hua, Xubin He, Zuoning Chen, Wen Xia, Yucheng Zhang, and Yujuan Tan. 2015. Design tradeoffs for data deduplication performance in backup workloads. In Proceedings of the 13th USENIX Conference on File and Storage Technologies (USENIX FAST’15).USENIX Association, Santa Clara, CA, 331–344.
[10]
Diwaker Gupta, Sangmin Lee, Michael Vrable, Stefan Savage, Alex C. Snoeren, George Varghese, Geoffrey M. Voelker, and Amin Vahdat. 2008. Difference engine: Harnessing memory redundancy in virtual machines. In Proceedings of 2008 USENIX Symposium on Operating Systems Design and Implementations (OSDI’08). ACM, New York, NY, 85–93.
[11]
Arne Holst. 2021. Volume of Data/Information Created, Captured, Copied, and Consumed Worldwide from 2010 to 2025. Statista 2 (2021).
[12]
David A. Huffman. 1952. A method for the construction of minimum redundancy codes. Proc. of the IRE 40, 9 (1952), 1098–1101.
[13]
Paul Jaccard. 1901. Etude de la distribution florale dans une portion des alpes et du jura. Bulletin de la Societe Vaudoise des Sciences Naturelles 37 (1901), 547–579.
[14]
Navendu Jain, Michael Dahlin, and Renu Tewari. 2005. TAPER: Tiered approach for eliminating redundancy in replica synchronization. In Proceedings of 4th USENIX Conference on File and Storage Technologies (FAST’05). USENIX Association, San Francisco, CA, 21–35.
[15]
Purushottam Kulkarni, Fred Douglis, Jason D. LaVoie, and John M. Tracey. 2004. Redundancy elimination within large collections of files. In Proceedings of the USENIX Conference on USENIX Annual Technical Conference (USENIX ATC’04). USENIX Association, Boston, MA, 59–72.
[16]
Mark Lillibridge, Kave Eshghi, Deepavali Bhagwat, Vinay Deolalikar, Greg Trezis, and Peter Camble. 2009. Sparse indexing: Large scale, inline deduplication using sampling and locality. In Proceedings of the 7th USENIX Conference on File and Storage Technologies (USENIX FAST’09). USENIX Association, San Francisco, CA, 111–123.
[17]
Xing Lin, Guanlin Lu, Fred Douglis, Philip Shilane, and Grant Wallace. 2014. Migratory compression: Coarse-grained data reordering to improve compressibility. In Proceedings of 12th USENIX Conference on File and Storage Technologies (USENIX FAST’14). USENIX Association, Santa Clara, CA, 256–273.
[18]
Josh MacDonald. 2000. File System Support for Delta Compression. Master’s thesis. Department of Electrical Engineering and Computer Science, University of California at Berkeley.
[19]
Dirk Meister, Jürgen Kaiser, and André Brinkmann. 2013. Block locality caching for data deduplication. In Proceedings of the 6th International Systems and Storage Conference (SYSTOR’13). ACM, New York, NY, 15:1–15:12.
[20]
Athicha Muthitacharoen, Benjie Chen, and David Mazières. 2001. A low-bandwidth network file system. ACM SIGOPS Operating Systems Review 35 (2001), 174–187.
[21]
Fan Ni and Song Jiang. 2019. RapidCDC: Leveraging duplicate locality to accelerate chunking in CDC-based deduplication systems. In Proceedings of the ACM Symposium on Cloud Computing (SOCC’19). ACM, New York, NY, 220–232.
[22]
Fan Ni, Xing Lin, and Song Jiang. 2019. SS-CDC: A two-stage parallel content-defined chunking for deduplicating backup storage. In Proceedings of the 12th ACM International Conference on Systems and Storage. ACM, New York, NY, 86–96.
[23]
Himabindu Pucha, David G. Andersen, and Michael Kaminsky. 2007. Exploiting similarity for multi-source downloads using file handprints. In Proceedings of 4th USENIX Symposium on Network System Design and Implementation (NSDI’07). USENIX Association, Cambridge, MA, 15–28.
[24]
Sean Quinlan and Sean Dorward. 2002. Venti: A new approach to archival data storage. In Proceedings of the 1st USENIX Conference on File and Storage Technologies (USENIX FAST’02). USENIX Association, Monterey, CA, 89–101.
[25]
Michael O. Rabin. 1981. Fingerprinting by Random Polynomials. Technical Report. Center for Research in Computing Technology, Harvard University.
[26]
David Reinsel, John Gantz, and John Rydning. 2018. The digitization of the world from edge to core. IDC White Paper 16 (2018), 28 pages.
[27]
Phlip Shilane, Mark Huang, Grant Wallace, and Windsor Hsu. 2012. WAN-optimized replication of backup datasets using stream-informed delta compression. ACM Transactions on Storage 8, 4 (2012), 1–26.
[28]
Philip Shilane, Grant Wallace, Mark Huang, and Windsor Hsu. 2012. Delta compressed and deduplicated storage using stream-informed locality. In Proceedings of 4th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage’12). USENIX Association, Boston, MA, 16 pages.
[29]
Grant Wallace, Fred Douglis, Hangwei Qian, Philip Shilane, Stephen Smaldone, Mark Chamness, and Windsor Hsu. 2012. Characteristics of backup workloads in production systems. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (USENIX FAST’12). USENIX Association, San Jose, CA, 4.
[30]
Avani Wildani, Ethan L. Miller, and Ohad Rodeh. 2013. HANDS: A heuristically arranged non-backup in-line deduplication system. In Proceedings of 39th IEEE International Conference on Data Engineering (IEEE ICDE’13). IEEE, Brisbane, QLD, Australia, 446–457.
[31]
Wen Xia, Hong Jiang, Dan Feng, Fred Douglis, Philip Shilane, Yu Hua, Min Fu, Yucheng Zhang, and Yukun Zhou. 2016. A comprehensive study of the past, present, and future of data deduplication. Proceedings of the IEEE 104, 9 (2016), 1681–1710.
[32]
Wen Xia, Hong Jiang, Dan Feng, and Yu Hua. 2011. SiLo: A similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’11). USENIX Association, Portland, OR, 26–30.
[33]
Wen Xia, Hong Jiang, Dan Feng, and Lei Tian. 2016. DARE: A deduplication-aware resemblance detection and elimination scheme for data reduction with low overheads. IEEE Transactions on Computers 65, 6 (2016), 1692–1705.
[34]
Wen Xia, Hong Jiang, Dan Feng, Lei Tian, Min Fu, and Yukun Zhou. 2014. Ddelta: A deduplication-inspired fast delta compression approach. Performance Evaluation 79 (2014), 258–272.
[35]
Wen Xia, Yukun Zhou, Hong Jiang, Dan Feng, Yu Hua, Yuchong Hu, Qing Liu, and Yucheng Zhang. 2016. FastCDC: A fast and efficient content-defined chunking approach for data deduplication. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’16). USENIX Association, Denver, CO, 101–114.
[36]
Lianghong Xu, Andrew Pavlo, Sudipta Sengupta, and Gregory R. Ganger. 2017. Online deduplication for databases. In Proceedings of 2017 ACM Conference on Management of Data (ACM SIGMOD’17). ACM, New York, NY, 1355–1368.
[37]
Lianghong Xu, Andrew Pavlo, Sudipta Sengupta, Jin Li, and Gregory R. Ganger. 2015. Reducing replication bandwidth for distributed document databases. In Proceedings of 2015 ACM Symposium on Cloud Computing (SoCC’15). ACM, New York, NY, 222–235.
[38]
Lawrence L. You, Kristal T. Pollack, and Darrell D. E. Long. 2005. Deep store: An archival storage system architecture. In Proceedings of the 21st International Conference on Data Engineering (ICDE’05). IEEE, Tokyo, Japan, 804–815.
[39]
Yucheng Zhang, Wen Xia, Dan Feng, Hong Jiang, Yu Hua, and Qiang Wang. 2019. Finesse: Fine-grained feature locality based fast resemblance detection for post-deduplication delta compression. In Proceedings of 17th USENIX Conference on File and Storage Technologies (USENIX FAST’19). USENIX Association, Boston, MA, 121–128.
[40]
Feng Zhou, Li Zhuang, Ben Y. Zhao, Ling Huang, Anthony D. Joseph, and John Kubiatowicz. 2003. Approximate object location and spam filtering on peer-to-peer systems. In Proceedings of 2003 International Middleware Conference (Middleware’03). Springer, Berlin, 1–20.
[41]
Benjamin Zhu, Kai Li, and Hugo Patterson. 2008. Avoiding the disk bottleneck in the data domain deduplication file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (USENIX FAST’08). USENIX Association, San Jose, California, 269–282.
[42]
Jacob Ziv and Abraham Lempel. 1977. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23, 3 (1977), 337–343.
[43]
Jacob Ziv and Abraham Lempel. 1978. Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory 24, 5 (1978), 530–536.
[44]
Daniel Zwillinger and Stephen Kokoska. 1999. CRC Standard Probability and Statistics Tables and Formulae. Crc Press, New York, NY, USA. 31–32.

Cited By

View all
  • (2025)Multiscroll hopfield neural network with extreme multistability and its application in video encryption for IIoTNeural Networks10.1016/j.neunet.2024.106904182(106904)Online publication date: Feb-2025
  • (2025)DaE2: Unmasking malicious URLs by leveraging diverse and efficient ensemble machine learning for online securityComputers & Security10.1016/j.cose.2024.104170148(104170)Online publication date: Jan-2025
  • (2025)An efficient class-dependent learning label approach using feature selection to improve multi-label classification algorithmsCluster Computing10.1007/s10586-024-04756-128:1Online publication date: 1-Feb-2025
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Storage
ACM Transactions on Storage  Volume 19, Issue 3
August 2023
233 pages
ISSN:1553-3077
EISSN:1553-3093
DOI:10.1145/3604654
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 June 2023
Online AM: 16 February 2023
Accepted: 21 January 2023
Revised: 22 November 2022
Received: 27 April 2022
Published in TOS Volume 19, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Post-deduplication delta compression
  2. resemblance detection
  3. SIMD
  4. parallel rolling hash
  5. content-defined sampling

Qualifiers

  • Research-article

Funding Sources

  • National Natural Science Foundation of China
  • Guangdong Basic and Applied Basic Research Foundation
  • Shenzhen Science and Technology Innovation Program
  • Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies
  • HITSZ-J&A Joint Laboratory of Digital Design and Intelligent Fabrication

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)359
  • Downloads (Last 6 weeks)62
Reflects downloads up to 23 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2025)Multiscroll hopfield neural network with extreme multistability and its application in video encryption for IIoTNeural Networks10.1016/j.neunet.2024.106904182(106904)Online publication date: Feb-2025
  • (2025)DaE2: Unmasking malicious URLs by leveraging diverse and efficient ensemble machine learning for online securityComputers & Security10.1016/j.cose.2024.104170148(104170)Online publication date: Jan-2025
  • (2025)An efficient class-dependent learning label approach using feature selection to improve multi-label classification algorithmsCluster Computing10.1007/s10586-024-04756-128:1Online publication date: 1-Feb-2025
  • (2024)Enhanced Data Mining and Visualization of Sensory-Graph-Modeled Datasets through SummarizationSensors10.3390/s2414455424:14(4554)Online publication date: 14-Jul-2024
  • (2024)Quantum Machine Learning: Exploring the Role of Data Encoding Techniques, Challenges, and Future DirectionsMathematics10.3390/math1221331812:21(3318)Online publication date: 23-Oct-2024
  • (2024)Selecting the foremost big data tool to optimize YouTube data in dynamic Fermatean fuzzy knowledgePLOS ONE10.1371/journal.pone.030738119:8(e0307381)Online publication date: 23-Aug-2024
  • (2024)Energy‐Efficient Routing Algorithm for Optimizing Network Performance in Underwater Data Transmission Using Gray Wolf Optimization AlgorithmJournal of Sensors10.1155/2024/22885272024:1Online publication date: 31-Jul-2024
  • (2024)The Design of Fast Delta Encoding for Delta Compression Based Storage SystemsACM Transactions on Storage10.1145/366481720:4(1-30)Online publication date: 14-May-2024
  • (2024)Detection of top-r spreader influential nodes on the Social Internet of Things networks to maximize spreading influenceThe European Physical Journal Plus10.1140/epjp/s13360-024-05538-9139:9Online publication date: 6-Sep-2024
  • (2024)Robust deadline-aware network function parallelization framework under demand uncertaintyKnowledge-Based Systems10.1016/j.knosys.2024.112696305(112696)Online publication date: Dec-2024
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media