Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1534530.1534539acmotherconferencesArticle/Chapter ViewAbstractPublication PagessystorConference Proceedingsconference-collections
research-article

The design of a similarity based deduplication system

Published: 04 May 2009 Publication History

Abstract

We describe some of the design choices that were made during the development of a fast, scalable, inline, deduplication device. The system's design goals and how they were achieved are presented. This is the firs deduplication device that uses similarity matching. The paper provides the following original research contributions: we show how similarity signatures can serve in a deduplication scheme; a novel type of similarity signatures is presented and its advantages in the context of deduplication requirements are explained. It is also shown how to combine similarity matching schemes with byte by byte comparison or hash based identity schemes.

References

[1]
Amir A., Lewenstein M., Porat E., Faster algorithms for string matching with k mismatches, Journal of Algorithms 50(2) (2004) 257--275.
[2]
Burton H. Bloom. Space/time Trade-offs in Hash Coding with Allowable Errors. Communications of the ACM, 13 (7). 422--426.
[3]
D. R. Bobbarjung, D, Jagannathan and C. Dubnicki, Improving duplicate elimination in storage systems, ACM Trans. on Storage, Vol. 2(4), 424--448, 2006.
[4]
Boyer R. S., Moore J. S., A fast string searching algorithm, Communications of the ACM 20 (1977) 762--772.
[5]
S. Brin, J. Davis, H. Garcia-Molina. Copy Detection Mechanisms for Digital Documents (weblink). 1994, also in Proceedings of ACM SIGMOD, 1995.
[6]
A. Z. Broder. Identifying and Filtering Near-Duplicate Documents. CPM 2000: 1--10
[7]
A. Z. Broder. Some applications of Rabin's fingerprinting method. In R. Capocelli, A. De Santis, and U. Vaccaro, editors, Sequences II: Methods in Communications, Security, and Computer Science, 143--152. Springer-Verlag, 1993.
[8]
A. Z. Broder. On the resemblance and containment of documents. Proceedings of Compression and Complexity of Sequences 1997, 21--29. IEEE Computer Society, 1997.
[9]
A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-Wise Independent Permutations. Proceedings of the 30th Annual ACM Symposium on Theory of Computing, 327--336, 1998.
[10]
A. Z. Broder and U. Feige. Min-Wise versus Linear Independence. Proc. of the Eleventh Annual ACM-SIAM Symp. on Discrete Algorithms, 147-- 154, 2000.
[11]
A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the Web. Proceedings of the Sixth International World Wide Web Conference, pages 391--404, 1997.
[12]
Fischer M. J., Paterson M. S., String matching and other products, in Complexity of Computation, R. M. Karp (editor), SIAM-AMS Proc. 7 (1974) 113--125.
[13]
B. Garret and C. Bouffard, The Enterprise Strategy Group (ESG) lab validation report for IBM TS7650G ProtecTier, 2008. Available at www.diligent.com.
[14]
N. Heintze. Scalable Document Fingerprinting. Proceedings of the Second USENIX Workshop on Electronic Commerce, pages 191--200, 1996.
[15]
N. Jain, M. Dahlin, and R. Tewari. TAPER: Tiered Approach for Eliminating Redundancy in Replica Synchronization. Proceedings of USENIX File And Storage Systems conference (FAST), 2005.
[16]
M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezise, and P. Camble, Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality, Proceedings of USENIX File And Storage Systems conference (FAST), 2009.
[17]
Karp R., Rabin M., Efficien randomized pattern matching algorithms, IBM Journal of Research and Development 31 (1987) 249--260.
[18]
Knuth D. E., Morris J. H., Pratt V. R., Fast pattern matching in strings, SIAM Journal on Computing 6 (1977) 323--350.
[19]
P. Kulkarni, F. Douglis, J. D. LaVoie, J. M. Tracey: Redundancy Elimination Within Large Collections of Files. Proceedings of USENIX Annual Technical Conference, pages 59--72, 2004.
[20]
Udi Manber. Finding Similar Files in A Large File System. Technical Report TR 93--33, Department of Computer Science, University of Arizona, October 1993, also in Proceedings of the USENIX Winter 1994 Technical Conference, pages 17--21. 1994.
[21]
Landau G. M., Vishkin U., Fast parallel and serial approximate string matching, Journal of Algorithms 10(2) (1989) 157--169.
[22]
Moulton G. H., Whitehill S. B., Hash fil system and method for use in a commonality factoring system, U.S. Pat. No. 6,704,730.
[23]
Navarro G., A Guided Tour to Approximate String Matching, ACM Computing Surveys, 33(1) (2001) 31--88.
[24]
S. Quinlan and S. Dorward, Venti: A New Approach to Archival Storage. In Proc. of the USENIX Conf. on File And Storage Technologies (FAST), 2002.
[25]
N. Shivakumar and H. Garcia-Molina. Building a Scalable and Accurate Copy Detection Mechanism. Proceedings of the 3nd International Conference on Theory and Practice of Digital Libraries, 1996.
[26]
N. Shivakumar and H. Garcia-Molina. Finding near-replicas of documents on the web. In Proc. of Workshop on Web Databases (WebDB'98), March 1998.
[27]
Ukkonen E., On-line construction of suffi trees, Algorithmica 14(3) (1995) 249--260.
[28]
Weiner P., Linear pattern matching algorithm, Proceedings of the 14th IEEE Symposium on Switching and Automata Theory, (1973) 1--11.
[29]
B. Zhu, K. Li and H. Patterson, Avoiding the Disk Bottleneck in the Data Domain Deduplication File System, Proc. of FAST 08, the 6th USENIX Conf. on File and Storage Technologies, 279--292.
[30]
J. Ziv and A. Lempel. A universal algorithm for sequential data compression, IEEE Trans. Inform. Theory, vol. IT-23, pp. 337--343, May 1977. 282

Cited By

View all
  • (2024)Palantir: Hierarchical Similarity Detection for Post-Deduplication Delta CompressionProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640353(830-845)Online publication date: 27-Apr-2024
  • (2024)I/O Causality Based In-Line Data Deduplication for Non-Volatile Memory Enabled Storage SystemsIEEE Transactions on Computers10.1109/TC.2024.336596173:5(1327-1340)Online publication date: May-2024
  • (2024)The Design of a Lossless Deduplication Scheme to Eliminate Fine-Grained Redundancy for JPEG Image Storage SystemsIEEE Transactions on Computers10.1109/TC.2024.336345673:5(1385-1399)Online publication date: May-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
SYSTOR '09: Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
May 2009
191 pages
ISBN:9781605586236
DOI:10.1145/1534530
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

  • Hebrew University of Jerusalem
  • Melanox Technologies
  • IBM: IBM

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 May 2009

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Conference

SYSTOR '09
Sponsor:
  • IBM

Acceptance Rates

Overall Acceptance Rate 108 of 323 submissions, 33%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)20
  • Downloads (Last 6 weeks)3
Reflects downloads up to 22 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Palantir: Hierarchical Similarity Detection for Post-Deduplication Delta CompressionProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640353(830-845)Online publication date: 27-Apr-2024
  • (2024)I/O Causality Based In-Line Data Deduplication for Non-Volatile Memory Enabled Storage SystemsIEEE Transactions on Computers10.1109/TC.2024.336596173:5(1327-1340)Online publication date: May-2024
  • (2024)The Design of a Lossless Deduplication Scheme to Eliminate Fine-Grained Redundancy for JPEG Image Storage SystemsIEEE Transactions on Computers10.1109/TC.2024.336345673:5(1385-1399)Online publication date: May-2024
  • (2024)Chunk2vec: A novel resemblance detection scheme based on Sentence‐BERT for post‐deduplication delta compression in network transmissionIET Communications10.1049/cmu2.12719Online publication date: 4-Jan-2024
  • (2024)A blockchain-based auditable deduplication scheme for multi-cloud storagePeer-to-Peer Networking and Applications10.1007/s12083-024-01734-717:5(2870-2883)Online publication date: 4-Jun-2024
  • (2023)The Design of Fast and Lightweight Resemblance Detection for Efficient Post-Deduplication Delta CompressionACM Transactions on Storage10.1145/358466319:3(1-30)Online publication date: 19-Jun-2023
  • (2023)Dataset Similarity Detection for Global Deduplication in the DD File System2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00255(3322-3335)Online publication date: Apr-2023
  • (2023)BDACD: Blockchain-based decentralized auditing supporting ciphertext deduplicationJournal of Systems Architecture10.1016/j.sysarc.2023.103053(103053)Online publication date: Dec-2023
  • (2022)From Hyper-dimensional Structures to Linear Structures: Maintaining Deduplicated Data’s LocalityACM Transactions on Storage10.1145/350792118:3(1-28)Online publication date: 24-Aug-2022
  • (2022)Pushing Collaborative Data Deduplication to the Network Edge: An Optimization Framework and System DesignIEEE Transactions on Network Science and Engineering10.1109/TNSE.2022.31553579:4(2110-2122)Online publication date: 1-Jul-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media