Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1900008.1900101acmconferencesArticle/Chapter ViewAbstractPublication Pagesacm-seConference Proceedingsconference-collections
research-article

A running time improvement for the two thresholds two divisors algorithm

Published: 15 April 2010 Publication History

Abstract

Chunking algorithms play an important role in hash-based data de-duplication systems. The Basic Sliding Window (BSW) algorithm is the first prototype of a content-based chunking algorithm that can handle most types of data. The Two Thresholds Two Divisors (TTTD) algorithm was proposed to improve the BSW algorithm by controlling the chunk-size variations. We conducted a series of systematic experiments to evaluate the performances of these two algorithms. We also proposed a new improvement for the TTTD algorithm. Our new approach reduced about 6% of the running time and 50% of the large-sized chunks, and also brought other significant benefits.

References

[1]
IDC. 2010. Backup and Recovery: Accelerating Efficiency and Driving Down IT Costs Using Data Deduplication. White Paper.
[2]
SEPATON Inc. 2007. Reducing Costs in the Data Center: Comparing Costs and Benefits of Leading Data Protection Technologies. White Paper.
[3]
Quantum Corp. 2009. Data Deduplication Background: A Technical White Paper. White Paper.
[4]
Geer, D. 2008. Reducing the storage burden via data deduplication. Computer, 41, 12 (Dec. 2008), 15--17.
[5]
IBM Corp. 2010. IBM System Storage TS7650 and TS7650G with ProtecTIER. http://www.redbooks.ibm.com/redpieces/abstracts/sg247652.html
[6]
Schleimer, S., Wilkerson, D. S., and Aiken, A. 2003. Winnowing: local algorithms for document fingerprinting. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (San Diego, California, USA, June 09--12, 2003).
[7]
Seo, J. and Croft, W. B. 2008. Local text reuse detection. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Singapore, Singapore, July 20--24, 2008).
[8]
Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani. S. 2008. Demystifying data deduplication. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01--05, 2008).
[9]
Bobbarjung, D. R., Jagannathan, S. and Dubnicki, C. 2006. Improving duplicate elimination in storage systems. ACM Transactions on Stroage, 2, 4 (Nov. 2006), 424--448.
[10]
Muthitacharoen, A., Chen, B., and Mazieres, D. 2001. A low-bandwidth network file system. In Proceedings of the 18th ACM Symposium on Operating Systems Principles (Banff, Alberta, Canada, October 21--24, 2001).
[11]
Eshghi, K. and Tang, H. K. 2005. A Framework for Analyzing and Improving Content-Based Chunking Algorithms. Technical Report, TR 2005--30. Hewlett-Packard Development Company, L. P. http://www.hpl.hp.com/techreports/2005/HPL-2005-30R1.html
[12]
GNU website. http://www.gnu.org/
[13]
Forman, G., Eshghi, K., and Chiocchetti, S. 2005. Finding similar files in large document repositories. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining (Chicago, Illinois, USA, August 21--24, 2005).
[14]
Bhagwat, D., Eshghi, K., and Mehra, P. 2007. Content-based document routing and index partitioning for scalable similarity-based searches in a large corpus. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Jose, California, USA, August 12--15, 2007).

Cited By

View all
  • (2024)SeqCDC: Hashless Content-Defined Chunking for Data DeduplicationProceedings of the 25th International Middleware Conference10.1145/3652892.3700766(292-298)Online publication date: 2-Dec-2024
  • (2022)Big Data Backup Deduplication : A SurveyInternational Journal of Scientific Research in Science, Engineering and Technology10.32628/IJSRSET229425(174-191)Online publication date: 5-Jul-2022
  • (2021)TSS: A two‐party secure server‐aid chunking algorithmConcurrency and Computation: Practice and Experience10.1002/cpe.657734:12Online publication date: 21-Sep-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ACMSE '10: Proceedings of the 48th annual ACM Southeast Conference
April 2010
488 pages
ISBN:9781450300643
DOI:10.1145/1900008
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 April 2010

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. chunk
  2. content-based chunking
  3. de-duplication
  4. duplicate
  5. performance
  6. redundancy
  7. signature
  8. threshold

Qualifiers

  • Research-article

Conference

ACM SE '10
Sponsor:
ACM SE '10: ACM Southeast Regional Conference
April 15 - 17, 2010
Mississippi, Oxford

Acceptance Rates

ACMSE '10 Paper Acceptance Rate 48 of 94 submissions, 51%;
Overall Acceptance Rate 502 of 1,023 submissions, 49%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 25 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)SeqCDC: Hashless Content-Defined Chunking for Data DeduplicationProceedings of the 25th International Middleware Conference10.1145/3652892.3700766(292-298)Online publication date: 2-Dec-2024
  • (2022)Big Data Backup Deduplication : A SurveyInternational Journal of Scientific Research in Science, Engineering and Technology10.32628/IJSRSET229425(174-191)Online publication date: 5-Jul-2022
  • (2021)TSS: A two‐party secure server‐aid chunking algorithmConcurrency and Computation: Practice and Experience10.1002/cpe.657734:12Online publication date: 21-Sep-2021
  • (2019)Secure Healthcare Data Aggregation and Deduplication Scheme for FoG-Orineted IoT2019 IEEE International Conference on Smart Internet of Things (SmartIoT)10.1109/SmartIoT.2019.00054(314-319)Online publication date: Aug-2019
  • (2018)Differential Evolution based bucket indexed data deduplication for big data storageJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-1759334:1(491-505)Online publication date: 1-Jan-2018
  • (2018)Efficient Data Deduplication for Big Data Storage SystemsProgress in Advanced Computing and Intelligent Engineering10.1007/978-981-13-0224-4_32(351-371)Online publication date: 10-Jul-2018
  • (2016)Hybrid Indexing for Versioned Document Search with Cluster-based RetrievalProceedings of the 25th ACM International on Conference on Information and Knowledge Management10.1145/2983323.2983733(377-386)Online publication date: 24-Oct-2016
  • (2015)Energy efficient metadata management for cloud storage systemInternational Journal of Distributed Sensor Networks10.1155/2015/6265752015(21-21)Online publication date: 1-Jan-2015

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media