research-article

A running time improvement for the two thresholds two divisors algorithm

Authors:

Teng-Sheng Moh,

BingChun ChangAuthors Info & Claims

ACMSE '10: Proceedings of the 48th annual ACM Southeast Conference

Article No.: 69, Pages 1 - 6

https://doi.org/10.1145/1900008.1900101

Published: 15 April 2010 Publication History

Get Access

Abstract

Chunking algorithms play an important role in hash-based data de-duplication systems. The Basic Sliding Window (BSW) algorithm is the first prototype of a content-based chunking algorithm that can handle most types of data. The Two Thresholds Two Divisors (TTTD) algorithm was proposed to improve the BSW algorithm by controlling the chunk-size variations. We conducted a series of systematic experiments to evaluate the performances of these two algorithms. We also proposed a new improvement for the TTTD algorithm. Our new approach reduced about 6% of the running time and 50% of the large-sized chunks, and also brought other significant benefits.

References

[1]

IDC. 2010. Backup and Recovery: Accelerating Efficiency and Driving Down IT Costs Using Data Deduplication. White Paper.

Google Scholar

[2]

SEPATON Inc. 2007. Reducing Costs in the Data Center: Comparing Costs and Benefits of Leading Data Protection Technologies. White Paper.

Google Scholar

[3]

Quantum Corp. 2009. Data Deduplication Background: A Technical White Paper. White Paper.

Google Scholar

[4]

Geer, D. 2008. Reducing the storage burden via data deduplication. Computer, 41, 12 (Dec. 2008), 15--17.

Digital Library

Google Scholar

[5]

IBM Corp. 2010. IBM System Storage TS7650 and TS7650G with ProtecTIER. http://www.redbooks.ibm.com/redpieces/abstracts/sg247652.html

Google Scholar

[6]

Schleimer, S., Wilkerson, D. S., and Aiken, A. 2003. Winnowing: local algorithms for document fingerprinting. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (San Diego, California, USA, June 09--12, 2003).

Digital Library

Google Scholar

[7]

Seo, J. and Croft, W. B. 2008. Local text reuse detection. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Singapore, Singapore, July 20--24, 2008).

Digital Library

Google Scholar

[8]

Mandagere, N., Zhou, P., Smith, M. A., and Uttamchandani. S. 2008. Demystifying data deduplication. In Proceedings of the ACM/IFIP/USENIX Middleware '08 Conference Companion (Leuven, Belgium, December 01--05, 2008).

Digital Library

Google Scholar

[9]

Bobbarjung, D. R., Jagannathan, S. and Dubnicki, C. 2006. Improving duplicate elimination in storage systems. ACM Transactions on Stroage, 2, 4 (Nov. 2006), 424--448.

Digital Library

Google Scholar

[10]

Muthitacharoen, A., Chen, B., and Mazieres, D. 2001. A low-bandwidth network file system. In Proceedings of the 18th ACM Symposium on Operating Systems Principles (Banff, Alberta, Canada, October 21--24, 2001).

Digital Library

Google Scholar

[11]

Eshghi, K. and Tang, H. K. 2005. A Framework for Analyzing and Improving Content-Based Chunking Algorithms. Technical Report, TR 2005--30. Hewlett-Packard Development Company, L. P. http://www.hpl.hp.com/techreports/2005/HPL-2005-30R1.html

Google Scholar

[12]

GNU website. http://www.gnu.org/

Google Scholar

[13]

Forman, G., Eshghi, K., and Chiocchetti, S. 2005. Finding similar files in large document repositories. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining (Chicago, Illinois, USA, August 21--24, 2005).

Digital Library

Google Scholar

[14]

Bhagwat, D., Eshghi, K., and Mehra, P. 2007. Content-based document routing and index partitioning for scalable similarity-based searches in a large corpus. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Jose, California, USA, August 12--15, 2007).

Digital Library

Google Scholar

Cited By

View all

Udayashankar SBaba AAl-Kiswany SSchiavoni VEdinger JCao JJin Z(2024)SeqCDC: Hashless Content-Defined Chunking for Data DeduplicationProceedings of the 25th International Middleware Conference10.1145/3652892.3700766(292-298)Online publication date: 2-Dec-2024
https://dl.acm.org/doi/10.1145/3652892.3700766
Hashem Bedr Jehlol Loay E. George (2022)Big Data Backup Deduplication : A SurveyInternational Journal of Scientific Research in Science, Engineering and Technology10.32628/IJSRSET229425(174-191)Online publication date: 5-Jul-2022
https://doi.org/10.32628/IJSRSET229425
Tian WLi RXu Z(2021)TSS: A two‐party secure server‐aid chunking algorithmConcurrency and Computation: Practice and Experience10.1002/cpe.657734:12Online publication date: 21-Sep-2021
https://doi.org/10.1002/cpe.6577
Show More Cited By

Index Terms

A running time improvement for the two thresholds two divisors algorithm
1. Information systems
  1. Information retrieval
    1. Document representation
    2. Evaluation of retrieval results
  2. Information storage systems
    1. Record storage systems

Recommendations

Modes of Real-Time Content Transformation for Web Intermediaries in Active Network
SITIS '07: Proceedings of the 2007 Third International IEEE Conference on Signal-Image Technologies and Internet-Based System

Active content transformation in web is a hot topic in Internet content delivery research. In this paper, we proposed 3 different modes content transformation, which are whole-file buffering, byte-streaming, and chunk buffering. Based on the chunk ...
Transport-layer issues in information centric networks
ICN '12: Proceedings of the second edition of the ICN workshop on Information-centric networking

Content to be transported over an Information Centric Networking (ICN) infrastructure can be very variable in size, from few bytes to hundreds of gigabytes. Therefore it needs to be segmented in smaller size data units, typically called chunks, in order ...
MUCH: Multithreaded Content-Based File Chunking
In this work, we developed a novel multithreaded variable size chunking method, MUCH, which exploits the multicore architecture of the modern microprocessors. The legacy single threaded variable size chunking method leaves much to be desired in terms of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

ACMSE '10: Proceedings of the 48th annual ACM Southeast Conference

April 2010

488 pages

ISBN:9781450300643

DOI:10.1145/1900008

Conference Chair:
H. Conrad Cunningham
University of Mississippi
,
Program Chairs:
Paul Ruth,
Nicholas A. Kraft

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 April 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ACM SE '10

Sponsor:

ACM SE '10: ACM Southeast Regional Conference

April 15 - 17, 2010

Mississippi, Oxford

Acceptance Rates

ACMSE '10 Paper Acceptance Rate 48 of 94 submissions, 51%;

Overall Acceptance Rate 502 of 1,023 submissions, 49%

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
170
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 25 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Udayashankar SBaba AAl-Kiswany SSchiavoni VEdinger JCao JJin Z(2024)SeqCDC: Hashless Content-Defined Chunking for Data DeduplicationProceedings of the 25th International Middleware Conference10.1145/3652892.3700766(292-298)Online publication date: 2-Dec-2024
https://dl.acm.org/doi/10.1145/3652892.3700766
Hashem Bedr Jehlol Loay E. George (2022)Big Data Backup Deduplication : A SurveyInternational Journal of Scientific Research in Science, Engineering and Technology10.32628/IJSRSET229425(174-191)Online publication date: 5-Jul-2022
https://doi.org/10.32628/IJSRSET229425
Tian WLi RXu Z(2021)TSS: A two‐party secure server‐aid chunking algorithmConcurrency and Computation: Practice and Experience10.1002/cpe.657734:12Online publication date: 21-Sep-2021
https://doi.org/10.1002/cpe.6577
Ullah AHamza KAzeem MFarha F(2019)Secure Healthcare Data Aggregation and Deduplication Scheme for FoG-Orineted IoT2019 IEEE International Conference on Smart Internet of Things (SmartIoT)10.1109/SmartIoT.2019.00054(314-319)Online publication date: Aug-2019
https://doi.org/10.1109/SmartIoT.2019.00054
Kumar NAntwal SJain S(2018)Differential Evolution based bucket indexed data deduplication for big data storageJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-1759334:1(491-505)Online publication date: 1-Jan-2018
https://dl.acm.org/doi/10.3233/JIFS-17593
Kumar NShobha Jain S(2018)Efficient Data Deduplication for Big Data Storage SystemsProgress in Advanced Computing and Intelligent Engineering10.1007/978-981-13-0224-4_32(351-371)Online publication date: 10-Jul-2018
https://doi.org/10.1007/978-981-13-0224-4_32
Jin XAgun DYang TWu QShen YZhao SMukhopadhyay SZhai CBertino ECrestani FMostafa JTang JSi LZhou XChang YLi YSondhi P(2016)Hybrid Indexing for Versioned Document Search with Cluster-based RetrievalProceedings of the 25th ACM International on Conference on Information and Knowledge Management10.1145/2983323.2983733(377-386)Online publication date: 24-Oct-2016
https://dl.acm.org/doi/10.1145/2983323.2983733
Ko YKim SKim JKim ESo J(2015)Energy efficient metadata management for cloud storage systemInternational Journal of Distributed Sensor Networks10.1155/2015/6265752015(21-21)Online publication date: 1-Jan-2015
https://dl.acm.org/doi/10.1155/2015/626575

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Modes of Real-Time Content Transformation for Web Intermediaries in Active Network

Transport-layer issues in information centric networks

MUCH: Multithreaded Content-Based File Chunking