Efficient Data Deduplication for Big Data Storage Systems

Naresh Kumar¹⁹,
Shobha¹⁹ &
S. C. Jain²⁰

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 714))

1080 Accesses
8 Citations

Abstract

For efficient chunking, we propose Differential Evolution (DE) based approach which is optimized Two Thresholds Two Divisors (TTTD-P) Content Defined Chunking (CDC) to reduce the number of computing operations using single dynamic optimal parameter divisor D with optimal threshold value exploiting multi-operations nature of TTTD. To reduce chunk size variance, TTTD algorithm introduces an additional backup divisor D′ that has a higher probability of finding cut points, however, adding an additional divisor decreases chunking throughput. To this end, Asymmetric Extremum (AE) significantly improves chunking throughput by using local extreme value in a variable-sized asymmetric window to overcome Rabin and TTTD boundaries shift problem, while achieving nearby same deduplication ratio (DR). Therefore, we propose DE-based TTTD-P optimized chunking to maximize chunking throughput with increased DR; and scalable bucket indexing approach reduces hash values judgment time to identify and declare redundant chunks about 16 times than Rabin CDC, 5 times than AE CDC, 1.6 times than FAST CDC on Hadoop Distributed File System (HDFS).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Content-Based Chunk Placement Scheme for Decentralized Deduplication on Distributed File Systems

A smart hybrid content-defined chunking algorithm for data deduplication in cloud storage

Article 14 October 2023

Study of Chunking Algorithm in Data Deduplication

References

Min Chen, Shiwen Mao, Yunhao Liu: Big Data: A Survey Mobile Networks and Applications Journal, Springer, Vol. 19, Issue 2, (2014) 171–209.
Google Scholar
James Manyika, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs, Charles Roxburgh, Angela Hung Byers: Big data: The next frontier for innovation, competition, and productivity. McKinsey Company, (2011) 1–156.
Google Scholar
R. Storn: Differential Evolution: A simple and efficient heuristic strategy for global optimization over continuous spaces. Journal of Global Optimization, Vol. 11, (1997) 341–359.
Google Scholar
Jaehong Min, Daeyoung Yoon, and Youjip Won: Efficient Deduplication Techniques for Modern Backup Operation. IEEE Transactions on Computers, Vol. 60, No. 6, (2011) 824–840.
Google Scholar
J. Gantz, and D. Reinsel: The digital universe decade are you ready? IDC White Paper, May (2011).
Google Scholar
H. Biggar: Experiencing data deduplication: Improving efficiency and reducing capacity requirements. White Paper, the Enterprise Strategy Group, Feb (2012).
Google Scholar
Tin-Yu Wu, Jeng-Shyang Pan, and Chia-Fan Lin: Improving Accessing Efficiency of Cloud Storage Using Deduplication and Feedback Schemes. IEEE Systems Journal, Volume 8, No.1, (2014) 208–218.
Google Scholar
Yinjin Fu, Lei Tian, Fang Liu, Hong Jiang, Nong Xiao: AA-Dedupe: An Application-Aware Source Deduplication Approach for Cloud Backup Services in the Personal Computing Environment. IEEE International Conference on Cluster Computing (CLUSTER), (2013) 112–120.
Google Scholar
Ahmed El-Shimi, Ran Kalach, Ankit Kumar Adi Oltean Jin Li, Sudipta Sengupta: Primary Data Deduplication – Large Scale Study and System Design USENIX federated conference Week. June 12–15, (2012) 285–296.
Google Scholar
Ross Neil Williams: Method for partitioning a block of data into sub-blocks and for storing and communicating such sub-blocks. Patent US5990810 A, 23 Nov (1999).
Google Scholar
Purushottam Kulkarni, Fred Douglis Jason LaVoie, John M. Tracey, Redundancy Elimination Within Large Collections of Files Proceedings of the annual conference on USENIX Annual Technical Conference, General Track, (2004) 59–72.
Google Scholar
Dirk Meister, Jürgen Kaiser, Andre Brinkmann, Michael Kuhn, Julian Martin Kunkel, Toni Cortes: A Study on Data Deduplication in HPC Storage Systems. Conference Proceedings of the ACM/IEEE Conference on High Performance Computing (2012).
Google Scholar
Guanlin Lu: An Efficient Data Deduplication Design with Flash-Memory Based Solid State Drive. A Dissertation Submitted to The Faculty of the Graduate School of the University of Minnesota, (2012) 1–114.
Google Scholar
Nikolaj Bjorner, Andreas Blass, Yuri Gurevich: Content-Dependent Chunking for Differential Compression, The Local Maximum Approach. Journal of Computer and System Sciences, Elsevier, Vol. 79, No. 3, (2010) 154–203.
Google Scholar
Athicha Muthitacharoen, Benjie Chen and David Mazieres: A Low-bandwidth Network File System, proceeding of the 18th ACM Symposium on operating System principle (Sosp ’01), Chateau lake louise, Banff, Canada, (2001) 174–187.
Google Scholar
Janaka Deepakumara, Howard M. Heys and R. Venkatesan: FPGA Implementation of MD5 Hash Algorithm. IEEE Canadian Conference on Electrical and Computer Engineering, (2001) 919–924.
Google Scholar
Dai Zibin Zhou Ning: FPGA Implementation of SHA-1 Algorithm. IEEE 5th international conference on ASIC, (2003) 1321–1324.
Google Scholar
Deepavali Bhagwat, Kave Eshghi, Darrell D.E. Long and Mark Lillibridge: Extreme Binning: Scalable, Parallel De-duplication for Chunk-based File Backup. Proceeding of 17th IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication systems (2009) 1–9.
Google Scholar
Guanlin Lu, Yu Jin, and David H.C. Du,: Frequency Based Chunking for Data De-Duplication. 18th Annual IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, (2010) 287–296.
Google Scholar
IderLkhagvasuren, Jung Min So, Jeong Gun Lee, Chuck Yoo and Young WoongKo: Byte-index Chunking Algorithm for Data Deduplication System. International Journal of Security and Its Applications, SERSC, Vol. 7, No. 5, (2013) 415–424.
Google Scholar
IderLkhagvasuren, Jung Min So, Jeong Gun Lee, Jin Kim and Young WoongKo, Multi-level Byte Index Chunking Mechanism for File Synchronization. International Journal of Software Engineering and Applications Vol. 8, No. 3, (2014) 339–350.
Google Scholar
Chuanshuai Yu, Chengwei Zhang, Yiping Mao, Fulu Li: Leap Based Content Defined Chunking- Theory and Implementation, IEEE 31st Symposium on Mass Storage Systems and Technologies (2015) 1–12.
Google Scholar
Erik Kruus, Christian Ungureanu, Cezary Dubnicki: Bimodal Content Defined Chunking for Backup Streams, FAST Proceedings of the 8th USENIX Conference on file and Storage Technologies, USENIX, (2010) 239–252.
Google Scholar
Jiansheng Wei, Junhua Zhu, Yong Li, “Multimodal Content Defined Chunking for Data Deduplication”, https://www.researchgate.net/publication/261286019, Research gate, (2014).
Yucheng Zhang, Hong Jiang, Dan Feng, Wen Xia, Min Fu, Fangting Huang, Yukun Zhou: AE An Asymmetric Extremum Content Defined Chunking Algorithm for Fast and Bandwidth-Efficient Data Deduplication. IEEE Conference on Computer Communications (INFOCOM), (2015) 1337–1345.
Google Scholar
Kave Eshghi, Hsiu Khuern Tang: A framework for analyzing and improving content based chunking Algorithms. Technical Report TR 2005–30, Hewlett-Packard Development Company, http://www.hpl.hp.com/techreports/2005/HPL-2005-30R1.html (2005).
Teng-Sheng Moh, Bing Chun Chang: A Running Time Improvement for the Two Thresholds Two Divisors Algorithm. ACMSE ‘10, April 15–17, (2010).
Google Scholar
Xia, W., Jiang, H., Feng, D., Tian, L., Fu, M., and Zhou, Y.: Ddelta: A deduplication-inspired fast delta compression approach Performance Evaluation. 15 Proceedings of the 7th USENIX Conference on Hot Topics in Storage and File Systems, (2015) 258–272.
Google Scholar
Aggarwal, B., Akella, A., Anand, A., et al. EndRE: an end-system redundancy elimination service for enterprises. In Proceedings of the 7th USENIX conference on Networked Systems Design and Implementation (NSDI ’10) (San Jose, CA, USA), USENIX Association, April (2010) 14–28.
Google Scholar
Cui, Y., Lal, Z., Wang, X., Dai, N., and Miao, C., “QuickSync: Improving Synchronization Efficiency for Mobile Cloud Storage Services” In Proceedings of the 21st Annual International Conference on Mobile Computing and Networking (Paris, France), ACM, Sept. (2015) 592–603.
Google Scholar
Wen Xia, Yukun Zhou, Hong Jiang, Dan Feng, Yu Hua, Yuchong Hu, Yucheng Zhang, Qing Liu: FastCDC: a Fast and Efficient Content-Defined Chunking Approach for Data Deduplication. USENIX Open access to the Proceedings of USENIX Annual Technical Conference (USENIX ATC ’16), (2016) 101–114.
Google Scholar
Andrei Z. Broder: Some applications of Rabin’s fingerprinting method. Sequences II: Methods in Communications, Security and Computer Science, Springer, (1993)143–152.
Google Scholar
Michael O, Rabin: Fingerprinting by random polynomials. Center for Research in Computing Technology. Aiken Computation Laboratory, Univ., (1981).
Google Scholar
Dubnicki, C., Kruus, E., Lichota, K., and Ungureanu, C.: Methods and systems for data management using multiple selection criteria. US Patent App. 11/566,122, Dec 1, (2006).
Google Scholar
Min, J., Yoon, D., and Won, Y.: Efficient deduplication techniques for modern backup operation. IEEE Transactions on Computers Vol. 60, Issue 6, (2011) 824–840.
Google Scholar
Derviskaraboga, Selcukokdem: A simple and Global optimization algorithm for engineering problem: Differential evolution algorithm. Turk Journal Electrical Engineering, Vol. 12, No. 1, (2004).
Google Scholar
An Introduction to Differential Evolution, http://www.maths.uq.edu.au/MASCOS/MultiAgent04/Fleetwood.pdf.
Differential Evolution (DE): http://www.dii.unipd.it/alotto/didattica/corsi/Elettrotecnicalcomputazionale/DE.pdf.
Prometeo Cortes-Antonio, Josue Rangel-Gonzalez, Luis A. Villa-Vargas, Marco Antonio Ramirez-Salinas, Heron Molina Lozano, Idar Batyrshin,: Design and Implementation of Differential Evolution Algorithm on FPGA for Double Precision Floating-Point Representation. Acta Polytechnic Hungarica, Vol. 11, No. 4, (2014).
Google Scholar
Naresh Kumar and Ruchika Kumar, “Improved Join Operations Using ORC in HIVE”, Springer Transactions on Information Communication Technology (ICT), Springer Vol. 4, Issue 2, pp. 209–215, Dec. (2016).
Google Scholar
Jeffrey Dean and Sanjay Ghemawat, “Map-reduce: Simplified Data Processing on Large Clusters”, To appear in OSDI, (2004).
Google Scholar
http://www.icsi.berkeley.edu/storn/code.html.
http://www.python.org/.
http://www.java.org/.
Freedb.org, “http://freedb.org/pub/freedb/”.
Apache hadoop 1.2.1, “http://hadoop.apache.org/”.
http://www.serve.net/buz/hash.adt/java.000.html.
Naresh Kumar, Rahul Rawat and S.C Jain, “Bucket Based Data Deduplication Technique for Big Data Storage System”, IEEE 5th International Conference on Reliability, Infocom Tech. and Optimization, Dec. (2016).
Google Scholar
B. Zhu, K. Li, and H. Patterson, “Avoiding the Disk Bottleneck in the Data Domain Deduplication File System,” in Proceedings of the 7th USENIX Conference on File and Storage Technologies (FAST) San Jose, CA, USA, pp. 269–282, Feb. (2008).
Google Scholar
M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar G. Trezise, and P. Campbell, “Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality,” in Proceedings of the 7th USENIX Conference on File and Storage Technologies (FAST) San Jose, CA, USA, pp. 111–123, Feb. (2008).
Google Scholar

Download references

Acknowledgements

For this research, we would like to show gratitude to Prof. Rohitashwa Shringi (and specially his wife Dr. Pramila) Mechanical Engineering department of Rajasthan Technical University, KOTA for their support, valuable advice, motivation, and encouragement. We would also like to thank Prof. S. C Jain Rajasthan Technical University, Kota for guiding this research work with valuable suggestions time to time.

Author information

Authors and Affiliations

Department UIET, Computer Science and Engineering, Kurukshetra University, Kurukshetra, 136119, India
Naresh Kumar & Shobha
Department Computer Science and Engineering, Rajasthan Technical University, Kota, 324010, India
S. C. Jain

Authors

Naresh Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Shobha
View author publications
You can also search for this author in PubMed Google Scholar
S. C. Jain
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Naresh Kumar .

Editor information

Editors and Affiliations

Department of Computer Science, Rama Devi Women’s University, Bhubaneswar, Odisha, India
Chhabi Rani Panigrahi
Department of Computer Science, Central University of Rajasthan, Ajmer, Rajasthan, India
Arun K. Pujari
Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, Kharagpur, India
Sudip Misra
Department of Computer Science, Rama Devi Women’s University, Bhubaneswar, Odisha, India
Bibudhendu Pati
Department of Computer Science and Information Engineering (CSIE), Providence University, Taichung, Taiwan
Kuan-Ching Li

Appendix: Experimental Setup

Chunking	Main divisor D	Backup Divisor D′	Fingerprinting fp mod D = R	Speed
BSW	1000	No	Rabin	Slow
BFS	1000	No	Rabin	Slow
TD	1200	600	Rabin	Slow
SCM	540	No	Rabin	Slow
TTTD	540	270	Rabin 16-bit incremental	Slow
TTTD-S	1600	800	Rabin 16-bit incremental	Slow
TTTD-P	Optimal DE (D, T)	No	BUZ 32-bit rolling hash	Very fast

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kumar, N., Shobha, Jain, S.C. (2019). Efficient Data Deduplication for Big Data Storage Systems. In: Panigrahi, C., Pujari, A., Misra, S., Pati, B., Li, KC. (eds) Progress in Advanced Computing and Intelligent Engineering. Advances in Intelligent Systems and Computing, vol 714. Springer, Singapore. https://doi.org/10.1007/978-981-13-0224-4_32

Download citation

DOI: https://doi.org/10.1007/978-981-13-0224-4_32
Published: 10 July 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-0223-7
Online ISBN: 978-981-13-0224-4
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Efficient Data Deduplication for Big Data Storage Systems

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Content-Based Chunk Placement Scheme for Decentralized Deduplication on Distributed File Systems

A smart hybrid content-defined chunking algorithm for data deduplication in cloud storage

Study of Chunking Algorithm in Data Deduplication

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix: Experimental Setup

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Efficient Data Deduplication for Big Data Storage Systems

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Content-Based Chunk Placement Scheme for Decentralized Deduplication on Distributed File Systems

A smart hybrid content-defined chunking algorithm for data deduplication in cloud storage

Study of Chunking Algorithm in Data Deduplication

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix: Experimental Setup

Appendix: Experimental Setup

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation