Abstract
For efficient chunking, we propose Differential Evolution (DE) based approach which is optimized Two Thresholds Two Divisors (TTTD-P) Content Defined Chunking (CDC) to reduce the number of computing operations using single dynamic optimal parameter divisor D with optimal threshold value exploiting multi-operations nature of TTTD. To reduce chunk size variance, TTTD algorithm introduces an additional backup divisor D′ that has a higher probability of finding cut points, however, adding an additional divisor decreases chunking throughput. To this end, Asymmetric Extremum (AE) significantly improves chunking throughput by using local extreme value in a variable-sized asymmetric window to overcome Rabin and TTTD boundaries shift problem, while achieving nearby same deduplication ratio (DR). Therefore, we propose DE-based TTTD-P optimized chunking to maximize chunking throughput with increased DR; and scalable bucket indexing approach reduces hash values judgment time to identify and declare redundant chunks about 16 times than Rabin CDC, 5 times than AE CDC, 1.6 times than FAST CDC on Hadoop Distributed File System (HDFS).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Min Chen, Shiwen Mao, Yunhao Liu: Big Data: A Survey Mobile Networks and Applications Journal, Springer, Vol. 19, Issue 2, (2014) 171–209.
James Manyika, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs, Charles Roxburgh, Angela Hung Byers: Big data: The next frontier for innovation, competition, and productivity. McKinsey Company, (2011) 1–156.
R. Storn: Differential Evolution: A simple and efficient heuristic strategy for global optimization over continuous spaces. Journal of Global Optimization, Vol. 11, (1997) 341–359.
Jaehong Min, Daeyoung Yoon, and Youjip Won: Efficient Deduplication Techniques for Modern Backup Operation. IEEE Transactions on Computers, Vol. 60, No. 6, (2011) 824–840.
J. Gantz, and D. Reinsel: The digital universe decade are you ready? IDC White Paper, May (2011).
H. Biggar: Experiencing data deduplication: Improving efficiency and reducing capacity requirements. White Paper, the Enterprise Strategy Group, Feb (2012).
Tin-Yu Wu, Jeng-Shyang Pan, and Chia-Fan Lin: Improving Accessing Efficiency of Cloud Storage Using Deduplication and Feedback Schemes. IEEE Systems Journal, Volume 8, No.1, (2014) 208–218.
Yinjin Fu, Lei Tian, Fang Liu, Hong Jiang, Nong Xiao: AA-Dedupe: An Application-Aware Source Deduplication Approach for Cloud Backup Services in the Personal Computing Environment. IEEE International Conference on Cluster Computing (CLUSTER), (2013) 112–120.
Ahmed El-Shimi, Ran Kalach, Ankit Kumar Adi Oltean Jin Li, Sudipta Sengupta: Primary Data Deduplication – Large Scale Study and System Design USENIX federated conference Week. June 12–15, (2012) 285–296.
Ross Neil Williams: Method for partitioning a block of data into sub-blocks and for storing and communicating such sub-blocks. Patent US5990810 A, 23 Nov (1999).
Purushottam Kulkarni, Fred Douglis Jason LaVoie, John M. Tracey, Redundancy Elimination Within Large Collections of Files Proceedings of the annual conference on USENIX Annual Technical Conference, General Track, (2004) 59–72.
Dirk Meister, Jürgen Kaiser, Andre Brinkmann, Michael Kuhn, Julian Martin Kunkel, Toni Cortes: A Study on Data Deduplication in HPC Storage Systems. Conference Proceedings of the ACM/IEEE Conference on High Performance Computing (2012).
Guanlin Lu: An Efficient Data Deduplication Design with Flash-Memory Based Solid State Drive. A Dissertation Submitted to The Faculty of the Graduate School of the University of Minnesota, (2012) 1–114.
Nikolaj Bjorner, Andreas Blass, Yuri Gurevich: Content-Dependent Chunking for Differential Compression, The Local Maximum Approach. Journal of Computer and System Sciences, Elsevier, Vol. 79, No. 3, (2010) 154–203.
Athicha Muthitacharoen, Benjie Chen and David Mazieres: A Low-bandwidth Network File System, proceeding of the 18th ACM Symposium on operating System principle (Sosp ’01), Chateau lake louise, Banff, Canada, (2001) 174–187.
Janaka Deepakumara, Howard M. Heys and R. Venkatesan: FPGA Implementation of MD5 Hash Algorithm. IEEE Canadian Conference on Electrical and Computer Engineering, (2001) 919–924.
Dai Zibin Zhou Ning: FPGA Implementation of SHA-1 Algorithm. IEEE 5th international conference on ASIC, (2003) 1321–1324.
Deepavali Bhagwat, Kave Eshghi, Darrell D.E. Long and Mark Lillibridge: Extreme Binning: Scalable, Parallel De-duplication for Chunk-based File Backup. Proceeding of 17th IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication systems (2009) 1–9.
Guanlin Lu, Yu Jin, and David H.C. Du,: Frequency Based Chunking for Data De-Duplication. 18th Annual IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, (2010) 287–296.
IderLkhagvasuren, Jung Min So, Jeong Gun Lee, Chuck Yoo and Young WoongKo: Byte-index Chunking Algorithm for Data Deduplication System. International Journal of Security and Its Applications, SERSC, Vol. 7, No. 5, (2013) 415–424.
IderLkhagvasuren, Jung Min So, Jeong Gun Lee, Jin Kim and Young WoongKo, Multi-level Byte Index Chunking Mechanism for File Synchronization. International Journal of Software Engineering and Applications Vol. 8, No. 3, (2014) 339–350.
Chuanshuai Yu, Chengwei Zhang, Yiping Mao, Fulu Li: Leap Based Content Defined Chunking- Theory and Implementation, IEEE 31st Symposium on Mass Storage Systems and Technologies (2015) 1–12.
Erik Kruus, Christian Ungureanu, Cezary Dubnicki: Bimodal Content Defined Chunking for Backup Streams, FAST Proceedings of the 8th USENIX Conference on file and Storage Technologies, USENIX, (2010) 239–252.
Jiansheng Wei, Junhua Zhu, Yong Li, “Multimodal Content Defined Chunking for Data Deduplication”, https://www.researchgate.net/publication/261286019, Research gate, (2014).
Yucheng Zhang, Hong Jiang, Dan Feng, Wen Xia, Min Fu, Fangting Huang, Yukun Zhou: AE An Asymmetric Extremum Content Defined Chunking Algorithm for Fast and Bandwidth-Efficient Data Deduplication. IEEE Conference on Computer Communications (INFOCOM), (2015) 1337–1345.
Kave Eshghi, Hsiu Khuern Tang: A framework for analyzing and improving content based chunking Algorithms. Technical Report TR 2005–30, Hewlett-Packard Development Company, http://www.hpl.hp.com/techreports/2005/HPL-2005-30R1.html (2005).
Teng-Sheng Moh, Bing Chun Chang: A Running Time Improvement for the Two Thresholds Two Divisors Algorithm. ACMSE ‘10, April 15–17, (2010).
Xia, W., Jiang, H., Feng, D., Tian, L., Fu, M., and Zhou, Y.: Ddelta: A deduplication-inspired fast delta compression approach Performance Evaluation. 15 Proceedings of the 7th USENIX Conference on Hot Topics in Storage and File Systems, (2015) 258–272.
Aggarwal, B., Akella, A., Anand, A., et al. EndRE: an end-system redundancy elimination service for enterprises. In Proceedings of the 7th USENIX conference on Networked Systems Design and Implementation (NSDI ’10) (San Jose, CA, USA), USENIX Association, April (2010) 14–28.
Cui, Y., Lal, Z., Wang, X., Dai, N., and Miao, C., “QuickSync: Improving Synchronization Efficiency for Mobile Cloud Storage Services” In Proceedings of the 21st Annual International Conference on Mobile Computing and Networking (Paris, France), ACM, Sept. (2015) 592–603.
Wen Xia, Yukun Zhou, Hong Jiang, Dan Feng, Yu Hua, Yuchong Hu, Yucheng Zhang, Qing Liu: FastCDC: a Fast and Efficient Content-Defined Chunking Approach for Data Deduplication. USENIX Open access to the Proceedings of USENIX Annual Technical Conference (USENIX ATC ’16), (2016) 101–114.
Andrei Z. Broder: Some applications of Rabin’s fingerprinting method. Sequences II: Methods in Communications, Security and Computer Science, Springer, (1993)143–152.
Michael O, Rabin: Fingerprinting by random polynomials. Center for Research in Computing Technology. Aiken Computation Laboratory, Univ., (1981).
Dubnicki, C., Kruus, E., Lichota, K., and Ungureanu, C.: Methods and systems for data management using multiple selection criteria. US Patent App. 11/566,122, Dec 1, (2006).
Min, J., Yoon, D., and Won, Y.: Efficient deduplication techniques for modern backup operation. IEEE Transactions on Computers Vol. 60, Issue 6, (2011) 824–840.
Derviskaraboga, Selcukokdem: A simple and Global optimization algorithm for engineering problem: Differential evolution algorithm. Turk Journal Electrical Engineering, Vol. 12, No. 1, (2004).
An Introduction to Differential Evolution, http://www.maths.uq.edu.au/MASCOS/MultiAgent04/Fleetwood.pdf.
Differential Evolution (DE): http://www.dii.unipd.it/alotto/didattica/corsi/Elettrotecnicalcomputazionale/DE.pdf.
Prometeo Cortes-Antonio, Josue Rangel-Gonzalez, Luis A. Villa-Vargas, Marco Antonio Ramirez-Salinas, Heron Molina Lozano, Idar Batyrshin,: Design and Implementation of Differential Evolution Algorithm on FPGA for Double Precision Floating-Point Representation. Acta Polytechnic Hungarica, Vol. 11, No. 4, (2014).
Naresh Kumar and Ruchika Kumar, “Improved Join Operations Using ORC in HIVE”, Springer Transactions on Information Communication Technology (ICT), Springer Vol. 4, Issue 2, pp. 209–215, Dec. (2016).
Jeffrey Dean and Sanjay Ghemawat, “Map-reduce: Simplified Data Processing on Large Clusters”, To appear in OSDI, (2004).
Freedb.org, “http://freedb.org/pub/freedb/”.
Apache hadoop 1.2.1, “http://hadoop.apache.org/”.
Naresh Kumar, Rahul Rawat and S.C Jain, “Bucket Based Data Deduplication Technique for Big Data Storage System”, IEEE 5th International Conference on Reliability, Infocom Tech. and Optimization, Dec. (2016).
B. Zhu, K. Li, and H. Patterson, “Avoiding the Disk Bottleneck in the Data Domain Deduplication File System,” in Proceedings of the 7th USENIX Conference on File and Storage Technologies (FAST) San Jose, CA, USA, pp. 269–282, Feb. (2008).
M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar G. Trezise, and P. Campbell, “Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality,” in Proceedings of the 7th USENIX Conference on File and Storage Technologies (FAST) San Jose, CA, USA, pp. 111–123, Feb. (2008).
Acknowledgements
For this research, we would like to show gratitude to Prof. Rohitashwa Shringi (and specially his wife Dr. Pramila) Mechanical Engineering department of Rajasthan Technical University, KOTA for their support, valuable advice, motivation, and encouragement. We would also like to thank Prof. S. C Jain Rajasthan Technical University, Kota for guiding this research work with valuable suggestions time to time.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix: Experimental Setup
Appendix: Experimental Setup
Chunking | Main divisor D | Backup Divisor D′ | Fingerprinting fp mod D = R | Speed |
---|---|---|---|---|
BSW | 1000 | No | Rabin | Slow |
BFS | 1000 | No | Rabin | Slow |
TD | 1200 | 600 | Rabin | Slow |
SCM | 540 | No | Rabin | Slow |
TTTD | 540 | 270 | Rabin 16-bit incremental | Slow |
TTTD-S | 1600 | 800 | Rabin 16-bit incremental | Slow |
TTTD-P | Optimal DE (D, T) | No | BUZ 32-bit rolling hash | Very fast |
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Kumar, N., Shobha, Jain, S.C. (2019). Efficient Data Deduplication for Big Data Storage Systems. In: Panigrahi, C., Pujari, A., Misra, S., Pati, B., Li, KC. (eds) Progress in Advanced Computing and Intelligent Engineering. Advances in Intelligent Systems and Computing, vol 714. Springer, Singapore. https://doi.org/10.1007/978-981-13-0224-4_32
Download citation
DOI: https://doi.org/10.1007/978-981-13-0224-4_32
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-0223-7
Online ISBN: 978-981-13-0224-4
eBook Packages: EngineeringEngineering (R0)