Nothing Special   »   [go: up one dir, main page]

Skip to main content

Efficient Data Deduplication for Big Data Storage Systems

  • Conference paper
  • First Online:
Progress in Advanced Computing and Intelligent Engineering

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 714))

Abstract

For efficient chunking, we propose Differential Evolution (DE) based approach which is optimized Two Thresholds Two Divisors (TTTD-P) Content Defined Chunking (CDC) to reduce the number of computing operations using single dynamic optimal parameter divisor D with optimal threshold value exploiting multi-operations nature of TTTD. To reduce chunk size variance, TTTD algorithm introduces an additional backup divisor D′ that has a higher probability of finding cut points, however, adding an additional divisor decreases chunking throughput. To this end, Asymmetric Extremum (AE) significantly improves chunking throughput by using local extreme value in a variable-sized asymmetric window to overcome Rabin and TTTD boundaries shift problem, while achieving nearby same deduplication ratio (DR). Therefore, we propose DE-based TTTD-P optimized chunking to maximize chunking throughput with increased DR; and scalable bucket indexing approach reduces hash values judgment time to identify and declare redundant chunks about 16 times than Rabin CDC, 5 times than AE CDC, 1.6 times than FAST CDC on Hadoop Distributed File System (HDFS).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Min Chen, Shiwen Mao, Yunhao Liu: Big Data: A Survey Mobile Networks and Applications Journal, Springer, Vol. 19, Issue 2, (2014) 171–209.

    Google Scholar 

  2. James Manyika, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs, Charles Roxburgh, Angela Hung Byers: Big data: The next frontier for innovation, competition, and productivity. McKinsey Company, (2011) 1–156.

    Google Scholar 

  3. R. Storn: Differential Evolution: A simple and efficient heuristic strategy for global optimization over continuous spaces. Journal of Global Optimization, Vol. 11, (1997) 341–359.

    Google Scholar 

  4. Jaehong Min, Daeyoung Yoon, and Youjip Won: Efficient Deduplication Techniques for Modern Backup Operation. IEEE Transactions on Computers, Vol. 60, No. 6, (2011) 824–840.

    Google Scholar 

  5. J. Gantz, and D. Reinsel: The digital universe decade are you ready? IDC White Paper, May (2011).

    Google Scholar 

  6. H. Biggar: Experiencing data deduplication: Improving efficiency and reducing capacity requirements. White Paper, the Enterprise Strategy Group, Feb (2012).

    Google Scholar 

  7. Tin-Yu Wu, Jeng-Shyang Pan, and Chia-Fan Lin: Improving Accessing Efficiency of Cloud Storage Using Deduplication and Feedback Schemes. IEEE Systems Journal, Volume 8, No.1, (2014) 208–218.

    Google Scholar 

  8. Yinjin Fu, Lei Tian, Fang Liu, Hong Jiang, Nong Xiao: AA-Dedupe: An Application-Aware Source Deduplication Approach for Cloud Backup Services in the Personal Computing Environment. IEEE International Conference on Cluster Computing (CLUSTER), (2013) 112–120.

    Google Scholar 

  9. Ahmed El-Shimi, Ran Kalach, Ankit Kumar Adi Oltean Jin Li, Sudipta Sengupta: Primary Data Deduplication – Large Scale Study and System Design USENIX federated conference Week. June 12–15, (2012) 285–296.

    Google Scholar 

  10. Ross Neil Williams: Method for partitioning a block of data into sub-blocks and for storing and communicating such sub-blocks. Patent US5990810 A, 23 Nov (1999).

    Google Scholar 

  11. Purushottam Kulkarni, Fred Douglis Jason LaVoie, John M. Tracey, Redundancy Elimination Within Large Collections of Files Proceedings of the annual conference on USENIX Annual Technical Conference, General Track, (2004) 59–72.

    Google Scholar 

  12. Dirk Meister, Jürgen Kaiser, Andre Brinkmann, Michael Kuhn, Julian Martin Kunkel, Toni Cortes: A Study on Data Deduplication in HPC Storage Systems. Conference Proceedings of the ACM/IEEE Conference on High Performance Computing (2012).

    Google Scholar 

  13. Guanlin Lu: An Efficient Data Deduplication Design with Flash-Memory Based Solid State Drive. A Dissertation Submitted to The Faculty of the Graduate School of the University of Minnesota, (2012) 1–114.

    Google Scholar 

  14. Nikolaj Bjorner, Andreas Blass, Yuri Gurevich: Content-Dependent Chunking for Differential Compression, The Local Maximum Approach. Journal of Computer and System Sciences, Elsevier, Vol. 79, No. 3, (2010) 154–203.

    Google Scholar 

  15. Athicha Muthitacharoen, Benjie Chen and David Mazieres: A Low-bandwidth Network File System, proceeding of the 18th ACM Symposium on operating System principle (Sosp ’01), Chateau lake louise, Banff, Canada, (2001) 174–187.

    Google Scholar 

  16. Janaka Deepakumara, Howard M. Heys and R. Venkatesan: FPGA Implementation of MD5 Hash Algorithm. IEEE Canadian Conference on Electrical and Computer Engineering, (2001) 919–924.

    Google Scholar 

  17. Dai Zibin Zhou Ning: FPGA Implementation of SHA-1 Algorithm. IEEE 5th international conference on ASIC, (2003) 1321–1324.

    Google Scholar 

  18. Deepavali Bhagwat, Kave Eshghi, Darrell D.E. Long and Mark Lillibridge: Extreme Binning: Scalable, Parallel De-duplication for Chunk-based File Backup. Proceeding of 17th IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication systems (2009) 1–9.

    Google Scholar 

  19. Guanlin Lu, Yu Jin, and David H.C. Du,: Frequency Based Chunking for Data De-Duplication. 18th Annual IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, (2010) 287–296.

    Google Scholar 

  20. IderLkhagvasuren, Jung Min So, Jeong Gun Lee, Chuck Yoo and Young WoongKo: Byte-index Chunking Algorithm for Data Deduplication System. International Journal of Security and Its Applications, SERSC, Vol. 7, No. 5, (2013) 415–424.

    Google Scholar 

  21. IderLkhagvasuren, Jung Min So, Jeong Gun Lee, Jin Kim and Young WoongKo, Multi-level Byte Index Chunking Mechanism for File Synchronization. International Journal of Software Engineering and Applications Vol. 8, No. 3, (2014) 339–350.

    Google Scholar 

  22. Chuanshuai Yu, Chengwei Zhang, Yiping Mao, Fulu Li: Leap Based Content Defined Chunking- Theory and Implementation, IEEE 31st Symposium on Mass Storage Systems and Technologies (2015) 1–12.

    Google Scholar 

  23. Erik Kruus, Christian Ungureanu, Cezary Dubnicki: Bimodal Content Defined Chunking for Backup Streams, FAST Proceedings of the 8th USENIX Conference on file and Storage Technologies, USENIX, (2010) 239–252.

    Google Scholar 

  24. Jiansheng Wei, Junhua Zhu, Yong Li, “Multimodal Content Defined Chunking for Data Deduplication”, https://www.researchgate.net/publication/261286019, Research gate, (2014).

  25. Yucheng Zhang, Hong Jiang, Dan Feng, Wen Xia, Min Fu, Fangting Huang, Yukun Zhou: AE An Asymmetric Extremum Content Defined Chunking Algorithm for Fast and Bandwidth-Efficient Data Deduplication. IEEE Conference on Computer Communications (INFOCOM), (2015) 1337–1345.

    Google Scholar 

  26. Kave Eshghi, Hsiu Khuern Tang: A framework for analyzing and improving content based chunking Algorithms. Technical Report TR 2005–30, Hewlett-Packard Development Company, http://www.hpl.hp.com/techreports/2005/HPL-2005-30R1.html (2005).

  27. Teng-Sheng Moh, Bing Chun Chang: A Running Time Improvement for the Two Thresholds Two Divisors Algorithm. ACMSE ‘10, April 15–17, (2010).

    Google Scholar 

  28. Xia, W., Jiang, H., Feng, D., Tian, L., Fu, M., and Zhou, Y.: Ddelta: A deduplication-inspired fast delta compression approach Performance Evaluation. 15 Proceedings of the 7th USENIX Conference on Hot Topics in Storage and File Systems, (2015) 258–272.

    Google Scholar 

  29. Aggarwal, B., Akella, A., Anand, A., et al. EndRE: an end-system redundancy elimination service for enterprises. In Proceedings of the 7th USENIX conference on Networked Systems Design and Implementation (NSDI ’10) (San Jose, CA, USA), USENIX Association, April (2010) 14–28.

    Google Scholar 

  30. Cui, Y., Lal, Z., Wang, X., Dai, N., and Miao, C., “QuickSync: Improving Synchronization Efficiency for Mobile Cloud Storage Services” In Proceedings of the 21st Annual International Conference on Mobile Computing and Networking (Paris, France), ACM, Sept. (2015) 592–603.

    Google Scholar 

  31. Wen Xia, Yukun Zhou, Hong Jiang, Dan Feng, Yu Hua, Yuchong Hu, Yucheng Zhang, Qing Liu: FastCDC: a Fast and Efficient Content-Defined Chunking Approach for Data Deduplication. USENIX Open access to the Proceedings of USENIX Annual Technical Conference (USENIX ATC ’16), (2016) 101–114.

    Google Scholar 

  32. Andrei Z. Broder: Some applications of Rabin’s fingerprinting method. Sequences II: Methods in Communications, Security and Computer Science, Springer, (1993)143–152.

    Google Scholar 

  33. Michael O, Rabin: Fingerprinting by random polynomials. Center for Research in Computing Technology. Aiken Computation Laboratory, Univ., (1981).

    Google Scholar 

  34. Dubnicki, C., Kruus, E., Lichota, K., and Ungureanu, C.: Methods and systems for data management using multiple selection criteria. US Patent App. 11/566,122, Dec 1, (2006).

    Google Scholar 

  35. Min, J., Yoon, D., and Won, Y.: Efficient deduplication techniques for modern backup operation. IEEE Transactions on Computers Vol. 60, Issue 6, (2011) 824–840.

    Google Scholar 

  36. Derviskaraboga, Selcukokdem: A simple and Global optimization algorithm for engineering problem: Differential evolution algorithm. Turk Journal Electrical Engineering, Vol. 12, No. 1, (2004).

    Google Scholar 

  37. An Introduction to Differential Evolution, http://www.maths.uq.edu.au/MASCOS/MultiAgent04/Fleetwood.pdf.

  38. Differential Evolution (DE): http://www.dii.unipd.it/alotto/didattica/corsi/Elettrotecnicalcomputazionale/DE.pdf.

  39. Prometeo Cortes-Antonio, Josue Rangel-Gonzalez, Luis A. Villa-Vargas, Marco Antonio Ramirez-Salinas, Heron Molina Lozano, Idar Batyrshin,: Design and Implementation of Differential Evolution Algorithm on FPGA for Double Precision Floating-Point Representation. Acta Polytechnic Hungarica, Vol. 11, No. 4, (2014).

    Google Scholar 

  40. Naresh Kumar and Ruchika Kumar, “Improved Join Operations Using ORC in HIVE”, Springer Transactions on Information Communication Technology (ICT), Springer Vol. 4, Issue 2, pp. 209–215, Dec. (2016).

    Google Scholar 

  41. Jeffrey Dean and Sanjay Ghemawat, “Map-reduce: Simplified Data Processing on Large Clusters”, To appear in OSDI, (2004).

    Google Scholar 

  42. http://www.icsi.berkeley.edu/storn/code.html.

  43. http://www.python.org/.

  44. http://www.java.org/.

  45. Freedb.org, “http://freedb.org/pub/freedb/”.

  46. Apache hadoop 1.2.1, “http://hadoop.apache.org/”.

  47. http://www.serve.net/buz/hash.adt/java.000.html.

  48. Naresh Kumar, Rahul Rawat and S.C Jain, “Bucket Based Data Deduplication Technique for Big Data Storage System”, IEEE 5th International Conference on Reliability, Infocom Tech. and Optimization, Dec. (2016).

    Google Scholar 

  49. B. Zhu, K. Li, and H. Patterson, “Avoiding the Disk Bottleneck in the Data Domain Deduplication File System,” in Proceedings of the 7th USENIX Conference on File and Storage Technologies (FAST) San Jose, CA, USA, pp. 269–282, Feb. (2008).

    Google Scholar 

  50. M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar G. Trezise, and P. Campbell, “Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality,” in Proceedings of the 7th USENIX Conference on File and Storage Technologies (FAST) San Jose, CA, USA, pp. 111–123, Feb. (2008).

    Google Scholar 

Download references

Acknowledgements

For this research, we would like to show gratitude to Prof. Rohitashwa Shringi (and specially his wife Dr. Pramila) Mechanical Engineering department of Rajasthan Technical University, KOTA for their support, valuable advice, motivation, and encouragement. We would also like to thank Prof. S. C Jain Rajasthan Technical University, Kota for guiding this research work with valuable suggestions time to time.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Naresh Kumar .

Editor information

Editors and Affiliations

Appendix: Experimental Setup

Appendix: Experimental Setup

Chunking

Main divisor D

Backup Divisor D′

Fingerprinting

fp mod D = R

Speed

BSW

1000

No

Rabin

Slow

BFS

1000

No

Rabin

Slow

TD

1200

600

Rabin

Slow

SCM

540

No

Rabin

Slow

TTTD

540

270

Rabin 16-bit incremental

Slow

TTTD-S

1600

800

Rabin 16-bit incremental

Slow

TTTD-P

Optimal DE (D, T)

No

BUZ 32-bit rolling hash

Very fast

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kumar, N., Shobha, Jain, S.C. (2019). Efficient Data Deduplication for Big Data Storage Systems. In: Panigrahi, C., Pujari, A., Misra, S., Pati, B., Li, KC. (eds) Progress in Advanced Computing and Intelligent Engineering. Advances in Intelligent Systems and Computing, vol 714. Springer, Singapore. https://doi.org/10.1007/978-981-13-0224-4_32

Download citation

  • DOI: https://doi.org/10.1007/978-981-13-0224-4_32

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-13-0223-7

  • Online ISBN: 978-981-13-0224-4

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics