Deduplication for large scale backup and archival storage

January 2010

Author:
Deepavali Bhagwat
University of California, Santa Cruz
,
Adviser:
Darrell D. Long
University of California, Santa Cruz

Publisher:

University of California at Santa Cruz
Computer and Information Sciences Dept. 265 Applied Sciences Building Santa Cruz, CA
United States

ISBN:978-1-124-28619-8

Order Number:AAI3429520

Pages:

130

Purchase on ProQuest

Bibliometrics

Abstract

The focus of this dissertation is to provide scalable solutions for problems unique to chunk-based deduplication. Chunk-based deduplication is used in backup and archival storage systems to reduce storage space requirements. We show how to conduct similarity-based searches over large repositories, and how to scale out these searches as the repository grows; how to deduplicate low-locality file-based workloads, and how to scale out deduplication via parallelization, data and index organization; how to build a unified deduplication solution that can adapt to tape-based and file-based workloads; and, how to introduce strategic redundancies in deduplicated data to improve the overall robustness of the system.

Our scalable similarity-based search solution finds for an object, highly similar objects from within a large store by examining only a small subset of its features. We show how to partition the feature index to scale out the search, and how to select a small subset of the partitions (less than 3%), independent of object size, based on the content of query object alone to conduct distributed similarity-based searches.

We show how to deduplicate low-locality file-based workloads using Extreme Binning. Extreme Binning uses file similarity to find duplicates accurately and makes only one disk access for chunk lookup per file to yield reasonable throughput. Multi-node backup systems built with Extreme Binning scale gracefully with the data size. Each backup node is autonomous—there is no dependency between nodes, making house keeping tasks robust and low overhead.

We build a 'unified deduplication' solution that can adapt and deduplicate a variety of workloads. We have workloads consisting of large byte streams with high-locality, and workloads made up of files of varying sizes without any locality between them. There are separate deduplication solutions for each kind of workload, but so far no unified solution that works well for all. Our unified deduplication solution simplifies administration—organizations do not have to deploy dedicated solutions for each kind of workload—and, it yields better storage space savings than dedicated solutions because it deduplicates across workloads.

Deduplication reduces storage space requirements by allowing common chunks to be shared between similar objects. This reduces the reliability of the storage system because the loss of a few shared chunks can lead to the loss of many objects. We show how to eliminate this problem by choosing for each chunk a replication level that is a function of the amount of data that would be lost if that chunk were lost. Experiments show that this technique can achieve significantly higher robustness than a conventional approach combining data mirroring and Lempel-Ziv compression while requiring about half the storage space.

Contributors

Darrell D. E. Long
Baskin School of Engineering
- Publication Years1988 - 2023
- Publication counts133
- Citation count1,905
- Available for Download32
- Downloads (cumulative)22,029
- Downloads (12 months)1,494
- Downloads (6 weeks)257
- Average Downloads per Article688
- Average Citation per Article14
View Full Profile
Deepavali Bhagwat
IBM Research
- Publication Years2004 - 2020
- Publication counts12
- Citation count251
- Available for Download7
- Downloads (cumulative)2,499
- Downloads (12 months)123
- Downloads (6 weeks)25
- Average Downloads per Article357
- Average Citation per Article21
View Full Profile

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Recommendations

Storage Deduplication by Virtual Large-Scale Disks
NBIS '12: Proceedings of the 2012 15th International Conference on Network-Based Information Systems

Recently, the demand of low cost large scale storages increases. We developed VLSD (Virtual Large Scale Disks) toolkit for constructing virtual disk based distributed storages, which aggregate free spaces of individual disks. VLSD realizes low-cost ...
Efficient archival data storage
LSM-tree managed storage for large-scale key-value store
SoCC '17: Proceedings of the 2017 Symposium on Cloud Computing

Key-value stores are increasingly adopting LSM-trees as their enabling data structure in the backend storage, and persisting their clustered data through a file system. A file system is expected to not only provide file/directory abstraction to organize ...

Browse Theses

Sections

Storage Deduplication by Virtual Large-Scale Disks

Efficient archival data storage

LSM-tree managed storage for large-scale key-value store

Sections

Save to Binder

Recommendations

Storage Deduplication by Virtual Large-Scale Disks

Efficient archival data storage

LSM-tree managed storage for large-scale key-value store