Article

SiLo: a similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput

Authors:

Yu HuaAuthors Info & Claims

USENIXATC'11: Proceedings of the 2011 USENIX conference on USENIX annual technical conference

Pages 26 - 28

Published: 15 June 2011 Publication History

Abstract

Data Deduplication is becoming increasingly popular in storage systems as a space-efficient approach to data backup and archiving. Most existing state-of-the-art deduplication methods are either locality based or similarity based, which, according to our analysis, do not work adequately in many situations. While the former produces poor deduplication throughput when there is little or no locality in datasets, the latter can fail to identify and thus remove significant amounts of redundant data when there is a lack of similarity among files. In this paper, we present SiLo, a near-exact deduplication system that effectively and complementarily exploits similarity and locality to achieve high duplicate elimination and throughput at extremely low RAM overheads. The main idea behind SiLo is to expose and exploit more similarity by grouping strongly correlated small files into a segment and segmenting large files, and to leverage locality in the backup stream by grouping contiguous segments into blocks to capture similar and duplicate data missed by the probabilistic similarity detection. By judiciously enhancing similarity through the exploitation of locality and vice versa, the SiLo approach is able to significantly reduce RAM usage for index-lookup and maintain a very high deduplication throughput. Our experimental evaluation of SiLo based on real-world datasets shows that the SiLo system consistently and significantly outperforms two existing state-of-the-art system, one based on similarity and the other based on locality, under various workload conditions.

References

[1]

AGRAWAL, N., BOLOSKY, W., DOUCEUR, J., AND LORCH, J. A five-year study of file-system metadata. ACM Transactions on Storage (TOS) 3, 3 (2007), 9.

[2]

ARONOVICH, L., ASHER, R., BACHMAT, E., BITNER, H., HIRSCH, M., AND KLEIN, S. The design of a similarity based deduplication system. In Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference (2009), ACM, pp. 1-14.

[3]

BHAGWAT, D., ESHGHI, K., LONG, D., AND LILLIBRIDGE, M. Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In Modeling, Analysis & Simulation of Computer and Telecommunication Systems, 2009. MASCOTS'09. IEEE International Symposium on (2009), IEEE, pp. 1-9.

[4]

BHAGWAT, D., ESHGHI, K., AND MEHRA, P. Content-based document routing and index partitioning for scalable similarity-based searches in a large corpus. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining (2007), ACM, pp. 105-112.

[5]

BRODER, A. On the resemblance and containment of documents. In Compression and Complexity of Sequences 1997. Proceedings (2002), IEEE, pp. 21-29.

[6]

CHEN, F., LUO, T., AND ZHANG, X. CAFTL: A content-aware flash translation layer enhancing the lifespan of flash memory based solid state drives. In FAST11: Proceedings of the 9th Conference on File and Storage Technologies (2011), USENIX Association.

[7]

CLEMENTS, A., AHMAD, I., VILAYANNUR, M., AND LI, J. Decentralized deduplication in SAN cluster file systems. In Proceedings of the 2009 conference on USENIX Annual technical conference (2009), USENIX Association, p. 8.

[8]

DEBNATH, B., SENGUPTA, S., AND LI, J. ChunkStash: speeding up inline storage deduplication using flash memory. In Proceedings of the 2010 USENIX conference on USENIX annual technical conference (2010), USENIX Association, p. 16.

[9]

FORMAN, G., ESHGHI, K., AND CHIOCCHETTI, S. Finding similar files in large document repositories. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining (2005), ACM, pp. 394-400.

[10]

FULL-DATASET. http://en.amazingstore.org/xyj/.

[11]

GUPTA, A., PISOLKAR, R., URGAONKAR, B., AND SIVASUBRAMANIAM, A. Leveraging Value Locality in Optimizing NAND Flash-based SSDs. In FAST11: Proceedings of the 9th Conference on File and Storage Technologies (2011), USENIX Association.

[12]

JIN, K., AND MILLER, E. The effectiveness of deduplication on virtual machine disk images. In Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference (2009), ACM, pp. 1-12.

[13]

KOLLER, R., AND RANGASWAMI, R. I/O deduplication: utilizing content similarity to improve I/O performance. ACM Transactions on Storage (TOS) 6, 3 (2010), 1-26.

[14]

KRUUS, E., UNGUREANU, C., AND DUBNICKI, C. Bimodal content defined chunking for backup streams. In Proceedings of the 8th USENIX conference on File and storage technologies (2010), USENIX Association, p. 18.

[15]

LILLIBRIDGE, M., ESHGHI, K., BHAGWAT, D., DEOLALIKAR, V., TREZISE, G., AND CAMBLE, P. Sparse indexing: large scale, inline deduplication using sampling and locality. In Proccedings of the 7th conference on File and storage technologies (2009), USENIX Association, pp. 111-123.

[16]

LINUX-DATASET. http://www.cn.kernel.org/pub/linux/kernel/.

[17]

MANBER, U., ET AL. Finding similar files in a large file system. In Proceedings of the USENIX winter 1994 technical conference (1994), Citeseer, pp. 1-10.

[18]

MICHAEL, V., STEFAN, S., AND GEOFFREY, M. Cumulus: Filesystem backup to the cloud. In Proceedings of 7th USENIX Conference on File and Storage Technologies (2009).

[19]

MUTHITACHAROEN, A., CHEN, B., AND MAZIERES, D. A low-bandwidth network file system. In Proceedings of the eighteenth ACM symposium on Operating systems principles (2001), ACM, pp. 174-187.

[20]

POLICRONIADES, C., AND PRATT, I. Alternatives for detecting redundancy in storage systems data. In Proceedings of the annual conference on USENIX Annual Technical Conference (2004), USENIX Association, p. 6.

[21]

QUINLAN, S., AND DORWARD, S. Venti: a new approach to archival storage. In Proceedings of the FAST 2002 Conference on File and Storage Technologies (2002), vol. 4.

[22]

RABIN, M. Fingerprinting by random polynomials. Center for Research in Computing Techn., Aiken Computation Laboratory, Univ., 1981.

[23]

REN, J., AND YANG, Q. A New Buffer Cache Design Exploiting Both Temporal and Content Localities. In 2010 International Conference on Distributed Computing Systems (2010), IEEE, pp. 273-282.

[24]

TAN, Y., JIANG, H., FENG, D., TIAN, L., YAN, Z., AND ZHOU, G. SAM: A Semantic-Aware Multi-Tiered Source Deduplication Framework for Cloud Backup. In 2010 39th International Conference on Parallel Processing (2010), IEEE, pp. 614-623.

[25]

XING, Y., LI, Z., AND DAI, Y. PeerDedupe: Insights into the Peer-Assisted Sampling Deduplication. In Peer-to-Peer Computing (P2P), 2010 IEEE Tenth International Conference on (2010), IEEE, pp. 1-10.

[26]

ZHU, B., LI, K., AND PATTERSON, H. Avoiding the disk bottleneck in the data domain deduplication file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (2008), USENIX Association, pp. 1-14.

Cited By

Zhang DDeng YZhou YZhu YQin X(2021)Improving the Performance of Deduplication-Based Backup Systems via Container Utilization Based Hot Fingerprint Entry DistillingACM Transactions on Storage10.1145/345962617:4(1-23)Online publication date: 15-Oct-2021
https://dl.acm.org/doi/10.1145/3459626
Yang RDeng YZhou YHuang P(2021)Boosting the Restoring Performance of Deduplication Data by Classifying Backup MetadataACM/IMS Transactions on Data Science10.1145/34372612:2(1-16)Online publication date: 21-Apr-2021
https://dl.acm.org/doi/10.1145/3437261
Zhang YXia WFeng DJiang HHua YWang QMerchant AWeatherspoon H(2019)FinesseProceedings of the 17th USENIX Conference on File and Storage Technologies10.5555/3323298.3323310(121-128)Online publication date: 25-Feb-2019
https://dl.acm.org/doi/10.5555/3323298.3323310
Show More Cited By

SiLo: a similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput
1. General and reference
  1. Cross-computing tools and techniques
2. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems

Recommendations

Log-structured SiLo
CEE-SECR '17: Proceedings of the 13th Central & Eastern European Software Engineering Conference in Russia

There is an important task of reducing the cost of storage in cloud infrastructure. One of the best known technologies of saving space and reducing the cost of storage as a result is deduplication.

This paper presents an effective method of combining ...
Silo: exploiting JavaScript and DOM storage for faster page loads
WebApps'10: Proceedings of the 2010 USENIX conference on Web application development

A modern web page contains many objects, and fetching these objects requires many network round trips-- establishing each HTTP connection requires a TCP handshake, and each HTTP request/response pair requires at least one round trip. To decrease a page'...
Silo, rainbow, and caching token: schemes for scalable, fault tolerant stream caching

In the current Internet, Web content is increasingly being cached closer to the end user to reduce network and Web server load and improve performance. Existing Web caching systems typically cache entire Web documents and attempt to keep them consistent ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

USENIXATC'11: Proceedings of the 2011 USENIX conference on USENIX annual technical conference

June 2011

36 pages

Program Chairs:
Jason Nieh
Columbia University
,
Carl Waldspurger

Publisher

USENIX Association

United States

Publication History

Published: 15 June 2011

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

38
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang DDeng YZhou YZhu YQin X(2021)Improving the Performance of Deduplication-Based Backup Systems via Container Utilization Based Hot Fingerprint Entry DistillingACM Transactions on Storage10.1145/345962617:4(1-23)Online publication date: 15-Oct-2021
https://dl.acm.org/doi/10.1145/3459626
Yang RDeng YZhou YHuang P(2021)Boosting the Restoring Performance of Deduplication Data by Classifying Backup MetadataACM/IMS Transactions on Data Science10.1145/34372612:2(1-16)Online publication date: 21-Apr-2021
https://dl.acm.org/doi/10.1145/3437261
Zhang YXia WFeng DJiang HHua YWang QMerchant AWeatherspoon H(2019)FinesseProceedings of the 17th USENIX Conference on File and Storage Technologies10.5555/3323298.3323310(121-128)Online publication date: 25-Feb-2019
https://dl.acm.org/doi/10.5555/3323298.3323310
Yang ZWang YBhamini JTan CMi N(2018)EADCluster Computing10.5555/3287988.328800521:3(1561-1579)Online publication date: 1-Sep-2018
https://dl.acm.org/doi/10.5555/3287988.3288005
Xiao HLi ZZhai EXu TLi YLiu YZhang QLiu YAgrawal NRangaswami R(2018)Towards web-based delta synchronization for cloud storage servicesProceedings of the 16th USENIX Conference on File and Storage Technologies10.5555/3189759.3189774(155-168)Online publication date: 12-Feb-2018
https://dl.acm.org/doi/10.5555/3189759.3189774
Singhal SSharma PAggarwal RPassricha V(2018)A Global Survey on Data DeduplicationInternational Journal of Grid and High Performance Computing10.4018/IJGHPC.201810010310:4(43-66)Online publication date: 1-Oct-2018
https://dl.acm.org/doi/10.4018/IJGHPC.2018100103
Liu LXu H(2018)ElasecutorProceedings of the ACM Symposium on Cloud Computing10.1145/3267809.3267818(107-120)Online publication date: 11-Oct-2018
https://dl.acm.org/doi/10.1145/3267809.3267818
Sun ZKuenning GMandal SShilane PTarasov VXiao NZadok E(2018)Cluster and Single-Node Analysis of Long-Term Deduplication PatternsACM Transactions on Storage10.1145/318389014:2(1-27)Online publication date: 11-May-2018
https://dl.acm.org/doi/10.1145/3183890
Lin CCao QHuang JYao JLi XXie CEl-Araby EEl-Ghazawi TPanda D(2018)HPDVProceedings of the 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing10.1109/CCGRID.2018.00074(472-481)Online publication date: 1-May-2018
https://dl.acm.org/doi/10.1109/CCGRID.2018.00074
Zhang BWang CZhou BYuan DZomaya A(2018)DCDedupeJournal of Grid Computing10.1007/s10723-018-9429-316:2(195-209)Online publication date: 1-Jun-2018
https://dl.acm.org/doi/10.1007/s10723-018-9429-3
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Table of Contents