Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/2002181.2002207guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

SiLo: a similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput

Published: 15 June 2011 Publication History

Abstract

Data Deduplication is becoming increasingly popular in storage systems as a space-efficient approach to data backup and archiving. Most existing state-of-the-art deduplication methods are either locality based or similarity based, which, according to our analysis, do not work adequately in many situations. While the former produces poor deduplication throughput when there is little or no locality in datasets, the latter can fail to identify and thus remove significant amounts of redundant data when there is a lack of similarity among files. In this paper, we present SiLo, a near-exact deduplication system that effectively and complementarily exploits similarity and locality to achieve high duplicate elimination and throughput at extremely low RAM overheads. The main idea behind SiLo is to expose and exploit more similarity by grouping strongly correlated small files into a segment and segmenting large files, and to leverage locality in the backup stream by grouping contiguous segments into blocks to capture similar and duplicate data missed by the probabilistic similarity detection. By judiciously enhancing similarity through the exploitation of locality and vice versa, the SiLo approach is able to significantly reduce RAM usage for index-lookup and maintain a very high deduplication throughput. Our experimental evaluation of SiLo based on real-world datasets shows that the SiLo system consistently and significantly outperforms two existing state-of-the-art system, one based on similarity and the other based on locality, under various workload conditions.

References

[1]
AGRAWAL, N., BOLOSKY, W., DOUCEUR, J., AND LORCH, J. A five-year study of file-system metadata. ACM Transactions on Storage (TOS) 3, 3 (2007), 9.
[2]
ARONOVICH, L., ASHER, R., BACHMAT, E., BITNER, H., HIRSCH, M., AND KLEIN, S. The design of a similarity based deduplication system. In Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference (2009), ACM, pp. 1-14.
[3]
BHAGWAT, D., ESHGHI, K., LONG, D., AND LILLIBRIDGE, M. Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In Modeling, Analysis & Simulation of Computer and Telecommunication Systems, 2009. MASCOTS'09. IEEE International Symposium on (2009), IEEE, pp. 1-9.
[4]
BHAGWAT, D., ESHGHI, K., AND MEHRA, P. Content-based document routing and index partitioning for scalable similarity-based searches in a large corpus. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining (2007), ACM, pp. 105-112.
[5]
BRODER, A. On the resemblance and containment of documents. In Compression and Complexity of Sequences 1997. Proceedings (2002), IEEE, pp. 21-29.
[6]
CHEN, F., LUO, T., AND ZHANG, X. CAFTL: A content-aware flash translation layer enhancing the lifespan of flash memory based solid state drives. In FAST11: Proceedings of the 9th Conference on File and Storage Technologies (2011), USENIX Association.
[7]
CLEMENTS, A., AHMAD, I., VILAYANNUR, M., AND LI, J. Decentralized deduplication in SAN cluster file systems. In Proceedings of the 2009 conference on USENIX Annual technical conference (2009), USENIX Association, p. 8.
[8]
DEBNATH, B., SENGUPTA, S., AND LI, J. ChunkStash: speeding up inline storage deduplication using flash memory. In Proceedings of the 2010 USENIX conference on USENIX annual technical conference (2010), USENIX Association, p. 16.
[9]
FORMAN, G., ESHGHI, K., AND CHIOCCHETTI, S. Finding similar files in large document repositories. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining (2005), ACM, pp. 394-400.
[10]
FULL-DATASET. http://en.amazingstore.org/xyj/.
[11]
GUPTA, A., PISOLKAR, R., URGAONKAR, B., AND SIVASUBRAMANIAM, A. Leveraging Value Locality in Optimizing NAND Flash-based SSDs. In FAST11: Proceedings of the 9th Conference on File and Storage Technologies (2011), USENIX Association.
[12]
JIN, K., AND MILLER, E. The effectiveness of deduplication on virtual machine disk images. In Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference (2009), ACM, pp. 1-12.
[13]
KOLLER, R., AND RANGASWAMI, R. I/O deduplication: utilizing content similarity to improve I/O performance. ACM Transactions on Storage (TOS) 6, 3 (2010), 1-26.
[14]
KRUUS, E., UNGUREANU, C., AND DUBNICKI, C. Bimodal content defined chunking for backup streams. In Proceedings of the 8th USENIX conference on File and storage technologies (2010), USENIX Association, p. 18.
[15]
LILLIBRIDGE, M., ESHGHI, K., BHAGWAT, D., DEOLALIKAR, V., TREZISE, G., AND CAMBLE, P. Sparse indexing: large scale, inline deduplication using sampling and locality. In Proccedings of the 7th conference on File and storage technologies (2009), USENIX Association, pp. 111-123.
[16]
LINUX-DATASET. http://www.cn.kernel.org/pub/linux/kernel/.
[17]
MANBER, U., ET AL. Finding similar files in a large file system. In Proceedings of the USENIX winter 1994 technical conference (1994), Citeseer, pp. 1-10.
[18]
MICHAEL, V., STEFAN, S., AND GEOFFREY, M. Cumulus: Filesystem backup to the cloud. In Proceedings of 7th USENIX Conference on File and Storage Technologies (2009).
[19]
MUTHITACHAROEN, A., CHEN, B., AND MAZIERES, D. A low-bandwidth network file system. In Proceedings of the eighteenth ACM symposium on Operating systems principles (2001), ACM, pp. 174-187.
[20]
POLICRONIADES, C., AND PRATT, I. Alternatives for detecting redundancy in storage systems data. In Proceedings of the annual conference on USENIX Annual Technical Conference (2004), USENIX Association, p. 6.
[21]
QUINLAN, S., AND DORWARD, S. Venti: a new approach to archival storage. In Proceedings of the FAST 2002 Conference on File and Storage Technologies (2002), vol. 4.
[22]
RABIN, M. Fingerprinting by random polynomials. Center for Research in Computing Techn., Aiken Computation Laboratory, Univ., 1981.
[23]
REN, J., AND YANG, Q. A New Buffer Cache Design Exploiting Both Temporal and Content Localities. In 2010 International Conference on Distributed Computing Systems (2010), IEEE, pp. 273-282.
[24]
TAN, Y., JIANG, H., FENG, D., TIAN, L., YAN, Z., AND ZHOU, G. SAM: A Semantic-Aware Multi-Tiered Source Deduplication Framework for Cloud Backup. In 2010 39th International Conference on Parallel Processing (2010), IEEE, pp. 614-623.
[25]
XING, Y., LI, Z., AND DAI, Y. PeerDedupe: Insights into the Peer-Assisted Sampling Deduplication. In Peer-to-Peer Computing (P2P), 2010 IEEE Tenth International Conference on (2010), IEEE, pp. 1-10.
[26]
ZHU, B., LI, K., AND PATTERSON, H. Avoiding the disk bottleneck in the data domain deduplication file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (2008), USENIX Association, pp. 1-14.

Cited By

View all
  • (2021)Improving the Performance of Deduplication-Based Backup Systems via Container Utilization Based Hot Fingerprint Entry DistillingACM Transactions on Storage10.1145/345962617:4(1-23)Online publication date: 15-Oct-2021
  • (2021)Boosting the Restoring Performance of Deduplication Data by Classifying Backup MetadataACM/IMS Transactions on Data Science10.1145/34372612:2(1-16)Online publication date: 21-Apr-2021
  • (2019)FinesseProceedings of the 17th USENIX Conference on File and Storage Technologies10.5555/3323298.3323310(121-128)Online publication date: 25-Feb-2019
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
USENIXATC'11: Proceedings of the 2011 USENIX conference on USENIX annual technical conference
June 2011
36 pages

Publisher

USENIX Association

United States

Publication History

Published: 15 June 2011

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 01 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2021)Improving the Performance of Deduplication-Based Backup Systems via Container Utilization Based Hot Fingerprint Entry DistillingACM Transactions on Storage10.1145/345962617:4(1-23)Online publication date: 15-Oct-2021
  • (2021)Boosting the Restoring Performance of Deduplication Data by Classifying Backup MetadataACM/IMS Transactions on Data Science10.1145/34372612:2(1-16)Online publication date: 21-Apr-2021
  • (2019)FinesseProceedings of the 17th USENIX Conference on File and Storage Technologies10.5555/3323298.3323310(121-128)Online publication date: 25-Feb-2019
  • (2018)EADCluster Computing10.5555/3287988.328800521:3(1561-1579)Online publication date: 1-Sep-2018
  • (2018)Towards web-based delta synchronization for cloud storage servicesProceedings of the 16th USENIX Conference on File and Storage Technologies10.5555/3189759.3189774(155-168)Online publication date: 12-Feb-2018
  • (2018)A Global Survey on Data DeduplicationInternational Journal of Grid and High Performance Computing10.4018/IJGHPC.201810010310:4(43-66)Online publication date: 1-Oct-2018
  • (2018)ElasecutorProceedings of the ACM Symposium on Cloud Computing10.1145/3267809.3267818(107-120)Online publication date: 11-Oct-2018
  • (2018)Cluster and Single-Node Analysis of Long-Term Deduplication PatternsACM Transactions on Storage10.1145/318389014:2(1-27)Online publication date: 11-May-2018
  • (2018)HPDVProceedings of the 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing10.1109/CCGRID.2018.00074(472-481)Online publication date: 1-May-2018
  • (2018)DCDedupeJournal of Grid Computing10.1007/s10723-018-9429-316:2(195-209)Online publication date: 1-Jun-2018
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media