Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/3489146.3489199guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
research-article
Free access

DupHunter: flexible high-performance deduplication for docker registries

Published: 15 July 2020 Publication History

Abstract

The rise of containers has led to a broad proliferation of container images. The associated storage performance and capacity requirements place high pressure on the infrastructure of container registries that store and serve images. Exploiting the high file redundancy in real-world container images is a promising approach to drastically reduce the demanding storage requirements of the growing registries. However, existing deduplication techniques significantly degrade the performance of registries because of the high layer restore overhead.
We propose DupHunter, a new Docker registry architecture, which not only natively deduplicates layers for space savings but also reduces layer restore overhead. DupHunter supports several configurable deduplication modes, which provide different levels of storage efficiency, durability, and performance, to support a range of uses. To mitigate the negative impact of deduplication on the image download times, DupHunter introduces a two-tier storage hierarchy with a novel layer prefetch/preconstruct cache algorithm based on user access patterns. Under real workloads, in the highest data reduction mode, DupHunter reduces storage space by up to 6.9× compared to the current implementations. In the highest performance mode, DupHunter can reduce the GET layer latency up to 2.8× compared to the state of the art.

References

[1]
Aliyun Open Storage Service (Aliyun OSS). https://cn.aliyun.com/product/oss?spm=5176.683009.2.4.Wma3SL.
[2]
Docker. https://www.docker.com/.
[3]
Docker Hub. https://hub.docker.com/.
[4]
Dockerfile. https://docs.docker.com/engine/reference/builder/.
[5]
Microsoft azure storage. https://azure.microsoft.com/en-us/services/storage/.
[6]
Redis. https://redis.io/.
[7]
K. Adams and O. Agesen. A Comparison of Software and Hardware Techniques for x86 Virtualization. ACM SIGOPS Operating Systems Review, 40(5):2-13, 2006.
[8]
Alfred Krohmer. Proposal: Deduplicated storage and transfer of container images. https://gist.github.com/devkid/5249ea4c88aab4c7bff1b34c955c1980.
[9]
Amazon. Amazon elastic container registry. https://aws.amazon.com/ecr/.
[10]
Amazon. Containers on aws. https://aws.amazon.com/containers/services/.
[11]
A. Anwar, M. Mohamed, V. Tarasov, M. Littley, L. Rupprecht, Y. Cheng, N. Zhao, D. Skourtis, A. S. Warke, H. Ludwig, D. Hildebrand, and A. R. Butt. Improving Docker Registry Design Based on Production Workload Analysis. In 16th USENIX Conference on File and Storage Technologies (FAST), 2018.
[12]
N. Bonvin, T. G. Papaioannou, and K. Aberer. A Self-organized, Fault-tolerant and Scalable Replication Scheme for Cloud Storage. In 1st ACM Symposium on Cloud Computing (SoCC), 2010.
[13]
Btrfs. https://btrfs.wiki.kernel.org/index.php/Deduplication.
[14]
R. S. Canon and D. Jacobsen. Shifter: Containers for HPC. In Cray User Group, 2016.
[15]
Z. Cao, H. Wen, F. Wu, and D. H. Du. {ALACC}: Accelerating restore performance of data deduplication systems using adaptive look-ahead window assisted chunk caching. In 16th {USENIX} Conference on File and Storage Technologies ({FAST} 18), pages 309-324, 2018.
[16]
Ceph. https://docs.ceph.com/docs/master/dev/deduplication/.
[17]
Cloud Native Computing Foundation Projects. https://www.cncf.io/projects/.
[18]
B. Compression and Deduplication. https://tinyurl.com/vgvb7wu.
[19]
Datadog. 8 Surprising Facts about Real Docker Adoption. https://www.datadoghq.com/docker-adoption/.
[20]
Docker. Docker Registry. https://github.com/docker/distribution.
[21]
Docker. Docker Registry HTTP API V2. https://github.com/docker/distribution/blob/master/docs/spec/api.md.
[22]
DockerSlim. https://dockersl.im.
[23]
M. Fu, D. Feng, Y. Hua, X. He, Z. Chen, W. Xia, F. Huang, and Q. Liu. Accelerating Restore and Garbage Collection in Deduplication-based Backup Systems via Exploiting Historical Information. In USENIX Annual Technical Conference (ATC), 2014.
[24]
M. Fu, D. Feng, Y. Hua, X. He, Z. Chen, W. Xia, Y. Zhang, and Y. Tan. Design Tradeoffs for Data Deduplication Performance in Backup Workloads. In 13th USENIX Conference on File and Storage Technologies (FAST), 2015.
[25]
Y. Fu, H. Jiang, N. Xiao, L. Tian, and F. Liu. AA-Dedupe: An Application-aware Source Deduplication Approach for Cloud Backup Services in the Personal Computing Environment. In IEEE International Conference on Cluster Computing (Cluster), 2011.
[26]
GNU Tar. Basic Tar Format. https://www.gnu.org/software/tar/manual/html_node/Standard.html.
[27]
Google. Google container registry. https://cloud.google.com/container-registry/.
[28]
Google compute engine. Google Compute Engine. https://cloud.google.com/compute/.
[29]
K. Gschwind, C. Adam, S. Duri, S. Nadgowda, and M. Vukovic. Optimizing Service Delivery with Minimal Runtimes. In International Conference on Service-Oriented Computing (ICSOC), 2017.
[30]
T. Harter, B. Salmon, R. Liu, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. Slacker: Fast Distribution with Lazy Docker Containers. In 14th USENIX Conference on File and Storage Technologies (FAST), 2016.
[31]
IBM Cloud Kubernetes Service. Ibm cloud kubernetes service. https://www.ibm.com/cloud/container-service.
[32]
IBM Cloud Kubernetes Service. S3 storage driver. https://docs.docker.com/registry/storage-drivers/s3/.
[33]
K. Jayaram, C. Peng, Z. Zhang, M. Kim, H. Chen, and H. Lei. An Empirical Analysis of Similarity in Virtual Machine Images. In Middleware Industry Track Workshop, 2011.
[34]
jdupes. https://github.com/jbruchon/jdupes.
[35]
JFrog Artifcatory. https://jfrog.com/artifactory/.
[36]
K. Jin and E. L. Miller. The Effectiveness of Deduplication on Virtual Machine Disk Images. In International Systems and Storage Conference (SYSTOR), 2009.
[37]
D. Karger, E. Lehman, T. Leighton, R. Panigrahy, M. Levine, and D. Lewin. Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web. In 29th Annual ACM Symposium on Theory of Computing (STOC), 1997.
[38]
K. Kumar and M. Kurhekar. Economically Efficient Virtualization over Cloud Using Docker Containers. In IEEE International Conference on Cloud Computing in Emerging Markets (CCEM), 2016.
[39]
M. Lillibridge, K. Eshghi, and D. Bhagwat. Improving Restore Speed for Backup Systems that use Inline Chunk-based Deduplication. In 11th USENIX Conference on File and Storage Technologies (FAST), 2013.
[40]
M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezise, and P. Camble. Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality. In 7th USENIX Conference on File and Storage Technologies (FAST), 2009.
[41]
M. Littley, A. Anwar, H. Fayyaz, Z. Fayyaz, V. Tarasov, L. Rupprecht, D. Skourtis, M. Mohamed, H. Ludwig, Y. Cheng, and A. R. Butt. Bolt: Towards a Scalable Docker Registry via Hyperconvergence. In IEEE International Conference on Cloud Computing (CLOUD), 2019.
[42]
M. Lu, D. Chambliss, J. Glider, and C. Constantinescu. Insights for Data Reduction in Primary Storage: A Practical Analysis. In International Systems and Storage Conference (SYSTOR), 2012.
[43]
N. Megiddo and D. S. Modha. ARC: A Self-Tuning, Low Overhead Replacement Cache. In 2nd USENIX Conference on File and Storage Technologies (FAST), 2003.
[44]
D. Meister, J. Kaiser, A. Brinkmann, T. Cortes, M. Kuhn, and J. Kunkel. A Study on Data Deduplication in HPC Storage Systems. In International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2012.
[45]
D. Meister, J. Kaiser, A. Brinkmann, T. Cortes, M. Kuhn, and J. Kunkel. A Study on Data Deduplication in HPC Storage Systems. In International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2012.
[46]
Microsoft. Azure container registry. https://azure.microsoft.com/en-us/services/container-registry/.
[47]
Microsoft Azure. https://azure.microsoft.com/en-us/.
[48]
A. Muthitacharoen, B. Chen, and D. Mazieres. A Low-bandwidth Network File System. In ACM SIGOPS Operating Systems Review, volume 35, 2001.
[49]
M. Oh, S. Park, J. Yoon, S. Kim, K. Lee, S. Weil, H. Y. Yeom, and M. Jung. Design of global data deduplication for a scale-out distributed storage system. In 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS), pages 1063-1073, 2018.
[50]
E. J. O'Neil, P. E. O'Neil, and G. Weikum. The LRU-K page replacement algorithm for database disk buffering. Acm Sigmod Record, 22(2):297-306, 1993.
[51]
OpenStack Swift storage driver. Openstack swift storage driver. https://docs.docker.com/registry/storage-drivers/swift/.
[52]
J. Paulo and J. Pereira. A survey and classification of storage deduplication systems. ACM Computing Surveys (CSUR), 47(1):11, 2014.
[53]
J. S. Plank, M. Blaum, and J. L. Hafner. Sd codes: erasure codes designed for how storage systems really fail. In FAST, pages 95-104, 2013.
[54]
V. Rastogi, D. Davidson, L. De Carli, S. Jha, and P. McDaniel. Cimplifier: Automatically Debloating Containers. In 11th Joint Meeting on Foundations of Software Engineering (FSE), 2017.
[55]
Redis. SETNX. https://redis.io/commands/setnx.
[56]
I. S. Reed and G. Solomon. Polynomial codes over certain finite fields. Journal of the society for industrial and applied mathematics, 8(2):300-304, 1960.
[57]
P. Shilane, R. Chitloor, and U. K. Jonnala. 99 deduplication problems. In 8th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 16), Denver, CO, June 2016. USENIX Association.
[58]
H. Shim, P. Shilane, and W. Hsu. Characterization of Incremental Data Changes for Efficient Data Protection. In USENIX Annual Technical Conference (ATC), 2013.
[59]
D. Skourtis, L. Rupprecht, V. Tarasov, and N. Megiddo. Carving Perfect Layers out of Docker Images. In 11th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud), 2019.
[60]
R. P. Spillane, W. Wang, L. Lu, M. Austruy, R. Rivera, and C. Karamanolis. Exo-clones: Better Container Runtime Image Management Across the Clouds. In 8th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage), 2016.
[61]
K. Srinivasan, T. Bisson, G. R. Goodson, and K. Voruganti. iDedup: latency-aware, inline data deduplication for primary storage. In 10th USENIX Conference on File and Storage Technologies (FAST), 2012.
[62]
Z. Sun, G. Kuenning, S. Mandal, P. Shilane, V. Tarasov, N. Xiao, and E. Zadok. A Long-Term User-Centric Analysis of Deduplication Patterns. In 32nd International Conference on Massive Storage Systems and Technology (MSST), 2016.
[63]
V. Tarasov, D. Jain, G. Kuenning, S. Mandal, K. Palanisami, P. Shilane, S. Trehan, and E. Zadok. Dmdedup: Device Mapper Target for Data Deduplication. In Ottawa Linux Symposium, 2014.
[64]
V. Tarasov, L. Rupprecht, D. Skourtis, A. Warke, D. Hildebrand, M. Mohamed, N. Mandagere, W. Li, R. Rangaswami, and M. Zhao. In Search of the Ideal Storage Configuration for Docker Containers. In 2nd IEEE International Workshops on Foundations and Applications of Self* Systems (FAS*W), 2017.
[65]
J. Thalheim, P. Bhatotia, P. Fonseca, and B. Kasikci. Cntr: Lightweight OS Containers. In USENIX Annual Technical Conference (ATC), 2018.
[66]
A. Upadhyay, P. R. Balihalli, S. Ivaturi, and S. Rao. Deduplication and compression techniques in cloud design. In 2012 IEEE International Systems Conference SysCon 2012, pages 1-6. IEEE, 2012.
[67]
Vdo. https://github.com/dm-vdo/vdo.
[68]
G. Wallace, F. Douglis, H. Qian, P. Shilane, S. Smaldone, M. Chamness, and W. Hsu. Characteristics of Backup Workloads in Production Systems. In 10th USENIX Conference on File and Storage Technologies (FAST), 2012.
[69]
E. Wolff. Microservices: Flexible Software Architecture. Addison-Wesley Professional, 2016.
[70]
ZFS. https://en.wikipedia.org/wiki/ZFS.
[71]
F. Zhao, K. Xu, and R. Shain. Improving Copy-on-Write Performance in Container Storage Drivers. In Storage Developer Conference (SDC), 2016.
[72]
N. Zhao, V. Tarasov, H. Albahar, A. Anwar, L. Rupprecht, D. Skourtis, A. S.Warke, M. Mohamed, and A. R. Butt. Large-scale analysis of the docker hub dataset. In IEEE International Conference on Cluster Computing (Cluster), 2019.
[73]
R. Zhou, M. Liu, and T. Li. Characterizing the efficiency of data deduplication for big data storage management. In IEEE International Symposium on Workload Characterization (IISWC), 2013.
[74]
B. Zhu, K. Li, and R. H. Patterson. Avoiding the Disk Bottleneck in the Data Domain Deduplication File System. In 6th USENIX Conference on File and Storage Technologies (FAST), 2008.

Index Terms

  1. DupHunter: flexible high-performance deduplication for docker registries
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image Guide Proceedings
      USENIX ATC'20: Proceedings of the 2020 USENIX Conference on Usenix Annual Technical Conference
      July 2020
      957 pages
      ISBN:978-1-939133-14-4

      Sponsors

      • VMware
      • Facebook
      • Microsoft
      • ORACLE
      • Google Inc.

      Publisher

      USENIX Association

      United States

      Publication History

      Published: 15 July 2020

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 63
        Total Downloads
      • Downloads (Last 12 months)45
      • Downloads (Last 6 weeks)11
      Reflects downloads up to 10 Nov 2024

      Other Metrics

      Citations

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Get Access

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media