Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Hybrid Block Storage for Efficient Cloud Volume Service

Published: 03 October 2023 Publication History

Abstract

The migration of traditional desktop and server applications to the cloud brings challenge of high performance, high reliability, and low cost to the underlying cloud storage. To satisfy the requirement, this article proposes a hybrid cloud-scale block storage system called Ursa. Trace analysis shows that the I/O patterns served by block storage have only limited locality to exploit. Therefore, instead of using solid state drives (SSDs) as a cache layer, Ursa proposes hybrid storage structure that directly stores primary replicas on SSDs and replicates backup replicas on hard disk drives (HDDs). At the core of Ursa’s hybrid storage design is an adaptive journal that can bridge the performance gap between primary SSDs and backup HDDs for random writes by transforming small backup writes into journal appends, which are then asynchronously replayed and merged to backup HDDs. To efficiently index the journal, we design a novel range-optimized merge-tree structure that combines a continuous range of keys into a single composite key {offset,length}. Ursa integrates the hybrid structure with designs for high reliability, scalability, and availability. Experiments show that Ursa in its hybrid mode achieves almost the same performance as in its SSD-only mode (storing all replicas on SSDs), and outperforms other block stores (Ceph and Sheepdog) even in their SSD-only mode while achieving much higher CPU efficiency (IOPS and throughput per core).

References

[1]
Retrieved from http://ceph.com/.
[2]
Retrieved from http://iotta.snia.org/traces/388.
[3]
Retrieved from https://aws.amazon.com/ebs/.
[4]
Retrieved from https://blocksandfiles.com/2020/05/15/enterprise-ssds-are-ten-x-cost-of-nearline-disk-drives/.
[5]
Retrieved from https://en.wikipedia.org/wiki/Wear_leveling.
[6]
Retrieved from https://github.com/alibaba/block-traces.
[7]
Retrieved from https://open-cas.github.io/.
[8]
Retrieved from https://oss.oracle.com/projects/ocfs/.
[9]
Retrieved from https://sheepdog.github.io/sheepdog/.
[10]
Retrieved from https://www.avast.com/c-ssd-vs-hdd.
[11]
Retrieved from https://www.pcgamer.com/hdd-vs-ssd/.
[12]
Retrieved from https://www.qcloud.com/.
[13]
Retrieved from http://www.facebook.com/notes/facebook-engineering/more-details-on-todays-outage/431441338919/.
[14]
Mary G. Baker, John H. Hartman, Michael D. Kupfer, Ken W. Shirriff, and John K. Ousterhout. 1991. Measurements of a distributed file system. In Proceedings of the 11th ACM Symposium on Operating Systems Principles. ACM, 198–212.
[15]
William J. Bolosky, Dexter Bradshaw, Randolph B. Haagens, Norbert P. Kusters, and Peng Li. 2011. Paxos replicated state machines as the basis of a high-performance data store. In Proceedings of the 8th USENIX Symposium on Networked Systems Design and Implementation. USENIX Association, 141–154.
[16]
Mike Burrows. 2006. The chubby lock service for loosely-coupled distributed systems. In Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, 335–350.
[17]
Jeremy C. W. Chan, Qian Ding, Patrick P. C. Lee, and Helen H. W. Chan. 2014. Parity logging with reserved space: Towards efficient updates and recovery in erasure-coded clustered storage. In Proceedings of the 12th USENIX Conference on File and Storage Technologies. USENIX Association, 163–176.
[18]
Feng Chen, David A. Koufaty, and Xiaodong Zhang. 2011. Hystor: Making the best use of solid state drives in high performance storage systems. In Proceedings of the 25th International Conference on Supercomputing. ACM, 22–32.
[19]
Vijay Chidambaram, Thanumalayan Sankaranarayana Pillai, Andrea C Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2013. Optimistic crash consistency. In Proceedings of the 24th ACM Symposium on Operating Systems Principles. ACM, 228–243.
[20]
Vijay Chidambaram, Tushar Sharma, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2012. Consistency without ordering. In Proceedings of the 10th USENIX Conference on File and Storage Technologies. USENIX Association.
[21]
Brendan Cully, Jake Wires, Dutch Meyer, Kevin Jamieson, Keir Fraser, Tim Deegan, Daniel Stodden, Geoffre Lefebvre, Daniel Ferstay, and Andrew Warfield. 2014. Strata: High-performance scalable storage on virtualized non-volatile memory. In Proceedings of the 12th USENIX Conference on File and Storage Technologies. USENIX Association, 17–31.
[22]
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107–113.
[23]
Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. 2007. Dynamo: Amazon’s highly available key-value store. In Proceedings of the 21st ACM Symposium on Operating Systems Principles. ACM, 205–220.
[24]
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google file system. In Proceedings of the 19th ACM Symposium on Operating Systems Principles. ACM, 29–43.
[25]
C. Gray and D. Cheriton. 1989. Leases: An efficient fault-tolerant mechanism for distributed file cache consistency. In Proceedings of the 12th ACM Symposium on Operating Systems Principles. ACM, New York, NY, 202–210. DOI:
[26]
Maurice P. Herlihy and Jeannette M. Wing. 1990. Linearizability: A correctness condition for concurrent objects. ACM Trans. Program. Lang. Syst. 12, 3 (July1990), 463–492. DOI:
[27]
Olzhas Kaiyrakhmet, Songyi Lee, Beomseok Nam, Sam H. Noh, and ri Young-Choi. 2019. SLM-DB: Single-level key-value store with persistent memory. In 17th USENIX Conference on File and Storage Technologies (FAST 19), USENIX Association, Boston, MA, 191–205. Retrieved from https://www.usenix.org/conference/fast19/presentation/kaiyrakhmet.
[28]
Youngjae Kim, Aayush Gupta, Bhuvan Urgaonkar, Piotr Berman, and Anand Sivasubramaniam. 2011. HybridStore: A cost-efficient, high-performance storage system combining SSDs and HDDs. In Proceedings of the 19th Annual IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems. IEEE Computer Society, 227–236.
[29]
Cheng Li, Philip Shilane, Fred Douglis, Hyong Shim, Stephen Smaldone, and Grant Wallace. 2014. Nitro: A capacity-optimized SSD cache for primary storage. In Proceedings of the USENIX Annual Technical Conference. USENIX Association, 501–512.
[30]
Huiba Li, Yiming Zhang, Dongsheng Li, Zhiming Zhang, Shengyun Liu, Peng Huang, Zheng Qin, Kai Chen, and Yongqiang Xiong. 2019. URSA: Hybrid block storage for cloud-scale virtual disks. In Proceedings of the 14th European Conference on Computer Systems. 15:1–15:17. DOI:
[31]
Huiba Li, Yiming Zhang, Haonan Wang, and Ping Zhong. 2020. URSAL: Ultra-efficient, reliable, scalable, and available block storage at low cost. In Proceedings of the IEEE International Conference on Computer Communications. IEEE Computer Society.
[32]
Huiba Li, Yiming Zhang, Zhiming Zhang, Shengyun Liu, Dongsheng Li, Xiaohui Liu, and Yuxing Peng. 2017. PARIX: Speculative partial writes in erasure-coded systems. In Proceedings of the USENIX Annual Technical Conference. USENIX Association, 581–587.
[33]
Lanyue Lu, Thanumalayan Sankaranarayana Pillai, Hariharan Gopalakrishnan, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2017. Wisckey: Separating keys from values in ssd-conscious storage. ACM Trans. Stor. 13, 1 (2017), 1–28.
[34]
Stathis Maneas, Kaveh Mahdaviani, Tim Emami, and Bianca Schroeder. 2020. A Study of SSD reliability in large scale enterprise storage deployments. In Proceedings of the 18th USENIX Conference on File and Storage Technologies. USENIX Association.
[35]
James Mickens, Edmund B Nightingale, Jeremy Elson, Darren Gehring, Bin Fan, Asim Kadav, Vijay Chidambaram, Osama Khan, and Krishna Nareddy. 2014. Blizzard: Fast, cloud-scale block storage for cloud-oblivious applications. In 11th USENIX Symposium on Networked Systems Design and Implementation. USENIX Association, 257–273.
[36]
Dushyanth Narayanan, Austin Donnelly, and Antony Rowstron. 2008. Write off-loading: Practical power management for enterprise storage. ACM Trans. Stor. 4, 3 (2008), 10.
[37]
Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield, Anand Sivasubramaniam, Ben Cutler, Jie Liu, Badriddine Khessib, and Kushagra Vaid. 2016. SSD failures in datacenters: What? when? and why? In Proceedings of the 9th ACM International on Systems and Storage Conference. ACM.
[38]
Brian M. Oki and Barbara H. Liskov. 1988. Viewstamped replication: A new primary copy method to support highly-available distributed systems. In Proceedings of the 7th Annual ACM Symposium on Principles of Distributed Computing. ACM, 8–17. DOI:
[39]
Michael A Olson, Keith Bostic, and Margo I Seltzer. 1999. Berkeley DB. In Proceedings of the USENIX Annual Technical Conference, FREENIX Track. 183–191.
[40]
Diego Ongaro and John Ousterhout. 2014. In search of an understandable consensus algorithm. In Proceedings of the USENIX Conference on USENIX Annual Technical Conference. USENIX Association, 305–320.
[41]
Diego Ongaro, Stephen M. Rumble, Ryan Stutsman, John K. Ousterhout, and Mendel Rosenblum. 2011. Fast crash recovery in RAMCloud. In Proceedings of the 23nd ACM Symposium on Operating Systems Principles. 29–41.
[42]
John K. Ousterhout, Parag Agrawal, David Erickson, Christos Kozyrakis, Jacob Leverich, David Mazières, Subhasish Mitra, Aravind Narayanan, Guru M. Parulkar, Mendel Rosenblum, Stephen M. Rumble, Eric Stratmann, and Ryan Stutsman. 2009. The case for RAMClouds: Scalable high-performance storage entirely in DRAM. Operat. Syst. Rev. 43, 4 (2009), 92–105.
[43]
Patrick O’Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O’Neil. 1996. The log-structured merge-tree (LSM-tree). Acta Inf. 33, 4 (1996), 351–385.
[44]
Daniel Porto, João Leitão, Cheng Li, Allen Clement, Aniket Kate, Flavio Junqueira, and Rodrigo Rodrigues. 2015. Visigoth fault tolerance. In Proceedings of the 10th European Conference on Computer Systems. ACM, Article 8, 14 pages. DOI:
[45]
Pandian Raju, Rohan Kadekodi, Vijay Chidambaram, and Ittai Abraham. 2017. PebblesDB: Building key-value stores using fragmented log-structured merge trees. In Proceedings of the 26th ACM Symposium on Operating Systems Principles. ACM, 497–514.
[46]
Youren Shen, Hongliang Tian, Yu Chen, Kang Chen, Runji Wang, Yi Xu, Yubin Xia, and Shoumeng Yan. 2020. Occlum: Secure and efficient multitasking inside a single enclave of intel SGX. In Proceedings of the 25th ACM Architectural Support for Programming Languages and Operating Systems. ACM, 955–970.
[47]
Gokul Soundararajan, Vijayan Prabhakaran, Mahesh Balakrishnan, and Ted Wobber. 2010. Extending SSD lifetimes with disk-based write caches. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’10), Vol. 10. 101–114.
[48]
Yang Wang, Lorenzo Alvisi, and Mike Dahlin. 2012. Gnothi: Separating data and metadata for efficient and available storage replication. In Proceedings of the USENIX Conference on Annual Technical Conference. USENIX Association, 38–38.
[49]
Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, and Carlos Maltzahn. 2006. Ceph: A scalable, high-performance distributed file system. In Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, 307–320.
[50]
Erci Xu, Mai Zheng, Feng Qin, Yikang Xu, and Jiesheng Wu. 2019. Lessons and actions: What we learned from 10K SSD-related storage system failures. In Proceedings of the USENIX Annual Technical Conference. USENIX Association.
[51]
Lujia Yin, Li Wang, Yiming Zhang, and Yuxing Peng. 2021. MapperX: Adaptive metadata maintenance for fast crash recovery of DM-cache based hybrid storage devices. In Proceedings of the USENIX Annual Technical Conference. USENIX Association.
[52]
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, and Ankur Dave. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation. USENIX Association, 1–14.
[53]
Yiming Zhang, Dongsheng Li, and Ling Liu. 2019. Leveraging glocality for fast failure recovery in distributed RAM storage. ACM Trans. Stor. 15, 1 (2019), 1–24.
[54]
Yiming Zhang, Huiba Li, Shengyun Liu, Jiawei Xu, and Guangtao Xue. 2020. PBS: An efficient erasure-coded block storage system based on speculative partial writes. ACM Trans. Stor. 15, 4 (2020), 1–26.
[55]
Wenshao Zhong, Chen Chen, Xingbo Wu, and Song Jiang. 2021. REMIX: Efficient range query for LSM-trees. In Proceedings of the 19th USENIX Conference on File and Storage Technologies (FAST’21). USENIX Association, 51–64.

Cited By

View all
  • (2024)A Double Auction for Charging Scheduling among Vehicles Using DAG-BlockchainsACM Transactions on Sensor Networks10.1145/368593220:5(1-27)Online publication date: 31-Jul-2024
  • (2024)Fair and Robust Federated Learning via Decentralized and Adaptive Aggregation based on BlockchainACM Transactions on Sensor Networks10.1145/3673656Online publication date: 17-Jun-2024
  • (2024)Push the Limit of Highly Accurate Ranging on Commercial UWB DevicesProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36596028:2(1-27)Online publication date: 15-May-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Storage
ACM Transactions on Storage  Volume 19, Issue 4
November 2023
238 pages
ISSN:1553-3077
EISSN:1553-3093
DOI:10.1145/3626486
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 October 2023
Online AM: 08 May 2023
Accepted: 02 April 2023
Revised: 12 February 2023
Received: 05 April 2022
Published in TOS Volume 19, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. SSD-HDD hybrid
  2. block storage
  3. cloud volume service

Qualifiers

  • Research-article

Funding Sources

  • National Key Research and Development Program of China
  • OS Innovation Lab Project of Xiamen University and Huawei

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)328
  • Downloads (Last 6 weeks)31
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A Double Auction for Charging Scheduling among Vehicles Using DAG-BlockchainsACM Transactions on Sensor Networks10.1145/368593220:5(1-27)Online publication date: 31-Jul-2024
  • (2024)Fair and Robust Federated Learning via Decentralized and Adaptive Aggregation based on BlockchainACM Transactions on Sensor Networks10.1145/3673656Online publication date: 17-Jun-2024
  • (2024)Push the Limit of Highly Accurate Ranging on Commercial UWB DevicesProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36596028:2(1-27)Online publication date: 15-May-2024
  • (2024)xMeta: SSD-HDD-hybrid Optimization for Metadata Maintenance of Cloud-scale Object StorageACM Transactions on Architecture and Code Optimization10.1145/365260621:2(1-20)Online publication date: 21-May-2024
  • (2024)Suitable and Style-Consistent Multi-Texture Recommendation for Cartoon IllustrationsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365251820:7(1-26)Online publication date: 16-May-2024
  • (2024)MS-GDA: Improving Heterogeneous Recipe Representation via Multinomial Sampling Graph Data AugmentationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364862020:7(1-23)Online publication date: 25-Apr-2024
  • (2024)MSEConv: A Unified Warping Framework for Video Frame InterpolationACM Transactions on Asian and Low-Resource Language Information Processing10.1145/3648364Online publication date: 14-Feb-2024
  • (2024)GMS-3DQA: Projection-Based Grid Mini-patch Sampling for 3D Model Quality AssessmentACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364381720:6(1-19)Online publication date: 8-Mar-2024
  • (2024)RAST: Restorable Arbitrary Style TransferACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363877020:5(1-21)Online publication date: 22-Jan-2024
  • (2024)Multiple Pseudo-Siamese Network with Supervised Contrast Learning for Medical Multi-modal RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363744120:5(1-23)Online publication date: 11-Jan-2024
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media