Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3694715.3695945acmconferencesArticle/Chapter ViewAbstractPublication PagessospConference Proceedingsconference-collections
research-article
Free access

SWARM: Replicating Shared Disaggregated-Memory Data in No Time

Published: 15 November 2024 Publication History

Abstract

Memory disaggregation is an emerging data center architecture that improves resource utilization and scalability. Replication is key to ensure the fault tolerance of applications, but replicating shared data in disaggregated memory is hard. We propose SWARM (Swift WAit-free Replication in disaggregated Memory), the first replication scheme for in-disaggregated-memory shared objects to provide (1) single-roundtrip reads and writes in the common case, (2) strong consistency (linearizability), and (3) strong liveness (wait-freedom). SWARM makes two independent contributions. The first is Safe-Guess, a novel wait-free replication protocol with single-roundtrip operations. The second is In-n-Out, a novel technique to provide conditional atomic update and atomic retrieval of large buffers in disaggregated memory in one roundtrip. Using SWARM, we build SWARM-KV, a low-latency, strongly consistent and highly available disaggregated key-value store. We evaluate SWARM-KV and find that it has marginal latency overhead compared to an unreplicated key-value store, and that it offers much lower latency and better availability than FUSEE, a state-of-the-art replicated disaggregated key-value store.

References

[1]
Daniel Anderson, Guy E. Blelloch, and Yuanhao Wei. 2021. Concurrent deferred reference counting with constant-time overhead. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation (Virtual Event) (PLDI '21). Association for Computing Machinery, New York, NY, USA, 526--541.
[2]
James Aspnes, Hagit Attiya, and Keren Censor. 2009. Max registers, counters, and monotone circuits. In Proceedings of the 28th ACM Symposium on Principles of Distributed Computing (Calgary, Alberta, Canada) (PODC '09). Association for Computing Machinery, New York, NY, USA, 36--45.
[3]
James Aspnes and Faith Ellen. 2014. Tight Bounds for Adopt-Commit Objects. Theory of Computing Systems 55, 3 (Oct. 2014), 451--474.
[4]
Hagit Attiya, Amotz Bar-Noy, and Danny Dolev. 1995. Sharing Memory Robustly in Message-Passing Systems. J. ACM 42, 1 (Jan. 1995), 124--142.
[5]
Hagit Attiya and Jennifer Welch. 2004. Distributed Computing: Fundamentals, Simulations and Advanced Topics. John Wiley and Sons, Inc., Hoboken, NJ, USA.
[6]
Motti Beck and Michael Kagan. 2011. Performance Evaluation of the RDMA over Ethernet (RoCE) Standard in Enterprise Data Centers Infrastructure. In Proceedings of the 3rd Workshop on Data Center - Converged and Virtual Ethernet Switching (San Francisco, CA, USA) (DC-CaVES '11). International Teletraffic Congress, San Francisco, CA, USA, 9--15.
[7]
Matthew Burke, Audrey Cheng, and Wyatt Lloyd. 2020. Gryff: Unifying Consensus and Shared Registers. In Proceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation (Santa Clara, CA, USA) (NSDI '20). USENIX Association, Berkeley, CA, USA, 591--617. https://www.usenix.org/conference/nsdi20/presentation/burke
[8]
Yu Lin Chen, Shuai Mu, Jinyang Li, Cheng Huang, Jin Li, Aaron Ogus, and Douglas Phillips. 2017. Giza: Erasure Coding Objects across Global Data Centers. In Proceedings of the 2017 USENIX Annual Technical Conference (Santa Clara, CA, USA) (USENIX ATC '17). USENIX Association, Berkeley, CA, USA, 539--551. https://www.usenix.org/conference/atc17/technical-sessions/presentation/chen-yu-lin
[9]
Yann Collet. 2022. xxHash: Extremely fast non-cryptographic hash algorithm. https://github.com/Cyan4973/xxHash Accessed 2024-03-17.
[10]
Compute Express Link Consortium. 2022. Compute Express Link (CXL) Specification, Revision 3.0. https://www.computeexpresslink.org/ Accessed 2024-04-13.
[11]
Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing (Indianapolis, IN, USA) (SoCC '10). Association for Computing Machinery, New York, NY, USA, 143--154.
[12]
Al Danial. 2022. CLOC: Count Lines of Code. https://github.com/AlDanial/cloc
[13]
Aleksandar Dragojević, Dushyanth Narayanan, Orion Hodson, and Miguel Castro. 2014. FaRM: Fast Remote Memory. In Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation (Seattle, WA, USA) (NSDI '14). USENIX Association, Berkeley, CA, USA, 401--414. https://www.usenix.org/conference/nsdi14/technical-sessions/dragojevi%C4%87
[14]
Partha Dutta, Rachid Guerraoui, Ron R. Levy, and Arindam Chakraborty. 2004. How fast can a distributed atomic read be?. In Proceedings of the 23rd ACM Symposium on Principles of Distributed Computing (St. John's, Newfoundland, Canada) (PODC '04). Association for Computing Machinery, New York, NY, USA, 236--245.
[15]
Vitor Enes, Carlos Baquero, Alexey Gotsman, and Pierre Sutra. 2021. Efficient replication via timestamp stability. In Proceedings of the 16th European Conference on Computer Systems (Virtual Event) (EuroSys '21). Association for Computing Machinery, New York, NY, USA, 178--193.
[16]
Burkhard Englert, Chryssis Georgiou, Peter M. Musial, Nicolas C. Nicolaou, and Alexander A. Shvartsman. 2009. On the Efficiency of Atomic Multi-reader, Multi-writer Distributed Memory. In 13th International Conference on Principles of Distributed Systems (Nîmes, France) (OPODIS '09'). Springer-Verlag, Berlin, Germany, 240--254.
[17]
Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson. 1985. Impossibility of Distributed Consensus with One Faulty Process. J. ACM 32, 2 (apr 1985), 374--382.
[18]
Aishwarya Ganesan, Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2021. Strong and Efficient Consistency with Consistency-aware Durability. ACM Transactions on Storage 17, 1, Article 4 (Jan. 2021), 27 pages.
[19]
Chryssis Georgiou, Nicolas C. Nicolaou, and Alexander A. Shvartsman. 2008. On the Robustness of (Semi) Fast Quorum-Based Implementations of Atomic Shared Memory. In Proceedings of the 22nd International Symposium on Distributed Computing (Arcachon, France) (DISC '08, Vol. 5218). Springer-Verlag, Berlin, Germany, 289--304.
[20]
Chryssis Georgiou, Nicolas C. Nicolaou, and Alexander A. Shvartsman. 2009. Fault-tolerant semifast implementations of atomic read/write registers. J. Parallel and Distrib. Comput. 69, 1 (2009), 62--79.
[21]
Juncheng Gu, Youngmoon Lee, Yiwen Zhang, Mosharaf Chowdhury, and Kang G. Shin. 2017. Efficient Memory Disaggregation with Infiniswap. In Proceedings of the 14th USENIX Symposium on Networked Systems Design and Implementation (Boston, MA, USA). USENIX Association, Berkeley, CA, USA, 649--667. https://www.usenix.org/conference/nsdi17/technical-sessions/presentation/gu
[22]
Rachid Guerraoui, Antoine Murat, Javier Picorel, Athanasios Xygkis, Huabing Yan, and Pengfei Zuo. 2022. uKharon: A Membership Service for Microsecond Applications. In Proceedings of the 2022 USENIX Annual Technical Conference (Carlsbad, CA, USA) (USENIX ATC '22). USENIX Association, Berkeley, CA, USA, 101--120. https://www.usenix.org/conference/atc22/presentation/guerraoui
[23]
Red Hat. 2020. RHEL for Real Time Timestamping. https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux_for_real_time/7/html/reference_guide/chap-timestamping Accessed: April 14, 2023.
[24]
Chris Hawblitzel, Jon Howell, Manos Kapritsos, Jacob R. Lorch, Bryan Parno, Michael L. Roberts, Srinath Setty, and Brian Zill. 2015. IronFleet: proving practical distributed systems correct. In Proceedings of the 25th ACM Symposium on Operating Systems Principles (Monterey, CA, USA) (SOSP '15). Association for Computing Machinery, New York, NY, USA, 1--17.
[25]
Maurice Herlihy. 1991. Wait-Free Synchronization. ACM Transactions on Programming Languages and Systems 13, 1 (Jan. 1991), 124--149.
[26]
Maurice Herlihy and Jeannette Wing. 1990. Linearizability: a correctness condition for concurrent objects. ACM Transactions on Programming Languages and Systems 12, 3 (July 1990), 463--492.
[27]
Kaile Huang, Yu Huang, and Hengfeng Wei. 2020. Fine-grained Analysis on Fast Implementations of Distributed Multi-writer Atomic Registers. In Proceedings of the 39th ACM Symposium on Principles of Distributed Computing (Virtual Event) (PODC '20). Association for Computing Machinery, New York, NY, USA, 200--209.
[28]
Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, and Benjamin Reed. 2010. ZooKeeper: Wait-Free Coordination for Internet-Scale Systems. In Proceedings of the 2010 USENIX Annual Technical Conference (Boston, MA, USA) (USENIX ATC '10). USENIX Association, Berkeley, CA, USA, 11 pages. https://www.usenix.org/conference/usenix-atc-10/zookeeper-wait-free-coordination-internet-scale-systems
[29]
Sunita Jain, Nagaradhesh Yeleswarapu, Hasan Al Maruf, and Rita Gupta. 2024. Memory Sharing with CXL: Hardware and Software Design Approaches. In Proceedings of the 3rd Workshop on Heterogeneous Composable and Disaggregated Systems (San Diego, CA, USA) (HCDS '24). Association for Computing Machinery, New York, NY, USA, 4 pages. https://arxiv.org/pdf/2404.03245.pdf
[30]
Hai Jin, Rajkumar Buyya, and Toni Cortes. 2002. An Introduction to the InfiniBand Architecture. In High Performance Mass Storage and Parallel I/O: Technologies and Applications (1st ed.). John Wiley and Sons, Inc., Hoboken, NJ, USA, 616--632.
[31]
Anuj Kalia, Michael Kaminsky, and David Andersen. 2019. Datacenter RPCs can be General and Fast. In Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation (Boston, MA, USA) (NSDI '19). USENIX Association, Berkeley, CA, USA, 1--16. https://www.usenix.org/conference/nsdi19/presentation/kalia
[32]
Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2014. Using RDMA Efficiently for Key-Value Services. In Proceedings of the 2014 ACM Conference on SIGCOMM (Chicago, IL, USA) (SIGCOMM '14). Association for Computing Machinery, New York, NY, USA, 295--306.
[33]
Kishori M. Konwar, Saptaparni Kumar, and Lewis Tseng. 2020. SemiFast Byzantine-tolerant Shared Register without Reliable Broadcast. In Proceedings of the 2020 IEEE 40th International Conference on Distributed Computing Systems (Singapore) (ICDCS '20). IEEE Computer Society, Los Alamitos, CA, USA, 743--753.
[34]
Leslie Lamport. 1978. Time, Clocks, and the Ordering of Events in a Distributed System. Commun. ACM 21, 7 (July 1978), 558--565.
[35]
Leslie Lamport. 1984. Using Time Instead of Timeout for Fault-Tolerant Distributed Systems. ACM Transactions on Programming Languages and Systems 6, 2 (April 1984), 254--280.
[36]
Leslie Lamport. 1998. The Part-Time Parliament. ACM Transactions on Computer Systems 16, 2 (May 1998), 133--169.
[37]
Hayley LeBlanc, Shankara Pailoor, Om Saran K R E, Isil Dillig, James Bornholt, and Vijay Chidambaram. 2023. Chipmunk: Investigating Crash-Consistency in Persistent-Memory File Systems. In Proceedings of the 18th European Conference on Computer Systems (Rome, Italy) (EuroSys '23). Association for Computing Machinery, New York, NY, USA, 718--733.
[38]
Sekwon Lee, Soujanya Ponnapalli, Sharad Singhal, Marcos K. Aguilera, Kimberly Keeton, and Vijay Chidambaram. 2022. DINOMO: An Elastic, Scalable, High-Performance Key-Value Store for Disaggregated Persistent Memory. Proceedings of the VLDB Endowment 15, 13 (Sept. 2022), 4023--4037.
[39]
Huaicheng Li, Daniel S. Berger, Lisa Hsu, Daniel Ernst, Pantea Zardoshti, Stanko Novakovic, Monish Shah, Samir Rajadnya, Scott Lee, Ishwar Agarwal, Mark D. Hill, Marcus Fontoura, and Ricardo Bianchini. 2023. Pond: CXL-Based Memory Pooling Systems for Cloud Platforms. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (Vancouver, BC, Canada) (ASPLOS '23'). Association for Computing Machinery, New York, NY, USA, 574--587.
[40]
Pengfei Li, Yu Hua, Pengfei Zuo, Zhangyu Chen, and Jiajie Sheng. 2023. ROLEX: A Scalable RDMA-oriented Learned Key-Value Store for Disaggregated Memory Systems. In Proceedings of the 21st USENIX Conference on File and Storage Technologies (Santa Clara, CA, USA) (FAST '23). USENIX Association, Berkeley, CA, USA, 99--114. https://www.usenix.org/conference/fast23/presentation/li-pengfei
[41]
Haonan Lu, Shuai Mu, Siddhartha Sen, and Wyatt Lloyd. 2023. NCC: Natural Concurrency Control for Strictly Serializable Datastores by Avoiding the Timestamp-Inversion Pitfall. In Proceedings of the 17th USENIX Symposium on Operating Systems Design and Implementation (Boston, MA, USA) (OSDI '23). USENIX Association, Berkeley, CA, USA, 305--323. https://www.usenix.org/conference/osdi23/presentation/lu
[42]
Nancy A. Lynch and Alexander A. Shvartsman. 1997. Robust Emulation of Shared Memory Using Dynamic Quorum-Acknowledged Broadcasts. In Proceedings of the 27th International Symposium on Fault-Tolerant Computing (Seattle, WA, USA) (FTCS '97). IEEE Computer Society, Los Alamitos, CA, USA, 272--281.
[43]
Teng Ma, Mingxing Zhang, Kang Chen, Zhuo Song, Yongwei Wu, and Xuehai Qian. 2020. AsymNVM: An Efficient Framework for Implementing Persistent Data Structures on Asymmetric NVM Architecture. In Proceedings of the 25gh International Conference on Architectural Support for Programming Languages and Operating Systems (Lausanne, Switzerland) (ASPLOS '20). Association for Computing Machinery, New York, NY, USA, 757--773.
[44]
Christopher Mitchell, Yifeng Geng, and Jinyang Li. 2013. Using One-Sided RDMA Reads to Build a Fast, CPU-Efficient Key-Value Store. In Proceedings of the 2013 USENIX Annual Technical Conference (San Jose, CA, USA) (USENIX ATC '13). USENIX Association, Berkeley, CA, USA, 103--114. https://www.usenix.org/conference/atc13/technical-sessions/presentation/mitchell
[45]
Mark Moir and James H. Anderson. 1995. Wait-free algorithms for fast, long-lived renaming. Science of Computer Programming 25, 1 (1995), 1--39.
[46]
Iulian Moraru, David G. Andersen, and Michael Kaminsky. 2013. There is More Consensus in Egalitarian Parliaments. In Proceedings of the 24th ACM Symposium on Operating Systems Principles (Farminton, PA, USA) (SOSP '13). Association for Computing Machinery, New York, NY, USA, 358--372.
[47]
Jiacheng Shen, Pengfei Zuo, Xuchuan Luo, Tianyi Yang, Yuxin Su, Yangfan Zhou, and Michael R. Lyu. 2023. FUSEE: A Fully Memory-Disaggregated Key-Value Store. In Proceedings of the 21st USENIX Conference on File and Storage Technologies (Santa Clara, CA, USA) (FAST '23). USENIX Association, Berkeley, CA, USA, 81--98. https://www.usenix.org/conference/fast23/presentation/shen
[48]
Mellanox Technologies. 2015. RDMA Aware Networks Programming User Manual. Rev 1.7. https://docs.nvidia.com/networking/display/rdmaawareprogrammingv17 Accessed 2024-03-17.
[49]
Shin-Yeh Tsai, Yizhou Shan, and Yiying Zhang. 2020. Disaggregating Persistent Memory and Controlling Them Remotely: An Exploration of Passive Disaggregated Key-Value Stores. In Proceedings of the 2020 USENIX Annual Technical Conference (Virtual Event) (USENIX ATC '20). USENIX Association, Berkeley, CA, USA, Article 3, 16 pages. https://www.usenix.org/conference/atc20/presentation/tsai
[50]
Lewis Tseng, Neo Zhou, Cole Dumas, Tigran Bantikyan, and Roberto Palmieri. 2023. Distributed Multi-writer Multi-reader Atomic Register with Optimistically Fast Read and Write. In Proceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures (Orlando, FL, USA) (SPAA '23). Association for Computing Machinery, New York, NY, USA, 479--488.
[51]
Qing Wang, Youyou Lu, and Jiwu Shu. 2022. Sherman: A Write-Optimized Distributed B+Tree Index on Disaggregated Memory. In Proceedings of the 2022 International Conference on Management of Data (Philadelphia, PA, USA) (SIGMOD '22). Association for Computing Machinery, New York, NY, USA, 1033--1048.
[52]
Yang Zhou, Hassan M. G. Wassel, Sihang Liu, Jiaqi Gao, James Mickens, Minlan Yu, Chris Kennelly, Paul Turner, David E. Culler, Henry M. Levy, and Amin Vahdat. 2022. Carbink: Fault-Tolerant Far Memory. In Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (Carlsbad, CA, USA) (OSDI '22). USENIX Association, Berkeley, CA, USA, 55--71. https://www.usenix.org/conference/osdi22/presentation/zhou-yang

Index Terms

  1. SWARM: Replicating Shared Disaggregated-Memory Data in No Time

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SOSP '24: Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles
      November 2024
      765 pages
      ISBN:9798400712517
      DOI:10.1145/3694715
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      In-Cooperation

      • USENIX

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 15 November 2024

      Check for updates

      Badges

      Qualifiers

      • Research-article

      Conference

      SOSP '24
      Sponsor:

      Acceptance Rates

      SOSP '24 Paper Acceptance Rate 43 of 245 submissions, 18%;
      Overall Acceptance Rate 174 of 961 submissions, 18%

      Upcoming Conference

      SOSP '25
      ACM SIGOPS 31st Symposium on Operating Systems Principles
      October 13 - 16, 2025
      Seoul , Republic of Korea

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 8
        Total Downloads
      • Downloads (Last 12 months)8
      • Downloads (Last 6 weeks)8
      Reflects downloads up to 16 Nov 2024

      Other Metrics

      Citations

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media