Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3544216.3544238acmconferencesArticle/Chapter ViewAbstractPublication PagescommConference Proceedingsconference-collections
research-article

From luna to solar: the evolutions of the compute-to-storage networks in Alibaba cloud

Published: 22 August 2022 Publication History

Abstract

This paper presents the two generations of storage network stacks that reduced the average I/O latency of Alibaba Cloud's EBS service by 72% in the last five years: Luna, a user-space TCP stack that corresponds the latency of network to the speed of SSD; and Solar, a storage-oriented UDP stack that enables both storage and network hardware accelerations.
Luna is our first step towards a high-speed compute-to-storage network in the "storage disaggregation" architecture. Besides the tremendous performance gains and CPU savings compared with the legacy kernel TCP stack, more importantly, it teaches us the necessity of offloading both network and storage into hardware and the importance of recovering instantaneously from network failures.
Solar provides a highly reliable and performant storage network running on hardware. For avoiding hardware's resource limitations and offloading storage's entire data path, Solar eliminates the superfluous complexity and the overfull states from the traditional architecture of the storage network. The core design of Solar is unifying the concepts of network packet and storage data block - each network packet is a self-contained storage data block. There are three remarkable advantages to doing so. First, it merges the packet processing and storage virtualization pipelines to bypass the CPU and PCIe; Second, since the storage processes data blocks independently, the packets in Solar become independent. Therefore, the storage (in hardware) does not need to maintain receiving buffers for assembling packets into blocks or handling packet reordering. Finally, due to the low resource requirement and the resilience to packet reordering, Solar inherently supports large-scale multi-path transport for fast failure recovery. Facing the future, Solar demonstrates that we can formalize the storage virtualization procedure into a P4-compatible packet processing pipeline. Hence, SOLAR's design perfectly applies to commodity DPUs (data processing units).

Supplementary Material

PDF File (p753-miao-supp.pdf)
Supplemental material.

References

[1]
2016. Dancing on the Lip of the Volcano: Chosen Ciphertext Attacks on Apple iMessage. https://www.usenix.org/sites/default/files/conference/protected-files/security16_slides_garman.pdf. (2016).
[2]
2017. Amazon EC2 Bare Metal Instances. https://aws.amazon.com/blogs/aws/new-amazon-ec2-bare-metal-instances-with-direct-access-to-hardware/. (2017).
[3]
2018. Alibaba Cloud ECS Bare Metal Instance. https://www.alibabacloud.com/product/ebm/. (2018).
[4]
2020. Alibaba Cloud EBS performance. https://www.alibabacloud.com/help/en/doc-detail/25382.html. (2020).
[5]
2020. Amazon EBS features. https://aws.amazon.com/ebs/features/. (2020).
[6]
2020. Fungible F1. https://www.fungible.com/product/dpu-platform/. (2020).
[7]
2020. Google Cloud Block Storage Performance. https://cloud.google.com/compute/docs/disks/performance. (2020).
[8]
2020. Microsoft Azure Disk Storage. https://azure.microsoft.com/en-us/services/storage/disks/. (2020).
[9]
2021. Intel Mount Evans. https://www.intel.com/content/www/us/en/products/platforms/details/mount-evans.html. (2021).
[10]
2021. Mellanox User Manual. https://docs.nvidia.com/networking/pages/viewpage.action?pageId=19812618. (2021).
[11]
2022. Nvidia BlueField. https://www.nvidia.com/en-us/networking/products/data-processing-unit/. (2022).
[12]
2022. Pensando Distributed Services Card. https://pensando.io/products/dsc/. (2022).
[13]
2022. Resource Director Technology. https://www.intel.com/content/www/us/en/architecture-and-technology/resource-director-technology.html. (2022).
[14]
2022. VMware Cloud. https://www.vmware.com/cloud-solutions/multi-cloud.html. (2022).
[15]
Emmanuel Amaro, Zhihong Luo, Amy Ousterhout, Arvind Krishnamurthy, Aurojit Panda, Sylvia Ratnasamy, and Scott Shenker. Remote Memory Calls. In HotNets, 2020.
[16]
Behnaz Arzani, Selim Ciraci, Boon Thau Loo, Assaf Schuster, and Geoff Outhred. Taking the Blame Game out of Data Centers Operations with NetPoirot. In SIGCOMM, 2016.
[17]
Matthew Burke, Sowmya Dharanipragada, Shannon Joyner, Adriana Szekeres, Jacob Nelson, Irene Zhang, and Dan RK Ports. PRISM: Rethinking the RDMA Interface for Distributed Systems. In SOSP, 2021.
[18]
Huynh Tu Dang, Marco Canini, Fernando Pedone, and Robert Soulé. Paxos made switch-y. CCR, 2016.
[19]
Huynh Tu Dang, Daniele Sciascia, Marco Canini, Fernando Pedone, and Robert Soulé. Netpaxos: Consensus at network speed. In SOSR, 2015.
[20]
Aleksandar Dragojević, Dushyanth Narayanan, Miguel Castro, and Orion Hodson. FaRM: Fast remote memory. In NSDI, 2014.
[21]
Jiawei Fei, Chen-Yu Ho, Atal N Sahu, Marco Canini, and Amedeo Sapio. Efficient sparse collective communication and its application to accelerate distributed deep learning. In SIGCOMM, 2021.
[22]
Daniel Firestone, Andrew Putnam, Sambhrama Mundkur, Derek Chiou, Alireza Dabagh, Mike Andrewartha, Hari Angepat, Vivek Bhanu, Adrian Caulfield, Eric Chung, Harish Kumar Chandrappa, Somesh Chaturmohta, Matt Humphrey, Jack Lavier, Norman Lam, Fengfen Liu, Kalin Ovtcharov, Jitu Padhye, Gautham Popuri, Shachar Raindel, Tejas Sapre, Mark Shaw, Gabriel Silva, Madhan Sivakumar, Nisheeth Srivastava, Anshuman Verma, Qasim Zuhair, Deepak Bansal, Doug Burger, Kushagra Vaid, David A. Maltz, and Albert Greenberg. Azure accelerated networking: SmartNICs in the public cloud. In NSDI, 2018.
[23]
Yixiao Gao, Qiang Li, Lingbo Tang, Yongqing Xi, Pengcheng Zhang, Wenwen Peng, Bo Li, Yaohui Wu, Shaozong Liu, Lei Yan, Fei Feng, Yan Zhuang, Fan Liu, Pan Liu, Xingkui Liu, Zhongjie Wu, Junping Wu, Zheng Cao, Chen Tian, Jinbo Wu, Jiaji Zhu, Haiyong Wang, Dennis Cai, and Jiesheng Wu. When Cloud Storage Meets RDMA. In NSDI, 2021.
[24]
Nadeen Gebara, Manya Ghobadi, and Paolo Costa. In-network aggregation for shared machine learning clusters. MLSys, 2021.
[25]
Richard L. Graham, Devendar Bureddy, Pak Lui, Hal Rosenstock, Gilad Shainer, Gil Bloch, Dror Goldenerg, Mike Dubman, Sasha Kotchubievsky, Vladimir Koushnir, Lion Levi, Alex Margolin, Tamir Ronen, Alexander Shpiner, Oded Wertheim, and Eitan Zahavi. Scalable hierarchical aggregation protocol (SHArP): a hardware architecture for efficient data reduction. In COMHPC, 2016.
[26]
Chuanxiong Guo, Haitao Wu, Zhong Deng, Gaurav Soni, Jianxi Ye, Jitu Padhye, and Marina Lipshteyn. RDMA over commodity ethernet at scale. In SIGCOMM, 2016.
[27]
Jaehyun Hwang, Qizhe Cai, Ao Tang, and Rachit Agarwal. TCP ≈ RDMA: CPU-efficient Remote Storage Access with i10. In NSDI, 2020.
[28]
EunYoung Jeong, Shinae Wood, Muhammad Jamshed, Haewon Jeong, Sunghwan Ihm, Dongsu Han, and KyoungSoo Park. mtcp: a highly scalable user-level TCP stack for multicore systems. In NSDI, 2014.
[29]
Xin Jin, Xiaozhou Li, Haoyu Zhang, Nate Foster, Jeongkeun Lee, Robert Soulé, Changhoon Kim, and Ion Stoica. Netchain: Scale-free sub-rtt coordination. In NSDI, 2018.
[30]
Xin Jin, Xiaozhou Li, Haoyu Zhang, Robert Soulé, Jeongkeun Lee, Nate Foster, Changhoon Kim, and Ion Stoica. Netcache: Balancing key-value stores with fast in-network caching. In SOSP, 2017.
[31]
Anuj Kalia, Michael Kaminsky, and David Andersen. Datacenter RPCs can be general and fast. In NSDI, 2019.
[32]
Jongyul Kim, Insu Jang, Waleed Reda, Jaeseong Im, Marco Canini, Dejan Kostić, Youngjin Kwon, Simon Peter, and Emmett Witchel. LineFS: Efficient SmartNIC Offload of a Distributed File System with Pipeline Parallelism. In SOSP, 2021.
[33]
Ana Klimovic, Heiner Litz, and Christos Kozyrakis. Reflex: Remote flash ≈ local flash. ASPLOS, 2017.
[34]
Gautam Kumar, Nandita Dukkipati, Keon Jang, Hassan M. G. Wassel, Xian Wu, Behnam Montazeri, Yaogong Wang, Kevin Springborn, Christopher Alfeld, Michael Ryan, David Wetherall, and Amin Vahdat. Swift: Delay is Simple and Effective for Congestion Control in the Datacenter. In SIGCOMM, 2020.
[35]
ChonLam Lao, Yanfang Le, Kshiteej Mahajan, Yixi Chen, Wenfei Wu, Aditya Akella, and Michael M Swift. ATP: In-network Aggregation for Multi-tenant Learning. In NSDI, 2021.
[36]
Jialin Li, Ellis Michael, and Dan RK Ports. Eris: Coordination-free consistent transactions using in-network concurrency control. In SOSP, 2017.
[37]
Jialin Li, Ellis Michael, Naveen Kr Sharma, Adriana Szekeres, and Dan RK Ports. Just say NO to paxos overhead: Replacing consensus with network ordering. In OSDI, 2016.
[38]
Yuliang Li, Rui Miao, Hongqiang Harry Liu, Yan Zhuang, Fei Feng, Lingbo Tang, Zheng Cao, Ming Zhang, Frank Kelly, Mohammad Alizadeh, and Minlan Yu. HPCC: High Precision Congestion Control. In SIGCOMM, 2019.
[39]
Ming Liu, Liang Luo, Jacob Nelson, Luis Ceze, Arvind Krishnamurthy, and Kishore Atreya. Incbricks: Toward in-network computation with an in-network cache. In ASPLOS, 2017.
[40]
Zaoxing Liu, Zhihao Bai, Zhenming Liu, Xiaozhou Li, Changhoon Kim, Vladimir Braverman, Xin Jin, and Ion Stoica. Distcache: Provable load balancing for large-scale storage systems with distributed caching. In FAST, 2019.
[41]
Dave Maltz. Scaling challenges in cloud networking (Microsoft Research Summit 2021).
[42]
Michael Marty, Marc de Kruijf, Jacob Adriaens, Christopher Alfeld, Sean Bauer, Carlo Contavalli, Michael Dalton, Nandita Dukkipati, William C Evans, Steve Gribble, Nicholas Kidd, Roman Kononov, Gautam Kumar, Carl Mauer, Emily Musick, Lena Olson, Mike Ryan, Erik Rubow, Kevin Springborn, Paul Turner, Valas Valancius, Xi Wang, and Amin Vahdat. Snap: A microkernel approach to host networking. In SOSP, 2019.
[43]
Rui Miao, Hongyi Zeng, Changhoon Kim, Jeongkeun Lee, and Minlan Yu. Silkroad: Making stateful layer-4 load balancing fast and cheap using switching asics. In SIGCOMM, 2017.
[44]
Jaehong Min, Ming Liu, Tapan Chugh, Chenxingyu Zhao, Andrew Wei, In Hwan Doh, and Arvind Krishnamurthy. Gimbal: enabling multi-tenant storage disaggregation on SmartNIC JBOFs. In SIGCOMM, 2021.
[45]
Christopher Mitchell, Yifeng Geng, and Jinyang Li. Using One-Sided RDMA Reads to Build a Fast, CPU-Efficient Key-Value Store. In ATC, 2013.
[46]
Tian Pan, Nianbing Yu, Chenhao Jia, Jianwen Pi, Liang Xu, Yisong Qiao, Zhiguo Li, Kun Liu, Jie Lu, Jianyuan Lu, Enge Song, Jiao Zhang, Tao Huang, and Shunmin Zhu. Sailfish: Accelerating cloud-scale multi-tenant multi-service gateways with programmable switches. In SIGCOMM, 2021.
[47]
Dan RK Ports, Jialin Li, Vincent Liu, Naveen Kr Sharma, and Arvind Krishnamurthy. Designing distributed systems using approximate synchrony in data center networks. In NSDI, 2015.
[48]
Costin Raiciu, Sebastien Barre, Christopher Pluntke, Adam Greenhalgh, Damon Wischik, and Mark Handley. Improving datacenter performance and robustness with multipath TCP. SIGCOMM, 2011.
[49]
Waleed Reda, Marco Canini, Dejan Kostić, and Simon Peter. RDMA is Turing complete, we just did not know it yet! NSDI, 2022.
[50]
Amedeo Sapio, Marco Canini, Chen-Yu Ho, Jacob Nelson, Panos Kalnis, Changhoon Kim, Arvind Krishnamurthy, Masoud Moshref, Dan RK Ports, and Peter Richtárik. Scaling distributed machine learning with in-network aggregation. NSDI, 2021.
[51]
Leah Shalev, Hani Ayoub, Nafea Bshara, and Erez Sabbag. A Cloud-Optimized Transport Protocol for Elastic and Scalable HPC. IEEE Micro 40, 6 (2020), 67--73.
[52]
Arjun Singhvi, Aditya Akella, Maggie Anderson, Rob Cauble, Harshad Deshmukh, Dan Gibson, Milo MKMartin, Amanda Strominger, Thomas F Wenisch, and Amin Vahdat. Cliquemap: Productionizing an rma-based distributed caching system. In SIGCOMM, 2021.
[53]
Arjun Singhvi, Aditya Akella, Dan Gibson, Thomas F Wenisch, Monica Wong-Chan, Sean Clark, Milo MK Martin, Moray McLaren, Prashant Chandra, Rob Cauble, Hassan M. G. Wassel, Behnam Montazeri, Simon L. Sabato, Joel Scherpelz, and Amin Vahdat. 1RMA: Re-Envisioning Remote Memory Access for Multi-Tenant Datacenters. In SIGCOMM, 2020.
[54]
Bingchuan Tian, Jiaqi Gao, Mengqi Liu, Ennan Zhai, Yanqing Chen, Yu Zhou, Li Dai, Feng Yan, Mengjing Ma, Ming Tang, Jie Lu, Xionglie Wei, Hongqiang Harry Liu, Ming Zhang, Chen Tian, and Minlan Yu. Aquila: a practically usable verification system for production-scale programmable data planes. In SIGCOMM, 2021.
[55]
Muhammad Tirmazi, Ran Ben Basat, Jiaqi Gao, and Minlan Yu. Cheetah: Accelerating database queries with switch pruning. In SIGMOD, 2020.
[56]
Qiuping Wang, Jinhong Li, Patrick PC Lee, Tao Ouyang, Chao Shi, and Lilong Huang. Separating Data via Block Invalidation Time Inference for Write Amplification Reduction in Log-Structured Storage. In FAST, 22.
[57]
Siyu Yan, Xiaoliang Wang, Xiaolong Zheng, Yinben Xia, Derui Liu, and Weishan Deng. ACC: Automatic ECN tuning for high-speed datacenter networks. In SIGCOMM, 2021.
[58]
Qiao Zhang, Guo Yu, Chuanxiong Guo, Yingnong Dang, Nick Swanson, Xinsheng Yang, Randolph Yao, Murali Chintalapati, Arvind Krishnamurthy, and Thomas Anderson. Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure. In NSDI, 2018.
[59]
Yibo Zhu, Haggai Eran, Daniel Firestone, Chuanxiong Guo, Marina Lipshteyn, Yehonatan Liron, Jitendra Padhye, Shachar Raindel, Mohamad Haj Yahia, and Ming Zhang. Congestion control for large-scale RDMA deployments. SIGCOMM, 2015.

Cited By

View all
  • (2024)TianMen: a DPU-based storage network offloading structure for disaggregated datacentersProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698528(689-703)Online publication date: 20-Nov-2024
  • (2024)Fast Core Scheduling with Userspace Process AbstractionProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695976(280-295)Online publication date: 4-Nov-2024
  • (2024)Performance Characterization of SmartNIC NVMe-over-Fabrics Target OffloadingProceedings of the 17th ACM International Systems and Storage Conference10.1145/3688351.3689154(14-24)Online publication date: 16-Sep-2024
  • Show More Cited By

Index Terms

  1. From luna to solar: the evolutions of the compute-to-storage networks in Alibaba cloud

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SIGCOMM '22: Proceedings of the ACM SIGCOMM 2022 Conference
      August 2022
      858 pages
      ISBN:9781450394208
      DOI:10.1145/3544216
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 22 August 2022

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. data processing unit
      2. in-network acceleration
      3. storage network

      Qualifiers

      • Research-article

      Conference

      SIGCOMM '22
      Sponsor:
      SIGCOMM '22: ACM SIGCOMM 2022 Conference
      August 22 - 26, 2022
      Amsterdam, Netherlands

      Acceptance Rates

      Overall Acceptance Rate 462 of 3,389 submissions, 14%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)467
      • Downloads (Last 6 weeks)47
      Reflects downloads up to 17 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)TianMen: a DPU-based storage network offloading structure for disaggregated datacentersProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698528(689-703)Online publication date: 20-Nov-2024
      • (2024)Fast Core Scheduling with Userspace Process AbstractionProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695976(280-295)Online publication date: 4-Nov-2024
      • (2024)Performance Characterization of SmartNIC NVMe-over-Fabrics Target OffloadingProceedings of the 17th ACM International Systems and Storage Conference10.1145/3688351.3689154(14-24)Online publication date: 16-Sep-2024
      • (2024)A Survey of RDMA Distributed StorageProceedings of the 2024 5th International Conference on Computing, Networks and Internet of Things10.1145/3670105.3670199(534-539)Online publication date: 24-May-2024
      • (2024)Improving Virtualized I/O Performance by Expanding the Polled I/O Path of LinuxProceedings of the 16th ACM Workshop on Hot Topics in Storage and File Systems10.1145/3655038.3665944(31-37)Online publication date: 8-Jul-2024
      • (2024)RDMA over Ethernet for Distributed Training at Meta ScaleProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672233(57-70)Online publication date: 4-Aug-2024
      • (2024)RedTE: Mitigating Subsecond Traffic Bursts with Real-time and Distributed Traffic EngineeringProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672231(71-85)Online publication date: 4-Aug-2024
      • (2024)Triton: A Flexible Hardware Offloading Architecture for Accelerating Apsara vSwitch in Alibaba CloudProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672224(750-763)Online publication date: 4-Aug-2024
      • (2024)LoWAR: Enhancing RDMA over Lossy WANs with Transparent Error Correction2024 IEEE/ACM 32nd International Symposium on Quality of Service (IWQoS)10.1109/IWQoS61813.2024.10682853(1-10)Online publication date: 19-Jun-2024
      • (2024)Flagger: Cooperative Acceleration for Large-Scale Cross-Silo Federated Learning Aggregation2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00071(915-930)Online publication date: 29-Jun-2024
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media