Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3542637.3542643acmotherconferencesArticle/Chapter ViewAbstractPublication PagescommConference Proceedingsconference-collections
research-article
Open access

LinkGuardian: Mitigating the impact of packet corruption loss with link-local retransmission

Published: 07 November 2023 Publication History

Abstract

Packet corruption loss is a serious problem in datacenter networks. A large-scale study by Microsoft reported that the number of packets lost due to corruption is comparable to those lost due to congestion. Previous attempts to mitigate the impact of packet corruption loss seek to avoid the faulty links by routing around them, at the cost of reduced link capacities and disruption to the rest of the network.
In this paper, we investigate the feasibility and tradeoffs of the classical loss recovery strategy of link-local retransmissions in the context of datacenter networks. We present the design and implementation of LinkGuardian, a dataplane-based protocol that detects the packets lost due to corruption and simply retransmits them out-of-order. Our preliminary results show that a naïve out-of-order retransmission strategy is effective in mitigating the impact of packet corruption loss for both throughput-sensitive and latency-sensitive flows. Our long-term goal is to extend LinkGuardian so that the end hosts can be made completely oblivious to packet corruption losses in the network.

References

[1]
3GPP. 2007. TS 36.321: E-UTRA; Medium Access Protocol Specification (Release 8). (2007).
[2]
3GPP. 2020. TS 36.321: LTE; E-UTRA; Medium Access Protocol Specification (Release 16). (2020).
[3]
Mohammad Alizadeh, Tom Edsall, Sarang Dharmapurikar, Ramanan Vaidyanathan, Kevin Chu, Andy Fingerhut, Francis Matus, Rong Pan, Navindra Yadav, George Varghese, 2014. CONGA: Distributed Congestion-aware Load Balancing for Datacenters. In Proceedings of SIGCOMM.
[4]
Mohammad Alizadeh, Albert Greenberg, David A Maltz, Jitendra Padhye, Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari Sridharan. 2010. Data Center TCP (DCTCP). In Proceedings of SIGCOMM.
[5]
Mark Allman, Vern Paxson, and Ethan Blanton. 2009. TCP Congestion Control. RFC 5681 (2009).
[6]
Mina Tahmasbi Arashloo, Alexey Lavrov, Manya Ghobadi, Jennifer Rexford, David Walker, and David Wentzlaff. 2020. Enabling Programmable Transport Protocols in High-SpeedNICs. In Proceedings of NSDI.
[7]
Hari Balakrishnan, Srinivasan Seshan, Elan Amir, and Randy H. Katz. 1995. Improving TCP/IP Performance over Wireless Networks. In Proceedings of MOBICOM.
[8]
Ethan Blanton, Mark Allman, Lili Wang, Ilpo Jarvinen, Markku Kojo, and Yoshifumi Nishida. 2012. A conservative loss recovery algorithm based on selective acknowledgment (SACK) for TCP. RFC 6675 (2012).
[9]
Neal Cardwell, Yuchung Cheng, C Stephen Gunn, Soheil Hassas Yeganeh, and Van Jacobson. 2016. BBR: Congestion-based congestion control: Measuring bottleneck bandwidth and round-trip propagation time. Queue 14, 5 (2016), 20–53.
[10]
Guo Chen, Yuanwei Lu, Yuan Meng, Bojie Li, Kun Tan, Dan Pei, Peng Cheng, Layong Larry Luo, Yongqiang Xiong, Xiaoliang Wang, 2016. Fast and cautious: Leveraging multi-path diversity for transport loss recovery in data centers. In Proceedings of NSDI.
[11]
Linux Networking Documentation. 2022. DCTCP (DataCenter TCP). https://www.kernel.org/doc/html/latest/networking/dctcp.html
[12]
Nick Feamster, Arpit Gupta, Jennifer Rexford, and Walter Willinger. 2019. NSF workshop on measurements for self-driving networks. In Proceedings of Workshop on Measurements for Self-Driving Networks.
[13]
Nick Feamster and Jennifer Rexford. 2017. Why (and how) networks should run themselves. arXiv preprint arXiv:1710.11583(2017).
[14]
Zeng Gaoxiong, Chen Li, Yi Bairen, and Chen Kai. 2021. Optimizing Tail Latency in Commodity Datacenters using Forward Error Correction. arXiv preprint arXiv:2110.15157(2021).
[15]
Hans Giesen, Lei Shi, John Sonchack, Anirudh Chelluri, Nishanth Prabhu, Nik Sultana, Latha Kant, Anthony J McAuley, Alexander Poylisher, André DeHon, 2018. In-network computing to the rescue of faulty links. In Proceedings of the NetCompute Workshop.
[16]
P4.org Architecture Working Group. 2018. Portable Switch Architecture. https://p4.org/p4-spec/docs/PSA-v1.1.0.pdf
[17]
IEEE. 2009. 802.11n-2009 Standard. https://standards.ieee.org/ieee/802.11n/3952/
[18]
IEEE. 2013. 802.11ac-2013 Standard. https://ieeexplore.ieee.org/document/6687187
[19]
IEEE. 2018. 802.3-2018 Standard. https://ieeexplore.ieee.org/document/8457469
[20]
Raj Joshi, Ben Leong, and Mun Choon Chan. 2019. Timertasks: Towards time-driven execution in programmable dataplanes. In Proceedings of the ACM SIGCOMM 2019 Conference Posters and Demos.
[21]
Glenn Judd. 2015. Attaining the Promise and Avoiding the Pitfalls of TCP in the Datacenter. In Proceedings of NSDI.
[22]
Jeongkeun Lee. 2020. Advanced Congestion & Flow Control with Programmable Switches. In P4 Expert Roundtable Series. https://bit.ly/3J8x7fw
[23]
Hwijoon Lim, Wei Bai, Yibo Zhu, Youngmok Jung, and Dongsu Han. 2021. Towards timeout-less transport in commodity datacenter networks. In Proceedings EuroSys.
[24]
Christina Parsa and JJ Garcia-Luna-Aceves. 1999. TULIP: A Link-Level Protocol for Improving TCP over Wireless Links. In Proceedings of WCNC.
[25]
Ting Qu, Raj Joshi, Mun Choon Chan, Ben Leong, Deke Guo, and Zhong Liu. 2019. SQR: In-network packet loss recovery from link failures for highly reliable datacenter networks. In Proceedings of ICNP.
[26]
Ashish Vulimiri, Oliver Michel, P Brighten Godfrey, and Scott Shenker. 2012. More is less: Reducing latency via redundancy. In Proceedings of HotNets.
[27]
Xin Wu, Daniel Turner, Chao-Chih Chen, David A Maltz, Xiaowei Yang, Lihua Yuan, and Ming Zhang. 2012. NetPilot: Automating datacenter network failure mitigation. In Proceedings of SIGCOMM.
[28]
Yu Zhou, Chen Sun, Hongqiang Harry Liu, Rui Miao, Shi Bai, Bo Li, Zhilong Zheng, Lingjun Zhu, Zhen Shen, Yongqing Xi, 2020. Flow event telemetry on programmable data plane. In Proceedings of SIGCOMM.
[29]
Yibo Zhu, Nanxi Kang, Jiaxin Cao, Albert Greenberg, Guohan Lu, Ratul Mahajan, Dave Maltz, Lihua Yuan, Ming Zhang, Ben Y Zhao, 2015. Packet-level telemetry in large datacenter networks. In Proceedings of SIGCOMM.
[30]
Danyang Zhuo, Monia Ghobadi, Ratul Mahajan, Klaus-Tycho Förster, Arvind Krishnamurthy, and Thomas Anderson. 2017. Understanding and mitigating packet corruption in data center networks. In Proceedings of SIGCOMM.
[31]
Danyang Zhuo, Monia Ghobadi, Ratul Mahajan, Amar Phanishayee, Xuan Kelvin Zou, Hang Guan, Arvind Krishnamurthy, and Thomas Anderson. 2017. RAIL: A Case for Redundant Arrays of Inexpensive Links in Data Center Networks. In Proceedings of NSDI.

Cited By

View all
  • (2023)A Fine-Grained Packet Loss Management System Based on Programmable Network2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00119(791-798)Online publication date: 17-Dec-2023

Index Terms

  1. LinkGuardian: Mitigating the impact of packet corruption loss with link-local retransmission

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image ACM Other conferences
        APNet '22: Proceedings of the 6th Asia-Pacific Workshop on Networking
        July 2022
        110 pages
        ISBN:9781450397483
        DOI:10.1145/3542637
        This work is licensed under a Creative Commons Attribution International 4.0 License.

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 07 November 2023

        Check for updates

        Author Tags

        1. link failure
        2. packet corruption
        3. programmable switches
        4. retransmission

        Qualifiers

        • Research-article
        • Research
        • Refereed limited

        Funding Sources

        Conference

        APNet 2022

        Acceptance Rates

        Overall Acceptance Rate 50 of 118 submissions, 42%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)294
        • Downloads (Last 6 weeks)71
        Reflects downloads up to 24 Nov 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2023)A Fine-Grained Packet Loss Management System Based on Programmable Network2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00119(791-798)Online publication date: 17-Dec-2023

        View Options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format.

        HTML Format

        Login options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media