Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3373360.3380835acmconferencesArticle/Chapter ViewAbstractPublication PagescommConference Proceedingsconference-collections
research-article
Public Access

Grasp the Root Causes in the Data Plane: Diagnosing Latency Problems with SpiderMon

Published: 04 March 2020 Publication History

Abstract

Unexplained performance degradation is one of the most severe problems in data center networks. The increasing scale of the network makes it even harder to maintain good performance for all users with a low-cost solution. Our system SpiderMon monitors network performance and debugs performance failures inside the network with little overhead. SpiderMon provides a two-phase solution that runs in the data plane. In the monitoring phase, it keeps track of the performance of every flow in the network; upon detecting performance problems, it triggers a debugging phase using a causality analyzer to find out the root cause of performance degradation. To implement these two phases, SpiderMon exploits the capabilities of high-speed programmable switches (e.g., per-packet monitoring, stateful memory). We prototype SpiderMon on using the BMv2 model of P4, and our preliminary evaluation shows that SpiderMon is able to quickly find the root cause of performance degradation problems with minimal overhead. SpiderMon achieves nearly-zero overhead during the monitoring phase and efficiently collects relevant data from switches during the debugging phase.

References

[1]
LoCo: Localizing Congestion. http://www.cs.cornell.edu/~vishal/papers/loco_2019.pdf.
[2]
NetFlow. https://www.cisco.com/c/en/us/products/collateral/ios-nx-os-software/ios-netflow/prod_white_paper0900aecd80406232.html.
[3]
Network Congestion Management: Considerations and Techniques. https://www.sandvine.com/hubfs/downloads/archive/whitepaper-network-congestion-management.pdf.
[4]
sFlow. http://www.sflow.org/.
[5]
M. Al-Fares, S. Radhakrishnan, B. Raghavan, N. Huang, A. Vahdat, et al. Hedera: dynamic flow scheduling for data center networks. In Nsdi, volume 10, 2010.
[6]
M. T. Arashloo, Y. Koral, M. Greenberg, J. Rexford, and D. Walker. Snap: Stateful network-wide abstractions for packet processing. In Proceedings of the 2016 ACM SIGCOMM Conference, pages 29--43. ACM, 2016.
[7]
B. Arzani, S. Ciraci, L. Chamon, Y. Zhu, H. H. Liu, J. Padhye, B. T. Loo, and G. Outhred. 007: Democratically finding the cause of packet drops. In USENIX NSDI, 2018.
[8]
A. Chen, A. Haeberlen, W. Zhou, and B. T. Loo. One primitive to diagnose them all: Architectural support for internet diagnostics. In Proceedings of the Twelfth European Conference on Computer Systems, pages 374--388. ACM, 2017.
[9]
X. Chen, S. L. Feibish, Y. Koral, J. Rexford, and O. Rottenstreich. Catching the microburst culprits with snappy. In Proceedings of the Afternoon Workshop on Self-Driving Networks, pages 22--28. ACM, 2018.
[10]
J. Cho, H. Chang, S. Mukherjee, T. Lakshman, and J. Van der Merwe. Typhoon: An sdn enhanced real-time big data streaming framework. In Proceedings of the 13th International Conference on emerging Networking EXperiments and Technologies, pages 310--322. ACM, 2017.
[11]
M. Ghasemi, T. Benson, and J. Rexford. Dapper: Data plane performance diagnosis of tcp. ACM SIGCOMM SOSR, 2017.
[12]
C. Guo, L. Yuan, D. Xiang, Y. Dang, R. Huang, D. Maltz, Z. Liu, V. Wang, B. Pang, H. Chen, Z.-W. Lin, and V. Kurien. Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis. In ACM SIGCOMM, 2015.
[13]
A. Gupta, R. Birkner, M. Canini, N. Feamster, C. Mac-Stoker, and W. Willinger. Network monitoring as a streaming analytics problem. In ACM HotNets, HotNets '16, 2016.
[14]
N. Handigol, B. Heller, V. Jeyakumar, D. Mazières, and N. McKeown. I Know What Your Packet Did Last Hop: Using Packet Histories to Troubleshoot Networks. In USENIX NSDI, 2014.
[15]
S. Ibanez, G. Brebner, N. McKeown, and N. Zilberman. The p4-> netfpga workflow for line-rate packet processing. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 1--9, 2019.
[16]
V. Jeyakumar, M. Alizadeh, Y. Geng, C. Kim, and D. Mazières. Millions of Little Minions: Using Packets for Low Latency Network Programming and Visibility. In ACM SIGCOMM, 2014.
[17]
N. Jiang, D. U. Becker, G. Michelogiannakis, and W. J. Dally. Network congestion avoidance through speculative reservation. In IEEE International Symposium on High-Performance Comp Architecture, pages 1--12. IEEE, 2012.
[18]
R. Joshi, T. Qu, M. C. Chan, B. Leong, and B. T. Loo. Burstradar: Practical real-time microburst monitoring for datacenter networks. In Proceedings of the 9th Asia-Pacific Workshop on Systems, page 8. ACM, 2018.
[19]
A. Khandelwal, R. Agarwal, and I. Stoica. Confluo: Distributed monitoring and diagnosis stack for high-speed networks. In USENIX NSDI, 2019.
[20]
Y. Li, R. Miao, C. Kim, and M. Yu. FlowRadar: A Better NetFlow for Data Centers. In USENIX NSDI, 2016.
[21]
Z. Liu, A. Manousis, G. Vorsanger, V. Sekar, and V. Braverman. One sketch to rule them all: Rethinking network flow monitoring with univmon. In ACM SIGCOMM, 2016.
[22]
R. Miao, H. Zeng, C. Kim, J. Lee, and M. Yu. Silkroad: Making stateful layer-4 load balancing fast and cheap using switching asics. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication, pages 15--28. ACM, 2017.
[23]
R. Mittal, N. Dukkipati, E. Blem, H. Wassel, M. Ghobadi, A. Vahdat, Y. Wang, D. Wetherall, D. Zats, et al. Timely: Rtt-based congestion control for the datacenter. In ACM SIGCOMM Computer Communication Review, volume 45, pages 537--550. ACM, 2015.
[24]
M. Moshref, M. Yu, R. Govindan, and A. Vahdat. Trumpet: Timely and Precise Triggers in Data Centers. In ACM SIGCOMM, 2016.
[25]
S. Narayana, A. Sivaraman, V. Nathan, P. Goyal, V. Arun, M. Alizadeh, V. Jeyakumar, and C. Kim. Language-Directed Hardware Design for Network Performance Monitoring. In ACM SIGCOMM, 2017.
[26]
Y. Ran, X. Wu, P. Li, C. Xu, Y. Luo, and L.-M. Wang. Equery: Enable event-driven declarative queries in programmable network measurement. In NOMS 2018-2018 IEEE/IFIP Network Operations and Management Symposium, pages 1--7. IEEE, 2018.
[27]
A. Roy, H. Zeng, J. Bagga, G. Porter, and A. C. Snoeren. Inside the social network's (datacenter) network. In ACM SIGCOMM Computer Communication Review, volume 45, pages 123--137. ACM, 2015.
[28]
J. Sonchack, O. Michel, A. J. Aviv, E. Keller, and J. M. Smith. Scaling hardware accelerated network monitoring to concurrent and dynamic queries with *flow. In USENIX ATC, 2018.
[29]
P. Sun, M. Yu, M. J. Freedman, J. Rexford, and D. Walker. Hone: Joint hostnetwork traffic management in software-defined networks. JNSM, 23(2), Apr. 2015.
[30]
P. Tammana, R. Agarwal, and M. Lee. Simplifying Datacenter Network Debugging with PathDump. In USENIX OSDI, 2016.
[31]
P. Tammana, R. Agarwal, and M. Lee. Distributed network monitoring and debugging with switchpointer. In USENIX NSDI, 2018.
[32]
Y. Tang, Y. Wu, G. Cheng, and Z. Xu. Intelligence enabled sdn fault localization via programmable in-band network telemetry. In 2019 IEEE 20th International Conference on High Performance Switching and Routing (HPSR), pages 1--6. IEEE, 2019.
[33]
H. Wang, R. Soulé, H. T. Dang, K. S. Lee, V. Shrivastav, N. Foster, and H. Weatherspoon. P4fpga: A rapid prototyping framework for p4. In Proceedings of the Symposium on SDN Research, pages 122--135, 2017.
[34]
M. Yu, L. Jose, and R. Miao. Software defined traffic measurement with opens-ketch. In Proc. NSDI, 2013.
[35]
Y. Zhu, N. Kang, J. Cao, A. Greenberg, G. Lu, R. Mahajan, D. Maltz, L. Yuan, M. Zhang, B. Y. Zhao, et al. Packet-level telemetry in large datacenter networks. In ACM SIGCOMM Computer Communication Review, volume 45, pages 479--491. ACM, 2015.
[36]
Y. Zhu, N. Kang, J. Cao, A. Greenberg, G. Lu, R. Mahajan, D. Maltz, L. Yuan, M. Zhang, B. Y. Zhao, and H. Zheng. Packet-Level Telemetry in Large Datacenter Networks. In ACM SIGCOMM, 2015.

Cited By

View all
  • (2023)In-Network Probabilistic Monitoring Primitives under the Influence of Adversarial Network InputsProceedings of the 7th Asia-Pacific Workshop on Networking10.1145/3600061.3600086(116-122)Online publication date: 29-Jun-2023
  • (2023)Advancing SDN from OpenFlow to P4: A SurveyACM Computing Surveys10.1145/355697355:9(1-37)Online publication date: 16-Jan-2023
  • (2023)MALT: Fine-Grained Microservice Profiling for Request Latency Anomaly Localization2023 IEEE International Conference on High Performance Computing & Communications, Data Science & Systems, Smart City & Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys60770.2023.00025(114-121)Online publication date: 17-Dec-2023
  • Show More Cited By

Index Terms

  1. Grasp the Root Causes in the Data Plane: Diagnosing Latency Problems with SpiderMon

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        SOSR '20: Proceedings of the Symposium on SDN Research
        March 2020
        151 pages
        ISBN:9781450371018
        DOI:10.1145/3373360
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 04 March 2020

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. P4
        2. Performance diagnosis
        3. in-network telemetry
        4. network provenance

        Qualifiers

        • Research-article
        • Research
        • Refereed limited

        Funding Sources

        Conference

        SOSR '20
        Sponsor:
        SOSR '20: Symposium on SDN Research
        March 3, 2020
        CA, San Jose, USA

        Acceptance Rates

        Overall Acceptance Rate 7 of 43 submissions, 16%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)98
        • Downloads (Last 6 weeks)18
        Reflects downloads up to 26 Sep 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2023)In-Network Probabilistic Monitoring Primitives under the Influence of Adversarial Network InputsProceedings of the 7th Asia-Pacific Workshop on Networking10.1145/3600061.3600086(116-122)Online publication date: 29-Jun-2023
        • (2023)Advancing SDN from OpenFlow to P4: A SurveyACM Computing Surveys10.1145/355697355:9(1-37)Online publication date: 16-Jan-2023
        • (2023)MALT: Fine-Grained Microservice Profiling for Request Latency Anomaly Localization2023 IEEE International Conference on High Performance Computing & Communications, Data Science & Systems, Smart City & Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys60770.2023.00025(114-121)Online publication date: 17-Dec-2023
        • (2023)A Survey on Rerouting Techniques with P4 Programmable Data Plane SwitchesComputer Networks: The International Journal of Computer and Telecommunications Networking10.1016/j.comnet.2023.109795230:COnline publication date: 1-Jul-2023
        • (2023)OpenFlow, P4, Opflex and I2RSCloud and Edge Networking10.1002/9781394257461.ch7(113-126)Online publication date: 6-Dec-2023
        • (2022)A survey on TCP enhancements using P4-programmable devicesComputer Networks: The International Journal of Computer and Telecommunications Networking10.1016/j.comnet.2022.109030212:COnline publication date: 27-Jun-2022
        • (2021)An Exhaustive Survey on P4 Programmable Data Plane Switches: Taxonomy, Applications, Challenges, and Future TrendsIEEE Access10.1109/ACCESS.2021.30867049(87094-87155)Online publication date: 2021

        View Options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Get Access

        Login options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media