Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3472883.3487000acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Public Access

Automating instrumentation choices for performance problems in distributed applications with VAIF

Published: 01 November 2021 Publication History

Abstract

Developers use logs to diagnose performance problems in distributed applications. However, it is difficult to know a priori where logs are needed and what information in them is needed to help diagnose problems that may occur in the future. We present the Variance-driven Automated Instrumentation Framework (VAIF), which runs alongside distributed applications. In response to newly-observed performance problems, VAIF automatically searches the space of possible instrumentation choices to enable the logs needed to help diagnose them. To work, VAIF combines distributed tracing (an enhanced form of logging) with insights about how response-time variance can be decomposed on the critical-path portions of requests' traces. We evaluate VAIF by using it to localize performance problems in OpenStack and HDFS. We show that VAIF can localize problems related to slow code paths, resource contention, and problematic third-party code while enabling only 3-34% of the total tracing instrumentation.

Supplementary Material

MP4 File (Day1_Session2_Order_1_VAIF.mp4)
Presentation video

References

[1]
2016. Instance stuck resuming from suspend state during load test. https://bugzilla.redhat.com/showbug.cgi?id=1425516.
[2]
2016. nova libvirt driver instance stuck. https://ask.openstack.org/en/question/91508/nova-libvirt-driver-instance-stuck-on-spawning/.
[3]
2016. Nova list is extremely slow with lots of vms. https://bugs.launchpad.net/nova/+bug/1160487.
[4]
2016. OpenStack Identity service is responding slowly. https://docs.openstack.org/operations-guide/ops-maintenance-slow.html.
[5]
2016. OVS plugin VIF plugging slow for VMs. https://ask.openstack.org/en/question/128160/ovs-plugin-vif-plugging-slow-for-vms-with-multiple-nics/.
[6]
2016. Update on os-vif progress. https://openstack-dev.openstack.narkive.com/gLpnyrJl/nova-neutron-update-on-os-vif-progress-port-binding-negotiation.
[7]
V. Anand and J. Mace. 2021. X-Trace DeathStarBench Dataset. https://gitlab.mpi-sws.org/cld/trace-datasets/deathstarbenchtraces.
[8]
Piramanayagam Nainar Arumuga and Ben Liblit. 2010. Adaptive bug isolation. In International Conference on Software Engineering. ACM Press, New York, New York, USA, 255--264.
[9]
Emre Ates, Lily Sturmann, Mert Toslali, Orran Krieger, Richard Megginson, Ayse K. Coskun, and Raja R. Sambasivan. 2019. An Automated, Cross-Layer Instrumentation Framework for Diagnosing Performance Problems in Distributed Applications. In Proceedings of the ACM Symposium on Cloud Computing (Santa Cruz, CA, USA) (SoCC '19). Association for Computing Machinery, New York, NY, USA, 165--170. https://doi.org/10.1145/3357223.3362704
[10]
Bryan Cantrill, Michael W. Shapiro, and Adam H. Leventhal. 2004. Dynamic instrumentation of production systems. In ATC '04: Proceedings of the 2004 USENIX Annual Technical Conference.
[11]
Michael Chow, David Meisner, Jason Flinn, Daniel Peek, and Thomas F Wenisch. 2014. The mystery machine: end-to-end performance analysis of large-scale internet services. In OSDI' 14: Proceedings of the 11th USENIX conference on Operating Systems Design and Implementation.
[12]
CockroachDB. 2019. https://www.cockroachlabs.com/.
[13]
Rui Ding, Hucheng Zhou, Jian-Guang Lou, Hongyu Zhang, Qingwei Lin, Qiang Fu, Dongmei Zhang, and Tao Xie. 2015. Log2: A cost-aware logging mechanism for performance diagnosis. In ATC '15: Proceedings of the 2015 USENIX Annual Technical Conference.
[14]
Kiciman Emre and Lakshminarayanan Subramanian. 2005. Root cause localization in large scale systems. In Root cause localization in large scale systems, Vol. Proc. 1st Workshop on Hot Topics in Systems Dependability (HotDep).
[15]
Ulfar Erlingsson, Marcus Peinado, Simon Peter, and Mihai Budiu. 2011. Fay: extensible distributed tracing from kernels to clusters. In SOSP '11: Proceedings of the 23nd ACM Symposium on Operating Systems Principles.
[16]
Rodrigo Fonseca, Michael J. Freedman, and George Porter. 2010. Experiences with tracing causality in networked services. In INM/WREN '10: Proceedings of the 1st Internet Network Management Workshop/Workshop on Research on Enterprise Monitoring.
[17]
Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal Rathi, Nayan Katarki, Ariana Bruno, Justin Hu, Brian Ritchken, Brendon Jackson, Kelvin Hu, Meghna Pancholi, Yuan He, Brett Clancy, Chris Colen, Fukang Wen, Catherine Leung, Siyuan Wang, Leon Zaruvinsky, Mateo Espinosa, Rick Lin, Zhongling Liu, Jake Padilla, and Christina Delimitrou. 2019. An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud and Edge Systems. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (Providence, RI, USA) (ASPLOS '19). Association for Computing Machinery, New York, NY, USA, 3--18. https://doi.org/10.1145/3297858.3304013
[18]
Sudhanshu Goswami. 2005. https://lwn.net/Articles/132196/.
[19]
Jonathan Kaldor, Jonathan Mace, Michał Bejda, Edison Gao, Wiktor Kuropatwa, Joe O'Neill, Kian Win Ong, Bill Schaller, Pingjia Shan, Brendan Viscomi, Vinod Venkataraman, Kaushik Veeraraghavan, and Yee Jiun Song. 2017. Canopy: An end-to-end performance tracing and analysis system. In SOSP '17: Proceedings of the 26th Symposium on Operating Systems Principles.
[20]
Ben Liblit, Alex Aiken, Alice X Zheng, and Michael I Jordan. 2003. Bug isolation via remote program sampling. In PLDI '03: Programming Language Design and Implementation. ACM.
[21]
Jonathan Mace. 2017. End-to-End Tracing: Adoption and Use Cases. Survey. Brown University. https://cs.brown.edu/~jcmace/papers/mace2017survey.pdf.
[22]
Jonathan Mace and Rodrigo Fonseca. 2018. Universal context propagation for distributed system instrumentation. In EuroSys'18: Proceedings of the Thirteenth EuroSys Conference.
[23]
Jonathan Mace, Ryan Roelke, and Rodrigo Fonseca. 2015. Pivot Tracing: dynamic causal monitoring for distributed systems. In SOSP '15: Proceedings of the 25th Symposium on Operating Systems Principles.
[24]
Gideon Mann, Mark Sandler, Darja Krushevskaja, Sudipto Guha, and Eyal Even-dar. 2011. Modeling the parallel execution of black-box services. In Proceedings of the 3rd USENIX Workshop on Hot Topics in Cloud Computing.
[25]
Mirantis OSProfiler [n.d.]. OSProfiler. https://docs.openstack.org/osprofiler/latest/.
[26]
Openstack [n.d.]. OpenStack web site. https://www.openstack.org.
[27]
OpenTelemetry website [n.d.]. OpenTelemetry website. http://opentelemetry.io/.
[28]
Austin Parker, Daniel Spoonhower, Jonathan Mace, Ben Sigelman, and Rebecca Isaacs. 2020. Distributed tracing in practice: Instrumenting, analyzing, and debugging microservices. O'Reilly Media.
[29]
A. Rabkin and R. H. Katz. 2013. How Hadoop Clusters Break. IEEE Software 30, 4 (2013), 88--94.
[30]
Raja R. Sambasivan and Gregory R. Ganger. 2012. Automated diagnosis without predictability is a recipe for failure. In Proceedings of the 4th USENIX conference on Hot Topics in Cloud Computing. USENIX Association, 21--21.
[31]
Raja R. Sambasivan, Ilari Shafer, Jonathan Mace, Benjamin H Sigelman, Rodrigo Fonseca, and Gregory R. Ganger. 2016. Principled workflow-centric tracing of distributed systems. In SoCC '16: Proceedings of the Seventh Symposium on Cloud Computing.
[32]
Raja R. Sambasivan, Alice X. Zheng, Michael De Rosa, Elie Krevat, Spencer Whitman, Michael Stroucken, William Wang, Lianghong Xu, and Gregory R. Ganger. 2011. Diagnosing performance changes by comparing request flows. In NSDI'11: Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation.
[33]
Yuri Shkuro. 2019. Mastering Distributed Tracing: Analyzing performance in microservices and complex systems. Packt Publishing Ltd.
[34]
Benjamin H. Sigelman, Luiz A. Barroso, Michael Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. 2010. Dapper, a large-scale distributed systems tracing infrastructure. Technical Report dapper-2010-1. Google.
[35]
The Apache Hadoop Distributed File System 2013. The Apache Hadoop Distributed File System. http://hadoop.apache.org/hdfs/.
[36]
Marc-André Vef, Vasily Tarasov, Dean Hildebrand, and André Brinkmann. 2018. Challenges and Solutions for Tracing Storage Systems: A Case Study with Spectrum Scale. ACM Transactions on Storage (TOS) 14, 2 (May 2018), 18--24.
[37]
Larry Wasserman. 2010. All of Statistics: A Concise Course in Statistical Inference. Springer Publishing Company, Incorporated.
[38]
Sage A Weil, Scott A Brandt, Ethan L Miller, Darrell D E Long, and Carlos Maltzahn. 2006. Ceph: a scalable, high-performance distributed file system. In OSDI '06: Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation.
[39]
Ding Yuan, Haohui Mai, Weiwei Xiong, Lin Tan, Yuanyuan Zhou, and Shankar Pasupathy. 2010. SherLog: error diagnosis by connecting clues from run-time logs. In ASPLOS '10: Proceedings of the Fifteenth Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems.
[40]
Ding Yuan, Soyeon Park, Peng Huang, Yang Liu, Michael M Lee, Xiaoming Tang, Yuanyuan Zhou, and Stefan Savage. 2012. Be conservative: enhancing failure diagnosis with proactive logging. In OSDI' 12: Proceedings of the 10th conferences on Operating Systems Design & Implementation.
[41]
Ding Yuan, Jing Zheng, Soyeon Park, Yuanyuan Zhou, and Stefan Savage. 2012. Improving software diagnosability via log enhancement. ACM SIGPLAN Notices 47, 4 (June 2012), 3--14.
[42]
Xu Zhao, Kirk Rodrigues, Yu Luo, Michael Stumm, Ding Yuan, and Yuanyuan Zhou. 2017. Log20: Fully automated optimal placement of log printing statements under specified overhead threshold. In SOSP '17: Proceedings of the 26th Symposium on Operating Systems Principles.
[43]
Zhiqiang Zuo, Lu Fang, Siau-Cheng Khoo, Guoqing Xu, and Shan Lu. 2016. Low-overhead and fully automated statistical debugging with abstraction refinement. In OOPSLA '16: Proceedings of the ACM international conference on Object oriented programming systems languages and applications.

Cited By

View all
  • (2024)The Tale of Errors in MicroservicesProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/37004368:3(1-36)Online publication date: 10-Dec-2024
  • (2024)Unleashing Performance Insights with Online Probabilistic Tracing2024 IEEE International Conference on Cloud Engineering (IC2E)10.1109/IC2E61754.2024.00015(72-82)Online publication date: 24-Sep-2024
  • (2023)Optimizing Logging and Monitoring in Heterogeneous Cloud Environments for IoT and Edge ApplicationsIEEE Internet of Things Journal10.1109/JIOT.2023.330437310:24(22611-22622)Online publication date: 15-Dec-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SoCC '21: Proceedings of the ACM Symposium on Cloud Computing
November 2021
685 pages
ISBN:9781450386388
DOI:10.1145/3472883
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 November 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. distributed systems
  2. distributed tracing
  3. logging
  4. performance

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

SoCC '21
Sponsor:
SoCC '21: ACM Symposium on Cloud Computing
November 1 - 4, 2021
WA, Seattle, USA

Acceptance Rates

Overall Acceptance Rate 169 of 722 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)257
  • Downloads (Last 6 weeks)32
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)The Tale of Errors in MicroservicesProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/37004368:3(1-36)Online publication date: 10-Dec-2024
  • (2024)Unleashing Performance Insights with Online Probabilistic Tracing2024 IEEE International Conference on Cloud Engineering (IC2E)10.1109/IC2E61754.2024.00015(72-82)Online publication date: 24-Sep-2024
  • (2023)Optimizing Logging and Monitoring in Heterogeneous Cloud Environments for IoT and Edge ApplicationsIEEE Internet of Things Journal10.1109/JIOT.2023.330437310:24(22611-22622)Online publication date: 15-Dec-2023
  • (2023)Poster Paper: Efficient Navigation of Cloud Performance with ’nuffTrace2023 IEEE International Conference on Cloud Engineering (IC2E)10.1109/IC2E59103.2023.00035(226-227)Online publication date: 25-Sep-2023

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media