research-article

Public Access

Automating instrumentation choices for performance problems in distributed applications with VAIF

Authors:

Samantha Puterman,

Ayse K. Coskun,

Raja R. SambasivanAuthors Info & Claims

SoCC '21: Proceedings of the ACM Symposium on Cloud Computing

Pages 61 - 75

https://doi.org/10.1145/3472883.3487000

Published: 01 November 2021 Publication History

Abstract

Developers use logs to diagnose performance problems in distributed applications. However, it is difficult to know a priori where logs are needed and what information in them is needed to help diagnose problems that may occur in the future. We present the Variance-driven Automated Instrumentation Framework (VAIF), which runs alongside distributed applications. In response to newly-observed performance problems, VAIF automatically searches the space of possible instrumentation choices to enable the logs needed to help diagnose them. To work, VAIF combines distributed tracing (an enhanced form of logging) with insights about how response-time variance can be decomposed on the critical-path portions of requests' traces. We evaluate VAIF by using it to localize performance problems in OpenStack and HDFS. We show that VAIF can localize problems related to slow code paths, resource contention, and problematic third-party code while enabling only 3-34% of the total tracing instrumentation.

Supplementary Material

MP4 File (Day1_Session2_Order_1_VAIF.mp4)

Presentation video

Download
289.87 MB

References

[1]

2016. Instance stuck resuming from suspend state during load test. https://bugzilla.redhat.com/showbug.cgi?id=1425516.

[2]

2016. nova libvirt driver instance stuck. https://ask.openstack.org/en/question/91508/nova-libvirt-driver-instance-stuck-on-spawning/.

[3]

2016. Nova list is extremely slow with lots of vms. https://bugs.launchpad.net/nova/+bug/1160487.

[4]

2016. OpenStack Identity service is responding slowly. https://docs.openstack.org/operations-guide/ops-maintenance-slow.html.

[5]

2016. OVS plugin VIF plugging slow for VMs. https://ask.openstack.org/en/question/128160/ovs-plugin-vif-plugging-slow-for-vms-with-multiple-nics/.

[6]

2016. Update on os-vif progress. https://openstack-dev.openstack.narkive.com/gLpnyrJl/nova-neutron-update-on-os-vif-progress-port-binding-negotiation.

[7]

V. Anand and J. Mace. 2021. X-Trace DeathStarBench Dataset. https://gitlab.mpi-sws.org/cld/trace-datasets/deathstarbenchtraces.

[8]

Piramanayagam Nainar Arumuga and Ben Liblit. 2010. Adaptive bug isolation. In International Conference on Software Engineering. ACM Press, New York, New York, USA, 255--264.

Digital Library

[9]

Emre Ates, Lily Sturmann, Mert Toslali, Orran Krieger, Richard Megginson, Ayse K. Coskun, and Raja R. Sambasivan. 2019. An Automated, Cross-Layer Instrumentation Framework for Diagnosing Performance Problems in Distributed Applications. In Proceedings of the ACM Symposium on Cloud Computing (Santa Cruz, CA, USA) (SoCC '19). Association for Computing Machinery, New York, NY, USA, 165--170. https://doi.org/10.1145/3357223.3362704

Digital Library

[10]

Bryan Cantrill, Michael W. Shapiro, and Adam H. Leventhal. 2004. Dynamic instrumentation of production systems. In ATC '04: Proceedings of the 2004 USENIX Annual Technical Conference.

[11]

Michael Chow, David Meisner, Jason Flinn, Daniel Peek, and Thomas F Wenisch. 2014. The mystery machine: end-to-end performance analysis of large-scale internet services. In OSDI' 14: Proceedings of the 11th USENIX conference on Operating Systems Design and Implementation.

[12]

CockroachDB. 2019. https://www.cockroachlabs.com/.

[13]

Rui Ding, Hucheng Zhou, Jian-Guang Lou, Hongyu Zhang, Qingwei Lin, Qiang Fu, Dongmei Zhang, and Tao Xie. 2015. Log2: A cost-aware logging mechanism for performance diagnosis. In ATC '15: Proceedings of the 2015 USENIX Annual Technical Conference.

[14]

Kiciman Emre and Lakshminarayanan Subramanian. 2005. Root cause localization in large scale systems. In Root cause localization in large scale systems, Vol. Proc. 1st Workshop on Hot Topics in Systems Dependability (HotDep).

[15]

Ulfar Erlingsson, Marcus Peinado, Simon Peter, and Mihai Budiu. 2011. Fay: extensible distributed tracing from kernels to clusters. In SOSP '11: Proceedings of the 23nd ACM Symposium on Operating Systems Principles.

Digital Library

[16]

Rodrigo Fonseca, Michael J. Freedman, and George Porter. 2010. Experiences with tracing causality in networked services. In INM/WREN '10: Proceedings of the 1st Internet Network Management Workshop/Workshop on Research on Enterprise Monitoring.

[17]

Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal Rathi, Nayan Katarki, Ariana Bruno, Justin Hu, Brian Ritchken, Brendon Jackson, Kelvin Hu, Meghna Pancholi, Yuan He, Brett Clancy, Chris Colen, Fukang Wen, Catherine Leung, Siyuan Wang, Leon Zaruvinsky, Mateo Espinosa, Rick Lin, Zhongling Liu, Jake Padilla, and Christina Delimitrou. 2019. An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud and Edge Systems. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (Providence, RI, USA) (ASPLOS '19). Association for Computing Machinery, New York, NY, USA, 3--18. https://doi.org/10.1145/3297858.3304013

Digital Library

[18]

Sudhanshu Goswami. 2005. https://lwn.net/Articles/132196/.

[19]

Jonathan Kaldor, Jonathan Mace, Michał Bejda, Edison Gao, Wiktor Kuropatwa, Joe O'Neill, Kian Win Ong, Bill Schaller, Pingjia Shan, Brendan Viscomi, Vinod Venkataraman, Kaushik Veeraraghavan, and Yee Jiun Song. 2017. Canopy: An end-to-end performance tracing and analysis system. In SOSP '17: Proceedings of the 26th Symposium on Operating Systems Principles.

Digital Library

[20]

Ben Liblit, Alex Aiken, Alice X Zheng, and Michael I Jordan. 2003. Bug isolation via remote program sampling. In PLDI '03: Programming Language Design and Implementation. ACM.

Digital Library

[21]

Jonathan Mace. 2017. End-to-End Tracing: Adoption and Use Cases. Survey. Brown University. https://cs.brown.edu/~jcmace/papers/mace2017survey.pdf.

[22]

Jonathan Mace and Rodrigo Fonseca. 2018. Universal context propagation for distributed system instrumentation. In EuroSys'18: Proceedings of the Thirteenth EuroSys Conference.

Digital Library

[23]

Jonathan Mace, Ryan Roelke, and Rodrigo Fonseca. 2015. Pivot Tracing: dynamic causal monitoring for distributed systems. In SOSP '15: Proceedings of the 25th Symposium on Operating Systems Principles.

Digital Library

[24]

Gideon Mann, Mark Sandler, Darja Krushevskaja, Sudipto Guha, and Eyal Even-dar. 2011. Modeling the parallel execution of black-box services. In Proceedings of the 3rd USENIX Workshop on Hot Topics in Cloud Computing.

Digital Library

[25]

Mirantis OSProfiler [n.d.]. OSProfiler. https://docs.openstack.org/osprofiler/latest/.

[26]

Openstack [n.d.]. OpenStack web site. https://www.openstack.org.

[27]

OpenTelemetry website [n.d.]. OpenTelemetry website. http://opentelemetry.io/.

[28]

Austin Parker, Daniel Spoonhower, Jonathan Mace, Ben Sigelman, and Rebecca Isaacs. 2020. Distributed tracing in practice: Instrumenting, analyzing, and debugging microservices. O'Reilly Media.

[29]

A. Rabkin and R. H. Katz. 2013. How Hadoop Clusters Break. IEEE Software 30, 4 (2013), 88--94.

Digital Library

[30]

Raja R. Sambasivan and Gregory R. Ganger. 2012. Automated diagnosis without predictability is a recipe for failure. In Proceedings of the 4th USENIX conference on Hot Topics in Cloud Computing. USENIX Association, 21--21.

[31]

Raja R. Sambasivan, Ilari Shafer, Jonathan Mace, Benjamin H Sigelman, Rodrigo Fonseca, and Gregory R. Ganger. 2016. Principled workflow-centric tracing of distributed systems. In SoCC '16: Proceedings of the Seventh Symposium on Cloud Computing.

[32]

Raja R. Sambasivan, Alice X. Zheng, Michael De Rosa, Elie Krevat, Spencer Whitman, Michael Stroucken, William Wang, Lianghong Xu, and Gregory R. Ganger. 2011. Diagnosing performance changes by comparing request flows. In NSDI'11: Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation.

[33]

Yuri Shkuro. 2019. Mastering Distributed Tracing: Analyzing performance in microservices and complex systems. Packt Publishing Ltd.

[34]

Benjamin H. Sigelman, Luiz A. Barroso, Michael Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. 2010. Dapper, a large-scale distributed systems tracing infrastructure. Technical Report dapper-2010-1. Google.

[35]

The Apache Hadoop Distributed File System 2013. The Apache Hadoop Distributed File System. http://hadoop.apache.org/hdfs/.

[36]

Marc-André Vef, Vasily Tarasov, Dean Hildebrand, and André Brinkmann. 2018. Challenges and Solutions for Tracing Storage Systems: A Case Study with Spectrum Scale. ACM Transactions on Storage (TOS) 14, 2 (May 2018), 18--24.

[37]

Larry Wasserman. 2010. All of Statistics: A Concise Course in Statistical Inference. Springer Publishing Company, Incorporated.

Digital Library

[38]

Sage A Weil, Scott A Brandt, Ethan L Miller, Darrell D E Long, and Carlos Maltzahn. 2006. Ceph: a scalable, high-performance distributed file system. In OSDI '06: Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation.

[39]

Ding Yuan, Haohui Mai, Weiwei Xiong, Lin Tan, Yuanyuan Zhou, and Shankar Pasupathy. 2010. SherLog: error diagnosis by connecting clues from run-time logs. In ASPLOS '10: Proceedings of the Fifteenth Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems.

Digital Library

[40]

Ding Yuan, Soyeon Park, Peng Huang, Yang Liu, Michael M Lee, Xiaoming Tang, Yuanyuan Zhou, and Stefan Savage. 2012. Be conservative: enhancing failure diagnosis with proactive logging. In OSDI' 12: Proceedings of the 10th conferences on Operating Systems Design & Implementation.

[41]

Ding Yuan, Jing Zheng, Soyeon Park, Yuanyuan Zhou, and Stefan Savage. 2012. Improving software diagnosability via log enhancement. ACM SIGPLAN Notices 47, 4 (June 2012), 3--14.

[42]

Xu Zhao, Kirk Rodrigues, Yu Luo, Michael Stumm, Ding Yuan, and Yuanyuan Zhou. 2017. Log20: Fully automated optimal placement of log printing statements under specified overhead threshold. In SOSP '17: Proceedings of the 26th Symposium on Operating Systems Principles.

Digital Library

[43]

Zhiqiang Zuo, Lu Fang, Siau-Cheng Khoo, Guoqing Xu, and Shan Lu. 2016. Low-overhead and fully automated statistical debugging with abstraction refinement. In OOPSLA '16: Proceedings of the ACM international conference on Object oriented programming systems languages and applications.

Digital Library

Cited By

Lee IZhang ZParwal AChabbi M(2024)The Tale of Errors in MicroservicesProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/37004368:3(1-36)Online publication date: 10-Dec-2024
https://dl.acm.org/doi/10.1145/3700436
Toslali MQasim SParthasarathy SOliveira FHuang HStringhini GLiu ZCoskun A(2024)Unleashing Performance Insights with Online Probabilistic Tracing2024 IEEE International Conference on Cloud Engineering (IC2E)10.1109/IC2E61754.2024.00015(72-82)Online publication date: 24-Sep-2024
https://doi.org/10.1109/IC2E61754.2024.00015
Kim CKim S(2023)Optimizing Logging and Monitoring in Heterogeneous Cloud Environments for IoT and Edge ApplicationsIEEE Internet of Things Journal10.1109/JIOT.2023.330437310:24(22611-22622)Online publication date: 15-Dec-2023
https://doi.org/10.1109/JIOT.2023.3304373
Show More Cited By

Index Terms

Automating instrumentation choices for performance problems in distributed applications with VAIF
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
      1. Cloud computing
2. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis
        Software testing and debugging

Recommendations

VAIF: Variance-driven Automated Instrumentation Framework
SIGOPS

Developers use logs to diagnose performance problems in distributed applications. But, it is difficult to know a priori where logs are needed and what information in them is needed to help diagnose problems that may occur in the future. We summarize our ...
An automated, cross-layer instrumentation framework for diagnosing performance problems in distributed applications
SoCC '19: Proceedings of the ACM Symposium on Cloud Computing

Diagnosing performance problems in distributed applications is extremely challenging. A significant reason is that it is hard to know where to place instrumentation a priori to help diagnose problems that may occur in the future. We present the vision of ...
The performance of independent checkpointing in distributed systems
HICSS '95: Proceedings of the 28th Hawaii International Conference on System Sciences

The paper describes performance measurements of an implementation of independent checkpointing in a network of workstations. Independent checkpointing is a simple technique for providing fault tolerance in distributed systems. Because processes do not ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SoCC '21: Proceedings of the ACM Symposium on Cloud Computing

November 2021

685 pages

ISBN:9781450386388

DOI:10.1145/3472883

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 November 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

NSF (National Science Foundation)
Red Hat, Inc. (Red Hat Collaboratory at Boston University)

Conference

SoCC '21

Sponsor:

SoCC '21: ACM Symposium on Cloud Computing

November 1 - 4, 2021

WA, Seattle, USA

Acceptance Rates

Overall Acceptance Rate 169 of 722 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
783
Total Downloads

Downloads (Last 12 months)257
Downloads (Last 6 weeks)32

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Lee IZhang ZParwal AChabbi M(2024)The Tale of Errors in MicroservicesProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/37004368:3(1-36)Online publication date: 10-Dec-2024
https://dl.acm.org/doi/10.1145/3700436
Toslali MQasim SParthasarathy SOliveira FHuang HStringhini GLiu ZCoskun A(2024)Unleashing Performance Insights with Online Probabilistic Tracing2024 IEEE International Conference on Cloud Engineering (IC2E)10.1109/IC2E61754.2024.00015(72-82)Online publication date: 24-Sep-2024
https://doi.org/10.1109/IC2E61754.2024.00015
Kim CKim S(2023)Optimizing Logging and Monitoring in Heterogeneous Cloud Environments for IoT and Edge ApplicationsIEEE Internet of Things Journal10.1109/JIOT.2023.330437310:24(22611-22622)Online publication date: 15-Dec-2023
https://doi.org/10.1109/JIOT.2023.3304373
Qasim SToslali MClark QParthasarathy SOliveira FLiu AStringhini GCoskun A(2023)Poster Paper: Efficient Navigation of Cloud Performance with ’nuffTrace2023 IEEE International Conference on Cloud Engineering (IC2E)10.1109/IC2E59103.2023.00035(226-227)Online publication date: 25-Sep-2023
https://doi.org/10.1109/IC2E59103.2023.00035

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten