Article

lprof: a non-intrusive request flow profiler for distributed systems

Authors:

Muhammad Faizan Ullah,

Michael StummAuthors Info & Claims

OSDI'14: Proceedings of the 11th USENIX conference on Operating Systems Design and Implementation

Pages 629 - 644

Published: 06 October 2014 Publication History

Abstract

Applications implementing cloud services, such as HDFS, Hadoop YARN, Cassandra, and HBase, are mostly built as distributed systems designed to scale. In order to analyze and debug the performance of these systems effectively and efficiently, it is essential to understand the performance behavior of service requests, both in aggregate and individually.

lprof is a profiling tool that automatically reconstructs the execution flow of each request in a distributed application. In contrast to existing approaches that require instrumentation, lprof infers the request-flow entirely from runtime logs and thus does not require any modifications to source code. lprof first statically analyzes an application's binary code to infer how logs can be parsed so that the dispersed and intertwined log entries can be stitched together and associated to specific individual requests.

We validate lprof using the four widely used distributed services mentioned above. Our evaluation shows lprof's precision in request extraction is 88%, and lprof is helpful in diagnosing 65% of the sampled real-world performance anomalies.

References

[1]

M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and A. Muthitacharoen. Performance debugging for distributed systems of black boxes. In Proceedings of the 19th ACM Symposium on Operating Systems Principles, SOSP'03, pages 74-89, 2003.

Digital Library

[2]

Amazon found every 100ms of latency cost them 1% in sales. http://blog.gigaspaces.com/amazon-found-every-100ms-of-latency-cost-them-1-in-sales/.

[3]

P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using magpie for request extraction and workload modelling. In Proceedings of the 6th Symposium on Opearting Systems Design and Implementation, OSDI'04, 2004.

Digital Library

[4]

I. Beschastnikh, Y. Brun, S. Schneider, M. Sloan, and M. D. Ernst. Leveraging existing instrumentation to automatically infer invariant-constrained models. In Proceedings of the 19th ACM Symposium on Foundations of Software Engineering, FSE '11, pages 267-277, 2011.

Digital Library

[5]

Boundary: Modern IT operation management. http://boundary.com/blog/2012/11/19/know-your-iaas-boundary-identifies-performance-lags-introduced-by-cloud/.

[6]

A. Chanda, A. L. Cox, and W. Zwaenepoel. Whodunit: Transactional profiling for multi-tier applications. In Proceedings of the 2nd ACM European Conference on Computer Systems, EuroSys '07, pages 17-30, 2007.

Digital Library

[7]

F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: a distributed storage system for structured data. In Proceedings of the 7th symposium on Operating systems design and implementation, OSDI'06, pages 205-218, 2006.

Digital Library

[8]

M. Y. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. Brewer. Pinpoint: Problem determination in large, dynamic internet services. In Proceedings of the International Conference on Dependable Systems and Networks, DSN '02, pages 595-604, 2002.

Digital Library

[9]

Chord: A program analysis platform for java. http://pag.gatech.edu/chord.

[10]

M. Chow, D. Meisner, J. Flinn, D. Peek, and T. F. Wenisch. The mystery machine: End-to-end performance analysis of large-scale internet services. In Proceedings of the 11th symposium on Operating Systems Design and Implementation, OSDI'14, 2014.

[11]

B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing, SoCC '10, pages 143-154, 2010.

Digital Library

[12]

J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th conference on Symposium on Opearting Systems Design and Implementation, OSDI'04, 2004.

Digital Library

[13]

Moving an elephant: Large scale hadoop data migration at facebook. https://www.facebook.com/notes/paul-yang/moving-an-elephant-large-scale-hadoop-data-migration-at-facebook/10150246275318920.

[14]

R. Fonseca, G. Porter, R. H. Katz, S. Shenker, and I. Stoica. X-trace: a pervasive network tracing framework. In Proceedings of the 4th USENIX conference on Networked systems design and implementation, NSDI'07, 2007.

Digital Library

[15]

Google protocol buffers. https://developers. google.com/protocol-buffers/.

[16]

S. L. Graham, P. B. Kessler, and M. K. Mckusick. Gprof: A call graph execution profiler. In Proceedings of the SIGPLAN Symposium on Compiler Construction, SIGPLAN'82, pages 120-126, 1982.

Digital Library

[17]

Z. Guo, D. Zhou, H. Lin, M. Yang, F. Long, C. Deng, C. Liu, and L. Zhou. G2: A graph processing system for diagnosing distributed systems. In Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference, USENIX ATC'11, 2011.

Digital Library

[18]

HBase bug 2399. https://issues.apache.org/jira/browse/HBASE-2399.

[19]

HBase bug 3654. https://issues.apache.org/jira/browse/HBASE-3654.

[20]

HDFS performance regression on write requests. https://issues.apache.org/jira/browse/HDFS-4049.

[21]

Highcharts: interactive JavaScript charts for your webpage. http://www.highcharts.com/.

[22]

S. Huang, J. Huang, J. Dai, T. Xie, and B. Huang. The HiBench benchmark suite: Characterization of the MapReduce-based data analysis. In 26th International Conference on Data Engineering Workshops (ICDEW), pages 41-51, 2010.

[23]

E. Koskinen and J. Jannotti. Borderpatrol: Isolating events for black-box tracing. In Proceedings of the 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 2008, Eurosys '08, pages 191-203, 2008.

Digital Library

[24]

The LLVM compiler infrastructure. http://llvm.org/.

[25]

log4j: Apache log4j, a logging library for Java. http://logging.apache.org/log4j/2.x/.

[26]

VMware vCenter Log Insight: Log management and analytics. http://www.vmware.com/ca/en/products/vcenter-log-insight.

[27]

D. Logothetis, C. Trezzo, K. C. Webb, and K. Yocum. Insitu mapreduce for log processing. In Proceedings of the 2011 USENIX Annual Technical Conference, 2011.

Digital Library

[28]

Mongodb. http://www.mongodb.org/.

[29]

K. Nagaraj, C. Killian, and J. Neville. Structured comparative analysis of systems logs to diagnose performance problems. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI'12, 2012.

Digital Library

[30]

Nagios: the industry standard in IT infrastructure monitoring. http://www.nagios.org/.

[31]

NewRelic: Application performance management and monitoring. http://newrelic.com/.

[32]

OProf - A system profiler for Linux. http://oprofile.sourceforge.net/.

[33]

OpsView - enterprise IT monitoring for networks. http://www.opsview.com/.

[34]

P. Reynolds, C. Killian, J. L. Wiener, J. C. Mogul, M. A. Shah, and A. Vahdat. Pip: Detecting the unexpected in distributed systems. In Proceedings of the 3rd Conference on Networked Systems Design and Implementation, NSDI'06, 2006.

Digital Library

[35]

M. Sharir and A. Pnueli. Two approaches to interprocedural analysis. Program Flow Analysis, Theory and applications, 1981.

[36]

B. H. Sigelman, L. A. Barroso, M. Burrows, P. Stephenson, M. Plakal, D. Beaver, S. Jaspan, and C. Shanbhag. Dapper, a large-scale distributed systems tracing infrastructure. Technical report, Google, Inc., 2010.

[37]

Simple logging facade for Java (SLF4J). http://www.slf4j.org/.

[38]

Splunk log management. http://www.splunk. com/view/log-management/SP-CAAAC6F.

[39]

S. Steinarsson. Downsampling time series for visual representation. M.Sc thesis. Faculty of Industrial Engineering, Mechanical Engineering and Computer Science, University of Iceland, 2013.

[40]

J. Tan, X. Pan, S. Kavulya, R. Gandhi, and P. Narasimhan. Salsa: Analyzing logs as state machines. In Proceedings of the 1st USENIX Conference on Analysis of System Logs, WASL'08, 2008.

Digital Library

[41]

J. Tan, X. Pan, S. Kavulya, R. Gandhi, and P. Narasimhan. Mochi: Visual log-analysis based tools for debugging hadoop. In Proceedings of the Conference on Hot Topics in Cloud Computing, HotCloud'09, 2009.

Digital Library

[42]

W. Xu, L. Huang, A. Fox, D. Patterson, and M. I. Jordan. Detecting large-scale system problems by mining console logs. In Proc. of the ACM 22nd Symposium on Operating Systems Principles, SOSP '09, pages 117-132, 2009.

Digital Library

[43]

D. Yuan, H. Mai, W. Xiong, L. Tan, Y. Zhou, and S. Pasupathy. SherLog: error diagnosis by connecting clues from run-time logs. In Proceedings of the 15th Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '10, pages 143-154, 2010.

Digital Library

[44]

D. Yuan, S. Park, P. Huang, Y. Liu, M. Lee, Y. Zhou, and S. Savage. Be conservative: Enhancing failure diagnosis with proactive logging. In Proceedings of the 10th USENIX Symposium on Operating System Design and Implementation, OSDI'12, pages 293-306, 2012.

Digital Library

[45]

D. Yuan, S. Park, and Y. Zhou. Characterising logging practices in open-source software. In Proceedings of the 34th International Conference on Software Engineering, ICSE '12, 2012.

Digital Library

[46]

Zabbix - an enterprise-class open source monitoring solution. http://www.zabbix.com/.

Cited By

Neves FVilaça RPereira J(2021)Detailed black-box monitoring of distributed systemsACM SIGAPP Applied Computing Review10.1145/3477133.347713521:1(24-36)Online publication date: 20-Jul-2021
https://dl.acm.org/doi/10.1145/3477133.3477135
Zhai EChen APiskac RBalakrishnan MTian BSong BZhang HBhagwan RPorter G(2020)Check before you changeProceedings of the 17th Usenix Conference on Networked Systems Design and Implementation10.5555/3388242.3388285(575-590)Online publication date: 25-Feb-2020
https://dl.acm.org/doi/10.5555/3388242.3388285
Glasbergen BAbebe MDaudjee KLevi A(2020)SentinelProceedings of the VLDB Endowment10.14778/3407790.340785613:12(2720-2733)Online publication date: 14-Sep-2020
https://dl.acm.org/doi/10.14778/3407790.3407856
Show More Cited By

lprof: a non-intrusive request flow profiler for distributed systems
1. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

OSDI'14: Proceedings of the 11th USENIX conference on Operating Systems Design and Implementation

October 2014

676 pages

ISBN:9781931971164

Program Chairs:
Jason Flinn
University of Michigan
,
Hank Levy
University of Washington

Sponsors

USENIX Assoc: USENIX Assoc

In-Cooperation

SIGOPS: ACM Special Interest Group on Operating Systems

Publisher

USENIX Association

United States

Publication History

Published: 06 October 2014

Check for updates

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

36
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 18 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Neves FVilaça RPereira J(2021)Detailed black-box monitoring of distributed systemsACM SIGAPP Applied Computing Review10.1145/3477133.347713521:1(24-36)Online publication date: 20-Jul-2021
https://dl.acm.org/doi/10.1145/3477133.3477135
Zhai EChen APiskac RBalakrishnan MTian BSong BZhang HBhagwan RPorter G(2020)Check before you changeProceedings of the 17th Usenix Conference on Networked Systems Design and Implementation10.5555/3388242.3388285(575-590)Online publication date: 25-Feb-2020
https://dl.acm.org/doi/10.5555/3388242.3388285
Glasbergen BAbebe MDaudjee KLevi A(2020)SentinelProceedings of the VLDB Endowment10.14778/3407790.340785613:12(2720-2733)Online publication date: 14-Sep-2020
https://dl.acm.org/doi/10.14778/3407790.3407856
Christophe LDe Roover CBoix EDe Meuter W(2020)Orchestrating dynamic analyses of distributed processes for full-stack JavaScript programsACM SIGPLAN Notices10.1145/3393934.327813553:9(107-118)Online publication date: 7-Apr-2020
https://dl.acm.org/doi/10.1145/3393934.3278135
Wu YChen APhan LLorch JYu M(2019)ZenoProceedings of the 16th USENIX Conference on Networked Systems Design and Implementation10.5555/3323234.3323268(395-420)Online publication date: 26-Feb-2019
https://dl.acm.org/doi/10.5555/3323234.3323268
Las-Casas PPapakerashvili GAnand VMace J(2019)SifterProceedings of the ACM Symposium on Cloud Computing10.1145/3357223.3362736(312-324)Online publication date: 20-Nov-2019
https://dl.acm.org/doi/10.1145/3357223.3362736
Winter JAniche MCito JDeursen ADumas MPfahl DApel SRusso A(2019)Monitoring-aware IDEsProceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3338906.3338926(420-431)Online publication date: 12-Aug-2019
https://dl.acm.org/doi/10.1145/3338906.3338926
Pi AChen WWang SZhou XWeissman JButt ASmirni E(2019)Semantic-aware Workflow Construction and Analysis for Distributed Data Analytics SystemsProceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3307681.3325404(255-266)Online publication date: 17-Jun-2019
https://dl.acm.org/doi/10.1145/3307681.3325404
Schipper DAniche Mvan Deursen AStorey MAdams BHaiduc S(2019)Tracing back log data to its log statementProceedings of the 16th International Conference on Mining Software Repositories10.1109/MSR.2019.00081(545-549)Online publication date: 26-May-2019
https://dl.acm.org/doi/10.1109/MSR.2019.00081
Bhagwan RKumar RMaddila CPhilip AArpaci-Dusseau AVoelker G(2018)OrcaProceedings of the 13th USENIX conference on Operating Systems Design and Implementation10.5555/3291168.3291205(493-509)Online publication date: 8-Oct-2018
https://dl.acm.org/doi/10.5555/3291168.3291205
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Table of Contents