Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/2685048.2685099acmotherconferencesArticle/Chapter ViewAbstractPublication PagesosdiConference Proceedingsconference-collections
Article

lprof: a non-intrusive request flow profiler for distributed systems

Published: 06 October 2014 Publication History

Abstract

Applications implementing cloud services, such as HDFS, Hadoop YARN, Cassandra, and HBase, are mostly built as distributed systems designed to scale. In order to analyze and debug the performance of these systems effectively and efficiently, it is essential to understand the performance behavior of service requests, both in aggregate and individually.
lprof is a profiling tool that automatically reconstructs the execution flow of each request in a distributed application. In contrast to existing approaches that require instrumentation, lprof infers the request-flow entirely from runtime logs and thus does not require any modifications to source code. lprof first statically analyzes an application's binary code to infer how logs can be parsed so that the dispersed and intertwined log entries can be stitched together and associated to specific individual requests.
We validate lprof using the four widely used distributed services mentioned above. Our evaluation shows lprof's precision in request extraction is 88%, and lprof is helpful in diagnosing 65% of the sampled real-world performance anomalies.

References

[1]
M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and A. Muthitacharoen. Performance debugging for distributed systems of black boxes. In Proceedings of the 19th ACM Symposium on Operating Systems Principles, SOSP'03, pages 74-89, 2003.
[2]
Amazon found every 100ms of latency cost them 1% in sales. http://blog.gigaspaces.com/amazon-found-every-100ms-of-latency-cost-them-1-in-sales/.
[3]
P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using magpie for request extraction and workload modelling. In Proceedings of the 6th Symposium on Opearting Systems Design and Implementation, OSDI'04, 2004.
[4]
I. Beschastnikh, Y. Brun, S. Schneider, M. Sloan, and M. D. Ernst. Leveraging existing instrumentation to automatically infer invariant-constrained models. In Proceedings of the 19th ACM Symposium on Foundations of Software Engineering, FSE '11, pages 267-277, 2011.
[5]
Boundary: Modern IT operation management. http://boundary.com/blog/2012/11/19/know-your-iaas-boundary-identifies-performance-lags-introduced-by-cloud/.
[6]
A. Chanda, A. L. Cox, and W. Zwaenepoel. Whodunit: Transactional profiling for multi-tier applications. In Proceedings of the 2nd ACM European Conference on Computer Systems, EuroSys '07, pages 17-30, 2007.
[7]
F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: a distributed storage system for structured data. In Proceedings of the 7th symposium on Operating systems design and implementation, OSDI'06, pages 205-218, 2006.
[8]
M. Y. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. Brewer. Pinpoint: Problem determination in large, dynamic internet services. In Proceedings of the International Conference on Dependable Systems and Networks, DSN '02, pages 595-604, 2002.
[9]
Chord: A program analysis platform for java. http://pag.gatech.edu/chord.
[10]
M. Chow, D. Meisner, J. Flinn, D. Peek, and T. F. Wenisch. The mystery machine: End-to-end performance analysis of large-scale internet services. In Proceedings of the 11th symposium on Operating Systems Design and Implementation, OSDI'14, 2014.
[11]
B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing, SoCC '10, pages 143-154, 2010.
[12]
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th conference on Symposium on Opearting Systems Design and Implementation, OSDI'04, 2004.
[13]
Moving an elephant: Large scale hadoop data migration at facebook. https://www.facebook.com/notes/paul-yang/moving-an-elephant-large-scale-hadoop-data-migration-at-facebook/10150246275318920.
[14]
R. Fonseca, G. Porter, R. H. Katz, S. Shenker, and I. Stoica. X-trace: a pervasive network tracing framework. In Proceedings of the 4th USENIX conference on Networked systems design and implementation, NSDI'07, 2007.
[15]
Google protocol buffers. https://developers. google.com/protocol-buffers/.
[16]
S. L. Graham, P. B. Kessler, and M. K. Mckusick. Gprof: A call graph execution profiler. In Proceedings of the SIGPLAN Symposium on Compiler Construction, SIGPLAN'82, pages 120-126, 1982.
[17]
Z. Guo, D. Zhou, H. Lin, M. Yang, F. Long, C. Deng, C. Liu, and L. Zhou. G2: A graph processing system for diagnosing distributed systems. In Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference, USENIX ATC'11, 2011.
[18]
HBase bug 2399. https://issues.apache.org/jira/browse/HBASE-2399.
[19]
HBase bug 3654. https://issues.apache.org/jira/browse/HBASE-3654.
[20]
HDFS performance regression on write requests. https://issues.apache.org/jira/browse/HDFS-4049.
[21]
Highcharts: interactive JavaScript charts for your webpage. http://www.highcharts.com/.
[22]
S. Huang, J. Huang, J. Dai, T. Xie, and B. Huang. The HiBench benchmark suite: Characterization of the MapReduce-based data analysis. In 26th International Conference on Data Engineering Workshops (ICDEW), pages 41-51, 2010.
[23]
E. Koskinen and J. Jannotti. Borderpatrol: Isolating events for black-box tracing. In Proceedings of the 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 2008, Eurosys '08, pages 191-203, 2008.
[24]
The LLVM compiler infrastructure. http://llvm.org/.
[25]
log4j: Apache log4j, a logging library for Java. http://logging.apache.org/log4j/2.x/.
[26]
VMware vCenter Log Insight: Log management and analytics. http://www.vmware.com/ca/en/products/vcenter-log-insight.
[27]
D. Logothetis, C. Trezzo, K. C. Webb, and K. Yocum. Insitu mapreduce for log processing. In Proceedings of the 2011 USENIX Annual Technical Conference, 2011.
[28]
Mongodb. http://www.mongodb.org/.
[29]
K. Nagaraj, C. Killian, and J. Neville. Structured comparative analysis of systems logs to diagnose performance problems. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI'12, 2012.
[30]
Nagios: the industry standard in IT infrastructure monitoring. http://www.nagios.org/.
[31]
NewRelic: Application performance management and monitoring. http://newrelic.com/.
[32]
OProf - A system profiler for Linux. http://oprofile.sourceforge.net/.
[33]
OpsView - enterprise IT monitoring for networks. http://www.opsview.com/.
[34]
P. Reynolds, C. Killian, J. L. Wiener, J. C. Mogul, M. A. Shah, and A. Vahdat. Pip: Detecting the unexpected in distributed systems. In Proceedings of the 3rd Conference on Networked Systems Design and Implementation, NSDI'06, 2006.
[35]
M. Sharir and A. Pnueli. Two approaches to interprocedural analysis. Program Flow Analysis, Theory and applications, 1981.
[36]
B. H. Sigelman, L. A. Barroso, M. Burrows, P. Stephenson, M. Plakal, D. Beaver, S. Jaspan, and C. Shanbhag. Dapper, a large-scale distributed systems tracing infrastructure. Technical report, Google, Inc., 2010.
[37]
Simple logging facade for Java (SLF4J). http://www.slf4j.org/.
[38]
Splunk log management. http://www.splunk. com/view/log-management/SP-CAAAC6F.
[39]
S. Steinarsson. Downsampling time series for visual representation. M.Sc thesis. Faculty of Industrial Engineering, Mechanical Engineering and Computer Science, University of Iceland, 2013.
[40]
J. Tan, X. Pan, S. Kavulya, R. Gandhi, and P. Narasimhan. Salsa: Analyzing logs as state machines. In Proceedings of the 1st USENIX Conference on Analysis of System Logs, WASL'08, 2008.
[41]
J. Tan, X. Pan, S. Kavulya, R. Gandhi, and P. Narasimhan. Mochi: Visual log-analysis based tools for debugging hadoop. In Proceedings of the Conference on Hot Topics in Cloud Computing, HotCloud'09, 2009.
[42]
W. Xu, L. Huang, A. Fox, D. Patterson, and M. I. Jordan. Detecting large-scale system problems by mining console logs. In Proc. of the ACM 22nd Symposium on Operating Systems Principles, SOSP '09, pages 117-132, 2009.
[43]
D. Yuan, H. Mai, W. Xiong, L. Tan, Y. Zhou, and S. Pasupathy. SherLog: error diagnosis by connecting clues from run-time logs. In Proceedings of the 15th Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '10, pages 143-154, 2010.
[44]
D. Yuan, S. Park, P. Huang, Y. Liu, M. Lee, Y. Zhou, and S. Savage. Be conservative: Enhancing failure diagnosis with proactive logging. In Proceedings of the 10th USENIX Symposium on Operating System Design and Implementation, OSDI'12, pages 293-306, 2012.
[45]
D. Yuan, S. Park, and Y. Zhou. Characterising logging practices in open-source software. In Proceedings of the 34th International Conference on Software Engineering, ICSE '12, 2012.
[46]
Zabbix - an enterprise-class open source monitoring solution. http://www.zabbix.com/.

Cited By

View all

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
OSDI'14: Proceedings of the 11th USENIX conference on Operating Systems Design and Implementation
October 2014
676 pages
ISBN:9781931971164

Sponsors

  • USENIX Assoc: USENIX Assoc

In-Cooperation

Publisher

USENIX Association

United States

Publication History

Published: 06 October 2014

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 18 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2021)Detailed black-box monitoring of distributed systemsACM SIGAPP Applied Computing Review10.1145/3477133.347713521:1(24-36)Online publication date: 20-Jul-2021
  • (2020)Check before you changeProceedings of the 17th Usenix Conference on Networked Systems Design and Implementation10.5555/3388242.3388285(575-590)Online publication date: 25-Feb-2020
  • (2020)SentinelProceedings of the VLDB Endowment10.14778/3407790.340785613:12(2720-2733)Online publication date: 14-Sep-2020
  • (2020)Orchestrating dynamic analyses of distributed processes for full-stack JavaScript programsACM SIGPLAN Notices10.1145/3393934.327813553:9(107-118)Online publication date: 7-Apr-2020
  • (2019)ZenoProceedings of the 16th USENIX Conference on Networked Systems Design and Implementation10.5555/3323234.3323268(395-420)Online publication date: 26-Feb-2019
  • (2019)SifterProceedings of the ACM Symposium on Cloud Computing10.1145/3357223.3362736(312-324)Online publication date: 20-Nov-2019
  • (2019)Monitoring-aware IDEsProceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3338906.3338926(420-431)Online publication date: 12-Aug-2019
  • (2019)Semantic-aware Workflow Construction and Analysis for Distributed Data Analytics SystemsProceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3307681.3325404(255-266)Online publication date: 17-Jun-2019
  • (2019)Tracing back log data to its log statementProceedings of the 16th International Conference on Mining Software Repositories10.1109/MSR.2019.00081(545-549)Online publication date: 26-May-2019
  • (2018)OrcaProceedings of the 13th USENIX conference on Operating Systems Design and Implementation10.5555/3291168.3291205(493-509)Online publication date: 8-Oct-2018
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media