Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1109/GRID.2007.4354137guideproceedingsArticle/Chapter ViewAbstractPublication PagesgridConference Proceedingsconference-collections
Article
Free access

Log summarization and anomaly detection for troubleshooting distributed systems

Published: 19 September 2007 Publication History

Abstract

Today’s system monitoring tools are capable of detecting system failures such as host failures, OS errors, and network partitions in near-real time. Unfortunately, the same cannot yet be said of the end-to-end distributed software stack. Any given action, for example, reliably transferring a directory of files, can involve a wide range of complex and interrelated actions across multiple pieces of software: checking user certificates and permissions, getting details for all files, performing third-party transfers, understanding re-try policy decisions, etc. We present an infrastructure for troubleshooting complex middleware, a general purpose technique for configurable log summarization, and an anomaly detection technique that works in near-real time on running Grid middleware. We present results gathered using this infrastructure from instrumented Grid middleware and applications running on the Emulab testbed. From these results, we analyze the effectiveness of several algorithms at accurately detecting a variety of performance anomalies.

References

[1]
"Open science grid (osg)," http://www.opensciencegrid.org/.
[2]
I. Foster et al., "Production grid: Principles and practice," Proceedings of the IEEE International Symposium on High Performance Distributed Computing, 2004.
[3]
F. Berman, A. Chien, K. Cooper, J. Dongarra, I. Foster, D. Gannon, L. Johnsson, K. Kennedy, C. Kesselman, J. Mellor-Crummey, D. Reed, L. Torczon, and R. Wolski, "The GrADS Project: Software support for high-level Grid application development," The International Journal of High Performance Computing Applications, vol. 15, no. 4, pp. 327-344, 2001.
[4]
G. Chun, H. Dail, H. Casanova, and A. Snavely, "Benchmark probes for grid assessment," in Proceedings of the High-Performance Grid Computing Workshop, 2004.
[5]
W. Allcock, J. Bresnahan, R. Kettimuthu, M. Link, C. Dumitrescu, I. Raicu, and I. Foster, "The globus striped gridftp framework and server," in In Proceeding of IEEE Supercomputing, November 2005.
[6]
D. A. Reed, R. A. Aydt, R. J. Noe, P. C. Roth, K. A. Shields, B. W. Schwartz, and L. F. Tavera, "Scalable Performance Analysis: The Pablo Performance Analysis Environment," in Proc. Scalable Parallel Libraries Conf. IEEE Computer Society, 1993, pp. 104-113.
[7]
J. C. B. Mohr, A. Malony, Parallel Programming using C++. M.I.T. Press, 1996, ch. TAU.
[8]
A. V. Mirgorodskiy, N. Maruyama, and B. P. Miller, "Problem diagnosis in large-scale computing environments," in Proceedings of IEEE SuperComputing '06, November 2007.
[9]
"Logging best practices guide." {Online}. Available: http://www.cedps.net/wiki/index.php/LoggingBestPractices/
[10]
"Iso-8601: Data elements and interchange formats - information exchange - representation of dates and times," International Organization for Standardization, 1888. {Online}. Available: http://www.iso.ch/markete/8601.pdf
[11]
D. Gunter, K. Jackson, D. Konerding, J. Lee, and B. Tierney, "Essential grid workflow monitoring elements," in Proceedings of the International Conference on Grid Computing and Applications, 2005.
[12]
P. Leach, M. Mealling, and R. Salz, "A universally unique identifier (uuid) urn namespace," RFC4122, July 2005.
[13]
"Syslog-ng," http://www.balabit.com/products/syslog-ng/.
[14]
D. Gunter, B. Tierney, K. Jackson, J. Lee, and M. Stoufer, "Dynamic monitoring of high-performance distributed applications," in 11th IEEE Symposium on High Performance Distributed Computing, 2002.
[15]
B. Tierney and D. Gunter, "Netlogger: A toolkit for distributed system performance tuning and debugging," LBNL, Tech. Rep. LBNL-51276, 2002.
[16]
B. White, J. Lepreau, L. Stoller, R. Ricci, S. Guruprasad, M. Newbold, M. Hibler, C. Barb, and A. Joglekar, "An integrated experimental environment for distributed systems and networks," in Proc. of the Fifth Symposium on Operating Systems Design and Implementation. Boston, MA: USENIX Association, Dec. 2002, pp. 255-270.
[17]
J. S. H. G. E. P. Box, W. G. Hunter, Statistics for Experimenters. Wiley-Interscience, 1978.
[18]
M. Frigo and S. G. Johnson, "The design and implementation of FFTW3," Proceedings of the IEEE, vol. 93, no. 2, pp. 216-231, 2005, special issue on "Program Generation, Optimization, and Platform Adaptation".
[19]
D. Gunter, B. Tierney, C. E. Tull, and V. Virmani, "On-demand grid application tuning and debugging with the netlogger activation service," in 4th International Workshop on Grid Computing (Grid2003), 2003.
[20]
D. Gunter and B. Tierney, "Scalable analysis of distributed workflow traces," in Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, 2005.
[21]
J. Odom, L. DeRose, K. Ekanadham, J. Hollingsworth, and S. Sbaraglia, "Using dynamic tracing sampling to measure long running programs," Proceedings of SuperComputing '05, 2005.
[22]
Y. Qu, B. Adam, M. Thornquist, J. Potter, M. Thompson, Y. Yasui, J. Davis, P. Schellhammer, L. Cazares, M. C. Jr., and Z. Feng, "Data reduction using a discrete wavelet transform in discriminant analysis of very high dimensionality data," in Biometrics, vol. 59, 2003.
[23]
A. Lendasse, J. Lee, E. Bodt, V. Wertz, and M. Verleyen, "Input data reduction for the prediction of financial time series," in Proceedings of the European Symposium on Artificial Neural Networks (ESANN'01), 2001.
[24]
N. Yoccoz, J. Nichols, and T. Boulinier, "Monitoring of biological diversity in space and time," in Trends in Ecology and Evolution, vol. 16, 2001, pp. 446-453.
[25]
M. Knop, J. Schopf, and P. Dinda, "Windows performance monitoring and data reduction using watchtower," in Proceedings of Workshop on Self-Healing, Adaptive and self-MANaged Systems (SHAMAN), June 2002.
[26]
L. Yang, J. Schopf, C. Dumitrescu, and I. Foster, "Statistical data reduction for efficient application performance monitoring," in Proceedings of the Grid Workshop 2006, October 2006.
[27]
W. A. Shewhart, Economic Control of Quality of Manufactured Product. American Society for Quality, 1931.
[28]
D. Denning, "An intrusion-detection model," IEEE Transactions on Software Engineering, pp. 222-232, 1987.
[29]
A. Jones and R. Sielken, "Computer system intrusion detection: A survey," in Tech Report, Computer Science Dept., University of Virginia, 2000.
[30]
R. Cottrell, C. Logg, M. Chhaparia, M. Grigoriev, F. Haro, F. Nazir, and M. Sandford, "Valuation of techniques to detect significant network performance problems using end-to-end active network measurements," in SLAC-PUB-11653, 2006.
[31]
M. Swany and R. Wolski, "Multivariate resource performance forecasting in the network weather service," in Proceedings of SC 2002, November 2002.
[32]
G. Allen, D. Angulo, I. Foster, and et al., "The cactus worm: Experiments with dynamic resource discovery and allocation in a grid environment," University of Chicago, Tech. Rep. Chicago TR-2001-28, 2001.
[33]
S. Zhang, I. Cohen, M. Goldszmidt, and et al., "Ensembles of models for automated diagnosis of system performance problems," in IEEE Conference on Dependable Systems and Networks (DSN), 2005.
[34]
T. Kelly, "Detecting performance anomalies in global applications," in Second USENIX Workshop on Real, Large Distributed Systems (WORLDS 2005), 2005.
[35]
L. Yang, "Anomaly management in grid environments," in PhD Thesis, University of Chicago, Computer Science Department, 2007.
[36]
A. V. Mirgorodskiy and B. P. Miller, "Diagnosing distributed systems with self-propelled instrumentation," in University of Wisconson Technical Report, 2007.

Cited By

View all
  • (2022)Improving the robustness of industrial Cyber–Physical Systems through machine learning-based performance anomaly identificationJournal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2022.102716131:COnline publication date: 1-Oct-2022
  • (2021)tprofProceedings of the ACM Symposium on Cloud Computing10.1145/3472883.3486994(76-91)Online publication date: 1-Nov-2021
  • (2015)Performance Anomaly Detection and Bottleneck IdentificationACM Computing Surveys10.1145/279112048:1(1-35)Online publication date: 22-Jul-2015
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
GRID '07: Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
September 2007
339 pages
ISBN:9781424415595

Publisher

IEEE Computer Society

United States

Publication History

Published: 19 September 2007

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)23
  • Downloads (Last 6 weeks)4
Reflects downloads up to 19 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Improving the robustness of industrial Cyber–Physical Systems through machine learning-based performance anomaly identificationJournal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2022.102716131:COnline publication date: 1-Oct-2022
  • (2021)tprofProceedings of the ACM Symposium on Cloud Computing10.1145/3472883.3486994(76-91)Online publication date: 1-Nov-2021
  • (2015)Performance Anomaly Detection and Bottleneck IdentificationACM Computing Surveys10.1145/279112048:1(1-35)Online publication date: 22-Jul-2015
  • (2012)Failure analysis of distributed scientific workflows executing in the cloudProceedings of the 8th International Conference on Network and Service Management10.5555/2499406.2499412(46-54)Online publication date: 22-Oct-2012
  • (2011)System log summarization via semi-Markov models of inter-arrival timesProceedings of the Seventh Annual Workshop on Cyber Security and Information Intelligence Research10.1145/2179298.2179346(1-1)Online publication date: 12-Oct-2011
  • (2011)Instrumentation-based tool for latency measurementsProceedings of the 2nd ACM/SPEC International Conference on Performance engineering10.1145/1958746.1958802(403-412)Online publication date: 14-Mar-2011
  • (2009)Decentralized log event correlation architectureProceedings of the International Conference on Management of Emergent Digital EcoSystems10.1145/1643823.1643919(480-482)Online publication date: 27-Oct-2009
  • (2008)Troubleshooting thousands of jobs on production grids using data mining techniquesProceedings of the 2008 9th IEEE/ACM International Conference on Grid Computing10.1109/GRID.2008.4662802(217-224)Online publication date: 29-Sep-2008

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media