Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/3433701.3433787acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Live forensics for HPC systems: a case study on distributed storage systems

Published: 09 November 2020 Publication History

Abstract

Large-scale high-performance computing systems frequently experience a wide range of failure modes, such as reliability failures (e.g., hang or crash), and resource overload-related failures (e.g., congestion collapse), impacting systems and applications. Despite the adverse effects of these failures, current systems do not provide methodologies for proactively detecting, localizing, and diagnosing failures. We present Kaleidoscope, a near real-time failure detection and diagnosis framework, consisting of of hierarchical domain-guided machine learning models that identify the failing components, the corresponding failure mode, and point to the most likely cause indicative of the failure in near real-time (within one minute of failure occurrence). Kaleidoscope has been deployed on Blue Waters supercomputer and evaluated with more than two years of production telemetry data. Our evaluation shows that Kaleidoscope successfully localized 99.3% and pinpointed the root causes of 95.8% of 843 real-world production issues, with less than 0.01% runtime overhead.

References

[1]
A. Avizienis, J.-C. Laprie, B. Randell, and C. Landwehr, "Basic concepts and taxonomy of dependable and secure computing," IEEE Trans. Dependable Secur. Comput., vol. 1, no. 1, pp. 11--33, Jan. 2004.
[2]
C. Di Martino, F. Baccanico, J. Fullop, W. Kramer, Z. Kalbarczyk, and R. Iyer, "Lessons learned from the analysis of system failures at petascale: The case of Blue Waters," in Proc. of 44th Annual IEEE/IFIP Int. Conf. on Dependable Systems and Networks (DSN), 2014.
[3]
H. S. Gunawi, R. O. Suminto, R. Sears, C. Golliher, S. Sundararaman, X. Lin, T. Emami, W. Sheng, N. Bidokhti, C. McCaffrey, G. Grider, P. M. Fields, K. Harms, R. B. Ross, A. Jacobson, R. Ricci, K. Webb, P. Alvaro, H. B. Runesha, M. Hao, and H. Li, "Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems," in Proceedings of the 16th USENIX Conference on File and Storage Technologies (FAST'18), Oakland, CA, USA, Feb. 2018.
[4]
B. Schroeder and G. Gibson, "A large-scale study of failures in highperformance computing systems," IEEE Transactions on Dependable and Secure Computing, vol. 7, no. 4, pp. 337--350, 2010.
[5]
S. Floyd, "Congestion control principles," BCP 41, RFC 2914, September, Tech. Rep., 2000.
[6]
S. Jha, A. Patke, J. Brandt, A. Gentile, B. Lim, M. Showerman, G. Bauer, L. Kaplan, Z. Kalbarczyk, W. Kramer, and R. Iyer, "Measuring congestion in high-performance datacenter interconnects," in 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20). Santa Clara, CA: USENIX Association, Feb. 2020, pp. 37--57. [Online]. Available: https://www.usenix.org/conference/nsdi20/presentation/jha
[7]
S. Chunduri, T. Groves, P. Mendygral, B. Austin, J. Balma, K. Kandalla, K. Kumaran, G. Lockwood, S. Parker, S. Warren, N. Wichmann, and N. Wright, "Gpcnet: Designing a benchmark suite for inducing and measuring contention in hpc networks," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '19. New York, NY, USA: Association for Computing Machinery, 2019. [Online]. Available
[8]
M. Mubarak, P. Carns, J. Jenkins, J. K. Li, N. Jain, S. Snyder, R. Ross, C. D. Carothers, A. Bhatele, and K.-L. Ma, "Quantifying I/O and communication traffic interference on dragonfly networks equipped with burst buffers," in Cluster Computing, 2017 IEEE Int'l Conf. on. IEEE, 2017, pp. 204--215.
[9]
X. Yang, J. Jenkins, M. Mubarak, R. B. Ross, and Z. Lan, "Watch out for the bully! job interference study on dragonfly network," in SC'16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2016, pp. 750--760.
[10]
C. Di Martino, W. Kramer, Z. Kalbarczyk, and R. Iyer, "Measuring and understanding extreme-scale application resilience: A field study of 5,000,000 HPC application runs," in Dependable Systems and Networks (DSN), Annual IEEE/IFIP International Conference on, 2015, pp. 25--36.
[11]
B. Schroeder and G. A. Gibson, "A large-scale study of failures in highperformance computing systems," IEEE Transactions on Dependable and Secure Computing, vol. 7, no. 4, pp. 337--350, 2010.
[12]
T. Hauer, P. Hoffmann, J. Lunney, D. Ardelean, and A. Diwan, "Meaningful availability," in 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20). Santa Clara, CA: USENIX Association, Feb. 2020, pp. 545--557. [Online]. Available: https://www.usenix.org/conference/nsdi20/presentation/hauer
[13]
M. Snir, W. D. Gropp, and P. Kogge, "Exascale research: preparing for the post-moore era," 2011.
[14]
N. Baker, F. Alexander, T. Bremer, A. Hagberg, Y. Kevrekidis, H. Najm, M. Parashar, A. Patra, J. Sethian, S. Wild, K. Willcox, and S. Lee, "Workshop report on basic research needs for scientific machine learning: Core technologies for artificial intelligence," 2 2019.
[15]
G. K. Lockwood, N. J. Wright, S. Snyder, P. Carns, G. Brown, and K. Harms, "Tokio on clusterstor: Connecting standard tools to enable holistic i/o performance analysis," 2018.
[16]
C. Guo, L. Yuan, D. Xiang, Y. Dang, R. Huang, D. Maltz, Z. Liu, V. Wang, B. Pang, H. Chen, Z.-W. Lin, and V. Kurien, "Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis," in Proceedings of the 2015 ACM SIGCOMM Conference (SIGCOMM'15), London, United Kingdom, Aug. 2015.
[17]
B. Yang, X. Ji, X. Ma, X. Wang, T. Zhang, X. Zhu, N. El-Sayed, H. Lan, Y. Yang, J. Zhai, W. Liu, and W. Xue, "End-to-end i/o monitoring on a leading supercomputer," in 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). Boston, MA: USENIX Association, Feb. 2019, pp. 379--394. [Online]. Available: https://www.usenix.org/conference/nsdi19/presentation/yang
[18]
A. Agelastos, B. Allan, J. Brandt, P. Cassella, J. Enos, J. Fullop, A. Gentile, S. Monk, N. Naksinehaboon, J. Ogden et al., "The lightweight distributed metric service: a scalable infrastructure for continuous monitoring of large scale computing systems and applications," in SC'14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2014, pp. 154--165.
[19]
"splunk," https://www.splunk.com.
[20]
J. B. Dugan, S. J. Bavuso, and M. A. Boyd, "Dynamic fault-tree models for fault-tolerant computer systems," IEEE Transactions on reliability, vol. 41, no. 3, pp. 363--377, 1992.
[21]
A. Gulati, C. Kumar, I. Ahmad, and K. Kumar, "Basil: Automated io load balancing across storage devices." in Fast, vol. 10, 2010, pp. 13--13.
[22]
R. Alverson, D. Roweth, and L. Kaplan, "The gemini system interconnect," in 2010 18th IEEE Symposium on High Performance Interconnects. IEEE, 2010, pp. 83--87.
[23]
B. Alverson, E. Froese, L. Kaplan, and D. Roweth, "Cray xc series network," Cray Inc., White Paper WP-Aries01-1112, 2012.
[24]
S. Jha, M. Showerman, A. Saxton, J. Enos, G. Bauer, B. Bode, J. Brandt, A. Gentile, Z. Kalbarczyk, R. K. Iyer, and W. Kramer, "Holistic measurement-driven system assessment," in Proceedings of the ACM International Conference on Supercomputing, Sep. 2019.
[25]
"operational data analytics," https://sc19.supercomputing.org/proceedings/bof/bofpages/bof137.html, 2019.
[26]
C. Tan, Z. Jin, C. Guo, T. Zhang, H. Wu, K. Deng, D. Bi, and D. Xiang, "NetBouncer: Active Device and Link Failure Localization in Data Center Networks," in Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI'19), Boston, MA, USA, Feb. 2019.
[27]
P. Bahl, R. Chandra, A. Greenberg, S. Kandula, D. A. Maltz, and M. Zhang, "Towards highly reliable enterprise network services via inference of multi-level dependencies," in Proceedings of the 2007 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, ser. SIGCOMM '07. New York, NY, USA: Association for Computing Machinery, 2007, p. 13--24. [Online]. Available
[28]
W. M., H. S., S. K., G. T., and T. M., "Canario: Sounding the alarm on io-related performance degradation," in Proceedings of the 34th IEEE International Parallel and Distributed Processing Symposium, IPDPS, 2020.
[29]
B. Arzani, S. Ciraci, L. Chamon, Y. Zhu, H. H. Liu, J. Padhye, B. T. Loo, and G. Outhred, "007: Democratically finding the cause of packet drops," in 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18). Renton, WA: USENIX Association, Apr. 2018, pp. 419--435. [Online]. Available: https://www.usenix.org/conference/nsdi18/presentation/arzani
[30]
Q. Zhang, G. Yu, C. Guo, Y. Dang, N. Swanson, X. Yang, R. Yao, M. Chintalapati, A. Krishnamurthy, and T. Anderson, "Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure," in Proceedings of the 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI'18), Renton, WA, USA, Apr. 2018.
[31]
B. Yang, X. Ji, X. Ma, X. Wang, T. Zhang, X. Zhu, N. El-Sayed, H. Lan, Y. Yang, J. Zhai, W. Liu, and W. Xue, "End-to-end I/O Monitoring on a Leading Supercomputer," in Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI'19), Boston, MA, USA, Feb. 2019.
[32]
A. Das, F. Mueller, P. Hargrove, E. Roman, and S. Baden, "Doomsday: predicting which node will fail when on supercomputers," in SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2018, pp. 108--121.
[33]
M. Du, F. Li, G. Zheng, and V. Srikumar, "Deeplog: Anomaly detection and diagnosis from system logs through deep learning," in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, 2017, pp. 1285--1298.
[34]
W. Xu, L. Huang, A. Fox, D. Patterson, and M. I. Jordan, "Detecting large-scale system problems by mining console logs," in Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, 2009, pp. 117--132.
[35]
M. Y. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. Brewer, "Pinpoint: Problem determination in large, dynamic internet services," in Proceedings International Conference on Dependable Systems and Networks. IEEE, 2002, pp. 595--604.
[36]
K. Nagaraj, C. Killian, and J. Neville, "Structured comparative analysis of systems logs to diagnose performance problems," in Presented as part of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12). San Jose, CA: USENIX, 2012, pp. 353--366. [Online]. Available: https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/nagaraj
[37]
K. Nagaraj, C. Killian, and J. Neville, "Structured Comparative Analysis of Systems Logs to Diagnose Performance Problems," in Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI'12), San Jose, CA, USA, Apr. 2012.
[38]
O. Tuncer, E. Ates, Y. Zhang, A. Turk, J. Brandt, V. J. Leung, M. Egele, and A. K. Coskun, "Diagnosing performance variations in hpc applications using machine learning," in High Performance Computing, J. M. Kunkel, R. Yokota, P. Balaji, and D. Keyes, Eds. Cham: Springer International Publishing, 2017, pp. 355--373.
[39]
K. Khlebnikov and K. Kolyshkin, "ioping: simple disk I/0 latency measuring tool," https://github.com/koct9i/ioping.
[40]
M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, "LOF: Identifying Density-based Local Outliers," in Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data (SIGMOD'00), Dallas, Texas, USA, May 2000.
[41]
T. Sherman, "Cray sonexion™ data storage system."
[42]
"Lustre filesystem," http://lustre.org/, Accessed: 2019-02-06.
[43]
S. Oral and F. Baetke, "LUSTRE Community BOF: Lustre in HPC and Emerging Data Markets: Roadmap, Features and Challenges," https://sc18.supercomputing.org/proceedings/bof/bofpages/bof176.html.
[44]
M. Holland and G. Gibson, "Parity Declustering for Continuous Operation in Redundant Disk Arrays," in Proceedings of the 5th International Conference on Architecture Support for Programming Languages and Operating Systems (ASPLOS-V), Boston, MA, USA, Oct. 1992.
[45]
Theoretical and Computational Biophysic Group, University of Illinois at Urbana-Champaign, "NAMD v. 2.9," http://www.ks.uiuc.edu/Development/Download/download.cgi?PackageName=NAMD, 2014.
[46]
"Zipkin," https://zipkin.io.
[47]
Y. Shkuro, "Evolving Distributed Tracing at Uber Engineering," https://eng.uber.com/distributed-tracing/.
[48]
P. Carns, K. Harms, W. Allcock, C. Bacon, S. Lang, R. Latham, and R. Ross, "Understanding and improving computational science storage access through continuous characterization," ACM Trans. Storage, vol. 7, no. 3, Oct. 2011. [Online]. Available
[49]
L. Ma, T. He, A. Swami, D. Towsley, and K. K. Leung, "Network capability in localizing node failures via end-to-end path measurements," IEEE/ACM transactions on networking, vol. 25, no. 1, pp. 434--450, 2016.
[50]
L. Ma, T. He, A. Swami, D. Towsley, K. K. Leung, and J. Lowe, "Node Failure Localization via Network Tomography," in Proceedings of the 2014 Conference on Internet Measurement Conference, ser. IMC '14. New York, NY, USA: ACM, 2014, pp. 195--208. [Online]. Available
[51]
B. D. Semeraro, R. Sisneros, J. Fullop, and G. H. Bauer, "It takes a village: Monitoring the blue waters supercomputer," in Proceedings of the 2014 IEEE International Conference on Cluster Computing (CLUSTER'14), Madrid, Spain, Sep 2014.
[52]
M. Butler, "blue waters super system," https://www.globusworld.org/files/2010/02/SuperSystem-BW-.pdf.
[53]
D. Koller, N. Friedman, and F. Bach, Probabilistic graphical models: principles and techniques. MIT press, 2009.
[54]
P. Diaconis, D. Ylvisaker et al., "Conjugate priors for exponential families," The Annals of statistics, vol. 7, no. 2, pp. 269--281, 1979.
[55]
R. M. Neal, "Probabilistic inference using markov chain monte carlo methods," 1993.
[56]
J. Salvatier, T. V. Wiecki, and C. Fonnesbeck, "Probabilistic programming in python using pymc3," PeerJ Computer Science, vol. 2, p. e55, Apr. 2016. [Online]. Available
[57]
N. Taerat, J. Brandt, A. Gentile, M. Wong, and C. Leangsuksun, "Baler: deterministic, lossless log message clustering tool," Computer Science-Research and Development, vol. 26, no. 3--4, p. 285, 2011.
[58]
F. Mahdisoltani, I. Stefanovici, and B. Schroeder, "Improving Storage System Reliability with Proactive Error Prediction," in Proceedings of the 2017 USENIX Annual Technical Conference (USENIX ATC'17), San Jose, CA, July 2017.
[59]
R. Vaarandi, "A data clustering algorithm for mining patterns from event logs," in Proceedings of the 3rd IEEE Workshop on IP Operations & Management (IPOM 2003)(IEEE Cat. No. 03EX764). IEEE, 2003, pp. 119--126.
[60]
Y. Peng, J. Yang, C. Wu, C. Guo, C. Hu, and Z. Li, "deTector: a Topology-aware Monitoring System for Data Center Networks," in Proceedings of the 2017 USENIX Annual Technical Conference (USENIX ATC'17), San Jose, CA, July 2017.
[61]
H. Herodotou, B. Ding, S. Balakrishnan, G. Outhred, and P. Fitter, "Scalable Near Real-Time Failure Localization of Data Center Networks," in Proceedings of the 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'14), New York City, NY, USA, Aug. 2014.
[62]
D. Ghita, H. Nguyen, M. Kurant, K. Argyraki, and P. Thiran, "Netscope: Practical Network Loss Tomography," in Proceedings of 2010 IEEE Conference on Computer Communications (INFOCOM'10), San Diego, CA, USA, Mar. 2010.
[63]
"Parallel file system I/O Benchmark," https://github.com/LLNL/ior.
[64]
Z. C. Lipton, "The mythos of model interpretability," Queue, vol. 16, no. 3, pp. 31--57, 2018.
[65]
S. A. Weil, S. A. Brandt, E. L. Miller, D. D. E. Long, and C. Maltzahn, "Ceph: A Scalable, High-Performance Distributed File System," in Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI'06), Seattle, WA, November 2016.
[66]
E. B. Boyer, M. C. Broomfield, and T. A. Perrotti, "Glusterfs one storage server to rule them all," 7 2012.
[67]
F. Schmuck and R. Haskin, "GPFS: A Shared-Disk File System for Large Computing Clusters," in Proceedings of the 1st USENIX Conference on File and Storage Technologies (FAST'02), Monterey, California, USA, Jan. 2002.
[68]
S. A. Weil, S. A. Brandt, E. L. Miller, and C. Maltzahn, "Crush: Controlled, scalable, decentralized placement of replicated data," in SC'06: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing. IEEE, 2006, pp. 31--31.
[69]
Q. Zhang, V. Liu, H. Zeng, and A. Krishnamurthy, "High-resolution measurement of data center microbursts," in Proceedings of the 2017 Internet Measurement Conference, ser. IMC '17. New York, NY, USA: Association for Computing Machinery, 2017, p. 78--85. [Online]. Available
[70]
J. B. Leners, H. Wu, W.-L. Hung, M. K. Aguilera, and M. Walfish, "Detecting Failures in Distributed Systems with the Falcon Spy Network," in Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP'11), Cascais, Portugal, Oct. 2011.
[71]
J. B. Leners, T. Gupta, M. K. Aguilera, and M. Walfish, "Taming Uncertainty in Distributed Systems with Help from the Network," in Proceedings of the 10th European Conference on Computer Systems (EuroSys'15), Bordeaux, France, Apr. 2015.
[72]
J. B. Leners, T. Gupta, M. K. Aguilera, and M. Walfish, "Improving availability in distributed systems with failure informers," in Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI'13), Lombard, IL, USA, Apr. 2013.
[73]
P. Huang, C. Guo, J. R. Lorch, L. Zhou, and Y. Dang, "Capturing and Enhancing In Situ System Observability for Failure Detection," in Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI'18), Carlsbad, CA, USA, Oct. 2018.
[74]
N. Hayashibara, X. Défago, R. Yared, and T. Katayama, "The φ Accrual Failure Detector," in Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems (SDRS'04), Florianópolis, Brazil, Oct. 2004.
[75]
T. D. Chandra and S. Toueg, "Unreliable Failure Detectors for Reliable Distributed Systems," Journal of the ACM, vol. 43, no. 2, pp. 225--267, Mar. 1996.
[76]
M. K. Aguilera, G. L. Lann, and S. Toueg, "On the Impact of Fast Failure Detectors on Real-Time Fault-Tolerant Systems," in Proceedings of the 16th International Symposium on Distributed Computing (DISC'02), Toulouse, France, Oct. 2002.
[77]
B. Arzani, S. Ciraci, L. Chamon, Y. Zhu, H. Liu, J. Padhye, B. T. Loo, and G. Outhred, "007: Democratically Finding the Cause of Packet Drops," in Proceedings of the 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI'18), Renton, WA, USA, Apr. 2018.
[78]
A. Roy, H. Zeng, J. Bagga, and A. C. Snoeren, "Passive Realtime Datacenter Fault Detection and Localization," in Proceedings of the 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI'17), Boston, MA, Mar. 2017.
[79]
G. K. Lockwood, S. Snyder, T. Wang, S. Byna, P. Carns, and N. J. Wright, "A year in the life of a parallel file system," in SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, 2018, pp. 931--943.
  1. Live forensics for HPC systems: a case study on distributed storage systems

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SC '20: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
    November 2020
    1454 pages
    ISBN:9781728199986

    Sponsors

    In-Cooperation

    • IEEE CS

    Publisher

    IEEE Press

    Publication History

    Published: 09 November 2020

    Check for updates

    Qualifiers

    • Research-article

    Conference

    SC '20
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 201
      Total Downloads
    • Downloads (Last 12 months)17
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 20 Nov 2024

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media