research-article

Probabilistic diagnosis of performance faults in large-scale parallel applications

Authors:

Ignacio Laguna,

Bronis R. de Supinski,

Saurabh Bagchi,

Todd GamblinAuthors Info & Claims

PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniques

Pages 213 - 222

https://doi.org/10.1145/2370816.2370848

Published: 19 September 2012 Publication History

Abstract

Debugging large-scale parallel applications is challenging. Most existing techniques provide mechanisms for process control but little information about the causes of failures. Most debuggers also scale poorly despite continued growth in supercomputer core counts. Our novel, highly scalable tool helps developers to understand and to fix performance failures and correctness problems at scale. Our tool probabilistically infers the least progressed task in MPI programs using Markov models of execution history and dependence analysis. This analysis guides program slicing to find code that may have caused a failure. In a blind study, we demonstrate that our tool can isolate the root cause of a particularly perplexing bug encountered at scale in a molecular dynamics simulation. Further, we perform fault injections into two benchmark codes and measure the scalability of the tool. Our results show that it accurately detects the least progressed task in most cases and can perform the diagnosis in a fraction of a second with thousands of tasks.

References

[1]

ASC Sequoia Benchmark Codes. https://asc.llnl.gov/sequoia/benchmarks/.

[2]

Boost C++ libraries. http://www.boost.org/.

[3]

DynInst - An Application Program Interface (API) for Runtime Code Generation. http://www.dyninst.org/.

[4]

D. Abramson, I. Foster, J. Michalakes, and R. Socic. Relative Debugging: A New Methodology for Debugging Scientific Applications. Communications of the ACM, 39(11):69--77, 1996.

Digital Library

[5]

D. H. Ahn, B. R. D. Supinski, I. Laguna, G. L. Lee, B. Liblit, B. P. Miller, and M. Schulz. Scalable Temporal Order Analysis for Large Scale Debugging. In SC '09, 2009.

Digital Library

[6]

Allinea Software Ltd. Allinea DDT - Debugging tool for parallel computing. http://www.allinea.com/products/ddt/.

[7]

D. Bailey, J. Barton, T. Lasinski, and H. Simon. The NAS Parallel Benchmarks. RNR-91-002, NASA Ames Research Center, Aug. 1991.

[8]

S. M. Balle, B. R. Brett, C. Chen, and D. LaFrance-Linden. Extending a Traditional Debugger to Debug Massively Parallel Applications. Journal of Parallel and Distributed Computing, 64(5):617--628, 2004.

Digital Library

[9]

G. Bronevetsky, I. Laguna, S. Bagchi, B. de Supinski, D. Ahn, and M. Schulz. AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks. In IEEE/IFIP Conference on Dependable Systems and Networks (DSN), pages 231--240, 2010.

[10]

J. DelSignore. TotalView on Blue Gene/L. Presented at "Blue Gene/L: Applications, Architecture and Software Workshop", Oct. 2003.

[11]

M. N. Dinh, D. Abramson, D. Kurniawan, C. Jin, B. Moench, and L. DeRose. Assertion based parallel debugging. In IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pages 63--72, 2011.

Digital Library

[12]

J. Ferrante, K. J. Ottenstein, and J. D. Warren. The program dependence graph and its use in optimization. ACM Trans. Program. Lang. Syst., 9(3):319--349, July 1987.

Digital Library

[13]

Q. Gao, F. Qin, and D. K. Panda. DMTracker: Finding Bugs in Large-scale Parallel Programs by Detecting Anomaly in Data Movements. In ACM/IEEE Supercomputing Conference (SC), 2007.

Digital Library

[14]

Q. Gao, W. Zhang, and F. Qin. FlowChecker: Detecting Bugs in MPI Libraries via Message Flow Checking. In ACM/IEEE Supercomputing Conference (SC), 2010.

Digital Library

[15]

GDB Steering Committee. GDB: The GNU Project Debugger. http://www.gnu.org/software/gdb/documentation/.

[16]

S. C. Gupta and G. Sreenivasamurthy. Navigating C1n a LeakyBoat? Try Purify. IBM developerWorks, 2006.

[17]

W. Haque. Concurrent deadlock detection in parallel programs. International Journal of Computers and Applications, 28:19--25, January 2006.

Digital Library

[18]

T. Hilbrich, B. R. de Supinski, M. Schulz, and M. S. Muller. A graph based approach for mpi deadlock detection. In International conference on Supercomputing (ICS), pages 296--305, 2009.

Digital Library

[19]

J. Hollingsworth and B. Miller. Parallel program performance metrics: a comparison and validation. In Proceedings of Supercomputing '92., pages 4--13, Nov 1992.

Digital Library

[20]

M. Kamkar and P. Krajina. Dynamic slicing of distributed programs. In International Conference on Software Maintenance, pages 222--229, oct 1995.

Digital Library

[21]

M. Kamkar, P. Krajina, and P. Fritzson. Dynamic slicing of parallel message-passing programs. In Proceedings of the Fourth Euromicro Workshop on Parallel and Distributed Processing, 1996. PDP '96., pages 170--177, jan 1996.

Digital Library

[22]

B. Korel and J. Laski. Dynamic slicing of computer programs. Journal of Systems and Software, 13(3):187--195, Dec. 1990.

Digital Library

[23]

I. Laguna, T. Gamblin, B. R. de Supinski, S. Bagchi, G. Bronevetsky, D. H. Anh, M. Schulz, and B. Rountree. Large scale debugging of parallel tasks with automaded. In ACM/IEEE Supercomputing Conference (SC), pages 50:1--50:10, 2011.

Digital Library

[24]

A. V. Mirgorodskiy, N. Maruyama, and B. P. Miller. Problem Diagnosis in Large-Scale Computing Environments. In ACM/IEEE Supercomputing Conference (SC), New York, NY, USA, 2006. ACM.

Digital Library

[25]

J. Rilling, H. Li, and D. Goswami. Predicate-based dynamic slicing of message passing programs. In Second IEEE International Workshop on Source Code Analysis and Manipulation, pages 133 -- 142, 2002.

Digital Library

[26]

Rogue Wave Software. TotalView Debugger. http://www.roguewave.com/products/totalview.aspx.

[27]

G. Shanmuganathan, K. Zhang, E. Wong, and Y. Qi. Analyzing message-passing programs through visual slicing. In International Conference on Information Technology: Coding and Computing (ITCC), volume 2, pages 341 -- 346 Vol. 2, april 2005.

Digital Library

[28]

F. H. Streitz, J. N. Glosli, M. V. Patel, B. Chan, R. K. Yates, B. R. de Supinski, J. Sexton, and J. A. Gunnels. Simulating solidification in metals at high pressure: The drive to petascale computing. Journal of Physics: Conference Series, 46(1):254, 2006.

[29]

M. Strout, B. Kreaseck, and P. Hovland. Data-flow analysis for mpi programs. In International Conference on Parallel Processing (ICPP), pages 175--184, aug. 2006.

Digital Library

[30]

R. Thakur, R. Rabenseifner, and W. Gropp. Optimization of collective communication operations in mpich. International Journal of High Performance Computing Applications, 19:49--66, 2005.

Digital Library

[31]

The Portland Group. PGPROF Graphical Performance Profiler. http://www.pgroup.com/products/pgprof.htm.

[32]

J. S. Vetter and B. R. de Supinski. Dynamic software testing of mpi applications with umpire. In ACM/IEEE Supercomputing Conference (SC), 2000.

Digital Library

[33]

G. Watson and D. Abramson. Relative Debugging for Data-Parallel Programs: A ZPL Case Study. IEEE Concurrency, 8(4):42--52, 2000.

Digital Library

[34]

G. Watson and N. DeBardeleben. Developing Scientific Applications Using Eclipse. Computing in Science & Engineering, 8(4):50--61, 2006.

Digital Library

[35]

M. Weiser. Program slicing. In Proceedings of the 5th International Conference on Software Engineering, pages 439--449, 1981.

Digital Library

Cited By

DeFreez DBhowmick ALaguna IRubio-González CGupta RShen X(2020)Detecting and reproducing error-code propagation bugs in MPI implementationsProceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3332466.3374515(187-201)Online publication date: 19-Feb-2020
https://dl.acm.org/doi/10.1145/3332466.3374515
Li HChen ZGupta RXie M(2018)Non-intrusively Avoiding Scaling Problems in and out of MPI Collectives2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW.2018.00076(415-424)Online publication date: May-2018
https://doi.org/10.1109/IPDPSW.2018.00076
Li HChen ZGupta RMohr BRaghavan P(2017)ParastackProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3126908.3126938(1-12)Online publication date: 12-Nov-2017
https://dl.acm.org/doi/10.1145/3126908.3126938
Show More Cited By

Index Terms

Probabilistic diagnosis of performance faults in large-scale parallel applications
1. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis
        Software testing and debugging

Recommendations

Diagnosis of Performance Faults in LargeScale MPI Applications via Probabilistic Progress-Dependence Inference
Debugging large-scale parallel applications is challenging. Most existing techniques provide little information about failure root causes. Further, most debuggers significantly slow down program execution, and run sluggishly with massively parallel ...
Test and diagnosis pattern generation for distinguishing stuck-at faults and bridging faults
Abstract
Fault diagnosis is the process of identifying a potential faulty site. It is crucial to consider bridging faults and Byzantine faults during diagnosis, as their fault effect manifests at a completely different location than the actual ...
Highlights
- To distinguish Bridging fault and stuck-at fault during diagnosis.
- Additional ...
Diagnosis of Short-Circuit Faults in Combinational Circuits

Most work on diagnosis of digital circuits has concentrated on the model of stuck-type faults. Although these faults are probably the most important class of faults, other types of faults do occur in practice and the occurrence of these other faults may ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniques

September 2012

512 pages

ISBN:9781450311823

DOI:10.1145/2370816

General Chairs:
Pen-Chung Yew
University of Minnesota
,
Sangyeun Cho
University of Pittsburgh
,
Program Chairs:
Luiz DeRose
Cray, Inc.
,
David J. Lilja
University of Minnesota

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

IFIP WG 10.3: IFIP WG 10.3
SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE CS TCPP: IEEE Computer Society Technical Committee on Parallel Processing
IEEE CS TCAA: IEEE CS technical committee on architectural acoustics

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 September 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PACT '12

Sponsor:

IFIP WG 10.3
SIGARCH
IEEE CS TCPP
IEEE CS TCAA

PACT '12: International Conference on Parallel Architectures and Compilation Techniques

September 19 - 23, 2012

Minnesota, Minneapolis, USA

Acceptance Rates

Overall Acceptance Rate 121 of 471 submissions, 26%

Upcoming Conference

PACT '24

Sponsor:
sigarch

International Conference on Parallel Architectures and Compilation Techniques

October 13 - 16, 2024

Long Beach , CA , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
239
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)0

Reflects downloads up to 21 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

DeFreez DBhowmick ALaguna IRubio-González CGupta RShen X(2020)Detecting and reproducing error-code propagation bugs in MPI implementationsProceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3332466.3374515(187-201)Online publication date: 19-Feb-2020
https://dl.acm.org/doi/10.1145/3332466.3374515
Li HChen ZGupta RXie M(2018)Non-intrusively Avoiding Scaling Problems in and out of MPI Collectives2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW.2018.00076(415-424)Online publication date: May-2018
https://doi.org/10.1109/IPDPSW.2018.00076
Li HChen ZGupta RMohr BRaghavan P(2017)ParastackProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3126908.3126938(1-12)Online publication date: 12-Nov-2017
https://dl.acm.org/doi/10.1145/3126908.3126938
Giménez AGamblin TBhatele AWood CShoga KMarathe ABremer PHamann BSchulz MMohr BRaghavan P(2017)ScrubJayProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3126908.3126935(1-12)Online publication date: 12-Nov-2017
https://dl.acm.org/doi/10.1145/3126908.3126935
Do-Mai ADiep TThoai N(2016)Race Condition and Deadlock Detection for Large-Scale Applications2016 15th International Symposium on Parallel and Distributed Computing (ISPDC)10.1109/ISPDC.2016.53(319-326)Online publication date: 2016
https://doi.org/10.1109/ISPDC.2016.53
Laguna IAhn Dde Supinski BGamblin TLee GSchulz MBagchi SKulkarni MZhou BChen ZQin F(2015)Debugging high-performance computing applications at massive scalesCommunications of the ACM10.1145/266721958:9(72-81)Online publication date: 24-Aug-2015
https://dl.acm.org/doi/10.1145/2667219
Laguna IAhn Dde Supinski BBagchi SGamblin T(2015)Diagnosis of Performance Faults in LargeScale MPI Applications via Probabilistic Progress-Dependence InferenceIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2014.231410026:5(1280-1289)Online publication date: 1-May-2015
https://dl.acm.org/doi/10.1109/TPDS.2014.2314100
Chuah EJhumka ABrowne JBarth BNarasimhamurthy S(2015)Insights into the Diagnosis of System Failures from Cluster Message LogsProceedings of the 2015 11th European Dependable Computing Conference (EDCC)10.1109/EDCC.2015.19(225-232)Online publication date: 7-Sep-2015
https://dl.acm.org/doi/10.1109/EDCC.2015.19
Chen PPlale BBalaji PXu C(2015)ProvErrProceedings of the 15th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing10.1109/CCGrid.2015.86(525-534)Online publication date: 4-May-2015
https://dl.acm.org/doi/10.1109/CCGrid.2015.86
Do-Mai ADiep TThoai N(2015)Message Leak Detection in Debugging Large-Scale Parallel ApplicationsProceedings of the 2015 International Conference on Advanced Computing and Applications (ACOMP)10.1109/ACOMP.2015.17(82-89)Online publication date: 23-Nov-2015
https://dl.acm.org/doi/10.1109/ACOMP.2015.17
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents