research-article

Diagnosis of Performance Faults in LargeScale MPI Applications via Probabilistic Progress-Dependence Inference

Authors:

Ignacio Laguna,

Bronis R. de Supinski,

Saurabh Bagchi,

Todd GamblinAuthors Info & Claims

IEEE Transactions on Parallel and Distributed Systems, Volume 26, Issue 5

Pages 1280 - 1289

https://doi.org/10.1109/TPDS.2014.2314100

Published: 01 May 2015 Publication History

Abstract

Debugging large-scale parallel applications is challenging. Most existing techniques provide little information about failure root causes. Further, most debuggers significantly slow down program execution, and run sluggishly with massively parallel applications. This paper presents a novel technique that scalably infers the tasks in a parallel program on which a failure occurred, as well as the code in which it originated. Our technique combines scalable runtime analysis with static analysis to determine the least-progressed task(s) and to identify the code lines at which the failure arose. We present a novel algorithm that infers probabilistically progress dependence among MPI tasks using a globally constructed Markov model that represents tasks' control-flow behavior. In comparison to previous work, our algorithm infers more precisely the least-progressed task. We combine this technique with static backward slicing analysis, further isolating the code responsible for the current state. A blind study demonstrates that our technique isolates the root cause of a concurrency bug in a molecular dynamics simulation, which only manifests itself at 7,996 tasks or more. We extensively evaluate fault coverage of our technique via fault injections in 10 HPC benchmarks and show that our analysis takes less than a few seconds on thousands of parallel tasks.

References

[1]

D. H. Ahn, B. R. D. Supinski, I. Laguna, G. L. Lee, B. Liblit, B. P. Miller, and M. Schulz, “Scalable temporal order analysis for large scale debugging”, Proc. Conf. High Performance Comput. Netw., Storage Anal., 2009, pp. 1 –11.

[2]

I. Laguna, D. H. Ahn, B. R. de Supinski, S. Bagchi, and T. Gamblin, “Probabilistic diagnosis of performance faults in large-scale parallel applications”, Proc. 21st Int. Conf. Parallel Archit. Compilation Tech., 2012, pp. 213 –222.

Digital Library

[3]

F. H. Streitz, J. N. Glosli, M. V. Patel, B. Chan, R. K. Yates, B. R. de Supinski, J. Sexton, and J.A. Gunnels, “Simulating solidification in metals at high pressure: The drive to petascale computing”, J. Phys.: Conf. Ser.vol. 46, no. 1, pp. 254-267, 2006.

[4]

R. Thakur, R. Rabenseifner, and W. Gropp, “Optimization of collective communication operations in MPICH”, Int. J. High Performance Comput. Appl., vol. 19, pp. 49– 66, 2005.

Digital Library

[5]

M. Weiser, “Program slicing”, Proc. 5th Int. Conf. Softw. Eng., 1981, pp. 439–449 .

Digital Library

[6]

B. Korel, and J. Laski, “Dynamic slicing of computer programs ”, J. Syst. Softw., vol. 13, no. 3, pp. 187– 195, Dec. 1990.

Digital Library

[7]

M. Kamkar, P. Krajina, and P. Fritzson, “Dynamic slicing of parallel message-passing programs”, Proc. 4th Euromicro Workshop Parallel Distrib. Process., Jan. 1996, pp. 170–177.

[8]

J. Rilling, H. Li, and D. Goswami, “Predicate-based dynamic slicing of message passing programs”, Proc. IEEE 2nd Int. Workshop Source Code Anal. Manipulation, 2002, pp. 133 –142.

[9]

G. Shanmuganathan, K. Zhang, E. Wong, and Y. Qi, “Analyzing message-passing programs through visual slicing,” in Proc. Int. Conf. Inf. Technol. Coding and Comput., vol. 2, Apr. 2005, pp. 341–346.

[10]

M. Strout, B. Kreaseck, and P. Hovland, “Data-flow analysis for MPI programs ”, Proc. Int. Conf. Parallel Process., Aug. 2006, pp. 175 –184.

[11]

D. Bailey, J. Barton, T. Lasinski, and H. Simon, “The NAS Parallel Benchmarks,” NASA Ames Research Center, Mountain View, CA, USA, Rep. RNR-91-002, Aug. 1991.

[12]

ASC Sequoia Benchmark Codes, (2013). [Online]. Available: https://asc.llnl.gov/sequoia/benchmarks/.

[13]

Allinea Software Ltd, “Allinea DDT—Debugging tool for parallel computing,” (2013). [Online]. Available: http://www.allinea.com/products/ddt/.

[14]

GDB Steering Committee, “GDB: The GNU Project Debugger,” (2013). [Online]. Available: http://www.gnu.org/software/gdb/documentation/.

[15]

Rogue Wave Software, “TotalView Debugger,” (2013). [Online]. Available: http://www.roguewave.com/products/totalview.aspx.

[16]

J. DelSignore. (2003, Oct.) “TotalView on Blue Gene/L,” Presented at “Blue Gene/L: Applications, Architecture and Software Workshop”, Oct. 2003. [Online]. Available: https://asc.llnl.gov/computing_resources/bluegenel/papers/delsignore.pdf.

[17]

S. M. Balle, B. R. Brett, C. Chen, and D. LaFrance-Linden, “Extending a traditional debugger to debug massively parallel applications”, J. Parallel Distrib. Comput., vol. 64, no. 5, pp. 617 –628, 2004.

Digital Library

[18]

G. Watson, and N. DeBardeleben, “Developing scientific applications using eclipse”, Comput. Sci. Eng., vol. 8, no. 4, pp. 50– 61, 2006.

Digital Library

[19]

J. Hollingsworth, and B. Miller, “Parallel program performance metrics: A comparison and validation”, Proc. Supercomput., Nov. 1992, pp. 4–13.

[20]

The Portland Group, “PGPROF graphical performance profiler,” (2013). [Online]. Available: http://www.pgroup.com/products/pgprof.htm.

[21]

G. Bronevetsky, I. Laguna, S. Bagchi, B. de Supinski, D. Ahn, and M. Schulz, “AutomaDeD: Automata-based debugging for dissimilar parallel tasks”, Proc. IEEE/IFIP Conf. Dependable Syst. Netw., 2010, pp. 231– 240.

[22]

Q. Gao, F. Qin, and D. K. Panda, “DMTracker: Finding bugs in large-scale parallel programs by detecting anomaly in data movements”, Proc. ACM/IEEE Supercomput. Conf., 2007, pp. 15:1–15:12.

[23]

I. Laguna, T. Gamblin, B. R. de Supinski, S. Bagchi, G. Bronevetsky, D. H. Ahn, M. Schulz, and B. Rountree, “Large scale debugging of parallel tasks with automaded”, Proc. ACM/IEEE Supercomput. Conf., 2011, pp. 50:1–50:10.

[24]

A. V. Mirgorodskiy, N. Maruyama, and B. P. Miller, “Problem diagnosis in large-scale computing environments, ”, ACM/IEEE Supercomput Conf.,, New York, NY, USA: ACM, 2006, pp. 11–.

[25]

S. C. Gupta and G. Sreenivasamurthy, “Navigating C¨ïn a L¨eakyB¨oat? Try purify,” IBM developerWorks, 2006. [Online]. Available: www.ibm.com/developerworks/rational/library/06/0822_satish-giridhar/.

[26]

Q. Gao, W. Zhang, and F. Qin, “FlowChecker: Detecting bugs in MPI libraries via message flow checking”, Proc. ACM/IEEE Int. Conf. High Performance Comput., Netw., Storage Anal., 2010, pp. 1–11.

[27]

T. Hilbrich, B. R. de Supinski, M. Schulz, and M. S. Müller, “A graph based approach for MPI deadlock detection”, Proc. Int. Conf. Supercomput, 2009, pp. 296–305.

[28]

J. S. Vetter, and B. R. de Supinski, “Dynamic software testing of MPI applications with umpire”, Proc. ACM/IEEE Supercomput. Conf., 2000, pp. 51:1–51:10.

[29]

J. Ferrante, K. J. Ottenstein, and J. D. Warren, “The program dependence graph and its use in optimization”, ACM Trans. Program. Lang. Syst., vol. 9, no. 3, pp. 319 –349, Jul. 1987.

Digital Library

[30]

M. Kamkar, and P. Krajina, “Dynamic slicing of distributed programs ”, Proc. Int. Conf. Softw. Maintenance, Oct. 1995, pp. 222 –229.

Cited By

You XXuan ZYang HLuan ZLiu YQian D(2024)GVARP: Detecting Performance Variance on Large-Scale Heterogeneous SystemsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00063(1-16)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00063
Zheng LZhai JTang XWang HYu TJin YSong SChen WLee JAgrawal KSpear M(2022)VaproProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508411(150-162)Online publication date: 2-Apr-2022
https://dl.acm.org/doi/10.1145/3503221.3508411
Zhai JZheng LZhang FTang XWang HYu TJin YSong SChen W(2022)Detecting Performance Variance for Parallel Applications Without Source CodeIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.318179933:12(4239-4255)Online publication date: 1-Dec-2022
https://dl.acm.org/doi/10.1109/TPDS.2022.3181799
Show More Cited By

Index Terms

Diagnosis of Performance Faults in LargeScale MPI Applications via Probabilistic Progress-Dependence Inference

Index terms have been assigned to the content through auto-classification.

Recommendations

Probabilistic diagnosis of performance faults in large-scale parallel applications
PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniques

Debugging large-scale parallel applications is challenging. Most existing techniques provide mechanisms for process control but little information about the causes of failures. Most debuggers also scale poorly despite continued growth in supercomputer ...
Performance Evaluation of MPI Implementations and MPI-Based Parallel ELLPACK Solvers
MPIDC '96: Proceedings of the Second MPI Developers Conference

Abstract: We are concerned with the parallelization of finite element mesh generation and its decomposition, and the parallel solution of sparse algebraic equations which are obtained from the parallel discretization of second order elliptic partial ...
Experimental evaluation of the impact of processor faults on parallel applications
SRDS '95: Proceedings of the 14TH Symposium on Reliable Distributed Systems

This paper addresses the problem of processor faults in distributed memory parallel systems. It shows that transient faults injected at the processor pins of one node of a commercial parallel computer, without any particular fault-tolerant techniques, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Parallel and Distributed Systems

IEEE Transactions on Parallel and Distributed Systems Volume 26, Issue 5

May 2015

291 pages

ISSN:1045-9219

Issue’s Table of Contents

Copyright © 2014.

Publisher

IEEE Press

Publication History

Published: 01 May 2015

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

You XXuan ZYang HLuan ZLiu YQian D(2024)GVARP: Detecting Performance Variance on Large-Scale Heterogeneous SystemsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00063(1-16)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00063
Zheng LZhai JTang XWang HYu TJin YSong SChen WLee JAgrawal KSpear M(2022)VaproProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508411(150-162)Online publication date: 2-Apr-2022
https://dl.acm.org/doi/10.1145/3503221.3508411
Zhai JZheng LZhang FTang XWang HYu TJin YSong SChen W(2022)Detecting Performance Variance for Parallel Applications Without Source CodeIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.318179933:12(4239-4255)Online publication date: 1-Dec-2022
https://dl.acm.org/doi/10.1109/TPDS.2022.3181799
Zhai JZheng LSun JZhang FTang XQian XHe BXue WChen WZheng W(2022)Leveraging Code Snippets to Detect Variations in the Performance of HPC SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.315874233:12(3558-3574)Online publication date: 1-Dec-2022
https://dl.acm.org/doi/10.1109/TPDS.2022.3158742

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents