Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2370816.2370848acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article

Probabilistic diagnosis of performance faults in large-scale parallel applications

Published: 19 September 2012 Publication History

Abstract

Debugging large-scale parallel applications is challenging. Most existing techniques provide mechanisms for process control but little information about the causes of failures. Most debuggers also scale poorly despite continued growth in supercomputer core counts. Our novel, highly scalable tool helps developers to understand and to fix performance failures and correctness problems at scale. Our tool probabilistically infers the least progressed task in MPI programs using Markov models of execution history and dependence analysis. This analysis guides program slicing to find code that may have caused a failure. In a blind study, we demonstrate that our tool can isolate the root cause of a particularly perplexing bug encountered at scale in a molecular dynamics simulation. Further, we perform fault injections into two benchmark codes and measure the scalability of the tool. Our results show that it accurately detects the least progressed task in most cases and can perform the diagnosis in a fraction of a second with thousands of tasks.

References

[1]
ASC Sequoia Benchmark Codes. https://asc.llnl.gov/sequoia/benchmarks/.
[2]
Boost C++ libraries. http://www.boost.org/.
[3]
DynInst - An Application Program Interface (API) for Runtime Code Generation. http://www.dyninst.org/.
[4]
D. Abramson, I. Foster, J. Michalakes, and R. Socic. Relative Debugging: A New Methodology for Debugging Scientific Applications. Communications of the ACM, 39(11):69--77, 1996.
[5]
D. H. Ahn, B. R. D. Supinski, I. Laguna, G. L. Lee, B. Liblit, B. P. Miller, and M. Schulz. Scalable Temporal Order Analysis for Large Scale Debugging. In SC '09, 2009.
[6]
Allinea Software Ltd. Allinea DDT - Debugging tool for parallel computing. http://www.allinea.com/products/ddt/.
[7]
D. Bailey, J. Barton, T. Lasinski, and H. Simon. The NAS Parallel Benchmarks. RNR-91-002, NASA Ames Research Center, Aug. 1991.
[8]
S. M. Balle, B. R. Brett, C. Chen, and D. LaFrance-Linden. Extending a Traditional Debugger to Debug Massively Parallel Applications. Journal of Parallel and Distributed Computing, 64(5):617--628, 2004.
[9]
G. Bronevetsky, I. Laguna, S. Bagchi, B. de Supinski, D. Ahn, and M. Schulz. AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks. In IEEE/IFIP Conference on Dependable Systems and Networks (DSN), pages 231--240, 2010.
[10]
J. DelSignore. TotalView on Blue Gene/L. Presented at "Blue Gene/L: Applications, Architecture and Software Workshop", Oct. 2003.
[11]
M. N. Dinh, D. Abramson, D. Kurniawan, C. Jin, B. Moench, and L. DeRose. Assertion based parallel debugging. In IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pages 63--72, 2011.
[12]
J. Ferrante, K. J. Ottenstein, and J. D. Warren. The program dependence graph and its use in optimization. ACM Trans. Program. Lang. Syst., 9(3):319--349, July 1987.
[13]
Q. Gao, F. Qin, and D. K. Panda. DMTracker: Finding Bugs in Large-scale Parallel Programs by Detecting Anomaly in Data Movements. In ACM/IEEE Supercomputing Conference (SC), 2007.
[14]
Q. Gao, W. Zhang, and F. Qin. FlowChecker: Detecting Bugs in MPI Libraries via Message Flow Checking. In ACM/IEEE Supercomputing Conference (SC), 2010.
[15]
GDB Steering Committee. GDB: The GNU Project Debugger. http://www.gnu.org/software/gdb/documentation/.
[16]
S. C. Gupta and G. Sreenivasamurthy. Navigating C1n a LeakyBoat? Try Purify. IBM developerWorks, 2006.
[17]
W. Haque. Concurrent deadlock detection in parallel programs. International Journal of Computers and Applications, 28:19--25, January 2006.
[18]
T. Hilbrich, B. R. de Supinski, M. Schulz, and M. S. Muller. A graph based approach for mpi deadlock detection. In International conference on Supercomputing (ICS), pages 296--305, 2009.
[19]
J. Hollingsworth and B. Miller. Parallel program performance metrics: a comparison and validation. In Proceedings of Supercomputing '92., pages 4--13, Nov 1992.
[20]
M. Kamkar and P. Krajina. Dynamic slicing of distributed programs. In International Conference on Software Maintenance, pages 222--229, oct 1995.
[21]
M. Kamkar, P. Krajina, and P. Fritzson. Dynamic slicing of parallel message-passing programs. In Proceedings of the Fourth Euromicro Workshop on Parallel and Distributed Processing, 1996. PDP '96., pages 170--177, jan 1996.
[22]
B. Korel and J. Laski. Dynamic slicing of computer programs. Journal of Systems and Software, 13(3):187--195, Dec. 1990.
[23]
I. Laguna, T. Gamblin, B. R. de Supinski, S. Bagchi, G. Bronevetsky, D. H. Anh, M. Schulz, and B. Rountree. Large scale debugging of parallel tasks with automaded. In ACM/IEEE Supercomputing Conference (SC), pages 50:1--50:10, 2011.
[24]
A. V. Mirgorodskiy, N. Maruyama, and B. P. Miller. Problem Diagnosis in Large-Scale Computing Environments. In ACM/IEEE Supercomputing Conference (SC), New York, NY, USA, 2006. ACM.
[25]
J. Rilling, H. Li, and D. Goswami. Predicate-based dynamic slicing of message passing programs. In Second IEEE International Workshop on Source Code Analysis and Manipulation, pages 133 -- 142, 2002.
[26]
Rogue Wave Software. TotalView Debugger. http://www.roguewave.com/products/totalview.aspx.
[27]
G. Shanmuganathan, K. Zhang, E. Wong, and Y. Qi. Analyzing message-passing programs through visual slicing. In International Conference on Information Technology: Coding and Computing (ITCC), volume 2, pages 341 -- 346 Vol. 2, april 2005.
[28]
F. H. Streitz, J. N. Glosli, M. V. Patel, B. Chan, R. K. Yates, B. R. de Supinski, J. Sexton, and J. A. Gunnels. Simulating solidification in metals at high pressure: The drive to petascale computing. Journal of Physics: Conference Series, 46(1):254, 2006.
[29]
M. Strout, B. Kreaseck, and P. Hovland. Data-flow analysis for mpi programs. In International Conference on Parallel Processing (ICPP), pages 175--184, aug. 2006.
[30]
R. Thakur, R. Rabenseifner, and W. Gropp. Optimization of collective communication operations in mpich. International Journal of High Performance Computing Applications, 19:49--66, 2005.
[31]
The Portland Group. PGPROF Graphical Performance Profiler. http://www.pgroup.com/products/pgprof.htm.
[32]
J. S. Vetter and B. R. de Supinski. Dynamic software testing of mpi applications with umpire. In ACM/IEEE Supercomputing Conference (SC), 2000.
[33]
G. Watson and D. Abramson. Relative Debugging for Data-Parallel Programs: A ZPL Case Study. IEEE Concurrency, 8(4):42--52, 2000.
[34]
G. Watson and N. DeBardeleben. Developing Scientific Applications Using Eclipse. Computing in Science & Engineering, 8(4):50--61, 2006.
[35]
M. Weiser. Program slicing. In Proceedings of the 5th International Conference on Software Engineering, pages 439--449, 1981.

Cited By

View all
  • (2020)Detecting and reproducing error-code propagation bugs in MPI implementationsProceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3332466.3374515(187-201)Online publication date: 19-Feb-2020
  • (2018)Non-intrusively Avoiding Scaling Problems in and out of MPI Collectives2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW.2018.00076(415-424)Online publication date: May-2018
  • (2017)ParastackProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3126908.3126938(1-12)Online publication date: 12-Nov-2017
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniques
September 2012
512 pages
ISBN:9781450311823
DOI:10.1145/2370816
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 September 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. diagnosis
  2. distributed debugging
  3. fault detection
  4. markov models
  5. mpi
  6. slicing

Qualifiers

  • Research-article

Conference

PACT '12
Sponsor:
  • IFIP WG 10.3
  • SIGARCH
  • IEEE CS TCPP
  • IEEE CS TCAA

Acceptance Rates

Overall Acceptance Rate 121 of 471 submissions, 26%

Upcoming Conference

PACT '24

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)0
Reflects downloads up to 21 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2020)Detecting and reproducing error-code propagation bugs in MPI implementationsProceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3332466.3374515(187-201)Online publication date: 19-Feb-2020
  • (2018)Non-intrusively Avoiding Scaling Problems in and out of MPI Collectives2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW.2018.00076(415-424)Online publication date: May-2018
  • (2017)ParastackProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3126908.3126938(1-12)Online publication date: 12-Nov-2017
  • (2017)ScrubJayProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3126908.3126935(1-12)Online publication date: 12-Nov-2017
  • (2016)Race Condition and Deadlock Detection for Large-Scale Applications2016 15th International Symposium on Parallel and Distributed Computing (ISPDC)10.1109/ISPDC.2016.53(319-326)Online publication date: 2016
  • (2015)Debugging high-performance computing applications at massive scalesCommunications of the ACM10.1145/266721958:9(72-81)Online publication date: 24-Aug-2015
  • (2015)Diagnosis of Performance Faults in LargeScale MPI Applications via Probabilistic Progress-Dependence InferenceIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2014.231410026:5(1280-1289)Online publication date: 1-May-2015
  • (2015)Insights into the Diagnosis of System Failures from Cluster Message LogsProceedings of the 2015 11th European Dependable Computing Conference (EDCC)10.1109/EDCC.2015.19(225-232)Online publication date: 7-Sep-2015
  • (2015)ProvErrProceedings of the 15th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing10.1109/CCGrid.2015.86(525-534)Online publication date: 4-May-2015
  • (2015)Message Leak Detection in Debugging Large-Scale Parallel ApplicationsProceedings of the 2015 International Conference on Advanced Computing and Applications (ACOMP)10.1109/ACOMP.2015.17(82-89)Online publication date: 23-Nov-2015
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media