Nothing Special   »   [go: up one dir, main page]

skip to main content
article

Extending a traditional debugger to debug massively parallel applications

Published: 01 May 2004 Publication History

Abstract

Beowulf systems, and other proprietary approaches, are placing systems with four or more CPUs in the hands of many researchers and commercial users. In the near future, systems with hundreds of CPUs will become commonly available, with some programmers dealing with tens of thousands of CPUs. The debugging methods used on these systems are a combination of the traditional methods used for debugging single processes and ad-hoc methods to help the user cope with the multitudes of processes. Programmers are usually familiar with a single-process debugger and would like to use it (with minimal user-visible extensions) to debug their distributed program.We present a set of modifications to a traditional debugger that makes it capable of debugging applications running on thousands of processes. Our parallel debugger is composed of individual fully functional debuggers connected with an n-nary aggregating network. This permits us to present to users the results from each debugger at the same time in an aggregated fashion. Users get a global view of the application and can easily see if a given parameter has a different value from either what they expect it to be or from the other processes. Users can then focus on the process sets of interest and investigate the problem.One challenge when debugging thousands of processes is to deal with the amount of output coming from all the debuggers. We present methods to aggregate the overwhelming amount of output from the debuggers into a more manageable subset, which is presented to the user without losing information.Experiments show that the debugger is scalable to thousands of processors. The startup mechanism, as well as users' command response time scale well. The conclusions preseated regarding the architecture and the new parallel debugger's scalability are not specific to the serial debugger we are using in our example implementation.

References

[1]
{1} J. Brown, M. Zosel, R. Zwakenberg, M. Seager, A. Williams, The ASCI Debugging Requirements, 1998; http://www.lanl.gov/ projects/asci/PSE/ASCIdebug.html.
[2]
{2} Compaq Computer Corporation. Alphaserver SC: Scalable Supercomputing, July 2000. Document number 135D-0900A- USEN.
[3]
{3} Etnus, INC. The Totalview Multiprocess Debugger; http::// www.etnus.com.
[4]
{4} High Performance Debugging Forum's HPD Version I Standard: Command Interface for Parallel Debuggers (Rev. 2.1), 1998: http://www.ptools.org/hpdf/draft.
[5]
{5} R. Hoods, G. Jost, A debugger for computational grid applications; http://www.nas.nasa.gov/Groups/Tools/Projects/P2D2/.
[6]
{6} Ladebug Debugger Manual Version 67, Compaq Computer Corporation, February 2002; http://www.compaq.com/Ladebug.
[7]
{7} D.C.P. LaFrance-Linden, Challenges in designing an HPF debugger, Digital Tech. J. 9(3) (1997) 50-64.
[8]
{8} Mantis project; http://www.cs.berkeley.edu/projects/parallel/ castle/mantis/.
[9]
{9} Pittsburgh Supercomputing Center; www.psc.edu.
[10]
{10} M.L. Simmons, A.H. Hayes, J.S. Brown, D.A. Reed (Eds.), Debugging and performance tuning for parallel computing sytems, IEEE Computer Society Press, Silverspring. MD, 1996.
[11]
{11} R. Sosic, D.A. Abramson, Guard: a relative debugger, Software Practice Experience, 27(2), (1997) 185-206.
[12]
{12} Thinking Machines Corporation, Prism's User's Guide. Thinking Machines Corporation, Cambridge. MA, 1991.
[13]
{13} J.J.P. Tsai, S.J.H. Yang (Eds.), Monitoring and Debugging of Distributed and Real-Time Systems, IEEE Computer Society Press. Silverspring, MD, 1995.
[14]
{14} G. Watson, D.A. Abramson, The architecture of a parallel relative debugger. 13th International Conference on Parallel and Distributed Computing Systems--PDCS 2000, August 8-10, 2000, to appear.

Cited By

View all
  • (2015)Diagnosis of Performance Faults in LargeScale MPI Applications via Probabilistic Progress-Dependence InferenceIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2014.231410026:5(1280-1289)Online publication date: 1-May-2015
  • (2014)PGDBProceedings of the 2014 Annual Conference on Extreme Science and Engineering Discovery Environment10.1145/2616498.2616535(1-7)Online publication date: 13-Jul-2014
  • (2012)Probabilistic diagnosis of performance faults in large-scale parallel applicationsProceedings of the 21st international conference on Parallel architectures and compilation techniques10.1145/2370816.2370848(213-222)Online publication date: 19-Sep-2012
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Journal of Parallel and Distributed Computing
Journal of Parallel and Distributed Computing  Volume 64, Issue 5
May 2004
115 pages

Publisher

Academic Press, Inc.

United States

Publication History

Published: 01 May 2004

Author Tags

  1. LadebugTM
  2. distributed breakpoints
  3. massively parallel debugging
  4. parallel debugger
  5. parallel debugging
  6. parallel programming

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 21 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2015)Diagnosis of Performance Faults in LargeScale MPI Applications via Probabilistic Progress-Dependence InferenceIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2014.231410026:5(1280-1289)Online publication date: 1-May-2015
  • (2014)PGDBProceedings of the 2014 Annual Conference on Extreme Science and Engineering Discovery Environment10.1145/2616498.2616535(1-7)Online publication date: 13-Jul-2014
  • (2012)Probabilistic diagnosis of performance faults in large-scale parallel applicationsProceedings of the 21st international conference on Parallel architectures and compilation techniques10.1145/2370816.2370848(213-222)Online publication date: 19-Sep-2012
  • (2012)Debugging component-based embedded applicationsProceedings of the 15th International Workshop on Software and Compilers for Embedded Systems10.1145/2236576.2236581(42-51)Online publication date: 15-May-2012
  • (2011)GRaceACM SIGPLAN Notices10.1145/2038037.194157446:8(135-146)Online publication date: 12-Feb-2011
  • (2011)GRaceProceedings of the 16th ACM symposium on Principles and practice of parallel programming10.1145/1941553.1941574(135-146)Online publication date: 12-Feb-2011
  • (2010)FlowCheckerProceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC.2010.27(1-11)Online publication date: 13-Nov-2010
  • (2008)Lessons learned at 208KProceedings of the 2008 ACM/IEEE conference on Supercomputing10.5555/1413370.1413397(1-9)Online publication date: 15-Nov-2008
  • (2007)DMTrackerProceedings of the 2007 ACM/IEEE conference on Supercomputing10.1145/1362622.1362643(1-12)Online publication date: 16-Nov-2007
  • (2007)A debugger for flow graph based parallel applicationsProceedings of the 2007 ACM workshop on Parallel and distributed systems: testing and debugging10.1145/1273647.1273651(14-20)Online publication date: 9-Jul-2007
  • Show More Cited By

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media