Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/2663370.2663381acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

Practical formal correctness checking of million-core problem solving environments for HPC

Published: 18 May 2013 Publication History

Abstract

While formal correctness checking methods have been deployed at scale in a number of important practical domains, we believe that such an experiment has yet to occur in the domain of high performance computing at the scale of a million CPU cores. This paper presents preliminary results from the Uintah Runtime Verification (URV) project that has been launched with this objective. Uintah is an asynchronous task-graph based problem-solving environment that has shown promising results on problems as diverse as fluid-structure interaction and turbulent combustion at well over 200K cores to date. Uintah has been tested on leading platforms such as Kraken, Keenland, and Titan consisting of multicore CPUs and GPUs, incorporates several innovative design features, and is following a roadmap for development well into the million core regime. The main results from the URV project to date are crystallized in two observations: (1) A diverse array of well-known ideas from light-weight formal methods and testing/observing HPC systems at scale have an excellent chance of succeeding. The real challenges are in finding out exactly which combinations of ideas to deploy, and where. (2) Large-scale problem solving environments for HPC must be designed such that they can be "crashed early" (at smaller scales of deployment) and "crashed often" (have effective ways of input generation and schedule perturbation that cause vulnerabilities to be attacked with higher probability). Furthermore, following each crash, one must "explain well" (given the extremely obscure ways in which an error finally manifests itself, we must develop ways to record information leading up to the crash in informative ways, to minimize off-site debugging burden). Our plans to achieve these goals and to measure our success are described. We also highlight some of the broadly applicable concepts and approaches.

References

[1]
M. Berzins, J. Schmidt, Q. Meng, and A. Humphrey, "Past, present, and future scalability of the Uintah software," in Proceedings of the Blue Waters Workshop, 2013, to appear.
[2]
D. L. Brown and P. Messina, "Scientific grand challenges, crosscutting technologies for computing at the exascale," 2010, http://science.energy.gov/~/media/ascr/pdf/program-documents/docs/crosscutting_grand_challenges.pdf.
[3]
J. D. d. S. Germain, J. McCorquodale, S. G. Parker, and C. R. Johnson, "Uintah: A massively parallel problem solving environment," in Proceedings of the IEEE International Symposium on High Performance Distributed Computing (HPDC), 2000, pp. 33--41.
[4]
M. Berzins, J. Luitjens, Q. Meng, T. Harman, C. A. Wight, and J. R. Peterson, "Uintah: A scalable framework for hazard analysis," in Proceedings of the TeraGrid Conference, 2010, pp. 3:1--3:8.
[5]
J. Luitjens and M. Berzins, "Improving the performance of Uintah: A large-scale adaptive meshing computational framework," in Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2010, pp. 1--10.
[6]
Q. Meng, J. Luitjens, and M. Berzins, "Dynamic task scheduling for the Uintah framework," in Proceedings of the IEEE Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS), 2010, pp. 1--10.
[7]
A. Humphrey, Q. Meng, M. Berzins, and T. Harman, "Radiation modeling using the Uintah heterogeneous CPU/GPU runtime system," in Proceedings of the Conference of the Extreme Science and Engineering Discovery Environment (XSEDE): Bridging from the eXtreme to the campus and beyond, 2012, pp. 4:1--4:8.
[8]
G. Gopalakrishnan, R. M. Kirby, S. Siegel, R. Thakur, W. Gropp, E. Lusk, B. R. de Supinski, M. Schulz, and G. Bronevetsky, "Formal analysis of MPI-based parallel programs," Communications of ACM, vol. 54, no. 12, pp. 82--91, Dec. 2011.
[9]
M. Emmi, S. Qadeer, and Z. Rakamarić, "Delay-bounded scheduling," in Proceedings of the ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), 2011, pp. 411--422.
[10]
S. F. Siegel and G. Gopalakrishnan, "Formal analysis of message passing," in International Conference on Verification, Model Checking, and Abstract Interpretation (VMCAI), 2007, invited tutorial.
[11]
A. Vo, G. Gopalakrishnan, R. M. Kirby, B. R. de Supinski, M. Schulz, and G. Bronevetsky, "Large scale verification of MPI programs using Lamport clocks with lazy update," in Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), 2011, pp. 330--339.
[12]
M. Müller, B. de Supinski, G. Gopalakrishnan, T. Hilbrich, and D. Lecomber, "Dealing with MPI bugs at scale: Best practices, automatic detection, debugging, and formal verification," http://www.cs.utah.edu/fv/publications/sc11_with_handson.pptx, Nov. 2011, full-day tutorial at Supercomputing.
[13]
C.-S. Park, K. Sen, P. Hargrove, and C. Iancu, "Efficient data race detection for distributed memory parallel programs," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2011, pp. 51:1--51:12.
[14]
S. F. Siegel, "Model checking nonblocking MPI programs," in Proceedings of the International Conference on Verification, Model Checking, and Abstract Interpretation (VMCAI), 2007, pp. 44--58.
[15]
S. F. Siegel, A. Mironova, G. S. Avrunin, and L. A. Clarke, "Combining symbolic execution with model checking to verify parallel numerical programs," ACM Transactions on Software Engineering and Methodology, vol. 17, no. 2, pp. 1--34, 2008.
[16]
L. V. Kale and S. Krishnan, "Charm++: Parallel Programming with Message-Driven Objects," in Parallel Programming using C++. MIT Press, 1996, pp. 175--213.
[17]
G. Bosilca, A. Bouteiller, A. Danalis, T. Herault, P. Lemarinier, and J. Dongarra, "DAGuE: A generic distributed DAG engine for high performance computing," Parallel Computing, vol. 38, no. 1--2, pp. 37--51, 2012.
[18]
D. C. Arnold, D. H. Ahn, B. R. de Supinski, G. L. Lee, B. P. Miller, and M. Schulz, "Stack trace analysis for large scale debugging," in Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS), 2007, pp. 1--10.
[19]
L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent, "HPCToolkit: Tools for performance analysis of optimized parallel programs," Concurrency and Computation: Practice and Experience, vol. 22, no. 6, pp. 685--701, 2010.
[20]
G. Bronevetsky, I. Laguna, S. Bagchi, B. de Supinski, D. Ahn, and M. Schulz, "AutomaDeD: Automata-based debugging for dissimilar parallel tasks," in Proceedings of the IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2010, pp. 231--240.
[21]
C. Cole and R. Williams, "Photoshop scalability: Keeping it simple," in ACM Queue, Sep. 2010, http://queue.acm.org/detail.cfm?id=1858330.
[22]
J. E. Guilkey, T. B. Harman, and B. Banerjee, "An Eulerian-Lagrangian approach for simulating explosions of energetic devices," Comput. Struct., vol. 85, no. 11--14, pp. 660--674, 2007.
[23]
I. Ionescu, J. E. Guilkey, M. Berzins, R. M. Kirby, and J. A. Weiss, "Simulation of soft tissue failure using the material point method," Journal of Biomechanical Engineering, vol. 128, pp. 917--924, 2006.
[24]
J. E. Guilkey, J. B. Hoying, and J. A. Weiss, "Computational modeling of multicellular constructs with the material point method," Journal of Biomechanics, vol. 39, no. 11, pp. 2074--2086, 2006.
[25]
G. Krishnamoorthy, S. Borodai, R. Rawat, J. Spinti, and P. J. Smith, "Numerical modeling of radiative heat transfer in pool fire simulations," in Proceedings of the ASME International Mechanical Engineering Congress and Exposition (IMECE), 2005, pp. 327--337.
[26]
A. Brydon, S. Bardenhagen, E. Miller, and G. Seidler, "Simulation of the densification of real open-celled foam microstructures," Journal of the Mechanics and Physics of Solids, vol. 53, no. 12, pp. 2638--2660, 2005.
[27]
L. V. Kale, E. Bohm, C. L. Mendes, T. Wilmarth, and G. Zheng, "Programming petascale applications with Charm++ and AMPI," in Petascale Computing: Algorithms and Applications, 2008, pp. 421--441.
[28]
Q. Meng, A. Humphrey, and M. Berzins, "The Uintah framework: A unified heterogeneous task scheduling and runtime system," in Proceedings of the International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC), 2012.
[29]
M. Schulz and B. R. de Supinski, "PNMPI tools: A whole lot greater than the sum of their parts," in Proceedings of the ACM/IEEE Conference on Supercomputing (SC), 2007, pp. 30:1--30:10.
[30]
"MPI: Performance visualization," http://www.mcs.anl.gov/research/projects/perfvis/software/MPE/index.htm.
[31]
P. Joshi, M. Naik, C.-S. Park, and K. Sen, "CalFuzzer: An extensible active testing framework for concurrent programs," in Proceedings of the International Conference on Computer Aided Verification (CAV), 2009, pp. 675--681.
[32]
M. Musuvathi and S. Qadeer, "Iterative context bounding for systematic testing of multithreaded programs," in Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2007, pp. 446--455.
[33]
E. M. Clarke, O. Grumberg, and D. A. Peled, Model Checking. MIT Press, 2000.
[34]
M. Sutton, A. Greene, and P. Amini, Fuzzing: Brute Force Vulnerability Discovery. Addison-Wesley Professional, 2007.
[35]
P. Godefroid, A. Kiezun, and M. Y. Levin, "Grammar-based whitebox fuzzing," in Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2008, pp. 206--215.
[36]
V. Ganesh, T. Leek, and M. Rinard, "Taint-based directed whitebox fuzzing," in Proceedings of the International Conference on Software Engineering (ICSE), 2009, pp. 474--484.
[37]
Q. Meng, J. Luitjens, and M. Berzins, "A comparison of load balancing algorithms for AMR in Uintah," Scientific Computing and Imaging Institute, University of Utah, Tech. Rep. UUSCI-2008-006, October 2008.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SE-CSE '13: Proceedings of the 5th International Workshop on Software Engineering for Computational Science and Engineering
May 2013
104 pages
ISBN:9781467362610

Sponsors

Publisher

IEEE Press

Publication History

Published: 18 May 2013

Check for updates

Qualifiers

  • Research-article

Conference

ICSE '13
Sponsor:

Upcoming Conference

ICSE 2025

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 30
    Total Downloads
  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 20 Nov 2024

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media