research-article

Open access

ScalaExtrap: Trace-based communication extrapolation for SPMD programs

Authors:

Frank MuellerAuthors Info & Claims

ACM Transactions on Programming Languages and Systems (TOPLAS), Volume 34, Issue 1

Article No.: 5, Pages 1 - 29

https://doi.org/10.1145/2160910.2160914

Published: 04 May 2012 Publication History

Abstract

Performance modeling for scientific applications is important for assessing potential application performance and systems procurement in high-performance computing (HPC). Recent progress on communication tracing opens up novel opportunities for communication modeling due to its lossless yet scalable trace collection. Estimating the impact of scaling on communication efficiency still remains nontrivial due to execution-time variations and exposure to hardware and software artifacts.

This work contributes a fundamentally novel modeling scheme. We synthetically generate the application trace for large numbers of nodes via extrapolation from a set of smaller traces. We devise an innovative approach for topology extrapolation of single program, multiple data (SPMD) codes with stencil or mesh communication. Experimental results show that the extrapolated traces precisely reflect the communication behavior and the performance characteristics at the target scale for both strong and weak scaling applications. The extrapolated trace can subsequently be (a) replayed to assess communication requirements before porting an application, (b) transformed to autogenerate communication benchmarks for various target platforms, and (c) analyzed to detect communication inefficiencies and scalability limitations.

To the best of our knowledge, rapidly obtaining the communication behavior of parallel applications at arbitrary scale with the availability of timed replay, yet without actual execution of the application, at this scale, is without precedence and has the potential to enable otherwise infeasible system simulation at the exascale level.

References

[1]

Bailey, D. and Snavely, A. 2005. Performance modeling: Understanding the present and predicting the future. In Proceedings of the Euro-Par Conference.

Digital Library

[2]

Bailey, D. H., Barszcz, E., Barton, J. T., Browning, D. S., Carter, R. L., Dagum, D., Fatoohi, R. A., Frederickson, P. O., Lasinski, T. A., Schreiber, R. S., Simon, H. D., Venkatakrishnan, V., and Weeratunga, S. K. 1991. The NAS parallel benchmarks. Int. J. Supercomput. Appl. 5, 3, 63--73.

Digital Library

[3]

Bergroth, L., Hakonen, H., and Raita, T. 2000. A survey of longest common subsequence algorithms. In Proceedings of the 7th International Symposium on String Processing Information Retrieval (SPIRE'00). Los Alamitos, CA, 39.

Digital Library

[4]

Brunst, H., Kranzlmüller, D., and Nagel, W. 2005. Tools for scalable parallel program analysis—Vampir NG and DeWiz. In International Series Eng. Comput. Sci. Distributed and Parallel Systems 777, 92--102.

[5]

Deshpande, V. 2011. Automatic generation of complete communication skeletons from traces. M.S. thesis, North Carolina State University, Raleigh, NC.

[6]

Eckert, Z. and Nutt, G. 1996. Trace extrapolation for parallel programs on shared memory multiprocessors. Tech. rep. TR CU-CS-804-96, Department of Computer Science, University of Colorado at Boulder, Boulder, CO.

[7]

Eckert, Z. K. F. and Nutt, G. J. 1994. Parallel program trace extrapolation. In Proceedings of the International Conference on Parallel Processing. 103--107.

Digital Library

[8]

Faraj, A., Patarasuk, P., and Yuan, X. 2007. A study of process arrival patterns for MPI collective operations. In Proceedings of the International Conference on Supercomputing. 168--179.

Digital Library

[9]

Gropp, W., Lusk, E., Doss, N., and Skjellum, A. 1996. A high-performance, portable implementation of the MPI message passing interface standard. Parallel Comput. 22, 6, 789--828.

Digital Library

[10]

Gruber, B., Haring, G., Kranzlmueller, D., and Volkert, J. 1996. Parallel programming with capse—A case study. In Proceedings of the International Euromicro Conference on Parallel, Distributed, and Network-Based Processing.

Digital Library

[11]

Gustafson, J. L. 1988. Reevaluating Amdahl's law. Comm. ACM 31, 5, 532--533.

Digital Library

[12]

Hermanns, M.-A., Geimer, M., Wolf, F., and Wylie, B. J. N. 2009. Verifying causality between distant performance phenomena in large-scale mpi applications. In Proceedings of the 17th Euromicro International Conference on Parallel, Distributed and Network-Based Processing. Los Alamitos, CA, 78--84.

Digital Library

[13]

Hoisie, A., Lubeck, O. M., and Wasserman, H. J. 1999. Performance analysis of wavefront algorithms on very-large scale distributed systems. In Proceedings of the Workshop on Wide Area Networks and High Performance Computing. Springer-Verlag, 171--187.

Digital Library

[14]

Ïpek, E., McKee, S. A., Caruana, R., de Supinski, B. R., and Schulz, M. 2006. Efficiently exploring architectural design spaces via predictive modeling. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems. 195--206.

Digital Library

[15]

Kerbyson, D., Alme, H., Hoisie, A., Petrini, F., Wasserman, H., and Gittings, M. 2001. Predictive performance and scalability modeling of a large-scale application. In International Conference on Supercomputing.

Digital Library

[16]

Kerbyson, D. J. and Hoisie, A. 2006. Performance modeling of the blue gene architecture. In Proceedings of the IEEE John Vincent Atanasoff International Symposium on Modern Computing. 252--259.

Digital Library

[17]

Knüpfer, A., Brendel, R., Brunst, H., Mix, H., and Nagel, W. E. 2006. Introducing the open trace format (OTF). In Proceedings of the International Conference on Computational Science. 526--533.

Digital Library

[18]

Labarta, J., Girona, S., and Cortes, T. 1997. Analyzing scheduling policies using dimemas. Parallel Comput. 23, 1--2, 23--34.

Digital Library

[19]

Nagel, W. E., Arnold, A., Weber, M., Hoppe, H. C., and Solchenbach, K. 1996. VAMPIR: Visualization and analysis of MPI resources. Supercomput. 12, 1, 69--80.

[20]

Noeth, M., Mueller, F., Schulz, M., and de Supinski, B. R. 2007. Scalable compression and replay of communication traces in massively parallel environments. In Proceedings of the International Parallel and Distributed Processing Symposium.

[21]

Noeth, M., Mueller, F., Schulz, M., and de Supinski, B. R. 2009. Scalatrace: Scalable compression and replay of communication traces in high performance computing. J. Parall. Distrib. Comput. 69, 8, 969--710.

Digital Library

[22]

Pillet, V., Labarta, J., Cortes, T., and Girona, S. 1995. PARAVER: A tool to visualise and analyze parallel code. In Proceedings of WoTUG-18: Transputer and OCCAM Developments. Transputer and Occam Engineering Series, vol. 44. 17--31.

[23]

Preissl, R., Köckerbauer, T., Schulz, M., Kranzlmüller, D., Supinski, B. R. D., and Quinlan, D. J. 2008. Detecting patterns in mpi communication traces. In Proceedings of the 37th International Conference on Parallel Processing. Los Alamitos, CA, 230--237.

Digital Library

[24]

Preissl, R., Schulz, M., Kranzlmüller, D., Supinski, B. R., and Quinlan, D. J. 2008. Using mpi communication patterns to guide source code transformations. In Proceedings of the 8th International Conference on Computational Science. Part III, Springer-Verlag, Berlin, 253--260.

Digital Library

[25]

Ratn, P., Mueller, F., de Supinski, B. R., and Schulz, M. 2008. Preserving time in large-scale communication traces. In Proceedings of the International Conference on Supercomputing. 46--55.

Digital Library

[26]

Rodrigues, A. F., Murphy, R. C., Kogge, P., and Underwood, K. D. 2006. The structural simulation toolkit: exploring novel architectures. In Proceedings of the ACM/IEEE Conference on Supercomputing. 157.

Digital Library

[27]

Ronsse, M. and Kranzlmueller, D. 1998. Roltmp-replay of lamport timestamps for message passing systems. In Proceedings of the International Euromicro Conference on Parallel, Distributed, and Network-Based Processing.

[28]

Sherwood, T., Perelman, E., Hamerly, G., and Calder, B. 2002. Automatically characterizing large scale program behavior. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems. 45--57.

Digital Library

[29]

Snavely, A., Carrington, L., Wolter, N., Labarta, J., Badia, R., and Purkayastha, A. 2002. A framework for performance modeling and prediction. In Proceedings of the International Conference on Supercomputing.

Digital Library

[30]

Vetter, J. and McCracken, M. 2001. Statistical scalability analysis of communication operations in distributed applications. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming.

Digital Library

[31]

Wasserman, H., Hoisie, A., and Lubeck, O. 2000. Performance and scalability analysis of teraflop-scale parallel architectures using multidimensional wavefront applications. Int. J. High Perform. Comput. Appli. 14, 330--346.

Digital Library

[32]

Wu, X., Mueller, F., and Pakin, S. 2011. Automatic generation of executable communication specifications from parallel applications. In Proceedings of the International Conference on Supercomputing. (ICS '11). ACM, New York, 12--21.

Digital Library

[33]

Wu, X., Vijayakumar, K., Mueller, F., Ma, X., and Roth, P. C. 2011. Probabilistic communication and i/o tracing with deterministic replay at scale. In Proceedings of the International Conference on Parallel Processing.

Digital Library

[34]

Xu, Q., Prithivathi, R., Subhlok, J., and Zheng, R. 2008. Logicalization of mpi communication traces. Tech. rep. UH-CS-08-07, Department of Computer Science, University of Houston.

[35]

Xu, Q. and Subhlok, J. 2008. Construction and evaluation of coordinated performance skeletons. In Proceeding of the International Conference on High Performance Computing. 73--86.

Digital Library

[36]

Zhai, J., Chen, W., and Zheng, W. 2010. Phantom: Predicting performance of parallel applications on large-scale parallel machines using a single node. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 305--314.

Digital Library

[37]

Zhai, J., Sheng, T., He, J., Chen, W., and Zheng, W. 2009. Fact: Fast communication trace collection for parallel applications through program slicing. In Proceedings of the International Conference on Supercomputing. 1--12.

Digital Library

Cited By

Copik MCalotoiu AGrosser TWicki NWolf FHoefler TLee JPetrank E(2021)Extracting clean performance models from tainted programsProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3437801.3441613(403-417)Online publication date: 17-Feb-2021
https://dl.acm.org/doi/10.1145/3437801.3441613
Sun JYan TSun HLin HSun G(2021)Lossy Compression of Communication Traces Using Recurrent Neural NetworksIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.3132417(1-1)Online publication date: 2021
https://doi.org/10.1109/TPDS.2021.3132417
Calotoiu ACopik MHoefler TRitter MShudler SWolf F(2020)ExtraPeak: Advanced Automatic Performance Modeling for HPC ApplicationsSoftware for Exascale Computing - SPPEXA 2016-201910.1007/978-3-030-47956-5_15(453-482)Online publication date: 31-Jul-2020
https://doi.org/10.1007/978-3-030-47956-5_15
Show More Cited By

Index Terms

ScalaExtrap: Trace-based communication extrapolation for SPMD programs
1. Computing methodologies
  1. Modeling and simulation
    1. Model development and analysis
      1. Modeling methodologies
2. General and reference
  1. Cross-computing tools and techniques
    1. Measurement
    2. Metrics

Recommendations

ScalaExtrap: trace-based communication extrapolation for spmd programs
PPoPP '11

Performance modeling for scientific applications is important for assessing potential application performance and systems procurement in high-performance computing (HPC). Recent progress on communication tracing opens up novel opportunities for ...
ScalaExtrap: trace-based communication extrapolation for spmd programs
PPoPP '11: Proceedings of the 16th ACM symposium on Principles and practice of parallel programming

Performance modeling for scientific applications is important for assessing potential application performance and systems procurement in high-performance computing (HPC). Recent progress on communication tracing opens up novel opportunities for ...
An embedded multi-resolution AMBA trace analyzer for microprocessor-based SoC integration
DAC '07: Proceedings of the 44th annual Design Automation Conference

The bus tracing is used to catch related signals for further investigation and analysis. However, the trace size of cycle-accurate tracing is large and the trace cycle is shallow unless using a proper compression mechanism. In this paper, we propose an ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Programming Languages and Systems

ACM Transactions on Programming Languages and Systems Volume 34, Issue 1

April 2012

225 pages

ISSN:0164-0925

EISSN:1558-4593

DOI:10.1145/2160910

Issue’s Table of Contents

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 May 2012

Accepted: 01 February 2012

Revised: 01 November 2011

Received: 01 June 2011

Published in TOPLAS Volume 34, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

National Science Foundation

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
318
Total Downloads

Downloads (Last 12 months)43
Downloads (Last 6 weeks)13

Reflects downloads up to 30 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Copik MCalotoiu AGrosser TWicki NWolf FHoefler TLee JPetrank E(2021)Extracting clean performance models from tainted programsProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3437801.3441613(403-417)Online publication date: 17-Feb-2021
https://dl.acm.org/doi/10.1145/3437801.3441613
Sun JYan TSun HLin HSun G(2021)Lossy Compression of Communication Traces Using Recurrent Neural NetworksIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.3132417(1-1)Online publication date: 2021
https://doi.org/10.1109/TPDS.2021.3132417
Calotoiu ACopik MHoefler TRitter MShudler SWolf F(2020)ExtraPeak: Advanced Automatic Performance Modeling for HPC ApplicationsSoftware for Exascale Computing - SPPEXA 2016-201910.1007/978-3-030-47956-5_15(453-482)Online publication date: 31-Jul-2020
https://doi.org/10.1007/978-3-030-47956-5_15
Lehr JCalotoiu ABischof CWolf F(2019)Automatic Instrumentation Refinement for Empirical Performance Modeling2019 IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools)10.1109/ProTools49597.2019.00011(40-47)Online publication date: Nov-2019
https://doi.org/10.1109/ProTools49597.2019.00011
Bahmani AMueller F(2018)Chameleon: Online Clustering of MPI Program Traces2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2018.00119(1102-1112)Online publication date: May-2018
https://doi.org/10.1109/IPDPS.2018.00119
Mariani GAnghel AJongerius RDittmann G(2018)Predicting cloud performance for HPC applications before deploymentFuture Generation Computer Systems10.1016/j.future.2017.10.04887:C(618-628)Online publication date: 1-Oct-2018
https://dl.acm.org/doi/10.1016/j.future.2017.10.048
Mariani GAnghel AJongerius RDittmann G(2017)Predicting Cloud Performance for HPC ApplicationsProceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing10.1109/CCGRID.2017.11(524-533)Online publication date: 14-May-2017
https://dl.acm.org/doi/10.1109/CCGRID.2017.11
Mariani GAnghel AJongerius RDittmann G(2017)Classification of thread profiles for scaling application behaviorParallel Computing10.1016/j.parco.2017.04.00666(1-21)Online publication date: Aug-2017
https://doi.org/10.1016/j.parco.2017.04.006
Bahmani AMueller F(2016)Efficient clustering for ultra-scale application tracingJournal of Parallel and Distributed Computing10.1016/j.jpdc.2016.08.00198(25-39)Online publication date: Dec-2016
https://doi.org/10.1016/j.jpdc.2016.08.001
Casanova HGupta ASuter F(2015)Toward More Scalable Off-Line Simulations of MPI ApplicationsParallel Processing Letters10.1142/S012962641541002925:03(1541002)Online publication date: Sep-2015
https://doi.org/10.1142/S0129626415410029
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents