Abstract
We present a modeling framework to accurately predict time to run dense linear algebra calculation. We report the framework’s accuracy in a number of varied computational environments such as shared memory multicore systems, clusters, and large supercomputing installations with tens of thousands of cores. We also test the accuracy for various algorithms, each of which having a different scaling properties and tolerance to low-bandwidth/high-latency interconnects. The predictive accuracy is very good and on the order of measurement accuracy which makes the method suitable for both dedicated and non-dedicated environments. We also present a practical application of our model to reduce the time required to tune and optimize large parallel runs whose time is dominated by linear algebra computations. We show practical examples of how to apply the methodology to avoid common pitfalls and reduce the influence of measurement errors and the inherent performance variability.
This research was supported by DARPA through ORNL subcontract 4000075916 as well as NSF through award number 1038814. We would like to also thank Patrick Worley from ORNL for facilitating the large scale runs on Jaguar’s Cray XT4 partition.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Anderson, E., Bai, Z., Bischof, C., Blackford, S.L., Demmel, J.W., Dongarra, J.J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., Sorensen, D.C.: LAPACK User’s Guide, 3rd edn. Society for Industrial and Applied Mathematics, Philadelphia (1999)
Barrett, R.F., Chan, T.H.F., D’Azevedo, E.F., Jaeger, E.F., Wong, K., Wong, R.Y.: Complex version of high performance computing LINPACK benchmark (HPL). Concurrency and Computation: Practice and Experience 22(5), 573–587 (2010)
Björk, Å.: Numerical methods for Least Squares Problems. SIAM (1996) ISBN 0-89871-360-9
Suzan Blackford, L., Choi, J., Cleary, A., D’Azevedo, E.F., Demmel, J.W., Dhillon, I.S., Dongarra, J.J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D.W., Clint Whaley, R.: ScaLAPACK Users’ Guide. Society for Industrial and Applied Mathematics, Philadelphia (1997)
Chen, Z., Dongarra, J., Luszczek, P., Roche, K.: Self-adapting software for numerical linear algebra and LAPACK for Clusters. Parallel Computing 29(11-12), 1723–1743 (2003)
Choi, J., Dongarra, J.J., Ostrouchov, S., Petitet, A., Walker, D.W., Clint Whaley, R.: The design and implementation of the ScaLAPACK LU, QR, and Cholesky factorization routines. Scientific Programming 5, 173–184 (1996)
Dongarra, J., Du Croz, J., Duff, I., Hammarling, S.: Algorithm 679: A set of Level 3 Basic Linear Algebra Subprograms. ACM Trans. Math. Soft. 16(1), 18–28 (1990)
Dongarra, J., Du Croz, J., Duff, I., Hammarling, S.: A set of Level 3 Basic Linear Algebra Subprograms. ACM Trans. Math. Soft. 16(1), 1–17 (1990)
Dongarra, J., Jeannot, E., Langou, J.: Modeling the LU factorization for SMP clusters. In: Proceeedings of Parallel Matrix Algorithms and Applications (PMAA 2006), September 7-9. IRISA, Rennes, France (2006)
Dongarra, J., Luszczek, P.: Reducing the time to tune parallel dense linear algebra routines with partial execution and performance modelling. In: Poster Session of SC 2010, New Orleans, Louisianna, USA, November 13-19 (2010), Also: Technical Report UT-CS-10-661, University of Tennessee, Computer Science Department
Dongarra, J.J., Duff, I.S., Sorensen, D.C., van der Vorst, H.A.: Numerical Linear Algebra for High-Performance Computers. Society for Industrial and Applied Mathematics, Philadelphia (1998)
Dongarra, J.J., Gustavson, F.G., Karp, A.: Implementing linear algebra algorithms for dense matrices on a vector pipeline machine. SIAM Review 26(1), 91–112 (1984)
Dongarra, J.J., Luszczek, P., Petitet, A.: The LINPACK benchmark: Past, present, and future. Concurrency and Computation: Practice and Experience 15, 1–18 (2003)
Edelman, A.: Large dense numerical linear algebra in 1993: the parallel computing influence. International Journal of High Performance Computing Applications 7(2), 113–128 (1993)
García, L.-P., Cuenca, J., Giménez, D.: Using Experimental Data to Improve the Performance Modelling of Parallel Linear Algebra Routines. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Wasniewski, J. (eds.) PPAM 2007. LNCS, vol. 4967, pp. 1150–1159. Springer, Heidelberg (2008) ISSN 0302-9743 (Print) 1611-3349 (Online), doi:10.1007/978-3-540-68111-3
Golub, G.H., Van Loan, C.F.: Matrix Computations, 3rd edn. The Johns Hopkins University Press, Baltimore and London (1996)
Meuer, H., Strohmaier, E., Dongarra, J., Simon, H.: TOP500 Supercomputer Sites, 34th edn. (November 2009), http://www.netlib.org/benchmark/top500.html and http://www.top500.org/
Meuer, H., Strohmaier, E., Dongarra, J., Simon, H.: TOP500 Supercomputer Sites, Hambug, Germany, 37th edn. (June 2011), http://www.netlib.org/benchmark/top500.html and http://www.top500.org/
Harrington, R.: Origin and development of the method of moments for field computation. IEEE Antennas and Propagation Magazine (June 1990)
Hess, J.L.: Panel methods in computational fluid dynamics. Annual Reviews of Fluid Mechanics 22, 255–274 (1990)
Hess, L., Smith, M.O.: Calculation of potential flows about arbitrary bodies. In: Kuchemann, D. (ed.) Progress in Aeronautical Sciences, vol. 8. Pergamon Press (1967)
Kerbyson, D.J., Hoisie, A., Wasserman, H.J.: Verifying Large-Scale System Performance During Installation using Modeling. In: High Performance Scientific and Engineering Computing, Hardware/Software Support. Kluwer (October 2003)
Luszczek, P., Dongarra, J., Kepner, J.: Design and implementation of the HPCC benchmark suite. CT Watch Quarterly 2(4A) (November 2006)
Numerich, R.W.: Computational forces in the Linpack benchmark. Concurrency Practice and Experience (2007)
Oram, A., Wilson, G. (eds.): Beautiful Code. O’Reilly (2007), Chapter 14: How Elegant Code Evolves with Hardware: The Case of Gaussian Elimination
Roche, K.J., Dongarra, J.J.: Deploying parallel numerical library routines to cluster computing in a self adapting fashion. In: Parallel Computing: Advances and Current Issues. Imperial College Press, London (2002)
Rodgers, J.L., Nicewander, W.A.: Thirteen ways to look at the correlation coefficient. The American Statistician 42, 59–66 (1988)
Smith, W., Foster, I., Taylor, V.: Predicting application runt times with historical information. In: Proceedings of IPPS Workshop on Job Scheduling Strtegies for Parallel Processing. Elsevier Inc. (1998), doi:10.1016/j.jpdc.2004.06.2008
Wang, J.J.H.: Generalized Moment Methods in Electromagnetics. John Wiley & Sons, New York (1991)
Weinberg, J., McCracken, M.O., Strohmaier, E., Snavely, A.: Quantifying locality in the memory access patterns of HPC applications. In: Proceedings of SC 2005, Seattle, Washington. IEEE Computer Society Washington, DC (2005)
Wilkinson, J.H.: Rounding Errors in Algebraic Processes. Prentice Hall, Englewood Cliffs (1963)
Wilkinson, J.H.: The Algebraic Eigenvalue Problem. Oxford University Press, Oxford (1965)
Yang, L.T., Ma, X., Mueller, F.: Cross-platform performance prediction of parallel applications using partial execution. In: Proceedings of the ACM/IEEE SC 2005 Conference (SC 2005). IEEE (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Luszczek, P., Dongarra, J. (2012). Reducing the Time to Tune Parallel Dense Linear Algebra Routines with Partial Execution and Performance Modeling. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Waśniewski, J. (eds) Parallel Processing and Applied Mathematics. PPAM 2011. Lecture Notes in Computer Science, vol 7203. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31464-3_74
Download citation
DOI: https://doi.org/10.1007/978-3-642-31464-3_74
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-31463-6
Online ISBN: 978-3-642-31464-3
eBook Packages: Computer ScienceComputer Science (R0)