Abstract
Cray X1 Fortran and C/C++ compilers provide a number of loop transformations, notably vectorization and multistreaming, in order to exploit the multistreaming processor (MSP) hardware resources and its high memory bandwidth. A Cray X1 node is composed of four MSPs, which in turn are composed of four single streaming processors (SSP). Each SSP contains a superscalar processing unit and two vector processing units. Compiler vectorization provides loop level parallelization and uses the vector processing hardware. Multistreaming code generation by the compiler permits execution across the SSPs of an MSP on a block of code. In this paper, we analyze overall impact of loop-level compiler optimization on a scientific application called Parallel Ocean Program (POP). POP has been extensively optimized for X1 by instrumenting the code using X1 compiler directives. We compare and contrast automatic and manual optimization schemes available on X1 and analyze their impact on the code performance and scalability. Our results show that the addition of compiler directives increases the average vector length, thereby improving the single node performance significantly. However, this code scales at a slower rate as the local workload volume decreases and the communication costs increase.
Chapter PDF
Similar content being viewed by others
Keywords
- Message Passing Interface
- Single Instruction Multiple Data
- Loop Transformation
- Loop Fusion
- Vector Instruction
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Agarwal, P.K., et al.: ORNL Cray X1 evaluation status report. In: Proceedings of the 46th Cray User Group Conference (2004)
Aho, A.V., Hill, M., Ullman, J.D.: Compilers: Principles, Techniques, and Tools. Addison-Wesley Longman Publishing Co., Inc., USA (1986)
Allen, R., Kennedy, K.: Optimizing Compilers for Modern Architecture: A Dependence Based Approach, 1st edn. Morgan Kaufmann Publishers, San Francisco (2001)
Dunigan, T.H., et al.: Early evaluation of the Cray X1. In: Proceedings of the 17th Annual International Conference on Supercomputing (2003)
Dunigan, T.H., et al.: Performance Evaluation of the Cray X1 Distributed Memory Architecture. IEEE Micro 25(1) (2005)
Optimizing Applications on the Cray X1 System: Loopmark Listings. Available at http://www.cray.com
Optimizing Applications on the Cray X1 System: Using CrayPAT Tools. Available at http://www.cray.com
Shan, H., Strohmaier, E.: Performance Characteristics of the Cray X1 and Their Implications for Application Performance Tuning. In: Proceedings of the 18th Annual International Conference on Supercomputing (2004)
van der Steen, A.J., Dongarra, J.J.: Overview of Recent Supercomputers (2004)
Numrich, R.W., Reid, J.K.: Co-Array Fortran for Parallel Programming. ACM SIGPLAN Fortran Forum 17(2) (1998)
Cray Fortran Compiler Commands and Directives Reference Manual. Available at http://www.cray.com
Cray X1 System Overview. Available at http://www.cray.com
The Parallel Ocean Program Homepage, http://climate.lanl.gov/Models/POP
Worley, P., Levesque, J.: The Performance Evolution of the Parallel Ocean Program on the Cray X1. In: Proceedings of the 46th Cray User Group Conference (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Alam, S., Vetter, J. (2005). Performance and Scalability Analysis of Cray X1 Vectorization and Multistreaming Optimization. In: Sunderam, V.S., van Albada, G.D., Sloot, P.M.A., Dongarra, J.J. (eds) Computational Science – ICCS 2005. ICCS 2005. Lecture Notes in Computer Science, vol 3514. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11428831_38
Download citation
DOI: https://doi.org/10.1007/11428831_38
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-26032-5
Online ISBN: 978-3-540-32111-8
eBook Packages: Computer ScienceComputer Science (R0)