Abstract
This paper presents an approach to adaptation of the double-precision matrix multiplication to the architecture of Cell processors. The algorithm used for the adaptation on a single SPE is based on C = C + A*B operation performed for matrices of size 64 ×64; these matrices are further divided into smaller submatrices which correspond to micro-kernel operations. Our approach is based on a performance model which is constructed as a function of submatrix size. The model accounts for such factors as size of local storage, number of registers, properties of double-precision operations, balance between pipelines, etc. This approach allows us to take into consideration properties of the first generation of Cell processors and its successor - PowerXCell 8i.
This adaptation is followed by an optimization phase which includes loop transformations, kernel implementation with SIMD instructions, and other transformations necessary to achieve balance between even and odd pipelines. Finally we present hand-tunings performed with the IBM Assembly Visualizer tool. The proposed adaptation and optimizations allow us to achieve about 96% of the peak performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Buttari, A., Dongarra, J., Kurzak, J.: Limitations of the PlayStation3 for High Performance Cluster Computing, http://www.netlib.org/lapack/lawnspdf/lawn185.pdf
Chen, T., Raghavan, R., Dale, J.N., Iwata, E.: Cell Broadband Engine Architecture and its first implementation - A performance view. IBM Journal of Research and Development 51(5), 559–572 (2007)
Dolfen, A., Gutheil, I., Homberg, W., Koch, E.: Applications on Juelich’s Cell-based Cluster JUICE, http://www.fz-juelich.de/jsc/datapool/cell/Para08_apps_on_juice.pdf
Kistler, M., Gunnels, J., Brokenshire, D., Benton, B.: Programming the Linpack Benchmark for the IBM PowerXCell 8i Processor. Scientific Programming 17(1-2), 43–57 (2009)
Kurzak, J., Alvaro, W., Dongarra, J.: Optimizing matrix multiplication for a short-vector SIMD architecture - CELL processor. Parallel Computing 35(3), 138–150 (2009)
Wang, H., Takizawa, H., Kobayashi, H.: A Performance Study of Secure Data Mining on the Cell Processor. Int. Journal of Grid and High Performance Computing 1(2), 30–44 (2009)
Williams, S., Shalf, J., Oliker, L., Kamil, S., Husbands, P., Yelick, K.: The potential of the Cell Processor for Scientific Computing. In: Proc. 3rd Conf. on Computing Frontiers, Ischia, Italy, pp. 9–20 (2006)
Woodward, P.R., Jayaraj, J., Lin, P., Yew, P.: Moving Scientific Codes to Multicore Microprocessor CPUs. Computing in Science and Engineering 10(6), 16–25 (2008)
IBM Assembly Visualizer for Cell Broadband Engine, http://www.alphaworks.ibm.com/tech/asmvis
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Rojek, K., Szustak, Ł. (2010). Adaptation of Double-Precision Matrix Multiplication to the Cell Broadband Engine Architecture. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Wasniewski, J. (eds) Parallel Processing and Applied Mathematics. PPAM 2009. Lecture Notes in Computer Science, vol 6067. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14390-8_56
Download citation
DOI: https://doi.org/10.1007/978-3-642-14390-8_56
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-14389-2
Online ISBN: 978-3-642-14390-8
eBook Packages: Computer ScienceComputer Science (R0)