Adaptation of Double-Precision Matrix Multiplication to the Cell Broadband Engine Architecture

Krzysztof Rojek²⁰ &
Łukasz Szustak²⁰

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6067))

Included in the following conference series:

International Conference on Parallel Processing and Applied Mathematics

1397 Accesses

Abstract

This paper presents an approach to adaptation of the double-precision matrix multiplication to the architecture of Cell processors. The algorithm used for the adaptation on a single SPE is based on C = C + A*B operation performed for matrices of size 64 ×64; these matrices are further divided into smaller submatrices which correspond to micro-kernel operations. Our approach is based on a performance model which is constructed as a function of submatrix size. The model accounts for such factors as size of local storage, number of registers, properties of double-precision operations, balance between pipelines, etc. This approach allows us to take into consideration properties of the first generation of Cell processors and its successor - PowerXCell 8i.

This adaptation is followed by an optimization phase which includes loop transformations, kernel implementation with SIMD instructions, and other transformations necessary to achieve balance between even and odd pipelines. Finally we present hand-tunings performed with the IBM Assembly Visualizer tool. The proposed adaptation and optimizations allow us to achieve about 96% of the peak performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

pzqd: PEZY-SC2 Acceleration of Double-Double Precision Arithmetic Library for High-Precision BLAS

Performance Engineering for a Tall & Skinny Matrix Multiplication Kernels on GPUs

A methodology for speeding up matrix vector multiplication for single/multi-core architectures

Article 29 March 2015

References

Buttari, A., Dongarra, J., Kurzak, J.: Limitations of the PlayStation3 for High Performance Cluster Computing, http://www.netlib.org/lapack/lawnspdf/lawn185.pdf
Chen, T., Raghavan, R., Dale, J.N., Iwata, E.: Cell Broadband Engine Architecture and its first implementation - A performance view. IBM Journal of Research and Development 51(5), 559–572 (2007)
Article Google Scholar
Dolfen, A., Gutheil, I., Homberg, W., Koch, E.: Applications on Juelich’s Cell-based Cluster JUICE, http://www.fz-juelich.de/jsc/datapool/cell/Para08_apps_on_juice.pdf
Kistler, M., Gunnels, J., Brokenshire, D., Benton, B.: Programming the Linpack Benchmark for the IBM PowerXCell 8i Processor. Scientific Programming 17(1-2), 43–57 (2009)
Google Scholar
Kurzak, J., Alvaro, W., Dongarra, J.: Optimizing matrix multiplication for a short-vector SIMD architecture - CELL processor. Parallel Computing 35(3), 138–150 (2009)
Article Google Scholar
Wang, H., Takizawa, H., Kobayashi, H.: A Performance Study of Secure Data Mining on the Cell Processor. Int. Journal of Grid and High Performance Computing 1(2), 30–44 (2009)
Google Scholar
Williams, S., Shalf, J., Oliker, L., Kamil, S., Husbands, P., Yelick, K.: The potential of the Cell Processor for Scientific Computing. In: Proc. 3rd Conf. on Computing Frontiers, Ischia, Italy, pp. 9–20 (2006)
Google Scholar
Woodward, P.R., Jayaraj, J., Lin, P., Yew, P.: Moving Scientific Codes to Multicore Microprocessor CPUs. Computing in Science and Engineering 10(6), 16–25 (2008)
Article Google Scholar
IBM Assembly Visualizer for Cell Broadband Engine, http://www.alphaworks.ibm.com/tech/asmvis

Download references

Author information

Authors and Affiliations

Institute of Computer and Information Sciences, Czestochowa University of Technology, Dabrowskiego 73, 42-200, Czestochowa, Poland
Krzysztof Rojek & Łukasz Szustak

Authors

Krzysztof Rojek
View author publications
You can also search for this author in PubMed Google Scholar
Łukasz Szustak
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Computational and Information Sciences, Czestochowa University of Technology,
Roman Wyrzykowski
Department of Electrical Engineering and Computer Science, University of Tennessee, TN 37996-3450, Knoxville, USA
Jack Dongarra
Institute of Computer and Information Science, Czestochowa University of Technology, Dabrowskiego 73, PL-42-200, Czestochowa, Poland
Konrad Karczewski
Department of Informatics and Mathematical Modeling, Technical University of Denmark, Richard Petersens Plads, Building 321, 2800, Kongens Lyngby, Denmark
Jerzy Wasniewski

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rojek, K., Szustak, Ł. (2010). Adaptation of Double-Precision Matrix Multiplication to the Cell Broadband Engine Architecture. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Wasniewski, J. (eds) Parallel Processing and Applied Mathematics. PPAM 2009. Lecture Notes in Computer Science, vol 6067. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14390-8_56

Download citation

DOI: https://doi.org/10.1007/978-3-642-14390-8_56
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-14389-2
Online ISBN: 978-3-642-14390-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics