article

Programming the Linpack benchmark for the IBM PowerXCell 8i processor

Authors:

Michael Kistler,

Daniel Brokenshire,

Brad BentonAuthors Info & Claims

Scientific Programming, Volume 17, Issue 1-2

Pages 43 - 57

https://doi.org/10.1155/2009/401691

Published: 01 January 2009 Publication History

Abstract

In this paper we present the design and implementation of the Linpack benchmark for the IBM BladeCenter QS22, which incorporates two IBM PowerXCell 8i ¹ processors. The PowerXCell 8i is a new implementation of the Cell Broadband Engine™ ² architecture and contains a set of special-purpose processing cores known as Synergistic Processing Elements (SPEs). The SPEs can be used as computational accelerators to augment the main PowerPC processor. The added computational capability of the SPEs results in a peak double precision floating point capability of 108.8 GFLOPS. We explain how we modified the standard open source implementation of Linpack to accelerate key computational kernels using the SPEs of the PowerXCell 8i processors. We describe in detail the implementation and performance of the computational kernels and also explain how we employed the SPEs for high-speed data movement and reformatting. The result of these modifications is a Linpack benchmark optimized for the IBM PowerXCell 8i processor that achieves 170.7 GFLOPS on a BladeCenter QS22 with 32 GB of DDR2 SDRAM memory. Our implementation of Linpack also supports clusters of QS22s, and was used to achieve a result of 11.1 TFLOPS on a cluster of 84 QS22 blades. We compare our results on a single BladeCenter QS22 with the base Linpack implementation without SPE acceleration to illustrate the benefits of our optimizations.

References

[1]

Advanced Micro Devices, AMD Core Math Library, http://www.amd.com/acml.

[2]

W. Alvaro, J. Kurzak and J. Dongarra, Fast and small short vector SIMD matrix multiplication kernels for the CELL processor, UT-CS-08-609, January 2008.

[3]

J. Bolz, I. Farmer E. Grinspun and P. Schroder, Sparse matrix solvers on the GPU: Conjugate gradients and multigrid, ACM Transactions on Graphics (TOG) 22(3) (2003), 917-924.

Digital Library

[4]

D. Brokenshire, Maximizing the power of the Cell Broadband Engine processor: 25 tips to optimal application performance, IBM DeveloperWorks, June 2006.

[5]

T. Chen, R. Raghaven, J. Dale and E. Iwata, Cell Broadband Engine Architecture and its first implementation, IBM Journal of Research and Development 51(5) (2007), 559-572.

Digital Library

[6]

ClearSpeed, Accelerated HPC Clusters, http://www. clearspeed.com/acceleration/accelhpcclusters/.

[7]

J. Dongarra, J. Du Croz, I. Duff and S. Hammarling, A set of level 3 basic linear algebra subprograms, ACM Transactions on Mathematical Software 16 (1990), 1-17.

[8]

J. Dongarra, R. van de Geijn and D. Walker, Scalability issues affecting the design of a dense linear algebra library, Journal of Parallel and Distributed Computing 22(3) (1994), 523-537.

Digital Library

[9]

E. Gabriel, G. Fagg, G. Bosilca et al., Open MPI: Goals, concept, and design of a next generation MPI implementation, in: 11th European PVM/MPI Users' Group Meeting, Budapest, Hungary, September 2004.

[10]

K. Goto and R. van de Geijn, Anatomy of high-performance matrix multiplication, ACM Transactions on Mathematical Software 34(3) (2008), 1-25.

Digital Library

[11]

F. Gustavson, High-performance linear algebra algorithms using new generalized data structures for matrices, IBM Journal of Research and Development 47(1) (2003), 31-55.

Digital Library

[12]

D. Hackenberg, Fast matrix multiplication on CELL systems, http://tu-dresden.de/die_tu_dresden/zentrale_einrichtungen/ zih/forschung/architektur_und_leistungsanalyse_von_ hochleistungsrechnern/cell/matmul/, July 2007.

[13]

P. Husbands and K. Yelick, Multi-threading and one-sided communication in parallel LU factorization, in: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, Reno, NV, November 2007.

Digital Library

[14]

IBM, Cell Broadband Engine Programming Handbook Including the PowerXCell 8i Processor, Version 1.11, Section 3.1.1.3, May 2008.

[15]

IBM, The IBM Software Kit for Multicore Acceleration Version 3.0 http://www.ibm.com/chips/techlib/techlib.nsf/ products/IBM_SDK_for_Multicore_Acceleration, October 2007.

[16]

C. Johns and D. Brokenshire, Introduction to the Cell Broad-band Engine Architecture, IBM Journal of Research and Development 51(5) (2007), 503-520.

Digital Library

[17]

M. Kistler, J. Gunnels, D. Brokenshire and B. Benton, Petascale computing with accelerators, in: Proceedings of the 14th ACM Symposium on Principles and Practice of Parallel Programming , Raleigh, NC, February 2009.

Digital Library

[18]

J. Kurzak and J. Dongarra, Implementing linear algebra routines on multi-core processors with pipelining and a look ahead, UT-CS-06-581, September 2006.

[19]

C. Lawson, R. Hanson, D. Kincaid and F. Krogh, Basic linear algebra subprograms for FORTRAN usage, ACM Transactions on Mathematical Software 5 (1979), 308-323.

Digital Library

[20]

Message Passing Interface Forum, MPI: A message passing interface standard, http://www.mpi-forum.org, June 1995.

[21]

Message Passing Interface Forum, MPI-2: Extensions to the message passing interface, http://www.mpi-forum.org, July 1997.

[22]

J. Panziera and J. Baron, A highly efficient Linpack implementation based on shared-memory parallelism, in: Proceedings of the 2005 International Supercomputer Conference, Heidelberg, Germany, June 2005.

[23]

A. Petitet, R. Whaley, J. Dongarra and A. Cleary, HPL - A portable implementation of the high-performance linpack benchmark for distributed memory computers, http://www.netlib.org/benchmark/hpl/, 2006.

[24]

The 3rd Edition of the Green 500 List, http://www.green500. org/lists/2008/06/list.php, June 2008.

[25]

TOP500 List, http://top500.org/list/2008/06, June 2008.

[26]

R. Whaley and J. Dongarra, Automatically tuned linear algebra software, in: Proceedings of the 1998 ACM/IEEE Conference on Supercomputing, San Jose, CA, November 1998.

Digital Library

Cited By

Fortin PLamotte J(2018)An (almost) direct deployment of the Fast Multipole Method on the Cell processorThe Journal of Supercomputing10.1007/s11227-013-0877-z65:3(1205-1222)Online publication date: 31-Dec-2018
https://dl.acm.org/doi/10.1007/s11227-013-0877-z
Rojek KSzustak Ł(2009)Adaptation of double-precision matrix multiplication to the cell broadband engine architectureProceedings of the 8th international conference on Parallel processing and applied mathematics: Part I10.5555/1882792.1882856(535-546)Online publication date: 13-Sep-2009
https://dl.acm.org/doi/10.5555/1882792.1882856

Index Terms

Programming the Linpack benchmark for the IBM PowerXCell 8i processor

Recommendations

Hybridizing S3D into an Exascale application using OpenACC: An approach for moving to multi-petaflops and beyond
SC '12: Proceedings of the 2012 International Conference for High Performance Computing, Networking, Storage and Analysis

Hybridization is the process of converting an application with a single level of parallelism to an application with multiple levels of parallelism. Over the past 15 years a majority of the applications that run on High Performance Computing systems have ...
Hybridizing S3D into an exascale application using OpenACC: an approach for moving to multi-petaflops and beyond
SC '12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Hybridization is the process of converting an application with a single level of parallelism to an application with multiple levels of parallelism. Over the past 15 years a majority of the applications that run on High Performance Computing systems have ...
Optimizing High-Performance Linpack for Exascale Accelerated Architectures
SC '23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

We detail the performance optimizations made in rocHPL, AMD's open-source implementation of the High-Performance Linpack (HPL) benchmark targeting accelerated node architectures designed for exascale systems such as the Frontier supercomputer. The ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Scientific Programming

Scientific Programming Volume 17, Issue 1-2

High Performance Computing with the Cell Broadband Engine

January 2009

206 pages

Issue’s Table of Contents

Publisher

IOS Press

Netherlands

Publication History

Published: 01 January 2009

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Fortin PLamotte J(2018)An (almost) direct deployment of the Fast Multipole Method on the Cell processorThe Journal of Supercomputing10.1007/s11227-013-0877-z65:3(1205-1222)Online publication date: 31-Dec-2018
https://dl.acm.org/doi/10.1007/s11227-013-0877-z
Rojek KSzustak Ł(2009)Adaptation of double-precision matrix multiplication to the cell broadband engine architectureProceedings of the 8th international conference on Parallel processing and applied mathematics: Part I10.5555/1882792.1882856(535-546)Online publication date: 13-Sep-2009
https://dl.acm.org/doi/10.5555/1882792.1882856

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents