Nothing Special   »   [go: up one dir, main page]

skip to main content
article

Programming the Linpack benchmark for the IBM PowerXCell 8i processor

Published: 01 January 2009 Publication History

Abstract

In this paper we present the design and implementation of the Linpack benchmark for the IBM BladeCenter QS22, which incorporates two IBM PowerXCell 8i 1 processors. The PowerXCell 8i is a new implementation of the Cell Broadband Engine™ 2 architecture and contains a set of special-purpose processing cores known as Synergistic Processing Elements (SPEs). The SPEs can be used as computational accelerators to augment the main PowerPC processor. The added computational capability of the SPEs results in a peak double precision floating point capability of 108.8 GFLOPS. We explain how we modified the standard open source implementation of Linpack to accelerate key computational kernels using the SPEs of the PowerXCell 8i processors. We describe in detail the implementation and performance of the computational kernels and also explain how we employed the SPEs for high-speed data movement and reformatting. The result of these modifications is a Linpack benchmark optimized for the IBM PowerXCell 8i processor that achieves 170.7 GFLOPS on a BladeCenter QS22 with 32 GB of DDR2 SDRAM memory. Our implementation of Linpack also supports clusters of QS22s, and was used to achieve a result of 11.1 TFLOPS on a cluster of 84 QS22 blades. We compare our results on a single BladeCenter QS22 with the base Linpack implementation without SPE acceleration to illustrate the benefits of our optimizations.

References

[1]
Advanced Micro Devices, AMD Core Math Library, http://www.amd.com/acml.
[2]
W. Alvaro, J. Kurzak and J. Dongarra, Fast and small short vector SIMD matrix multiplication kernels for the CELL processor, UT-CS-08-609, January 2008.
[3]
J. Bolz, I. Farmer E. Grinspun and P. Schroder, Sparse matrix solvers on the GPU: Conjugate gradients and multigrid, ACM Transactions on Graphics (TOG) 22(3) (2003), 917-924.
[4]
D. Brokenshire, Maximizing the power of the Cell Broadband Engine processor: 25 tips to optimal application performance, IBM DeveloperWorks, June 2006.
[5]
T. Chen, R. Raghaven, J. Dale and E. Iwata, Cell Broadband Engine Architecture and its first implementation, IBM Journal of Research and Development 51(5) (2007), 559-572.
[6]
ClearSpeed, Accelerated HPC Clusters, http://www. clearspeed.com/acceleration/accelhpcclusters/.
[7]
J. Dongarra, J. Du Croz, I. Duff and S. Hammarling, A set of level 3 basic linear algebra subprograms, ACM Transactions on Mathematical Software 16 (1990), 1-17.
[8]
J. Dongarra, R. van de Geijn and D. Walker, Scalability issues affecting the design of a dense linear algebra library, Journal of Parallel and Distributed Computing 22(3) (1994), 523-537.
[9]
E. Gabriel, G. Fagg, G. Bosilca et al., Open MPI: Goals, concept, and design of a next generation MPI implementation, in: 11th European PVM/MPI Users' Group Meeting, Budapest, Hungary, September 2004.
[10]
K. Goto and R. van de Geijn, Anatomy of high-performance matrix multiplication, ACM Transactions on Mathematical Software 34(3) (2008), 1-25.
[11]
F. Gustavson, High-performance linear algebra algorithms using new generalized data structures for matrices, IBM Journal of Research and Development 47(1) (2003), 31-55.
[12]
D. Hackenberg, Fast matrix multiplication on CELL systems, http://tu-dresden.de/die_tu_dresden/zentrale_einrichtungen/ zih/forschung/architektur_und_leistungsanalyse_von_ hochleistungsrechnern/cell/matmul/, July 2007.
[13]
P. Husbands and K. Yelick, Multi-threading and one-sided communication in parallel LU factorization, in: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, Reno, NV, November 2007.
[14]
IBM, Cell Broadband Engine Programming Handbook Including the PowerXCell 8i Processor, Version 1.11, Section 3.1.1.3, May 2008.
[15]
IBM, The IBM Software Kit for Multicore Acceleration Version 3.0 http://www.ibm.com/chips/techlib/techlib.nsf/ products/IBM_SDK_for_Multicore_Acceleration, October 2007.
[16]
C. Johns and D. Brokenshire, Introduction to the Cell Broad-band Engine Architecture, IBM Journal of Research and Development 51(5) (2007), 503-520.
[17]
M. Kistler, J. Gunnels, D. Brokenshire and B. Benton, Petascale computing with accelerators, in: Proceedings of the 14th ACM Symposium on Principles and Practice of Parallel Programming , Raleigh, NC, February 2009.
[18]
J. Kurzak and J. Dongarra, Implementing linear algebra routines on multi-core processors with pipelining and a look ahead, UT-CS-06-581, September 2006.
[19]
C. Lawson, R. Hanson, D. Kincaid and F. Krogh, Basic linear algebra subprograms for FORTRAN usage, ACM Transactions on Mathematical Software 5 (1979), 308-323.
[20]
Message Passing Interface Forum, MPI: A message passing interface standard, http://www.mpi-forum.org, June 1995.
[21]
Message Passing Interface Forum, MPI-2: Extensions to the message passing interface, http://www.mpi-forum.org, July 1997.
[22]
J. Panziera and J. Baron, A highly efficient Linpack implementation based on shared-memory parallelism, in: Proceedings of the 2005 International Supercomputer Conference, Heidelberg, Germany, June 2005.
[23]
A. Petitet, R. Whaley, J. Dongarra and A. Cleary, HPL - A portable implementation of the high-performance linpack benchmark for distributed memory computers, http://www.netlib.org/benchmark/hpl/, 2006.
[24]
The 3rd Edition of the Green 500 List, http://www.green500. org/lists/2008/06/list.php, June 2008.
[25]
TOP500 List, http://top500.org/list/2008/06, June 2008.
[26]
R. Whaley and J. Dongarra, Automatically tuned linear algebra software, in: Proceedings of the 1998 ACM/IEEE Conference on Supercomputing, San Jose, CA, November 1998.

Cited By

View all
  • (2018)An (almost) direct deployment of the Fast Multipole Method on the Cell processorThe Journal of Supercomputing10.1007/s11227-013-0877-z65:3(1205-1222)Online publication date: 31-Dec-2018
  • (2009)Adaptation of double-precision matrix multiplication to the cell broadband engine architectureProceedings of the 8th international conference on Parallel processing and applied mathematics: Part I10.5555/1882792.1882856(535-546)Online publication date: 13-Sep-2009

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Scientific Programming
Scientific Programming  Volume 17, Issue 1-2
High Performance Computing with the Cell Broadband Engine
January 2009
206 pages

Publisher

IOS Press

Netherlands

Publication History

Published: 01 January 2009

Author Tags

  1. Accelerators
  2. hybrid programming

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2018)An (almost) direct deployment of the Fast Multipole Method on the Cell processorThe Journal of Supercomputing10.1007/s11227-013-0877-z65:3(1205-1222)Online publication date: 31-Dec-2018
  • (2009)Adaptation of double-precision matrix multiplication to the cell broadband engine architectureProceedings of the 8th international conference on Parallel processing and applied mathematics: Part I10.5555/1882792.1882856(535-546)Online publication date: 13-Sep-2009

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media