article

State-of-the-art in heterogeneous computing

Authors:

Andre R. Brodtkorb,

Christopher Dyken,

Trond R. Hagen,

Jon M. Hjelmervik,

Olaf O. StoraasliAuthors Info & Claims

Scientific Programming, Volume 18, Issue 1

Pages 1 - 33

https://doi.org/10.1155/2010/540159

Published: 01 January 2010 Publication History

Abstract

Node level heterogeneous architectures have become attractive during the last decade for several reasons: compared to traditional symmetric CPUs, they offer high peak performance and are energy and/or cost efficient. With the increase of fine-grained parallelism in high-performance computing, as well as the introduction of parallelism in workstations, there is an acute need for a good overview and understanding of these architectures. We give an overview of the state-of-the-art in heterogeneous computing, focusing on three commonly found architectures: the Cell Broadband Engine Architecture, graphics processing units (GPUs), and field programmable gate arrays (FPGAs). We present a review of hardware, available software tools, and an overview of state-of-the-art techniques and algorithms. Furthermore, we present a qualitative and quantitative comparison of the architectures, and give our view on the future of heterogeneous computing.

References

[1]

Accelereyes, Jacket user guide, February 2009.

[2]

Y. Allusse, P. Horain, A. Agarwal and C. Saipriyadarshan, GpuCV: A GPU-accelerated framework for image processing and computer vision, in: Intl. Symp. on Advances in Visual Computing, Springer-Verlag, Berlin, 2008, pp. 430-439.

Digital Library

[3]

Altera, FFT megacore function user guide, March 2009.

[4]

Altera, Logicore ip fast Fourier transform v7.0 user guide, June 2009.

[5]

W. Alvaro, J. Kurzak and J. Dongarra, Fast and small short vector SIMD matrix multiplication kernels for the synergistic processing element of the cell processor, in: Intl. Conf. on Computational Science, Springer-Verlag, Berlin, 2008, pp. 935-944.

Digital Library

[6]

AMD, R700-family instruction set architecture, March 2009.

[7]

AMD, ATI Radeon HD 5870 GPU feature summary, available at: http://www.amd.com/us/products/desktop/graphics/ ati-radeon-hd-5000/hd-5870/Pages/ati-radeon-hd-5870- specifications.aspx (visited 2009-10-28).

[8]

AMD, ATI stream software development kit, available at: http://developer.amd.com/gpu/ATIStreamSDK/ (visited 2009-04-28).

[9]

AMD, Stream KernelAnalyzer, available at: http://developer. amd.com/gpu/ska/.

[10]

AMD, AMD core math library for graphic processors, March 2009, available at: http://developer.amd.com/gpu/acmlgpu/ (visited 2009-04-20).

[11]

T. Aoki, Real-time tsunami simulation on a multinode GPU cluster, in: SuperComputing, Portland, OR, 2009, poster.

[12]

M. Araya-Polo, F. Rubio, R. de la Cruz, M. Hanzich, J. Cela and D. Scarpazza, 3D seismic imaging through reverse-time migration on homogeneous and heterogeneous multicore processors, Sci. Prog. 17(1,2) (2009), 185-198.

[13]

K. Asanovic, R. Bodik, B. Catanzaro, J. Gebis, P. Husbands, K. Keutzer, D. Patterson, W. Plishker, J. Shalf, S. Williams and K. Yelick, The landscape of parallel computing research: A view from Berkeley, Technical report, EECS Department, University of California, Berkeley, December 2006.

[14]

K. Asanovic, R. Bodik, J. Demmel, T. Keaveny, K. Keutzer, J. Kubiatowicz, E. Lee, N. Morgan, G. Necula, D. Patterson, K. Sen, J. Wawrzynek, D. Wessel and K. Yelick, The parallel computing laboratory at U.C. Berkeley: A research agenda based on the Berkeley view, Technical report, EECS Department, University of California, Berkeley, December 2008.

[15]

W. Aspray, The Intel 4004 microprocessor: What constituted invention?, Hist. Comput. 19(3) (1997), 4-15.

Digital Library

[16]

P. Babenko and M. Shah, MinGPU: a minimum GPU library for computer vision, Real-Time Image Process. 3(4) (2008), 255-268.

[17]

J. Backus, Can programming be liberated from the von Neumann style?: a functional style and its algebra of programs, Commun. ACM 21(8) (1978), 613-641.

Digital Library

[18]

D. Bader and V. Argwal, FFTC: Fastest Fourier transform for the IBM cell broadband engine, in: Intl. Conf. on High Performance Computing, Goa, India, 2007, pp. 172-184.

[19]

D. Bader, V. Agarwal and K. Madduri, On the design and analysis of irregular algorithms on the cell processor: A case study of list ranking, in: Intl. Parallel and Distributed Processing Symp., Long Beach, CA, USA, 2007, pp. 1-10.

[20]

Z. Baker, M. Gokhale and J. Tripp, Matched filter computation on FPGA, cell and GPU, in: Symp. on Field-Programmable Custom Computing Machines, IEEE Computer Society, Washington, DC, USA, 2007, pp. 207-218.

Digital Library

[21]

K. Barker, K. Davis, A. Hoisie, D. Kerbyson, M. Lang, S. Pakin and J. Sancho, Entering the petaflop era: The architecture and performance of Roadrunner, in: Supercomputing, November 2008, IEEE Press, Piscataway, NJ, USA, 2008, pp. 1-11.

[22]

J. Beeckler and W. Gross, Particle graphics on reconfigurable hardware, Reconfigurable Technology and Systems 1(3) (2008), 1-27.

[23]

N. Bell and M. Garland, Efficient sparse matrix-vector multiplication on CUDA, NVIDIA Technical Report NVR-2008- 004, NVIDIA Corporation, December 2008.

[24]

P. Bellens, J. Perez, R. Badia and J. Labarta, CellSs: a programming model for the cell BE architecture, in: Supercomputing, ACM, New York, NY, USA, 2006, p. 86.

[25]

S. Benkner, E. Laure and H. Zima, HPF+: An extension of HPF for advanced applications, Technical report, The HPF+ Consortium, 1999.

[26]

G. Blelloch, Prefix sums and their applications, Technical Report CMU-CS-90-190, School of Computer Science, Carnegie Mellon University, November 1990.

[27]

G. Blelloch, M. Heroux and M. Zagha, Segmented operations for sparse matrix computation on vector multiprocessors, Technical report, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA, 1993.

[28]

G. Bradski and A. Kaehler, Learning OpenCV: Computer Vision with the OpenCV Library, O'Reilly, Cambridge, MA, USA, 2008.

[29]

F. Bodin, An evolutionary path for high-performance heterogeneous multicore programming, 2008.

[30]

G. Boone, Computing systems CPU, United States Patent 3, 757, 306, August 1971.

[31]

M. Boyer, D. Tarjan, S. Acton and K. Skadron, Accelerating leukocyte tracking using cuda: A case study in leveraging manycore coprocessors, in: Int. Parallel and Distributed Processing Symp., Rome, Italy, 2009, pp. 1-12.

Digital Library

[32]

A. Brodtkorb, The graphics processor as a mathematical coprocessor in MATLAB, in: Intl. Conf. on Complex, Intelligent and Software Intensive Systems, Barcelona, Spain, IEEE Computer Society, 2008, pp. 822-827.

Digital Library

[33]

I. Buck, T. Foley, D. Horn, J. Sugerman, M. Houston and P. Hanrahan, Brook for GPUs: Stream computing on graphics hardware, SIGGRAPH, Los Angeles, CA, 2004.

Digital Library

[34]

A. Buttari, J. Langou, J. Kurzak and J. Dongarra, A class of parallel tiled linear algebra algorithms for multicore architectures, Parallel Comput. 35(1) (2009), 38-53.

[35]

J. Canny, A computational approach to edge detection, Pattern Anal. Machine Intelligence 8(6) (1986), 679-698.

Digital Library

[36]

Celoxica website, http://www.celoxica.com/ (visited 2009- 04-28).

[37]

R. Chamberlain, M. Franklin, E. Tyson, J. Buhler, S. Gayen, P. Crowley and J. Buckley, Application development on hybrid systems, in: SC'07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing, ACM, New York, NY, USA, 2007, pp. 1-10.

Digital Library

[38]

Chapel language specification 0.782, Technical report, Cray Inc., 2009.

[39]

B. Chapman, G. Jost and R. van der Pas, Using OpenMP: Portable Shared Memory Parallel Programming, MIT Press, Cambridge, MA, USA, 2007.

Digital Library

[40]

S. Che, J. Li, J. Sheaffer, K. Skadron and J. Lach, Accelerating compute-intensive applications with GPUs and FPGAs, in: Symposium on Application Specific Processors, 2008 (SASP'2008), Anaheim, CA, June 2008, pp. 101-107.

Digital Library

[41]

T. Chen, R. Raghavan, J. Dale and E. Iwata, Cell broadband engine architecture and its first implementation: a performance view, IBM J. Res. Dev. 51(5) (2007), 559-572.

Digital Library

[42]

J. Chhugani, A. Nguyen, V. Lee, W. Macy, M. Hagog, Y.-K. Chen, A. Baransi, S. Kumar and P. Dubey, Efficient implementation of sorting on multi-core SIMD CPU architecture, Proc. VLDB Endowment 1(2) (2008), 1313-1324.

Digital Library

[43]

M. Christen, O. Schenk, P. Messmer, E. Neufeld and H. Burkhart, Accelerating stencil-based computations by increased temporal locality on modern multi- and many-core architectures, in: Intl. Workshop on New Frontiers in High-Performance and Hardware-Aware Computing, KIT Scientific Publishing, Karlsruhe, Germany, 2008, pp. 47-54.

[44]

Pico Computing, Accelerating bioinformatics searching and dot plotting using a scalable FPGA cluster, November 2009 (visited 2009-11-14).

[45]

K. Datta, S. Kamil, S. Williams, L. Oliker, J. Shalf and K. Yelick, Optimization and performance modeling of stencil computations on modern microprocessors, SIAM Rev. 51(1) (2009), 129-159.

[46]

K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf and K. Yelick, Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures, in: Supercomputing, IEEE Press, Piscataway, NJ, USA, 2008, pp. 1-12.

[47]

J. Dongarra, Basic linear algebra subprograms technical forum standard, High Perform. Appl. Supercomput. 16(2002), 1-111.

Digital Library

[48]

U. Drepper, What every programmer should know about memory, November 2007, available at: http://people. redhat.com/drepper/cpumemory.pdf (visited 2009-03-20).

[49]

Dsplogic website, http://www.dsplogic.com/ (visited 2009- 04-28).

[50]

C. Dyken, G. Ziegler, C. Theobalt and H.-P. Seidel, Highspeed marching cubes using histogram pyramids, Computer Graphics Forum 27(8) (2008), 2028-2039.

[51]

EDA IndustryWorking Groups, http://www.vhdl.org/ (visited 2009-04-28).

[52]

A. Eichenberger, J. O'Brien, K. O'Brien, P. Wu, T. Chen, P. Oden, D. Prener, J. Shepherd, B. So, Z. Sura, A. Wang, T. Zhang, P. Zhao andM. Gschwind, Optimizing compiler for the cell processor, in: Intl. Conf. on Parallel Architectures and Compilation Techniques, IEEE Computer Society, Washington, DC, USA, 2005, pp. 161-172.

[53]

A. Eichenberger, J. O'Brien, K. O'Brien, P. Wu, T. Chen, P. Oden, D. Prener, J. Shepherd, B. So, Z. Sura, A. Wang, T. Zhang, P. Zhao,M. Gschwind, R. Archambault, Y. Gao and R. Koo, Using advanced compiler technology to exploit the performance of the cell broadband engine architecture, IBM Syst. J. 45(1) (2006), 59-84.

[54]

E. Elsen, P. LeGresley and E. Darve, Large calculation of the flow over a hypersonic vehicle using a GPU, Comput. Phys. 227(24) (2008), 10148-10161.

Digital Library

[55]

K. Fatahalian, D. Horn, T. Knight, L. Leem, M. Houston, J. Park, M. Erez, M. Ren, A. Aiken, W. Dally and P. Hanrahan, Sequoia: programming the memory hierarchy, in: Supercomputing, ACM, New York, NY, USA, 2006, p. 83.

[56]

Fftw website, http://www.fftw.org (visited 2009-04-28).

[57]

Fixtars, OpenCV on the cell, available at: http://cell.fixstars. com/opencv/ (visited 2009-03-20).

[58]

M. Flynn, Some computer organizations and their effectiveness, Trans. Comput. C-21(9) (1972), 948-960.

[59]

M. Flynn, R. Dimond, O. Mencer and O. Pell, Finding speedup in parallel processors, in: Intl. Symp. on Parallel and Distributed Computing, Miami, FL, USA, July 2008, pp. 3-7.

Digital Library

[60]

M. Frigo and V. Strumpen, The memory behavior of cache oblivious stencil computations, in: Supercomputing, Vol. 39, Kluwer Academic Publishers, Hingham, MA, USA, 2007, pp. 93-112.

[61]

I. Foster and K. Chandy, Fortran M: A language for modular parallel programming, Parallel Distrib. Comput. 26(1992).

[62]

D. Göddeke, R. Strzodka and S. Turek, Performance and accuracy of hardware-oriented native-, emulated- and mixed precision solvers in FEM simulations, Parallel Emergent Distrib. Syst. 22(4) (2007), 221-256.

[63]

D. Göddeke and R. Strzodka, Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations (part 2: Double precision GPUs), Technical report, Technical University Dortmund, 2008.

[64]

D. Göddeke, R. Strzodka, J. Mohd-Yusof, P. McCormick, S. Buijssen, M. Grajewski and S. Turek, Exploring weak scalability for FEM calculations on a GPU-enhanced cluster, Parallel Comput. 33(10, 11) (2007), 685-699.

[65]

N. Govindaraju, J. Gray, R. Kumar and D. Manocha, GPUTeraSort: high performance graphics co-processor sorting for large database management, in: Intl. Conf. on Management of Data, ACM, New York, NY, USA, 2006, pp. 325-336.

Digital Library

[66]

N. Govindaraju, B. Lloyd, Y. Dotsenko, B. Smith and J. Manferdelli, High performance discrete Fourier transforms on graphics processors, in: Supercomputing, IEEE Press, Piscataway, NJ, USA, 2008, pp. 1-12.

[67]

A. Greß, M. Guthe and R. Klein, GPU-based collision detection for deformable parameterized surfaces, Computer Graphics Forum 25(3) (2006), 497-506.

[68]

W. Gropp, E. Lusk and A. Skjellum, Using MPI: Portable Parallel Programming with the Message-Passing Interface, 2nd edn, MIT Press, Cambridge, MA, USA, 1999.

[69]

J. Gustafson, Reconstruction of the Atanasoff-Berry computer, in: The First Computers: History and Architectures, R. Rojas and U. Hashagen, eds, MIT Press, Cambridge, MA, USA, 2000, pp. 91-106, Chapter 6.

[70]

D. Hackenberg, Fast matrix multiplication on cell (SMP) systems, July 2007, available at: http://www.tudresden. de/zih/cell/matmul/ (visited 2009-02-24).

[71]

T. Hagen, J. Hjelmervik, K.-A. Lie, J. Natvig and M. Henriksen, Visual simulation of shallow-water waves, Simul. Model. Pract. Theory 13(8) (2005), 716-726.

[72]

T. Hagen, K.-A. Lie and J. Natvig, Solving the euler equations on graphics processing units, in: Intl. Conf. on Computational Science, V.N. Alexandrov, G.D. van Albada, P.M. Sloot and J. Dongarra, eds, LNCS, Vol. 3994, Springer, 2006, pp. 220- 227.

Digital Library

[73]

M. Harris, Parallel computing with CUDA, SIGGRAPH Asia 2008 presentation, avaiklable at: http://sa08.idav.ucdavis. edu/NVIDIA.CUDA.Harris.pdf (visited 2009-04-28).

[74]

M. Harris, J. Owens, S. Sengupta, Y. Zhang and A. Davidson, CUDPP: CUDA data parallel primitives library, available at: http://www.gpgpu.org/developer/cudpp/ (visited 2009-03- 20).

[75]

M. Harris, S. Sengupta and J. Owens, Parallel prefix sum (scan) with CUDA, in: GPU Gems 3, H. Nguyen, ed., Addison-Wesley, Boston, MA, USA, 2007, pp. 851-876.

[76]

K. Hemmert and K. Underwood, An analysis of the double-precision floating-point FFT on FPGAs, in: Symp. on Field-Programmable Custom Computing Machines, IEEE Computer Society, Washington, DC, USA, 2005, pp. 171-180.

Digital Library

[77]

J. Hennessy and D. Patterson, Computer Architecture: A Quantitative Approach, 4th edn, Morgan Kaufmann, San Francisco, CA, USA, 2007.

[78]

N. Higham, The accuracy of floating point summation, Sci. Comp. 14(4) (1993), 783-799.

Digital Library

[79]

P. Hilfinger, D. Bonachea, K. Datta, D. Gay, S. Graham, B. Liblit, G. Pike, J. Su and K. Yelick, Titanium language reference manual, Technical report, UC Berkeley, 2005.

[80]

M. Hill and M. Marty, Amdahl's law in the multicore era, IEEE Computer 41(7) (2008), 33-38.

Digital Library

[81]

W. Hillis and G. Steele Jr., Data parallel algorithms, Commun. ACM 29(12) (1986), 1170-1183.

Digital Library

[82]

J. Hjelmervik, Heterogeneous computing with focus on mechanical engineering, PhD dissertation, University of Oslo and Grenoble Institute of Technology, 2009. (Thesis accepted. Defence 2009-05-06.)

[83]

R. Holt, LSI technology state of the art in 1968, September 1998.

[84]

D. Horn, Stream reduction operations for GPGPU applications, in: GPU Gems 2, M. Pharr and R. Fernando, eds, Addison-Wesley, Boston, MA, USA, 2005, pp. 573-589.

[85]

L. Howes and D. Thomas, Efficient random number generation and application using CUDA, in: GPU Gems 3, H. Nguyen, ed., Addison-Wesley, Boston, MA, USA, 2007, pp. 805-830.

[86]

HPCWire, FPGA cluster accelerates bioinformatics application by 5000×, November 2009, available at: http://www.hpcwire.com/offthewire/FPGA-Cluster-Accelerates-Bioinformatics-Application-by-5000X- 69612762.html (visited 2009-11-14).

[87]

IBM, PowerPC microprocessor family: Vector/SIMD multimedia extension technology programming environments manual, 2005.

[88]

IBM, IBM BladeCenter QS22, available at: http://www-03. ibm.com/systems/bladecenter/hardware/servers/qs22/ (visited 2009-04-20).

[89]

IBM, Software development kit for multicore acceleration version 3.1: Programmers guide, August 2008.

[90]

IBM, Fast Fourier transform library: Programmer's guide and API reference, August 2008.

[91]

IBM, 3d fast Fourier transform library: Programmer's guide and API reference, August 2008.

[92]

Impulse accelerated technologies, http://impulseaccelerated. com/ (visited 2009-04-28).

[93]

H. Inoue, T. Moriyama, H. Komatsu and T. Nakatani, AA-sort: A new parallel sorting algorithm for multicore SIMD processors, in: Intl. Conf. on Parallel Architecture and Compilation Techniques, IEEE Computer Society, Washington, DC, USA, 2007, pp. 189-198.

[94]

ISO/IEC, 9899:TC3, International Organization for Standardization, September 2007.

[95]

K. Iverson, A Programming Language,Wiley, New York, NY, USA, 1962.

[96]

S. Kamil, K. Datta, S. Williams, L. Oliker, J. Shalf and K. Yelick, Implicit and explicit optimizations for stencil computations, in: Workshop on Memory System Performance and Correctness, ACM, New York, NY, USA, 2006, pp. 51-60.

Digital Library

[97]

K. Kennedy, C. Koelbel and H. Zima, The rise and fall of high performance Fortran: an historical object lesson, in: Conf. on History of Programming Languages, 2007, San Diego, CA, USA, 7-1-7-22.

Digital Library

[98]

Khronos OpenCL Working Group, The OpenCL specification 1.0, 2008, available at: http://www.khronos.org/ registry/cl/ (visited 2009-03-20).

[99]

M. Kistler, J. Gunnels, D. Brokenshire and B. Benton, Petascale computing with accelerators, in: Symp. on Principles and Practice of Parallel Programming, ACM, NewYork, NY, USA, 2008, pp. 241-250.

Digital Library

[100]

T. Knight, J. Park, M. Ren, M. Houston, M. Erez, K. Fatahalian, A. Aiken, W. Dally and P. Hanrahan, Compilation for explicitly managed memory hierarchies, in: Symp. on Principles and Practice of Parallel Programming, ACM, NewYork, NY, USA, 2007, pp. 226-236.

Digital Library

[101]

C. Koelbel, U. Kremer, C.-W. Tseng, M.-Y. Wu, G. Fox, S. Hiranandani and K. Kennedy, Fortran D language specification, Technical report, 1991.

[102]

P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, K. Hill, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. Williams and K. Yelick, Exascale computing study: Technology challenges in achieving exascale systems, Technical report, DARPA IPTO, 2008.

[103]

M. Krishna, A. Kumar, N. Jayam, G. Senthilkumar, P. Baruah, R. Sharma, S. Kapoor and A. Srinivasan, A synchronous mode MPI implementation on the cell BE architecture, in: Intl. Symp. on Parallel and Distributed Processing with Applications, Niagara Falls, ON, Canada, 2007, pp. 982-991.

[104]

A. Kumar, G. Senthilkumar,M. Krishna, N. Jayam, P. Baruah, R. Sharma, A. Srinivasan and S. Kapoor, A buffered-mode MPI implementation for the cell BE processor, in: Intl. Conf. on Computational Science, Springer-Verlag, Berlin, 2007, pp. 603-610.

[105]

E. Larsen and D. McAllister, Fast matrix multiplies using graphics hardware, in: Supercomputing, ACM, New York, NY, USA, 2001, p. 55.

Digital Library

[106]

P. L'Ecuyer, Maximally equidistributed combined Tausworthe generators, Math. Comput. 65(213) (1996), 203-213.

Digital Library

[107]

B. Lloyd, C. Boyd and N. Govindaraju, Fast computation of general Fourier transforms on GPUs, in: Intl. Conf. on Multimedia & Expo, Hannover, Germany, 2008, pp. 5-8.

[108]

E. Loh and G. Walster, Rump's example revisited, Reliab. Comput. 8(3) (2002), 245-248.

[109]

Y. Luo and R. Duraiswami, Canny edge detection on NVIDIA CUDA, in: Computer Vision and Pattern Recognition Workshops, June 2008, IEEE Computer Society Press, Washington, DC, USA, 2008, pp. 1-8.

[110]

M. Matsumoto and T. Nishimura, Mersenne twister: a 623- dimensionally equidistributed uniform pseudo-random number generator, Model. Comput. Simul. 8(1) (1998), 3-30.

Digital Library

[111]

M. Matsumoto and T. Nishimura, Dynamic creation of pseudorandom number generators, in: Monte Carlo and Quasi-Monte Carlo Methods 1998, Springer-Verlag, Heidelberg, Germany, 2000, pp. 56-69.

[112]

T. Mattson, R. Van der Wijngaart and M. Frumkin, Programming the Intel 80-core network-on-a-chip terascale processor, in: Supercomputing, IEEE Press, Piscataway, NJ, USA, 2008, pp. 1-11.

[113]

M. McCool, Data-parallel programming on the cell BE and the GPU using the Rapidmind development platform, in: gSPx Multicore Applications Conference, November 2006.

[114]

M. McCool, S. Du Toit, T. Popa, B. Chan and K. Moule, Shader algebra, in: SIGGRAPH, ACM, New York, NY, USA, 2004, pp. 787-795.

[115]

S. McKeown, R. Woods and J. McAllister, Algorithmic factorisation for low power FPGA implementations through increased data locality, in: Int. Symp. on VLSI Design, Automation and Test, April 2008, pp. 271-274.

[116]

J. McZalpin and D.Wonnacott, Time skewing: A value-based approach to optimizing for memory locality, Technical Report dcs-tr-379, Rutgers School of Arts and Sciences, 1999.

[117]

P. Michel, J. Chestnut, S. Kagami, K. Nishiwaki, J. Kuffner and T. Kanade, GPU-accelerated real-time 3D tracking for humanoid locomotion and stair climbing, in: Intl. Conf. on Intelligent Robots and Systems, San Diego, CA, USA, 29 October-2 November 2007, pp. 463-469.

[118]

P. Micikevicius, 3D finite difference computation on GPUs using CUDA, in: Workshop on General Purpose Processing on Graphics Processing Units, ACM, New York, NY, USA, 2009, pp. 79-84.

Digital Library

[119]

Microsoft, DirectX: Advanced graphics on windows, available at: http://msdn.microsoft.com/directx (visited 2009-03- 31).

[120]

Mitronics website, http://www.mitrion.com/ (visited 2009- 04-28).

[121]

E. Mollick, Establishing Moore's law, Hist. Comput. 28(3) (2006), 62-75.

Digital Library

[122]

G. Moore, Cramming more components onto integrated circuits, Electronics 38(8) (1965), 114-117.

[123]

H. Neoh and A. Hazanchuk, Adaptive edge detection for real-time video processing using FPGAs, March 2005, available at: http://www.altera.com/literature/cp/gspx/edgedetection. pdf (visited 2009-03-20).

[124]

A. Nukada and S. Matsuoka, Auto-tuning 3-D FFT library for CUDA GPUs, in: Supercomputing, 2009.

[125]

R. Numrich and J. Reid, Co-Array Fortran for parallel programming, Technical report, Fortran Forum, 1998.

[126]

NVIDIA, CUDA CUBLAS library version 2.0, March 2008.

[127]

NVIDIA, CUDA CUFFT library version 2.1, March 2008.

[128]

NVIDIA, CUDA SDK version 2.0, 2008.

[129]

NVIDIA, CUDA Zone, available at: http://www.nvidia.com/ cuda (visited 2009-03-20).

[130]

NVIDIA, Developer Zone, available at: http://developer. nvidia.com/ (visited 2009-09-07).

[131]

NVIDIA, NVIDIA's next generation CUDA compute architecture: Fermi, October 2009.

[132]

NVIDIA, NVIDIA GeForce GTX 200 GPU architectural overview, May 2008.

[133]

NVIDIA, NVIDIA CUDA reference manual 2.0, June 2008.

[134]

K. O'Brien, K. O'Brien, Z. Sura, T. Chen and T. Zhang, Supporting OpenMP on cell, Parallel Prog. 36(3) (2008), 289- 311.

Digital Library

[135]

M. Ohara, H. Inoue, Y. Sohda, H. Komatsu and T. Nakatani, MPI microtask for programming the cell broadband engine processor, IBM Syst. J. 45(1) (2006), 85-102.

[136]

OpenFPGA website, http://www.openfpga.org/ (visited 2009-04-28).

[137]

OpenGL Architecture Review Board, D. Shreiner, M. Woo, J. Neider and T. Davis, OpenGL Programming Guide: The Official Guide to Learning OpenGL, 6th edn, Addison-Wesley, Boston, MA, USA, 2007.

[138]

Open systemc initiative, http://www.systemc.org/ (visited 2009-04-28).

[139]

J. Owens, M. Houston, D. Luebke, S. Green, J. Stone and J. Phillips, GPU computing, Proc. IEEE 96(5) (2008), 879- 899.

[140]

S. Pakin, Receiver-initiated message passing over RDMA networks, in: Intl. Parallel and Distributed Processing Symp., Miami, FL, USA, April 2008.

[141]

Parallel linear algebra for scalable multi-core architectures (PLASMA) project, available at: http://icl.cs.utk.edu/plasma/ (visited 2009-04-20).

[142]

D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas and K. Yelick, A case for intelligent RAM, IEEE Micro 17(2) (1997), 34-44.

Digital Library

[143]

J. Perez, P. Bellens, R. Badia and J. Labarta, CellSs: making it easier to program the cell broadband engine processor, IBM J. Res. Dev. 51(5) (2007), 593-604.

Digital Library

[144]

J. Peterson, P. Bohrer, L. Chen, E. Elnozahy, A. Gheith, R. Jewell, M. Kistler, T. Maeurer, S. Malone, D. Murrell, N. Needel, K. Rajamani, M. Rinaldi, R. Simpson, K. Sudeep and L. Zhang, Application of full-system simulation in exploratory system design and development, IBM J. Res. Dev. 50(2,3) (2006), 321-332.

[145]

T. Pock,M. Unger, D. Cremers and H. Bischof, Fast and exact solution of total variation models on the GPU, in: Computer Vision and Pattern Recognition Workshops, June 2008, IEEE Computer Society Press, Washington, DC, USA, 2008, pp. 1-8.

[146]

Portland Group, PGI accelerator compilers, available at: http://www.pgroup.com/resources/accel.htm (visited 2009- 08-05).

[147]

T. Purcell, C. Donner, M. Cammarano, H. Jensen and P. Hanrahan, Photon mapping on programmable graphics hardware, in: EUROGRAPHICS, Eurographics Association, 2003, pp. 41-50.

[148]

RapidMind, Cell BE porting and tuning with RapidMind: A case study, 2006, available at: http://www.rapidmind. net/case-cell.php (visited 2009-03-20).

[149]

M. Ren, J. Park, M. Houston, A. Aiken and W. Dally, A tuning framework for software-managed memory hierarchies, in: Intl. Conf. on Parallel Architectures and Compilation Techniques, ACM, New York, NY, USA, 2008, pp. 280-291.

Digital Library

[150]

Report on the experimental language X10, draft 0.41, Technical report, IBM, 2006.

[151]

H. Richardson, High performance Fortran: history, overview and current developments, Technical Report 1.4 TMC-261, Thinking Machines Corporation, 1996.

[152]

V. Sachdeva, M. Kistler, E. Speight and T.-H. Tzeng, Exploring the viability of the cell broadband engine for bioinformatics applications, Parallel Comput. 34(11) (2008), 616-626.

Digital Library

[153]

M. Saito and M. Matsumoto, SIMD-oriented fast Mersenne Twister: a 128-bit pseudorandom number generator, in: Monte Carlo and Quasi-Monte Carlo Methods, Springer-Verlag, Heidelberg, Germany, 2008.

[154]

N. Satish, M. Harris and M. Garland, Designing efficient sorting algorithms for manycore GPUs, NVIDIA, NVIDIA Technical Report NVR-2008-001, September 2008.

[155]

L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan and P. Hanrahan, Larrabee: a many-core ×86 architecture for visual computing, Trans. Graph. 27(3) (2008), 1-15.

Digital Library

[156]

S. Sengupta, M. Harris, Y. Zhang and J. Owens, Scan primitives for GPU computing, in: EUROGRAPHICS, Pergamon Press Inc., Elmsford, NY, USA, 2007, pp. 97-106.

[157]

S. Sengupta, A. Lefohn and J. Owens, A work-efficient step-efficient prefix sum algorithm, in: Workshop on Edge Computing Using New Commodity Architectures, May 2006, pp. 26-27.

[158]

B. Stackhouse, B. Cherkauer, M. Gowan, P. Gronowski and C. Lyles, A 65nm 2-billion-transistor quad-core Itanium processor, in: Intl. Solid-State Circuits Conf., Lille, France, February 2008, pp. 92-598.

[159]

O. Storaasli and D. Strenski, Beyond 100× speedup with FPGAs: Cray XD1 I/O analysis, in: Cray Users Group, Cray User Group Inc., Corvallis, OR, USA, 2009.

[160]

O. Storaasli, W. Yu, D. Strenski and J. Maltby, Performance evaluation of FPGA-based biological applications, in: Cray User Group, Cray User Group Inc., Corvallis, OR, USA, 2007.

[161]

D. Strenski, 2009, personal communication.

[162]

D. Strenski, J. Simkins, R. Walke and R. Wittig, Evaluating fpgas for floating-point performance, in: Intl. Workshop on High-Performance Reconfigurable Computing Technology and Applications, November 2008, IEEE Computer Society Press, Washington, DC, USA, pp. 1-6.

[163]

H. Sugano and R.Miyamoto, Parallel implementation of morphological processing on cell/BE with OpenCV interface, in: Intl. Symp. on Communications, Control and Signal Processing, St. Julians, Malta, March 2008, pp. 578-583.

[164]

J. Sun, G. Peterson and O. Storaasli, High-performance mixed-precision linear solver for FPGAs, IEEE Trans. Comput. 57(12) (2008), 1614-1623.

Digital Library

[165]

M. Sussman, W. Crutchfield and M. Papakipos, Pseudorandom number generation on the GPU, in: Graphics Hardware, ACM, New York, NY, USA, 2006, pp. 87-94.

[166]

Starbridge systems website, http://www.starbridgesystems. com/ (visited 2009-12-04).

[167]

S. Swaminarayan, K. Kadau, T. Germann and G. Fossum, 369 tflop/s molecular dynamics simulations on the Roadrunner general-purpose heterogeneous supercomputer, in: Supercomputing, IEEE Press, Piscataway, NJ, USA, 2008, pp. 1-10.

[168]

D. Thomas, L. Howes and W. Luk, A comparison of CPUs, GPUs, FPGAs and masssively parallel processor arrays for random number generation, in: FPGA, ACM, New York, NY, USA, 2009.

[169]

D. Thomas and W. Luk, High quality uniform random number generation using LUT optimised state-transition matrices, VLSI Signal Process. Syst. 47(1) (2007), 77-92.

[170]

Tokyo Tech, November 2008, booth #3208 at Supercomputing' 08, available at: http://www.voltaire.com/assets/files/ Case%20studies/titech_case_study_final_for_SC08.pdf (visited 2009-04-28).

[171]

Top 500 supercomputer sites, June 2009, available at: http://www.top500.org/.

[172]

Top green500 list, November 2008, available at: http://www.green500.org/.

[173]

UPC language specification v1.2, Technical report, UPC Consortium, 2005.

[174]

A. van Amesfoort, A. Varbanescu, H. Sips and R. van Nieuwpoort, Evaluating multi-core platforms for HPC dataintensive kernels, in: Conf. on Computing Frontiers, ACM, New York, NY, USA, 2009, pp. 207-216.

[175]

W. van der Laan, Cubin utilities, 2007, available at: http:// www.cs.rug.nl/~wladimir/decuda/ (visited 2009-03-20).

[176]

S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, A. Singh, T. Jacob, S. Jain, V. Erraguntla, C. Roberts, Y. Hoskote, N. Borkar and S. Borkar, An 80-tile sub-100-w teraflops processor in 65-nm CMOS, Solid-State Circuits 43(1) (2008), 29-41.

[177]

A. Varbanescu, H. Sips, K. Ross, Q. Liu, L.-K. Liu, A. Natsev and J. Smith, An effective strategy for porting C++ applications on cell, in: Intl. Conf. on Parallel Processing, IEEE Computer Society, Washington, DC, USA, 2007, p. 59.

[178]

Verilog website, http://www.verilog.com/ (visited 2009-04- 28).

[179]

D. Vianney, G. Haber, A. Heilper and M. Zalmanovici, Performance analysis and visualization tools for cell/B.E. multicore environment, in: Intl. Forum on Next-Generation Multicore/ Manycore Technologies, ACM, New York, NY, USA, 2008, pp. 1-12.

[180]

V. Volkov and J. Demmel, Benchmarking GPUs to tune dense linear algebra, in: Supercomputing, IEEE Press, Piscataway, NJ, USA, 2008, pp. 1-11.

[181]

V. Volkov and B. Kazian, Fitting FFT onto the G80 architecture, available at: http://www.cs.berkeley.edu/~kubitron/ courses/cs258-S08/projects/reports/project6_report.pdf (visited 2009-08-10).

[182]

S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick and J. Demmel, Optimization of sparse matrix-vector multiplication on emerging multicore platforms, in: Supercomputing, ACM, New York, NY, USA, 2007, pp. 1-12.

Digital Library

[183]

Xilinx website, http://www.xilinx.com/ (visited 2009-04-28).

[184]

M. Xu, P. Thulasiraman and R. Thulasiram, Exploiting data locality in FFT using indirect swap network on cell/BE, in: Intl. Symp. on High Performance Computing Systems and Applications, IEEE Computer Society, Washington, DC, USA, 2008, pp. 88-94.

Digital Library

[185]

L. Zhuo and V.K. Prasanna, High-performance designs for linear algebra operations on reconfigurable hardware, IEEE Trans. Comput. 57(8) (2008), 1057-1071.

Digital Library

[186]

G. Ziegler, A. Tevs, C. Theobalt and H.-P. Seidel, GPU point list generation through histogram pyramids, Technical Report MPI-I-2006-4-002, Max-Planck-Institut für Informatik, 2006.

[187]

H. Zima, P. Brezany, B. Chapman, P. Mehrotra and A. Schwald, Vienna Fortran - a language specification version 1.1, Technical Report 3, Austrian Center for Parallel Computation, 1992.

Cited By

Priya AChoudhury RPatni SSharma HMohanty MNarayanam KDevi UMoogi PPatil PParag P(2024)Energy-minimizing workload splitting and frequency selection for guaranteed performance over heterogeneous coresProceedings of the 15th ACM International Conference on Future and Sustainable Energy Systems10.1145/3632775.3661968(308-322)Online publication date: 4-Jun-2024
https://dl.acm.org/doi/10.1145/3632775.3661968
Wang JZhang QRong HXu GKim MChandra SBlincoe KTonella P(2023)Leveraging Hardware Probes and Optimizations for Accelerating Fuzz Testing of Heterogeneous ApplicationsProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616318(1101-1113)Online publication date: 30-Nov-2023
https://dl.acm.org/doi/10.1145/3611643.3616318
Bernaschi MCelestini AVella FD'Ambra P(2023)A Multi-GPU Aggregation-Based AMG Preconditioner for Iterative Linear SolversIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.328723834:8(2365-2376)Online publication date: 1-Aug-2023
https://dl.acm.org/doi/10.1109/TPDS.2023.3287238
Show More Cited By

State-of-the-art in heterogeneous computing

Recommendations

Performance and Power Analysis of High-Density Multi-GPGPU Architectures: A Preliminary Case Study
HPCC-CSS-ICESS '15: Proceedings of the 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conf on Embedded Software and Systems

A system architecture with high-density general purpose graphic processing unit (GPGPU) is emerging as a promising solution that can offer high compute performance and performance-per-watt for building cluster supercomputers. The raw compute power of ...
A dynamically configurable coprocessor for convolutional neural networks
ISCA '10: Proceedings of the 37th annual international symposium on Computer architecture

Convolutional neural networks (CNN) applications range from recognition and reasoning (such as handwriting recognition, facial expression recognition and video surveillance) to intelligent text applications such as semantic text analysis and natural ...
Exploiting State-of-the-Art x86 Architectures in Scientific Computing
ISPDC '12: Proceedings of the 2012 11th International Symposium on Parallel and Distributed Computing

In recent years, general purpose x86 architectures have undergone significant modifications towards high performance computing capabilities. Lately, technologies like wider vector units or Fused Multiply-Add (FMA) instruction, which were mainly known ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Scientific Programming

Scientific Programming Volume 18, Issue 1

January 2010

75 pages

Issue’s Table of Contents

Publisher

IOS Press

Netherlands

Publication History

Published: 01 January 2010

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

47
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Priya AChoudhury RPatni SSharma HMohanty MNarayanam KDevi UMoogi PPatil PParag P(2024)Energy-minimizing workload splitting and frequency selection for guaranteed performance over heterogeneous coresProceedings of the 15th ACM International Conference on Future and Sustainable Energy Systems10.1145/3632775.3661968(308-322)Online publication date: 4-Jun-2024
https://dl.acm.org/doi/10.1145/3632775.3661968
Wang JZhang QRong HXu GKim MChandra SBlincoe KTonella P(2023)Leveraging Hardware Probes and Optimizations for Accelerating Fuzz Testing of Heterogeneous ApplicationsProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616318(1101-1113)Online publication date: 30-Nov-2023
https://dl.acm.org/doi/10.1145/3611643.3616318
Bernaschi MCelestini AVella FD'Ambra P(2023)A Multi-GPU Aggregation-Based AMG Preconditioner for Iterative Linear SolversIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.328723834:8(2365-2376)Online publication date: 1-Aug-2023
https://dl.acm.org/doi/10.1109/TPDS.2023.3287238
Carratalá-Sáez RTorres YSierra-Pallares JLópez-Huguet SLlanos D(2023)UVaFTLE: Lagrangian finite time Lyapunov exponent extraction for fluid dynamic applicationsThe Journal of Supercomputing10.1007/s11227-022-05017-x79:9(9635-9665)Online publication date: 19-Jan-2023
https://dl.acm.org/doi/10.1007/s11227-022-05017-x
Miao DLaguna IRubio-González C(2023)Expression Isolation of Compiler-Induced Numerical Inconsistencies in Heterogeneous CodeHigh Performance Computing10.1007/978-3-031-32041-5_20(381-401)Online publication date: 21-May-2023
https://dl.acm.org/doi/10.1007/978-3-031-32041-5_20
Bouter ABosman PWagner M(2022)GPU-accelerated parallel gene-pool optimal mixing in a gray-box optimization settingProceedings of the Genetic and Evolutionary Computation Conference10.1145/3512290.3528797(675-683)Online publication date: 8-Jul-2022
https://dl.acm.org/doi/10.1145/3512290.3528797
Ciccozzi FAddazi LAsadollah SLisper BMasud AMubeen S(2022)A Comprehensive Exploration of Languages for Parallel ComputingACM Computing Surveys10.1145/348500855:2(1-39)Online publication date: 18-Jan-2022
https://dl.acm.org/doi/10.1145/3485008
Michels FSchnorr LNavaux P(2022)Investigating Oil and Gas CSEM Application on Vector ArchitecturesComputational Science and Its Applications – ICCSA 2022 Workshops10.1007/978-3-031-10542-5_45(650-667)Online publication date: 4-Jul-2022
https://dl.acm.org/doi/10.1007/978-3-031-10542-5_45
Malkovsky SSorokin ATsoy GKorolev SSmagin SKondrashev V(2021)Evaluating the performance of FFT library implementations on modern hybrid computing systemsThe Journal of Supercomputing10.1007/s11227-020-03591-677:8(8326-8354)Online publication date: 1-Aug-2021
https://dl.acm.org/doi/10.1007/s11227-020-03591-6
de Fine Licht JBesta MMeierhans SHoefler T(2020)Transformations of High-Level Synthesis Codes for High-Performance ComputingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.303940932:5(1014-1029)Online publication date: 8-Dec-2020
https://dl.acm.org/doi/10.1109/TPDS.2020.3039409
Show More Cited By

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents