Nothing Special   »   [go: up one dir, main page]

skip to main content
article

State-of-the-art in heterogeneous computing

Published: 01 January 2010 Publication History

Abstract

Node level heterogeneous architectures have become attractive during the last decade for several reasons: compared to traditional symmetric CPUs, they offer high peak performance and are energy and/or cost efficient. With the increase of fine-grained parallelism in high-performance computing, as well as the introduction of parallelism in workstations, there is an acute need for a good overview and understanding of these architectures. We give an overview of the state-of-the-art in heterogeneous computing, focusing on three commonly found architectures: the Cell Broadband Engine Architecture, graphics processing units (GPUs), and field programmable gate arrays (FPGAs). We present a review of hardware, available software tools, and an overview of state-of-the-art techniques and algorithms. Furthermore, we present a qualitative and quantitative comparison of the architectures, and give our view on the future of heterogeneous computing.

References

[1]
Accelereyes, Jacket user guide, February 2009.
[2]
Y. Allusse, P. Horain, A. Agarwal and C. Saipriyadarshan, GpuCV: A GPU-accelerated framework for image processing and computer vision, in: Intl. Symp. on Advances in Visual Computing, Springer-Verlag, Berlin, 2008, pp. 430-439.
[3]
Altera, FFT megacore function user guide, March 2009.
[4]
Altera, Logicore ip fast Fourier transform v7.0 user guide, June 2009.
[5]
W. Alvaro, J. Kurzak and J. Dongarra, Fast and small short vector SIMD matrix multiplication kernels for the synergistic processing element of the cell processor, in: Intl. Conf. on Computational Science, Springer-Verlag, Berlin, 2008, pp. 935-944.
[6]
AMD, R700-family instruction set architecture, March 2009.
[7]
AMD, ATI Radeon HD 5870 GPU feature summary, available at: http://www.amd.com/us/products/desktop/graphics/ ati-radeon-hd-5000/hd-5870/Pages/ati-radeon-hd-5870- specifications.aspx (visited 2009-10-28).
[8]
AMD, ATI stream software development kit, available at: http://developer.amd.com/gpu/ATIStreamSDK/ (visited 2009-04-28).
[9]
AMD, Stream KernelAnalyzer, available at: http://developer. amd.com/gpu/ska/.
[10]
AMD, AMD core math library for graphic processors, March 2009, available at: http://developer.amd.com/gpu/acmlgpu/ (visited 2009-04-20).
[11]
T. Aoki, Real-time tsunami simulation on a multinode GPU cluster, in: SuperComputing, Portland, OR, 2009, poster.
[12]
M. Araya-Polo, F. Rubio, R. de la Cruz, M. Hanzich, J. Cela and D. Scarpazza, 3D seismic imaging through reverse-time migration on homogeneous and heterogeneous multicore processors, Sci. Prog. 17(1,2) (2009), 185-198.
[13]
K. Asanovic, R. Bodik, B. Catanzaro, J. Gebis, P. Husbands, K. Keutzer, D. Patterson, W. Plishker, J. Shalf, S. Williams and K. Yelick, The landscape of parallel computing research: A view from Berkeley, Technical report, EECS Department, University of California, Berkeley, December 2006.
[14]
K. Asanovic, R. Bodik, J. Demmel, T. Keaveny, K. Keutzer, J. Kubiatowicz, E. Lee, N. Morgan, G. Necula, D. Patterson, K. Sen, J. Wawrzynek, D. Wessel and K. Yelick, The parallel computing laboratory at U.C. Berkeley: A research agenda based on the Berkeley view, Technical report, EECS Department, University of California, Berkeley, December 2008.
[15]
W. Aspray, The Intel 4004 microprocessor: What constituted invention?, Hist. Comput. 19(3) (1997), 4-15.
[16]
P. Babenko and M. Shah, MinGPU: a minimum GPU library for computer vision, Real-Time Image Process. 3(4) (2008), 255-268.
[17]
J. Backus, Can programming be liberated from the von Neumann style?: a functional style and its algebra of programs, Commun. ACM 21(8) (1978), 613-641.
[18]
D. Bader and V. Argwal, FFTC: Fastest Fourier transform for the IBM cell broadband engine, in: Intl. Conf. on High Performance Computing, Goa, India, 2007, pp. 172-184.
[19]
D. Bader, V. Agarwal and K. Madduri, On the design and analysis of irregular algorithms on the cell processor: A case study of list ranking, in: Intl. Parallel and Distributed Processing Symp., Long Beach, CA, USA, 2007, pp. 1-10.
[20]
Z. Baker, M. Gokhale and J. Tripp, Matched filter computation on FPGA, cell and GPU, in: Symp. on Field-Programmable Custom Computing Machines, IEEE Computer Society, Washington, DC, USA, 2007, pp. 207-218.
[21]
K. Barker, K. Davis, A. Hoisie, D. Kerbyson, M. Lang, S. Pakin and J. Sancho, Entering the petaflop era: The architecture and performance of Roadrunner, in: Supercomputing, November 2008, IEEE Press, Piscataway, NJ, USA, 2008, pp. 1-11.
[22]
J. Beeckler and W. Gross, Particle graphics on reconfigurable hardware, Reconfigurable Technology and Systems 1(3) (2008), 1-27.
[23]
N. Bell and M. Garland, Efficient sparse matrix-vector multiplication on CUDA, NVIDIA Technical Report NVR-2008- 004, NVIDIA Corporation, December 2008.
[24]
P. Bellens, J. Perez, R. Badia and J. Labarta, CellSs: a programming model for the cell BE architecture, in: Supercomputing, ACM, New York, NY, USA, 2006, p. 86.
[25]
S. Benkner, E. Laure and H. Zima, HPF+: An extension of HPF for advanced applications, Technical report, The HPF+ Consortium, 1999.
[26]
G. Blelloch, Prefix sums and their applications, Technical Report CMU-CS-90-190, School of Computer Science, Carnegie Mellon University, November 1990.
[27]
G. Blelloch, M. Heroux and M. Zagha, Segmented operations for sparse matrix computation on vector multiprocessors, Technical report, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA, 1993.
[28]
G. Bradski and A. Kaehler, Learning OpenCV: Computer Vision with the OpenCV Library, O'Reilly, Cambridge, MA, USA, 2008.
[29]
F. Bodin, An evolutionary path for high-performance heterogeneous multicore programming, 2008.
[30]
G. Boone, Computing systems CPU, United States Patent 3, 757, 306, August 1971.
[31]
M. Boyer, D. Tarjan, S. Acton and K. Skadron, Accelerating leukocyte tracking using cuda: A case study in leveraging manycore coprocessors, in: Int. Parallel and Distributed Processing Symp., Rome, Italy, 2009, pp. 1-12.
[32]
A. Brodtkorb, The graphics processor as a mathematical coprocessor in MATLAB, in: Intl. Conf. on Complex, Intelligent and Software Intensive Systems, Barcelona, Spain, IEEE Computer Society, 2008, pp. 822-827.
[33]
I. Buck, T. Foley, D. Horn, J. Sugerman, M. Houston and P. Hanrahan, Brook for GPUs: Stream computing on graphics hardware, SIGGRAPH, Los Angeles, CA, 2004.
[34]
A. Buttari, J. Langou, J. Kurzak and J. Dongarra, A class of parallel tiled linear algebra algorithms for multicore architectures, Parallel Comput. 35(1) (2009), 38-53.
[35]
J. Canny, A computational approach to edge detection, Pattern Anal. Machine Intelligence 8(6) (1986), 679-698.
[36]
Celoxica website, http://www.celoxica.com/ (visited 2009- 04-28).
[37]
R. Chamberlain, M. Franklin, E. Tyson, J. Buhler, S. Gayen, P. Crowley and J. Buckley, Application development on hybrid systems, in: SC'07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing, ACM, New York, NY, USA, 2007, pp. 1-10.
[38]
Chapel language specification 0.782, Technical report, Cray Inc., 2009.
[39]
B. Chapman, G. Jost and R. van der Pas, Using OpenMP: Portable Shared Memory Parallel Programming, MIT Press, Cambridge, MA, USA, 2007.
[40]
S. Che, J. Li, J. Sheaffer, K. Skadron and J. Lach, Accelerating compute-intensive applications with GPUs and FPGAs, in: Symposium on Application Specific Processors, 2008 (SASP'2008), Anaheim, CA, June 2008, pp. 101-107.
[41]
T. Chen, R. Raghavan, J. Dale and E. Iwata, Cell broadband engine architecture and its first implementation: a performance view, IBM J. Res. Dev. 51(5) (2007), 559-572.
[42]
J. Chhugani, A. Nguyen, V. Lee, W. Macy, M. Hagog, Y.-K. Chen, A. Baransi, S. Kumar and P. Dubey, Efficient implementation of sorting on multi-core SIMD CPU architecture, Proc. VLDB Endowment 1(2) (2008), 1313-1324.
[43]
M. Christen, O. Schenk, P. Messmer, E. Neufeld and H. Burkhart, Accelerating stencil-based computations by increased temporal locality on modern multi- and many-core architectures, in: Intl. Workshop on New Frontiers in High-Performance and Hardware-Aware Computing, KIT Scientific Publishing, Karlsruhe, Germany, 2008, pp. 47-54.
[44]
Pico Computing, Accelerating bioinformatics searching and dot plotting using a scalable FPGA cluster, November 2009 (visited 2009-11-14).
[45]
K. Datta, S. Kamil, S. Williams, L. Oliker, J. Shalf and K. Yelick, Optimization and performance modeling of stencil computations on modern microprocessors, SIAM Rev. 51(1) (2009), 129-159.
[46]
K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf and K. Yelick, Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures, in: Supercomputing, IEEE Press, Piscataway, NJ, USA, 2008, pp. 1-12.
[47]
J. Dongarra, Basic linear algebra subprograms technical forum standard, High Perform. Appl. Supercomput. 16(2002), 1-111.
[48]
U. Drepper, What every programmer should know about memory, November 2007, available at: http://people. redhat.com/drepper/cpumemory.pdf (visited 2009-03-20).
[49]
Dsplogic website, http://www.dsplogic.com/ (visited 2009- 04-28).
[50]
C. Dyken, G. Ziegler, C. Theobalt and H.-P. Seidel, Highspeed marching cubes using histogram pyramids, Computer Graphics Forum 27(8) (2008), 2028-2039.
[51]
EDA IndustryWorking Groups, http://www.vhdl.org/ (visited 2009-04-28).
[52]
A. Eichenberger, J. O'Brien, K. O'Brien, P. Wu, T. Chen, P. Oden, D. Prener, J. Shepherd, B. So, Z. Sura, A. Wang, T. Zhang, P. Zhao andM. Gschwind, Optimizing compiler for the cell processor, in: Intl. Conf. on Parallel Architectures and Compilation Techniques, IEEE Computer Society, Washington, DC, USA, 2005, pp. 161-172.
[53]
A. Eichenberger, J. O'Brien, K. O'Brien, P. Wu, T. Chen, P. Oden, D. Prener, J. Shepherd, B. So, Z. Sura, A. Wang, T. Zhang, P. Zhao,M. Gschwind, R. Archambault, Y. Gao and R. Koo, Using advanced compiler technology to exploit the performance of the cell broadband engine architecture, IBM Syst. J. 45(1) (2006), 59-84.
[54]
E. Elsen, P. LeGresley and E. Darve, Large calculation of the flow over a hypersonic vehicle using a GPU, Comput. Phys. 227(24) (2008), 10148-10161.
[55]
K. Fatahalian, D. Horn, T. Knight, L. Leem, M. Houston, J. Park, M. Erez, M. Ren, A. Aiken, W. Dally and P. Hanrahan, Sequoia: programming the memory hierarchy, in: Supercomputing, ACM, New York, NY, USA, 2006, p. 83.
[56]
Fftw website, http://www.fftw.org (visited 2009-04-28).
[57]
Fixtars, OpenCV on the cell, available at: http://cell.fixstars. com/opencv/ (visited 2009-03-20).
[58]
M. Flynn, Some computer organizations and their effectiveness, Trans. Comput. C-21(9) (1972), 948-960.
[59]
M. Flynn, R. Dimond, O. Mencer and O. Pell, Finding speedup in parallel processors, in: Intl. Symp. on Parallel and Distributed Computing, Miami, FL, USA, July 2008, pp. 3-7.
[60]
M. Frigo and V. Strumpen, The memory behavior of cache oblivious stencil computations, in: Supercomputing, Vol. 39, Kluwer Academic Publishers, Hingham, MA, USA, 2007, pp. 93-112.
[61]
I. Foster and K. Chandy, Fortran M: A language for modular parallel programming, Parallel Distrib. Comput. 26(1992).
[62]
D. Göddeke, R. Strzodka and S. Turek, Performance and accuracy of hardware-oriented native-, emulated- and mixed precision solvers in FEM simulations, Parallel Emergent Distrib. Syst. 22(4) (2007), 221-256.
[63]
D. Göddeke and R. Strzodka, Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations (part 2: Double precision GPUs), Technical report, Technical University Dortmund, 2008.
[64]
D. Göddeke, R. Strzodka, J. Mohd-Yusof, P. McCormick, S. Buijssen, M. Grajewski and S. Turek, Exploring weak scalability for FEM calculations on a GPU-enhanced cluster, Parallel Comput. 33(10, 11) (2007), 685-699.
[65]
N. Govindaraju, J. Gray, R. Kumar and D. Manocha, GPUTeraSort: high performance graphics co-processor sorting for large database management, in: Intl. Conf. on Management of Data, ACM, New York, NY, USA, 2006, pp. 325-336.
[66]
N. Govindaraju, B. Lloyd, Y. Dotsenko, B. Smith and J. Manferdelli, High performance discrete Fourier transforms on graphics processors, in: Supercomputing, IEEE Press, Piscataway, NJ, USA, 2008, pp. 1-12.
[67]
A. Greß, M. Guthe and R. Klein, GPU-based collision detection for deformable parameterized surfaces, Computer Graphics Forum 25(3) (2006), 497-506.
[68]
W. Gropp, E. Lusk and A. Skjellum, Using MPI: Portable Parallel Programming with the Message-Passing Interface, 2nd edn, MIT Press, Cambridge, MA, USA, 1999.
[69]
J. Gustafson, Reconstruction of the Atanasoff-Berry computer, in: The First Computers: History and Architectures, R. Rojas and U. Hashagen, eds, MIT Press, Cambridge, MA, USA, 2000, pp. 91-106, Chapter 6.
[70]
D. Hackenberg, Fast matrix multiplication on cell (SMP) systems, July 2007, available at: http://www.tudresden. de/zih/cell/matmul/ (visited 2009-02-24).
[71]
T. Hagen, J. Hjelmervik, K.-A. Lie, J. Natvig and M. Henriksen, Visual simulation of shallow-water waves, Simul. Model. Pract. Theory 13(8) (2005), 716-726.
[72]
T. Hagen, K.-A. Lie and J. Natvig, Solving the euler equations on graphics processing units, in: Intl. Conf. on Computational Science, V.N. Alexandrov, G.D. van Albada, P.M. Sloot and J. Dongarra, eds, LNCS, Vol. 3994, Springer, 2006, pp. 220- 227.
[73]
M. Harris, Parallel computing with CUDA, SIGGRAPH Asia 2008 presentation, avaiklable at: http://sa08.idav.ucdavis. edu/NVIDIA.CUDA.Harris.pdf (visited 2009-04-28).
[74]
M. Harris, J. Owens, S. Sengupta, Y. Zhang and A. Davidson, CUDPP: CUDA data parallel primitives library, available at: http://www.gpgpu.org/developer/cudpp/ (visited 2009-03- 20).
[75]
M. Harris, S. Sengupta and J. Owens, Parallel prefix sum (scan) with CUDA, in: GPU Gems 3, H. Nguyen, ed., Addison-Wesley, Boston, MA, USA, 2007, pp. 851-876.
[76]
K. Hemmert and K. Underwood, An analysis of the double-precision floating-point FFT on FPGAs, in: Symp. on Field-Programmable Custom Computing Machines, IEEE Computer Society, Washington, DC, USA, 2005, pp. 171-180.
[77]
J. Hennessy and D. Patterson, Computer Architecture: A Quantitative Approach, 4th edn, Morgan Kaufmann, San Francisco, CA, USA, 2007.
[78]
N. Higham, The accuracy of floating point summation, Sci. Comp. 14(4) (1993), 783-799.
[79]
P. Hilfinger, D. Bonachea, K. Datta, D. Gay, S. Graham, B. Liblit, G. Pike, J. Su and K. Yelick, Titanium language reference manual, Technical report, UC Berkeley, 2005.
[80]
M. Hill and M. Marty, Amdahl's law in the multicore era, IEEE Computer 41(7) (2008), 33-38.
[81]
W. Hillis and G. Steele Jr., Data parallel algorithms, Commun. ACM 29(12) (1986), 1170-1183.
[82]
J. Hjelmervik, Heterogeneous computing with focus on mechanical engineering, PhD dissertation, University of Oslo and Grenoble Institute of Technology, 2009. (Thesis accepted. Defence 2009-05-06.)
[83]
R. Holt, LSI technology state of the art in 1968, September 1998.
[84]
D. Horn, Stream reduction operations for GPGPU applications, in: GPU Gems 2, M. Pharr and R. Fernando, eds, Addison-Wesley, Boston, MA, USA, 2005, pp. 573-589.
[85]
L. Howes and D. Thomas, Efficient random number generation and application using CUDA, in: GPU Gems 3, H. Nguyen, ed., Addison-Wesley, Boston, MA, USA, 2007, pp. 805-830.
[86]
HPCWire, FPGA cluster accelerates bioinformatics application by 5000×, November 2009, available at: http://www.hpcwire.com/offthewire/FPGA-Cluster-Accelerates-Bioinformatics-Application-by-5000X- 69612762.html (visited 2009-11-14).
[87]
IBM, PowerPC microprocessor family: Vector/SIMD multimedia extension technology programming environments manual, 2005.
[88]
IBM, IBM BladeCenter QS22, available at: http://www-03. ibm.com/systems/bladecenter/hardware/servers/qs22/ (visited 2009-04-20).
[89]
IBM, Software development kit for multicore acceleration version 3.1: Programmers guide, August 2008.
[90]
IBM, Fast Fourier transform library: Programmer's guide and API reference, August 2008.
[91]
IBM, 3d fast Fourier transform library: Programmer's guide and API reference, August 2008.
[92]
Impulse accelerated technologies, http://impulseaccelerated. com/ (visited 2009-04-28).
[93]
H. Inoue, T. Moriyama, H. Komatsu and T. Nakatani, AA-sort: A new parallel sorting algorithm for multicore SIMD processors, in: Intl. Conf. on Parallel Architecture and Compilation Techniques, IEEE Computer Society, Washington, DC, USA, 2007, pp. 189-198.
[94]
ISO/IEC, 9899:TC3, International Organization for Standardization, September 2007.
[95]
K. Iverson, A Programming Language,Wiley, New York, NY, USA, 1962.
[96]
S. Kamil, K. Datta, S. Williams, L. Oliker, J. Shalf and K. Yelick, Implicit and explicit optimizations for stencil computations, in: Workshop on Memory System Performance and Correctness, ACM, New York, NY, USA, 2006, pp. 51-60.
[97]
K. Kennedy, C. Koelbel and H. Zima, The rise and fall of high performance Fortran: an historical object lesson, in: Conf. on History of Programming Languages, 2007, San Diego, CA, USA, 7-1-7-22.
[98]
Khronos OpenCL Working Group, The OpenCL specification 1.0, 2008, available at: http://www.khronos.org/ registry/cl/ (visited 2009-03-20).
[99]
M. Kistler, J. Gunnels, D. Brokenshire and B. Benton, Petascale computing with accelerators, in: Symp. on Principles and Practice of Parallel Programming, ACM, NewYork, NY, USA, 2008, pp. 241-250.
[100]
T. Knight, J. Park, M. Ren, M. Houston, M. Erez, K. Fatahalian, A. Aiken, W. Dally and P. Hanrahan, Compilation for explicitly managed memory hierarchies, in: Symp. on Principles and Practice of Parallel Programming, ACM, NewYork, NY, USA, 2007, pp. 226-236.
[101]
C. Koelbel, U. Kremer, C.-W. Tseng, M.-Y. Wu, G. Fox, S. Hiranandani and K. Kennedy, Fortran D language specification, Technical report, 1991.
[102]
P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, K. Hill, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. Williams and K. Yelick, Exascale computing study: Technology challenges in achieving exascale systems, Technical report, DARPA IPTO, 2008.
[103]
M. Krishna, A. Kumar, N. Jayam, G. Senthilkumar, P. Baruah, R. Sharma, S. Kapoor and A. Srinivasan, A synchronous mode MPI implementation on the cell BE architecture, in: Intl. Symp. on Parallel and Distributed Processing with Applications, Niagara Falls, ON, Canada, 2007, pp. 982-991.
[104]
A. Kumar, G. Senthilkumar,M. Krishna, N. Jayam, P. Baruah, R. Sharma, A. Srinivasan and S. Kapoor, A buffered-mode MPI implementation for the cell BE processor, in: Intl. Conf. on Computational Science, Springer-Verlag, Berlin, 2007, pp. 603-610.
[105]
E. Larsen and D. McAllister, Fast matrix multiplies using graphics hardware, in: Supercomputing, ACM, New York, NY, USA, 2001, p. 55.
[106]
P. L'Ecuyer, Maximally equidistributed combined Tausworthe generators, Math. Comput. 65(213) (1996), 203-213.
[107]
B. Lloyd, C. Boyd and N. Govindaraju, Fast computation of general Fourier transforms on GPUs, in: Intl. Conf. on Multimedia & Expo, Hannover, Germany, 2008, pp. 5-8.
[108]
E. Loh and G. Walster, Rump's example revisited, Reliab. Comput. 8(3) (2002), 245-248.
[109]
Y. Luo and R. Duraiswami, Canny edge detection on NVIDIA CUDA, in: Computer Vision and Pattern Recognition Workshops, June 2008, IEEE Computer Society Press, Washington, DC, USA, 2008, pp. 1-8.
[110]
M. Matsumoto and T. Nishimura, Mersenne twister: a 623- dimensionally equidistributed uniform pseudo-random number generator, Model. Comput. Simul. 8(1) (1998), 3-30.
[111]
M. Matsumoto and T. Nishimura, Dynamic creation of pseudorandom number generators, in: Monte Carlo and Quasi-Monte Carlo Methods 1998, Springer-Verlag, Heidelberg, Germany, 2000, pp. 56-69.
[112]
T. Mattson, R. Van der Wijngaart and M. Frumkin, Programming the Intel 80-core network-on-a-chip terascale processor, in: Supercomputing, IEEE Press, Piscataway, NJ, USA, 2008, pp. 1-11.
[113]
M. McCool, Data-parallel programming on the cell BE and the GPU using the Rapidmind development platform, in: gSPx Multicore Applications Conference, November 2006.
[114]
M. McCool, S. Du Toit, T. Popa, B. Chan and K. Moule, Shader algebra, in: SIGGRAPH, ACM, New York, NY, USA, 2004, pp. 787-795.
[115]
S. McKeown, R. Woods and J. McAllister, Algorithmic factorisation for low power FPGA implementations through increased data locality, in: Int. Symp. on VLSI Design, Automation and Test, April 2008, pp. 271-274.
[116]
J. McZalpin and D.Wonnacott, Time skewing: A value-based approach to optimizing for memory locality, Technical Report dcs-tr-379, Rutgers School of Arts and Sciences, 1999.
[117]
P. Michel, J. Chestnut, S. Kagami, K. Nishiwaki, J. Kuffner and T. Kanade, GPU-accelerated real-time 3D tracking for humanoid locomotion and stair climbing, in: Intl. Conf. on Intelligent Robots and Systems, San Diego, CA, USA, 29 October-2 November 2007, pp. 463-469.
[118]
P. Micikevicius, 3D finite difference computation on GPUs using CUDA, in: Workshop on General Purpose Processing on Graphics Processing Units, ACM, New York, NY, USA, 2009, pp. 79-84.
[119]
Microsoft, DirectX: Advanced graphics on windows, available at: http://msdn.microsoft.com/directx (visited 2009-03- 31).
[120]
Mitronics website, http://www.mitrion.com/ (visited 2009- 04-28).
[121]
E. Mollick, Establishing Moore's law, Hist. Comput. 28(3) (2006), 62-75.
[122]
G. Moore, Cramming more components onto integrated circuits, Electronics 38(8) (1965), 114-117.
[123]
H. Neoh and A. Hazanchuk, Adaptive edge detection for real-time video processing using FPGAs, March 2005, available at: http://www.altera.com/literature/cp/gspx/edgedetection. pdf (visited 2009-03-20).
[124]
A. Nukada and S. Matsuoka, Auto-tuning 3-D FFT library for CUDA GPUs, in: Supercomputing, 2009.
[125]
R. Numrich and J. Reid, Co-Array Fortran for parallel programming, Technical report, Fortran Forum, 1998.
[126]
NVIDIA, CUDA CUBLAS library version 2.0, March 2008.
[127]
NVIDIA, CUDA CUFFT library version 2.1, March 2008.
[128]
NVIDIA, CUDA SDK version 2.0, 2008.
[129]
NVIDIA, CUDA Zone, available at: http://www.nvidia.com/ cuda (visited 2009-03-20).
[130]
NVIDIA, Developer Zone, available at: http://developer. nvidia.com/ (visited 2009-09-07).
[131]
NVIDIA, NVIDIA's next generation CUDA compute architecture: Fermi, October 2009.
[132]
NVIDIA, NVIDIA GeForce GTX 200 GPU architectural overview, May 2008.
[133]
NVIDIA, NVIDIA CUDA reference manual 2.0, June 2008.
[134]
K. O'Brien, K. O'Brien, Z. Sura, T. Chen and T. Zhang, Supporting OpenMP on cell, Parallel Prog. 36(3) (2008), 289- 311.
[135]
M. Ohara, H. Inoue, Y. Sohda, H. Komatsu and T. Nakatani, MPI microtask for programming the cell broadband engine processor, IBM Syst. J. 45(1) (2006), 85-102.
[136]
OpenFPGA website, http://www.openfpga.org/ (visited 2009-04-28).
[137]
OpenGL Architecture Review Board, D. Shreiner, M. Woo, J. Neider and T. Davis, OpenGL Programming Guide: The Official Guide to Learning OpenGL, 6th edn, Addison-Wesley, Boston, MA, USA, 2007.
[138]
Open systemc initiative, http://www.systemc.org/ (visited 2009-04-28).
[139]
J. Owens, M. Houston, D. Luebke, S. Green, J. Stone and J. Phillips, GPU computing, Proc. IEEE 96(5) (2008), 879- 899.
[140]
S. Pakin, Receiver-initiated message passing over RDMA networks, in: Intl. Parallel and Distributed Processing Symp., Miami, FL, USA, April 2008.
[141]
Parallel linear algebra for scalable multi-core architectures (PLASMA) project, available at: http://icl.cs.utk.edu/plasma/ (visited 2009-04-20).
[142]
D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas and K. Yelick, A case for intelligent RAM, IEEE Micro 17(2) (1997), 34-44.
[143]
J. Perez, P. Bellens, R. Badia and J. Labarta, CellSs: making it easier to program the cell broadband engine processor, IBM J. Res. Dev. 51(5) (2007), 593-604.
[144]
J. Peterson, P. Bohrer, L. Chen, E. Elnozahy, A. Gheith, R. Jewell, M. Kistler, T. Maeurer, S. Malone, D. Murrell, N. Needel, K. Rajamani, M. Rinaldi, R. Simpson, K. Sudeep and L. Zhang, Application of full-system simulation in exploratory system design and development, IBM J. Res. Dev. 50(2,3) (2006), 321-332.
[145]
T. Pock,M. Unger, D. Cremers and H. Bischof, Fast and exact solution of total variation models on the GPU, in: Computer Vision and Pattern Recognition Workshops, June 2008, IEEE Computer Society Press, Washington, DC, USA, 2008, pp. 1-8.
[146]
Portland Group, PGI accelerator compilers, available at: http://www.pgroup.com/resources/accel.htm (visited 2009- 08-05).
[147]
T. Purcell, C. Donner, M. Cammarano, H. Jensen and P. Hanrahan, Photon mapping on programmable graphics hardware, in: EUROGRAPHICS, Eurographics Association, 2003, pp. 41-50.
[148]
RapidMind, Cell BE porting and tuning with RapidMind: A case study, 2006, available at: http://www.rapidmind. net/case-cell.php (visited 2009-03-20).
[149]
M. Ren, J. Park, M. Houston, A. Aiken and W. Dally, A tuning framework for software-managed memory hierarchies, in: Intl. Conf. on Parallel Architectures and Compilation Techniques, ACM, New York, NY, USA, 2008, pp. 280-291.
[150]
Report on the experimental language X10, draft 0.41, Technical report, IBM, 2006.
[151]
H. Richardson, High performance Fortran: history, overview and current developments, Technical Report 1.4 TMC-261, Thinking Machines Corporation, 1996.
[152]
V. Sachdeva, M. Kistler, E. Speight and T.-H. Tzeng, Exploring the viability of the cell broadband engine for bioinformatics applications, Parallel Comput. 34(11) (2008), 616-626.
[153]
M. Saito and M. Matsumoto, SIMD-oriented fast Mersenne Twister: a 128-bit pseudorandom number generator, in: Monte Carlo and Quasi-Monte Carlo Methods, Springer-Verlag, Heidelberg, Germany, 2008.
[154]
N. Satish, M. Harris and M. Garland, Designing efficient sorting algorithms for manycore GPUs, NVIDIA, NVIDIA Technical Report NVR-2008-001, September 2008.
[155]
L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan and P. Hanrahan, Larrabee: a many-core ×86 architecture for visual computing, Trans. Graph. 27(3) (2008), 1-15.
[156]
S. Sengupta, M. Harris, Y. Zhang and J. Owens, Scan primitives for GPU computing, in: EUROGRAPHICS, Pergamon Press Inc., Elmsford, NY, USA, 2007, pp. 97-106.
[157]
S. Sengupta, A. Lefohn and J. Owens, A work-efficient step-efficient prefix sum algorithm, in: Workshop on Edge Computing Using New Commodity Architectures, May 2006, pp. 26-27.
[158]
B. Stackhouse, B. Cherkauer, M. Gowan, P. Gronowski and C. Lyles, A 65nm 2-billion-transistor quad-core Itanium processor, in: Intl. Solid-State Circuits Conf., Lille, France, February 2008, pp. 92-598.
[159]
O. Storaasli and D. Strenski, Beyond 100× speedup with FPGAs: Cray XD1 I/O analysis, in: Cray Users Group, Cray User Group Inc., Corvallis, OR, USA, 2009.
[160]
O. Storaasli, W. Yu, D. Strenski and J. Maltby, Performance evaluation of FPGA-based biological applications, in: Cray User Group, Cray User Group Inc., Corvallis, OR, USA, 2007.
[161]
D. Strenski, 2009, personal communication.
[162]
D. Strenski, J. Simkins, R. Walke and R. Wittig, Evaluating fpgas for floating-point performance, in: Intl. Workshop on High-Performance Reconfigurable Computing Technology and Applications, November 2008, IEEE Computer Society Press, Washington, DC, USA, pp. 1-6.
[163]
H. Sugano and R.Miyamoto, Parallel implementation of morphological processing on cell/BE with OpenCV interface, in: Intl. Symp. on Communications, Control and Signal Processing, St. Julians, Malta, March 2008, pp. 578-583.
[164]
J. Sun, G. Peterson and O. Storaasli, High-performance mixed-precision linear solver for FPGAs, IEEE Trans. Comput. 57(12) (2008), 1614-1623.
[165]
M. Sussman, W. Crutchfield and M. Papakipos, Pseudorandom number generation on the GPU, in: Graphics Hardware, ACM, New York, NY, USA, 2006, pp. 87-94.
[166]
Starbridge systems website, http://www.starbridgesystems. com/ (visited 2009-12-04).
[167]
S. Swaminarayan, K. Kadau, T. Germann and G. Fossum, 369 tflop/s molecular dynamics simulations on the Roadrunner general-purpose heterogeneous supercomputer, in: Supercomputing, IEEE Press, Piscataway, NJ, USA, 2008, pp. 1-10.
[168]
D. Thomas, L. Howes and W. Luk, A comparison of CPUs, GPUs, FPGAs and masssively parallel processor arrays for random number generation, in: FPGA, ACM, New York, NY, USA, 2009.
[169]
D. Thomas and W. Luk, High quality uniform random number generation using LUT optimised state-transition matrices, VLSI Signal Process. Syst. 47(1) (2007), 77-92.
[170]
Tokyo Tech, November 2008, booth #3208 at Supercomputing' 08, available at: http://www.voltaire.com/assets/files/ Case%20studies/titech_case_study_final_for_SC08.pdf (visited 2009-04-28).
[171]
Top 500 supercomputer sites, June 2009, available at: http://www.top500.org/.
[172]
Top green500 list, November 2008, available at: http://www.green500.org/.
[173]
UPC language specification v1.2, Technical report, UPC Consortium, 2005.
[174]
A. van Amesfoort, A. Varbanescu, H. Sips and R. van Nieuwpoort, Evaluating multi-core platforms for HPC dataintensive kernels, in: Conf. on Computing Frontiers, ACM, New York, NY, USA, 2009, pp. 207-216.
[175]
W. van der Laan, Cubin utilities, 2007, available at: http:// www.cs.rug.nl/~wladimir/decuda/ (visited 2009-03-20).
[176]
S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, A. Singh, T. Jacob, S. Jain, V. Erraguntla, C. Roberts, Y. Hoskote, N. Borkar and S. Borkar, An 80-tile sub-100-w teraflops processor in 65-nm CMOS, Solid-State Circuits 43(1) (2008), 29-41.
[177]
A. Varbanescu, H. Sips, K. Ross, Q. Liu, L.-K. Liu, A. Natsev and J. Smith, An effective strategy for porting C++ applications on cell, in: Intl. Conf. on Parallel Processing, IEEE Computer Society, Washington, DC, USA, 2007, p. 59.
[178]
Verilog website, http://www.verilog.com/ (visited 2009-04- 28).
[179]
D. Vianney, G. Haber, A. Heilper and M. Zalmanovici, Performance analysis and visualization tools for cell/B.E. multicore environment, in: Intl. Forum on Next-Generation Multicore/ Manycore Technologies, ACM, New York, NY, USA, 2008, pp. 1-12.
[180]
V. Volkov and J. Demmel, Benchmarking GPUs to tune dense linear algebra, in: Supercomputing, IEEE Press, Piscataway, NJ, USA, 2008, pp. 1-11.
[181]
V. Volkov and B. Kazian, Fitting FFT onto the G80 architecture, available at: http://www.cs.berkeley.edu/~kubitron/ courses/cs258-S08/projects/reports/project6_report.pdf (visited 2009-08-10).
[182]
S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick and J. Demmel, Optimization of sparse matrix-vector multiplication on emerging multicore platforms, in: Supercomputing, ACM, New York, NY, USA, 2007, pp. 1-12.
[183]
Xilinx website, http://www.xilinx.com/ (visited 2009-04-28).
[184]
M. Xu, P. Thulasiraman and R. Thulasiram, Exploiting data locality in FFT using indirect swap network on cell/BE, in: Intl. Symp. on High Performance Computing Systems and Applications, IEEE Computer Society, Washington, DC, USA, 2008, pp. 88-94.
[185]
L. Zhuo and V.K. Prasanna, High-performance designs for linear algebra operations on reconfigurable hardware, IEEE Trans. Comput. 57(8) (2008), 1057-1071.
[186]
G. Ziegler, A. Tevs, C. Theobalt and H.-P. Seidel, GPU point list generation through histogram pyramids, Technical Report MPI-I-2006-4-002, Max-Planck-Institut für Informatik, 2006.
[187]
H. Zima, P. Brezany, B. Chapman, P. Mehrotra and A. Schwald, Vienna Fortran - a language specification version 1.1, Technical Report 3, Austrian Center for Parallel Computation, 1992.

Cited By

View all
  • (2024)Energy-minimizing workload splitting and frequency selection for guaranteed performance over heterogeneous coresProceedings of the 15th ACM International Conference on Future and Sustainable Energy Systems10.1145/3632775.3661968(308-322)Online publication date: 4-Jun-2024
  • (2023)Leveraging Hardware Probes and Optimizations for Accelerating Fuzz Testing of Heterogeneous ApplicationsProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616318(1101-1113)Online publication date: 30-Nov-2023
  • (2023)A Multi-GPU Aggregation-Based AMG Preconditioner for Iterative Linear SolversIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.328723834:8(2365-2376)Online publication date: 1-Aug-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Scientific Programming
Scientific Programming  Volume 18, Issue 1
January 2010
75 pages

Publisher

IOS Press

Netherlands

Publication History

Published: 01 January 2010

Author Tags

  1. Power-efficient architectures
  2. energy and power consumption
  3. microprocessor performance
  4. parallel computer architecture
  5. stream or vector architectures

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 14 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Energy-minimizing workload splitting and frequency selection for guaranteed performance over heterogeneous coresProceedings of the 15th ACM International Conference on Future and Sustainable Energy Systems10.1145/3632775.3661968(308-322)Online publication date: 4-Jun-2024
  • (2023)Leveraging Hardware Probes and Optimizations for Accelerating Fuzz Testing of Heterogeneous ApplicationsProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616318(1101-1113)Online publication date: 30-Nov-2023
  • (2023)A Multi-GPU Aggregation-Based AMG Preconditioner for Iterative Linear SolversIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.328723834:8(2365-2376)Online publication date: 1-Aug-2023
  • (2023)UVaFTLE: Lagrangian finite time Lyapunov exponent extraction for fluid dynamic applicationsThe Journal of Supercomputing10.1007/s11227-022-05017-x79:9(9635-9665)Online publication date: 19-Jan-2023
  • (2023)Expression Isolation of Compiler-Induced Numerical Inconsistencies in Heterogeneous CodeHigh Performance Computing10.1007/978-3-031-32041-5_20(381-401)Online publication date: 21-May-2023
  • (2022)GPU-accelerated parallel gene-pool optimal mixing in a gray-box optimization settingProceedings of the Genetic and Evolutionary Computation Conference10.1145/3512290.3528797(675-683)Online publication date: 8-Jul-2022
  • (2022)A Comprehensive Exploration of Languages for Parallel ComputingACM Computing Surveys10.1145/348500855:2(1-39)Online publication date: 18-Jan-2022
  • (2022)Investigating Oil and Gas CSEM Application on Vector ArchitecturesComputational Science and Its Applications – ICCSA 2022 Workshops10.1007/978-3-031-10542-5_45(650-667)Online publication date: 4-Jul-2022
  • (2021)Evaluating the performance of FFT library implementations on modern hybrid computing systemsThe Journal of Supercomputing10.1007/s11227-020-03591-677:8(8326-8354)Online publication date: 1-Aug-2021
  • (2020)Transformations of High-Level Synthesis Codes for High-Performance ComputingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.303940932:5(1014-1029)Online publication date: 8-Dec-2020
  • Show More Cited By

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media