Gpu IEEE Paper
Gpu IEEE Paper
Gpu IEEE Paper
...................................................................................................................................................................................................................
......
As we enter the era of GPU computing, demanding applications with substantial parallelism increasingly use the massively parallel computing capabilities of GPUs to achieve superior performance and efficiency. Today GPU computing enables applications that we previously thought infeasible because of long execution times. With the GPUs rapid evolution from a configurable graphics processor to a programmable parallel processor, the ubiquitous GPU in every PC, laptop, desktop, and workstation is a many-core multithreaded multiprocessor that excels at both graphics and computing applications. Todays GPUs use hundreds of parallel processor cores executing tens of thousands of parallel threads to rapidly solve large problems having substantial inherent parallelism. Theyre now the most pervasive massively parallel processing platform ever available, as well as the most costeffective. Using NVIDIA GPUs as examples, this article describes the evolution of GPU computing and its parallel computing model, the enabling architecture and software developments, how computing applications use CPUGPU coprocessing, example application performance speedups, and trends in GPU computing.
........................................................
56
development of increasingly parallel GPUs. Table 1 lists significant milestones in NVIDIA GPU technology development that drove the evolution of unified graphics and computing GPUs. GPU transistor counts increased exponentially, doubling roughly every 18 months with increasing semiconductor density. Since their 2006 introduction, CUDA parallel computing cores per GPU also doubled nearly every 18 months. In the early 1990s, there were no GPUs. Video graphics array (VGA) controllers generated 2D graphics displays for PCs to accelerate graphical user interfaces. In 1997, NVIDIA released the RIVA 128 3D singlechip graphics accelerator for games and 3D visualization applications, programmed with Microsoft Direct3D and OpenGL. Evolving to modern GPUs involved adding programmability incrementallyfrom fixed function pipelines to microcoded processors, configurable processors, programmable processors, and scalable parallel processors.
Early GPUs
The first GPU was the GeForce 256, a single-chip 3D real-time graphics processor introduced in 1999 that included nearly
every feature of high-end workstation 3D graphics pipelines of that era. It contained a configurable 32-bit floating-point vertex transform and lighting processor, and a configurable integer pixel-fragment pipeline, programmed with OpenGL and Microsoft DirectX 7 (DX7) APIs. GPUs first used floating-point arithmetic to calculate 3D geometry and vertices, then applied it to pixel lighting and color values to handle high-dynamic-range scenes and to simplify programming. They implemented accurate floating-point rounding to eliminate frame-varying artifacts on moving polygon edges that would otherwise sparkle at real-time frame rates. As programmable shaders emerged, GPUs became more flexible and programmable. In 2001, the GeForce 3 introduced the first programmable vertex processor that executed vertex shader programs, along with a configurable 32-bit floating-point pixel-fragment pipeline, programmed with OpenGL and DX8. The ATI Radeon 9700, introduced in 2002, featured a programmable 24-bit floating-point pixel-fragment processor programmed with DX9 and OpenGL. The GeForce FX and GeForce 68005 featured
..............................................................
MARCH/APRIL 2010
57
...............................................................................................................................................................................................
HOT CHIPS
programmable 32-bit floating-point pixelfragment processors and vertex processors, programmed with Cg programs, DX9, and OpenGL. These processors were highly multithreaded, creating a thread and executing a thread program for each vertex and pixel fragment. The GeForce 6800 scalable processor core architecture facilitated multiple GPU implementations with different numbers of processor cores. Developing the Cg language6 for programming GPUs provided a scalable parallel programming model for the programmable floating-point vertex and pixel-fragment processors of GeForce FX, GeForce 6800, and subsequent GPUs. A Cg program resembles a C program for a single thread that draws a single vertex or single pixel. The multithreaded GPU created independent threads that executed a shader program to draw every vertex and pixel fragment. In addition to rendering real-time graphics, programmers also used Cg to compute physical simulations and other generalpurpose GPU (GPGPU) computations. Early GPGPU computing programs achieved high performance, but were difficult to write because programmers had to express nongraphics computations with a graphics API such as OpenGL.
operations. It added instructions to support C and other general-purpose languages, including integer arithmetic, IEEE 754 floating-point arithmetic, and load/store memory access instructions with byte addressing. It provided hardware and instructions to support parallel computation, communication, and synchronizationincluding thread arrays, shared memory, and fast barrier synchronization.
.............................................................
58
IEEE MICRO
GPUs. Researchers and developers have enthusiastically adopted CUDA and GPU computing for a diverse range of applications,10 publishing hundreds of technical papers, writing parallel programming textbooks,11 and teaching CUDA programming at more than 300 universities. The CUDA Zone (see http://www.nvidia.com/object/ cuda_home_new.html) lists more than 1,000 links to GPU computing applications, programs, and technical papers. The 2009 GPU Technology Conference (see http:// www.nvidia.com/object/research_summit_ posters.html) published 91 research posters. Library and tools developers are making GPU development more productive. GPU computing languages include CUDA C, CUDA C, Portland Group (PGI) CUDA Fortran, DirectCompute, and OpenCL. GPU mathematics packages include MathWorks Matlab, Wolfram Mathematica, National Instruments Labview, SciComp SciFinance, and PyCUDA. NVIDIA developed the parallel Nsight GPU development environment, debugger, and analyzer integrated with Microsoft Visual Studio. GPU libraries include C productivity libraries, dense linear algebra, sparse linear algebra, FFTs, video and image processing, and data-parallel primitives. Computer system manufacturers are developing integrated CPUGPU coprocessing systems in rackmount server and cluster configurations.
Thread
Thread block
Grid 0
Grid 1
Figure 1. The CUDA hierarchy of threads, thread blocks, and grids of blocks, with corresponding memory spaces: per-thread private local, per-block shared, and per-application global memory spaces.
on any size GPU, automatically using more parallelism on GPUs with more processor cores and threads. A CUDA program is organized into a host program, consisting of one or more sequential threads running on a host CPU, and one or more parallel kernels suitable for execution on a parallel computing GPU. A kernel executes a sequential program on a set of lightweight parallel threads. As Figure 1 shows, the programmer or compiler organizes these threads into a grid of thread blocks. The threads comprising a thread block can synchronize with each other via barriers and communicate via a high-speed, perblock shared memory. Threads from different blocks in the same grid can coordinate via atomic operations in a global memory space shared by all threads. Sequentially dependent kernel grids can synchronize via global barriers and coordinate via global shared memory. CUDA requires that thread blocks be independent, which
..............................................................
MARCH/APRIL 2010
59
...............................................................................................................................................................................................
HOT CHIPS
void saxpy(uint n, float a, float *x, float *y) { uint i; for(i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } void serial_sample() { // Call serial SAXPY function saxpy(n, 2.0, x, y); }
__global__ void saxpy(uint n, float a, float *x, float *y) { uint i = blockIdx.x*blockDim.x + threadIdx.x; if(i < n) y[i] = a*x[i] + y[i]; } void parallel_sample() { // Launch parallel SAXPY kernel // using n/256 blocks of 256 threads each saxpy<<<ceil(n/256),256>>>(n, 2.0, x, y); }
(a)
(b)
Figure 2. Serial (a) and parallel CUDA (b) SAXPY kernels computing y a x y.
provides scalability to GPUs with different numbers of processor cores and threads. Thread blocks implement coarse-grained scalable data parallelism, while the lightweight threads comprising each thread block provide fine-grained data parallelism. Thread blocks executing different kernels implement coarse-grained task parallelism. Threads executing different paths implement fine-grained thread-level parallelism. Details of the CUDA programming model are available in the programming guide.2 Figure 2 shows some basic features of parallel programming with CUDA. It contains sequential and parallel implementations of the SAXPY routine defined by the basic linear algebra subroutines (BLAS) library. Given scalar a and vectors x and y containing n floating-point numbers, it performs the update y a x y. The serial implementation is a simple loop that computes one element of y per iteration. The parallel kernel executes each of these independent iterations in parallel, assigning a separate thread to compute each element of y. The __global__ modifier indicates that the procedure is a kernel entry point, and the extended functioncall syntax saxpy<<<B, T>>>(. . .) launches the kernel saxpy() in parallel across B blocks of T threads each. Each thread determines which element it should process from its integer thread block index blockIdx.x, its thread index within its block threadIdx.x, and the total number of threads per block blockDim.x. This example demonstrates a common parallelization pattern, where we can
transform a serial loop with independent iterations to execute in parallel across many threads. In the CUDA paradigm, the programmer writes a scalar programthe parallel saxpy() kernelthat specifies the behavior of a single thread of the kernel. This lets CUDA leverage standard C language with only a few small additions, such as built-in thread and block index variables. The SAXPY kernel is also a simple example of data parallelism, where parallel threads each produce assigned result data elements.
.............................................................
60
IEEE MICRO
L2 cache GigaThread
Tex Tex Tex Tex L1 Tex Tex Tex Tex L1 Tex Tex Tex Tex L1 Tex Tex Tex Tex L1 Tex Tex Tex Tex L1 Tex Tex Tex Tex L1 Tex Tex Tex Tex L1 Tex Tex Tex Tex L1
DRAM interface
SM
SM
SM
SM
SM
SM
SM
SM
Figure 3. Fermi GPU computing architecture with 512 CUDA processor cores organized as 16 streaming multiprocessors (SMs) sharing a common second-level (L2) cache, six 64-bit DRAM interfaces, and a host interface with the host CPU, system memory, and I/O devices. Each streaming multiprocessor has 32 CUDA cores.
fine-grained data and thread-level parallelism to execute hundreds of fine-grained threads in parallel. Application programs using the CUDA model thus scale transparently to small and large GPUs with different numbers of streaming multiprocessors and processor cores.
help cover long latency loads from DRAM memory. As each thread block completes executing its kernel program and releases its streaming multiprocessor resources, the work scheduler assigns a new thread block to that streaming multiprocessor. The PCIe host interface connects the GPU and its DRAM memory with the host CPU and system memory. The CPUGPU coprocessing and data transfers use the bidirectional PCIe interface. The streaming multiprocessor threads access system memory via the PCIe interface, and CPU threads access GPU DRAM memory via PCIe. The GPU architecture balances its parallel computing power with parallel DRAM memory controllers designed for high memory bandwidth. The Fermi GPU in Figure 3 has six high-speed GDDR5 DRAM interfaces, each 64 bits wide. Its 40-bit addresses handle up to 1 Tbyte of address space for GPU DRAM and CPU system memory for large-scale computing.
..............................................................
Host interface
DRAM interface
SM
SM
SM
SM
SM
SM
SM
SM
MARCH/APRIL 2010
61
...............................................................................................................................................................................................
HOT CHIPS
Efficient multithreading
The streaming multiprocessor implements zero-overhead multithreading and thread scheduling for up to 1,536 concurrent threads. To efficiently manage and execute this many individual threads, the multiprocessor employs the single-instruction multiplethread (SIMT) architecture introduced in the first unified computing GPU.7,8 The SIMT instruction logic creates, manages, schedules, and executes concurrent threads in groups of 32 parallel threads called warps. A CUDA thread block comprises one or more warps. Each Fermi streaming multiprocessor has two warp schedulers and two dispatch units that each select a warp and issue an instruction from the warp to 16 CUDA cores, 16 load/store units, or four SFUs. Because warps execute independently, the streaming multiprocessor can issue two warp instructions to appropriate sets of CUDA cores, load/store units, and SFUs. To support C, C, and standard single-thread programming languages, each streaming multiprocessor thread is independent, having its own private registers, condition codes and predicates, private per-thread memory and stack frame, instruction address, and thread execution state. The SIMT instructions control the execution of an individual thread, including arithmetic, memory access, and branching and control flow instructions. For efficiency, the SIMT multiprocessor issues an instruction to a warp of 32 independent parallel threads. The streaming multiprocessor realizes full efficiency and performance when all threads of a warp take the same execution path. If threads of a warp diverge at a data-dependent conditional branch, execution serializes for each branch path taken, and when all paths complete, the threads converge to the same execution path. The Fermi streaming multiprocessor extends the flexibility of the SIMT independent thread control flow with indirect branch and function-call instructions, and trap handling for exceptions and debuggers.
ECC memory
Fermi introduces ECC memory protection to enhance data integrity in large-scale GPU computing systems. Fermi ECC corrects single-bit errors and detects double-bit errors in the DRAM memory, GPU L2 cache, L1 caches, and streaming multiprocessor registers. The ECC lets us integrate thousands of GPUs in a system while maintaining a high mean time between failures (MTBF) for high-performance computing and supercomputing systems.
Streaming multiprocessor
The Fermi streaming multiprocessor introduces several architectural features that deliver higher performance, improve its programmability, and broaden its applicability. As Figure 4 shows, the streaming multiprocessor execution units include 32 CUDA processor cores, 16 load/store units, and four special function units (SFUs). It has a 64-Kbyte configurable shared memory/L1 cache, 128-Kbyte register file, instruction cache, and two multithreaded warp schedulers and instruction dispatch units.
Thread instructions
Parallel thread execution (PTX) instructions describe the execution of a single thread in a parallel CUDA program. The PTX
.............................................................
62
IEEE MICRO
Instruction cache Warp scheduler Dispatch unit Warp scheduler Dispatch unit
Core
Core
Core
Core
LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST SFU SFU SFU SFU
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Result.queue
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Interconnect network
FP = Floating point INT = Integer arithmetic logic LD/ST = Load/store SFU = Special function unit
Figure 4. The Fermi streaming multiprocessor has 32 CUDA processor cores, 16 load/store units, four special function units, a 64-Kbyte configurable shared memory/L1 cache, 128-Kbyte register file, instruction cache, and two multithreaded warp schedulers and instruction dispatch units.
instructions focus on scalar (rather than vector) operations to match standard scalar programming languages. Fermi implements the PTX 2.0 instruction set architecture (ISA), which targets C, C, Fortran, OpenCL, and DirectCompute programs. Instructions include
32-bit and 64-bit integer, addressing,
dicated instructions, branching, function calls, and indirect function calls for C virtual functions; and parallel barrier synchronization.
CUDA cores
Each pipelined CUDA core executes a scalar floating point or integer instruction per clock for a thread. With 32 cores, the streaming multiprocessor can execute up to 32 arithmetic thread instructions per clock.
and floating-point arithmetic; load, store, and atomic memory access; texture and multidimensional surface access;
..............................................................
MARCH/APRIL 2010
63
...............................................................................................................................................................................................
HOT CHIPS
The integer unit implements 32-bit precision for scalar integer operations, including 32-bit multiply and multiply-add operations, and efficiently supports 64-bit integer operations. The Fermi integer unit adds bit-field insert and extract, bit reverse, and population count.
memory spaces. A unified load/store instruction can access any of the three memory spaces, steering the access to the correct memory, which enables general C and C pointer access anywhere. Fermi provides a terabyte 40-bit unified byte address space, and the load/store ISA supports 64-bit byte addressing for future growth. The ISA also provides 32-bit addressing instructions when the program can limit its accesses to the lower 4 Gbytes of address space.
CPU+GPU coprocessing
Heterogeneous CPUGPU coprocessing systems evolved because the CPU and GPU have complementary attributes that allow applications to perform best using both types of processors. CUDA programs are coprocessing programsserial portions execute on the CPU, while parallel portions execute on the GPU. Coprocessing optimizes total application performance. With coprocessing, we use the right core for the right job. We use a CPU core (optimized for low latency on a single thread) for a codes serial portions, and we use GPU cores (optimized for aggregate throughput on a codes parallel portions) for parallel portions of code. This approach gives more performance per unit area or power than either CPU or GPU cores alone. The comparison in Table 2 illustrates the advantage of CPUGPU coprocessing using
.............................................................
64
IEEE MICRO
Table 2. CPU+GPU coprocessing execution time, assuming that a CPU core is 53 faster and 503 the area of a GPU core.
Processing time for 1 CPU Configuration Program type Parallel-intensive program Area 0.5% serial code 99.5% parallelizable code Total Mostly sequential program 75% serial code 25% parallelizable code Total 200.0 750.1 155.0 150.11 200.0 150.0 50.0 5.4 750.0 0.1 20.9 150.0 5.0 1.44 150.0 0.11 core 50 1.0 199.0 Processing time for 500 GPU cores 500 5.0 0.4 Processing time for 10 CPU cores 500 1.0 19.9 Processing time for 1 CPU core + 450 GPU cores 500 1.00 0.44
single CPU core and 450 GPU cores. Table 2 assumes that a CPU core is 5 faster and 50 the area of a GPU core numbers consistent with contemporary CPUs and GPUs. The coprocessing system devotes 10 percent of its area to the single CPU core and 90 percent of its area to the 450 GPU cores. Table 2 compares the four configurations on both a parallel-intensive program (with a serial fraction of only 0.5 percent) and on a mostly sequential program (with a serial fraction of 75 percent). The coprocessing architecture is the fastest on both programs. On the parallel-intensive program, the coprocessing architecture is slightly slower on the parallel portion than the pure GPU configuration (0.44 seconds versus 0.40 seconds) but more than makes up for this by running the tiny serial portion 5 faster (1 second versus 5 seconds). The heterogeneous architecture has an advantage over the pure throughputoptimized configuration here because serial performance is important even for mostly parallel codes.
On the mostly sequential program, the coprocessing configuration matches the performance of the multi-CPU configuration on the codes serial portion but runs the parallel portion 45 as fast, giving a slightly faster overall performance. Even on mostly sequential codes, its more efficient to run the codes parallel portion on a throughputoptimized architecture. The coprocessing architecture provides the best performance across a wide range of the serial fraction because it uses the right core for each task. By using a latencyoptimized CPU to run the codes serial fraction, it gives the best possible performance on the serial fractionwhich is important even for mostly parallel codes. By using throughput-optimized cores to run the codes parallel portion, it gives nearoptimal performance on the parallel fraction as wellwhich becomes increasingly important as codes become more parallel. Its wasteful to use large, inefficient latency-optimized cores to run parallel code segments.
Application performance
Many applications consist of a mixture of fundamentally serial control logic and inherently parallel computations. Furthermore, these parallel computations are frequently data-parallel in nature. This directly matches the CUDA coprocessing programming model, namely a sequential control thread
..............................................................
MARCH/APRIL 2010
65
...............................................................................................................................................................................................
HOT CHIPS
Field Quantum chemistry Molecular dynamics 3D computational fluid dynamics (CFD) 3D CFD Quantum physics Finite element analysis
* Elemental Technologies, Badaboom media converter, 2009; http://badaboomit.com. ** Accelereyes, Jacket GPU engine for Matlab, 2009; http://www.accelereyes.com.
capable of launching a series of parallel kernels. The use of parallel kernels launched from a sequential program also makes it relatively easy to parallelize an applications individual components rather than rewrite the entire application. Many applicationsboth academic research and industrial productshave been accelerated using CUDA to achieve significant parallel speedups.10 Such applications fall into a variety of problem domains, including quantum chemistry, molecular dynamics, computational fluid dynamics, quantum physics, finite element analysis, astrophysics, bioinformatics, computer vision, video transcoding, speech recognition, computed tomography, and computational modeling. Table 3 lists some representative applications along with the runtime speedups obtained for the whole application using CPUGPU coprocessing over CPU alone, as measured by application developers.12-22 The speedups using GeForce 8800, Tesla T8, GeForce GTX 280, Tesla T10, and GeForce GTX 285 range from 9 to more than 130, with the higher speedups reflecting applications where more of the work ran in parallel on the GPU. The lower speedups while still quite attractiverepresent
applications that are limited by the codes CPU portion, coprocessing overhead, or by divergence in the codes GPU fraction. The speedups achieved on this diverse set of applications validate the programmability of GPUsin addition to their performance. The breadth of applications that have been ported to CUDA (more than 1,000 on the CUDA Zone) demonstrates the range of the programming model and architecture. Applications with dense matrices, sparse matrices, and arbitrary pointer structures have all been successfully implemented in CUDA with impressive speedups. Similarly, applications with diverse control structures and significant data-dependent control, such as ray tracing, have achieved good performance in CUDA. Many real-world applications (such as interactive ray tracing) are composed of many different algorithms, each with varying degrees of parallelism. OptiX, our interactive ray-tracing software developers kit built in the CUDA architecture, provides a mechanism to control and schedule a wide variety of tasks on both the CPU and GPU. Some tasks are primarily serial and execute on the CPU, such as compilation, data structure management, and coordination with the operating system and user interaction.
.............................................................
66
IEEE MICRO
Other tasks, such as building an acceleration structure or updating animations, may run either on the CPU or the GPU depending on the choice of algorithms and the performance required. However, the real power of OptiX is the manner in which it manages parallel operations that are either regular or highly irregular in a fine-grained fashion, and this is executed entirely on the GPU. To accomplish this, OptiX decomposes the algorithm into a series of states. Some of these states consist of uniform operations such as generating rays according to a camera model, or applying a tone-mapping operator. Other states consist of highly irregular computation, such as traversing an acceleration structure to locate candidate geometry, or intersecting a ray with that candidate geometry. These operations are highly irregular because they require an indefinite number of operations and can vary in cost by an order of magnitude or more between rays. A user-provided program that executes within an abstract ray-tracing machine controls the details of each operation. To execute this state machine, each warp of parallel threads on the GPU selects a handful of rays that have matching state. Each warp executes in SIMT fashion, causing each ray to transition to a new state (which might not be the same for all rays). This process repeats until all rays complete their operations. OptiX can control both the prioritization of the states and the scope of the search for rays to target trade-offs in different GPU generations. Even if rays temporarily diverge to different states, theyll return to a small number of common states usually traversal and intersection. This statemachine model gives the GPU an opportunity to reunite diverged threads with other threads executing the same code, which wouldnt occur in a straightforward SIMT execution model. The net result is that CPUGPU coprocessing enables fast, interactive ray tracing of complex scenes while you watch, which is an application that researchers previously considered too irregular for a GPU.
Thus, we must use parallelism for the increased performance required to deliver more value to users. A GPU thats optimized for throughput delivers parallel performance much more efficiently than a CPU thats optimized for latency. A heterogeneous coprocessing architecture that combines a single latency-optimized core (a CPU) with many throughputoptimized cores (a GPU) performs better than either alternative alone. This is because it uses the right processor for the right job the CPU for serial sections and critical paths and the GPU for the parallel sections. Because they efficiently execute programs across a range of parallelism, heterogeneous CPUGPU architectures are becoming pervasive. In high-performance computing, technical computing, and consumer media processing, CPUGPU coprocessing has become the architecture of choice. We expect the rate of adoption of GPU computing in these beachhead areas to accelerate, and for GPU computing to spread to broader application areas. In addition to accelerating existing applications, we expect GPU computing to enable new classes of applications, such as building compelling virtual worlds and performing universal speech translation. As the GPU computing era progresses, GPU performance will continue to scale at Moores law ratesabout 50 percent per yeargiving an order of magnitude increase in performance in five to six years. At the same time, GPU architecture will evolve to further increase the span of applications that it can efficiently address. GPU cores will not become CPUsthey will continue to be optimized for throughput, rather than latency. However, they will evolve to become more agile and better able to handle arbitrary control and data access patterns. GPUs and their programming systems will also evolve to make GPUs even easier to programfurther accelerating their rate of MICRO adoption. Acknowledgments We thank Jen-Hsun Huang of NVIDIA for his Hot Chips 21 keynote23 that inspired this article, and the entire NVIDIA team that brings GPU computing to market.
PU computing is at the tipping point. Single-threaded processor performance is no longer scaling at historic rates.
..............................................................
MARCH/APRIL 2010
67
...............................................................................................................................................................................................
HOT CHIPS
....................................................................
References
1. J. Nickolls et al., Scalable Parallel Programming with CUDA, ACM Queue, vol. 6, no. 2, 2008, pp. 40-53. 2. NVIDIA, NVIDIA CUDA Programming Guide, 2009; http://developer.download.nvidi a. com /com pute/ cuda/2_3/toolkit/docs/ NVIDIA_CUDA_Programming_Guide_2.3.pdf. 3. C. Boyd, DirectCompute: Capturing the Teraflop, Microsoft Personal Developers Conf., 2009; http://ecn.channel9.msdn. com/o9/pdc09/ppt/CL03.pptx. 4. Khronos, The OpenCL Specification, 2009; http://www.khronos.org/OpenCL. 5. J. Montrym and H. Moreton, The GeForce 6800, IEEE Micro, vol. 25, no. 2, 2005, pp. 41-51. 6. W.R. Mark et al., Cg: A System for Programming Graphics Hardware in a C-like Language, Proc. Special Interest Group on Computer Graphics (Siggraph), ACM Press, 2003, pp. 896-907. 7. E. Lindholm et al., NVIDIA Tesla: A Unified Graphics and Computing Architecture, IEEE Micro, vol. 28, no. 2, 2008, pp. 39-55. 8. J. Nickolls and D. Kirk, Graphics and Computing GPUs, Computer Organization and Design: The Hardware/Software and J.L. Interface, D.A. Patterson CFD, Intl J. Computational Fluid Dynamics, vol. 22, no. 7, 2008, pp. 443-456. 15. T. Brandvik and G. Pullan, Acceleration of a 3D Euler Solver Using Commodity Graphics Hardware, Proc. 46th Am. Inst. of Aeronautics and Astronautics (AIAA) Aerospace Sciences Meeting, AIAA, 2008; http:// www.aiaa.org/agenda.cfm?lumeetingid= 1065&dateget=08-Jan-08#session8907. 16. M.A. Clark et al., Solving Lattice QCD Systems of Equations Using Mixed Precision Solvers on GPUs, Computer Physics Comm., 2009; http://arxiv.org/abs/0911. 3191v2. 17. D. Go ddeke and R. Strzodka, Performance and Accuracy of Hardware-Oriented Native-, Emulated-, and Mixed-Precision Solvers in FEM Simulations (Part 2: Double Precision GPUs), Ergebnisberichte des Instituts fu r Angewandte Mathematik [Reports on Findings of the Inst. for Applied Mathematics], Dortmund Univ. of Technology, no. 370, 2008; http://www. mathematik.uni-dortmund.de/~goeddeke/ pubs/GTX280_mixedprecision.pdf. 18. R.G. Belleman, J. Bedorf, and S.P. Zwart, High Performance Direct Gravitational N-body Simulations on Graphics Processing Units II: An Implementation in CUDA, New Astronomy, vol. 13, no. 2, 2008, pp. 103-112. 19. Y. Liu, B. Schmidt, and D.L. Maskell, MSACUDA: Multiple Sequence Alignment on Graphics Processing Units with CUDA, Proc. 20th IEEE Intl Conf. ApplicationSpecific Systems, Architectures and Processors, IEEE CS Press, 2009, pp. 121-128. 20. B. Catanzaro et al., Efficient, High-Quality Image Contour Detection, Proc. IEEE Intl Conf. Computer Vision, IEEE CS Press, 2009; http://www.cs.berkeley.edu/~catanzar/ Damascene/iccv2009.pdf. 21. J. Chong et al., Data-Parallel Large Vocabulary Continuous Speech Recognition on Graphics Processors, tech. report UCB/EECS-2008-69, Univ. of California at Berkeley, 2008; http://www.eecs.berkeley. edu/Pubs/TechRpts/2008/EECS-2008-69.pdf. 22. Y. Pan et al., Feasibility of GPU-Assisted Iterative Image Reconstruction for Mobile C-Arm CT, Proc. Intl Soc. for Photonics and Optonics (SPIE), vol. 7258, SPIE 2009; http://www.sci.utah.edu/~ypan/Pan_ SPIE2009.pdf.
Hennessy, 4th ed., Morgan Kaufmann, 2009, pp. A2-A77. 9. NVIDIA, Fermi: NVIDIAs Next Generation CUDA Compute Architecture, 2009; http:// www.nvidia.com/content/PDF/fermi_white_ papers/NVIDIA_Fermi_Compute_Architecture_ Whitepaper.pdf. 10. M. Garland et al., Parallel Computing Experiences with CUDA, IEEE Micro, vol. 28, no. 4, 2008, pp. 13-27. 11. D.B. Kirk and W.W. Hwu, Programming Massively Parallel Processors: A Hands-on Approach, Morgan Kaufmann, 2010. 12. I.S. Ufimtsev and T.J. Martinez, Quantum Chemistry on Graphical Processing Units. 1. Strategies for Two-Electron Integral Evaluation, J. Chemical Theory and Computation, vol. 4, no. 2, 2008, pp. 222-231. 13. M.S. Friedrichs et al., Accelerating Molecular Dynamic Simulation on Graphics Processing Units, J. Computational Chemistry, vol. 30, no. 6, 2009, pp. 864-872. 14. J. To lke and M. Krafczyk, TeraFLOP Computing on a Desktop PC with GPUs for 3D
.............................................................
68
IEEE MICRO
23. J.H. Huang, 2009: The GPU Computing Tipping Point, Proc. IEEE Hot Chips 21, 2009; http://www.hotchips.org/archives/hc21.
John Nickolls is the director of architecture at NVIDIA for GPU computing. His technical interests include developing parallel processing products, languages, and architectures. Nickolls has a PhD in electrical engineering from Stanford University. William J. Dally is the chief scientist and senior vice president of research at NVIDIA and the Willard R. and Inez Kerr Bell Professor of Engineering at Stanford University. His technical interests include parallel computer architecture, parallel
programming systems, and interconnection networks. Hes a member of the National Academy of Engineering and a fellow of IEEE, ACM, and the American Academy of Arts and Sciences. Dally has a PhD in computer science from the California Institute of Technology. Direct questions and comments about this article to John Nickolls, NVIDIA, 2701 San Tomas Expressway, Santa Clara, CA 95050; jnickolls@nvidia.com.
..............................................................
MARCH/APRIL 2010
69