A LBM Solver 3D Fluid Simulation On GPU
A LBM Solver 3D Fluid Simulation On GPU
A LBM Solver 3D Fluid Simulation On GPU
a r t i c l e
i n f o
Article history:
Received 24 October 2011
Received in revised form 26 February 2012
Accepted 13 March 2012
Available online 9 April 2012
Keywords:
GPGPU
3D Lattice-Boltzmann Methods
CUDA
a b s t r a c t
A three-dimensional Lattice-Boltzmann uid model with nineteen discrete velocities was
implemented using NVIDIA Graphic Processing Unit (GPU) programing language Compute
Unied Device Architecture (CUDA). Previous LBM GPU implementations required two
steps to maximize memory bandwidth due to memory access restrictions of earlier
versions of CUDA toolkit and hardware capabilities. In this work, a new approach based
on single-step algorithm with a reversed collisionpropagation scheme is developed to
maximize GPU memory bandwidth, taking advantage of the newer versions of CUDA programming model and newer NVIDIA Graphic Cards. The code was tested on the numerical
calculation of lid driven cubic cavity ow at Reynolds number 100 and 1000 showing great
precision and stability. Simulations running on low cost GPU cards can calculate 400 cell
updates per second with more than 65% hardware bandwidth.
2012 Elsevier B.V. All rights reserved.
1. Introduction
The Lattice-Boltzmann Method (LBM) is a class of cellular automata (CA) that approximates the NavierStokes equations
to second order with an explicit collisionadvection scheme [4]. Derived from the Lattice Gas Automata [2], discrete velocities act as arcs between lattice cells whose populations are state variables. Like any explicit CA scheme the same simple code
is run over the entire grid in every time step, making LBM especially suitable for parallel implementations [5,15,18].
An interesting economic alternative for parallelizing LBM are the Graphic Processing Units (GPUs) [6,16,20]. A GPU is the
chip used by graphic cards to render pixels on the screen. Modern GPUs are optimized for executing single-instruction multiple threads (SIMT) that is a simple kernel over each element of a large set simultaneously operating as a co-processor of the
host CPU [12]. Several researchers have shown that the combination of GPU and Parallel CA is a valid tool to simulate uids
[6,8,16,20]. Although there has been interesting advances in this type of implementations, inter-processor data communication continues limiting the calculation performance, thus memory access optimizations is always necessary to take good
advantage of GPU processing power. Particularly, in Compute Unied Device Architecture (CUDA) language, parallel global
memory calls should be coalesced to reduce latency [13].
In this article, a novel efcient CUDA implementation of the LBM based in a single-loop pull scheme is presented. The
new implementation makes use of a new version of the CUDA language, with less memory alignment restrictions than the
used in previous works [8]. The implementation is tested in 3D ow simulations over a GPU-NVIDIA GeForce card achieving
high performances and near 65% of the hardware theoretical bandwidth.
Corresponding author. Address: Instituto PLADEMA, UNCPBA, Pinto 399, Tandil 7000, Argentina. Tel./fax: +54 249 4439690.
E-mail address: prinaldipablo@gmail.com (P.R. Rinaldi).
1569-190X/$ - see front matter 2012 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.simpat.2012.03.004
164
P.R. Rinaldi et al. / Simulation Modelling Practice and Theory 25 (2012) 163171
fi ~
x; t
x dt ei ; t dt fi ~
1h
eq
fi
i
~
x; t fi ~
x; t ;
i 0; 1; . . . ; 18
eq
where the equilibrium function fi ~
x; t denes the macroscopic equations that the automata simulates. To simulate the
NavierStokes equations, BKG uses the following function [4]:
9
3
wi q 1 3ei u ei u2 u u
2
2
eq
fi
where wi are the weights corresponding to each ei lattice velocity [7]. For a 3D grid with 19 internal velocities, d3Q19 (Fig. 1):
1
;
9
w0
wi
1
;
18
i 1 : 6;
wi
1
;
36
i 7 : 18
and
P
fi
i
qu
P
fi ei
i
2s 1 2
e dt
6
where e is the lattice velocity, Dx/Dt. In the present study, on-grid bounce-back boundary conditions were implemented,
which was developed by Maier et al. [10], extended to second order by [21] and generalized to d3Q19 by Hecht and Harting
[7]. This condition allows arbitrary boundary conditions of pressure and/or velocity.
Table 1
GPU hardware specications [12].
Geforce 260 GTX
Number of stream processors
Global memory size (MB)
Memory bus width (bits)
Memory bandwidth (GB/s)
Estimated peak performance (Gops)
Core clock (MHz)
Stream processors clock (MHz)
Memory clock (MHz)
CUDA compute capability
192
896
448
111.9
805
576
1242
999
1.3
P.R. Rinaldi et al. / Simulation Modelling Practice and Theory 25 (2012) 163171
165
Y
12
2
7
16
13
0
15
14
17
11
10
18
Fig. 1. Space of discrete velocities ei in LBM- d3Q19, corresponding to the population functions fi.
4. Implementation
The optimization of data transfer is the most relevant issue regarding performance in LBM implementations in GPU. This
assertion is crucial when working in three dimensions. There are some basic strategies, like using a single one-dimensional
array and macro functions for the data representation, which were successfully used in distributed CPU programming [17],
that are suitable for GPU. However, memory access patterns, execution congurations and memory layout, are aspects that
must be rethought due to the particularities of NVIDIA GPUs.
4.1. Algorithm
The following code shows a CUDA macro function for main data structure access: it receives the lattice position (x, y, z),
the velocity index q, and the parameter t for time step switching. The grid and model dimensions are also parameterized.
This macro denes memory layout where the population of consecutive x-dimension cells are stored in a consecutive global
memory address.
(1) #dene dirQ(x,y,z,q,t, dimx, dimy, dimz, Q)
(x+(y)dimx+(z)dimxdimy+(q)dimxdimydimz+(k)Qdimxdimydimz))
To minimize the data transferred between iterations, the collision and propagation rules are executed in a single step.
Propagation is performed rst in each iteration loop, resulting in a pull scheme of the update process [17]. Using two data
copies to hold data of successive time steps keeps the implementation simple and avoids data coupling between steps, thus
allowing parallelization.
4.2. Coalesced memory access
The effective memory bandwidth substantially depends on the access pattern. Since the global memory is slow and does
not have cache, accesses to it should be minimized by copying data to local or shared memory for later processing. Provided
that threads from the same block access the GPU global memory simultaneously in aligned directions, groups of 16 threads
(named half-warp) can share a single access. This procedure is called coalescence and can substantially increase memory
bandwidth taking into account some restrictions. In this work, as consecutive x cells are assigned to each thread block, each
thread within a block accesses consecutive global memory elements. During the advection step, thread x reads base + x address, while thread x + 1 reeds base + x + 1 memory position, and so on.
There are different restrictions to the possibility to access 16 consecutive threads as a single global access, depending on
the CUDA version and hardware compute capabilities. In earlier versions, for example, each thread read or wrote single values and the initial address of the memory block should be a multiple of the data size [13]. With this restriction, only the
advection of particles distributions located in the same x-plane (i = 0, 2, 4, 5, 6, 12, 14, 16 and 18 in Fig. 1) can be coalesced.
In turn, devices with compute capability of 1.2 or above have fewer requirements for coalesced memory access running code
compiled with CUDA toolkit 3.0 or more. In these situations, coalescing is achieved for any accesses pattern tting into a
segment size of 32 bytes for 8-bit words, 64 bytes for 16-bit words, or 128 bytes for 32- and 64-bit words [13]. All
166
P.R. Rinaldi et al. / Simulation Modelling Practice and Theory 25 (2012) 163171
simulations presented in this work were made using CUDA 3.2 toolkit combined with Geforce GTX 260 Graphic Board (compute capability 1.3).
Figs. 2 and 3 help explaining the coalescing of 32-bit words (oats) by a half warp in the LBM advection step for shifted xplane particle distributions (i.e. i = 1, 3, 7, 8, 9, 10, 11, 13, 15, 17). Global memory is shown as 64-bytes rows (16 oats). Two
rows of the same tone represent a 128-bytes aligned segment. A half-warp of threads that accesses the global memory is
indicated at the bottom of the gure. Fig. 2 shows an example where 16 accesses are grouped in a single one. Fig. 3 shows
a case where 16 consecutive memory accesses are issued as two.
In the present LBM implementation, each half-warp memory access was issued as a single one or two. In this way forced
alignments through extra sweeps are avoided [8,16]. An additional benet obtained by performing the advection step rst is
that misaligned memory readings are less time consuming than misaligned memory writings.
int x, y, z;
x = treadidx.x;
y = blockidx.x;
z = blockidx.y;
However, the domain cannot grow more than 512 cells in the x direction, due to the upper bound of threads per block.
Actually, in cubic domains the global memory size limit is reached before (<180 cells per dimension). Therefore, for simulations in non-cubic domains it is recommended to take x as the smaller dimension.
Global
Memory
16 consecutive threads
64-bytes aligned
memory segment
1 memory acces
167
P.R. Rinaldi et al. / Simulation Modelling Practice and Theory 25 (2012) 163171
2 memory
segments
Global
Memory
16 consecutive threads
2 memory acceses
Block(2,2)
Block(2,1)
Block(2,0)
Block(1,2)
Block(1,1)
Block(1,0)
Block(0,2)
Block(0,1)
Block(0,0)
X
Fig. 4. Domain division in thread blocks.
west = 1;
east = 1;
...
if (x==0)//left border
west = 0;
if (x==DIMX-1) //right border
east = 0;
...
f[base] = F[dirQD(x,y,z,0,k,DIMX,DIMY,DIMZ,Q)];
f[base + 1] = F[dirQD(x west,y,z,1,k,DIMX,DIMY,DIMZ,Q)];
...
f[base + 18] = F[dirQD(x,y + north,z back,18,k,DIMX,DIMY,DIMZ,Q)];
168
P.R. Rinaldi et al. / Simulation Modelling Practice and Theory 25 (2012) 163171
L
y
z
Fig. 6. Proles of the velocity components ux and uy in the cavity midplane z = 0.5 for Re = 100 (a) and 1000 (b). The curves are LBM simulation and the dots
are the calculations reported by Yang et al. [19].
P.R. Rinaldi et al. / Simulation Modelling Practice and Theory 25 (2012) 163171
169
velocity, Umax, in x direction is imposed (Fig. 5). Different ow regimes are encountered depending on the Reynolds number,
dened as:
Re
LU max
P kUxi ; t d Uxi ; tk
6 106
kUxi ; t dk
i
Fig. 6 show the velocity proles along the cavity midplane (z = 0.5), uy along the x direction and ux along the y direction. The
LBM results are compared to the solution obtained by Yang et al. [19]. Fig. 7 show the streamlines passing through the central line in direction z at x = 0.5 and y = 0.5, for different Reynolds numbers. It can be see that a main vortex rotating around
an axis in the z direction is formed
6. Performance
In order to assess the performance of the GPU implementation presented in the present work, a comparison was made
against the implementation of the same LBM in CPU using C programming language. The most popular metric of LBM codes
performance is the number of lattice updates per second, or LUPS [9]. The LBM-CPU code was implemented as a single thread
170
P.R. Rinaldi et al. / Simulation Modelling Practice and Theory 25 (2012) 163171
Table 2
Comparison of performances of LBM implementations in GPU and CPU, obtained over 1000 iterations, bold indicates peak performance and speedup
respectively.
Domain size
CPU
GPU CUDA
MLUPS
Blocks (threads)
MLUPS
Speedup
323
(30768)
2.1
32 32
(32)
198
92
643
(262144)
2.1
64 64
(64)
245
116
963
(884736)
2.06
96 96
(96)
259
125
1283
(2097152)
1.92
128 128
(128)
217
113
1443
(2985984)
1.88
144 144
(144)
233
123
1603
(4096000)
1.89
160 160
(160)
248
131
with one nested loop for each direction x, y and z. The grid access pattern and the memory layout were optimized for cachebased architectures following [17]. Table 2 compares the performances of each implementation for different platforms, scaling the problem to various domain grid sizes varying from 163 to 1603 cells. The performance achieved with LBM-GPU resulted between 92 and 130 times faster than LBM-CPU, depending on the domain size and execution conguration. Both
implementations use single precision oating point representations.
To estimate the memory bandwidth, the number of bytes per cell transferred from/to the device global memory is calculated. In each iteration, 38 oating points for the distribution functions are transferred (19 reads + 19 writes) plus one integer
read for cell type, that is 38 4 bytes + 1 2 bytes = 154 bytes, at 259 MLUPS memory bandwidth is about 39.9 GB/s.
Further optimization strategies can be applied resigning some exibility and accuracy. For example, simulation parameters can be replaced by constants, rst order boundary conditions can be applied, and output calculation between iterations
can be minimized. With these modications a performance of 400 MLUPS was achieved, with a maximum memory bandwidth of 61.6 GB/s (>55% of the theoretical bandwidth and >65% of real memory bandwidth calculated with the CUDA test
program). Standard compilation options were used in all cases. By contrast, Kuznik et al. [8] reached memory bandwidths of
69.1 GB/s, corresponding to 49% of the theoretical 147.1 GB/s of GTX 280.
7. Conclusions
A novel CUDA implementation in GPU of the Lattice-Boltzmann Method to solve tha NavierStokes equations in 3D was
presented. The implementation makes use of a reversed advectioncollision scheme, applying a simple single loop algorithm
in combination with shared memory and recent CUDA sdk versions. Numerical validation was done with lid driven cavity
ow simulations for different Reynolds numbers varying the grid size showing good agreement in comparison with pervious
works. A fully parameterized code with second order boundary conditions run at 259 Mlups on Geforce GTX 260 graphic
board. Tuned implementation actualizes 4 108 cells per second running on the same hardware, with near 70% peak memory bandwidth. The speed up over similar CPU implementations is about two orders of magnitude.
It was shown that using the shared memory for local caching can improve performance by reducing the number of accesses to global memory. Nvidia recommends exploiting any opportunity to replace global memory accesses by shared memory
accesses [12]. However, this comes at the cost of fewer threads per multiprocessor, as well as the requirement to synchronize
threads in a block after each memory load. Nevertheless, the limitations in the number of threads per multiprocessor is compensated by the reduced number of global memory access. Once a multiprocessor occupancy of more than 50% has been
reached, it generally does not pay to obtain higher occupancy ratios [13].
A particularly important issue from the point of view of applications is the implementation of boundary conditions, particularly the capability of representing irregular or dynamic frontiers. There are several ways to treat irregular boundaries
with Lattice Boltzmann. Each way requires a different implementation in GPU. In the present case a rather rigid treatment
was chosen, consisting in classifying every boundary cell according to the boundary condition it represents, and associating
the corresponding actualization rule. Boundaries forming an angle respect to the grid directions can be treated in this way
using generalized interpolation rules similar to bounce-back or slip [22]. However, the most promising way to treat irregular
boundaries in a GPU implementation is the immersed boundaries method [23], which represents boundaries by means of a
force term in the streaming equation. This issue will be studied in future works.
Further studies can also pay attention to other optimization opportunities, like allowing each thread to process more than
one element or using pitched memory arrays to enable non-hardware-dependent domain sizes.
P.R. Rinaldi et al. / Simulation Modelling Practice and Theory 25 (2012) 163171
171
References
[1] P.L. Bhatnagar, E.P. Gross, M. Krook, A model for collisional processes in gases. I: Small amplitude processes in charged and neutral one-component
system, Phys. Rev. 94 (3) (1954) 511525.
[2] J.P. Boon, J.P. Rivet, Lattice Gas Hydrodynamics, Cambridge University Press, Cambridge, 2001.
[3] S. Chen, H. Chen, D.O. Martinez, W.H. Matthaeus, Lattice Boltzmann model for simulation of magnetohydrodynamics, Phys. Rev. Lett. 67 (1991) 3776
3779.
[4] S. Chen, G.D. Doolen, Lattice Boltzmann methods for uid ows, Annu. Rev. Fluid Mech. 30 (1998) 329364.
[5] G.M. Crisci, R. Rongo, S. Di Gregorio, W. Spataro, The simulation model SCIARA: the 1991 and 2001 lava ows at Mount Etna, J. Volcanol. Geotherm. Res
132 (23) (2004) 253267.
[6] N. Goodnight, CUDA/OpenGL Fluid Simulation, 2011. <http://new.math.uiuc.edu/MA198-2008/schaber2/uidsGL.pdf>.
[7] M. Hecht, J. Harting, Implementation of on-site velocity boundary conditions for D3Q19 lattice Boltzmann simulations, J. Stat. Mech. (2010) 1018
1021.
[8] F. Kuznik, C. Obrecht, G. Rusaoun, J.J. Roux, LBM based ow simulation using GPU computing processor, Comput. Math. Appl. 59 (7) (2009) 2380
2392.
[9] P. Lammers, U. Kster, Recent performance results of the lattice Boltzmann method, in: M. Resch, T. Bnisch, S. Tiyyagura, T. Furui, Y. Seo, W. Bez
(Eds.), High Performance Computing on Vector Systems 2006 Part 2, Springer, Stuttgart, 2007, pp. 5159.
[10] R.S. Maier, R.S. Bernard, D.W. Grunau, Boundary conditions for the lattice Boltzmann method, Phys. Fluids 8 (7) (1996) 17881802.
[11] NVIDIA CUDA Home Page, 2011. <http://developer.nvidia.com/category/zone/cuda-zone>.
[12] NVIDIA CUDA C Programing Guide Version 4.0, 2010. <http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/
CUDA_C_Programming_Guide.pdf>.
[13] NVIDIA CUDA C Best Practice Guide Version 4.0, 2010. <http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/
CUDA_C_Best_Practices_Guide.pdf>.
[14] Y.H. Qian, D. dHumieres, P. Lallemand, Lattice BGK models for NavierStokes equation, Europhys. Lett. 17 (1992) 479484.
[15] P. Rinaldi, D. Dalponte, M. Vnere, A. Clausse, Cellular automata algorithm for simulation of surface ows in large plains, Simul. Model. Practice Theory
15 (2007) 315327.
[16] J. Tlke, Implementation of a lattice Boltzmann kernel using the compute unied device architecture developed by nVIDIA, Comput. Visual. Sci. 13 (1)
(2010) 2939.
[17] G. Wellein, T. Zeiser, G. Hager, S. Donath, On the single processor performance of simple lattice Boltzmann kernels, Comput. Fluids 35 (89) (2006)
910919.
[18] S. Wolfram, Cellular automata uids. 1: Basic theory, J. Stat. Phys. 45 (3) (1986) 475526.
[19] J.Y. Yang, S.C. Yang, Y.N. Chen, C.A. Hsu, Implicit weighted ENO schemes for the three-dimensional incompressible NavierStokes equations, J. Comput.
Phys. 146 (1) (1998) 464487.
[20] Y. Zhao, Lattice Boltzmann based PDE Solver on the GPU, Vis. Comput. 24 (5) (2007) 323333.
[21] Q. Zou, X. He, On pressure and velocity boundary conditions for the lattice Boltzmann BGK model, Phys. Fluids 9 (6) (1997) 15911598.
[22] Z. Guo, C. Zheng, An extrapolation method for boundary conditions in lattice Boltzmann method, Phys. Fluids 14 (6) (2002) 20072011.
[23] Y. Cheng, H. Zhang, Immersed boundary method and lattice Boltzmann method coupled FSI simulation of mitral leaet ow, Comput. Fluids 39 (5)
(2010) 871881.