Nothing Special   »   [go: up one dir, main page]

Garbin FastNeRF High-Fidelity Neural Rendering at 200FPS ICCV 2021 Paper

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

FastNeRF: High-Fidelity Neural Rendering at 200FPS

Stephan J. Garbin* Marek Kowalski* Matthew Johnson


Jamie Shotton Julien Valentin
Microsoft

NeRF at 0.06FPS NeRF at 31FPS FastNeRF at 200FPS


Figure 1. FastNeRF renders high-resolution photorealistic novel views of objects at hundreds of frames per second. Comparable existing
methods, such as NeRF, are orders of magnitude slower and can only render very low resolution images at interactive rates.

Abstract significant manual effort in designing or pre-processing the


scene in order to achieve both quality and speed. Recently,
Recent work on Neural Radiance Fields (NeRF) showed neural rendering [10, 27, 17, 25, 19] has offered a disrup-
how neural networks can be used to encode complex 3D tive alternative: involve a neural network in the rendering
environments that can be rendered photorealistically from pipeline to output either images directly [24, 18, 27] or
novel viewpoints. Rendering these images is very compu- to model implicit functions that represent a scene appro-
tationally demanding and recent improvements are still a priately [5, 23, 29, 25, 40]. Beyond rendering, some of
long way from enabling interactive rates, even on high-end these approaches implicitly reconstruct a scene from static
hardware. Motivated by scenarios on mobile and mixed re- or moving cameras [25, 36, 1], thereby greatly simplifying
ality devices, we propose FastNeRF, the first NeRF-based the traditional reconstruction pipelines used in computer vi-
system capable of rendering high fidelity photorealistic im- sion.
ages at 200Hz on a high-end consumer GPU. The core of One of the most prominent recent advances in neural ren-
our method is a graphics-inspired factorization that allows dering is Neural Radiance Fields (NeRF) [25] which, given
for (i) compactly caching a deep radiance map at each po- a handful of images of a static scene, learns an implicit volu-
sition in space, (ii) efficiently querying that map using ray metric representation of the scene that can be rendered from
directions to estimate the pixel values in the rendered im- novel viewpoints. The rendered images are of high quality
age. Extensive experiments show that the proposed method and correctly retain thin structures, view-dependent effects,
is 3000 times faster than the original NeRF algorithm and and partially-transparent surfaces. NeRF has inspired sig-
at least an order of magnitude faster than existing work on nificant follow-up work that has addressed some of its lim-
accelerating NeRF, while maintaining visual quality and ex- itations, notably extensions to dynamic scenes [31, 7, 46],
tensibility. relighting [2, 3, 37], and incorporation of uncertainty [22].
One common challenge to all of the NeRF-based ap-
proaches is their high computational requirements for ren-
1. Introduction
dering images. The core of this challenge resides in NeRF’s
Rendering scenes in real-time at photorealistic quality volumetric scene representation. More than 100 neural
has long been a goal of computer graphics. Traditional ap- network calls are required to render a single image pixel,
proaches such as rasterization and ray-tracing often require which translates into several seconds being required to ren-
* Denotes equal contribution
der low-resolution images on high-end GPUs. Recent ex-
plorations [35, 15, 28, 16] conducted with the aim of im-
Project website: https://microsoft.github.io/FastNeRF proving NeRF’s computational requirements reduced the

14346
render time by up to 50×. While impressive, these ad- NeRF also have large memory requirements. A single for-
vances are still a long way from enabling real-time render- ward pass of NeRF requires performing hundreds of for-
ing on consumer-grade hardware. Our work bridges this gap ward passes through an eight layer 256 hidden unit MLP
while maintaining quality, thereby opening up a wide range per pixel. If pixels are processed in parallel for efficiency
of new applications for neural rendering. Furthermore, our this consumes large amounts of memory, even at moder-
method could form the fundamental building block for neu- ate resolutions. Since many natural scenes (e.g. a living
ral rendering at high resolutions. room, a garden) are sparse, we are able to store our cache
To achieve this goal, we use caching to trade memory sparsely. In some cases this can make our method actually
for computational efficiency. As NeRF is fundamentally a more memory efficient than NeRF.
function of positions p ∈ R3 and ray directions d ∈ R2 In summary, our main contributions are:
to color c ∈ R3 (RGB) and a scalar density σ, a naı̈ve ap-
• The first NeRF-based system capable of rendering
proach would be to build a cached representation of this
photorealistic novel views at 200FPS, thousands of
function in the space of its input parameters. Since σ only
times faster than NeRF.
depends on p, it can be cached using existing methodolo-
gies. The color c, however, is a function of both ray direc- • A graphics-inspired factorization that can be com-
tion d and position p. If this 5 dimensional space were to be pactly cached and subsequently queried to compute the
discretized with 512 bins per dimension, a cache of around pixel values in the rendered image.
176 terabytes of memory would be required – dramatically
more than is available on current consumer hardware. • A blueprint detailing how the proposed factorization
Ideally, we would treat directions and positions sepa- can efficiently run on the GPU.
rately and thus avoid the polynomial explosion in required
cache space. Fortunately this problem is not unique to 2. Related work
NeRF; the rendering equation (modulo the wavelength of FastNeRF belongs to the family of Neural Radiance
light and time) is also a function of R5 and solving it ef- Fields methods [25] and is trained to learn an implicit, com-
ficiently is the primary challenge of real-time computer pressed deep radiance map parameterized by position and
graphics. As such there is a large body of research which view direction that provides color and density estimates.
investigates ways of approximating this function as well as Our method differs from [25] in the structure of the implicit
efficient means of evaluating the integral. One of the fastest model, changing it in such a way that positional and direc-
approximations of the rendering equation involves the use tional components can be discretized and stored in sparse
of spherical harmonics such that the integral results in a dot 3D grids.
product of the harmonic coefficients of the material model This also differentiates our method from models that use
and lighting model. Inspired by this efficient approxima- a discretized grid at training time, such as Neural Volumes
tion, we factorize the problem into two independent func- [18] or Deep Reflectance Volumes [3]. Due to the mem-
tions whose outputs are combined using their inner product ory requirements associated with generating and holding
to produce RGB values. The first function takes the form 3D volumes at training time, the output image resolution
of a Multi-Layer Perceptron (MLP) that is conditioned on of [18, 3] is limited by the maximum volume size of 1283 .
position in space and returns a compact deep radiance map In contrast, our method uses volumes as large as 10243 .
parameterized by D components. The second function is an Subsequent to [18, 3], the problem of parameter estima-
MLP conditioned on ray direction that produces D weights tion for an MLP with low dimensional input coordinates
for the D deep radiance map components. was addressed using Fourier feature encodings. After use
This factorized architecture, which we call FastNeRF, al- in [41], this was popularized in [25, 49] and explored in
lows for independently caching the position-dependent and greater detail by [40].
ray direction-dependent outputs. Assuming that k and l de- Neural Radiance Fields (NeRF) [25] first showed con-
note the number of bins for positions and ray directions re- vincing compression of a light field utilizing Fourier fea-
spectively, caching NeRF would have a memory complex- tures. Using an entirely implicit model, NeRF is not bound
ity of O(k 3 l2 ). In contrast, caching FastNeRF would have to any voxelized grid, but only to a specific domain. Despite
a complexity of O(k 3 ∗ (1 + D ∗ 3) + l2 ∗ D). As a re- impressive results, one disadvantage of this method is that a
sult of this reduced memory complexity, FastNeRF can be complex MLP has to be called for every sample along each
cached in the memory of a high-end consumer GPU, thus ray, meaning hundreds of MLP invocations per image pixel.
enabling very fast function lookup times that in turn lead to One way to speed up NeRF’s inference is to scale to mul-
a dramatic increase in test-time performance. tiple processors. This works particularly well because ev-
While caching does consume a significant amount of ery pixel in NeRF can be computed independently. For an
memory, it is worth noting that current implementations of 800 × 800 pixel image, JaxNeRF [6] achieves an inference

14347
speed of 20.77 seconds on an Nvidia Tesla V100 GPU, 2.65 over 200Hz on high-end consumer hardware. The core in-
seconds on 8 V100 GPUS, and 0.35 seconds on 128 second sight of our approach (Section 3.2) consists of factorizing
generation Tensor Processing Units. Similarly, the volumet- NeRF into two neural networks: a position-dependent net-
ric integration domain can be split and separate models used work that produces a deep radiance map and a direction-
for each part. This approach is taken in the Decomposed dependent network that produces weights. The inner prod-
Radiance Fields method [35], where an integration domain uct of the weights and the deep radiance map estimates the
is divided into Voronoi cells. This yields better test scores, color in the scene at the specified position and as seen from
and a speedup of up to 3×. the specified direction. This architecture, which we call
A different way to increase efficiency is to realize that FastNeRF, can be efficiently cached (Section 3.3), signifi-
natural scenes tend to be volumetrically sparse. Thus, effi- cantly improving test time efficiency whilst preserving the
ciency can be gained by skipping empty regions of space. visual quality of NeRF. See Figure 2 for a comparison of
This amounts to importance sampling of the occupancy dis- the NeRF and FastNeRF network architectures.
tribution of the integration domain. One approach is to
3.1. Neural Radiance Fields
use a voxel grid in combination with the implicit function
learned by the MLP, as proposed in Neural Sparse Voxel A Neural Radiance Field (NeRF) [25] captures a volu-
Fields (NSVFs) [16] and Neural Geometric Level of De- metric 3D representation of a scene within the weights of a
tail [38] (for SDFs only), where a dynamically constructed neural network. NeRF’s neural network FN eRF : (p, d) 7→
sparse octree is used to represent scene occupancy. As a (c, σ) maps a 3D position p ∈ R3 and a ray direction
network still has to be queried inside occupied voxels, how- d ∈ R2 to a color value c and transparency σ. See the
ever, NSVFs takes between 1 and 4 seconds to render an left side of Figure 2 for a diagram of FN eRF .
8002 image, with decreases to PSNR at the lower end of In order to render a single image pixel, a ray is cast from
those timings. Our method is orders of magnitude faster in the camera center, passing through that pixel and into the
a similar scenario. scene. We denote the direction of this ray as d. A num-
Another way to represent the importance distribution is ber of 3D positions (p1 ,· · · , pN ) are then sampled along
via depth prediction for each pixel. This approach is taken the ray between its near and far bounds defined by the cam-
in [28], which is concurrent to ours and achieves roughly era parameters. The neural network FN eRF is evaluated at
15FPS for 8002 images at reduced quality or roughly half each position pi and ray direction d to produce color ci and
that for quality comparable to NeRF. transparency σi . These intermediate outputs are then inte-
Orthogonal to this, AutoInt [15] showed that a neural grated as follows to produce the final pixel color ĉ:
network can be used to approximate the integrals along each N
X
ray with far fewer samples. While significantly faster than ĉ = Ti (1 − exp(−σi δi ))ci , (1)
NeRF, this still does not provide interactive frame rates. i=1
What differentiates our method from those described Pi−1
above is that FastNeRF’s proposed decomposition, and sub- where Ti = exp(− j=i σj δj ) is the transmittance and
sequent caching, lets us avoid calls to an MLP at inference δi = (pi+1 − pi ) is the distance between the samples.
time entirely. This makes our method faster in absolute Since FN eRF depends on ray directions, NeRF has the abil-
terms even on a single machine. ity to model viewpoint-dependent effects such as specular
It is worth noting that our method does not address train- reflections, which is one key dimension in which NeRF im-
ing speed, as for example [39]. In that case, the authors proves upon traditional 3D reconstruction methods.
propose improving the training speed of NeRF models by Training a NeRF network requires a set of images of a
finding initialisation through meta learning. scene as well as the extrinsic and intrinsic parameters of the
Finally, there are orthogonal neural rendering strate- cameras that captured the images. In each training iteration,
gies capable of fast inference, such as Neural Point-Based a subset of pixels from the training images are chosen at
Graphics [11], Mixture of Volumetric Primitives [19] or random and for each pixel a 3D ray is generated. Then, a
Pulsar [12], which use forms of geometric primitives and set of samples is selected along each ray and the pixel color
rasterization. In this work, we only deal with NeRF-like ĉ is computed using Equation (1). The training loss is the
implicit functional representations paired with ray tracing. mean squared difference between ĉ and the ground truth
pixel value. For further details please refer to [25].
3. Method While NeRF renders photorealistic images, it requires
calling FN eRF a large number of times to produce a sin-
In this section we describe FastNeRF, a method that is gle image. With the default number of samples per pixel
3000 times faster than the original Neural Radiance Fields N = 192 proposed in NeRF [25], nearly 400 million calls
(NeRF) system [25] (Section 3.1). This breakthrough al- to FN eRF are required to compute a single high defini-
lows for rendering high-resolution photorealistic images at tion (1080p) image. Moreover the intermediate outputs of

14348
𝜎
(𝑢1 𝑣1 𝑤1 ) 𝐷

(𝑥, 𝑦, 𝑧) ⋮ ෍ 𝛽𝑖 𝑢𝑖 𝑣𝑖 𝑤𝑖 (𝑟𝑔𝑏𝜎)
𝜎 𝑖=1
(𝑢𝐷 𝑣𝐷 𝑤𝐷 )
position-
dependent
(𝑟𝑔𝑏𝜎) MLP 𝐹𝑝𝑜𝑠
128-d
vector
(𝜃, 𝜙) (𝛽1 , 𝛽2 , . . , 𝛽𝐷 )
(𝑥, 𝑦, 𝑧) (𝜃, 𝜙)
direction-
dependent
NeRF MLP 𝐹𝑑𝑖𝑟 FastNeRF
Figure 2. Left: NeRF neural network architecture. (x, y, z) denotes the input sample position, (θ, φ) denotes the ray direction and
(r, g, b, σ) are the output color and transparency values. Right: our FastNeRF architecture splits the same task into two neural networks
that are amenable to caching. The position-dependent network Fpos outputs a deep radiance map (u, v, w) consisting of D components,
while the Fdir outputs the weights for those components (β1 , . . . , βD ) given a ray direction as input.

this process would take hundreds of gigabytes of memory. on positions p and one that depends only on the ray direc-
Even on high-end consumer GPUs, this constrains the orig- tions d. Similarly to evaluating the rendering equation us-
inal method to be executed over several batches even for ing spherical harmonics, our position-dependent Fpos and
medium resolution (800×800) images, leading to additional direction-dependent Fdir functions produce outputs that are
computational overheads. combined using the dot product to obtain the color value at
position p observed from direction d. Crucially, this factor-
3.2. Factorized Neural Radiance Fields ization splits a single function that takes inputs in R5 into
Taking a step away from neural rendering for a moment, two functions that take inputs in R3 and R2 . As explained
we recall that in traditional computer graphics, the render- in the following section, this makes caching the outputs of
ing equation [9] is an integral of the form the network possible and allows accelerating NeRF by 3 or-
Z ders of magnitude on consumer hardware. See Figure 3 for
Lo (p, d) = fr (p, d, ωi )Li (p, ωi )(ωi · n)dωi , (2) a visual representation of the achieved speedup.
Ω The position-dependent and direction-dependent func-
tions of FastNeRF are defined as follows:
where Lo (p, d) is the radiance leaving the point p in direc-
tion d, fr (p, d, ωi ) is the reflectance function capturing the Fpos : p 7→ {σ, (u, v, w)}, (3)
material properties at position p, Li (p, ωi ) describes the Fdir : d 7→ β, (4)
amount of light reaching p from direction ωi , and n cor-
where u, v, w are D-dimensional vectors that form a deep
responds to the direction of the surface normal at p. Given
radiance map describing the view-dependent radiance at po-
its practical importance, evaluating this integral efficiently
sition p. The output of Fdir , β, is a D-dimensional vector
has been a subject of active research for over three decades
of weights for the D components of the deep radiance map.
[9, 42, 8, 33]. One efficient way to evaluate the rendering
The inner product of the weights and the deep radiance map
equation is to approximate fr (p, d, ωi ) and Li (p, ωi ) us-
ing spherical harmonics [4, 45]. In this case, evaluating the D
X
integral boils down to a dot product between the coefficients c = (r, g, b) = βi (ui , vi , wi ) = β T · (u, v, w) (5)
of both spherical harmonics approximations. i=1

With this insight in mind, we can return to neural ren- results in the estimated color c = (r, g, b) at position p ob-
dering. In the case of NeRF, FN eRF : (p, d) can also served from direction d.
be interpreted as returning the radiance leaving point p in When the two functions are combined using Equation
direction d. This key insight leads us to propose a new (5), the resulting function FFastNeRF (p, d) 7→ (c, σ) has
network architecture for NeRF, which we call FastNeRF. the same signature as the function FN eRF used in NeRF.
This novel architecture consists in splitting NeRF’s neural This effectively allows for using the FastNeRF architecture
network FN eRF into two networks: one that depends only as a drop-in replacement for NeRF’s architecture at runtime.

14349
3.3. Caching 250FPS 238.1FPS

Given the large number of samples that need to be evalu- 200FPS 172.4FPS
ated to render a single pixel, the cost of computing F dom-
inates the total cost of NeRF rendering. Thus, to accelerate 150FPS
NeRF, one could attempt to reduce the test-time cost of F
by caching its outputs for a set of inputs that cover the space 100FPS
of the scene. The cache can then be evaluated at a fraction 50FPS
of the time it takes to compute F. 15.0FPS
0.06FPS 0.18FPS 1.1FPS
For a trained NeRF model, we can define a bounding 0FPS
box V that covers the entire scene captured by NeRF. We NeRF DeRF NSVF DONeRF ours 1k ours 512
can then uniformly sample k values for each of the 3 world- cache cache
space coordinates (x, y, z) = p within the bounds of V. Figure 3. Speed evaluation of our method and prior work [25, 35,
Similarly, we uniformly sample l values for each of the ray 16, 28] on the Lego scene from the Realistic 360 Synthetic [25]
dataset, rendered at 800×800 pixels. For previous methods, when
direction coordinates (θ, φ) = d with θ ∈ h0, πi and φ ∈
numbers for the Lego scene were not available, we used an opti-
h0, 2πi. The cache is then generated by computing F for mistic approximation.
each combination of sampled p and d.
The size of such a cache for a standard NeRF model
with k = l = 1024 and densely stored 16-bit floating point time features and settings including the coarse/fine net-
values is approximately 5600 Terabytes. Even for highly works and sample count N follow the original NeRF work
sparse volumes, where one would only need to keep 1% [25] (please see supplementary materials).
of the outputs, the size of the cache would still severely At test time, both our method and NeRF take a set of
exceed the memory capacity of consumer-grade hardware. camera parameters as input which are used to generate a
This huge memory requirement is caused by the usage of ray for each pixel in the output. A number of samples are
both d and p as input to FN eRF . A separate output needs then produced along each ray and integrated following Sec-
to be saved for each combination of d and p, resulting in a tion 3.1. While FastNeRF can be executed using its neural
memory complexity in the order of O(k 3 l2 ). network representation, its performance is greatly improved
Our FastNeRF architecture makes caching feasible. For when cached. Once the cache is generated, we convert it
k = l = 1024, D = 8 the size of two dense caches hold- to a sparse data structure using OpenVDB [26], which im-
ing {σ, (u, v, w))} and β would be approximately 54 GB. proves performance and reduces the memory footprint. To
For moderately sparse volumes, where 30% of space is oc- further improve performance, we use hardware accelerated
cupied, the memory requirement is low enough to fit into ray tracing [32] to skip empty space, starting to integrate
either the CPU or GPU memory of consumer-grade ma- points along the ray only after the first hit with a collision
chines. In practice, the choice k and l depends on the scene mesh derived from the density volume, and visiting every
size and the expected image resolution. For many scenarios voxel thereafter until a ray’s transmittance is saturated. The
a smaller cache of k = 512, l = 256 is sufficient, lower- collision mesh can be computed from a sign-distance func-
ing the memory requirements further. Please see the sup- tion derived from the density volume using Marching Cubes
plementary materials for formulas used to calculate cache [20]. We use the same meshing and integration parameters
sizes for both network architectures. for all scenes and datasets, please see the supplementary for
details.
4. Implementation
5. Experiments
Aside from using the proposed FastNeRF architecture
described in Section 3.2, training FastNeRF is identical to We evaluate our method quantitatively and qualitatively
training NeRF [25]. We model Fpos and Fview of FastNeRF on the Realistic 360 Synthetic [25] and Local Light Field
using 8 and 4 layer MLPs respectively, with positional en- Fusion (LLFF) [24] (with additions from [25]) datasets used
coding applied to inputs [25]. The 8-layer MLP has 384 in the original NeRF paper. While the NeRF synthetic
hidden units, and we use 128 for the view-predicting 4- dataset consists of 360 degree views of complex objects, the
layer network. The only exception to this is the ficus scene, LLFF dataset consist of forward-facing scenes, with fewer
where we found a 256 unit MLP gave better results than a images. In all comparisons with NeRF we use the same
384 unity MLP. While this makes the network larger than training parameters as described in the original paper [25].
the baseline, cache lookup speed is not influenced by MLP To assess the rendering quality of FastNeRF we com-
complexity. Similarly to NeRF, we parameterize the view pare its outputs to GT using Peak Signal to Noise Ratio
direction as a 3-dimensional unit vector. Other training- (PSNR), Structural Similarity (SSIM) [43] and perceptual

14350
Lego

Ship Ours Ours Ours


GT NeRF
no cache large cache small cache
Figure 4. Qualitative comparison of our method vs NeRF on the dataset of [25] at 8002 pixels using 8 components. Small cache refers to
our method cached at 2563 , and large cache at 7683 . Varying the cache size allows for trading compute and memory for image quality
resembling levels of detail (LOD) in traditional computer graphics.

Table 1. We compare NeRF to our method when not cached to a grid, and when cached at a high resolution in terms of PSNR, SSIM and
LPIPS, and also provide the average speed in ms of our method when cached. LLFF denotes the dataset of [24] at 504 × 378 pixels, the
other dataset is that of [25] at 8002 pixels. We use 8 components and a 10243 cache for the NeRF Synthetic 360 scenes, and 6 components
at 7683 for LLFF. Please see the supplementary material for more detailed results.
Scene NeRF Ours - No Cache Ours - Cache Speed
PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓
Nerf Synthetic 29.54dB 0.94 0.05 29.155dB 0.936 0.053 29.97dB 0.941 0.053 4.2ms
LLFF 27.72dB 0.88 0.07 27.958dB 0.888 0.063 26.035dB 0.856 0.085 1.4ms

LPIPS [48]. We use the same metrics in our ablation study, puts, and training of NeRF, we retain the compatibility with
where we evaluate the effects of changing the number of ray-tracing [44, 9] and the ability to only specify loose
components D and the size of the cache. All speed com- volume bounds. At the same time, our factorization and
parisons are run on one machine that is equipped with an caching do not affect the character of the rendered images to
Nvidia RTX 3090 GPU. a large degree. Please see Figure 4 and Figure 5 for qualita-
cache tive comparisons, and Table 1 for a quantitative evaluation.
We keep the size of the view-dependent cache Fview
Note that some aliasing artefacts can appear since our model
the same for all experiments at a base resolution of 3843 .
cache is cached to a grid after training. This explains the decrease
It is the base resolution of the RGBD cache Fpos that
across all metrics as a function of the grid resolution as seen
represents the main trade-off in complexity of runtime and
in Table 1. However, both NeRF and FastNeRF mitigate
memory vs image quality. Across all results, grid resolution
multi-view inconsistencies, ghosting and other such arti-
refers to the number of divisions of the longest side of the
facts from multi-view reconstruction. Table 1 demonstrates
bounding volume surrounding the object of interest.
that at high enough cache resolutions, our method is capable
Rendering quality: Because we adopt the inputs, out-

14351
Leaves

Horns Ours Ours Ours


GT NeRF
no cache large cache small cache
Figure 5. Qualitative comparison of our method vs NeRF on the dataset of [24] at 504 × 378 pixels using 6 factors. Small cache refers to
our method cached at 2563 , and large cache at 7683 .

Table 2. Speed comparison of our method vs NeRF in ms. The Chair and Lego scenes are at rendered at 8002 resolution, Horns and Leaves
scenes at 504 × 378. Our method never drops below 100FPS when cached, and is often significantly faster. Please note that our method is
slower when not cached due to using larger MLPs; it performs identically to NeRF when using 256 hidden units. We do not compute the
highest resolution cache for LLFF scenes because these are less sparse due to the use of normalized device coordinates.
Scene NeRF Ours - No Cache 2563 3843 5123 7683 10243 Speedup over NeRF
Chair 17.5K 28.2K 0.8 1.1 1.4 2.0 2.7 6468× - 21828×
Lego 17.5K 28.2K 1.5 2.1 2.8 4.2 5.6 3118× - 11639×
Horns* 3.8K 6.2K 0.5 0.7 0.9 1.2 - 3183× - 7640×
Leaves* 3.9K 6.3K 0.6 0.8 1.0 1.5 - 2626× - 6566×

of the same visual quality as NeRF. Figure 4 and Figure 5 Cache Resolution: As shown in Table 1, our method
further show that smaller caches appear ‘pixelated’ but re- matches or outperforms NeRF on the dataset of synthetic
tain the overall visual characteristics of the scene. This is objects at 10243 cache resolution. Because our method is
similar to how different levels of detail are used in computer fast enough to visit every voxel (as opposed to the fixed
graphics [21], and an important innovation on the path to- sample count used in NeRF), it can sometimes achieve bet-
wards neurally rendered worlds. ter scores by not missing any detail. We observe that a
cache of 5123 is a good trade-off between perceptual qual-
Challenging scenarios: We theorize that highly reflec- ity, memory and rendering speed for the synthetic dataset.
tive surfaces could be a challenge for our method, as β For the LLFF dataset, we found a 7683 cache to work best.
might have to change with a high frequency depending on For the Ship scene, an ablation over the grid size, mem-
the view direction. In Section 1.1 of suppl. materials we ory, and PSNR is shown in Table 3. For the view-filling
discuss situations where the view-dependent network over- LLFF scenes at 504 × 378 pixels, our method sees a slight
fits to the training data and describe a simple solution.

14352
Table 3. Influence of the number of components and grid resolution on PSNR and memory required for caching the ship scene. Note how
more factors increase grid sparsity. We find that 8 or 6 components are a reasonable compromise in practice.
Factors No Cache 2563 3843 5123 7683
PSNR↑ Memory PSNR↑ Memory PSNR↑ Memory PSNR↑ Memory PSNR↑ Memory
4 27.11dB - 24.81dB 0.34GB 26.29dB 0.61GB 26.94dB 1.09GB 27.54dB 2.51GB
6 27.12dB - 24.82dB 0.5GB 26.34dB 0.93GB 27.0dB 1.67GB 27.58dB 4.1GB
8 27.24dB - 24.89dB 0.71GB 26.42dB 1.41GB 27.1dB 2.7GB 27.72dB 7.15GB
16 27.68dB - 25.07dB 1.2GB 26.77dB 2.08GB 27.55dB 3.72GB 28.3dB 9.16GB

of a person performing facial expressions for about 20s in


a multi-camera rig consisting of 32 calibrated cameras. We
then fit a 3D face model similar to FLAME [13] to obtain
expression parameters for each frame. Since the captured
scene is not static, we train a FastNeRF model of the scene
jointly with a deformation model [30] conditioned on the
expression data. The deformation model takes the samples
Figure 6. Face images rendered using FastNeRF combined with a p as input and outputs their updated positions that live in a
deformation field network [30]. Thanks to the use of FastNeRF, canonical frame of reference modeled by FastNeRF.
expression-conditioned images can be rendered at 30FPS. We show example outputs produced using this approach
in Figure 6. This method allows us to render 300×300 pixel
images of the face at 30 fps on a single Nvidia Tesla V100
decrease in metrics when cached at 7683 , but still produces
GPU – around 50 times faster than a setup with a NeRF
qualitatively compelling results as shown in Figure 5, where
model. At the resolution we achieve, the face expression
intricate detail is clearly preserved.
is clearly visible and the high frame rate allows for real-
Rendering Speed: When using a grid resolution of time rendering, opening the door to telepresence scenarios.
7683 , FastNeRF is on average more than 3000× faster than The main limitation in terms of speed and image resolution
NeRF at largely the same perceptual quality. See Table 2 is the deformation model, which needs to be executed for
for a breakdown of run-times across several resolutions of a large number of samples. We employ a simple pruning
the cache and Figure 3 for a comparison to other NeRF ex- method, detailed in the supplementary material, to reduce
tensions in terms of speed. Note that we log the time it takes the number of samples. Finally, note that while we train
our method to fill an RGBA buffer of the image size with this proof-of-concept method on a dataset captured using
the final pixel values, whereas the baseline implementation multiple cameras, the approach can be extended to use only
needs to perform various steps to reshape samples back into a single camera, similarly to [7].
images. Our CUDA kernels are not highly optimized, favor-
Since FastNeRF uses the same set of inputs and outputs
ing flexibility over maximum performance. It is reasonable
as those used in NeRF, it is applicable to many of NeRF’s
to assume that advanced code optimization and further com-
extensions. This allows for accelerating existing approaches
pression of the grid values could lead to further reductions
for: reconstruction of dynamic scenes [30, 34, 14], single-
in compute time.
image reconstruction [39], quality improvement [35, 47]
Number of Components: While we can see a theoreti-
and others [28]. With minor modifications, FastNeRF can
cal improvement of roughly 0.5dB going from 8 to 16 com-
be applied to even more methods allowing for control over
ponents, we find that 6 or 8 components are sufficient for
illumination [2] and incorporation of uncertainty [22].
most scenes. As shown in Table 3, our cache can bring out
fine details that are lost when only a fixed sample count is
used, compensating for the difference. Table 3 also shows 7. Conclusion
that more components tend to be increasingly sparse when
cached, which compensates somewhat for the additional In this paper we presented FastNeRF, a novel exten-
memory usage. Please see the supplementary material for sion to NeRF that enables the rendering of photorealistic
more detailed results on other scenes. images at 200Hz and more on consumer hardware. We
achieve significant speedups over NeRF and competing
6. Application methods by factorizing its function approximator, enabling
a caching step that makes rendering memory-bound instead
We demonstrate a proof-of-concept application of Fast- of compute-bound. This allows for the application of NeRF
NeRF for a telepresence scenario by first gathering a dataset to real-time scenarios.

14353
References [19] Stephen Lombardi, Tomas Simon, Gabriel Schwartz,
Michael Zollhoefer, Yaser Sheikh, and Jason Saragih. Mix-
[1] Mojtaba Bemana, Karol Myszkowski, Hans-Peter Seidel, ture of volumetric primitives for efficient neural rendering,
and Tobias Ritschel. X-fields: Implicit neural view-, light- 2021. 1, 3
and time-image interpolation. ACM Transactions on Graph- [20] William E. Lorensen and Harvey E. Cline. Marching cubes:
ics (Proc. SIGGRAPH Asia 2020), 39(6), 2020. 1 A high resolution 3d surface construction algorithm. SIG-
[2] Sai Bi, Zexiang Xu, Pratul Srinivasan, Ben Mildenhall, GRAPH Comput. Graph., 21(4):163–169, Aug. 1987. 5
Kalyan Sunkavalli, Miloš Hašan, Yannick Hold-Geoffroy, [21] David Luebke, Martin Reddy, Jonathan D. Cohen, Amitabh
David Kriegman, and Ravi Ramamoorthi. Neural reflectance Varshney, Benjamin Watson, and Robert Huebner. Level of
fields for appearance acquisition, 2020. 1, 8 Detail for 3D Graphics. Morgan Kaufmann Publishers Inc.,
[3] Sai Bi, Zexiang Xu, Kalyan Sunkavalli, Miloš Hašan, Yan- San Francisco, CA, USA, 2002. 7
nick Hold-Geoffroy, David Kriegman, and Ravi Ramamoor- [22] Ricardo Martin-Brualla, Noha Radwan, Mehdi S. M. Sajjadi,
thi. Deep reflectance volumes: Relightable reconstructions Jonathan T. Barron, Alexey Dosovitskiy, and Daniel Duck-
from multi-view photometric images, 2020. 1, 2 worth. Nerf in the wild: Neural radiance fields for uncon-
[4] Brian Cabral, Nelson Max, and Rebecca Springmeyer. Bidi- strained photo collections, 2021. 1, 8
rectional reflection functions from surface bump maps. [23] Lars M. Mescheder, Michael Oechsle, Michael Niemeyer,
21(4):273–281, Aug. 1987. 4 Sebastian Nowozin, and Andreas Geiger. Occupancy net-
[5] Zhiqin Chen and Hao Zhang. Learning implicit fields for works: Learning 3d reconstruction in function space. CoRR,
generative shape modeling. CoRR, abs/1812.02822, 2018. 1 abs/1812.03828, 2018. 1
[6] Boyang Deng, Jonathan T. Barron, and Pratul P. Srinivasan. [24] Ben Mildenhall, Pratul P. Srinivasan, Rodrigo Ortiz-Cayon,
JaxNeRF: an efficient JAX implementation of NeRF, 2020. Nima Khademi Kalantari, Ravi Ramamoorthi, Ren Ng, and
2 Abhishek Kar. Local light field fusion: Practical view syn-
[7] Guy Gafni, Justus Thies, Michael Zollhöfer, and Matthias thesis with prescriptive sampling guidelines. ACM Trans.
Nießner. Dynamic neural radiance fields for monoc- Graph., 38(4), July 2019. 1, 5, 6, 7
ular 4d facial avatar reconstruction. arXiv preprint [25] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik,
arXiv:2012.03065, 2020. 1, 8 Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf:
[8] Iliyan Georgiev, Jaroslav Křivánek, Tomáš Davidovič, and Representing scenes as neural radiance fields for view syn-
Philipp Slusallek. Light transport simulation with vertex thesis. arXiv preprint arXiv:2003.08934, 2020. 1, 2, 3, 5,
connection and merging. ACM Trans. Graph., 31(6):192:1– 6
192:10, Nov. 2012. 4 [26] Ken Museth. Vdb: High-resolution sparse volumes with dy-
[9] James T. Kajiya. The rendering equation. SIGGRAPH Com- namic topology. ACM Trans. Graph., 32(3), July 2013. 5
put. Graph., 20(4):143–150, Aug. 1986. 4, 6 [27] Oliver Nalbach, Elena Arabadzhiyska, Dushyant Mehta, H-
[10] Tero Karras, Samuli Laine, and Timo Aila. A style-based P Seidel, and Tobias Ritschel. Deep shading: convolutional
generator architecture for generative adversarial networks, neural networks for screen space shading. In Computer
2019. 1 graphics forum, volume 36, pages 65–78. Wiley Online Li-
[11] Maria Kolos, Artem Sevastopolsky, and Victor Lempitsky. brary, 2017. 1
Transpr: Transparency ray-accumulating neural 3d scene [28] Thomas Neff, Pascal Stadlbauer, Mathias Parger, Andreas
point renderer, 2020. 3 Kurz, Chakravarty R. Alla Chaitanya, Anton Kaplanyan, and
[12] Christoph Lassner and Michael Zollhöfer. Pulsar: Efficient Markus Steinberger. Donerf: Towards real-time rendering of
sphere-based neural rendering, 2020. 3 neural radiance fields using depth oracle networks, 2021. 1,
[13] Tianye Li, Timo Bolkart, Michael J Black, Hao Li, and Javier 3, 5, 8
Romero. Learning a model of facial shape and expression [29] Jeong Joon Park, Peter Florence, Julian Straub, Richard A.
from 4d scans. ACM Trans. Graph., 36(6):194–1, 2017. 8 Newcombe, and Steven Lovegrove. Deepsdf: Learning con-
[14] Zhengqi Li, Simon Niklaus, Noah Snavely, and Oliver Wang. tinuous signed distance functions for shape representation.
Neural scene flow fields for space-time view synthesis of dy- CoRR, abs/1901.05103, 2019. 1
namic scenes. arXiv preprint arXiv:2011.13084, 2020. 8 [30] Keunhong Park, Utkarsh Sinha, Jonathan T Barron, Sofien
[15] David B Lindell, Julien NP Martel, and Gordon Wetzstein. Bouaziz, Dan B Goldman, Steven M Seitz, and Ricardo-
Autoint: Automatic integration for fast neural volume ren- Martin Brualla. Deformable neural radiance fields. arXiv
dering. arXiv preprint arXiv:2012.01714, 2020. 1, 3 preprint arXiv:2011.12948, 2020. 8
[16] Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and [31] Keunhong Park, Utkarsh Sinha, Jonathan T. Barron, Sofien
Christian Theobalt. Neural sparse voxel fields, 2021. 1, 3, 5 Bouaziz, Dan B Goldman, Steven M. Seitz, and Ricardo
[17] Stephen Lombardi, Jason M. Saragih, Tomas Simon, and Martin-Brualla. Deformable neural radiance fields. arXiv
Yaser Sheikh. Deep appearance models for face rendering. preprint arXiv:2011.12948, 2020. 1
CoRR, abs/1808.00362, 2018. 1 [32] Steven G Parker, James Bigler, Andreas Dietrich, Heiko
[18] Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel Friedrich, Jared Hoberock, David Luebke, David McAllis-
Schwartz, Andreas Lehrmann, and Yaser Sheikh. Neural vol- ter, Morgan McGuire, Keith Morley, Austin Robison, et al.
umes: Learning dynamic renderable volumes from images. Optix: a general purpose ray tracing engine. Acm transac-
ACM Trans. Graph., 38(4):65:1–65:14, July 2019. 1, 2 tions on graphics (tog), 29(4):1–13, 2010. 5

14354
[33] M. Pharr, W. Jakob, and G. Humphreys. Physically Based [47] Kai Zhang, Gernot Riegler, Noah Snavely, and Vladlen
Rendering: From Theory to Implementation. Elsevier Sci- Koltun. Nerf++: Analyzing and improving neural radiance
ence, 2016. 4 fields. arXiv preprint arXiv:2010.07492, 2020. 8
[34] Albert Pumarola, Enric Corona, Gerard Pons-Moll, and [48] Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shecht-
Francesc Moreno-Noguer. D-nerf: Neural radiance fields for man, and Oliver Wang. The unreasonable effectiveness of
dynamic scenes. arXiv preprint arXiv:2011.13961, 2020. 8 deep features as a perceptual metric. CoRR, abs/1801.03924,
[35] Daniel Rebain, Wei Jiang, Soroosh Yazdani, Ke Li, 2018. 5
Kwang Moo Yi, and Andrea Tagliasacchi. Derf: Decom- [49] Ellen D. Zhong, Tristan Bepler, Joseph H. Davis, and Bon-
posed radiance fields. arXiv preprint arXiv:2011.12490, nie Berger. Reconstructing continuous distributions of 3d
2020. 1, 3, 5, 8 protein structure from cryo-em images, 2020. 2
[36] Gernot Riegler and Vladlen Koltun. Stable view synthesis,
2020. 1
[37] Pratul P. Srinivasan, Boyang Deng, Xiuming Zhang,
Matthew Tancik, Ben Mildenhall, and Jonathan T. Barron.
Nerv: Neural reflectance and visibility fields for relighting
and view synthesis, 2020. 1
[38] Towaki Takikawa, Joey Litalien, Kangxue Yin, Karsten
Kreis, Charles Loop, Derek Nowrouzezahrai, Alec Jacobson,
Morgan McGuire, and Sanja Fidler. Neural geometric level
of detail: Real-time rendering with implicit 3d shapes, 2021.
3
[39] Matthew Tancik, Ben Mildenhall, Terrance Wang, Divi
Schmidt, Pratul P. Srinivasan, Jonathan T. Barron, and Ren
Ng. Learned initializations for optimizing coordinate-based
neural representations, 2020. 3, 8
[40] Matthew Tancik, Pratul P. Srinivasan, Ben Mildenhall, Sara
Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ra-
mamoorthi, Jonathan T. Barron, and Ren Ng. Fourier fea-
tures let networks learn high frequency functions in low di-
mensional domains, 2020. 1, 2
[41] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Il-
lia Polosukhin. Attention is all you need. In I. Guyon,
U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-
wanathan, and R. Garnett, editors, Advances in Neural Infor-
mation Processing Systems, volume 30. Curran Associates,
Inc., 2017. 2
[42] Eric Veach and Leonidas J. Guibas. Optimally combining
sampling techniques for monte carlo rendering. In Proceed-
ings of the 22nd Annual Conference on Computer Graphics
and Interactive Techniques, SIGGRAPH ’95, page 419–428,
New York, NY, USA, 1995. Association for Computing Ma-
chinery. 4
[43] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si-
moncelli. Image quality assessment: from error visibility to
structural similarity. IEEE transactions on image processing,
13(4):600–612, 2004. 5
[44] Turner Whitted. An improved illumination model for shaded
display. In ACM Siggraph 2005 Courses, pages 4–es. 2005.
6
[45] Lifan Wu, Guangyan Cai, Shuang Zhao, and Ravi Ra-
mamoorthi. Analytic spherical harmonic gradients for real-
time rendering with many polygonal area lights. ACM Trans.
Graph., 39(4), July 2020. 4
[46] Wenqi Xian, Jia-Bin Huang, Johannes Kopf, and Changil
Kim. Space-time neural irradiance fields for free-viewpoint
video. arXiv preprint arXiv:2011.12950, 2020. 1

14355

You might also like