GLGNet: light field angular superresolution with arbitrary interpolation rates

794 Accesses
4 Citations
Explore all metrics

Abstract

Acquiring high-resolution light fields (LFs) is expensive. LF angular superresolution aims to synthesize the required number of views from a given sparse set of spatially high-resolution images. Existing methods struggle with sparsely sampled LFs captured with large baselines. Some methods rely on depth estimation and view reprojection, and are sensitive to textureless and occluded regions. Other non-depth based methods suffer from aliasing or blurring effects due to the large disparity. In addition, most methods require specific models for different interpolation rates, which reduces their flexibility in practice. In this paper, we propose a learning framework that overcomes these challenges by exploiting the global and local structures of LFs. Our framework includes aggregation across both the angular and spatial dimensions to fully exploit the input data and a novel bilateral upsampling module that upsamples each epipolar plane image while better preserving its local parallax structure. Furthermore, our method predicts the weights of the interpolation filters based on both subpixel offset and range difference, allowing angular superresolution at different rates with a single model. We show that our non-depth based method outperforms the state-of-the-art methods in terms of handling large disparities and flexibility on both real-world and synthetic LF images.

How Depth Estimation in Light Fields Can Benefit from Angular Super-Resolution?

MFSRNet: spatial-angular correlation retaining for light field super-resolution

Article 06 April 2023

MPIN: a macro-pixel integration network for light field super-resolution

Article 13 October 2021

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The light field (LF) is a promising representation of 3D visual scenes that captures light rays traveling in every direction through every point in free space. LF image contains rich visual information of the scene and has drawn great attention from both academia and industry. For many applications, densely sampled LFs with high angular resolution are essential for avoiding ghosting effects.

Acquiring a densely sampled LF with sufficient angular resolution is challenging in many practical cases. Conventional LF acquisition methods use camera arrays [1] or computer-controlled gantries [2] to sample LF at different viewpoints with single or multiple exposures. However, these methods are either bulky and expensive, or time-consuming and limited to static scenes. Recently, commercial LF cameras have made LF image acquisition more convenient [3]. Unfortunately, the limited sensor resolution forces a trade-off between the spatial and angular resolution of the captured LF images.

Many studies have focused on synthesizing the required number of views from a given sparse set of images with high spatial resolution, which is called LF angular superresolution. With the recent advances in convolutional neural networks (CNNs) for visual modeling, several learning-based methods [4–11] have been proposed. A major technical challenge for LF angular superresolution is the large disparity, which makes pixel correspondence difficult to find. To address this challenge, several methods use depth estimation to construct a correspondence map between input views and subsequently blend them via various methods to synthesize novel views [12, 13]. However, these methods have limitations when dealing with factors such as texture, occlusion and specular surfaces. Therefore, several methods using specific priors such as sparsity in the transformation domain, have been presented for depth-free angular super-resolution [5–8, 14]. However they often produce aliasing or blurring effects when the input LF has a large disparity. Moreover, most existing depth-based and non-depth-based methods need to be retrained for different interpolation rates, which is inconvenient in practice.

In fact, 4D LF data is highly correlated in ray space, and structurally record abundant information of the scene. Therefore, a key insight for view synthesis in LF imaging is to fully exploit the inherent correlations within the input views [6, 7, 14]. This approach is particularly useful when the input views are very sparse. Another insight is to utilize the parallax structure of the LF, which manifests as distinctive patterns in epipolar plane images (EPIs) [5, 15], as shown in Fig. 1. However, for LFs with very sparse input views, each EPI can access only limited spatial and angular information, which becomes even more challenging for scenes with occlusions or non-Lambertian surfaces. Moreover, to reconstruct LFs with relatively large disparities, the receptive field of CNNs should be large enough to avoid aliasing effect, but this may result in over-smoothed edges and blur in synthesized views. To address these issues, we propose a learning-based LF angular superresolution framework that leverages both the global spatial-angular consistency and the local parallax structure of EPIs. Specifically, we first pre-aggregate the global spatial-angular information to gather useful information from all the input views; then we extract local structural features from each EPI and interpolate its feature map to the desired angular resolution; finally, a post-aggregation is applied to ensure the spatial-angular consistency of all the synthesized views. This combination forms “global-local-global” network (GLGNet), which ensures the full utilization of both the global and local features of the 4D LF data. We also mitigate the blur caused by large receptive field by using a bilateral upsampling module to better preserve the LF parallax structure and dynamically predict its weights to enable arbitrary interpolation rate with a single model.

The main contributions of this paper are summarized as follows:

1)
We exploit both the global spatial-angular consistency and the local parallax structure of EPIs, which is beneficial for LF reconstruction from very sparse input views.
2)
We propose a bilateral upsampling module that consists of dynamically predicted global spatial weights and local range weights, which can achieve arbitrary upsampling rates while better preserving the LF parallax structure.
3)
Unlike other methods that train different models for different upsampling rates, our method achieves state-of-the-art performance on both real-world and synthetic scenes with a single model, which benefits the flexibility of the method in practical applications.

2 Related work

LF angular superresolution is a long-standing problem that has been studied for decades. Existing methods can be divided into two categories: those that use depth estimation, and those that do not. In this paper, we mainly focus on the methods based on deep learning, since they can usually achieve better results than the other methods.

2.1 Depth image-based view synthesis

These approaches typically first estimate the scene depth at the novel view or the input view [16–19], and then warp the input views to the novel view based on the estimated depth.

Flynn et al. [12] proposed a deep learning method to synthesize novel views from a stereo pair or a sequence of images with wide baselines. Srinivasan et al. [20] proposed synthesizing a 4D LF image from a 2D RGB image based on the estimated 4D ray depth. Kalantari et al. [4] proposed synthesizing novel views with two sequential networks that performed depth estimation and color prediction successively. These methods synthesize novel views independently and neglect their inter-view correlations. Wu et al. [8] proposed fusing a set of sheared EPIs for LF reconstruction, based on the observation that an EPI showed clear structure when sheared with the disparity value. However, the spatial and angular information of EPI-based models are severely limited, since each EPI is a 2D slice of the 4D LF. Zhou et al. [21] and Mildenhall et al. [22] trained a network that inferred alpha and multiplane images, and synthesized novel views using homography and alpha composition. Jin et al. proposed a geometry-aware network that could reconstruct large disparity LFs with a regular sampling pattern [9] or a flexible sampling pattern [23].

Typically, depth-based approaches depend heavily on the estimated depth, which is prone to errors in challenging scene features such as object boundaries, lighting reflections and thin structures. Moreover, small deviations in the estimated depth map can cause visual annoying artifacts in the synthesized views. In addition, they often focus on the quality of depth estimation rather than the quality of synthetic views.

2.2 LF reconstruction without depth

Densely sampled LF reconstruction can be considered consecutive reconstruction (interpolation) of the underlying plenoptic function from incomplete measurements (input views). For sparsely sampled LF, the sampling rate is insufficient, thus, direct interpolation will result in ghosting effects in the rendered views. Various priors for LF images have been used to reduce this effect [24–27]. Moreover, several methods exploit compressive LF photography [28]. However, these methods require specific patterns for sampling the input views, which makes the acquisition more difficult.

Recently, several learning-based approaches for depth-free reconstruction have also been proposed. Yoon et al. [14] super-resolved the LF image in both the spatial and angular domain sequentially using a network. However, their approach could only handle scale 2 angular superresolution and could not adapt to very sparsely sampled LF input. Following the idea of single image superresolution, Wu et al. [5] proposed a “blur-restoration-deblur” framework for learning-based angular detail restoration on 2D EPIs. Wang et al. [6, 29] proposed processing 3D LF volumes of stacked EPIs with 3D CNNs. Yeung et al. [7] applied a coarse-to-fine model using an encoder-like view refinement network for larger receptive field. Fang et al. [30] combined the advantages of classical model-based methods and data-driven deep learning. Wu et al. [31] proposed a spatial-angular attention network to capture non-local correspondences in the LF. Wang et al. [10] proposed a coarse-to-fine model that used a meta-learning based upsampling module to achieve arbitrary upsampling rates. These methods can achieve good performance when the sampling baseline is relatively small, but their results deteriorate on LFs with a wider disparity range. Wang et al. [11] developed a class of domain-specific convolutions that could disentangle 4D LF data from different dimensions and designed task-specific modules for spatial and angular superresolution and disparity estimation. However, this method requires training different models for different upsampling rates, which is not flexible for practical applications.

3 The proposed method

3.1 LF representation and problem statement

In the two-plane parameterization, each light ray intersects the image plane at coordinate $(u, v)$ that are the spatial dimensions, and then intersects the camera plane at coordinate $(s, t)$ that are the angular dimensions. A densely sampled LF $\boldsymbol{L}(u, v, s, t)$ consists of $N \times M$ views of spatial size $W \times H$, which are sampled on the angular plane with a regular 2D grid of size $N \times M$, as shown in Fig. 1.

An EPI is a slice of the LF with a fixed spatial coordinate v and an angular coordinate t (or u and s), denoted by $\boldsymbol{E}_{v^{*}, t^{*}} (u, s)$ (or $\boldsymbol{E}_{u^{*}, s^{*}}(v, t)$), as illustrated in Fig. 1. It contains the relative motion between the camera and the object points. A visible scene point appears as a line in one of the EPIs, whose slope reflects its depth and intensity reflects its emitted light. In other words, the line pattern in the EPI represents the local parallax of the LF. For the Lambertian reflectance model, a point on the surface forms a line with a constant intensity in the EPI.

Since the EPI is a 2D slice of the 4D LF with clearer structure, LF angular superresolution can be seen as an image superresolution problem along the angular dimension of the EPI. Novel views can be synthesized by interpolating their pixels along the corresponding lines in all EPIs. However, this EPI angular superresolution problem is highly ill-posed due to the limited angular resolution. This makes it difficult to recover some lines in scenes with occlusions or non-Lambertian surfaces, as they are either discontinuous or non-linear. On the one hand, for a small set of input views, the available spatial and angular information of the 2D EPI is very limited. On the other hand, the existence of large disparity, occlusion and non-Lambertian effect makes it difficult to find correspondence between views. As illustrated in the example given in Fig. 2, in a task with only 2 × 2 input views, the sparsely sampled EPI can only access views in one row/column, and the presence of occlusion causes the endpoints of some lines to have different values, as displayed in Fig. 2(a) (enclosed in red line), resulting in incorrect interpolation. In addition, lines in non-Lambertian regions are not exactly straight or have constant values, as shown in Fig. 2(b) (enclosed by yellow dash line), making them difficult to recover with only two endpoints.

3.2 Overview of the proposed framework

We propose GLGNet, a network that leverages the global spatial-angular consistency and the local parallax structure of EPI for high-angular-resolution LF reconstruction. To aggregate both the angular and spatial information across views, we use a 4D CNN with pseudo filters to reduce model complexity. Instead of using two 2D filters that alternately convolve on the spatial and angular dimensions of the LF, we use 3D CNNs to filter on row and column EPI stacks alternately. The row/column EPIs aggregated with global information are then super-resolved along the corresponding axis. This design can preserve the local parallax structure of the EPI while using all the available information from the input views. Figure 3 shows our proposed framework, which consists of two aggregation modules, one angular conversion module and one EPI angular superresolution module.

Given a sparsely sampled LF $\boldsymbol{L}_{\mathrm{ss}} (u, v, s, t)$ with size $W \times H \times n \times m$, we first perform pre-aggregation and obtain a feature map as

$$ \boldsymbol{L}_{\mathrm{feat}} (u, v, s, t) = A_{\mathrm{pre}}\bigl( \boldsymbol{L}_{\mathrm{ss}} (u, v, s, t)\bigr), $$

(1)

where $A_{\mathrm{pre}}(\cdot )$ is the pre-aggregation module. It alternates between three 3D CNN layers on the row EPI stacks and two 3D CNN layers on the column EPI stacks to reconstruct the residual map, as illustrated in Fig. 4(a).

Then, we super-resolve the angular dimension s. For each row $t^{*}$, $t^{*} = 0, 1, \ldots , m - 1$, by fixing one spatial dimension $v = v^{*}$, $v^{*} = 0, 1, \ldots , H-1$, we extract the 2D row EPI

$$ \boldsymbol{E}_{v^{*}, t^{*}} (u, s) = \boldsymbol{L}_{\mathrm{feat}} \bigl(u, v^{*}, s, t^{*}\bigr), $$

(2)

with a size of $W \times n$. The EPI angular superresolution module $S(\cdot )$ interpolates each row EPI to the desired angular resolution and assembles them into the partially upsampled feature map

$$ \boldsymbol{L}_{\mathrm{feat}} \bigl(u, v^{*}, s, t^{*}\bigr) \uparrow s = S\bigl( \boldsymbol{E}_{v^{*}, t^{*}} (u, s), f_{\mathrm{row}} \bigr), $$

(3)

with the size of $W \times H \times N \times m$, where ↑ indicate upsampling on the specified dimension, and $f_{\mathrm{row}} = (N - 1) / (n - 1)$ is an arbitrary interpolation rate.

Next, we use the angular conversion module $C(\cdot )$ to refine $\boldsymbol{L}_{\mathrm{feat}}(u, v, s, t) \uparrow s$, transfer the angular dimension from s to t, and further extract the feature map

$$ \boldsymbol{L}'_{\mathrm{feat}} (u, v, s, t) = C\bigl( \boldsymbol{L}_{ \mathrm{feat}} (u, v, s, t) \uparrow s\bigr), $$

(4)

with a size of $W \times H \times N \times m$. We use residual blocks with three 3D CNN layers for both row EPI stack refinement and column EPI stack feature extraction, as demonstrated in Fig. 4(b).

Similarly, we extract the 2D column EPI by fixing one angular dimension $s = s^{*}$, $s^{*} = 0, 1, \ldots , N-1$ and one spatial dimension $u = u^{*}$, $u^{*} = 0, 1, \ldots , W-1$ as

$$ \boldsymbol{E}_{u^{*}, s^{*}} (v, t) = \boldsymbol{L}'_{\mathrm{feat}} \bigl(u^{*}, v, s^{*}, t\bigr), $$

(5)

with a size of $H \times m$, which is further interpolated to the desired angular resolution by the EPI angular superresolution module $S(\cdot )$, and assembled into the fully upsampled feature map

$$ \boldsymbol{L}'_{\mathrm{feat}} \bigl(u^{*}, v, s^{*}, t\bigr) \uparrow t = S\bigl( \boldsymbol{E}_{u^{*}, s^{*}} (v, t), f_{\mathrm{col}}\bigr), $$

(6)

with a size of $W \times H \times N \times M$, where $f_{\mathrm{col}} = (M - 1) / (m - 1)$ is an arbitrary interpolation rate that can differ from $f_{\mathrm{row}}$. The input low resolution LF has been converted to the desired angular resolution so far.

Finally, after applying the post-aggregation module $A_{\mathrm{post}} (\cdot )$ to $\boldsymbol{L}'_{\mathrm{feat}} (u, v, s, t) \uparrow t$, we obtain the super-resolved LF as

$$ \boldsymbol{L}_{\mathrm{sr}} (u, v, s, t) = A_{\mathrm{post}}\bigl( \boldsymbol{L}'_{\mathrm{feat}} (u, v, s, t) \uparrow t\bigr). $$

(7)

The post-aggregation module also employs a residual block that alternates between three 3D CNN layers on column EPI stacks and two 3D CNN layers on row EPI stacks, as demonstrated in Fig. 4(c).

As shown in Fig. 4, except for the one before skip connection, all the 3D CNNs mentioned above use $3 \times 3 \times 3$ kernels with a PReLU activation layer. We also adopt a learnable architecture for the EPI angular superresolution module, which enables end-to-end training of the proposed framework.

3.3 EPI superresolution with bilateral upsampling

To super-resolve EPIs, we treat them as image superresolution problems along the angular dimension. Our EPI superresolution module first applies a residual dense network (RDN) [32] to extract features from the input EPI, and then upsamples the feature map with an arbitrary interpolation rate. In this section, we describe the bilateral upsampling module in detail.

Given a sparsely sampled EPI $\boldsymbol{E}_{\mathrm{ss}}$, which can be considered a downsampled version of its densely sampled counterpart $\boldsymbol{E}_{\mathrm{ds}}$, the task of the EPI angular superresolution module is to generate an angular super-resolved EPI $\boldsymbol{E}_{\mathrm{sr}}$ whose ground truth is $\boldsymbol{E}_{\mathrm{ds}}$. Let $\boldsymbol{E}_{\mathrm{feat}}$ denote the feature extracted from $\boldsymbol{E}_{\mathrm{ss}}$. Consider a pixel at $(i,j)$ in $\boldsymbol{E}_{\mathrm{sr}}$, it corresponds to a filtering output at subpixel location $(i, \mathrm{round} (j/f) )$ in $\boldsymbol{E}_{\mathrm{feat}}$, where j is the angular dimension, f is the interpolation rate, and $\mathrm{round}(\cdot )$ represents rounding operation. Therefore, each feature patch centered at $(i', j')$ in $\boldsymbol{E}_{\mathrm{feat}}$, except for the first and last ones, determines f pixels in $\boldsymbol{E}_{\mathrm{sr}}$, located at $(i', f \cdot j' + k )$, $k \in [ - \lfloor (f-1)/2 \rfloor , \lceil (f-1)/2 \rceil ]$ and $k \in \mathbb{Z}$. We employ subpixel convolution [33] to implement the above operation as

$$ \boldsymbol{E}_{\mathrm{sr}} = \mathcal{PS} (\boldsymbol{W} * \boldsymbol{E}_{\mathrm{feat}}), $$

(8)

where $\mathcal{PS}(\cdot )$ is a pixel shuffling operator that upsamples only the angular dimension, and W is the weight of the filter.

For different interpolation rates, both the number of filters and the weights of the filters are different. To perform multiple interpolation rates with a single model, we need to dynamically predict the filters for each interpolation rate. The weights can be predicted based on the subpixel location and the interpolation rate f [34]. Unlike natural images, EPIs have unique line patterns that may be over-smoothed by CNNs with large receptive fields, especially when sampled with large baselines, as shown in Fig. 1. To address this problem, we introduce edge-preserving bilateral filtering in our bilateral upsampling module. As illustrated in Fig. 5, the weights of this module depend on both the subpixel coordinates and the range differences. The module consists of two main steps: weight prediction and subpixel filtering. The weight prediction step predicts the weights of the local filters for each feature patch in $\boldsymbol{E}_{\mathrm{feat}}$. The subpixel filtering step applies the predicted weights to filter the feature in $\boldsymbol{E}_{\mathrm{feat}}$ and calculate the values of the subpixels, i.e., the pixels in $\boldsymbol{E}_{\mathrm{sr}}$. We will explain these steps in detail below.

Weight prediction

Instead of using a typical upsampling module that predefines the number of filters for each interpolation rate and learns weights from the training dataset, we train a network that can dynamically predict the weights of the filters based on both the subpixel offset and the range difference. For feature $\boldsymbol{E}_{\mathrm{feat}} (i', j')$, the filter weights $\boldsymbol{W}_{i', j'}$ consist of two parts: the global spatial weights $\boldsymbol{W}^{\mathrm{s}}$ and the local range weights $\boldsymbol{W}_{i', j'}^{\mathrm{r}}$. Among them, the filter coefficients of global spatial weights are related only to the relative spatial distance of subpixels, so the weights of subpixel filtering can be calculated one by one using meta learning, and the local range weights are dynamically calculated based on feature similarity. Therefore, the bilateral upsampling module can dynamically calculate the weights of subpixel filters based on different interpolation rates, thereby achieving any interpolation rate.

The global spatial weights $\boldsymbol{W}^{\mathrm{s}}$ are predicted as

$$ \boldsymbol{W}^{\mathrm{s}} = \varphi (\boldsymbol{v}_{f}), $$

(9)

where $\varphi (\cdot )$ is the spatial weight prediction network, and $\boldsymbol{v}_{f}$ is a vector containing the relative offset of the subpixels and the interpolation rate f:

$$ \boldsymbol{v}_{f} = \biggl( \frac{k}{f}, \frac{1}{f} \biggr), $$

(10)

where $k \in [ - \lfloor (f-1)/2 \rfloor , \lceil (f-1)/2 \rceil ]$, and $k \in \mathbb{Z}$.

Moreover, the local range weights $\boldsymbol{W}_{i', j'}^{\mathrm{r}}$ are predicted via a strategy similar to self-attention [35], which is formulated as

$$ \boldsymbol{W}_{i', j'}^{\mathrm{r}} (\boldsymbol{x}) = \operatorname{softmax} \bigl( \theta \bigl(\boldsymbol{E}_{\mathrm{feat}} \bigl(i', j'\bigr)\bigr) \cdot \phi \bigl( \boldsymbol{E}_{\mathrm{feat}} (\boldsymbol{x})\bigr) \bigr) $$

(11)

where x is the coordinate of the current feature weighted within a window $\Omega _{i', j'}$ centered at $(i', j')$, softmax denotes softmax operation, and $\theta (\cdot )$ and $\phi (\cdot )$ are embedding functions.

Next, we compute the final weights $\boldsymbol{W}_{i', j'}$ for each feature patch centered at $(i', j')$ by taking the elementwise (Hadamard) product of the predicted global spatial weights $\boldsymbol{W}^{\mathrm{s}}$ and the local range weights $\boldsymbol{W}_{i', j'}^{\mathrm{r}}$, as shown in Eq. (12). Both $\boldsymbol{W}^{\mathrm{s}}$ and $\boldsymbol{W}_{i', j'}^{\mathrm{r}}$ are broadcasted to the same size before multiplication.

$$ \boldsymbol{W}_{i', j'} = \boldsymbol{W}^{\mathrm{s}} \circ \boldsymbol{W}_{i', j'}^{\mathrm{r}}, $$

(12)

where “∘” denotes the elementwise (Hadamard) product.

Subpixel filtering

We filter the features $\boldsymbol{E}_{\mathrm{feat}}$ with the predicted weights W to obtain the pixel values of the densely sampled EPI $\boldsymbol{E}_{\mathrm{sr}}$. Motivated by the fact that the residuals of natural images and videos are more compressible, we use a skip connection to add $\boldsymbol{E}_{\mathrm{feat}}$ to the output of the filtering operation, as shown in Eq. (13). We also use a $1 \times 1$ convolution layer $g(\cdot )$ to adjust the number of feature maps of $\boldsymbol{E}_{\mathrm{feat}}$ and tile them according to their center pixel before adding them to the output. This ensures that both sides of the addition have the same size. Finally, we remove the first $\lfloor (f-1)/2 \rfloor $ and the last $\lceil (f-1)/2 \rceil $ pixels in the angular dimension so that $N = (n - 1) * f + 1$ holds (taking row EPI $\boldsymbol{E}_{v^{*}, t^{*}} (u, s)$ as an example; this applies to the following text as well).

$$ \boldsymbol{E}_{\mathrm{sr}} = g(\boldsymbol{E}_{\mathrm{feat}}) + \mathcal{PS} (\boldsymbol{W} * \boldsymbol{E}_{\mathrm{feat}}). $$

(13)

Network architecture

For feature extraction, we use an RDN with 8 residual dense blocks (RDBs), each containing 8 2D CNN layers. The growth rate is 64.

For global spatial weight prediction, we use the multilayer perceptron from Ref. [34], which has 2 fully connected layers and 256 hidden neurons. The output is a tensor of size $\mathit{outC} \times \mathit{inC} \times ks \times ks$, where $\mathit{inC} = 64$ is the number of feature maps of $\boldsymbol{E}_{\mathrm{feat}}$, $\mathit{outC} = f$ is the number of subpixels for the luminance Y channel in the YCbCr color space, and $ks = 3$ is the filter size. To predict the local range weights, we use $1 \times 1$ convolutions as the embedding functions $\theta (\cdot )$ and $\phi (\cdot )$ with 16 output channels. The output is a tensor of size $W \times n \times ks \times ks$. Therefore, the final weight has a size of $W \times n \times \mathit{outC} \times \mathit{inC} \times ks \times ks$.

3.4 Training details

Since we use a deep network with large receptive field for scenes with large disparity range, simply minimizing the $L_{2}$ distance between the synthesized views and the reference views could cause blur in the synthesized views. Instead, we use the feature similarity [36] as part of our loss function. We use the features from the conv1_2, conv2_2 and conv3_3 layers of a pre-trained VGG-16 network [37], and apply the per view weighting method from Ref. [6] to improve the visual quality of synthesized views. We also use the EPI structure preserving loss from Ref. [29] to ensure cross-view consistency.

We define the total loss function $\mathcal{L}$ in this paper as a combination of the weighted feature similarity term $\mathcal{L}_{\mathrm{vgg}}$ and the EPI structure preserving term $\mathcal{L}_{\mathrm{esp}}$:

$$ \mathcal{L} = \mathcal{L}_{\mathrm{vgg}} + \lambda \cdot \mathcal{L}_{ \mathrm{esp}}, $$

(14)

where λ is a hyperparameter that controls the relative weight and is empirically set to 100.

In the training procedure, we use 20 synthetic LF images from the 4D light field benchmark [38], 100 real-world LF images provided by the Standford Lytro LF archive [39] and Kalantari et al. [4]. We remove the border views and crop the LF images to $7 \times 7$ views as the ground truth, and randomly downsample them to $2 \times 2$ or $3 \times 3$ views as the input. Similar to other methods, we process only the luminance Y channel in the YCbCr color space. We train the network end-to-end using adaptive moment estimation (ADAM) optimizer with $\beta _{1} = 0.9$, $\beta _{2} = 0.999$ and a batch size of 8 (via gradient accumulation). During training, small patches in the same position in each view are randomly extracted. To speed up training, we first train with $32 \times 32$ patches until convergence, and then with $64 \times 64$ patches. The learning rates for both sizes are initially set to $1.0 \times 10^{-4}$ and then decreased by a factor of 0.5 every $1.0 \times 10^{3}$ epochs until convergence.

4 Evaluation

4.1 Performance evaluation

We compare our method with 8 state-of-the-art learning-based LF angular superresolution methods, including Kalantari et al. [4], Wu et al. [5], Wang et al. [6], Yeung et al. [7], Wu et al. [8], Jin et al. [9], Wang et al. [10] and Wang et al. [11]. To evaluate the performance of different methods, we use 48 synthetic data, including 4 LF images from the HCI [38] dataset, 5 LF images from the old HCI [40] dataset and 39 LF images from the Inria DLFD [41] dataset, and totally 70 real-world LF images captured with a Lytro Illum camera are used for test, including 30 LF images from 30 scenes [4] dataset, 15 LF images from Reflective [39] dataset and 25 LF images from Occlusions [39] dataset. These datasets cover several important factors in evaluating LF angular superresolution methods, including high-frequency texture, natural illumination, practical camera distortion, large disparity, occlusions and reflective surfaces. We use the peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) averaged on all synthesized views to evaluate all the methods.

To the best of our knowledge, our method is one of the few methods that can handle different interpolation rates with a single model, however, most other methods need to train different models for different interpolation rates. Therefore, for a fair comparison, we fine-tune our model for the different tasks on the training dataset. The fine-tuned model is denoted as $\mathrm{Ours}_{\mathrm{ft}}$.

Real-world scenes

For real-world scenes, we reconstruct $7 \times 7$ LFs from both $3 \times 3$ and $2 \times 2$ sparse views, as illustrated in Fig. 6. We first reconstruct $7 \times 7$ LFs from $3 \times 3$ sparse views, and compare with Wu et al. [5], Wang et al. [6], Yeung et al. [7] and Wang et al. [10]. For the task $2 \times 2$ to $7 \times 7$, we carry out comparison with methods by Kalantari et al. [4], Yeung et al. [7], Wu et al. [8], Jin et al. [9], Wang et al. [10] and Wang et al. [11].

Table 1 lists the results on real-world datasets. The performance of most methods decreases as the input becomes sparser. Among them, the EPI-based methods of Wu et al. [5] and Wu et al. [8] perform worse than the others, possibly due to the limited spatial and angular information of 2D EPI. Our proposed model outperforms the comparison methods for all datasets under the $3 \times 3$ to $7 \times 7$ task, and is second only to Wang et al. [11]. Like our method, Wang et al. [11] also focused on exploiting the spatial and angular relationships between the input views, which demonstrated the effectiveness of this strategy. The non-depth-based method of Yeung et al. [7] achieves the third-best results for the same reason.

Table 1 Quantitative comparisons (PSNR/SSIM) of different methods on real-world light field datasets for $3 \times 3$ to $7 \times 7$ and $2 \times 2$ to $7 \times 7$ tasks. The best results are highlighted in bold, and the second best results are highlighted in blue bold

Full size table

Synthetic scenes

The disparity range of real-world datasets captured by commercial LF cameras is relatively small, usually less than 1 pixel. We use synthetic datasets with much larger disparity ranges to evaluate the performance on scenes with large disparity. Note that we have replaced the origami scene with the dishes scene to increase the disparity range of the test images in the HCI dataset. Table 2 shows the $2 \times 2$ to $7 \times 7$ results compared with those of 6 state-of-the-art methods that have been proven to handle large disparity well, including Kalantari et al. [4], Yeung et al. [7], Wu et al. [8], Jin et al. [9], Wang et al. [10] and Wang et al. [11]. We also present the disparity range of each dataset to examine its effect on the reconstruction quality. Our model without fine-tuning (denoted as “Ours”) achieves the second-best SSIM and the third-best PSNR among all methods, which are only slightly lower than those of Jin et al. [9]. The possible reason is that we use perceptual loss instead of pixel loss. The latter usually leads to better PSNR, but does not necessarily improve the visual quality. To demonstrate this, we visually compare the reconstruction results obtained via different methods, as shown in Fig. 7. More comparisons with Jin et al. [9] are given in Fig. 8. Our approach produces accurate estimations that are closer to ground-truth ones.

Table 2 Quantitative comparisons (PSNR/SSIM) of different methods on synthetic light field datasets for the $2 \times 2$ to $7 \times 7$ task. The best results are highlighted in bold, and the second best results are highlighted in blue bold

Full size table

Without using depth, our method achieves performance comparable to that of the state-of-the-art depth-based method of Jin et al. [9], they use a specifically trained model for this task. On the other hand, most non-depth-based methods perform worse as the disparity range increases. This shows the effectiveness of our EPI superresolution module in exploiting the local parallax structure of the EPI. Moreover, our method uses the same model for all the experiments, while other methods have to train different models for different tasks. The fine-tuned model $\mathrm{Ours}_{\mathrm{ft}}$ is comparable to that of Wang et al. [11], and outperforms all the other methods on all the metrics.

Table 3 lists the running times of different methods. All methods are tested with an Nvidia GeForce RTX A6000 GPU. Our method takes 9.13 seconds to reconstruct a $7 \times 7$ LF with spatial resolution of $512 \times 512$ from $2 \times 2$ views. Our method is significantly faster than other methods except for Yeung et al. [7] and Jin et al. [9]. This is the cost of better performance and practical flexibility.

Table 3 Comparison of the running times of different methods on the HCI datasets for the $2 \times 2$ to $7 \times 7$ task. The best results are highlighted in bold

Full size table

4.2 Ablation study

To better illustrate the advantage of our GLGNet, we conduct ablation experiments on the HCI dataset with variants of our framework. Table 4 displays the results. The aggregation modules, the angular conversion module and the range weight significantly boost the quality of the synthesized views. The proposed combination of loss functions further improves the performance of our proposed model. In addition, a visual comparison given in Fig. 8 shows the effectiveness of our bilateral upsampling module.

Table 4 Quantitative comparisons on the HCI dataset with variants of the proposed framework for the $2 \times 2$ to $7 \times 7$ task. The best results are highlighted in bold

Full size table

5 Limitation and discussion

To handle LFs with large disparity range, we use a deep RDN with a large receptive field to extract features from EPIs. Moreover, our bilateral upsampling module predicts local weights to preserve the EPI structure. However, these operations are memory-intensive and impractical for higher resolutions. A possible solution is using multiresolution CNN architectures and segmenting the processing along the depth dimension.

6 Conclusion

We present GLGNet, a learning-based method for depth-free LF angular superresolution that handles large disparity. Our method uses pseudo 4D CNNs to capture the global spatial-angular consistency between input views, and 2D EPI superresolution to achieve high angular resolution by exploiting the local parallax structure of the EPI. We also propose a bilateral upsampling module that allows superresolution with different interpolation rates using a single model, while preserving the parallax structure of the EPI better. We evaluate our method on real-world and synthetic LF scenes and show that our proposed method outperforms various state-of-the-art methods in addressing the challenges of large disparity.

Data availability

The datasets generated during and/or analyzed during the current study are available from the corresponding author upon reasonable request.

References

Wilburn, B., Joshi, N., Vaish, V., Talvala, E. V., Antúnez, E., Barth, A., et al. (2005). High performance imaging using large camera arrays. ACM Transactions on Graphics, 24(3), 765–776.
Article Google Scholar
Rerabek, M., & Ebrahimi, T. (2016). New light field image dataset. [Paper presentation]. Proceedings of the 8th international conference on quality of multimedia experience. Lisbon, Portugal.
Google Scholar
Ng, R., Levoy, M., Brédif, M., Duval, G., Horowitz, M., & Hanrahan, P. (2005). Light field photography with a hand-held plenoptic camera. Technical report, Stanford University.
Kalantari, N. K., Wang, T. C., & Ramamoorthi, R. (2016). Learning-based view synthesis for light field cameras. ACM Transactions on Graphics, 35(6), 1–10.
Article Google Scholar
Wu, G., Zhao, M., Wang, L., Dai, Q., Chai, T., & Liu, Y. (2017). Light field reconstruction using deep convolutional network on EPI. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6319–6327). Piscataway: IEEE.
Google Scholar
Wang, Y., Liu, F., Wang, Z., Hou, G., Sun, Z., & Tan, T. (2018). End-to-end view synthesis for light field imaging with pseudo 4DCNN. In Proceedings of the European conference on computer vision (pp. 333–348). Cham: Springer.
Google Scholar
Yeung, H. W. F., Hou, J., Chen, J., Chung, Y. Y., & Chen, X. (2018). Fast light field reconstruction with deep coarse-to-fine modeling of spatial-angular clues. In Proceedings of the European conference on computer vision (pp. 137–152). Cham: Springer.
Google Scholar
Wu, G., Liu, Y., Dai, Q., & Chai, T. (2019). Learning sheared EPI structure for light field reconstruction. IEEE Transactions on Image Processing, 28(7), 3261–3273.
Article ADS MathSciNet PubMed Google Scholar
Jin, J., Hou, J., Yuan, H., & Kwong, S. (2020). Learning light field angular super-resolution via a geometry-aware network. In Proceedings of the AAAI conference on artificial intelligence (pp. 11141–11148). Palo Alto: AAAI Press.
Google Scholar
Wang, Q., Fang, L., Ye, L., Zhong, W., Hu, F., & Zhang, Q. (2022). Flexible light field angular superresolution via a deep coarse-to-fine framework. Wireless Communications and Mobile Computing, 2022, Article No. 4570755.
Google Scholar
Wang, Y., Wang, L., Wu, G., Yang, J., An, W., Yu, J., et al. (2023). Disentangling light fields for super-resolution and disparity estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1), 425–443.
Article PubMed Google Scholar
Flynn, J., Neulander, I., Philbin, J., & Snavely, N. (2016). Deepstereo: learning to predict new views from the world’s imagery. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5515–5524). Piscataway: IEEE.
Google Scholar
Penner, E., & Zhang, L. (2017). Soft 3D reconstruction for view synthesis. ACM Transactions on Graphics, 36(6), 1–11.
Article Google Scholar
Yoon, Y., Jeon, H. G., Yoo, D., Lee, J. Y., & So Kweon, I. (2015). Learning a deep convolutional network for light-field image super-resolution. In Proceedings of the IEEE international conference on computer vision workshops (pp. 24–32). Piscataway: IEEE.
Google Scholar
Wu, G., Liu, Y., Fang, L., & Chai, T. (2021). Revisiting light field rendering with deep anti-aliasing neural network. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9), 5430–5444.
Google Scholar
Wanner, S., & Goldluecke, B. (2013). Variational light field analysis for disparity estimation and super-resolution. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(3), 606–619.
Article Google Scholar
Kim, C., Zimmer, H., Pritch, Y., Sorkine-Hornung, A., & Gross, M. H. (2013). Scene reconstruction from high spatio-angular resolution light fields. ACM Transactions on Graphics, 32(4), 1–12.
CAS Google Scholar
Wanner, S., & Goldluecke, B. (2012). Globally consistent depth labeling of 4D light fields. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 41–48). Piscataway: IEEE.
Google Scholar
Jeon, H. G., Park, J., Choe, G., Park, J., Bok, Y., Tai, Y. W., et al. (2015). Accurate depth map estimation from a lenslet light field camera. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1547–1555). Piscataway: IEEE.
Google Scholar
Srinivasan, P. P., Wang, T., Sreelal, A., Ramamoorthi, R., & Ng, R. (2017). Learning to synthesize a 4D RGBD light field from a single image. In Proceedings of the IEEE international conference on computer vision (pp. 2243–2251). Piscataway: IEEE.
Google Scholar
Zhou, T., Tucker, R., Flynn, J., Fyffe, G., & Snavely, N. (2018). Stereo magnification: learning view synthesis using multiplane images. ACM Transactions on Graphics, 37(4), 1–12.
Article Google Scholar
Mildenhall, B., Srinivasan, P. P., Ortiz-Cayon, R., Kalantari, N. K., Ramamoorthi, R., Ng, R., et al. Local light field fusion: practical view synthesis with prescriptive sampling guidelines. ACM Transactions on Graphics, 38(4), 1–14.
Jin, J., Hou, J., Chen, J., Zeng, H., Kwong, S., & Yu, J. (2020). Deep coarse-to-fine dense light field reconstruction with flexible sampling and geometry-aware fusion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(4), 1819–1836.
Article Google Scholar
Mitra, K., & Veeraraghavan, A. (2012). Light field denoising, light field superresolution and stereo camera based refocussing using a GMM light field patch prior. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops (pp. 22–28). Piscataway: IEEE.
Google Scholar
Shi, L., Hassanieh, H., Davis, A., Katabi, D., & Durand, F. (2014). Light field reconstruction using sparsity in the continuous Fourier domain. ACM Transactions on Graphics, 34(1), 1–13.
Article Google Scholar
Vagharshakyan, S., Bregovic, R., & Gotchev, A. (2017). Light field reconstruction using shearlet transform. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(1), 133–147.
Article PubMed Google Scholar
Levin, A., Freeman, W. T., & Durand, F. (2008). Understanding camera trade-offs through a Bayesian analysis of light field projections. In D. A. Forsyth, P. H. S. Torr, & A. Zisserman (Eds.), Proceedings of the 10th European conference on computer vision (pp. 88–101). Cham: Springer.
Google Scholar
Marwah, K., Wetzstein, G., Bando, Y., & Raskar, R. (2013). Compressive light field photography using overcomplete dictionaries and optimized projections. ACM Transactions on Graphics, 32(4), 1–12.
Article Google Scholar
Wang, Y., Liu, F., Zhang, K., Wang, Z., Sun, Z., & Tan, T. (2020). High-fidelity view synthesis for light field imaging with extended pseudo 4DCNN. IEEE Transactions on Computational Imaging, 6, 830–842.
Article Google Scholar
Fang, L., Zhong, W., Ye, L., Li, R., & Zhang, Q. (2020). Light field reconstruction with a hybrid sparse regularization-pseudo 4DCNN framework. IEEE Access, 8, 171009–171020.
Article Google Scholar
Wu, G., Wang, Y., Liu, Y., Fang, L., & Chai, T. (2021). Spatial-angular attention network for light field reconstruction. IEEE Transactions on Image Processing, 30, 8999–9013.
Article ADS PubMed Google Scholar
Zhang, Y., Tian, Y., Kong, Y., Zhong, B., & Fu, Y. (2018). Residual dense network for image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2472–2481). Piscataway: IEEE.
Google Scholar
Shi, W., Caballero, J., Huszár, F., Totz, J., Aitken, A. P., Bishop, R., et al. (2016). Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1874–1883). Piscataway: IEEE.
Google Scholar
Hu, X., Mu, H., Zhang, X., Wang, Z., Tan, T., & Sun, J. (2019). Meta-SR: a magnification-arbitrary network for super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1575–1584). Piscataway: IEEE.
Google Scholar
Wang, X., Girshick, R., Gupta, A., & He, K. (2018). Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7794–7803). Piscataway: IEEE.
Google Scholar
Chen, Q., & Koltun, V. (2017). Photographic image synthesis with cascaded refinement networks. In Proceedings of the IEEE international conference on computer vision (pp. 1511–1520). Piscataway: IEEE.
Google Scholar
Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. [Paper presentation]. Proceedings of the 3rd international conference on learning representations. San Diego, US.
Google Scholar
Honauer, K., Johannsen, O., Kondermann, D., & Goldluecke, B. (2017). A dataset and evaluation methodology for depth estimation on 4D light fields. In Proceedings of the 13th Asian conference on computer vision (pp. 19–34). Cham: Springer.
Google Scholar
Raj, A. S., Lowney, M., Shah, R., & Wetzstein, G. (2016). Stanford lytro light field archive. Retrieved December 22, 2023, from http://lightfields.stanford.edu/LF2016.html.
Wanner, S., Meister, S., & Goldluecke, B. (2013). Datasets and benchmarks for densely sampled 4D light fields. In Proceedings of the 18th international workshop on vision, modeling, and visualization (pp. 225–226). Aire-la-Ville: Eurographics Association.
Google Scholar
Shi, J., Jiang, X., & Guillemot, C. (2019). A framework for learning depth from a flexible subset of dense and sparse light field views. IEEE Transactions on Image Processing, 28(12), 5867–5880.
Article ADS MathSciNet PubMed Google Scholar

Download references

Acknowledgements

The colleagues in Key Laboratory of Media Audio & Video (Communication University of China), Ministry of Education are gratefully acknowledged.

Funding

This work was supported by the National Natural Science Foundation of China (No. 62001432) and the Fundamental Research Funds for the Central Universities (No. CUC18LG024).

Author information

Authors and Affiliations

Key Laboratory of Media Audio & Video (Communication University of China), Ministry of Education, Beijing, China
Li Fang, Qian Wang & Long Ye

Authors

Li Fang
View author publications
You can also search for this author in PubMed Google Scholar
Qian Wang
View author publications
You can also search for this author in PubMed Google Scholar
Long Ye
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by LF, QW and LY. The first draft of the manuscript was written by LF and all the authors commented on previous versions of the manuscript. All the authors read and approved the final manuscript.

Corresponding author

Correspondence to Long Ye.

Ethics declarations

Competing interests

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Abbreviations

ADAM, adaptive moment estimation; CNN, convolutional neural network; EPI, epipolar plane image; GLGNet, global-local-global network; LF, light field; PSNR, peak signal-to-noise ratio; RDN, residual dense network; RDB, residual dense blocks; SSIM, structural similarity.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Fang, L., Wang, Q. & Ye, L. GLGNet: light field angular superresolution with arbitrary interpolation rates. Vis. Intell. 2, 6 (2024). https://doi.org/10.1007/s44267-024-00039-w

Download citation

Received: 08 August 2023
Revised: 06 February 2024
Accepted: 07 February 2024
Published: 01 March 2024
DOI: https://doi.org/10.1007/s44267-024-00039-w

GLGNet: light field angular superresolution with arbitrary interpolation rates

Abstract

Similar content being viewed by others

How Depth Estimation in Light Fields Can Benefit from Angular Super-Resolution?

MFSRNet: spatial-angular correlation retaining for light field super-resolution

MPIN: a macro-pixel integration network for light field super-resolution

Explore related subjects

1 Introduction

2 Related work

2.1 Depth image-based view synthesis

2.2 LF reconstruction without depth

3 The proposed method

3.1 LF representation and problem statement

3.2 Overview of the proposed framework

3.3 EPI superresolution with bilateral upsampling

Weight prediction

Subpixel filtering

Network architecture

3.4 Training details

4 Evaluation

4.1 Performance evaluation

Real-world scenes

Synthetic scenes

4.2 Ablation study

5 Limitation and discussion

6 Conclusion

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Abbreviations

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation