1 Introduction

The light field (LF) is a promising representation of 3D visual scenes that captures light rays traveling in every direction through every point in free space. LF image contains rich visual information of the scene and has drawn great attention from both academia and industry. For many applications, densely sampled LFs with high angular resolution are essential for avoiding ghosting effects.

Acquiring a densely sampled LF with sufficient angular resolution is challenging in many practical cases. Conventional LF acquisition methods use camera arrays [1] or computer-controlled gantries [2] to sample LF at different viewpoints with single or multiple exposures. However, these methods are either bulky and expensive, or time-consuming and limited to static scenes. Recently, commercial LF cameras have made LF image acquisition more convenient [3]. Unfortunately, the limited sensor resolution forces a trade-off between the spatial and angular resolution of the captured LF images.

Many studies have focused on synthesizing the required number of views from a given sparse set of images with high spatial resolution, which is called LF angular superresolution. With the recent advances in convolutional neural networks (CNNs) for visual modeling, several learning-based methods [411] have been proposed. A major technical challenge for LF angular superresolution is the large disparity, which makes pixel correspondence difficult to find. To address this challenge, several methods use depth estimation to construct a correspondence map between input views and subsequently blend them via various methods to synthesize novel views [12, 13]. However, these methods have limitations when dealing with factors such as texture, occlusion and specular surfaces. Therefore, several methods using specific priors such as sparsity in the transformation domain, have been presented for depth-free angular super-resolution [58, 14]. However they often produce aliasing or blurring effects when the input LF has a large disparity. Moreover, most existing depth-based and non-depth-based methods need to be retrained for different interpolation rates, which is inconvenient in practice.

In fact, 4D LF data is highly correlated in ray space, and structurally record abundant information of the scene. Therefore, a key insight for view synthesis in LF imaging is to fully exploit the inherent correlations within the input views [6, 7, 14]. This approach is particularly useful when the input views are very sparse. Another insight is to utilize the parallax structure of the LF, which manifests as distinctive patterns in epipolar plane images (EPIs) [5, 15], as shown in Fig. 1. However, for LFs with very sparse input views, each EPI can access only limited spatial and angular information, which becomes even more challenging for scenes with occlusions or non-Lambertian surfaces. Moreover, to reconstruct LFs with relatively large disparities, the receptive field of CNNs should be large enough to avoid aliasing effect, but this may result in over-smoothed edges and blur in synthesized views. To address these issues, we propose a learning-based LF angular superresolution framework that leverages both the global spatial-angular consistency and the local parallax structure of EPIs. Specifically, we first pre-aggregate the global spatial-angular information to gather useful information from all the input views; then we extract local structural features from each EPI and interpolate its feature map to the desired angular resolution; finally, a post-aggregation is applied to ensure the spatial-angular consistency of all the synthesized views. This combination forms “global-local-global” network (GLGNet), which ensures the full utilization of both the global and local features of the 4D LF data. We also mitigate the blur caused by large receptive field by using a bilateral upsampling module to better preserve the LF parallax structure and dynamically predict its weights to enable arbitrary interpolation rate with a single model.

Figure 1
figure 1

Visualization of the 4D light field \(\boldsymbol{L} (u, v, s, t)\). The bottom epipolar plane image (EPI) is a 2D \((u, s)\) slice \(\boldsymbol{L} (u, v^{*}, s, t^{*})\) by setting \(v = v^{*}\) (highlighted in red) and \(t = t^{*}\), and the right EPI \((v, t)\) by setting \(u = u^{*}\) (highlighted in green) and \(s = s^{*}\)

The main contributions of this paper are summarized as follows:

  1. 1)

    We exploit both the global spatial-angular consistency and the local parallax structure of EPIs, which is beneficial for LF reconstruction from very sparse input views.

  2. 2)

    We propose a bilateral upsampling module that consists of dynamically predicted global spatial weights and local range weights, which can achieve arbitrary upsampling rates while better preserving the LF parallax structure.

  3. 3)

    Unlike other methods that train different models for different upsampling rates, our method achieves state-of-the-art performance on both real-world and synthetic scenes with a single model, which benefits the flexibility of the method in practical applications.

2 Related work

LF angular superresolution is a long-standing problem that has been studied for decades. Existing methods can be divided into two categories: those that use depth estimation, and those that do not. In this paper, we mainly focus on the methods based on deep learning, since they can usually achieve better results than the other methods.

2.1 Depth image-based view synthesis

These approaches typically first estimate the scene depth at the novel view or the input view [1619], and then warp the input views to the novel view based on the estimated depth.

Flynn et al. [12] proposed a deep learning method to synthesize novel views from a stereo pair or a sequence of images with wide baselines. Srinivasan et al. [20] proposed synthesizing a 4D LF image from a 2D RGB image based on the estimated 4D ray depth. Kalantari et al. [4] proposed synthesizing novel views with two sequential networks that performed depth estimation and color prediction successively. These methods synthesize novel views independently and neglect their inter-view correlations. Wu et al. [8] proposed fusing a set of sheared EPIs for LF reconstruction, based on the observation that an EPI showed clear structure when sheared with the disparity value. However, the spatial and angular information of EPI-based models are severely limited, since each EPI is a 2D slice of the 4D LF. Zhou et al. [21] and Mildenhall et al. [22] trained a network that inferred alpha and multiplane images, and synthesized novel views using homography and alpha composition. Jin et al. proposed a geometry-aware network that could reconstruct large disparity LFs with a regular sampling pattern [9] or a flexible sampling pattern [23].

Typically, depth-based approaches depend heavily on the estimated depth, which is prone to errors in challenging scene features such as object boundaries, lighting reflections and thin structures. Moreover, small deviations in the estimated depth map can cause visual annoying artifacts in the synthesized views. In addition, they often focus on the quality of depth estimation rather than the quality of synthetic views.

2.2 LF reconstruction without depth

Densely sampled LF reconstruction can be considered consecutive reconstruction (interpolation) of the underlying plenoptic function from incomplete measurements (input views). For sparsely sampled LF, the sampling rate is insufficient, thus, direct interpolation will result in ghosting effects in the rendered views. Various priors for LF images have been used to reduce this effect [2427]. Moreover, several methods exploit compressive LF photography [28]. However, these methods require specific patterns for sampling the input views, which makes the acquisition more difficult.

Recently, several learning-based approaches for depth-free reconstruction have also been proposed. Yoon et al. [14] super-resolved the LF image in both the spatial and angular domain sequentially using a network. However, their approach could only handle scale 2 angular superresolution and could not adapt to very sparsely sampled LF input. Following the idea of single image superresolution, Wu et al. [5] proposed a “blur-restoration-deblur” framework for learning-based angular detail restoration on 2D EPIs. Wang et al. [6, 29] proposed processing 3D LF volumes of stacked EPIs with 3D CNNs. Yeung et al. [7] applied a coarse-to-fine model using an encoder-like view refinement network for larger receptive field. Fang et al. [30] combined the advantages of classical model-based methods and data-driven deep learning. Wu et al. [31] proposed a spatial-angular attention network to capture non-local correspondences in the LF. Wang et al. [10] proposed a coarse-to-fine model that used a meta-learning based upsampling module to achieve arbitrary upsampling rates. These methods can achieve good performance when the sampling baseline is relatively small, but their results deteriorate on LFs with a wider disparity range. Wang et al. [11] developed a class of domain-specific convolutions that could disentangle 4D LF data from different dimensions and designed task-specific modules for spatial and angular superresolution and disparity estimation. However, this method requires training different models for different upsampling rates, which is not flexible for practical applications.

3 The proposed method

3.1 LF representation and problem statement

In the two-plane parameterization, each light ray intersects the image plane at coordinate \((u, v)\) that are the spatial dimensions, and then intersects the camera plane at coordinate \((s, t)\) that are the angular dimensions. A densely sampled LF \(\boldsymbol{L}(u, v, s, t)\) consists of \(N \times M\) views of spatial size \(W \times H\), which are sampled on the angular plane with a regular 2D grid of size \(N \times M\), as shown in Fig. 1.

An EPI is a slice of the LF with a fixed spatial coordinate v and an angular coordinate t (or u and s), denoted by \(\boldsymbol{E}_{v^{*}, t^{*}} (u, s)\) (or \(\boldsymbol{E}_{u^{*}, s^{*}}(v, t)\)), as illustrated in Fig. 1. It contains the relative motion between the camera and the object points. A visible scene point appears as a line in one of the EPIs, whose slope reflects its depth and intensity reflects its emitted light. In other words, the line pattern in the EPI represents the local parallax of the LF. For the Lambertian reflectance model, a point on the surface forms a line with a constant intensity in the EPI.

Since the EPI is a 2D slice of the 4D LF with clearer structure, LF angular superresolution can be seen as an image superresolution problem along the angular dimension of the EPI. Novel views can be synthesized by interpolating their pixels along the corresponding lines in all EPIs. However, this EPI angular superresolution problem is highly ill-posed due to the limited angular resolution. This makes it difficult to recover some lines in scenes with occlusions or non-Lambertian surfaces, as they are either discontinuous or non-linear. On the one hand, for a small set of input views, the available spatial and angular information of the 2D EPI is very limited. On the other hand, the existence of large disparity, occlusion and non-Lambertian effect makes it difficult to find correspondence between views. As illustrated in the example given in Fig. 2, in a task with only 2 × 2 input views, the sparsely sampled EPI can only access views in one row/column, and the presence of occlusion causes the endpoints of some lines to have different values, as displayed in Fig. 2(a) (enclosed in red line), resulting in incorrect interpolation. In addition, lines in non-Lambertian regions are not exactly straight or have constant values, as shown in Fig. 2(b) (enclosed by yellow dash line), making them difficult to recover with only two endpoints.

Figure 2
figure 2

Illustration of the angular superresolution problem on the epipolar plane image (EPI) and examples of the Lambertian regions with small disparity (enclosed in green line) and large disparity (enclosed in cyan line), non-Lambertian region (enclosed in yellow dashed line) and the partially occluded region (enclosed in red line): (a) sparsely sampled EPI (2 input views); (b) densely sampled EPI

3.2 Overview of the proposed framework

We propose GLGNet, a network that leverages the global spatial-angular consistency and the local parallax structure of EPI for high-angular-resolution LF reconstruction. To aggregate both the angular and spatial information across views, we use a 4D CNN with pseudo filters to reduce model complexity. Instead of using two 2D filters that alternately convolve on the spatial and angular dimensions of the LF, we use 3D CNNs to filter on row and column EPI stacks alternately. The row/column EPIs aggregated with global information are then super-resolved along the corresponding axis. This design can preserve the local parallax structure of the EPI while using all the available information from the input views. Figure 3 shows our proposed framework, which consists of two aggregation modules, one angular conversion module and one EPI angular superresolution module.

Figure 3
figure 3

The framework of the proposed “global-local-global” network (GLGNet) for light field angular superresolution. Four modules are involved in our method. The pre-aggregation module \(A_{\mathrm{pre}}\) and the post-aggregation module \(A_{\mathrm{post}}\) combine both spatial and angular information to exploit the global spatial-angular consistency, the epipolar plane image (EPI) angular superresolution module S upsamples each angular dimension by exploiting the local parallax structure in the EPI, and the angular conversion module C converts the feature maps for further processing of the column EPIs

Given a sparsely sampled LF \(\boldsymbol{L}_{\mathrm{ss}} (u, v, s, t)\) with size \(W \times H \times n \times m\), we first perform pre-aggregation and obtain a feature map as

$$ \boldsymbol{L}_{\mathrm{feat}} (u, v, s, t) = A_{\mathrm{pre}}\bigl( \boldsymbol{L}_{\mathrm{ss}} (u, v, s, t)\bigr), $$
(1)

where \(A_{\mathrm{pre}}(\cdot )\) is the pre-aggregation module. It alternates between three 3D CNN layers on the row EPI stacks and two 3D CNN layers on the column EPI stacks to reconstruct the residual map, as illustrated in Fig. 4(a).

Figure 4
figure 4

The proposed GLGNet network uses three 3D convolutional neural network modules

Then, we super-resolve the angular dimension s. For each row \(t^{*}\), \(t^{*} = 0, 1, \ldots , m - 1\), by fixing one spatial dimension \(v = v^{*}\), \(v^{*} = 0, 1, \ldots , H-1\), we extract the 2D row EPI

$$ \boldsymbol{E}_{v^{*}, t^{*}} (u, s) = \boldsymbol{L}_{\mathrm{feat}} \bigl(u, v^{*}, s, t^{*}\bigr), $$
(2)

with a size of \(W \times n\). The EPI angular superresolution module \(S(\cdot )\) interpolates each row EPI to the desired angular resolution and assembles them into the partially upsampled feature map

$$ \boldsymbol{L}_{\mathrm{feat}} \bigl(u, v^{*}, s, t^{*}\bigr) \uparrow s = S\bigl( \boldsymbol{E}_{v^{*}, t^{*}} (u, s), f_{\mathrm{row}} \bigr), $$
(3)

with the size of \(W \times H \times N \times m\), where ↑ indicate upsampling on the specified dimension, and \(f_{\mathrm{row}} = (N - 1) / (n - 1)\) is an arbitrary interpolation rate.

Next, we use the angular conversion module \(C(\cdot )\) to refine \(\boldsymbol{L}_{\mathrm{feat}}(u, v, s, t) \uparrow s\), transfer the angular dimension from s to t, and further extract the feature map

$$ \boldsymbol{L}'_{\mathrm{feat}} (u, v, s, t) = C\bigl( \boldsymbol{L}_{ \mathrm{feat}} (u, v, s, t) \uparrow s\bigr), $$
(4)

with a size of \(W \times H \times N \times m\). We use residual blocks with three 3D CNN layers for both row EPI stack refinement and column EPI stack feature extraction, as demonstrated in Fig. 4(b).

Similarly, we extract the 2D column EPI by fixing one angular dimension \(s = s^{*}\), \(s^{*} = 0, 1, \ldots , N-1\) and one spatial dimension \(u = u^{*}\), \(u^{*} = 0, 1, \ldots , W-1\) as

$$ \boldsymbol{E}_{u^{*}, s^{*}} (v, t) = \boldsymbol{L}'_{\mathrm{feat}} \bigl(u^{*}, v, s^{*}, t\bigr), $$
(5)

with a size of \(H \times m\), which is further interpolated to the desired angular resolution by the EPI angular superresolution module \(S(\cdot )\), and assembled into the fully upsampled feature map

$$ \boldsymbol{L}'_{\mathrm{feat}} \bigl(u^{*}, v, s^{*}, t\bigr) \uparrow t = S\bigl( \boldsymbol{E}_{u^{*}, s^{*}} (v, t), f_{\mathrm{col}}\bigr), $$
(6)

with a size of \(W \times H \times N \times M\), where \(f_{\mathrm{col}} = (M - 1) / (m - 1)\) is an arbitrary interpolation rate that can differ from \(f_{\mathrm{row}}\). The input low resolution LF has been converted to the desired angular resolution so far.

Finally, after applying the post-aggregation module \(A_{\mathrm{post}} (\cdot )\) to \(\boldsymbol{L}'_{\mathrm{feat}} (u, v, s, t) \uparrow t\), we obtain the super-resolved LF as

$$ \boldsymbol{L}_{\mathrm{sr}} (u, v, s, t) = A_{\mathrm{post}}\bigl( \boldsymbol{L}'_{\mathrm{feat}} (u, v, s, t) \uparrow t\bigr). $$
(7)

The post-aggregation module also employs a residual block that alternates between three 3D CNN layers on column EPI stacks and two 3D CNN layers on row EPI stacks, as demonstrated in Fig. 4(c).

As shown in Fig. 4, except for the one before skip connection, all the 3D CNNs mentioned above use \(3 \times 3 \times 3\) kernels with a PReLU activation layer. We also adopt a learnable architecture for the EPI angular superresolution module, which enables end-to-end training of the proposed framework.

3.3 EPI superresolution with bilateral upsampling

To super-resolve EPIs, we treat them as image superresolution problems along the angular dimension. Our EPI superresolution module first applies a residual dense network (RDN) [32] to extract features from the input EPI, and then upsamples the feature map with an arbitrary interpolation rate. In this section, we describe the bilateral upsampling module in detail.

Given a sparsely sampled EPI \(\boldsymbol{E}_{\mathrm{ss}}\), which can be considered a downsampled version of its densely sampled counterpart \(\boldsymbol{E}_{\mathrm{ds}}\), the task of the EPI angular superresolution module is to generate an angular super-resolved EPI \(\boldsymbol{E}_{\mathrm{sr}}\) whose ground truth is \(\boldsymbol{E}_{\mathrm{ds}}\). Let \(\boldsymbol{E}_{\mathrm{feat}}\) denote the feature extracted from \(\boldsymbol{E}_{\mathrm{ss}}\). Consider a pixel at \((i,j)\) in \(\boldsymbol{E}_{\mathrm{sr}}\), it corresponds to a filtering output at subpixel location \((i, \mathrm{round} (j/f) )\) in \(\boldsymbol{E}_{\mathrm{feat}}\), where j is the angular dimension, f is the interpolation rate, and \(\mathrm{round}(\cdot )\) represents rounding operation. Therefore, each feature patch centered at \((i', j')\) in \(\boldsymbol{E}_{\mathrm{feat}}\), except for the first and last ones, determines f pixels in \(\boldsymbol{E}_{\mathrm{sr}}\), located at \((i', f \cdot j' + k )\), \(k \in [ - \lfloor (f-1)/2 \rfloor , \lceil (f-1)/2 \rceil ]\) and \(k \in \mathbb{Z}\). We employ subpixel convolution [33] to implement the above operation as

$$ \boldsymbol{E}_{\mathrm{sr}} = \mathcal{PS} (\boldsymbol{W} * \boldsymbol{E}_{\mathrm{feat}}), $$
(8)

where \(\mathcal{PS}(\cdot )\) is a pixel shuffling operator that upsamples only the angular dimension, and W is the weight of the filter.

For different interpolation rates, both the number of filters and the weights of the filters are different. To perform multiple interpolation rates with a single model, we need to dynamically predict the filters for each interpolation rate. The weights can be predicted based on the subpixel location and the interpolation rate f [34]. Unlike natural images, EPIs have unique line patterns that may be over-smoothed by CNNs with large receptive fields, especially when sampled with large baselines, as shown in Fig. 1. To address this problem, we introduce edge-preserving bilateral filtering in our bilateral upsampling module. As illustrated in Fig. 5, the weights of this module depend on both the subpixel coordinates and the range differences. The module consists of two main steps: weight prediction and subpixel filtering. The weight prediction step predicts the weights of the local filters for each feature patch in \(\boldsymbol{E}_{\mathrm{feat}}\). The subpixel filtering step applies the predicted weights to filter the feature in \(\boldsymbol{E}_{\mathrm{feat}}\) and calculate the values of the subpixels, i.e., the pixels in \(\boldsymbol{E}_{\mathrm{sr}}\). We will explain these steps in detail below.

Figure 5
figure 5

The flowchart of the proposed bilateral upsampling module, which predicts the weights of local filters based on the subpixel offset, interpolation rate and feature similarity. By performing subpixel filtering, the densely sampled epipolar plane image (EPI) is reconstructed. FC refers to fully connected layer

Weight prediction

Instead of using a typical upsampling module that predefines the number of filters for each interpolation rate and learns weights from the training dataset, we train a network that can dynamically predict the weights of the filters based on both the subpixel offset and the range difference. For feature \(\boldsymbol{E}_{\mathrm{feat}} (i', j')\), the filter weights \(\boldsymbol{W}_{i', j'}\) consist of two parts: the global spatial weights \(\boldsymbol{W}^{\mathrm{s}}\) and the local range weights \(\boldsymbol{W}_{i', j'}^{\mathrm{r}}\). Among them, the filter coefficients of global spatial weights are related only to the relative spatial distance of subpixels, so the weights of subpixel filtering can be calculated one by one using meta learning, and the local range weights are dynamically calculated based on feature similarity. Therefore, the bilateral upsampling module can dynamically calculate the weights of subpixel filters based on different interpolation rates, thereby achieving any interpolation rate.

The global spatial weights \(\boldsymbol{W}^{\mathrm{s}}\) are predicted as

$$ \boldsymbol{W}^{\mathrm{s}} = \varphi (\boldsymbol{v}_{f}), $$
(9)

where \(\varphi (\cdot )\) is the spatial weight prediction network, and \(\boldsymbol{v}_{f}\) is a vector containing the relative offset of the subpixels and the interpolation rate f:

$$ \boldsymbol{v}_{f} = \biggl( \frac{k}{f}, \frac{1}{f} \biggr), $$
(10)

where \(k \in [ - \lfloor (f-1)/2 \rfloor , \lceil (f-1)/2 \rceil ]\), and \(k \in \mathbb{Z}\).

Moreover, the local range weights \(\boldsymbol{W}_{i', j'}^{\mathrm{r}}\) are predicted via a strategy similar to self-attention [35], which is formulated as

$$ \boldsymbol{W}_{i', j'}^{\mathrm{r}} (\boldsymbol{x}) = \operatorname{softmax} \bigl( \theta \bigl(\boldsymbol{E}_{\mathrm{feat}} \bigl(i', j'\bigr)\bigr) \cdot \phi \bigl( \boldsymbol{E}_{\mathrm{feat}} (\boldsymbol{x})\bigr) \bigr) $$
(11)

where x is the coordinate of the current feature weighted within a window \(\Omega _{i', j'}\) centered at \((i', j')\), softmax denotes softmax operation, and \(\theta (\cdot )\) and \(\phi (\cdot )\) are embedding functions.

Next, we compute the final weights \(\boldsymbol{W}_{i', j'}\) for each feature patch centered at \((i', j')\) by taking the elementwise (Hadamard) product of the predicted global spatial weights \(\boldsymbol{W}^{\mathrm{s}}\) and the local range weights \(\boldsymbol{W}_{i', j'}^{\mathrm{r}}\), as shown in Eq. (12). Both \(\boldsymbol{W}^{\mathrm{s}}\) and \(\boldsymbol{W}_{i', j'}^{\mathrm{r}}\) are broadcasted to the same size before multiplication.

$$ \boldsymbol{W}_{i', j'} = \boldsymbol{W}^{\mathrm{s}} \circ \boldsymbol{W}_{i', j'}^{\mathrm{r}}, $$
(12)

where “∘” denotes the elementwise (Hadamard) product.

Subpixel filtering

We filter the features \(\boldsymbol{E}_{\mathrm{feat}}\) with the predicted weights W to obtain the pixel values of the densely sampled EPI \(\boldsymbol{E}_{\mathrm{sr}}\). Motivated by the fact that the residuals of natural images and videos are more compressible, we use a skip connection to add \(\boldsymbol{E}_{\mathrm{feat}}\) to the output of the filtering operation, as shown in Eq. (13). We also use a \(1 \times 1\) convolution layer \(g(\cdot )\) to adjust the number of feature maps of \(\boldsymbol{E}_{\mathrm{feat}}\) and tile them according to their center pixel before adding them to the output. This ensures that both sides of the addition have the same size. Finally, we remove the first \(\lfloor (f-1)/2 \rfloor \) and the last \(\lceil (f-1)/2 \rceil \) pixels in the angular dimension so that \(N = (n - 1) * f + 1\) holds (taking row EPI \(\boldsymbol{E}_{v^{*}, t^{*}} (u, s)\) as an example; this applies to the following text as well).

$$ \boldsymbol{E}_{\mathrm{sr}} = g(\boldsymbol{E}_{\mathrm{feat}}) + \mathcal{PS} (\boldsymbol{W} * \boldsymbol{E}_{\mathrm{feat}}). $$
(13)

Network architecture

For feature extraction, we use an RDN with 8 residual dense blocks (RDBs), each containing 8 2D CNN layers. The growth rate is 64.

For global spatial weight prediction, we use the multilayer perceptron from Ref. [34], which has 2 fully connected layers and 256 hidden neurons. The output is a tensor of size \(\mathit{outC} \times \mathit{inC} \times ks \times ks\), where \(\mathit{inC} = 64\) is the number of feature maps of \(\boldsymbol{E}_{\mathrm{feat}}\), \(\mathit{outC} = f\) is the number of subpixels for the luminance Y channel in the YCbCr color space, and \(ks = 3\) is the filter size. To predict the local range weights, we use \(1 \times 1\) convolutions as the embedding functions \(\theta (\cdot )\) and \(\phi (\cdot )\) with 16 output channels. The output is a tensor of size \(W \times n \times ks \times ks\). Therefore, the final weight has a size of \(W \times n \times \mathit{outC} \times \mathit{inC} \times ks \times ks\).

3.4 Training details

Since we use a deep network with large receptive field for scenes with large disparity range, simply minimizing the \(L_{2}\) distance between the synthesized views and the reference views could cause blur in the synthesized views. Instead, we use the feature similarity [36] as part of our loss function. We use the features from the conv1_2, conv2_2 and conv3_3 layers of a pre-trained VGG-16 network [37], and apply the per view weighting method from Ref. [6] to improve the visual quality of synthesized views. We also use the EPI structure preserving loss from Ref. [29] to ensure cross-view consistency.

We define the total loss function \(\mathcal{L}\) in this paper as a combination of the weighted feature similarity term \(\mathcal{L}_{\mathrm{vgg}}\) and the EPI structure preserving term \(\mathcal{L}_{\mathrm{esp}}\):

$$ \mathcal{L} = \mathcal{L}_{\mathrm{vgg}} + \lambda \cdot \mathcal{L}_{ \mathrm{esp}}, $$
(14)

where λ is a hyperparameter that controls the relative weight and is empirically set to 100.

In the training procedure, we use 20 synthetic LF images from the 4D light field benchmark [38], 100 real-world LF images provided by the Standford Lytro LF archive [39] and Kalantari et al. [4]. We remove the border views and crop the LF images to \(7 \times 7\) views as the ground truth, and randomly downsample them to \(2 \times 2\) or \(3 \times 3\) views as the input. Similar to other methods, we process only the luminance Y channel in the YCbCr color space. We train the network end-to-end using adaptive moment estimation (ADAM) optimizer with \(\beta _{1} = 0.9\), \(\beta _{2} = 0.999\) and a batch size of 8 (via gradient accumulation). During training, small patches in the same position in each view are randomly extracted. To speed up training, we first train with \(32 \times 32\) patches until convergence, and then with \(64 \times 64\) patches. The learning rates for both sizes are initially set to \(1.0 \times 10^{-4}\) and then decreased by a factor of 0.5 every \(1.0 \times 10^{3}\) epochs until convergence.

4 Evaluation

4.1 Performance evaluation

We compare our method with 8 state-of-the-art learning-based LF angular superresolution methods, including Kalantari et al. [4], Wu et al. [5], Wang et al. [6], Yeung et al. [7], Wu et al. [8], Jin et al. [9], Wang et al. [10] and Wang et al. [11]. To evaluate the performance of different methods, we use 48 synthetic data, including 4 LF images from the HCI [38] dataset, 5 LF images from the old HCI [40] dataset and 39 LF images from the Inria DLFD [41] dataset, and totally 70 real-world LF images captured with a Lytro Illum camera are used for test, including 30 LF images from 30 scenes [4] dataset, 15 LF images from Reflective [39] dataset and 25 LF images from Occlusions [39] dataset. These datasets cover several important factors in evaluating LF angular superresolution methods, including high-frequency texture, natural illumination, practical camera distortion, large disparity, occlusions and reflective surfaces. We use the peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) averaged on all synthesized views to evaluate all the methods.

To the best of our knowledge, our method is one of the few methods that can handle different interpolation rates with a single model, however, most other methods need to train different models for different interpolation rates. Therefore, for a fair comparison, we fine-tune our model for the different tasks on the training dataset. The fine-tuned model is denoted as \(\mathrm{Ours}_{\mathrm{ft}}\).

Real-world scenes

For real-world scenes, we reconstruct \(7 \times 7\) LFs from both \(3 \times 3\) and \(2 \times 2\) sparse views, as illustrated in Fig. 6. We first reconstruct \(7 \times 7\) LFs from \(3 \times 3\) sparse views, and compare with Wu et al. [5], Wang et al. [6], Yeung et al. [7] and Wang et al. [10]. For the task \(2 \times 2\) to \(7 \times 7\), we carry out comparison with methods by Kalantari et al. [4], Yeung et al. [7], Wu et al. [8], Jin et al. [9], Wang et al. [10] and Wang et al. [11].

Figure 6
figure 6

Illustration of sampling patterns: (a) \(3 \times 3\); (b) \(2 \times 2\)

Table 1 lists the results on real-world datasets. The performance of most methods decreases as the input becomes sparser. Among them, the EPI-based methods of Wu et al. [5] and Wu et al. [8] perform worse than the others, possibly due to the limited spatial and angular information of 2D EPI. Our proposed model outperforms the comparison methods for all datasets under the \(3 \times 3\) to \(7 \times 7\) task, and is second only to Wang et al. [11]. Like our method, Wang et al. [11] also focused on exploiting the spatial and angular relationships between the input views, which demonstrated the effectiveness of this strategy. The non-depth-based method of Yeung et al. [7] achieves the third-best results for the same reason.

Table 1 Quantitative comparisons (PSNR/SSIM) of different methods on real-world light field datasets for \(3 \times 3\) to \(7 \times 7\) and \(2 \times 2\) to \(7 \times 7\) tasks. The best results are highlighted in bold, and the second best results are highlighted in blue bold

Synthetic scenes

The disparity range of real-world datasets captured by commercial LF cameras is relatively small, usually less than 1 pixel. We use synthetic datasets with much larger disparity ranges to evaluate the performance on scenes with large disparity. Note that we have replaced the origami scene with the dishes scene to increase the disparity range of the test images in the HCI dataset. Table 2 shows the \(2 \times 2\) to \(7 \times 7\) results compared with those of 6 state-of-the-art methods that have been proven to handle large disparity well, including Kalantari et al. [4], Yeung et al. [7], Wu et al. [8], Jin et al. [9], Wang et al. [10] and Wang et al. [11]. We also present the disparity range of each dataset to examine its effect on the reconstruction quality. Our model without fine-tuning (denoted as “Ours”) achieves the second-best SSIM and the third-best PSNR among all methods, which are only slightly lower than those of Jin et al. [9]. The possible reason is that we use perceptual loss instead of pixel loss. The latter usually leads to better PSNR, but does not necessarily improve the visual quality. To demonstrate this, we visually compare the reconstruction results obtained via different methods, as shown in Fig. 7. More comparisons with Jin et al. [9] are given in Fig. 8. Our approach produces accurate estimations that are closer to ground-truth ones.

Figure 7
figure 7

Visual comparisons of different methods on the synthesized center view under the \(2 \times 2\) to \(7 \times 7\) task

Figure 8
figure 8

Additional visual comparisons of different methods on the synthesized center view under the \(2 \times 2\) to \(7 \times 7\) task

Table 2 Quantitative comparisons (PSNR/SSIM) of different methods on synthetic light field datasets for the \(2 \times 2\) to \(7 \times 7\) task. The best results are highlighted in bold, and the second best results are highlighted in blue bold

Without using depth, our method achieves performance comparable to that of the state-of-the-art depth-based method of Jin et al. [9], they use a specifically trained model for this task. On the other hand, most non-depth-based methods perform worse as the disparity range increases. This shows the effectiveness of our EPI superresolution module in exploiting the local parallax structure of the EPI. Moreover, our method uses the same model for all the experiments, while other methods have to train different models for different tasks. The fine-tuned model \(\mathrm{Ours}_{\mathrm{ft}}\) is comparable to that of Wang et al. [11], and outperforms all the other methods on all the metrics.

Table 3 lists the running times of different methods. All methods are tested with an Nvidia GeForce RTX A6000 GPU. Our method takes 9.13 seconds to reconstruct a \(7 \times 7\) LF with spatial resolution of \(512 \times 512\) from \(2 \times 2\) views. Our method is significantly faster than other methods except for Yeung et al. [7] and Jin et al. [9]. This is the cost of better performance and practical flexibility.

Table 3 Comparison of the running times of different methods on the HCI datasets for the \(2 \times 2\) to \(7 \times 7\) task. The best results are highlighted in bold

4.2 Ablation study

To better illustrate the advantage of our GLGNet, we conduct ablation experiments on the HCI dataset with variants of our framework. Table 4 displays the results. The aggregation modules, the angular conversion module and the range weight significantly boost the quality of the synthesized views. The proposed combination of loss functions further improves the performance of our proposed model. In addition, a visual comparison given in Fig. 8 shows the effectiveness of our bilateral upsampling module.

Table 4 Quantitative comparisons on the HCI dataset with variants of the proposed framework for the \(2 \times 2\) to \(7 \times 7\) task. The best results are highlighted in bold

5 Limitation and discussion

To handle LFs with large disparity range, we use a deep RDN with a large receptive field to extract features from EPIs. Moreover, our bilateral upsampling module predicts local weights to preserve the EPI structure. However, these operations are memory-intensive and impractical for higher resolutions. A possible solution is using multiresolution CNN architectures and segmenting the processing along the depth dimension.

6 Conclusion

We present GLGNet, a learning-based method for depth-free LF angular superresolution that handles large disparity. Our method uses pseudo 4D CNNs to capture the global spatial-angular consistency between input views, and 2D EPI superresolution to achieve high angular resolution by exploiting the local parallax structure of the EPI. We also propose a bilateral upsampling module that allows superresolution with different interpolation rates using a single model, while preserving the parallax structure of the EPI better. We evaluate our method on real-world and synthetic LF scenes and show that our proposed method outperforms various state-of-the-art methods in addressing the challenges of large disparity.