We evaluated our strategy for training foveated image reconstruction using objective image metrics (Section
5.2) and a subjective experiment (Section
5.3). In our evaluation, we aimed to show that the benefits of our method are not limited to a particular selection of the training loss. To this end, we evaluated the generator network
G of our method with six loss functions that were combinations of LPIPS (
\(\mathcal {L}_{LPIPS}\)), L2 (
\(\mathcal {L}_{L2}\)), and Laplacian pyramid loss (
\(\mathcal {L}_{Lapl}\)) terms (see Table
1). For the discriminator network,
D, we benchmarked the performance of networks trained using our new patch dataset (
\(\mathcal {L}_{adv}^*\)) as well as the original dataset (
\(\mathcal {L}_{adv}\)).
5.1 Visual Inspection
Figure
7 presents the results of reconstruction obtained by using differently trained architectures on four images. For reference, we include the original high-resolution and the standard foveated reconstruction by using an interpolation with Gaussian weights. For the results of training using Laplacian loss, we introduced a notation consisting of two letters,
\(\mathcal {L}_G^{Lapl\,XY}\), where
X,
Y \(\in \lbrace {{H}}, {{M}} \rbrace\) encode the position of the Gaussian peak at the far and near periphery, respectively. The letter
H represents the position of the peak located at the first level of the pyramid (characterized by an emphasis on high spatial frequencies), whereas the letter
M represents the position of the peak located at the fourth level of the pyramid (medium spatial frequencies). For example, the method denoted by
\(\mathcal {L}_G^{Lapl\,HM}\) refers to a reconstruction obtained by using a network trained with Laplacian pyramid based loss. In this case, high spatial frequencies were assigned larger weights for the far periphery, and medium frequencies were given higher importance for the near periphery.
The first observation is that the results of all eight GAN-based reconstruction exhibit clear hallucinated results, and the reconstruction of very fine details is not exact. Although this is visible with direct visual inspection, such deviations are less visible when shown in the periphery. Furthermore, all reconstructions introduce high spatial frequencies and strong edges, but training with
\(\mathcal {L}_G^{L2}\) loss makes them sparser and more exaggerated. A visual comparison (see Figure
7) between the discriminators trained with and without our synthesized dataset (i.e.,
\(\mathcal {L}_G^{L2}\) vs.
\(\mathcal {L}_{G^*}^{L2}\),
\(\mathcal {L}_G^{LPIPS}\) vs.
\(\mathcal {L}_{G^*}^{LPIPS}\),
\(\mathcal {L}_G^{Lapl\,MM}\) vs.
\(\mathcal {L}_{G^*}^{Lapl\,MM}\),
\(\mathcal {L}_G^{Lapl\,HH}\) vs.
\(\mathcal {L}_{G^*}^{Lapl\,HH}\)) shows that our results include higher spatial frequencies. We argue that this is due to the flexibility of the discriminator, which penalizes hallucinations of high spatial frequencies less harshly. This is the desired effect because although the HVS is sensitive to the removal of some high spatial frequencies in the periphery, it is less sensitive to changes in their positions (Section
2).
To further investigate the characteristics of spatial frequency distribution of our reconstructions, we visualized the outputs of frequency band decomposition of the Laplacian pyramid and computed the differences of two layers from the bottom of the pyramid,
2 which encode the highest frequency band as well as one octave below it (Figure
8). We observe that our reconstructions provide additional hallucinated high-frequency details that do not exist in the traditional foveated image reconstruction. Please refer to the supplementary material for an interactive demo with more results.
5.2 Objective Image Metrics
We assessed the perceptual quality of our foveated image reconstruction using the recently introduced FovVideoVDP metric [
33]. FovVideoVDP is a full-reference quality metric that can be used on images and videos. It takes into account the peripheral acuity of the HVS and the retinal eccentricity of the stimuli while computing quality scores. FovVideoVDP quality scores are in
Just-Objectionable Difference (JOD) units (
\(\text{JOD} \in [0, 10]\)), where
\(\text{JOD} = 10\) represents the highest quality and lower values represent higher perceived distortion with respect to the reference. We computed FovVideoVDP quality scores of the images generated by our method (
\(\mathcal {L}_{adv}^*\)) and those reconstructed by networks trained on a standard dataset (
\(\mathcal {L}_{adv}\)). We provided the original image to the metric as a reference image. We report the FovVideoVDP quality scores in Table
2 for different peripheral regions (near and far) and generator loss functions. We compared these scores with those of our training method using
\(\mathcal {L}_{adv}^*\) and the standard training approach with
\(\mathcal {L}_{adv}\). Our method achieved higher quality-related scores than the standard approach to training the GAN. The generator was able to reconstruct the images better when we included perceptually non-objectionable distortions in the training set of the discriminator using our method.
We also evaluated our method by using other objective quality metrics. Although many objective quality metrics are available for non-foveated quality measurement, objective quality assessment for foveated images is still an open research problem. In the absence of quality metrics for specific types of image distortions, past work has shown that the task-specific calibration of currently available objective quality metrics may be a promising solution [
27,
69]. Motivated by this, we used our perceptual data to calibrate existing metrics—L2, SSIM [
65], MS-SSIM [
66], and LPIPS [
70]—separately for different eccentricities. The calibration was performed by fitting the following logistic function [
41],
to reflect the non-linear relation between the magnitude of distortion in the image and the probability of detecting it, with
\(a,b,c,k,q,v\) being free parameters. Inspired by LPIPS [
70], we also considered reweighing the contributions of each convolution and pooling layer of VGG-19 for each eccentricity separately. We refer to this metric based on the calibrated VGG network as Cal. VGG.
For all metrics, the free parameters (i.e., the parameters of the logistic functions as well as the weights and bias of VGG-19 layers) were obtained by minimizing the MSE in predicting the probability of detection:
where
M is one of the original metrics;
\(S_r\) is the set of distorted and undistorted pairs of images for eccentricity
\(r \in {8,14,20}\); and
P is the probability of detecting the difference. Minimization was performed by using non-linear curve fitting through the
trust-region-reflective and the
Levenberg-Marquardt optimizations [
4,
30] with multiple random initializations. Furthermore, we constrained the VGG weights to be non-negative values to maintain the positive correlation between image dissimilarity and the magnitude of differences in VGG features, as motivated by the work Zhang et al. [
70]. To make our dataset more comprehensive, we added stimuli from an additional experiment that analyzed the visibility of the blur. For this purpose, we followed the procedure described in Section
3.2.
To validate our calibration, we performed fivefold cross validation and computed Pearson’s correlations between the ground-truth probability of detecting distortions and metric predictions. Figure
9 presents correlation coefficients for all trained metrics and eccentricities computed as an average across all the folds. Each bar shows the measured correlation for the uncalibrated (bright part) and calibrated (dark part) metrics by using the data from our initial experiment (Section
3.2). For uncalibrated metrics, we used the standard sigmoid logistic function:
\(y(t) = 1/(1+e^{-t})\). We also provide the aggregated results, where the correlation was analyzed across all eccentricities. The individual scores showed that as the eccentricity increase, there is a decline in performance in terms of the original metrics. The additional calibration significantly improves the prediction performance in terms of all metrics. An interesting observation is that LPIPS performs very well for small eccentricities (8
\(\circ\)). For larger ones (14
\(\circ\) and 20
\(\circ\)), however, its performance is significantly reduced even with the optimized logistic function. We relate this observation to the fact that LPIPS is not trained for peripheral vision. However, when the weights of the deep layers of VGG-19 are optimized (Cal. VGG), the performance improved significantly. This suggests that the preceding metrics are promising, but depending on the eccentricity, the contributions of the individual layers to the overall prediction must change. Since our Cal. VGG delivered the best performance in the tests, we selected it to benchmark the foveated image reconstruction techniques listed in Table
1. The results of this test for other metrics that we did not use for evaluation are also reported in the supplementary material as a reference.
After calibration, Cal. VGG is still limited to processing image patches as input. To be able to run Cal. VGG on full images that cover a larger field of view, it needs to consider the influence of changes in eccentricity depending on the position of a given pixel in the image. To support arbitrary values of eccentricity as input, we linearly interpolated the prediction of the metric from 8° and 20° to intermediate values of the eccentricity between 8° and 20°. Moreover, in contrast to our approach in the calibration step, we switched to a single logistic function whose parameters were estimated by using experimental data from all eccentricities. After these extensions, we ran Cal. VGG locally on non-overlapping patches of the full input image. To compute a single scalar for the entire image, we took the average value computed across all patches as a global pooling step. To benchmark different reconstruction methods, we randomly selected 10 publicly available images at
\(3840 \times 2160\) resolution that contained architectural and natural features. Before applying different reconstruction techniques, we split the images into three regions: fovea, near periphery, and far periphery. We then draw sparse samples as visualized in Figure 1. To test the reconstruction quality provided by different sampling rates used in near- and far-peripheral regions, we analyzed the Cal. VGG predictions for various blending strategies by changing the eccentricity thresholds at which the transition from near to far peripheral regions occurred. We computed the predicted detection rates from Cal. VGG for threshold points between 9° and 22° for all images. Figure
10 presents the results. Lower detection rates indicate lower probabilities of detecting reconstruction artifacts by human observers and therefore a higher reconstruction quality. Training the reconstruction using
\(\mathcal {L}_{G^*}^{LPIPS}\) and
\(\mathcal {L}_G^{LPIPS}\) yields reconstructions that are the least likely to be distinguished from the original images. The results generated by using
\(\mathcal {L}_{G^*}^{L2}\) delivered a lower detection rate than those generated by using
\(\mathcal {L}_{G}^{L2}\). The detection rate for our method is significantly lower when the far periphery threshold is selected in the range of eccentricity of 12
\(\circ\) to 22
\(\circ\) (
\(p \lt 0.05\)). For
\(\mathcal {L}_{G*}^{LPIPS}\), this difference in detection rate is significant compared with that for
\(\mathcal {L}_{G}^{LPIPS}\) for thresholds between 9° and 16° (
\(p \lt 0.05\)). We did not note a significant difference between the methods (
\(p \gt 0.10\) for all cases considered) when the network was trained by using Laplacian loss. All
p-values were computed by using a
t-test.
We separated the images into two groups according to their prominent visual features: nature and architecture. Nature images were considered to form a class containing fewer geometrical structures and more texture-like areas, such as leaves, trees, and waves. They usually have a large variety of shapes without any well-defined patterns. In addition, they exhibit a high level of variance in colors and structure. Visual distortions in nature images would be less likely to result in perceivable changes because the variance in color and structure may have a masking effect on the distortions. On the contrary, architecture images mostly contain structures, like human-made objects, such as buildings, and larger uniform areas with clear visual boundaries. They usually have many edges and corners, which makes it more challenging to have a perceptually plausible and faithful reconstruction from sparse image samples. Distorting such images is more likely to lead to the mixing of visual information from different areas, and this is easy to detect even in the peripheral region of vision. Owing to these distinct properties of nature and architecture images, we separately evaluated the results on these two types of images separately. The results show that the difference of detection rate between our method and the standard training, compared to the overall trend, is more pronounced for nature images and less pronounced for architecture images. The results are available in Figure 6 of the supplementary material.
5.3 Subjective Experiments
The psychovisual experiment used to derive the data for training our reconstruction methods was performed using five participants. Although it is common to use few participants in such experiments due to their complexity, and given that they should capture the general properties of the HVS, such experiments do not investigate potential differences within the population. In addition, it is not clear whether the method derived from the perceptual data is effective. Therefore, to further validate our claims regarding the new training strategy, and verify the importance of the improvements observed when using calibrated metrics, we conducted an additional subjective user experiment, in which naive participants were asked to directly compare different reconstruction methods.
Stimuli and Task. We used the 10 images that were used in our evaluation on objective image metrics (Section
5.2). They were subsampled and reconstructed by using
\(\mathcal {L}_{G^*}^{L2}\),
\(\mathcal {L}_{G}^{L2}\),
\(\mathcal {L}_{G^*}^{LPIPS}\), and
\(\mathcal {L}_G^{LPIPS}\) as shown in Figure
1. In each trial, the participants were shown the original image on the left and one of two reconstructions on the right half of the display. The two halves were separated by a 96-pixel-wide gray stripe. The participants could freely switch between reconstructions by using a keyboard. They were asked to select the reconstruction that was more similar to the reference on the left by pressing a key. During the experiment, the images followed the eye movements of the participant, as shown in Figure
11. In contrast to the calibration experiment performed in Section
3.2, in this experiment we showed full images to the participants, each covering half of the screen. Fixation was enforced as in the calibration experiment described in Section
3.2. Each trial took 15 seconds on average. The total duration of the experiment was around 15 minutes.
Participants. Fifteen participants with normal or corrected-to-normal vision took part in the experiment. All were naive for the purposes of the study and were given instructions at the beginning. Each participant was asked to compare all pairs of techniques for each image (60 comparisons per participant).
Results. To analyze the results, for each pair of techniques, we computed a preference rate of method A over method B. The rate expresses the percentage of trials in which method A was chosen as visually more similar to the original image. Table
3 shows the preference rates obtained by using networks trained using our procedure (
\(\mathcal {L}_{G^*}^{LPIPS}\),
\(\mathcal {L}_{G^*}^{L2}\)) in comparison with the standard procedure (
\(\mathcal {L}_G^{LPIPS}\),
\(\mathcal {L}_G^{L2}\)). We report the results for all images (last column) and do so separately for nature and architecture images. We used the binomial test to compute the
p-values. The reconstructions obtained by using a network trained with our strategy are preferred in 57% of the cases (
\(p = 0.013\)). The difference is significant for
\(\mathcal {L}_{G^*}^{LPIPS}\) with a 59% preference (
\(p = 0.04\)). In the context of different image classes, our method performed well on nature images, both when we consider
\(\mathcal {L}_{G^*}^{L2}\) and
\(\mathcal {L}_{G^*}^{LPIPS}\) separately and when we consider them jointly (for each case preferred in 75% of the cases,
\(p \lt 0.001\)). On architecture images, we observe the preference for
\(\mathcal {L}_G^{L2}\) (63%,
\(p = 0.037\)). For all techniques, our method is preferred in 40% of the cases (
\(p = 0.018\)). This is consistent with the results of Cal. VGG (Section
5.2), where our method had a lower probability of detection of nature images and a similar probability of detection of architecture images. We hypothesize that there might be several reasons for its poorer performance on architecture images. First, architecture images contain objects with simple shapes, uniform areas, edges, and corners. Such features might not have been represented well in our calibration, where we used
\(256 \times 256\) patches, whose size was limited to avoid testing the visibility across a wide range of eccentricities. Furthermore, we believe that distortions in the visual features of simple objects are much easier to perceive than those in natural textures, which are more random. This problem might have been aggravated because our calibration considered both groups together and did not make any distinction when modeling the perception of artifacts for them. This problem might be solved by using different numbers of guiding samples for different classes of images when generating the dataset for training GAN-based reconstruction. However, this would require more careful data collection for the initial experiment and a more complex model that can predict the number of guiding samples based on the image content. Once these challenges have been addressed, the proposed approach can yield a more accurate dataset and can be used to train a single architecture that can handle different types of images.
Figure
12 shows the preferences for the individual methods compared with those for all other training strategies, including different loss functions—that is,
\(\mathcal {L}_{G^*}^{LPIPS}\),
\(\mathcal {L}_{G^*}^{LPIPS}\),
\(\mathcal {L}_{G}^{L2}\), and
\(\mathcal {L}_{G^*}^{L2}\). In the experiment with experts (left),
\(\mathcal {L}_{G^*}^{LPIPS}\) attained the highest preference of 38% (
\(p \lt 0.001\)), whereas
\(\mathcal {L}_G^{L2}\) recorded the lowest preference (24%,
\(p \lt 0.001\)). When divided into classes,
\(\mathcal {L}_{G^*}^{LPIPS}\) and
\(\mathcal {L}_{G^*}^{L2}\) were the most preferred methods for natural images, with values of 41% (
\(p \lt 0.001\)) and 37% (
\(p = 0.003\)), respectively. The other methods had lower preference values: 22% for
\(\mathcal {L}_G^{LPIPS}\) (
\(p \lt 0.001\)) and 20% for
\(\mathcal {L}_G^{L2}\) (
\(p \lt 0.001\)).
\(\mathcal {L}_G^{LPIPS}\) was the most preferred on architecture images (37%,
\(p = 0.005\)), followed by
\(\mathcal {L}_{G^*}^{LPIPS}\) (35%,
\(p = 0.032\)). The
\(\mathcal {L}_{G^*}^{L2}\) was selected the fewest number of times (21%,
\(p \lt 0.001\)). All
p-values were computed by using the binomial test, and the remaining results were not statistically significant. The experiment, when repeated with naive participants (right), yielded a different threshold as a function of the guiding samples needed for the appropriate foveated reconstruction. In particular, the values related to 8
\(\circ\) and 14
\(\circ\) changed from 9.09 to 7.93, and from 6.89 to 4.57, respectively. This means that for a standard observer, the number of samples needed to generate an image of fixed quality is higher than the samples needed for an expert observer. Texture synthesis was the initial step of our pipeline. For this reason, we trained all our networks again and repeated the validation experiment with the new reconstruction. The results are presented in Table
4 and Figure
12 (right). The new experiments showed that although our technique maintained a slight advantage through
\(\mathcal {L}_{G^*}^{LPIPS}\) over the standard method on nature images, our reconstructions delivered the worst performance on architecture images.