1 Introduction

Computer vision algorithms generally rely on the assumption that the value of each pixel is a function of the radiance of a single area in the scene. Semi-reflectors, such as typical windows or glass doors, break this assumption by creating a superposition of the radiance of two different objects: the one behind the surface and the one that is reflected. It is virtually impossible to avoid semi-reflectors in man-made environments, as can be seen in Fig. 2(a), which shows a typical downtown area. Any multi-view stereo or SLAM algorithm would be hard-pressed to produce accurate reconstructions on this type of images.

Fig. 1.
figure 1

Glass surfaces are virtually unavoidable in real-world pictures. Our approach to separate the reflection and transmission layers, works even for general, curved surfaces, which break the assumptions of state-of-the-art methods. In this example, only our method can correctly estimate both reflection \(\widehat{R}\) (the tree branches) and transmission \(\widehat{T}\) (the car’s interior).

Fig. 2.
figure 2

Depending on the ratio between transmitted and reflected radiance, a semi-reflector may produce no reflections , pure reflections , or a mix of the two, which can vary smoothly , or abruptly . The local curvature of the surface can also affect the appearance of the reflection . The last two, and , are all but uncommon, as shown in (b).

Several methods exist that attempt to separate the reflection and transmission layers. At a semi-reflective surface, the observed image can be modeled as a linear combination of the reflection and the transmission images: \(I_o = \alpha _r I_r + \alpha _t I_t\). The inverse problem is ill-posed as it requires estimating multiple unknowns from a single observation. A solution, therefore, requires additional priors or data. Indeed, previous works rely on assumptions about the appearance of the reflection (e.g., it is blurry), about the shape and orientation of the surface (e.g., it is perfectly flat and exactly perpendicular to the principal axis of the camera), and others. Images taken in the wild, however, regularly break even the most basic of these assumptions, see Fig. 2(b), causing the results of state-of-the-art methods [2,3,4] to deteriorate even on seemingly simple cases, as shown in Fig. 1, which depicts a fairly typical real-world scene.

One particularly powerful tool is is polarization: images captured through a polarizer oriented at different angles offer additional observations. Perhaps surprisingly, however, our analysis of the state-of-the-art methods indicates that the quality of the results degrades significantly when moving from synthetic to real data, even when using polarization. This is due to the simplifying assumptions that are commonly made, but also to an inherent issue that is all too often neglected: a polarizer’s ability to attenuate reflections greatly depends on the viewing angle [5]. The attenuation is maximal at an angle called the Brewster angle, \(\theta _{B}\). However, even when part of a semi-reflector is imaged at \(\theta _{B}\), the angle of incidence in other areas is sufficiently different from \(\theta _{B}\) to essentially void the effect of the polarizer, as clearly shown in Fig. 3. Put differently, because of the limited signal-to-noise ratio, for certain regions in the scene, the additional observations may not be independent.

We present a deep-learning method capable of separating the reflection and transmission components of images captured in the wild. The success of the method stems from our two main contributions. First, rather than requiring a network to learn the reflected and transmitted images directly from the observations, we leverage the properties of light polarization and use a residual representation, in which the input images are projected onto the canonical polarization angles (Sects. 3.1 and 3.2). Second, we design an image-based data generator that faithfully reproduces the image formation model (Sect. 3.3).

We show that our method can successfully separate the reflection and transmission layers even in challenging cases, on which previous works fail. To further validate our findings, we capture the Urban Reflections Dataset, a polarization-based dataset of reflections in urban environments that can be used to test reflection removal algorithms on realistic images. Moreover, to perform a thorough evaluation against state-of-the-art methods whose implementation is not publicly available, we re-implemented several representative methods. As part of our contribution, we release those implementations for others to be able to compare against their own methods [1].

2 Related Work

There is a rich literature of methods dealing with semi-reflective surfaces, which can be organized in three main categories based on the assumptions they make.

Single-image methods can leverage gradient information to solve the problem. Levin and Weiss, for instance, require manual input to separate gradients of the reflection and the transmission [6]. Methods that are fully automated can distinguish the gradients of the reflected and transmitted images by leveraging the defocus blur [7]: reflections can be blurry because the subject behind the semi-reflector is much closer than the reflected image [4], or because the camera is focused at infinity and the reflected objects are close to the surface [8]. Moreover, for the case of double-pane or thick windows, the reflection can appear “doubled” [9], and this can be used to separate it from the transmitted image [10]. While these methods show impressive results, their assumptions are stringent and do not generalize well to real-world cases, causing them to fail on common cases.

Multiple images captured from different viewpoints can also be used to remove reflections. Several methods propose different ways to estimate the relative motion of the reflected and transmitted image, which can be used to separate them [11,12,13,14,15]. It is important to note that these methods assume static scenes—the motion is the apparent motion of the reflected layer relative to the transmitted layer, not scene motion. Other than that, these methods make assumptions that are less stringent than those made by single-image methods. Nonetheless, these algorithms work well when reflected and transmitted scenes are shallow in terms of depth, so that their velocity can be assumed uniform. For the case of spatially and temporally varying mixes, Kaftory and Zeevi propose to use sparse component analysis instead [16].

Multiple images captured under different polarization angles offer a third venue to tackle this problem. Assuming that images taken at different polarization angles offer independent measurements of the same scene, reflection and transmission can be separated using independent component analysis [17,18,19]. An additional prior that can be leveraged is given by double reflections, when the semi-reflective surface generates them [9]. Under ideal conditions, and leveraging polarization information, a solution can also be found in closed form [2, 3]. In our experiments, we found that most of the pictures captured in unconstrained settings break even the well-founded assumptions used by these papers, as shown in Fig. 2.

3 Method

We address the problem of layer decomposition by leveraging the ability of a semi-reflector to polarize the reflected and transmitted layers differently. Capturing multiple polarization images of the same scene, then, offers partially independent observations of the two layers. To use this information, we take a deep learning approach. Since the ground truth for this problem is virtually impossible to capture, we synthesize it. As for any data-driven approach, the realism of the training data is paramount to the quality of the results. In this section, after reviewing the image formation model, we give an overview of our approach, we discuss the limitations of the assumptions that are commonly made, and how we address them in our data generation pipeline. Finally, we describe the details of our implementation.

3.1 Polarization, Reflections, and Transmissions

Consider two points, \(P_R\) and \(P_T\) such that \(P^{'}_R\), the reflection of \(P_R\), lies on the line of sight of \(P_T\), and assume that both emit unpolarized light, see Fig. 3. After being reflected or transmitted, unpolarized light becomes polarized by an amount that depends on \(\theta \), the angle of incidence (AOI).

At point \(P_S\), the intersection of the line of sight and the surface, the total radiance L is a combination of the reflected radiance \(L_R\), and the transmitted radiance \(L_T\). Assume we place a linear polarizer with polarization angle \(\phi \) in front of the camera. If we integrate over the exposure time, the intensity at each pixel x is

$$\begin{aligned} I_{\phi }(x) = \alpha (\theta , \phi _\perp , \phi ) \cdot \frac{I_R(x)}{2} + \left( 1-\alpha (\theta , \phi _\perp , \phi )\right) \cdot \frac{I_T(x)}{2}, {} \end{aligned}$$
(1)

where the mixing coefficient \(\alpha (\cdot ) \in [0,1]\), the angle of incidence \(\theta (x) \in [0, \nicefrac {\pi }{2}]\), the p–polarization direction [2] \(\phi _{\perp }(x)\in [-\nicefrac {\pi }{4}, \nicefrac {\pi }{4}]\), and the reflected and transmitted images at the semi-reflector, \(I_R(x)\) and \(I_T(x)\), are all unknown.

At the Brewster angle, \(\theta _{B}\), the reflected light is completely polarized along \(\phi _{\perp }\), i.e. in the direction perpendicular to the incidence planeFootnote 1, and the transmitted light along \(\phi _{\parallel }\), the direction parallel to the plane of incidence. The angles \(\phi _{\perp }\) and \(\phi _{\parallel }\) are called the canonical polarization angles. In the unique condition in which \(\theta (x) = \theta _{B}\), two images captured with the polarizer at the canonical polarization angles offer independent observations that are sufficient to disambiguate between \(I_R\) and \(I_T\). Unless the camera or the semi-reflector are at infinity, however, \(\theta (x) = \theta _{B}\) only holds for few points in the scene, if any, as shown in Fig. 3. To complicate things, for curved surfaces, \(\theta (x)\) varies non-linearly with x. Finally, even for arbitrarily many acquisitions at different polarization angles, \(\phi _{j}\), the problem remains ill-posed as each observation \(I_{\phi _{j}}\) adds new pixel-wise unknowns \(\alpha (\theta ,\phi _{\perp },\phi _j)\).

Fig. 3.
figure 3

A polarizer attenuates reflections when they are viewed at the Brewster angle \(\theta =\theta _B\). For the scene shown on the left, we manually selected the two polarization directions that maximize and minimize reflections respectively. Indeed, the reflection of the plant is almost completely removed. However, only a few degrees away from the Brewster angle, the polarizer has little to no effect, as is the case for the reflection of the book on the right.

3.2 Recovering R and T

When viewed through a polarizer oriented along direction \(\phi \), \(I_R\) and \(I_T\), which are the reflected and transmitted images at the semi-reflector, produce image \(I_\phi \) at the sensor. Due to differences in dynamic range, as well as noise, in some regions the reflection may dominate \(I_\phi \), or vice versa, see Sect. 3.3. Without hallucinating content, one can only aim at separating R and T, which we define to be the observable reflected and transmitted components. For instance, T may be zero in regions where R dominates, even though \(I_T\) may be greater than zero in those regions. To differentiate them from the ground truth, we refer to our estimates as \(\widehat{R}\) and \(\widehat{T}\).

Fig. 4.
figure 4

Our encoder-decoder network architecture with ResNet blocks includes a Canonical Projection Layer, which projects the input images onto the canonical polarization directions, and uses a residual parametrization for \(\widehat{T}\) and \(\widehat{R}\).

To recover \(\widehat{R}\) and \(\widehat{T}\), we use an encoder-decoder architecture, which has been shown to be particularly effective for a number of tasks, such as image-to-image translation [20], denoising [21], or deblurring [22]. Learning to estimate \(\widehat{R}\) and \(\widehat{T}\) directly from images taken at arbitrary polarization angles does not produce satisfactory results. One main reason is that parts of the image may be pure reflections, thus yielding no information about the transmission, and vice versa.

To address this issue, we turn to the polarization properties of reflected and transmitted images. Recall that R and T are maximally attenuated, though generally not completely removed, at \(\phi _{\Vert }\) and \(\phi _{\perp }\) respectively. The canonical polarization angles depend on the geometry of the scene, and are thus hard to capture directly. However, we note that an image \(I_{\phi }(x)\) can be expressed as [3]:

$$\begin{aligned} I_{\phi }(x) = I_{\perp }(x)\cos ^2(\phi - \phi _{\perp }(x)) + I_{\Vert }(x) \sin ^2(\phi - \phi _{\perp }(x)). \end{aligned}$$
(2)

Since Eq. 2 has three unknowns, \(I_{\perp }\), \(\phi _{\perp }\), and \(I_{\Vert }\), we can use three different observations of the same scene, \(\left\{ I_{\phi _i}(x)\right\} _{i=\{0,1,2\}}\), to obtain a linear system that allows to compute \(I_{\perp }(x)\) and \(I_{\Vert }(x)\). To further simplify the math we capture images such that \(\phi _i = \phi _0+i\cdot \nicefrac {\pi }{4}\).

For efficiency, we implement the projection onto the canonical views as a network layer in TensorFlow. The canonical views and the actual observations are then stacked in a 15-channel tensor and used as input to our network. Then, instead of training the network to learn to predict \(\widehat{R}\) and \(\widehat{T}\), we train it to learn the residual reflection and transmission layers. More specifically, we train the network to learn an 8-channel output, which comprises the residual images \(\widetilde{T}(x)\), \(\widetilde{R}(x)\), and the two single-channel weights \(\xi _\Vert (x)\) and \(\xi _\perp (x)\). Dropping the dependency on pixel x for clarity, we can then compute:

$$\begin{aligned} \widehat{R} = \xi _\perp \widetilde{R} + (1-\xi _\perp )I_\perp \qquad \text {and}\qquad \widehat{T} = \xi _\Vert \widetilde{T} + (1-\xi _\Vert )I_\Vert . \end{aligned}$$
(3)

While \(\xi _\perp \) and \(\xi _\Vert \) introduce two additional unknowns per pixel, they significantly simplify the prediction task in regions where the canonical projections are already good predictors of \(\widehat{R}\) and \(\widehat{T}\). We use an encoder-decoder with skip connections [23] that consists of three down-sampling stages, each with two ResNet blocks [24]. The corresponding decoder mirrors the encoding layers using a transposed convolution with two ResNet blocks. We use an \(\ell _2\) loss on \(\widehat{R}\) and \(\widehat{T}\). We also tested \(\ell _1\) and a combination of \(\ell _1\) and \(\ell _2\), which did not yield significant improvements.

The use of the canonical projection layer, as well as the parametrization of residual images is key for the success of our method. We show this in the Supplementary, where we compare the output of our network with the output of the exact same architecture trained to predict \(\widehat{R}\) and \(\widehat{T}\) directly from the three polarization images \(I_{\phi _i}(x)\).

3.3 Image-Based Data Generation

The ground truth data to estimate \(\widehat{R}\) and \(\widehat{T}\) is virtually impossible to capture in the wild. Recently, Wan et al. released a dataset for single-image reflection removal [25], but it does not offer polarization information. In principle, Eq. 1 could be used directly to generate, from any two images, the data we need. The term \(\alpha \) in the equation, however, hides several subtleties and nonidealities. For instance, previous polarization-based works use it to synthesize data by assuming uniform AOI, perfectly flat surfaces, comparable power for the reflected and transmitted irradiance, or others. This generally translates to poor results on images captured in the wild: Figs. 1 and 2 show common scenes that violate all of these assumptions.

Fig. 5.
figure 5

Our image-based data generation procedure. We apply several steps to images \(I_R\) and \(I_T\) simulating reflections in most real-world scenarios (Sect. 3.3).

We propose a more accurate synthetic data generation pipeline, see Fig. 5. Our pipeline starts from two randomly picked images from the PLACE2 dataset [26], \(I_R\) and \(I_T\), which we treat as the image of reflected and transmitted scene at the surface. From those, we model the behaviors observed in real-world data, which we describe as we “follow” the path of the photons from the scene to the camera.

Dynamic Range Manipulation at the Surface. To simulate realistic reflections, the dynamic range (DR) of the transmitted and reflected images at the surface must be significantly different. This is because real-world scenes are generally high-dynamic-range (HDR). Additionally, the light intensity at the surface drops with the distance from the emitting object, further expanding the combined DR. However, our inputs are low-dynamic-range images because a large dataset of HDR images is not available. We propose to artificially manipulate the DR of the inputs so as to match the appearance of the reflections we observe in real-world scenes.

Going back to Fig. 3 (right), we note that for regions where \(L_T \approx L_R\), a picture taken without a polarizer will capture a smoothly varying superposition of the images of \(P_R\) and \(P_T\) (Fig. 2 ). For areas of the surface where \(L_R \gg L_T\), however, the total radiance is \(L \approx L_R\), and the semi-reflector essentially acts as a mirror (Fig. 2 ). The opposite situation is also common (Fig. 2 ). To allow for these distinct behaviors, we manipulate the dynamic range of the input images with a random factor \(\beta \sim \mathcal {U}[1, K]\):

$$\begin{aligned} \tilde{I}_R = \beta I_R^{1/\gamma }\qquad \text {and} \qquad \tilde{I}_T = \frac{1}{\beta } I_T^{1/\gamma }, \end{aligned}$$
(4)

where \(1/\gamma \) linearizes the gamma-compressed inputsFootnote 2. We impose that \(K > 1\) to compensate for the fact that a typical glass surface transmits a much larger portion of the incident light than it reflectsFootnote 3.

Images \(\tilde{I}_R\) and \(\tilde{I}_T\) can reproduce the types of reflections described above, but are limited to those cases for which \(L_R - L_T\) changes smoothly with \(P_S\). However, as shown in Fig. 2 , the reflection can drop abruptly following the boundaries of an object. This may happen when an object is much closer than the rest of the scene, or when its radiance is larger than the surrounding objects. To properly model this behavior, we treat it as a type of reflection on its own, which we apply to a random subset of the image whose range we have already expanded. Specifically, we set to zero the regions of the reflection or transmission layer, whose intensity is below \(T = \text {mean}(\tilde{I}_R+\tilde{I}_T)\), similarly to the method proposed by Fan et al.  [4].

Dealing with Dynamic Scenes. Our approach requires images captured under three different polarization angles. While cameras that can simultaneously capture multiple polarization images exist [27,28,29], they are not widespread. To date, the standard way to capture different polarization images is sequential; this causes complications for non-static scenes. As mentioned in Sect. 2, if multiple pictures are captured from different locations, the relative motion between the transmitted and reflected layers can help disambiguate them. Here, however, “non-static” refers to the scene itself, such as is the case when a tree branch moves between the shots. Several approaches were proposed that can deal with dynamic scenes in the context of stack-based photography [30]. Rather than requiring some pre-processing to fix artifacts due to small scene changes at inference time, however, we propose to synthesize training data that simulates them, such as local, non-rigid deformations. We first define a regular grid over a patch, and then we perturb each one of the grid’s anchors by (dxdy), both sampled from a Gaussian with variance \(\sigma _{\text {NR}}^2\), which is also drawn randomly for each patch. We then interpolate the position of the rest of the pixels in the patch. For each input patch, we generate three different images, one per polarization angle. We only apply this processing to a subset of the synthesized images—the scene is not always dynamic. Figure 6(a) and (b) show an example of original and distorted patch respectively.

Geometry of the Semi-Reflective Surface. The images synthesized up to this point can be thought of as the irradiance of the unpolarized light at the semi-reflector. After bouncing off of, or going through, the surface, light becomes polarized as described in Sect. 3.1. The effect of a linear polarizer placed in front of the camera and oriented at a given polarization angle, depends on the angle of incidence (AOI) of the specific light ray. Some previous works assume this angle to be uniform over the image, which is only true if the camera is at infinity, or if the surface is flat.

We observe that real-world surfaces are hardly ever perfectly flat. Many common glass surfaces are in fact designed to be curved, as is the case of car windows, see Fig. 1. Even when the surfaces are meant to be flat, the imperfections of the glass manufacturing process introduce local curvatures, see Fig. 2 .

At training time, we could generate unconstrained surface curvatures to account for this observation. However, it would be difficult to sample realistic surfaces. Moreover, the computation of the AOI from the surface curvature may be non-trivial. As a regularizer, we propose to use a parabola. When the patches are synthesized, we just sample four parameters: the camera position C, a point on the surface \(P_S\), a segment length, \(\ell \), and the convexity as \(\pm 1\), Fig. 6(c). Since the segment is always mapped to the same output size, this parametrization allows to generate a number of different, realistic curvatures. Additionally, because we use a parabola, we can quickly compute the AOI in closed form, from the sample parameters, see Supplementary.

Fig. 6.
figure 6

Examples of our non-rigid motion deformation (a, b) and a curved surface-generator given the camera position, C, a surface-point, \(P_S\), length, \(\ell \), and the convexity \(\pm 1\) (c). Randomly sampled training data (d) with synthesized observations \(I_{{\phi }_0}\), \(I_{{\phi }_1}\), \(I_{{\phi }_2}\) from the ground truth data T and R, and estimates \(\widehat{T}, \widehat{R}\).

3.4 Implementation Details

From the output of the pipeline described so far, the simulated AOI, and a random polarization angle \(\phi _0\), the polarization engine generates three observations with polarization angles separated by \(\nicefrac {\pi }{4}\), see Fig. 5. In practice, the polarizer angles \(\phi _i\) will be inaccurate for real data due to the manual adjustments of the polarizer rotation. We account for this by adding noise within \(\pm 4^\circ \) to each polarizer angle \(\phi _i\). Additionally we set \(\beta \sim \mathcal {U}[1, 2.8]\). The input for our neural network is \(\mathbb {R}^{B\times 128\times 128\times 9}\) when trained on \(128\times 128\) patches, where \(B=32\) is the batch size. We trained the model from scratch with a learning rate \(5\cdot 10^{-3}\) using ADAM. See the Supplementary for more details about the architecture. The colors of the network predictions might be slightly desaturated [4, 31, 32]. We use a parameter-free color-histogram matching against one of the observations to obtain the final results.

4 Experiments

In this section we evaluate our method and data modeling pipeline on both synthetic and real data. For the latter, we introduce the Urban Reflections Dataset (URD), a new dataset of images containing semi-reflectors captured with polarization information. A fair evaluation can only be done against other polarization-based methods, which use multiple images. However, we also compare against single-image methods for completeness.

The Urban Reflections Dataset (URD). For practical relevance, we compile a dataset of 28 high-resolution RAW images (24MP) that are taken in urban environments using two different consumer cameras (Alpha 6000 and Canon EOS 7D, both ASP-C sensors), and which we make publicly available. The Supplementary shows all the pictures in the dataset. This dataset includes examples taken with a wide aperture, and while focusing on the plane of the semi-reflector, thus meeting the assumptions of Fan et al. [4].

4.1 Numerical Performance Evaluation

Due to the need for ground-truth, a large-scale numerical evaluation can only be performed on synthetic data. For this task we take two datasets, the VOC2012 [33] and the PLACE2 [26] datasets. A comparison with state-of-the-art methods shows that our method outperforms the second best method by a significant margin in terms of PSNR: \(\sim 2\) dB, see Table 1. For a numerical evaluation on real data, we set up a scene with a glass surface and objects causing reflections. After capturing polarization images of the scene, we removed the glass and captured the ground truth transmission, \(T_{\text {gt}}\). Figure 7 shows the transmission images estimated by different methods. Our method achieves the highest PSRN, and the least amount of artifacts.

Table 1. Cross-validation on synthetic data. Best results in bold.
Fig. 7.
figure 7

By removing the semi-reflector, we can capture the ground truth transmission, \(T_{\text {gt}}\), optically.

4.2 Effect of Data Modeling

We also thoroughly validate our data-generation pipeline. Using both synthetic and real data, we show that the proposed non-rigid deformation (NRD) procedure and the local curvature generation (LCG) are effective and necessary. To do this, we train our network until convergence on three types of data: data generated only with the proposed dynamic range manipulation, DR for short, data generated with DR \(+\) NRD, and data generated with DR \(+\) NRD \(+\) LCG.

We evaluate these three models on a hold-out synthetic validation set that features all the transformations from Fig. 5. The table in Fig. 8 shows that the PSNR drops significantly when only part of our pipeline is used to train the network. Unfortunately, a numerical evaluation is only possible when the ground truth is available. However, Fig. 8 shows the output of the three models on the real image from Fig. 1. The benefits of using the full pipeline are apparent.

A visual inspection of Fig. 1 allows to appreciate that, thanks to our ability to deal with curved surfaces and dynamic scenes, we achieve better performance than the state-of-the-art methods.

Fig. 8.
figure 8

Our reflection estimation (left) on a real-world curved surface and synthetic data (right Table) using the same network architecture trained on different components of our data pipeline. Only when using the full pipeline (DR+NRD+LCG) the reflection layer is estimated correctly. Note how faint the reflection is in the inputs (bottom row).

4.3 Evaluation on Real-World Examples

We extensively evaluate our method against previous work on the proposed URD. For fairness towards competing methods, which make stronger assumptions or expect different input data, we slightly adapt them, or run them multiple times with different parameters retaining only the best result. Due to space constraints, Fig. 10 only shows seven of the results. We refer the reader to the Supplementary for the rest of the results and for a detailed explanation about how we adapted previous methods. One important remark is in order. Although the images we use include opaque objects, i.e. the semi-reflector does not cover the whole picture, the methods against which we compare are local: applying the different algorithms to the whole image and cropping a region is equivalent to applying the same algorithms to the cropped region directly, Fig. 9.

Fig. 9.
figure 9

Applying the different algorithms to the whole image and cropping a region (‘full’) is equivalent to applying the same algorithms to the cropped region directly (‘crop’).

Fig. 10.
figure 10

Results on typical real-world scenes. Top pane: comparison with state-of-the-art methods, bottom pane: additional results. More results are given in the Supplementary.

Figure 10, Curved Window shows a challenging case in which the AOI is significantly different from \(\theta _B\) across the whole image, thus limiting the effect of the polarizer in all of the inputs. Moreover, the glass surface is slanted and locally curved, which breaks several of the assumptions of previous works. As a result, other methods completely fail at estimating the reflection layer, the transmission layer, or both. On the contrary, our method separates \(\widehat{T}\) and \(\widehat{R}\) correctly, with only a slight halo of the reflection in \(\widehat{T}\). In particular, notice the contrast of the white painting with the stars, as compared with other methods. While challenging, this scene is far from uncommon.

Figure 10, Bar shows another result on which our method performs significantly better than most related works. On this example, the method by Schechner et al.  [2] produces results comparable to ours. However, recall that, to be fair towards their method, we exhaustively search the parameter space and hand-pick the best result. Another thing to note is that our method may introduce artifacts in a region for which there is little or no information about the reflected or transmitted layer in any of the inputs, such as the case in the region marked with the red square on our \(\widehat{T}\).

We also show an additional comparison showing the superiority of our method (Fig. 10, Paintings) and a few more challenging cases. We note that in a few examples, our method may fail at removing part of the “transmitted” objects from \(\widehat{R}\), as is the case in Fig. 10, Chairs.

User Study. Since we do not have the ground truth for real data, we evaluate our method against previous results by means of a thorough user study. We asked 43 individuals not involved with the project, to rank our results against the state-of-the-art [2,3,4, 7, 17]. In our study, we evaluate \(\widehat{R}\) and \(\widehat{T}\) as two separate tasks, because different methods may perform better on one or the other. For each task, the subjects were shown the three input polarization images, and the results of each method on the same screen, in randomized order. They were given the task to rank the results 1–6, which took, on average, 35 min per subject. We measure the recall rate in ranking, R@k, i.e. the fraction of times a method ranks among the top-k results. Table 2 reports the recall-rates. Two conclusions emerge from analyzing the table. First, and perhaps expected, polarization-based methods outperform the other methods. Second, our method ranks higher than related works by a significant margin.

Table 2. Result from the user study. We report the average recall-rate for each method.

5 Conclusion

Separating the reflection and transmission layers from images captured in the wild is still an open problem, as state-of-the-art methods fail on many real-world images. Rather than learning to estimate the reflection and the transmission directly from the observations, we propose a deep learning solution that leverages the properties of polarized light: it uses a Canonical Projection Layer, and it learns the residuals of the reflection and transmission relative to the canonical images. Another key ingredient to the success of our method is the definition of an image-synthesis pipeline that can accurately reproduce typical nonidealities observed in everyday pictures. We also note that the non-rigid deformation procedure that we propose can be used for other stack-based methods where non-static scenes may be an issue. To evaluate our method, we also propose the Urban Reflection Dataset, which we will make available upon publication. Using this dataset, we extensively compare our method against a number of related works, both visually and by means of a user study, which confirms that our approach is superior to the state-of-the-art methods. Finally, the code for most of the existing methods that separate reflection and transmission is not available: to perform an accurate comparison, we re-implemented representative, state-of-the-art works, and make our implementation of those algorithms available to the community, to enable more comparisons.