research-article

Open access

Asymmetric Dual-Decoder U-Net for Joint Rain and Haze Removal

Authors:

Shengyong ChenAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications and Applications, Volume 20, Issue 3

Article No.: 87, Pages 1 - 23

https://doi.org/10.1145/3628451

Published: 09 December 2023 Publication History

PDF eReader

Abstract

This work studies the multi-weather restoration problem. In real-life scenarios, rain and haze, two often co-occurring common weather phenomena, can greatly degrade the clarity and quality of the scene images, leading to a performance drop in the visual applications, such as autonomous driving. However, jointly removing the rain and haze in scene images is ill-posed and challenging, where the existence of haze and rain and the change of atmosphere light, can both degrade the scene information. Current methods focus on the contamination removal part, thus ignoring the restoration of the scene information affected by the change of atmospheric light. We propose a novel deep neural network, named Asymmetric Dual-decoder U-Net (ADU-Net), to address the aforementioned challenge. The ADU-Net produces both the contamination residual and the scene residual to efficiently remove the contamination while preserving the fidelity of the scene information. Extensive experiments show our work outperforms the existing state-of-the-art methods by a considerable margin in both synthetic data and real-world data benchmarks, including RainCityscapes, BID Rain, and SPA-Data. For instance, we improve the state-of-the-art PSNR value by 2.26/4.57 on the RainCityscapes/SPA-Data, respectively. Codes will be made available freely to the research community.

1 Introduction

When photographing in bad weather, the quality of outdoor scene images can be greatly degraded by the contamination, i.e., rain, haze, snow, and so forth, distributed in the air. Such contamination absorbs or disperses the scene light, thereby reducing the contrast and color fidelity of the scene image. Hence, the existence of contamination significantly affects many real-world vision systems, such as scene recognition, object tracking, semantic segmentation, and so forth, and all of these vision systems are essential for autonomous driving [7, 13, 60]. In other words, such outdoor vision systems, which work efficiently in ideal weather conditions, will suffer a plummet due to complex real-world weather conditions. Therefore, it is essential to develop algorithms to restore images contaminated by different contaminants as a pre-processor for such outdoor vision systems.

In this work, we focus on a real yet less-investigated scenario, the co-occurrence of the rain and haze in the scenes. Both image rain removal and haze removal are challenging low-level computer vision tasks. Many efforts have been made to solve the individual rain removal and haze removal tasks [48, 52, 56]. However, only a few works consider removing the rain and haze jointly in scene images [18, 21, 47]. In the real-world scenario, it is a very common situation that rain and haze co-occur in a rainfall environment (see Figure 1(a)) [17]. Along with rain streaks and raindrops, the uneven haze will also obscure the image, interfering with the perception of the environment. Such a scenario brings challenges to the outdoor vision systems that are required to jointly remove the rain and haze in images.

Fig. 1.

The existing methods for single-image rain and haze removal can be roughly categorized into two categories: priority knowledge-oriented approaches and data-driven approaches. The prior knowledge-based image rain removal [24, 31, 36] and haze removal methods [15, 19, 63] are mostly based on the physical imaging models. However, such solutions suffer from the robustness issue when deployed into real-world scenarios [32, 62]. Recent advances in deep learning demonstrate dramatic success in haze removal [9, 27, 43] and rain removal [40, 49, 59]. Learning-based methods in both fields have achieved cutting-edge performance on synthetic datasets. However, methods designed for certain contamination cannot handle the complex real-world scenario with the co-occurrence of rain and haze in the natural scenes. Recent studies also pointed out the necessity of joint-removal, such as Han et al. [18] decompose rain and haze by a Blind Image Decomposition Network, and Kim et al. [25] remove rain and haze by a frequency-based model. A new dataset for the purpose of benchmarking joint rain and haze removal, named RainCityScapes, is also proposed to facilitate research on this important task [21]. Thus, such a joint-removal task becomes an open problem in the community and calls for further study.

Recent advances in low-level computer vision have made remarkable progress, where a well-trained deep neural network can almost perfectly remove the contamination in outdoor scene images. However, no existing work considers paying attention to the scene difference in the restoration process. We observe that the true residual, obtained by \((\mathrm{Input} - \mathrm{Ground}~\mathrm{Truth})\) (see Figure 1(b)), contains the scene information. That is, a neural network designed to focus on contamination may suffer from a gap in recovering the scenes. Such a gap motivates us to develop a unified method to remove the contamination and compensate for the scene information in one go.

In real-world scenarios, the weather condition is complex, that is, different components, such as rain streaks and haze, may co-occur in the scenes. The occurrence of some components, i.e., heavy haze, impacts the atmospheric light. As a consequence, the scene information at the photometric level can be degraded. Physically speaking, along with removing contamination in the image, it is also necessary to restore scene information affected by the change of atmospheric light. To address this issue, we propose a novel dual-branch architecture, called Asymmetric Dual-decoder U-Net (ADU-Net). The ADU-Net consists of a single-branch encoder and an asymmetric dual-branch decoder. In the asymmetric dual-branch architecture, one branch, the contamination residual branch, is designed to remove the contamination (see Figure 1(c)). Another branch, the scene residual branch, is required to perform the recovery of scene information (see Figure 1(d)). The contamination residual branch, equipped with a novel channel feature fusion (CFF) module and window multi-head self-attention (W-MSA), produces the contamination residual. The special design allows the branch to focus more on the local foreground information in the image, thus extracting the contamination residual. The scene branch, powered by a novel global channel feature fusion (GCFF) module and shift-window multi-head self-attention (SW-MSA) mechanism, aims to compensate for the scene information. Unlike the contamination residual branch, the scene residual branch is designed to focus more on the global contextual information in the image, thus extracting the scene residual. The joint efforts of contamination residual and scene residual separate the rain and haze from the input scene image, while preserving the scene of the image (see Figure 1(e)). The proposed ADU-Net can effectively remove the different contamination in the images and compensate for the scene information on multiple benchmark datasets, including RainCityscapes [21], BID rain [18], and SPA-Data [49].

Our contribution can be summarized as follows:

—

We propose a novel yet efficient neural architecture, ADU-Net, to jointly remove rain and haze in scene images.

—

We present an asymmetric dual-decoder, which removes the contamination while compensating for the scene information of the image. To the best of our knowledge, this is the first work to consider the recovery of scene information in deraining and dehazing tasks.

—

Extensive experiments, including quantitative studies and qualitative studies, are conducted to evaluate the effectiveness of the ADU-Net. Empirical evaluation shows our method outperforms the current state-of-the-art methods by a considerable margin.

2 Related Work

2.1 Single-image Rain Removal

The very first single-image rain removal methods were based on a priori knowledge. Morphological component analysis (MCA) [24] employs bilateral filters to extract high-frequency components from rain images, where the high-frequency components are further decomposed into “rain components” and “non-rain components” through dictionary learning and sparse coding. Luo et al. [36] proposed a single-image rain removal algorithm based on mutual exclusion dictionary learning. Gaussian mixture model prior knowledge [31] was utilized to accommodate multiple orientations and scales of rain streaks. In [62], Zhu et al. detected the approximate region, where the rain streaks were located, to guide the separation of the rain layer from the background layer. However, early models based on prior knowledge often suffer from a lack of stability in real scenarios [24, 31, 36]. Since 2017, deep learning approaches have been developed for rain removal tasks. Deep detail networks [16] narrowed the mapping from input to output and combined prior knowledge to capture high-frequency details, making the model stay focused on rain streaks information. By adding an iterative information feedback network, JORDER [53] used a binary mapping to locate rain streaks. A non-locally enhanced encoder-decoder structure [28] was proposed to capture long-range dependencies and leverage the hierarchical features of the convolutional layer. In [30], Li et al. proposed a deep recurrent convolutional neural network to remove rain marks located at different depths progressively. A density-aware multi-stream connectivity network was introduced for rain removal in [58]. By adding constraints to the cGAN [23], Zhang et al. [59] generated more photo-realistic results. A progressive contextual aggregation network [40] was proposed as a baseline for rain removal. A real-world rain dataset was constructed by Wang et al. [49]; they also incorporated spatial perception mechanisms into deraining networks. Recently, Zhu et al. [61] proposed a gated non-local depth residual network for image rain removal. Yu et al. [55] conducted a comprehensive analysis of various aspects of existing rain removal models and their robustness against adversarial attacks. Based on these analyses, they proposed a more robust approach to address this issue.

While significant progress has been made in the research on image rain removal, the existing studies lack consideration for real-world rainy scenarios, limiting their effectiveness in practical applications. In contrast, our methods take a more realistic approach by not only addressing rain streak occlusions commonly encountered in rainy weather but also considering the impact of haze, which is prevalent in the atmosphere, on atmospheric light. By incorporating these factors, our methods offer a more comprehensive and practical solution that better aligns with real-world conditions.

2.2 Single-image Haze Removal

Similar to image rain removal methods, early work on image dehaze tended to employ statistical methods to acquire prior information by capturing patterns in haze-free images. Representative methods include Dark channel prior [19], color-line prior [15], color attenuation prior [63], and so forth. However, prior-based methods tend to distort colors and thus produce undesirable artifacts [15, 19, 63]. In the deep learning era, methods started to not rely on prior knowledge, but to estimate atmospheric light and the transmission map directly. For example, Cai et al. [5] proposed an end-to-end dehazing model named DehazeNet, where haze-free images are produced by learning the transmission rate. Similarly, Ren et al. [41] employed multi-scale deep neural networks to learn the mapping relationship between foggy images and their corresponding transmission maps, aiming to reduce the error in estimating the transmission maps. AODNet [27] reconstructed the atmospheric scattering model by leveraging an improved convolutional neural network to learn the mapping relationship between foggy and clean pairs. In [57], a single network was proposed to simultaneously learn the intrinsic relationship between transmission maps, atmospheric light, and clean images. Ren et al. [42] built an encoder-decoder neural network to enhance the dehazing process. A network with an enhancer and two generators was proposed by Qu et al. [39]. Chen et al. [9] proposed a patch map-based PMS-Net to effectively suppress the distorted color issue. Dong et al. [12] proposed MSBDN (Multi-Scale Boosted Dehazing Network) based on the U-Net architecture, incorporating boosting and error feedback as guiding principles. Although the method achieves good results, it suffers from a large number of parameters. Yeh et al. [54] decomposed hazy images into base components and detail components and proposed MSRL-DehazeNet, which is based on residual learning and U-Net architecture. Sun et al. [46] proposed SADNet based on the attention mechanism using a semi-supervised approach for solving practical problems. Song et al. [45] introduced Swin Transformer into image haze removal and proposed DehazeFormer, which achieved significant improvements on multiple datasets. Unlike image rain removal, image dehazing often considers the impact of haze on atmospheric light intensity, which can compensate for the limitations in rain removal methods. Our methods combine these approaches with the research on rain removal, resulting in a more realistic approach that better aligns with real-world scenarios.

2.3 Other Related Works

Unlike previous single-task models, some researchers have also explored the simultaneous enhancement of both rain removal and haze removal in images. Hu et al. [21] built an imaging model for rain streaks and haze based on the visual effect of rain and the scene depth map to synthesize a realistic dataset named RainCityscapes. Han et al. [18] constructed a superimposed image dataset and proposed a simple yet general Blind Image Decomposition Network to decompose rain streaks, raindrops, and haze in a blind image decomposition setting. Kim et al. [25] proposed a frequency-based model for removing rain and haze, where the frequency-based model divided the input image into high-frequency and low-frequency parts with a guided filter and then employed a symmetric encoder-decoder network to remove rain and haze separately. Kulkarni and Murala [26] proposed a lightweight network that combines convolutions at different scales with spatial attention and channel attention mechanisms, employing a dual restoration mechanism to handle images affected by various weather conditions. Recently, Li et al. [29] used a neural structure search based approach to handle multiple weather situations; however, it has a large number of parameters as it uses multiple encoders for each weather removal task. Chen et al. [10] proposed a training approach based on knowledge distillation, considering the perspective of training strategies. They introduced a multi-teacher model and a single-student model, enabling a single model to handle various weather conditions without increasing the parameter size. Valanarasu et al. [47] proposed a single transformer-based encoder-decoder network while restored image with a learnable weather type query in the decoder to learn the type of weather degradation. Wang et al. [50] enhanced the U-Net architecture by adding a small decoder and a dilated convolution attention module. This enhancement enabled the network to capture both global information and finer details in high-resolution remote sensing images. After an in-depth study of related works, we have identified two primary factors that contribute to image quality degradation: contamination and scene information affected by atmospheric light. To effectively address these factors, we introduce an innovative asymmetric dual-branch structure, allowing independent processing of each category. By separately optimizing contamination removal and scene information recovery, our method achieves enhanced overall performance and improved image quality.

3 Method

This section details the proposed method in a top-down fashion: starting from the problem formulation of our application, followed by the architecture of the proposed ADU-Net and its building block, namely, asymmetric dual-decoder block (ADB).

Notations. Throughout the article, we use bold capital letters to denote matrices or tensors (e.g., \({\boldsymbol {X}}\)), and bold lowercase letters to denote vectors (e.g., \({\boldsymbol {x}}\)).

3.1 Problem Formulation

Let a third-order tensor, \({{\boldsymbol {I}}} \in \mathbb {R}^{C \times H \times W}\), denote an input image, where C, H, and W present the channel, height, and width of the image, respectively. In our application, both rain and haze are synthesized into the origin scene images as input images. Each input image \({\boldsymbol {I}}\) is labeled with its ground truth image \({\boldsymbol {I}}^{\mathrm{gt}}\) without rain and haze in the scene. Our ADU-Net \(f_{\theta }\), consisting of a single branch encoder \(f_{\mathrm{E}}\), and an asymmetric dual-decoder \(f_{\mathrm{AD}}\), can remove the rain and haze in the input image, such that the output of the ADU-Net, \({\boldsymbol {Y}} = f_{\theta }({\boldsymbol {I}})\), can restore its ground truth scene \({\boldsymbol {I}}^{\mathrm{gt}}\). The ADU-Net is trained to learn a set of parameters, \(\theta ^*\), with minimum empirical objective value \(\mathcal {L}({\boldsymbol {I}}^{\mathrm{gt}}, {\boldsymbol {Y}})\).

3.2 Network Overview

We first give a sketch of the proposed ADU-Net. In rain and haze removal applications, one ideal option is to employ the deep neural network to understand the scene of the input image and separate the rain and haze from the input image. In our work, we develop the ADU-Net to remove the rain and haze jointly. As shown in Figure 2, the ADU-Net is stacked by a single branch encoder and an asymmetric dual-decoder. In the encoder \(f_{\mathrm{E}}\), we have five convolutional blocks, with each denoted by \(\mathrm{Conv}_{i},~0 \le i \le 4\). The output of each convolutional block is denoted by \({\boldsymbol {F}}_i = \mathrm{Conv}_i({\boldsymbol {F}}_{i-1})\) and \({\boldsymbol {F}}_{-1} = {\boldsymbol {I}}\).

Fig. 2.

Then a following asymmetric dual-decoder \(f_{\mathrm{AD}}\) aims to recover the scene image without rain and haze (see Figure 2). The proposed asymmetric dual-decoder is stacked of a set of ADBs, which produce two streams of latent features, denoted by \({\boldsymbol {Z}}^{\mathrm{c}}_j\) and \({\boldsymbol {Z}}^{\mathrm{s}}_j\) in the j-th ADB. Specifically, the processing can be formulated as

\begin{equation} {\boldsymbol {Z}}^{\mathrm{c}}_0, {\boldsymbol {Z}}^{\mathrm{s}}_0 = \mathrm{ADB}_0({\boldsymbol {F}}_3, {\boldsymbol {F}}_4), \end{equation}

(1)

\begin{equation} {\boldsymbol {Z}}^{\mathrm{c}}_j, {\boldsymbol {Z}}^{\mathrm{s}}_j = \mathrm{ADB}_j({\boldsymbol {Z}}^{\mathrm{c}}_{j-1}, {\boldsymbol {Z}}^{\mathrm{s}}_{j-1}, {\boldsymbol {F}}_{3-j}),~j\gt 0. \end{equation}

(2)

After the last ADB, each stream of latent features \({\boldsymbol {Z}}^{\mathrm{c}}_3\) or \({\boldsymbol {Z}}^{\mathrm{s}}_3\) is encoded by a convolutional block to recover the channel dimensions into the image space (e.g., \(C=3\)), as \({\boldsymbol {Y}}^{\mathrm{c}} = \mathrm{Conv}_5({\boldsymbol {Z}}^{\mathrm{c}}_3)\) and \({\boldsymbol {Y}}^{\mathrm{s}} = \mathrm{Conv}_5({\boldsymbol {Z}}^{\mathrm{s}}_3)\). We denote the \({\boldsymbol {Y}}^{\mathrm{c}}\) as the contamination residual, and \({\boldsymbol {Y}}^{\mathrm{s}}\) as the scene residual. Having the \({\boldsymbol {Y}}^{\mathrm{c}}\) and \({\boldsymbol {Y}}^{\mathrm{s}}\) at hand, one can obtain the restored scene image \({\boldsymbol {Y}}\) as

\begin{equation} {\boldsymbol {Y}} = {\boldsymbol {I}} - {\boldsymbol {Y}}^{\mathrm{c}} - {\boldsymbol {Y}}^{\mathrm{s}}. \end{equation}

(3)

The network is optimized by the negative SSIM loss [51] as \(\mathcal {L}_{\mathrm{SSIM}} = -\mathrm{SSIM} ({\boldsymbol {I}}^{\mathrm{gt}}, {\boldsymbol {Y}})\). Note that the common practice uses both the negative SSIM loss and MSE loss as the objective. Empirically, we observed that a negative SSIM loss works better in the proposed ADU-Net, which will be justified in Section 4.4.

3.3 Asymmetric Dual-decoder Block

In this part, we will describe the asymmetric dual-decoder \(f_{\mathrm{AD}}\) in ADU-Net. As shown in Figure 2, \(f_{\mathrm{AD}}\) consists of four ADBs and a convolutional block, while the ADBs are stacked by two different instantiations (e.g., \(\mathrm{ADB}_0\) vs. \(\mathrm{ADB}_j,~j = 1,2,3\)). In this following, we will first describe \(\mathrm{ADB}_0\), a simple form of the block. Then with minor modifications, we can realize the \(\mathrm{ADB}_j, \ j = 1,2,3\) on top of the \(\mathrm{ADB}_0\).

The \(\mathrm{ADB}_0\) is a two-branch architecture (see Figure 2), which receives the \({\boldsymbol {F}}_3\) and \({\boldsymbol {F}}_4\) as input, and produces two latent features \({\boldsymbol {Z}}^{\mathrm{c}}_0\) and \({\boldsymbol {Z}}^{\mathrm{s}}_0\). In \(\mathrm{ADB}_0\), the two latent features are respectively encoded by two branches of network, namely, contamination residual net (denoted by \(g^{\mathrm{c}}\)), and scene residual net (denoted by \(g^{\mathrm{s}}\)), given by

\begin{equation} {\boldsymbol {Z}}^{\mathrm{c}}_0 = g^{\mathrm{c}}({\boldsymbol {F}}_3, {\boldsymbol {F}}_4) \end{equation}

(4)

and

\begin{equation} {\boldsymbol {Z}}^{\mathrm{s}}_0 = g^{\mathrm{s}}({\boldsymbol {F}}_3, {\boldsymbol {F}}_4). \end{equation}

(5)

Contamination Residual Net. In the contamination residual net (\(g^{\mathrm{c}}\)), \({\boldsymbol {F}}_3\) and \({\boldsymbol {F}}_4\) are fed to a cCFF module to localize the rain and haze areas in the scene image, as

\begin{equation} {\boldsymbol {G}}^{\mathrm{c}}_0 = \mathrm{CFF}({\boldsymbol {F}}_3, {\boldsymbol {F}}_4). \end{equation}

(6)

The details of CFF are illustrated in Figure 3(a). Given two feature maps \({\boldsymbol {F}}_3\) and \({\boldsymbol {F}}_4\) as input, it first fuses the two inputs by using element-wise addition and then feeds the fused feature maps to two-layer convolutional blocks to obtain the attention weights, formulated by

\begin{equation} {\boldsymbol {W}}_0^{\mathrm{c}} = \sigma (\mathrm{BN}(\mathrm{Conv}(\mathrm{ReLU(\mathrm{BN}(\mathrm{Conv}({\boldsymbol {F}}_3 \oplus {\boldsymbol {F}}_4)))}))), \end{equation}

(7)

where \(\sigma\), \(\mathrm{BN}\), and \(\mathrm{ReLU}\) are sigmoid function, batch normalization, and rectified linear unit activation, respectively. Here, the kernel size of \(\mathrm{Conv}\) is \(1\times 1\), which can be understood as applying a fully connected layer to the channel features.

Fig. 3.

Then, we can apply the attention weights to the input feature maps and obtain the fused output, as

\begin{equation} {\boldsymbol {G}}^{\mathrm{c}}_0 = \left({\boldsymbol {W}}_0^{\mathrm{c}} \otimes {\boldsymbol {F}}_3 \right) \oplus \left(\left({\boldsymbol {I}} - {\boldsymbol {W}}_0^{\mathrm{c}}\right) \otimes {\boldsymbol {F}}_4 \right)\!. \end{equation}

(8)

The CFF module fuses the input feature maps, and the fusion weights are produced via the channel patterns. We further employ a self-attention mechanism to build the spatially long-range dependencies of the fused feature maps \({\boldsymbol {G}}^{\mathrm{s}}_0\), given by

\begin{equation} {\boldsymbol {H}}^{\mathrm{c}}_0 = \mathrm{W\text{-}MSA}({\boldsymbol {G}}^{\mathrm{c}}_0), \end{equation}

(9)

where W-MSA is the window multi-head self-attention from the Swin Transformer [35].

By fusing the input feature maps and processing them by the attention mechanism, we can obtain the contamination residual feature maps as

\begin{equation} {\boldsymbol {Z}}^{\mathrm{c}}_0 = \mathrm{Conv}({\boldsymbol {H}}^{\mathrm{c}}_0). \end{equation}

(10)

The contamination residual net (\(g^{\mathrm{c}}\)) aims to attend to the rainy and hazy regions, thereby highlighting the rain and haze components in the contamination residual feature maps.

Scene Residual Net. Since we can observe from the contamination residual (\({\boldsymbol {Y}}^{\mathrm{c}}\)) that it contains the scene information along with the rain and haze, we develop a scene residual net (\(g^{\mathrm{s}}\)), that can compensate for the removed scene information in the image. In doing so, the GCFF module is proposed to capture valuable global scene information of the image, and fuse features, as

\begin{equation} {\boldsymbol {G}}^{\mathrm{s}}_0 = \mathrm{GCFF}({\boldsymbol {F}}_3, {\boldsymbol {F}}_4). \end{equation}

(11)

As shown in Figure 3(b), \({\boldsymbol {F}}_3\) and \({\boldsymbol {F}}_4\) are first fused, and summarized to its global feature, as

\begin{equation} {\boldsymbol {m}}^{\mathrm{s}}_0 = \mathrm{GAP}({\boldsymbol {F}}_3 \oplus {\boldsymbol {F}}_4), \end{equation}

(12)

where \(\mathrm{GAP}\) indicates the global average pooling and \({\boldsymbol {m}}^{\mathrm{s}}_0\) indicates the resultant vector. Then, a two-layer convolutional block is used to modulate per element of the global feature \({\boldsymbol {m}}^{\mathrm{s}}_0\), written by

\begin{equation} {\boldsymbol {w}}^{\mathrm{s}}_0 = \sigma \left(\mathrm{BN}\left(\mathrm{Conv}\left(\mathrm{ReLU\left(\mathrm{BN}\left(\mathrm{Conv}\left({\boldsymbol {m}}^{\mathrm{s}}_0\right)\right)\right)}\right)\right)\right)\!. \end{equation}

(13)

We can thereby fuse the input feature maps as

\begin{equation} {\boldsymbol {G}}^{\mathrm{s}}_0 = \left({\boldsymbol {w}}^{\mathrm{s}}_0 \otimes {\boldsymbol {F}}_3 \right) \oplus \left(\left({\boldsymbol {i}} - {\boldsymbol {w}}_0^{\mathrm{s}}\right) \otimes {\boldsymbol {F}}_4 \right)\!, \end{equation}

(14)

where \({\boldsymbol {i}}\) indicates a unit vector with the same size as \({\boldsymbol {w}}^{\mathrm{s}}_0\).

After GCFF, we employ the SW-MSA to enhance the spatial interaction of the feature maps and obtain the scene residual features, described by

\begin{equation} {\boldsymbol {H}}^{\mathrm{s}}_0 = \mathrm{SW\text{-}MSA}({\boldsymbol {G}}^{\mathrm{s}}_0) \end{equation}

(15)

and

\begin{equation} {\boldsymbol {Z}}^{\mathrm{s}}_0 = \mathrm{Conv}({\boldsymbol {H}}^{\mathrm{s}}_0). \end{equation}

(16)

Instantiation of \({\boldsymbol {ADB}}_j\). The difference between \(\mathrm{ADB}_{j}, \ j \ne 0\) and \(\mathrm{ADB}_0\) is that \(\mathrm{ADB}_0\) receives two feature maps as input, while \(\mathrm{ADB}_j, \ j\ne 0\) includes three feature maps as input. To adapt the architecture of \(\mathrm{ADB}_0\) to \(\mathrm{ADB}_j, \ j \ne 0\), we make minor modifications (see Figure 2). Specifically, for any block, \(\mathrm{ADB}_j\), its input includes the output from the \(\left(j-1\right)\)-th ADB blcok, e.g., \({\boldsymbol {Z}}^{\mathrm{c}}_{j-1}, {\boldsymbol {Z}}^{\mathrm{s}}_{j-1} \in \mathbb {R}^{d \times h\times w}\), and from the \(\left(3-j\right)\)-th convolutional encoder, e.g., \({\boldsymbol {F}}_{3-j}\). We first concatenate the \({\boldsymbol {Z}}^{\mathrm{c}}_{j-1}, {\boldsymbol {Z}}^{\mathrm{s}}_{j-1}\), and reduce its dimension from \(2d \times h \times w\) to \(d \times h \times w\), as

\begin{equation} \bar{{\boldsymbol {Z}}}_{j-1} = \mathrm{Concat}({\boldsymbol {Z}}^{\mathrm{c}}_{j-1}, {\boldsymbol {Z}}^{\mathrm{s}}_{j-1}) \end{equation}

(17)

and

\begin{equation} \tilde{{\boldsymbol {Z}}}^{\mathrm{c}}_{j-1} = \mathrm{Conv_{in}}(\bar{{\boldsymbol {Z}}}_{j-1}),~\tilde{{\boldsymbol {Z}}}^{\mathrm{s}}_{j-1} = \mathrm{Conv_{in}}(\bar{{\boldsymbol {Z}}}_{j-1}),{} \end{equation}

(18)

where \(\mathrm{Conv_{in}}\) indicates a two-layer convolution block with batch normalization and rectified linear unit activation. Here, the kernel size of convolution layers is \(3 \times 3\).

With \({\boldsymbol {F}}_{3-j}\), the output of \(\mathrm{ADB}_j\) can be obtained as

\begin{equation} \begin{split}{\boldsymbol {Z}}^{\mathrm{c}}_j &= g^{\mathrm{c}}\left(\tilde{{\boldsymbol {Z}}}^{\mathrm{c}}_{j-1} ,{\boldsymbol {F}}_{3-j}\right) \\ &= \mathrm{Conv_{out}}\left(\mathrm{W\text{-}MSA}\left(\mathrm{CFF}\left(\tilde{{\boldsymbol {Z}}}^{\mathrm{c}}_{j-1} ,{\boldsymbol {F}}_{3-j}\right)\right)\right) \end{split} \end{equation}

(19)

and

\begin{equation} \begin{split}{\boldsymbol {Z}}^{\mathrm{s}}_j &= g^{\mathrm{s}}(\tilde{{\boldsymbol {Z}}}^{\mathrm{s}}_{j-1} ,{\boldsymbol {F}}_{3-j}) \\ &= \mathrm{Conv_{out}}\left(\mathrm{SW\text{-}MSA}\left(\mathrm{GCFF}\left(\tilde{{\boldsymbol {Z}}}^{\mathrm{s}}_{j-1} ,{\boldsymbol {F}}_{3-j}\right)\right)\right)\!. \end{split} \end{equation}

(20)

Here, \(\mathrm{Conv_{out}}\) indicates a convolution layer with the kernel size of \(3 \times 3\) followed by a leaky rectified linear unit and another convolution layer with the kernel size of \(1 \times 1\) also followed by a leaky rectified linear unit.

In this work, we propose a novel architecture for the rain and haze removal task. Considering the network capacity and hardware overhead, we propose two sizes of networks. One is the lite network, called ADU-Net, and the other is the large network, called ADU-Net-plus. In Section 4, we present the details of two architectures. The network performance is also evaluated in Section 4.

Remark 1.

The residual U-Net architecture has been used extensively for the rain or haze removal tasks [6], as shown in Figure 4(a). Having the observation that the contamination residual, produced by the decoder, contains the scene information, we aim to develop a dual-decoder U-Net, with one decoder producing the contamination residual, and another one producing the scene residual as a scene compensator. Its initial design is shown in Figure 4(b). Considering the physical property of the contamination and scene information in the input image, we propose a novel network architecture, ADU-Net, where we integrate two decoders with non-identical architectures (see Figure 4(c)). We justify our design in Section 4.4.

Fig. 4.

4 Experiments

In this section, we first give the implementation details of the proposed ADU-Net and ADU-Net-plus. Then, the benchmark datasets and evaluation protocol are also introduced. We further compare our network to the state-of-the-art methods and conduct ablation studies to evaluate the superiority of the proposed network and each component. In the final part, we demonstrate substantial qualitative results to analyze the superior performance of our network.

4.1 Implementation Details

Network Architecture. The overall neural architecture of the proposed network is shown in Figure 2. Table 1 lists the kernel size of the convolutional layers. In the encoder block, the feature maps are processed by the Batch Normalization [22] and ReLU [1] after the convolutional layer, i.e., \(\mathrm{Conv_0}\), \(\mathrm{Conv_1}\), \(\mathrm{Conv_2}\), \(\mathrm{Conv_3}\), and \(\mathrm{Conv_4}\). Then the max-pooling layer is employed to down-sample the feature maps in each layer. In the decoder block, we also list the kernel size in the convolutional layers (see Table 1), and employ the Leaky ReLU as the activation function. Having computational efficiency in mind, we develop two neural networks of different scales. The light one is denoted as ADU-Net, while the large one is denoted as ADU-Net-plus. As shown in Table 1, the difference between the two networks is merely the modification to the channel dimensions. The superiority of our network will be evaluated in Section 4.3.

Table 1.

Layer name		Output size	ADU-Net	ADU-Net-plus
\(\mathrm{Conv_0}\)		\(H \times W\)	\(\left[\begin{array}{l}3 \times 3,32 \\ 3 \times 3,32\end{array}\right]\)	\(\left[\begin{array}{l}3 \times 3,64 \\ 3 \times 3,64\end{array}\right]\)
\(\mathrm{Conv_1}\)		\(\frac{H}{2} \times \frac{W}{2}\)	\(\left[\begin{array}{l}3 \times 3,64 \\ 3 \times 3,64\end{array}\right]\)	\(\left[\begin{array}{l}3 \times 3,128 \\ 3 \times 3,128\end{array}\right]\)
\(\mathrm{Conv_2}\)		\(\frac{H}{4} \times \frac{W}{4}\)	\(\left[\begin{array}{l}3 \times 3,128 \\ 3 \times 3,128\end{array}\right]\)	\(\left[\begin{array}{l}3 \times 3,256 \\ 3 \times 3,256\end{array}\right]\)
\(\mathrm{Conv_3}\)		\(\frac{H}{8} \times \frac{W}{8}\)	\(\left[\begin{array}{l}3 \times 3,256 \\ 3 \times 3,256\end{array}\right]\)	\(\left[\begin{array}{l}3 \times 3,512 \\ 3 \times 3,512\end{array}\right]\)
\(\mathrm{Conv_4}\)		\(\frac{H}{16} \times \frac{W}{16}\)	\(\left[\begin{array}{l}3 \times 3,256 \\ 3 \times 3,256\end{array}\right]\)	\(\left[\begin{array}{l}3 \times 3,512 \\ 3 \times 3,512\end{array}\right]\)
\(\mathrm{ADB_0}\)		\(\frac{H}{8} \times \frac{W}{8}\)	\(\left[\begin{array}{l}3 \times 3,128 \\ 3 \times 3,128\end{array}\right]\)	\(\left[\begin{array}{l}3 \times 3,256 \\ 3 \times 3,256\end{array}\right]\)
\(\mathrm{ADB_1}\)	\(\mathrm{Conv_{in}}\)	\(\frac{H}{4} \times \frac{W}{4}\)	\(\left[\begin{array}{l}3 \times 3,128 \\ 3 \times 3,128\end{array}\right]\)	\(\left[\begin{array}{l}3 \times 3,256 \\ 3 \times 3,256\end{array}\right]\)
\(\mathrm{ADB_1}\)	\(\mathrm{Conv_{out}}\)	\(\frac{H}{4} \times \frac{W}{4}\)	\(\left[\begin{array}{l}3 \times 3,64 \\ 3 \times 3,64\end{array}\right]\)	\(\left[\begin{array}{l}3 \times 3,128 \\ 3 \times 3,128\end{array}\right]\)
\(\mathrm{ADB_2}\)	\(\mathrm{Conv_{in}}\)	\(\frac{H}{2} \times \frac{W}{2}\)	\(\left[\begin{array}{l}3 \times 3,64 \\ 3 \times 3,64\end{array}\right]\)	\(\left[\begin{array}{l}3 \times 3,128 \\ 3 \times 3,128\end{array}\right]\)
\(\mathrm{ADB_2}\)	\(\mathrm{Conv_{out}}\)	\(\frac{H}{2} \times \frac{W}{2}\)	\(\left[\begin{array}{l}3 \times 3,32 \\ 3 \times 3,32\end{array}\right]\)	\(\left[\begin{array}{l}3 \times 3,64 \\ 3 \times 3,64\end{array}\right]\)
\(\mathrm{ADB_3}\)	\(\mathrm{Conv_{in}}\)	\(H \times W\)	\(\left[\begin{array}{l}3 \times 3,32 \\ 3 \times 3,32\end{array}\right]\)	\(\left[\begin{array}{l}3 \times 3,64 \\ 3 \times 3,64\end{array}\right]\)
\(\mathrm{ADB_3}\)	\(\mathrm{Conv_{out}}\)	\(H \times W\)	\(\left[\begin{array}{l}3 \times 3,16 \\ 3 \times 3,16\end{array}\right]\)	\(\left[\begin{array}{l}3 \times 3,32 \\ 3 \times 3,32\end{array}\right]\)
\(\mathrm{Conv_5}\)		\(H \times W\)	\(\left[\begin{array}{l}3 \times 3,3 \\ 3 \times 3,3\end{array}\right]\)	\(\left[\begin{array}{l}3 \times 3,3 \\ 3 \times 3,3\end{array}\right]\)
Parameter size			\(6.63\times 10^6\)	\(26.45 \times 10^6\)

Table 1. Details of the Kernel Size in Convolution Layers

\(H\) and \(W\) denote the height and width of the input image, respectively.

Network Training. We implement our method using the PyTorch deep learning package [37]. All experiments are evaluated on NVIDIA GTX 2080ti GPUs. In the experiments for RainCityscapes [21], BID Rain datasets [18], and NH-HAZE [2], the input images are resized to \(512 \times 256\). For the SPA-Data, we follow the practice in [49], which uses original images with size of \(256 \times 256\). The Adam optimization scheme with an initial learning rate of 0.001 is used to optimize the network. We train the network for 100 epochs for RainCityscapes and BID Rain datasets, and 20 epochs for SPA-Data. The learning rate adjustment strategy is employed to realize the learning rate decay, where the learning rate is decayed by a factor of 0.1 when the accuracy of the network does not improve in five epochs.

4.2 Datasets and Evaluation Protocol

We evaluate the proposed methods on two synthetic datasets, i.e., RainCityscapes [21], BID Rain [18], and two real-world datasets, i.e., SPA-Data [49], and NH-HAZE [2]. In the following, we will introduce these datasets and the statistics of each dataset are illustrated in Table 2.

Table 2.

Dataset	Train set	Test set	Property		Contamination
Dataset	Train set	Test set	Synthetic	Real world	Rain streaks	Haze	Snow	Raindrops
RainCityscapes	9,432	1,188	\(\checkmark\)		\(\checkmark\)	\(\checkmark\)
BID Rain	2,975	500*6	\(\checkmark\)		\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)
SPA-Data	638,492	1,000		\(\checkmark\)	\(\checkmark\)
NH-HAZE	40	15		\(\checkmark\)		\(\checkmark\)

Table 2. The Statistics of Datasets

RainCityscapes. The RainCityscapes dataset is synthesized from the Cityscapes dataset [11]. It takes 9,432 images synthesized from 262 Cityscapes images as the training set and 1,188 images synthesized from 33 Cityscapes images as the test set. All the selected images of Cityscapes are overcast, without obvious shadow. Rain streaks and haze are synthesized by different intensity maps. By adjusting the intensity of the rain streaks and haze, each original image can produce 36 different synthesized images. The results of different methods are reported in Table 3.

Table 3.

Method		PSNR	SSIM
Input		15.55	0.7722
Haze removal	EPDN‡ [39]	26.08	0.9306
	DCPDN‡ [57]	28.52	0.9277
	AECRNet† [52]	28.77	0.9350
Rain removal	RESCAN‡ [30]	24.49	0.8852
	PReNet† [40]	27.34	0.9497
	DuRN‡ [33]	29.43	0.9487
	RCDNet† [48]	30.56	0.8873
	SPANet‡ [49]	31.48	0.9656
	MPRNet† [56]	32.33	0.9767
Rain and haze removal	DAF-Net† [20]	30.16	0.9531
	DGNL-Net† [21]	32.38	0.9743
	TransWeather† [47]	29.28	0.9216
	GTRain† [3]	30.19	0.9597
	WiperNet† [26]	30.21	0.9584
	ADU-Net	33.83	0.9784
	ADU-Net-plus	34.64	0.9805

Table 3. Comparison with the State-of-the-Arts Methods of Rain Removal and Haze Removal on RainCityscapes Dataset

†indicates the network was trained on the RainCityscapes dataset. ‡ indicates the results of the algorithms as reported in [21]. \(1^{\mathrm{st}}/2^{\mathrm{nd}}\) best in red/blue.

BID Rain. The BID Rain dataset is also synthesized from the Cityscapes dataset. It samples 2,975 images from the validation set of the Cityscapes dataset as a training set, and 500 images from the test set of the Cityscapes dataset as its test set. This is a complicated dataset as the images contain rain streaks, haze, snow, and raindrops. The rain streaks masks are sampled from Rain100L and Rain100H [53], and the snow masks are sampled from Snow 100K [34]. The haze masks include three different intensities originating from FoggyCityScape [44]. The raindrops are produced from the metaball model [4]. Those weather components are mixed with the images in the Cityscapes dataset using the physical imaging models [4, 19, 34, 44, 53]. In the training set, every image can be mixed with each weather component with random probabilities, and we evaluate our model in six different cases; the combinations of the weather components in each case are as follows (1) rain streaks; (2) rain streaks and snow; (3) rain streaks and light haze; (4) rain streaks and heavy haze; (5) rain streaks, moderate haze, and raindrops; and (6) rain streaks, snow, moderate haze, and raindrops. Refer to [18] for more details of the six settings. The results of different cases are shown in Table 4.

Table 4.

Case		Input	PReNet	RCDNet	BIDNet	TransWeather	GTrain	ADUNet	ADUNet-plus
(1)	PSNR	25.51	32.69	28.05	31.17	31.88	31.91	34.62	39.05
(1)	SSIM	0.8144	0.9803	0.9527	0.9438	0.9307	0.9596	0.9827	0.9877
(2)	PSNR	18.69	30.52	29.84	29.47	29.37	30.03	32.47	36.48
(2)	SSIM	0.5979	0.9504	0.9351	0.9089	0.8844	0.9178	0.9560	0.9742
(3)	PSNR	17.48	29.65	30.17	28.90	29.46	30.14	31.48	33.75
(3)	SSIM	0.7427	0.9568	0.9536	0.9325	0.9176	0.9470	0.9669	0.9777
(4)	PSNR	11.55	25.80	26.74	26.82	27.51	27.33	26.52	29.30
(4)	SSIM	0.6017	0.9233	0.9210	0.9125	0.8949	0.9222	0.9360	0.9565
(5)	PSNR	14.02	27.36	28.30	27.31	26.94	27.74	28.54	30.32
(5)	SSIM	0.6455	0.9302	0.9285	0.9116	0.8833	0.9191	0.9443	0.9594
(6)	PSNR	12.38	26.56	27.26	26.54	26.22	26.85	27.63	29.66
(6)	SSIM	0.4916	0.9046	0.9005	0.8675	0.8504	0.8857	0.9222	0.9418

Table 4. Comparison with the State-of-the-Arts Methods on BID Rain Dataset

†indicates the network was trained on the BID Rain dataset. \(1^{\mathrm{st}}/2^{\mathrm{nd}}\) best in red/blue.

SPA-Data. The SPA-Data is a real-world dataset, which is cropped from 170 real rain videos, of which 86 videos are collected from StoryBlocks or YouTube, and 84 videos are captured by iPhone X or iPhone 6SP. Those videos cover outdoor fields, suburb scenes, and common urban scenes. This dataset contains 638,492 image pairs for training and 1,000 for testing. The results of SPA-Data are shown in Table 5.

Table 5.

Method	PSNR	SSIM
Input	34.15	0.9269
RESCA \(\mathrm{N}\)‡ [30]	38.19	0.9707
PReNe \(\mathrm{t}\)‡ [40]	40.16	0.9816
SPANe \(\mathrm{t}\)‡ [49]	40.24	0.9811
RCDNe \(\mathrm{t}\)‡[48]	41.47	0.9834
TransWeather† [47]	38.31	0.9757
WiperNet† [26]	41.73	0.9905
ADU-Net	44.19	0.9885
ADU-Net-plus	46.04	0.9924

Table 5. Comparison with the State-of-the-Arts Methods on SPA-Data Dataset

‡indicates the results of the algorithms as reported in [48]. †indicates the network was trained on the SPA-data. \(1^{\mathrm{st}}/2^{\mathrm{nd}}\) best in red/blue.

NH-HAZE The NH-HAZE [2] is a valuable dataset for non-homogeneous haze research, as it offers ground truth images for evaluation. The dataset comprises 55 pairs of real-world outdoor scenes, where each pair consists of a hazy image and its corresponding haze-free counterpart. The non-homogeneous haze present in these images has been meticulously generated using a professional haze generator, ensuring an accurate representation of real-life haze conditions. The results on the NH-HAZE dataset are presented in Table 6.

Table 6.

Method	PSNR	SSIM
Input	11.48	0.4023
DehazeNet‡ [5]	16.62	0.524
FFA-Net‡ [38]	19.87	0.692
MSBDN‡ [12]	19.23	0.706
AECRNet‡ [52]	19.88	0.717
DehazeFormer-S‡[45]	20.47	0.731
ADU-Net	28.37	0.887
ADU-Net-plus	29.46	0.890

Table 6. Comparison with the State-of-the-Arts Methods on NH-HAZE Dataset

‡indicates the results of the algorithms as reported in [45]. †indicates the network was trained on the SPA-data. \(1^{\mathrm{st}}/2^{\mathrm{nd}}\) best in red/blue.

Evaluation Protocol. In our experiments, the network performance is quantitatively evaluated by the peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) metrics. A higher value of PSNR and SSIM indicates a better image recovery performance of the network.

4.3 Comparison to the State-of-the-Arts

To verify the advance of our method, we compare the performance of our method with current state-of-the-art methods across three datasets.

RainCityscapes. In the RainCityscapes dataset, we compare our methods to the state-of-the-art rain removal methods including RESCAN [28], PReNet [40], DuRN [33], RCDNet [48], SPANet [49], and MPRNet [56]. We also compare our methods with approaches that jointly remove the rain and haze, i.e., DAF-Net [20], DGNL-Net [21], WiperNet [26], TransWeather [47], and GTRain [3]. The comparison with haze removal methods, like EPDN [39], DCPDN [57], and AECR-Net [52], is also conducted. The results are reported in Table 3. We can find that our vanilla solution, i.e., ADU-Net, outperforms the existing state-of-the-art methods. In particular, it improves the PSNR/SSIM values of the DGNL-Net by 1.45/0.0017, indicating the superior design of our method. The plus version of our method, i.e., ADU-Net-plus, again brings performance gain over the AUD-Net, where the ADU-Net-plus improves the PSNR/SSIM values by 0.81/0.0021.

BID Rain. Since the scene in the RainCityscapes dataset only contains rain and haze information, we further evaluate our methods on the challenging dataset, BID Rain, to verify its generalization of working in complicated weather conditions. Table 4 illustrates the comparison of the model performance in each weather condition. We can observe that the proposed ADU-Net can outperform the BIDeN [18] in each of the cases. Especially in cases (2) and (3), the ADU-Net brings the maximum performance gain. One possible explanation is that the proposed ADU-Net is designed with a dual-branch decoder, which is tailored for the images in case (2) including the rain streaks and snow, or that in (3) including rain streaks and a light haze. However, the improvement in the other cases reveals the generalization of our proposal. Along with the ADU-Net, its plus version can significantly improve both PSNR/SSIM values, showing the superiority of our network architecture. In case (4), the performance of ADU-Net is lower than that of RCDNet [48], BIDeN [18], TransWeather [47], and GTrain [3]. One possible explanation is that the “heavy haze” covers the scenes, which makes it difficult for our network to produce the scene residual. Nevertheless, this issue is addressed by increasing the parameter size, supported by the performance in ADU-Net-plus.

SPA-Data. We also evaluate our methods in the large-scale dataset, SPA-Data. We compare our methods to the existing state-of-the-art methods in Table 5, including RESCAN [30], PReNet [40], SPANet [49], RCDNet [48], and WiperNet [26]. As shown in Table 5, the proposed methods outperform the existing methods by a large margin. For example, the improvements read of 2.72/0.0051 (PSNR/SSIM) from ADU-Net and 4.57/0.0090 from ADU-Net-plus, as compared to RCDNet, showing the strong performance of our network architecture. Indeed, although ADU-Net exhibits a slightly lower SSIM compared to WiperNet by 0.0020, its superiority in PSNR by 2.46 highlights its excellent performance. Furthermore, ADU-Net-plus outperforms WiperNet, leading in both PSNR and SSIM by 4.31 and 0.0019, respectively. These findings affirm the robustness and efficacy of our proposed methods for image rain removal in real-world scenarios.

NH-HAZE. To showcase the effectiveness of our approach, we conducted experiments using the real-world NH-HAZE dataset [2]. In Table 6, we compare our methods with state-of-the-art techniques, including MSBDN [12], FFA-Net [38], AECRNet [52], and DehazeFormer [45]. The results, as presented in Table 6, demonstrate significant performance improvements with our proposed methods surpassing existing approaches by a wide margin. For instance, when compared to DehazeFormer-S, ADU-Net and ADU-Net-plus exhibit remarkable improvements of 7.9/0.156 (PSNR/SSIM) and 8.99/0.159, respectively. These outcomes highlight the strong performance of our network architecture.

Comparison of Model Complexity and Time Cost. In addition to analyzing PSNR and SSIM, we also compare the complexity of our methods with the existing state-of-the-art methods in Table 7, including PReNet [40], RCDNet [48], AECRNet [52], and DGNL-Net [21]. We employ GFLOPs (Giga Floating Point Operations) to quantify the complexity of the model. Additionally, we assess the runtime complexity by averaging the training time per epoch (s/epoch). It is evident from the table that ADU-Net stands out with the smallest GFLOPs and the second shortest s/epoch. Although our model’s runtime is slightly higher compared to DGNL-Net [21], we have demonstrated superior performance on both PSNR and SSIM. In our proposed methods, the convolutional layers play a significant role in the overall computational load. We are proud to share that ADU-Net achieves an impressive 50% reduction in kernel size compared to ADU-Net-plus while maintaining an exceptional performance level of 95%. We believe this significant reduction in complexity makes ADU-Net a more efficient and practical choice for various applications. However, in cases where computational resources are abundant and time is not a constraint, choosing ADU-Net-plus to achieve better performance is also a viable option.

Table 7.

Model	GFLOPs	s/epoch	PSNR	SSIM
PReNet [40]	132.88	2145.45	27.34	0.9497
RCDNet [48]	48.46	3509.27	30.56	0.8873
AECRNet [52]	86.09	1113.68	28.77	0.9350
DGNL-Net [21]	39.53	837.28	32.38	0.9743
ADU-Net	31.65	1017.21	33.83	0.9784
ADU-Net-plus	125.71	1126.49	34.64	0.9805

Table 7. Model Complexity and Runtime Cost

The PSNR/SSIM are the results of the RainCityscapes dataset. \(1^{\mathrm{st}}/2^{\mathrm{nd}}\) best in red/blue.

4.4 Ablation Study

In this section, we conduct thorough ablation studies to verify the effectiveness per component in the proposed network. All studies in this section are conducted using ADU-Net on the RainCityscapes dataset.

Loss Function. In our implementation, the network is optimized by the negative SSIM loss, i.e., \(\mathcal {L}_{\mathrm{SSIM}}\). While in many practices of the low-level computer vision tasks, the MSE loss, i.e., \(\mathcal {L}_{\mathrm{MSE}}\), is employed [14]. In this study, we evaluate the effectiveness of each loss function. As shown in Table 8, we can find that each of the loss functions works better for our rain and haze removal task, and the network performance training from the two-loss functions is similar. However, the multi-task training, which optimizes the loss functions jointly, will degrade the network performance, indicating that the network may be saturated using one loss function, and the joint training will harm the network.

Table 8.

Loss Function	\(\mathcal {L}_{\mathrm{MSE}}\)	\(\mathcal {L}_{\mathrm{SSIM}}\)	\(\mathcal {L}_{\mathrm{MSE}} + \mathcal {L}_{\mathrm{SSIM}}\)
PSNR	33.17	33.83	33.74
SSIM	0.9720	0.9784	0.9774

Table 8. Comparison of the Effectiveness of Loss Functions