Keywords

1 Introduction

Person re-identification (re-id) refers to matching pedestrians from non-overlapping cameras, which is very challenging because of the visual variations across changing views. With the ubiquity of surveillance systems, re-id has attracted increasing research attentions, and recent studies have achieved remarkable progresses. The existing approaches focus on addressing the issues of illumination variations, viewpoint changes and occlusions. However, the low-resolution (LR) problem is generally ignored by the current literature, which is common in the surveillance systems because of poor imaging quality and long-distance monitoring, arising the LR-REID problem. The performance of traditional re-id methods encounters a serious degradation when recognizing LR images because these methods fail to capture discriminative information from the LR pedestrians. Therefore, learning a resolution-free model is essential for overcoming the drawbacks of the existing methods and enhancing the realistic surveillance systems.

Some studies have made pioneering efforts to tackle with LR-REID. These methods first reduce the cross-resolution discrepancies by reconstructing HR pedestrians using super-resolution techniques or generative adversarial networks (GAN). Then learn a classification model is learned to distinguish the identity of the reconstructed outputs. Although the latest researches have considerably improved the robustness of re-id models against different resolutions, the performance of the LR-REID models is far beneath the traditional re-id techniques. An important reason is that the reconstruction methods fail to capture accurate and discriminative details. On the one hand, super-resolution methods tend to generate blurred images which are not superior for recognition. On the other hand, although GAN-based networks can obtain sharp details, the generated pedestrians contain annoying artifacts, which may confuse the discriminator and result in wrong decisions. To address this problem, we intend to desire a model which recovers articulate and accurate details. Furthermore, the adversarial loss, which is designed to determine whether the image is real or fake, is not always appropriate for reconstructing HR images because both LR and HR images are real images. A suitable loss function is to determine whether the reconstructed output contains sharper details than the input image.

Based on the above discussions, we propose an adaptive dual-branch network (ADBNet) structure to reconstruct HR images. The ADBNet consists of two branches, namely the SR-branch and the GAN-branch. The SR-Branch adopts the reconstruction loss to recover accurate outputs. The GAN-Branch imposes the adversarial loss to obtain sharp details. The ADBNet structure adaptively learns the weights of each branch and fuses the SR-Branch and GAN-Branch in a weighted summation manner to obtain the final HR output. Besides, we propose a resolution-aware adversarial loss named relative loss to determine whether the generated images have sharper edges than the LR images. Eventually, a classification module is attached behind the dual-branch structure to learn distinguishing features in an end-to-end manner. Experimental results prove the effectiveness of the proposed approach.

2 Related Works

2.1 Tradition Re-id Methods

Re-id has attracted widespread research attention in the last decade. The existing works mainly focus on feature extraction and metric learning. Deep learning [1, 3, 4, 7, 15, 19, 23] is playing an increasingly significant role in the re-id task because of the strengths in discriminative feature learning. With the development of deep learning techniques, deep networks have achieved remarkable progress in re-id, and deep models significantly outperform hand-crafted features. Furthermore, researchers implement metric learning [2, 5, 10, 13] on the deep features to learn robust features. Although recent studies have reported high recognition accuracy for re-id, the existing methods are focused on handling the challenges of cross-view variations including illumination, pose, and occlusion, where the re-id models are trained and tested with HR images. The performance of traditional re-id methods will be dramatically affected by LR images, rising a LR-REID problem, which is largely under-study.

2.2 Low-Resolution Re-id Methods

Recently, some researches have paid attention to the LR problem and proposed pioneering researches [11, 14, 22] to handle the LR-REID task. The basic idea is first synthesising the HR images and then extracting blur-robust representations. Jiao et al. [11] proposed the SING model which first implemented super-resolution using SRCNN and learned a classification network to conduct person re-identification. Wang et al. [22] proposed a cascaded SR-GAN model to conduct scale-adaptive image super-resolution and used a re-id network to learn discriminative features. Although the latest researches have achieved remarkable progresses considering the resolution variations, the performances of the LR-REID models are still far from satisfactory. In this paper, we follow the framework to first synthesize HR images and then extract robust feature representations using the reconstructed images. Notably, we propose a novel adaptive dual-branch network to improve the image enhancement process.

3 Proposed Methodology

3.1 Overview

Figure 1 illustrates the structure of the proposed approach. The proposed method jointly optimizes two sub-networks, namely the adaptive dual-branch sub-network and re-id sub-network, to conduct image enhancement and recognition. The ADBNet consists of two branches, including the super-resolving branch (SR-branch) and the cycle-generative branch (GAN-branch). The weights \(W_{SR}\) and \(W_{GAN}\) are adaptively learned to combine the SR-branch and GAN-branch to obtain the desired HR reconstruction results. Finally, the re-id sub-network is integrated into the ADBNet to learn discriminative resolution-robust representations.

Fig. 1.
figure 1

The proposed LR-REID framework. A novel adaptive dual-branch network is proposed to recover HR images. Then a re-id network is attached by the reconstructed images to learn discriminative feature representations.

The notations of the proposed approach are as follows. Given the training data as \((\varvec{x}^l_i,\varvec{x}^h_i,y_i),i=1,2,...,N\), where \(\varvec{x}^l_i\) and \(\varvec{x}^l_i\) refer to the LR and HR images, \(y_i\) is the corresponding identity label, and N denotes the number of training samples, we use the ADBNet to learn an enhancing function \(F_{ADB}(.)\) which compensates high-frequency components for the LR input. Then a feature extraction function \(F_{FE}(.)\) is adopted to learn discriminant features for LR-REID.

3.2 Adaptive Dual-Branch Network

As described above, the ADBNet combines the SR-branch together with the GAN-branch to generate HR pedestrians. The SR-branch adopts the reconstruction loss to produce accurate details whereas the GAN-branch imposes the generative adversarial loss to generate sharp edges. Finally, the ADBNet learns the adaptive weights to combine the SR-branch and GAN-branch together to merit from the complementary information in the outputs of these branches.

Super-Resolving Branch. The SR-branch recovers the high-frequency details using a traditional image super-resolution manner. We adopt the resnet-based structure used by Johnson et al. [12] as the generator network of the SR-branch. Particularly, we exploit the structure which contains 6 resnet blocks to extract deep features from the LR images. Then a reconstruction loss is imposed to guarantee that the output is as close to the ground-truth HR image as possible. Denote \(F_{SR}(.)\) as the function of the SR-branch, the reconstruction loss can be formulated as:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{SR}=\frac{1}{N}\sum _{i=1}^{N}\parallel F_{SR}(\varvec{x}_i^l)-\varvec{x}_i^h\parallel _1. \end{aligned} \end{aligned}$$
(1)

The researches of super-resolution techniques have verified that the reconstruction loss tends to produce blurry outputs, which is detrimental for recognition. Therefore, we employ the generative network to obtain clear outputs.

Fig. 2.
figure 2

The structure of the cycle-generative branch.

Cycle-Generative Branch. We adopt the generative networks to extract sharp details, which is supplementary to the SR-branch. Inspired by [25], we use a cycle structure to generate HR images. The proposed cycle-generative branch is shown in Fig. 2. Our model consists of two generators and three discriminators, including Generator L2H, Generator H2L, Discriminator H, Discriminator L and Discriminator R. For LR images, we generate HR outputs using generator L2H. Then the HR outputs are mapped back to LR domain using generator H2L. Similar implementation is conducted for HR images. The cycle-generative branch includes three losses: adversarial loss, cycle loss, and relative loss.

We apply the adversarial loss [8] to both Generator L2H and Generator H2L to improve the quality of the generative outputs. For Generator L2H and Discriminator H, the adversarial loss can be expressed as:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{ADV}^{L2H}&=E_{\varvec{x}_i^h\sim p_{data}(\varvec{x}_i^h)}[log(D_H(\varvec{x}_i^h))]\\&+E_{\varvec{x}_i^l\sim p_{data}(\varvec{x}_i^l)}[log(1-D_H(G_{L2H}(\varvec{x}_i^l)))], \end{aligned} \end{aligned}$$
(2)

where \(G_{L2H}(.)\) refers to the function of Generator L2H and \(D_H(.)\) refers to the function of Discriminator H. In this manner, Generator L2H tries to generate more realistic images to fool the Discriminator H while Discriminator H aims to distinguish the real and fake samples. We also introduce a similar adversarial loss for Generator H2L and Discriminator L, which can be formulated as:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{ADV}^{H2L}&=E_{\varvec{x}_i^l\sim p_{data}(\varvec{x}_i^l)}[log(D_L(\varvec{x}_i^l))]\\&+E_{\varvec{x}_i^h\sim p_{data}(\varvec{x}_i^h)}[log(1-D_L(G_{H2L}(\varvec{x}_i^h)))]. \end{aligned} \end{aligned}$$
(3)

The adversarial training has many potential solutions. The generative network can produce any HR image for the input LR image, which is harmful to re-id because the identity information may be changed by the generator. Therefore, we impose the cycle loss on the cycle-generative branch to ensure that the mapping functions \(G_{L2H}\) and \(G_{H2L}\) are cycle-consistent. The cycle loss can reduce the space of possible mapping functions and force the generated output to maintain the intrinsic information of the input image. The cycle loss for Generator L2H and Generator H2L can be formulated as:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{CYC}&=E_{\varvec{x}_i^l\sim p_{data}(\varvec{x}_i^l)}[\parallel G_{H2L}(G_{L2H}(\varvec{x}_i^l))-\varvec{x}_i^l\parallel _1]\\&+E_{\varvec{x}_i^h\sim p_{data}(\varvec{x}_i^h)}[\parallel G_{L2H}(G_{H2L}(\varvec{x}_i^h))-\varvec{x}_i^h\parallel _1]. \end{aligned} \end{aligned}$$
(4)

The adversarial loss focuses on determining whether the input images are realistic samples or generated ones, which may be insufficient for LR-REID because both LR and HR images are realistic images. For the sake of improving the recognition performance of LR-REID, the generator not only needs to produce realistic outputs, but also needs to generate images whose details are sharper than the input images. Based on the above discussions, we propose a relative loss to guarantee that the generator produces images with higher resolution. The relative loss is designed to determine the relative relations of two images in terms of sharpness, or whether the output image is more clear than the input image. The formula of the relative loss is as follows:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{REL}&=E[log(D_R(\varvec{x}_i^l,\varvec{x}_i^h))]+E[log(1-D_R(\varvec{x}_i^h,\varvec{x}_i^l))]\\&+E[log(D_R(\varvec{x}_i^l,G_{L2H}(\varvec{x}_i^l)))]+E[log(1-D_R(G_{L2H}(\varvec{x}_i^l)),\varvec{x}_i^l)]\\&+E[log(D_R(G_{H2L}(\varvec{x}_i^h)),\varvec{x}_i^h)]+E[log(1-D_R(\varvec{x}_i^h,G_{H2L}(\varvec{x}_i^h)))], \end{aligned} \end{aligned}$$
(5)

where \(D_R\) represents the function of the Discriminator R. The Discriminator R first learns the relative relationship considering sharpness between the LR and HR images. Then the discriminator is utilized to determine the relative relationship between the input images and the outputs of the generators.

Notably, we follow [12] and adopt the structure of 6 resnet blocks to extract features for the Generator L2H and Generator H2L. For the discriminators, we follow [25] and use Patch-GANs.

Adaptive Weights Learning. The outputs of the SR-branch and the GAN-branch are complementary since the SR-branch produces accurate but blurry images whereas the GAN-branch generates sharp images with artifacts. In this paper, we integrate these branches in an united framework and learn an adaptive weight for each branch to combine the reconstruction results. Particularly, we attach a convolution layer for each branch, denoted as \(W_{SR}\) and \(W_{GAN}\) respectively, to learn the suitable weights. Then the final output of the ADBNet can be formulated as:

$$\begin{aligned} \begin{aligned} \varvec{x}_{RE}=W_{SR}F_{SR}(\varvec{x}_i^l)+W_{GAN}G_{L2H}(\varvec{x}_i^l). \end{aligned} \end{aligned}$$
(6)

Finally, we employ a reconstruction loss to guarantee that the output \(x_{RE}\) is the same as the ground-truth HR images, which can be formulated as:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{RE}=\frac{1}{N}\sum _{i=1}^{N}\parallel \varvec{x}_{RE}-\varvec{x}_i^h\parallel _1. \end{aligned} \end{aligned}$$
(7)

3.3 Feature Extraction Network

In this paper, we propose an end-to-end structure to conduct image enhancement and classification. A classification network is attached behind the ADBNet to learn distinguishing features for both the HR images and reconstructed outputs. Denote \(L_S(.)\) as the classification function, the objective function of the feature extraction network can be formulated as:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{Re-id}=\frac{1}{N}\sum _{i=1}^{N}(L_S(F_{FE}(\varvec{x}_i^h),y_i)+L_S(F_{FE}(F_{ADB}(\varvec{x}_i^l)),y_i)). \end{aligned} \end{aligned}$$
(8)

Here we adopt ResNet50 [9] as the feature extraction network and impose Softmax loss [7] to train the classifier.

3.4 Overall Objective Function

Consequently, the objective function of the proposed approach can be formulated as:

$$\begin{aligned} \begin{aligned} \mathcal {L}=\mathcal {L}_{Re-id}+\lambda _1\mathcal {L}_{SR}+\lambda _2\mathcal {L}_{CYC}+\lambda _3\mathcal {L}_{RE}+\lambda _4(\mathcal {L}_{ADV}^{L2H}+\mathcal {L}_{ADV}^{H2L}+\mathcal {L}_{REL}). \end{aligned} \end{aligned}$$
(9)

4 Experiment

4.1 Experimental Settings

Datasets. We conduct experiments on two simulated large-scale LR datasets, namely LR-Market-1501 and LR-CUHK03, to evaluate the effectiveness of the ADBNet. We obtain the LR images by down-scaling the HR images with a ratio of 0.25 for both datasets. LR-Market-1501 is built from the Market-1501 dataset [24]. The market-1501 dataset contains 32,668 annotated bounding boxes of 1,501 identities from six cameras in an open system. The images are automatically detected by a deformable parts model detector. LR-CUHK03 is based on the CUHK03 dataset [15]. The CUHK03 dataset contains more than 14,000 images of 1,467 people captured from six cameras. The images of each person are from two disjoint cameras.

Evaluation Protocols. We follow the standard evaluation protocol to ensure fair comparisons between the ADBNet and the other approaches. For LR-Market-1501 benchmark, we train the models with the standard training set (750 identities) and match 3,368 query images with the standard testing set (751 identities). Notably, we adopt the single-query mode for testing. We conduct matching using the LR query images and the HR testing gallery images. Rank-1, Rank-5, Rank-10 accuracies and mean average precision (mAP) are computed to evaluate the performance of all the methods. For LR-CUHK03 benchmark, we follow the protocol used by [11]: randomly split the samples into 100 people for testing and the remainder for training. We randomly select one image from the gallery for each identity to form the gallery set and construct the probe set with all LR images. The cumulative matching characteristic (CMC) is used to evaluate the performance of the compared methods.

Implementation Details. We adopt ResNet50 as the baseline feature extraction model, use 6 resnet blocks [12] to form the generators, and employ the Patch-GANs [25] as the discriminators. The proposed method is implemented using the PyTorch [17] framework. We set the learning rate as \(10^{-2}\) for the re-id sub-network and \(2\times 10^{-4}\) for the ADBNet. The proposed approach is tuned to the best efficiency, and we set \(\lambda _1=10\), \(\lambda _2=10\), \(\lambda _3=10\), and \(\lambda _4=1\) for all the datasets. The training ends when the number of epcho reaches 100.

4.2 Comparisons with Other Methods

In this section, we compare the ADBNet with the state-of-the-art methods to validate its effectiveness. The compared methods include traditional re-id methods, such as DGD [23], MGN [20], PCB [18], and ResNet50, and LR-REID methods, such as JUDEA [16], SDF [21], and SING [11].

Table 1. Comparisons on LR-Market-1501

LR-Market-1501. The researchers have not yet studied LR-REID on the Market-1501 dataset, we compare the ADBNet with some popular re-id methods. Furthermore, we also evaluate the performance of reconstruction-based methods which reconstruct HR images and conduct recognition on the generated outputs. Table 1 shows the comparison results, where the subscript HR means training merely on HR images. Obviously, the performance of the popular re-id methods (PCB and MGN) degrades significantly when dealing with LR images. Table 1 also verifies that pre-processing is valuable for addressing the LR-REID task. The SRCNN approach significantly improves the Rank-1/mAP of ResNet50 from 15.1%/11.1% to 51.2%/33.7%. Notably, recovering HR images by ADBNet dramatically outperforms the SRCNN methods, by a margin of 9.3%/6.5% in terms of Rank-1/mAP. Finally, the ADBNet remarkably outperforms the compared methods. The ADBNet achieves the highest Rank-1/mAP statistics of 72.1%/48.6, which is very closed to the traditional re-id performance. Compared with the baseline ResNet50, the ADBNet achieves an improvement of 12.1% in Rank-1 accuracy and 10.1% in mAP.

Table 2. Comparisons on LR-CUHK03

LR-CUHK03. Table 2 illustrates the comparison results between the ADBNet and the state-of-the-art LR-REID methods. The ADBNet is proven to be competitive against other methods and achieve the highest Rank-1 accuracy in LR-CUHK03. Note that the Rank-5 and Rank-10 accuracies of ADBNet is lower than SING. However, SING is tested with images down-scaled by magnification factors including 2, 3, and 4. Therefore, the ADBNet deals with images of much lower resolutions in average, and the performance of ADBNet can be further improved when dealing with smaller magnification factors.

Table 3. Ablation experiments on LR-Market-1501

4.3 In-Depth Analysis

Ablation Analysis. In this section, we conduct ablation experiments on LR-Market-1501 to evaluate the effects of each component in ADBNet. Table 3 shows the results of ablation experiments. Without the SR-branch, although we can obtain sharp details by merely using the GAN-branch, the Rank-1 accuracy/mAP is only 60.7%/37.1% because of the annoying artifacts. On the other hand, we can only achieve a Rank-1 accuracy/mAP of 66.0%/44.1% with blurry HR images generated by the GAN-branch. Combining the SR-branch and the GAN-branch results in a better Rank-1 accuracy/mAP of 69.2%/45.3%. Finally, using the proposed relative loss can further improve the performance of ADBNet and reports the best Rank-1 accuracy/mAP of 72.1%/48.6%.

Fig. 3.
figure 3

The reconstructed HR images on LR-Market-1501 and LR-CUHK03.

Reconstruction Results. Figure 3 illustrates the reconstruction results of each component in ADBNet. Obviously, the reconstructed images of the SR-branch are much blurry than the HR outputs of the GAN-branch, particularly in the contours of pedestrians. Although GAN-branch can produce high-quality results in LR-CUHK03 dataset, the outputs of GAN-branch in LR-Market-1501 contain many artifacts, which is detrimental for recognition. Notably, the ADBNet can achieve a better balance between improving the image quality and reducing inappropriate artifacts, and generate accurate sharp details for the pedestrians.

5 Conclusion

In this paper, we propose a novel ADBNet to deal with the LR-REID issue, which is largely under-study in the current literature. The proposed ADBNet consists of a SR-branch and a GAN-branch, and adaptively learns the combination weights to obtain sharp and accurate reconstruction outputs. Furthermore, we design a relative loss to generate HR outputs with sharper details. Finally, a re-id sub-network is integrated into the proposed approach to learn discriminative features. The experimental results on two large-scale simulated LR benchmarks validate the effectiveness of the proposed approach.