4.1 Dataset
We apply the proposed semi-supervised approach to the public SpaceNet3 [
56] and DeepGlobe Road [
17] datasets and compare it to other alternative semi-supervised methods in the recent literature.
SpaceNet3. This dataset contains 2780 images. The size of each image is
\(1300 \times 1300\), with a ground resolution of 30 cm/pixel. We remove the images without roads and obtain a subset of 2549 images. The annotations are in the format of line strings for roads. Following [
6], we first employ the Euclidean distance transform along the line strings to get the Gaussian maps and then threshold the map using a constant of 0.76 to obtain the binary mask, which corresponds to 6–7 meters wide road. We split the dataset into a subset of 2018 images for training, 100 for validation, and 431 for testing, following the approach described in [
6].
DeepGlobe Road. This dataset includes a total of 6226 satellite images, which are captured over Thailand, Indonesia, and India, spanning 1632
\({\rm km^2}\) in ground area. All pixels belong to the road areas. Each image has a size of
\(1024 \times 1024\) pixels and a ground resolution of 50 cm/pixel. Following [
6,
51], we divide the dataset into 4496 for training, 200 for validation, and 1530 for testing.
4.3 Implementations
We use U-Net [
49] as the backbone network. For the supervised branch, we crop multiple image regions of
\(256\times 256\) from each labeled image and use these cropped images to enlarge the training set. For the unlabeled branch, we first select a region of
\(256\times 256\) and draw two outside regions of
\(384\times 384\). These two larger regions overlap with each other, and their overlapping areas (
\(256\times 256\)) form the positive pairs. To collect negative pairs, we crop a region of
\(512\times 512\) having no overlap with positive pairs and select negative samples from this region that are perceptually dissimilar from positive samples. We extract a HOG descriptor from the network output over each region and measure their distance to prune similar regions. We also augment the cropped images by applying different transformations, including flipping, rotating, and color jitter augmentations.
We train all the models using a single GTX1080ti GPU with a batch size of 6 for labeled data and 2 for unlabeled data. Adam optimizer is implemented with a weight decay of 0.0001. The supervised and unsupervised loss weights are set as 1 and 0.1, respectively. There are a total of 100 epochs for SpaceNet3 and 60 for DeepGlobe Road. We start with a learning rate of 0.0005 with a step scheduler by a factor of 0.1 at epochs \(\lbrace 60, 75\rbrace\) for SpaceNet3 and \(\lbrace 30, 40\rbrace\) for DeepGlobe. The first four and two epochs are trained only with the supervised loss for stabilization for SpaceNet3 and DeepGlobe, respectively. The temperature hyper-parameter \(\tau\) is 0.07. The patch size is \(8\times 8\), and the bin number is 12 to generate HOG descriptors. The percentile threshold \(thr_1\) and contrastive loss threshold \(thr_2\) are \(80\%\) and 4.7, respectively, to select pseudo labels for the iterative labeling process. The supervised baseline model is U-Net trained without TA or TTA. We compare our model on \(1\%\) and \(5\%\) of the labeled data of SpaceNet3 and DeepGlobe Road datasets.
4.4 Comparison with State-of-the-art Methods
We apply both the proposed method and other recently published semi-supervised segmentation methods, including
\(C^3-\)SemiSeg [
75],
\({\rm PC^2Seg}\) [
73], Cross Pseudo Supervise (CPS) [
13], and Cross-Consistency Training (CCT) [
47] over the same datasets and compare their performance. We compared models using the same partition protocols for fairness.
Results on SpaceNet3. Table
1 reports the comparison results of the proposed method and the other four methods on the SpaceNet3 dataset while using
\(1\%\) or
\(5\%\) of the labeled samples. We include the results of the baseline method for comparison. Besides the proposed iterative labeling method, we also implement a variant of our method that does not use the iterative labeling. The comparison in Table
1 suggests that the proposed semi-supervised method with iterative labeling (IL) achieves the highest performance in both settings and clearly outperforms the other four state-of-the-art semi-supervised segmentation methods [
13,
47,
73,
75]. Moreover, the proposed method can still outperform other methods without iterative labeling. Note that there are only 20 training images using
\(1\%\) labeled data. The proposed method can still work well and even outperforms the baseline method trained on (
\(5\%\)) labeled data.
Figure
6 further plots the F1 and IoU of both the supervised baseline and the proposed semi-supervised methods using
\(1\%\),
\(5\%\), and
\(100\%\) labeled data. When we apply our method to the fully supervised setting in which all data are assigned for both supervised and unsupervised training, our method (
F1: 0.7223;
IoU: 0.5712) can still beat the baseline (
F1: 0.7066;
IoU: 0.5578) with an improvement of
\(1.57\%\) and
\(1.34\%\) for
F1 and road
IoU, respectively.
Figure
7 visualizes the results of our method and five other methods, including the supervised baseline and four state-of-the-art methods. For the images in the first seven rows, our method can produce much higher-quality results than others. The last three rows also show three challenging images for which our method did not work well. These images include complicated highways and partially visible roadways; these patterns did not appear in the training set. Actually, none of the six learning-based methods can work properly over these three images. One potential solution is the so-called sample-specific data augmentation method [
36]. In our current experiment, we applied the same set of augmentation operations, including cropping, flipping, rotating, and color jitter, over every image. A more reasonable augmentation solution is to assess how difficult an image is regarding the current model, and to produce more augmented samples for these difficult samples than easy samples. We will explore this direction in future work.
Results on DeepGlobe Road. Table
2 reports the results of different methods over the DeepGlobe Road dataset. Some examples of results are shown in Figure
8. The recently proposed method
\(C^3-\)SemiSeg [
75] achieves F1 0.6383 while using
\(1\%\) labeled data, which clearly outperforms other methods, including the baseline network. Our method with iterative labeling can further improve its performance with a significant margin of 3.46%. Similar improvements are obtained by the proposed method over other methods while using
\(5\%\) labeled data. When we apply our method to the fully supervised setting in which all data are assigned for both supervised and unsupervised training, our method (
F1: 0.7100;
IoU: 0.5712) can still beat the baseline (
F1: 0.6992;
IoU: 0.5577).
4.5 Ablation Study
We perform ablation studies to validate the effectiveness of the proposed components on the SpaceNet3 dataset with
\(1\%\) (i.e., 20) labeled and
\(99\%\) (i.e., 1998) unlabeled training data. Table
3 reports the
\(F1-\)score and road
IoU of these methods. The method of ID I represents the supervised baseline model without training data augmentation (TA), test-time augmentation (TTA), pixel-wise contrastive loss (PCL), histogram of oriented gradients (HOG), negative sampling (NS), or self-training (ST).
Effectiveness of Histogram of Oriented Gradients (HOG). Misalignment is a challenge when working with pixel-wise tasks and calculating between-sample similarities. In this work, a roadway appears anywhere or at any scale in an image. We employ a histogram-based feature (HOG) to ensure that the contrastive loss is invariant against geometric transformations. In Table
3, the comparison between Model VIII and IX suggests that the semi-supervised model with HOG descriptors is superior to that without HOG, improving
F1 from 0.5958 to 0.6028 and road
IoU from 0.4342 to 0.4405.
Effectiveness of Negative Sampling. The proposed negative sampling method aims to prune or filter false-negative pairs of samples during training. We evaluate and report the results of the semi-supervised model with and without negative sampling (NS). Models IX and XI results in Table
3 show that integrating NS can lead to improved
F1 and road
IoU.
Effectiveness of Contextual Information. To verify the effectiveness of the contextual information on the predictions of overlap regions, we compare the models with and without contextual information. For the model without contextual information, we crop the same region with different data augmentations to generate positive pairs for the contrastive loss. The two cropped patches of the unlabeled image are totally overlapped and have the same contextual information. In Table
3, the comparison between Models X and XI shows that considering contextual information in the model can significantly improve F1 by 2.01% and IoU by 2.86%.
Effectiveness of Pixel-wise Contrastive Loss. Our proposed pixel-wise contrastive loss (PCL), considering contextual information, is performed on the pixel level. It encourages the network to have similar outputs for positive pairs and different results for negative ones. In Table
3, the comparisons between Model VIII and Model IV show that utilizing PCL over unsupervised images can lead to a gain of 1.67% in
F1 and 2.08% in road
IoU while using augmentation techniques for both testing and training. We observe an even more significant improvement from the comparisons of Model V and Model I. Using PCL and negative sampling will lead to a gain of 7.18% in
F1 and 6.46% in road
IoU. These experiments with comparisons indicate that our PCL is effective for road extraction as it helps to learn more semantic representations for the model and alleviate the over-fitting issue.
Effectiveness of Iterative Labeling. Table
4 reports how the proposed method works over iterations. The method called
plain ST w/o IL is a naïve implementation of the proposed method in which the pseudo labels at the previous iterations are directly used as the labeled samples at the current iteration. When applying this naïve implementation with our semi-supervised model, we observe that the
F1 and road
IoU decrease in the second iteration. In contrast, the method called
ST with IL, which uses the proposed iterative labeling strategy, can progressively improve the
F1 and road
IoU of our semi-supervised model over the first four iterations and converge at the fifth iteration. The best performance of our semi-supervised iterative approach on the
\(1\%\) labeled SpaceNet3 dataset achieves 0.6551
F1 and 0.4900 road
IoU, which are much higher than those of the supervised baseline model employing plain ST w/o IL. This set of comparisons shows that the proposed iterative labeling process can effectively prune the noisy data.
In this study, we employ contrastive loss during the iterative labeling process to filter out low-quality pseudo labels. Our offline experiments show that directly ranking all pseudo-labels based on their contrastive loss and classification loss can slightly enhance system robustness. In previous literature, various measures have been introduced to identify and select high-quality pseudo-labels when learning from raw data. These measures include confidence scores, diversity of selected samples, and other learning-based metrics. However, finding an optimal combination of these measures is a non-trivial problem. In this study, we focus exclusively on the proposed pixel-wise constructive loss and empirically demonstrate its effectiveness as a measure for selecting pseudo-labels. We remain open to exploring different measurement combinations to select high-quality pseudo labels in future research.
Effectiveness of Training Data and Test-Time Augmentation. As shown in Table
3, the supervised and semi-supervised models with TA (Model II and Model VII) have better
F1 and road
IoU than corresponding models without TA (Model I and Model V). Model III and Model VI with TTA also perform better than Model I and Model V, respectively. When both TA and TTA are incorporated together in the supervised Model IV and semi-supervised Model XI, we obtain better results than using TA or TTA alone. These comparisons suggest that integrating TA, TTA, or both can significantly improve the model performance of road extraction.
Analysis on the 1% Labeled DeepGlobe Road Dataset. We further validate the effectiveness of the proposed components on the DeepGlobe Road dataset with
\(1\%\) (i.e., 45) labeled and
\(99\%\) (i.e., 4451) unlabeled data. Table
5 reports the results of multiple implementations of our methods. We can observe consistent improvements while using TA
\(\&\) TTA, PCL
\(\&\) HOG
\(\&\) NS, ST with IL, or the combinations. The model incorporating ST with IL achieves the best performance, whose
F1 is 9.66% higher and road
IoU is 10.04 higher than the baseline method.
Analysis on the 5% Labeled SpaceNet3 Dataset. Table
6 reports the performance of our model on the SpaceNet3 dataset while using
\(5\%\) (i.e., 101) labeled and
\(95\%\) (i.e., 1917) unlabeled data. TA
\(\&\) TTA, PCL
\(\&\) HOG
\(\&\) NS, and ST with IL effectively boost model performance, which is consistent with our previous experimental results obtained from
\(1\%\) labeled SpaceNet3 data. The F1 score of our proposed semi-supervised iterative method achieves 0.6774, which outperforms the result of the supervised model trained with
\(10\%\) labeled data (0.6295) and is only 2.9% lower than the supervised model trained over the full dataset (0.7066).