Keywords

1 Introduction

Supervised deep learning algorithms often require numerous labeled examples for training to yield satisfactory performance. In the domain of medical imaging, however, labeled data is often very scarce, which is especially true for dense labels such as segmentations, as they are particularly costly to produce. On the other hand, for many medical image analysis tasks, unlabeled data coming from the same or similar distribution as the labeled data is available in abundance. This motivates the development of semi-supervised algorithms.

Many ideas have been proposed for improving performance of deep learning algorithms for medical image analysis through utilizing unlabeled data [2]. Self-supervised approaches learn to perform an auxiliary task related to the target prediction task on unlabeled data, examples including image colorization [12] and predicting one image modality from another [5]. Auxiliary manifold embedding [1] learns to place unlabeled examples far or close in the feature space with respect to each other and to labeled examples according to prior adjacency information. Self-ensembling approaches learn from synthetic labels on unlabeled data, which are constructed using past iterations of the same network [7, 9].

In this paper, we propose an approach in which we learn consistency under transformations on both labeled and unlabeled data, in addition to supervised learning from labeled data. We implement this consistency learning through a Siamese architecture trained end-to-end. The network has two identical branches each of which receives differently transformed versions of the same images as inputs and is supervised to output segmentations consistent with the other branch, in addition to learning from labeled images in a supervised fashion. Combining supervised and unsupervised consistency learning in the network is achieved by a composite objective consisting of two respective terms. This idea was already explored before and applied to the prediction of image-level labels: image classification [10, 13] and landmark coordinate regression [6]. We extend this method so that it can be applied efficiently to learning to predict pixel-level labels consistent under spatial transformations of images. This entails including a special differentiable layer into the Siamese architecture that transforms the pixel-wise predictions of one of the branches so as to align it with the predictions of the second branch and consequently allow their pixel-wise comparison in the consistency loss term. Self-ensembling approaches [7, 9] are similar to the proposed approach in that they use a transformation consistency prior: they construct their synthetic labels on unlabeled data using predictions on differently transformed inputs. Unlike these methods, the proposed Siamese approach does not train the network to fit specific targets on unlabeled images (i.e. the synthetic labels), which are unknown and cannot be reliably estimated; it only encourages the outputs to have the desired transformation consistency property.

Fig. 1.
figure 1

The proposed network. The inputs are mixed batches of labeled and unlabeled images. Every image x is transformed by two random mappings \(t^{in}_1\) and \(t^{in}_2\). The label y, if available, is transformed by \(t^{out}_1\) and \(t^{out}_2\). \(t^{in}_1(x)\) and \(t^{in}_2(x)\) are fed to the two identical branches of the network. The output of one of the branches is transformed by a differentiable layer for comparison with the output of the second branch in the consistency loss \(\mathcal {C}\). The network is trained end-to-end using a combination of \(\mathcal {C}\) and a supervised loss \(\mathcal {S}\) (defined only on labeled images) as specified by Eq. 1.

We evaluate our method on the JSRT chest X-ray dataset [15, 16]. In this paper, we focus on learning equivariance to elastic deformations, although our method can be readily applied to a broader class of transformations. Through our experiments, we evaluate: (1) the contribution of learning this equivariance on labeled data (i.e. as a regularization in supervised-only learning) to the segmentation performance; (2) the contribution of adding different amounts of unlabeled data into the equivariance learning; (3) how these contributions vary with the size of the labeled portion of the training set. We compare the proposed method trained in the small data (20 labeled images) and full supervision regimes with state-of-the-art methods [3, 4, 8, 14] and the inter-observer agreement [16].

2 Method

Let \(\mathcal {X}_l\) be a set of training examples with corresponding ground truth labels \(\mathcal {Y}\) and \(\mathcal {X}_u\) be a set of unlabeled examples. Let \(\mathcal {T}\) be a distribution of tuples of mappings such that for a tuple applying transformation \(t^{in}\) to any image would result in the corresponding label y being transformed into \(t^{out}(y)\) and \(t^{out}\) is invertible. Let \(\mathcal {X}^\mathcal {T}_l\) be a set of all images from \(\mathcal {X}_l\) with corresponding labels, augmented by examples (\(t^{in}(x)\), \(t^{out}(y)\)). We would like to find parameters \(\theta \) of a network f that optimize the following objective:

$$\begin{aligned} \min { \mathcal {L}_{sup}^\mathcal {T}(\theta ) + \lambda \mathcal {L}_{cons}^\mathcal {T}(\theta ) } \end{aligned}$$

with \(\mathcal {L}_{sup}^\mathcal {T}(\theta ) = 1 / |\mathcal {X}^{\mathcal {T}}_l| ( \sum _{(x, y) \in \mathcal {X}^{\mathcal {T}}_l}{\mathcal {S}(y, f(x; \theta ))} )\) being a regular supervised loss (using \(\mathcal {T}\) as a data augmentation strategy and \(\mathcal {S}\) as an image-wise loss) and \(\mathcal {L}_{cons}^\mathcal {T}(\theta )\) being an unsupervised consistency loss defined as:

\(\mathcal {L}_{cons}^\mathcal {T}(\theta )\) encourages the selection of \(\theta \) that maximizes consistency of network predictions under transformations \(\mathcal {T}\) on \(\mathcal {X}\) as measured by image-wise loss \(\mathcal {C}\).

We approximate minimization of this objective by a mini-batch training scheme in which we sample a set of labeled examples \(\mathcal {B}_l\), a set of unlabeled examples \(\mathcal {B}_u\) and two transforms \(t_1, t_2 \sim \mathcal {T}\) for every example. A member of a batch \(\mathcal {B}\) thus is a tuple \((x, t_1, t_2)\) with or without ground truth label y. The mini-batch objective is:

(1)

The first sum approximates \(\mathcal {L}_{sup}^\mathcal {T}(\theta )\) and the second approximates \(\mathcal {L}_{cons}^\mathcal {T}(\theta )\). In the rest of the paper we will abuse notations \(\mathcal {L}_{sup}^\mathcal {T}\) and \(\mathcal {L}_{sup}^\mathcal {T}\) to refer to the first and the second sums in Eq. 1, respectively.

The overview of the network implementing this scheme is shown in Fig. 1. The architecture has two branches with shared weights \(\theta \). For every example \((x, t_1, t_2) \in \mathcal {B}\), we feed two differently deformed versions of x (\(t^{in}_1(x)\) and \(t^{in}_2(x)\)) to the two branches of the network. The output of the first branch, , is transformed by using a custom differentiable layer so as to align it with the prediction of the second branch for pixel-wise comparison by the consistency term \(\mathcal {C}\). In addition to \(\mathcal {C}\), if x happens to be labeled, the supervised loss \(\mathcal {S}\) is applied to both \(\hat{y}_1\) and \(\hat{y}_2\). The network is thus trained end-to-end using Eq. 1 as a composite loss.

Note that since our transformation layer is differentiable, the gradient can flow through both branches. If the layer did not let the gradient through, training the network would be equivalent to applying the network to \(t^{in}_1(x)\) and using as a target for \(t^{in}_2(x)\) (such approach was adopted by [7]). In this case, the network is forced to update the prediction for \(t^{in}_2(x)\) to be more similar to \(\tilde{y}_1\), even if \(\tilde{y}_1\) is incorrect. In the case of a differentiable layer, the network has the freedom to update its predictions in any way that optimizes \(\mathcal {C}\), which includes changing the prediction for \(t^{in}_1(x)\) to be more consistent with that for \(t^{in}_2(x)\), the other way around or changing both of them in the same direction. In other words, the proposed methodology encourages predictions to have the desired property of transformation consistency without encouraging any specific predictions for unlabeled images, which might otherwise introduce a bias when the targets for these images cannot be reliably inferred.

In this work, we use elastic deformations as the event space for \(\mathcal {T}\). Our transformation layer, in addition to the predicted segmentation, takes as its inputs deformation fields specifying forward and backward transformations and . The latter is necessary for backpropagating the gradients through the layer:

In principle, any transformations \(t^{in}\) and \(t^{out}\) could be implemented, as long as the inverse can be computed.

3 Experiments

Dataset and Validation. We used the Japanese Society of Radiological Technology (JSRT) dataset [15], which contains 247 posterior-anterior chest radiographs with a resolution \(2048 \times 2048\), 0.175 mm pixel size and 12-bit depth. Segmentations of the left and right lung fields, left and right clavicles and the heart were made available by [16]. We created five splits of the dataset by choosing images with either even or odd IDs as the test set (\(|\mathcal {X}^{\text {test}}| = 123\) or 124) as prescribed by [16] and randomly splitting the rest into the training (\(|\mathcal {X}^{\text {train}}| = 100\)) and validation portions. For every split, we sampled subsets \(\mathcal {X}^{\text {train}}_l\) of 5, 10, 25 and 50 examples (with larger subsets containing smaller ones) from \(\mathcal {X}^{\text {train}}\) to be used as labeled portions of the training set. The rest of the images from \(\mathcal {X}^{\text {train}}\) were assigned to the unlabeled portion of the training set \(\mathcal {X}^{\text {train}}_u\), to be used by the proposed semi-supervised algorithm.

Implementation Details. We used a U-Net-like architecture [11] as the basis for the Siamese network. For every cross-validation (CV) split and labeled-unlabeled training set split, we first trained a network using \(\mathcal {L}_{sup}^{\mathcal {T}}\) loss (i.e. in a purely supervised way). We used this network as a basis for fine-tuning four different networks: two supervised and two semi-supervised. The two supervised networks were fine-tuned using \(\mathcal {L}_{sup}^{\mathcal {T}}\) or \(\mathcal {L}_{sup}^{\mathcal {T}}+\lambda \mathcal {L}_{cons}^{\mathcal {T}}\). We refer to them as “the baseline” and the supervised transformation-consistent network or SupTC, respectively. The semi-supervised networks were fine-tuned using \(\mathcal {L}_{sup}^{\mathcal {T}}+\lambda \mathcal {L}_{cons}^{\mathcal {T}}\), with batches containing equal numbers of labeled and unlabeled examples. (The total batch size was the same as in the supervised cases.) One of the semi-supervised networks only used unlabeled images from the training set \(\mathcal {X}^{\text {train}}_u\) (we dubbed it SemiTC), while the other one additionally used images from the corresponding validation and test sets as unlabeled (SemiTC+). We used intersection over union (IOU) averaged over six classes (the five structures and the background) as both supervised and unsupervised loss terms: \(\mathcal {S}(y, \hat{y}) = \mathcal {C}(y, \hat{y}) = 1 / 6 \sum _c ( \sum _i{y_c^{(i)}\hat{y}_c^{(i)}}/(\sum _i{y_c^{(i)} + (1 - y_c^{(i)})\hat{y}_c^{(i)}}))\). The weight \(\lambda \) of the consistency term was arbitrarily set to 1, giving the supervised and consistency terms equal importance. Adadelta optimizer was used for both training and fine-tuning. The images and segmentation maps were subsampled to a resolution of \(512 \times 512\) for training. The deformation fields for elastic deformations were created by randomly sampling two-dimensional displacement maps from a uniform distribution \(\mathcal {U}(-1000, 1000)\) and smoothing them with a Gaussian filter with the standard deviation of 100 pixels. Spline interpolation was applied to images and nearest neighbor interpolation was applied to labels and predictions. To reduce computational time, probability distribution \(\mathcal {T}\) was specified as drawing an identity transform \((\text {id}_\mathcal {X}, \text {id}_{\mathcal {Y} \cup \mathcal {\hat{Y}}})\) or a random elastic deformation \((t^{in}, t^{out})\) (specified by a deformation field sampled as described above) with 50% chance. The transformation layer was implemented in Tensorflow, which allows implementing operations with custom gradients. Gradient backpropagation through the layer was implemented by copying gradients with respect to the layer output pixels to the positions where those pixels’ values came from in the forward pass, which is elastic deformation with nearest neighbor interpolation. Pixel values that are not copied in the forward pass receive no gradient.

4 Results and Discussion

Table 1 compares the four versions of the proposed network. The metric used is IOU averaged over the five anatomical structures (mIOU).

The consistency term \(\mathcal {L}_{cons}^\mathcal {T}\) improved the performance even when all training images were labeled. Although this improvement was modest, it was very reliable: \(\mathcal {L}_{cons}^\mathcal {T}\) improved mIOU in 24 experiments out of 25 (5 CV splits \(\times \) 5 training set sizes). Interestingly, SupTC networks achieved similar or higher supervised training loss \(\mathcal {L}_{sup}^{\mathcal {T}}\) most of the time, while expectedly having lower consistency loss \(\mathcal {L}_{cons}^{\mathcal {T}}\) and lower total loss \(\mathcal {L}_{sup}^{\mathcal {T}}+\mathcal {L}_{cons}^{\mathcal {T}}\), compared to their non-consistency-regularized counterparts. This rules out a hypothesis that the consistency term merely helps the network to converge to a lower \(\mathcal {L}_{sup}^\mathcal {T}\) (e.g. because of an increase in the learning rate). We believe that this performance gain can be explained by that image-to-segmentation mappings that are more consistent under elastic deformations are more likely to be correct even if the resulting segmentations of training images fit less to the ground truth (which might be wrong or ill-defined).

Table 1. Means and standard deviations of mIOU over the five test sets corresponding to different versions of the method (rows) and labeled training set sizes (columns). These versions are: the proposed architecture trained with \(\mathcal {L}_{sup}^{\mathcal {T}}\) only, the consistency-regularized version of the latter (SupTC), and the proposed semi-supervised method using either only unlabeled examples from the training set (SemiTC) or additionally validation and test set as unlabeled examples (SemiTC+). Note that with the largest training set size SupTC is equivalent to SemiTC, since all labels are available.

The proposed semi-supervised approach SemiTC outperformed the supervised SupTC substantially when the size of the labeled training set was small (5 or 10 images). This improvement was also very consistent: SemiTC was better than SupTC in all five CV splits. For larger training set sizes, the improvement was more modest but still consistent (at least 4 out of 5 CV splits). With SemiTC+, which added validation and test images to the pool of unlabeled images for training, we achieved an additional small but consistent improvement (in 20 experiments out of 25). The comparison of the performance gains achieved by adding \(\mathcal {L}_{cons}^{\mathcal {T}}\) to the loss (i.e. the improvement of SupTC over the baseline) and introducing unlabeled images to the training (i.e. the improvement of SemiTC and SemiTC+ over SupTC) suggests that the latter is mainly responsible for the superior performance of the proposed method compared to the baseline.

Table 2. The comparison of the inter-observer agreement, state-of-the-art techniques, our baseline and SemiTC+ trained on 20 or 124(123) labeled images. The metrics reported are means and standard deviations of per-structure IOU and mean absolute contour distance (MACD) averaged over CV splits. Dai et al. [3] and Novikov et al. [8] did not report MACD.

The proposed method substantially outperformed MS-Net [14], the only weakly supervised method evaluated on JSRT that is known to us. MS-Net achieved 67% and 81% mIOU (extracted from Fig. 4 in [14]) when trained in 20% and 100% strong supervision (124(123) labeled training images) modes, respectively. (In the former case, bounding boxes and landmarks were used as labels for the remaining 80% of the images.) SemiTC reached \(87 \pm 1.5\)% mIOU in <20% supervision mode (10 labeled images for training and 10 for validation).

Table 2 compares our baseline network and the proposed SemiTC+ with the inter-observer agreement [16] and state-of-the-art chest X-ray segmentation methods [3, 4, 8]. All these methods are based on fully convolutional networks and are trained in a supervised way using at least 124(123) labeled images from the JSRT dataset. For these comparisons, we post-processed all predicted segmentations as described in [4] (small objects removal, hole filling).

Both our baseline and SemiTC+ trained using 124(123) labeled images outperformed Dai et al. [3] and Novikov et al. [8] in segmentation of all structures (without post-processing as well). Both methods performed similarly to the method of Frid-Adar et al. [4], with the heart segmentation being slightly worse and clavicle segmentation being slightly better. (Note that the network of Frid-Adar et al. [4] benefited from pre-training on ImageNet.) We reached human-level performance in lung and heart segmentation and approached it closely in clavicle segmentation, unlike all other methods, which had a larger gap between their performance and the observers’ for clavicle segmentation.

SemiTC+ trained only on 20 images (10 for training and 10 for validation) reached human-level performance in lung segmentation and was only slightly worse than the observers in heart segmentation (2.6% lower IOU). Its clavicle segmentation performance was substantially worse than human, but was only slightly worse than the automatic methods [4, 8] trained using the fully labeled dataset (2.6% and 2.2% lower IOU, respectively). This could not be achieved by purely supervised training with the small labeled set, which was substantially worse in segmentation of all structures.

5 Conclusion

We proposed a novel semi-supervised segmentation method that learns consistency under transformations. The evaluation on a public chest X-ray dataset showed that the proposed consistency regularization improved the segmentation performance both when all training data was labeled and when additional unlabeled data was used for training. We achieved the performance comparable to the state-of-the-art while using more than five times fewer labeled images.