1 Introduction

While deep convolutional neural networks (CNNs) have achieved encouraging results across a number of tasks in the medical imaging domain, they frequently suffer from generalization issues due to source and target domain divergence. Examples of such divergence include distribution shift caused by images collected with distinct protocols, from different institutions or patient groups. This can be alleviated by supervised domain adaptation (SDA) [2, 13], which adapts certain layers of the model that trained with large amounts of well-labeled source data, with additional moderate amounts of labeled target data. However, obtaining abundant labels in each new, unseen domain is a non-trivial and laborious process that relies heavily on skilled clinicians in the majority of clinical applications. Alternatively, unsupervised domain adaptation (UDA) [10] aims to mitigate the harmful effects of domain divergence when transferring knowledge from a supervised (labeled) source domain to an unsupervised (unlabeled) target domain. Because of its potential benefits for medical image processing, UDA of deep learning models has attracted many researchers’ attention [1, 7, 12].

Adversarial adaptation methods [5, 10] have become increasingly popular with the recent success of generative adversarial networks (GANs) [3] and their variants [14]. In medical imaging, most of the previous work for adversarial adaptation focuses on lesion or organ segmentation [1, 7, 12, 13]. For instance, Kamnitsas et al. [7] derive domain-invariant features by an adversarial network for brain lesion segmentation of MR images from two different datasets. GAN-based image-to-image (I2I) translation methods [14] are also widely used to generate medical images cross modalities to help adaptation. For example, Zhang et al. [12] segment multiple organs in unlabeled X-ray images with labeled digitally reconstructed radiographs rendered from 3D CT volumes, using I2I translation. Zhang et al. [13] improve Cycle-GAN [14] by introducing shape-consistency for CT and MRI cardiovascular 3D image translation to help organ segmentation. Though CT and MR images are not necessarily paired, the shape-consistency loss requires supervision of pixel-wise annotations from both domains. Chen et al. [1] preserve semantic structural information of the lungs in chest radiographs (X-rays) for cross-dataset lung segmentation.

All the previous methods deal with limited domain shift or large organs appearing at approximately fixed positions with clear boundaries, or both. Moreover, they do not necessarily preserve class-specific semantic information of lesions or abnormalities in the process of distribution alignment. An illustrative example is, when translating an adult X-ray into a pediatric X-ray, there is no guarantee that fine-grained disease content on the original image will be explicitly transferred. The capability of preserving class-specific semantic context across domains is crucial for medical imaging analysis for certain clinically relevant tasks, such as disease classification or detection [8, 9]. However, to our best knowledge, solutions to this problem of adversarial adaptation for medical imaging are limited.

In this paper, we present a novel framework to tackle the target task of disease recognition in cross-domain chest X-rays. Specifically, we proposed a task-oriented unsupervised adversarial network (TUNA-Net) for pneumonia (findings on X-rays are airspace opacity, lobar consolidation, or interstitial opacity) recognition in cross-domain X-rays. Two visually discrepant but intrinsically related domains are involved: adult and pediatric chest X-rays. The TUNA-Net consists of a cyclic I2I translation framework with class-aware semantic constraint modules. In the absence of labels from one domain, the proposed model is able to (1) synthesize “radio-realistic” (i.e., a synthesized radiograph that appears anatomically realistic) images with sufficient low-level details across two different domains, (2) preserve high-level class-specific semantic contextual information during translation, (3) regularize learned mid-level features of real and synthetic target domains to be similar, (4) optimize the objective functions simultaneously to generalize to the unlabeled domain. We demonstrate the effectiveness of our approach on two public chest X-ray datasets of sufficient domain shift for pneumonia recognition.

2 Method

2.1 Problem Formulation

In this work, we focus on the problem of unsupervised domain adaptation, where we are given a source domain A with both images \(X_A\) (e.g., adult X-rays) and labels \(Y_A\) (e.g., normal or pneumonia), and a target domain P with only images \(X_P\) (e.g., pediatric X-rays), but no labels. The goal is to learn a classification model \(\mathcal {F}\) from images of both domains but with only source labels and predict the labels in the target domain. Note that \(X_A\) are naturally unpaired with \(Y_P\) as these images are from two different patient populations (adults and children).

A naive baseline method is to learn \(\mathcal {F}\) solely from source images and labels, then apply it directly on target domain. While \(\mathcal {F}\) performs well on data with similar distribution as the source data, it typically leads to degraded performance on the target data because of domain divergence. To alleviate this effect, we follow previous methods [12,13,14] to map images from two domains (\(X_A\leftrightarrows {X_P}\)) using multi-domain I2I translation with unpaired training data. During translation, we add constraints at different levels to preserve both holistic and fine-grained class-specific image content. Consequently, the model \(\mathcal {F}\) learned on the source domain can be well generalized to the target domain. The flowchart of the proposed framework for UDA is shown in Fig. 1.

Fig. 1.
figure 1

The framework of TUNA-Net. The question we investigate is whether class-specific semantics can be preserved in an I2I translation framework (e.g., Cycle-GAN [14]) to help domain adaptation, providing disease labels only in source domain (e.g., translate an adult chest X-ray into a pediatric chest X-ray while preserving disease semantics, i.e., normal or pneumonia). In test phase, model \(\mathcal {F}_P\) is applied on target pediatric images to make predictions. In this figure, for inputs from both domains, top two examples are normal, bottom two examples are with pneumonia.

2.2 Pixel-Level Image-to-image Translation with Unpaired Images

GANs [3] have been widely used for image-to-image translation. Given unpaired images from two domains, we adopt Cycle-GAN [14] to first learn two mappings: and , with two generators and , so that discriminators \(D_P\) and \(D_A\) can not distinguish between real and synthetic images generated by G. For and its discriminator \(D_P\), the objective is expressed as the adversarial learning loss:

(1)

A similar adversarial loss can be designed for mapping and its discriminator \(D_A\) as well: i.e., .

To preserve sufficient low-level content information for domain adaptation, we then use the cycle consistency loss [14] to force the reconstructed synthetic images \(x_a'\) and \(x_p'\) to resemble their inputs \(x_a\) and \(x_p\):

(2)

where and , \(||\cdot ||_1\) is the \(l_1\) norm.

The generative adversarial training with cycle-consistency enables synthesizing realistic looking radiographs across domains. However, there is no guarantee that high-level semantics would be preserved during translation. For example, when translating an adult X-ray with lung opacities, sometimes it might be converted into a normal pediatric X-ray without opacities, since the disease semantics are not explicitly modelled in the learning process.

2.3 High-Level Class-Specific Semantics Modelling

To preserve high-level class-specific semantic information indicating abnormalities in the image before and after translation, we propose to explicitly model disease labels into the translation framework by incorporating auxiliary classification models with source labels.

A source classification model \(\mathcal {F}_A\) is first learned on the labeled source data \(A=\{X_A, Y_A\}\) using a cross-entropy loss to classify C categories:

$$\begin{aligned} \mathcal {L}_{\text {cls}}(\mathcal {F}_A, A) = -\mathbb {E}_{a \sim A}\sum _{c=1}^C \mathbbm {1}_{c} \log \left( \sigma (\mathcal {F}_A^{(c)}(x_a)) \right) , \end{aligned}$$
(3)

where \(\sigma \) is the softmax function, \(\mathbbm {1}_c=1\) if an input image \(x_a\) belongs to class \(c\in C\), otherwise \(\mathbbm {1}_c=0\). We then enforce the learned \(\mathcal {F}_A\) to perform similarly on the reconstructed source data to minimize \(\mathcal {L}_{\text {cls}}(\mathcal {F}_A, A')\). In this way, the high-level class specific content is preserved within the source target source cycle.

To retain similar semantics within the target source target cycle in the absence of target labels \(Y_P\), we learn a target classification model \(\mathcal {F}_P\) (fine-tuned from \(\mathcal {F}_A\)) on synthetic target images to minimize , in the mean time, minimizing , so that classifiers in both domains produce consistent predictions to keep semantic consistency. The total semantic classification loss is:

(4)

By modelling disease labels into the translation network, the synthesized images maintain meaningful semantics to favor the target clinically relevant task. For instance, \(\mathcal {F}_P\) can be acted as a disease classifier on the target domain.

2.4 Mid-Level Feature Regularization

Inspired by the perceptual loss [6] that encourages image before and after translation to be perceptually similar, we impose feature reconstruction loss, to encourage real target image \(X_P\) and synthetic target image to be similar in the feature space. Using this feature regularization in training for middle layers of CNNs also tends to generate images that are visually indistinguishable from target domain referring to our experiments. The feature reconstruction loss is the normalized Euclidean distance between feature representations:

$$\begin{aligned} \mathcal {L}_{\text {feat}}(\mathcal {F}_P) = \sum _i \frac{\Vert f_i - \hat{f}_i\Vert _2^2}{H_iW_iC_i}, \end{aligned}$$
(5)

where i is a convolutional block from target model \(\mathcal {F}_P\), and \(f_i\) and \(\hat{f}_i\) are features maps of size \(H_i \times W_i \times C_i\) output by the \(i^{th}\) convolutional block.

2.5 Final Objective and Implementation Details

The final objective of TUNA-Net is the sum of adversarial learning losses, cycle-consistency loss, semantic classification loss and feature reconstruction loss:

(6)

Driven by the target task of disease recognition, this corresponds to optimizing the objective for the adapted target model \(\mathcal {F}_P\).

We adopt Cycle-GAN [14] for training the I2I translation framework. We use 9 residual blocks [4] for the generator network for an input X-ray image of size 512 \(\times \) 512. For source classification networks \(\mathcal {F}_A\), we use ImageNet pre-trained ResNet with 18 layers [4] as a trade-off between performance and GPU memory usage. The target classification model \(\mathcal {F}_P\) is fine-tuned from the source model \(\mathcal {F}_A\) hence has the same network structure with \(\mathcal {F}_A\). Feature maps of conv_3 (56 \(\times \) 56 \(\times \) 128) and conv_4 (28 \(\times \) 28 \(\times \) 256) are extracted from \(\mathcal {F}_P\) as mid-level feature representations to calculate the reconstruction loss. \(\lambda \) in Eq. 6 is set to 10 as in [14]. All other networks are trained from scratch with a batch size of 1, an initial learning rate of 0.0002 for first 100 epochs and linearly decay to 0 in the next 100 epochs. All the network components are optimized using the Adam solver. The TUNA-Net is implemented using the PyTorch framework. All the experiments are run on a 32 GB NVIDIA Tesla V-100 GPU.

3 Experiments

Material and Settings: We extensively evaluate the proposed TUNA-Net for unsupervised domain adaptation on two public chest X-ray datasets containing normal and pneumonia frontal view X-rays, i.e., an adult chest X-ray dataset used in the RSNA Pneumonia Detection ChallengeFootnote 1 (a subset of the NIH Chest X-ray 14 [11]) and a pediatric chest X-ray datasetFootnote 2 from Guangzhou Women and Children’s Medical Center in China. We set the adult dataset as source domain and the pediatric dataset as target domain. For the adult dataset, we use 6993 normal X-rays and 4659 X-rays with pneumonia. For the pediatric dataset, we use 5232 X-rays (either normal (n = 1349) or abnormal with pneumonia (n = 3883), but labels were removed in our setting) for training and validation. The combined dataset are used to train the adult \(\leftrightarrows \) pediatric translation framework. 5-fold cross-validation is performed. Classification performance of the proposed adaptation method is evaluated on a hold-out test of 624 pediatric X-rays (normal: 234, pneumonia: 390) from the target domain.

Reference Methods: Although unsupervised adversarial domain adaptation methods exist in medical imaging field, they are mainly designed for segmentation. Here we compare the performance of our proposed TUNA-Net with the following five relevant reference models:

  1. 1.

    NoAdapt: A ResNet-50 [4] CNN trained on adult X-rays is applied to the pediatric X-rays for pneumonia prediction. This serves as a lower bound method.

  2. 2.

    Cycle-GAN [14]: Without considering labels indicating diseases in X-rays during I2I translation using [14]. A model trained on labeled real adult X-rays is applied to synthetic adult X-rays generated from pediatric X-rays.

  3. 3.

    ADDA [10]: First we train an adult classification network with labeled X-rays. Then we adversarially learn a target encoder CNN such that a domain discriminator is unable to differentiate between the source and target domain. During testing, pediatric images are mapped with the target encoder to the shared feature space of the source adult domain and classified by the adult disease classifier.

  4. 4.

    CyCADA [5]: It improves upon ADDA by incorporating cycle consistency at both pixel and feature levels.

  5. 5.

    Supervised: We assume that disease labels for target domain are accessible. A supervised model can be trained and tested on labeled target domain. This servers as an upper bound method.

Table 1. Comparison of normal versus pneumonia classification results on the test set of pediatric X-ray dataset.

Quantitative Results and Ablation Studies: We calculate the Area Under the Receiver Operating Characteristic Curve (AUC), accuracy (Acc.), sensitivity (Sen.), specificity (Spec.) and F1 score to evaluate the classification performance of our model. The validation set is only used to optimize the threshold using Youden’s index (i.e., \(max(\text {Sen.}+\text {Spec.}-1)\)) for normal versus pneumonia classification. The classification results of our TUNA-Net and reference methods are shown in Table 1. The baseline method without adaptation (NoAdapt) performs poorly on the target task of pediatric pneumonia recognition, though the source classifier excels in pneumonia recognition on adult chest X-rays (AUC = 98.0%). It demonstrates that the gap between the source and target domain are fairly large although they share the same disease labels. Cycle-GAN does not consider disease labels during I2I translation. It generates X-rays without preserving high-level semantics, resulting in many normal adult X-rays converted into pediatric X-rays with opacities on the lungs, or adults with lung opacities converted into normal pediatric X-rays. This hugely decreases the adaptation performance for the classification task, where correct labels are considered to be crucial. Our full TUNA-Net considers high-level class-specific semantics achieves an AUC of 96.3% with both sensitivity and specificity larger than 91%. It outperforms both ADDA and CyCADA with similar settings. It is also worth noting that the performance of TUNA-Net is very close to that of the supervised model, where labeled training images on the target dataset are available. We ablate different modules in the TUNA-Net to see their influence on the final model: (a) We exclude the feature construction loss in the target classification model; (b) We do not use reconstructed images to retrain the source classification model; (c) We exclude the target classification model \(\mathcal {F}_P\) in the training, but use the synthetic images to train it offline. As shown in Table 1, each component contributes to improving the final TUNA-Net. The online end-to-end learning of \(\mathcal {F}_P\) with other components is crucial and contributes most to the performance improvement.

Fig. 2.
figure 2

Qualitative comparison of image-to-image translation. Cycle-GAN is trained without using labels indicating normal or pneumonia, while CyCADA and our TUNA-Net considers labels in source domain in training. Left part shows adult pediatric, right shows pediatric adult. The first row shows two normal X-rays as input. The appearances of pneumonia(s) are pointed by arrows. Please refer to supplementary material for higher resolution images.

Qualitative Results: We show some qualitative image-to image translation examples in Fig. 2. Cycle-GAN failed to preserve important semantic information during transfer. CyCADA is able to preserve certain high-level semantics but not as robust as the proposed TUNA-Net. TUNA-Net retains image content of various levels: from low-level content, mid-level features, to high-level semantics. For example, for the bottom left adult input, Cycle-GAN removes the pathology while our TUNA-Net perfectly preserves it. The synthetic X-rays by TUNA-Net are most close to the input source image semantically and to the target domain anatomically.

Discussion: We specifically focused on normal versus pneumonia classification on a cross-domain setting. We showed that the I2I translation framework can be constrained using semantic classification components to preserve class-specific disease content for medical image synthesis. We used two public chest X-ray datasets with sufficient domain shift to demonstrate the ability of our unsupervised domain adaptation method. The domain adaptation from adult to pediatric chest X-rays is natural and intuitive. For example, medical students and radiology residents learn in a similar way: they first learn to read adult chest X-rays, and then they transfer the learned knowledge to pediatric X-rays.

4 Conclusion

In this paper, we investigated how knowledge about class-specific labels can be transferred from a source domain to an unlabeled target domain for unsupervised domain adaptation. Using adversarially learned cross-domain image-to-image translation networks, we found clear evidence that semantic labels could be translated across medical image domains. The proposed TUNA-Net is general and has the potential to be extended to more disease classes (e.g., pneumothorax), other image modalites (such as CT and MRI) and more clinically relevant tasks.