1 Introduction

Preoperative malignancy differentiation of hepatocellular carcinoma (HCC) is important for establishing treatment strategies and predicting treatment outcomes and prognosis [1]. Computer-aided methods combined with machine learning techniques have shown great potential for providing radiologists with a second opinion on the visual diagnosis of diseased malignancies [2]. In particular, the deep features obtained from sample learning based on data-driven methods show excellent ability to describe tumor features [3]. Although deep features based on deep learning techniques have been shown to be promising for lesion characterization, it is still challenging to train powerful deep learning systems for lesion characterization, because there are often limited samples in different malignancy types in addition to the variability across images from multiple scanners in clinical practice [4].

Generative Adversarial Network (GAN) has been widely applied for data augmentation to improve the performance of deep learning [5]. Salimans et al. proposed a semi-supervised learning model by simply adding samples generated by GAN to the actual data to improve the performance of classification [6]. Subsequently, Shams et al. used a similar semi-supervised learning to combine CNN and GAN for breast cancer screening and diagnosis [7]. Mahapatra et al. proposed a content loss combined with GAN and CNN, which encourages the generated samples to have a different appearance than the real data, where the content loss is measured by the mutual information (NMI), the L2 distance and the intensity mean square error between the generated image and the real image [8]. However, these similarity measures in content loss are closely related to the nature of the image and may not be suitable for processing diverse images and also subject to image artifacts and noise. In addition, due to the expensive calculation of NMI, the calculation of content loss is very time consuming. Therefore, it is desirable to develop new methods to accurately and efficiently improve the discrepancy of generated samples.

Furthermore, the technique of transfer learning has been expected to mitigate the problem of variability of images across multiple scanners in the society of Radiology [9]. Previous studies have shown that CNN models pre-trained with a typical computer vision database (ImageNet) have been shown to improve the classification performance of medical image analysis [10]. Since the features of ImageNet are significantly different from those of specific medical images, it can be anticipated that a CNN model pre-trained using data from different MR scanners may be superior to pre-training with ImageNet to improve the performance of lesion characterization. In addition, a recent study shows that fine-tuning with the final fully connected layer can improve classification performance [11]. Meanwhile, another recently proposed study proposes an adaptive fine-tuning strategy in computer vision based on a policy network, which makes the optimal decision on which layer is frozen and which layer is fine-tuned [12]. Inspired by the concept of the policy network, it is expected that the adaptive fine-tuning strategy may be effective to simulate the image variability of multiple scanners, thereby mitigating image variation problems between multiple scanners. Therefore, we hypothesize that pre-training and adaptive fine-tuning of data from different MR scanners may improve the performance of lesion characterization in clinical practice.

In this work, we propose a similarity steered generative adversarial network (SSGAN) coupled with pre-train and adaptive fine-turning of data from multiple scanners for lesion characterization. Specifically, we introduce a similarity discriminative network to make the GAN efficiently generate more discrepant samples, while we adopt an adaptive fine-tune strategy to optimally make decisions on whether to use the pre-train layers or the fine-tune layers. Finally, we devise hybrid loss functions to embed the similarity discriminative network, GAN, policy network and CNN into the proposed end-to-end framework for malignancy characterization of HCC.

2 Method

The architecture of our proposed method is shown in Fig. 1. The network is comprised of several components: generator, discriminator, policy network and similarity discriminative network, where the generator, the similarity discriminative network and the discriminator jointly constitute the end-to-end similarity steered GAN based classification network (SSGAN). The generator is trained to generate images that cannot be distinguished by the discriminator, the discriminator is trained adversarially to figure out whether the input images are real or fake, the policy network is trained jointly with the discriminator and to decide which layer is frozen and which layer is fine-tuned, and the similarity discriminative network is trained to measure the similarity between the generated images and the real images to increase the discrepancy of the generated images. We will illustrate the details of the proposed framework in the following sections.

Fig. 1.
figure 1

The architecture of our proposed method.

2.1 Similarity Steered Generative Adversarial Network

Inspired by GAN [5], the similarity discrimination is performed on the generated patch and the real patch, which is implemented by the fully convolutional network (FCN). The detailed architecture of the proposed similarity discriminative network is shown in the bottom left of Fig. 1, which consists of four down-sampling convolutional blocks. Note that the last convolutional layer is activated by sigmoid function. Let \(\{X_i,y_i\}_{i=1}^N\) represent a set of N input patches, where \(X_i\) and \(y_i\) denote the patch and its label, respectively. The generator takes the random noise as input and outputs the generated patch. Let G denote the generator and \(\theta _g\) denote its parameters. Let z denote a random noise, the generated patch can be denoted as \(G(z;\theta _g )\). Considering that a standard classifier classifies an input patch into one of k possible classes, the number of neurons of the last fully connected layer is increased from k to \(k+1\). The output of the first k neurons are activated by Softmax function and the output of the \((k+1)^{th}\) neuron is activated by sigmoid function.

As the discriminator both discriminates real patches from those generated patches and performs lesion classification, there are two loss functions for the discriminator. The loss for discriminating real patches from generated patches is:

$$\begin{aligned} \begin{aligned} L_d(z,X)=-&logP(r=1|X)-logP(r=0|G(z;\theta _g)) \end{aligned} \end{aligned}$$
(1)

where r is the output value of the \((k+1)^{th}\) neuron and takes value 1 if the patch is from a real patch and 0 otherwise. For the lesion classification loss, we adopt cross-entropy as the loss function. The loss function for lesion classification is:

$$\begin{aligned} \begin{aligned} L_c(X,y)=-\sum _{i}y^{(i)}logP(y=y^{(i)}|X) \end{aligned} \end{aligned}$$
(2)

Finally, the loss function of the similarity discriminative network is similar to the loss function for the discriminator:

$$\begin{aligned} \begin{aligned} L_s(z,X)=-logP(r'=1|X)-logP(r'=0|G(z;\theta _g)) \end{aligned} \end{aligned}$$
(3)

where \(r'\) is the output value of the last convolutional layer and takes value 1 if the patch is completely similar to a real patch and 0 otherwise.

As the purpose of the generator is to generate patches that look like the real one and meanwhile different from the real one, the loss function of the generator contains the real-looking part and the similarity discriminative part:

$$\begin{aligned} \begin{aligned} L_g(z)=-&logP(r=1|G(z;\theta _g))-logP(r'=0|G(z;\theta _g))\\ \end{aligned} \end{aligned}$$
(4)

2.2 GAN-Based Adaptive Fine-Tuning

We first pre-train the discriminator with pre-trained data in the SSGAN, and then their parameters are transferred to the frozen part and fine-tune part as shown in Fig. 1. The mechanism of the adaptive fine-tuning for the CNN model used in the present work is inspired by the recently proposed work of spot-tune in the residual network [12]. It is evident that the initial parameters of the frozen part and fine-tune part are the same. Let \(F_l (x_{l-1})\) and \({F'}_l (x_{l-1})\) denote the output of the \(l^{th}\) layer of the frozen part and the output of the \(l^{th}\) layer of the fine-tune part, respectively, where \(x_{l-1}\) denotes the output of the \(l-1^{th}\) layer of the discriminator. The output of the \(l^{th}\)layer of the discriminator is:

$$\begin{aligned} \begin{aligned} x_l= I_l(x_l)F'_l{(x_{l-1})}+(1-I_l(x_l))F_l(x_{l-1})\\ \end{aligned} \end{aligned}$$
(5)

where \(I_l(x_l)\) is a binary random variable, which indicates whether the parameter of the \(l^{th}\) layer should be frozen or fine-tuned, conditioned on the input image. \(I_l(x)\) is sampled from a discrete distribution with two categories (freeze or fine-tune), which is parameterized by the output of the policy network depicted in Fig. 1. However, \(I_l(x_l)\) is discrete, which makes the optimization of the network non-differentiable. The Gumbel-Max trick [13] can draw samples from a categorical distribution parameterized by \({\alpha _1,\alpha _2,...,\alpha _z}\), where \(\alpha _i\) are scalars not confined to the simplex, and z is the number of categories. In the present work, we just consider two categories, namely froze or fine-tune. As a result, z is 2. A random variable G is said to have a standard Gumbel distribution if \(G=-log(-log(U))\) with U sampled from a uniform distribution, i.e. \(U \sim Unif[0,1]\). Based on the Gumbel-Max trick, samples can be drawn from a discrete distribution parameterized by \(\alpha _i\) in the following way: we first draw samples \(G_i,...,G_z\) from Gumbel (0, 1) and then generate the discrete sample as follows:

$$\begin{aligned} \begin{aligned} A= argmax_i [log \alpha _i +G_i] \end{aligned} \end{aligned}$$
(6)

The argmax operation is non-differentiable. However, the Gumbel Softmax distribution can be used [12, 14], which adopts Softmax as a continuous relaxation to argmax. Let A represent a one-hot vector where the index of the non-zero entry of the vector is equal to A, and the one-hot encoding of A is relaxed to a z-dimensional real-valued vector Y using Softmax:

$$\begin{aligned} \begin{aligned} Y_i=\frac{exp((log \alpha _i + G_i)/\tau )}{\sum _{j=1}^{z} exp((log \alpha _i +G_i )/\tau )} \qquad for \quad i=1,2,\cdots ,z \end{aligned} \end{aligned}$$
(7)

where \(\tau \) is a temperature parameter, which controls the discreteness of the output vector Y. During the forward pass, the fine-tuning policy \(I_l(x)\) is sampled using Eq. (5) for the \(l^{th}\) layer. During the backward pass, the gradient of the discrete samples is approximated by computing the gradient of the continuous Softmax relaxation in Eq. (7).

2.3 The Implementation and Training

The proposed framework was implemented by an open source deep learning framework “Tensorflow”. Date augmentation based on image resampling is first adopted to generate more 2D real patches (e.g., 125 patches for each tumor) within each tumor for training the deep learning framework. We first train the discriminator, the policy network and the similarity discriminative network, and then train the generator. Note that the policy network is jointly trained with the discriminator. The optimization of each loss is based on stochastic gradient descent (SGD). The training and testing process were conducted on the platform of GeForce GTX1080 8G. The convolution layer is performed by convolving patches with a \(5 \times 5\) convolutional filter with stride 2. The max pooling layer is the size of kernels \(2 \times 2\) with stride 2. The parameter of dropout was 0.5. The learning rate is initialized by \(1e-4\) with the decay value 0.98.

3 Results

3.1 Clinical Data

The performance of the proposed end-to-end framework was assessed by 154 clinical HCCs with T1-weighted MR images, in which 77 HCCs were acquired by the GE scanner, 37 HCCs were acquired by the Philips scanner and 40 HCCs were acquired by the Siemens scanner. The MR images of HCCs from the Philips scanner and the Siemens scanner were separately used for the pre-train, and the MR images of HCCs from the GE scanner were used for the fine-tuning and independent test. Of the 77 HCCs acquired by the GE scanner, 37 HCCs were used for the fine-tuning and the left 40 HCCs were used for the independent test. Accuracy, sensitivity and specificity were calculated to assess the performance of differentiating the high-grade and low-grade HCCs. Receiver operating characteristic curve (ROC) and area under the curve (AUC) were also used to assess the characterization performance. The output probability of the deep learning model for differentiating the low-grade and high-grade in the test set were also assessed by the student’s t-test for statistical analysis. \(P<0.05\) was considered statistically significant. The training and testing were repetitively performed five times in order to assess the stability of the deep learning framework and reduce the measurement error.

3.2 Performance Comparison of CNN, GAN+CNN, NMI+L2+MSE, Similarity Steered GAN

Table 1 showed the characterization performance of the proposed method and other methods using the 37 HCCs of GE for training and the left 40 HCCs of GE for independent test. The GAN+CNN method [6] is superior to the traditional CNN method due to the increase of effective training data. This finding is consistent with the previous study [6], indicating that the combination of real data and data generated by GAN can further improve the performance of lesion characterization. The recently proposed NMI+MSE+L2 method [8] yields slightly better results than that of the GAN+CNN method as the discrepancy of the generated data has been considered. It can be clearly found that the proposed similarity steered GAN method obtained best results due to the superiority of the proposed similarity discriminative measure. Furthermore, it should be noted that our proposed similarity steered method consumes much less time (20 min) than the recently proposed NMI+L2+MSE method (4 h) with the same task of increasing the discrepancy of generated data for malignancy characterization of HCC.

Table 1. Characterization performance of different methods assessed by the independent test set \((\%)\).

3.3 Performance Comparison of Pre-train with ImageNet, Philips and Siemens and Fine-Tuning with All Layers, One Last Layer and Adaptive Fine-Tuning Coupled with Proposed SSGAN

Table 2 showed the characterization performance of the proposed method with different pre-train methods and different fine-tuning methods in the train and test set. For recently proposed different fine-tuning methods, it can be found that the method of fine-tuning the last layer [11] yields better performance than that of fine-tuning all the layers [10], which is consistent with the previous finding [11]. It can be clearly found that the adaptive fine-tuning coupled with the proposed SSGAN yields the best performance. Furthermore, results of different pre-train methods demonstrates that pre-train with multiple scanners can improve the performance of lesion characterization using data from other scanners, slightly outperforming the pre-train with ImageNet [10].

Table 2. Characterization performance assessed by Pre-train with ImageNet/Philips/Siemens and Fine-tuning with all layers, last layer and adaptive fine-tuning based on our proposed SSGAN \((\%)\).

4 Conclusion

We proposed a similarity steered generative adversarial network (SSGAN) coupled with pre-train and adaptive fine-turning of data from multiple scanners for malignancy characterization of HCC. Our experimental results showed that the proposed similarity steered GAN outperformed conventional GAN and a recently proposed study. Furthermore, we also showed that the proposed SSGAN coupled with the adaptive fine-tuning yielded significantly improved performance. Finally, our study suggested that pre-train with multiple scanners has considerable contribution to improve the performance, outperforming the pre-train with ImageNet.