Keywords

1 Introduction

Ultrasonography has become the preferred method for clinical diagnosis of thyroid nodules [1]. The Thyroid Imaging Reporting and Data System (TI-RADS) [2] provides standardised terminology that describes thyroid nodule features in ultrasound B-mode images, including boundary, calcification, echo pattern, etc. Ultrasound elastography reflects the hardness of lesion tissue by using the colour distribution ratio [3]. In clinical diagnosis, radiologists generally consider these two kinds of information together and then give diagnostic results. However, manual diagnostics not only have a large workload but also have subjective differences. Therefore, reducing the workload and misdiagnosis rate of radiologists by computer-aided diagnosis is important.

Recently, deep learning has become greatly successful in the diagnosis of ultrasound thyroid nodules. Among them, semi-supervised learning has been receiving increasingly more attention because it can greatly reduce the burden of manually annotating data by using unlabelled data for learning [4, 5]. Wang et al. [4] proposed a semi-supervised learning model based on weak label data for the diagnosis of thyroid nodules. Ding et al. [5] used the MIL (multiple-instance learning) method to extract pre-designed features of ultrasound B-mode and elasticity data and applied SVM for classification. Although these methods produce commendable results, the inherently small sample characteristics of medical image datasets may make the semi-supervised learning process unstable, which will lead to mixed results and even degrade the performance of the model. Therefore, how to ensure stable and advanced diagnostic performance when performing semi-supervised learning on a small medical dataset with insufficient labels remains a challenging issue.

To address these challenges, we propose the DScGANS model (DScGAN (dual-path semi-supervised conditional generative adversarial networks) and the S3VM (semi-supervised support vector machine)), which is built by combining semi-supervised generative adversarial networks (sGAN) [6], conditional generative adversarial networks (cGAN) [7] and S3VM [8]. The uniqueness of our model includes the following: (1) We quantify the features of thyroid nodules in the B-mode ultrasound image described by TI-RADS and the hardness information of lesion tissue in the elasticity image and use them as domain knowledge (DK) to promote semi-supervised learning; (2) DK acts as a conditional constraint generator to produce higher quality images. When DK connects with deep features of labelled data in the discriminant network, correlation information between DK, deep features and class labels will be learned. The correlation information can be used as a condition constraint for the unlabelled data when DK is connected with deep features of unlabelled data. Thus, the ambiguity of unlabelled data will significantly reduce, which will enable DScGAN to use unlabelled data for stable and high-performing feature extraction; (3) DK is used as a condition to constrain S3VM for thyroid nodule classification. As S3VM combines powerful regularization-based SVMs with the cluster assumption [9], when concatenating DK and the deep features learned by DScGAN as the input of S3VM, the conditional constraint DK will make the different class of unlabelled data more inclined to their own cluster, which greatly improves the classification results. The experimental results show that our proposed model can provide a stable 90.5% accuracy and 91.4% AUC using only 35% of labelled data by semi-supervised learning and potentially integrated into the thyroid ultrasound system.

2 Datasets

We collected 3090 fully anonymous thyroid nodule lesion images of 1030 different age stages (the age older than 18 years) patients from the co-operative hospital and ensured that different sizes of nodules were included. The average size of the nodules is 2.4 cm. Each lesion image includes two types of images, B-mode and elasticity, which correspond to the same lesion area. In 3090 pairs of images, including 1601 pairs of benign nodules and 1489 pairs of malignant nodules. Each pair of images is labelled as benign or malignant by two senior radiologists based on pathological examination results. Finally, a total of 6180 images are obtained, with each type of image accounting for 50%.

3 Methods

Our proposed DScGANS model is shown in Fig. 1. The model has three steps: (1) image pre-processing and augmentation, (2) acquiring DK, and then use the B-mode and elastic data as input to train the DScGAN component (DScGAN-A/B) under the constraints of DK, (3) saving the discriminator network as a feature extraction network and concatenating the output image representation of a penultimate fully connected layer (the output result includes both deep features and DK) as input to S3VM for thyroid nodule classification.

Fig. 1.
figure 1

Framework of our proposed DScGANS model, where fc denotes a fully connected layer, fdr denotes a feature dimension reduction layer, and jln denotes a joint learning network.

3.1 Image Preprocessing and Augmentation

From each B-mode image, we obtain square regions of interest (ROI) by using the method in [10] and define the ROI as an OB patch. We also obtain an elasticity ROI according to the B-mode ROI position and define it as the OE patch. To characterise the shape of the nodule, we use the live-wire algorithm [11] to semi-automatically segment the nodule area in the OB patch with the help of an experienced radiologist; then, we set the pixel value of the nodule region to 255 and that of the non-nodule region to 0 and define it as an OS patch. Then, the OB patch, OE patch, and OS patch are resized to 128 × 128. We applied data augmentation (including random rotation and horizontal and vertical flipping) to enlarge each patch and apply them to form three datasets. Note that the OS patch is only used to automatically extract DK, while the OB and OE patches are the inputs of DScGAN-A and DScGAN-B.

3.2 Automatic Acquisition of Domain Knowledge

According to TI-RADS [2] and the elastography score [3] and after discussing with experienced radiologists, we calculated the following DK.

  1. 1.

    Dimension. First, the OS patch is fitted using an ellipse [12]. For two axes of the fitting ellipse, the axis closer to the horizontal is defined as the horizontal axis, and its length is represented by \( L_{h} \). The other axis is defined as the vertical axis, and its length is represented by \( L_{v} \). \( L_{h} \) and \( L_{v} \) are used as the values of dimension.

  2. 2.

    Orientation. Growth direction is important for reflecting the shape of the nodule. We use the rotation angle of the fitting ellipse as the values of orientation.

  3. 3.

    Aspect ratio. The aspect ratio of malignant nodules is likely to be greater than 1 [2]. The ratio of \( L_{v} \) to \( L_{h} \) is used as the aspect ratio.

  4. 4.

    Boundary feature. It can be described as clear or unclear. According to the nodule area (pixel value is 255) and surrounding tissue (pixel value is 0) in the OS patch, we use acutance [13] to calculate the nodule boundary feature in the OB patch.

  5. 5.

    Echo pattern. This pattern can be described as hypoechoic, hyperechoic, or isoechoic. Locate the nodule area with the OS patch. We calculated the ratio of the gray value of the nodule area to the gray value of the surrounding tissue area in the OB patch and used it as the value of the echo pattern.

  6. 6.

    Cystic or Solid. A cystic nodule is more likely to be benign. Locate the nodule area with the OS patch. We calculated the average gray value of the nodule area in the OB patch and used the result to determine if the value represents cystic or solid.

  7. 7.

    Calcification. The calcification can be described as coarse or micro-calcification. Locate the nodule area with the OS patch. We calculated the calcification information of the nodule area in the OB patch using the method in [13].

  8. 8.

    Elastography colour distribution. The colour distribution is quantised by calculating the RGB colour histogram inside the image [14]. For each OE patch, a 1 × 64-dimensional vector is obtained as values of elastography colour distribution.

Finally, the DK dimensions of the B-mode and elasticity patches are 1 × 8 and 1 × 64.

3.3 DScGAN Component

The DScGAN component is built by combining sGAN [6] and cGAN [7].

Generator:

We use \( G \) to indicate, which consists of a fully connected layer and five deconvolution layers, the parameter is \( \varvec{\theta}_{g} \). Each input random noise vector \( \varvec{z} \) has a corresponding DK as condition \( \varvec{Dk} \), which is obtained from the real image corresponding to a fake image. \( G \) can use both to generate image patch \( G\left( {\left( {\varvec{z},\varvec{Dk}} \right);\varvec{\theta}_{g} } \right) \).

Discriminator:

We use \( D \) to indicate, which consists of six convolutional layers, a feature dimension reduction layer (fdr), a joint learning network (jln) and a fully connected layer; the parameter is \( \varvec{\theta}_{d} \). The fdr layer is a fully connected layer that has 16 neurons (DScGAN-B has 128 neurons) and is used to reduce the dimension of deep features before concatenating the DK and deep features. Our experimental results demonstrate that high-dimensional deep features will overwhelm low-dimensional DK if they are connected directly. After the DK and deep features are concatenated, they are fed into the jln. The jln consists of two fully connected layers to learn the correlation information between DK, deep features and class labels of labelled data. Assuming that the real image patch has \( k \) classes, we define the image patch generated by \( G \) as the \( \left( {k + 1} \right) \)-th class. Thus, the last fully connected layer has \( \left( {k + 1} \right) \) neurons, of which the first \( k \) are softmax neurons, and the \( \left( {k + 1} \right) \)-th is a sigmoid neuron. An input image patch \( I \) may come from a real image with a label \( data\left( {x,y} \right) \), a real image without a label \( data\left( x \right) \), or a generated fake image \( G\left( {\left( {{\mathbf{z}},\varvec{Dk}} \right);\varvec{\theta}_{g} } \right) \). We use the first \( k \) neurons to calculate the class probability that the input \( I \sim data\left( {x,y} \right) \) belongs to class \( j \) as follows:

$$ P_{D} \left( {y = j |\left( {I,\varvec{Dk}} \right)} \right) = D^{\left( j \right)} \left( {\left( {I,\varvec{Dk}} \right);\varvec{\theta}_{d} } \right) $$
(1)

where y is the class label of input \( I \). We use the \( \left( {k + 1} \right) \)-th neuron to calculate the probability that the input \( I \) is from the real image \( data\left( x \right) \) or the fake image \( G\left( {\left( {{\mathbf{z}},\varvec{Dk}} \right);\varvec{\theta}_{g} } \right) \). We define the output of th \( {\text{e }}\left( {k + 1} \right) \)-th neuron as \( D^{{\left( {k + 1} \right)}} \left( {\left( {I,\varvec{Dk}} \right);\varvec{\theta}_{d} } \right) \) as follows:

$$ P_{D} \left( {l = 0 |\left( {I,\varvec{Dk}} \right)} \right) = D^{{\left( {k + 1} \right)}} \left( {\left( {I,\varvec{Dk}} \right);\varvec{\theta}_{d} } \right) $$
(2)

If \( I \) is from a fake image, the value of variable \( l \) is 0; otherwise the value is 1.

3.4 Training

To train the DScGAN component, we use a variety of loss functions during the training process. For the input random noise vector \( \varvec{z} \), the generator loss function is:

$$ L_{G} = - E_{{I \sim G\left( {\left( {{\mathbf{z}},\varvec{Dk}} \right),\varvec{\theta}_{g} } \right)}} { \log }\left\{ {P_{D} \left( {l = 1 |\left( {I,\varvec{Dk}} \right)} \right)} \right\} $$
(3)

The loss function of the discriminator consists of two parts: the supervised loss function \( L_{D\_su} \) and the unsupervised loss function \( L_{D\_un} \). We use \( L_{D\_un} \) as the loss of distinguishing whether input \( \varvec{ }I \) is from the real or fake image and \( L_{D\_su} \) as the loss of distinguishing whether input \( I \) belongs to class \( j \). Given an input \( I \), the loss function of the discriminator is:

$$ L_{D} = L_{D\_su} + L_{D\_un} $$
(4)
$$ \text{where,}\quad L_{D\_su} = - E_{{I \sim data\left( {x,y} \right)}} { \log }\left\{ {P_{D} \left( {y = j |\left( {I,\varvec{Dk}} \right)} \right)} \right\} $$
(5)
$$ L_{D\_un} = - E_{I \sim data\left( x \right)} \log \left\{ {P_{D} \left( {l = 1 |\left( {I,\varvec{Dk}} \right)} \right)} \right\} - E_{{I \sim G\left( {\left( {{\mathbf{z}},\varvec{Dk}} \right),\varvec{\theta}_{g} } \right)}} { \log }\left\{ {P_{D} \left( {l = 0 |\left( {I,\varvec{Dk}} \right)} \right)} \right\} $$
(6)

S3VM Classifier.

S3VM [8] is used as a classifier, given its good application in the two-category classification task. We denote \( \varvec{F}_{\alpha } \) and \( \varvec{F}_{\beta } \) as the outputs of the penultimate fully connected layer of DScGAN-A and DScGAN-B, which contain their own deep features and DK. We concatenate \( \varvec{F}_{\alpha } \) with \( \varvec{F}_{\beta } \), and define them as feature vector \( \varvec{F}_{\alpha *\beta } \). Thus, DK will be used as a conditional constraint for unlabelled data during the classification process of S3VM. Assuming that \( \varvec{F}_{\alpha *\beta }^{\left( i \right)} \) represents the \( i \)-th vector in \( \varvec{F}_{\alpha *\beta } \), \( L = \left\{ {\left( {\varvec{F}_{\alpha *\beta }^{\left( 1 \right)} ,y_{1} } \right), \ldots ,\left( {\varvec{F}_{\alpha *\beta }^{\left( m \right)} ,y_{m} } \right)} \right\} \) denote the labelled dataset, \( y_{m} \) is the benign or malignant label, \( U = \left\{ {\varvec{F}_{\alpha *\beta }^{{\left( {m + 1} \right)}} , \ldots ,\varvec{F}_{\alpha *\beta }^{{\left( {m + n} \right)}} } \right\} \) denote the unlabelled dataset, and \( f\left( {\varvec{F}_{\alpha *\beta }^{\left( i \right)} } \right) = \left\langle {\omega ,\varphi \left( {\varvec{F}_{\alpha *\beta }^{\left( i \right)} } \right)} \right\rangle + b \) is the discriminating hyperplane. \( \omega \) is the hyperplane normal vector, \( b \) is the offset term, and \( \varphi \) is a transformation function from an input space to a high-dimensional reproducing kernel Hilbert space. The S3VM discriminator optimisation problem is as follows:

$$ \mathop {min}\nolimits_{\omega ,b} \frac{1}{2}\left\langle {\omega ,\omega } \right\rangle + C\sum\nolimits_{i = 1}^{m} {H\left( {y_{i} f\left( {\varvec{F}_{\alpha *\beta }^{\left( i \right)} } \right)} \right) + C^{*} \sum\nolimits_{i = m + 1}^{m + n} {\tilde{H}\left( {\left| {f\left( {\varvec{F}_{\alpha *\beta }^{\left( i \right)} } \right)} \right|} \right)} } $$
(7)

where \( H\left( t \right) \) is the hinge loss of labelled data, \( \tilde{H}\left( t \right) \) is the symmetric hinge loss of unlabelled data, \( C \) is the regularization parameter, and \( C^{*} \) is the penalization parameter of unlabelled data. When \( C^{*} \) = 0, Eq. (7) will degenerate into the standard SVM.

4 Experiments and Results

In our experiments, we randomly divide the dataset in a mutually exclusive way: 80% for training, and 20% as an independent test set. To mitigate the problem that labelled data are usually too few to perform effective cross-validation, 5-fold cross-validation is thus used at each stage to increase the number of labelled data in each fold and tested on an independent test set for evaluation. Since more than one ultrasound image can be collected from each patient, we ensure that the image patches from the same patient are assigned to the same folding.

Table 1 shows the mean and standard deviation (std) of the predictive capability of the DScGANS model on the test set. Specifically, 25% means that 25% of data in the training set are labelled, which are obtained by random selection. FSL (fully supervised learning) indicates that the training set is composed entirely of labelled data. We applied ten experiments on the training set independently (ensuring that all samples were used as labelled and unlabelled data to form the training set at least once), with 5-fold cross-validation, and calculated the average test result on the test set. The results indicate that regardless of the proportion of labelled data being used, the standard deviation of the obtained performance is smaller when using DK. This result clearly shows that DK tremendously improves the stability of semi-supervised learning. Furthermore, with the help of DK, our proposed model can achieve stable and advanced diagnostic performance that is almost on par with that of FSL, using only approximately 30%–35% of labelled data. Subsequently, we continue to increase the amount of labelled data without significant performance gains.

Table 1. The mean ± std of the predictive capability of the proposed DScGANS model using different amounts of labelled data with or without DK on the test set

5 Discussion

Comparison of Deep Features and DK Combinations in Different Dimensions:

Figure 2 shows the ROC curve and AUC value obtained when the original deep feature (without dimensionality reduction) and the reduced dimensional feature size are 3, 2, or 1 times that of DK, using 35% of the labelled data (given this situation, almost optimal performance is obtained with minimal labelled data). When the dimension of the deep features was twice that of DK, the proposed model obtained an optimal AUC value. Plausibly, the deep learning model can capture more valuable implied information than can human experts, and when the dimension of deep features is too high, DK will be overwhelmed, which will degrade the performance of the DScGANS model.

Fig. 2.
figure 2

ROC curve and AUC value of the proposed DScGANS model when using different dimensional deep features and DK combinations.

S3VM Kernel Function and Parameter C and C* Setting:

In Table 2, we compare the test results of using the linear kernel and the Gaussian kernel (based on libsvm in sklearnFootnote 1) on different proportions of labelled data. Specifically, 25% means that 25% of the data in the training set are labelled. To mitigate the possible negative effect of unlabelled data on the results, we set \( C^{*} \) to begin with a small value and reach the same value as \( C \) when the model converges and define \( C^{*} \) = \( C_{p} \times C \), where \( C_{p} \) is the proportionality factor of \( C \). Note that \( C^{*} \) is 0 in FSL. The grid search strategy was adopted to find optimal parameters. Finally, we use the Gaussian kernel and \( C \) = 10, \( C_{p} \) = 0.2 and \( \gamma \) = 2 with 35% of labelled data and obtain stable and advanced diagnostic performance that is almost on par with that of FSL.

Table 2. Two kernel functions and optimal parameters obtained using the grid search strategy

6 Conclusion

We propose the DScGANS model for the diagnosis of ultrasound thyroid nodules based on multimodal ultrasound data. The experimental results suggest that with the help of DK, our proposed model can effectively avoid mixed results that may occur when learning with a small medical dataset with insufficient labels and can provide stable and advanced diagnostic performance; this model can be potentially integrated into the thyroid ultrasound system. In future works, we aim to achieve automatic segmentation of the nodules and integrate more types of DK to train the proposed model and extend our model to real medical environments.