Abstract
We propose a novel convolutional neural network for lesion detection from weak labels. Only a single, global label per image - the lesion count - is needed for training. We train a regression network with a fully convolutional architecture combined with a global pooling layer to aggregate the 3D output into a scalar indicating the lesion count. When testing on unseen images, we first run the network to estimate the number of lesions. Then we remove the global pooling layer to compute localization maps of the size of the input image. We evaluate the proposed network on the detection of enlarged perivascular spaces in the basal ganglia in MRI. Our method achieves a sensitivity of \(62\%\) with on average 1.5 false positives per image. Compared with four other approaches based on intensity thresholding, saliency and class maps, our method has a \(20\%\) higher sensitivity.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
1 Introduction
This paper addresses the problem of the detection of small structures in 3D images. We aim at developing a machine learning method requiring the least possible annotations for training. Several deep learning techniques [1,2,3] have recently been proposed for 3D segmentation. These methods use fully convolutional networks (FCN) [4] with a downsampling and upsampling path and can therefore detect small regions and fine details. Although efforts have been made to reduce the amount of annotations required with e.g., sparse annotations [1], these techniques still need pixel-wise annotations for training. Acquiring those is very time-consuming and often not feasible for large datasets.
In [5] the authors propose 2D localization networks requiring only global image labels for their training. They aggregate the last features maps of their network into scalars thanks to a global pooling layer and can therefore compute a classification. Heatmaps can be obtained as a weighted sum of these feature maps. These heatmaps indicate which regions of the image contributed the most to the classification results. Because of the downsampling introduced by the pooling operations in the network, the heatmaps are several times smaller than the original input image, which makes it impossible to detect very small structures. In [6] a similar technique is applied to 3D CT lung data. This network splits into two branches, one with fully connected layers for classification, and the other with a global pooling layer for localization. The loss is computed as a weighted sum of these two terms. However, as in [5], this network is not suitable for the detection of small structures.
In this paper we propose a method to learn fine pixelwise detection from image-level labels. By combining global pooling with a fully convolutional architecture including upsampling steps as in the popular U-Net [7], we compute a heatmap of the size of the input. This heatmap reveals the presence of the targeted structures. During the training, unlike [5] or [6] where the authors use a classification network, the weights of the network are optimized to solve a regression task, where the objective is to predict the number of lesions.
We evaluate our method on the detection of enlarged perivascular spaces (EPVS) in the basal ganglia in MRI. EPVS is an emerging biomarker for cerebral small vessel disease. The detection of EPVS is a challenging task - even for a human observer - because of the very small size and variable shape. In [8,9,10] the authors propose different EPVS segmentation methods. However detection of individual EPVS has not been addressed much in the literature. Only [9] proposes an automatic method for this, but this method requires pixel-wise annotated scans for training.
2 Methods
Our method takes as input a full 3D MRI brain scan, computes a smaller smooth region of interest (ROI) and inputs it to a FCN which computes a heatmap revealing the presence of the targeted brain lesions. The FCN is trained with weak labels. Its architecture changes between training and testing but the optimized weights stay the same.
2.1 Preprocessing
Scans are first registered to MNI space. A binary mask segmenting a region of interest (ROI) in the brain is computed with standard algorithms [11]. This mask is dilated and its borders are smoothed with a Gaussian kernel. After applying the mask, we crop the image around the ROI, trim its highest values and rescale its intensities such that the input to the network is \(S \in [0,1]^{h \times w \times d}\).
2.2 3D Regression Fully Convolutional Network
Figure 1 shows the architecture of our network. It is similar to the one of 3D U-Net [1] but is less deep and has, during training, a global pooling layer [12] before the last layer. The use of this layer is detailed in the next section. Our network has \(3 \times 3 \times 3\) convolutional layers. In the encoding path, these layers are alternated with \(2 \times 2 \times 2\) pooling layers to downsample the image and increase the size of the receptive field. In the deconding path we use \(2 \times 2 \times 2\) upsampling layers. Skip connections connect the encoding and decoding paths. We do not use padding. After each convolutional layer - except for the last one - we compute a rectified linear unit activation. The depth and number of feature maps of the network are set to fit the available GPU memory. See Fig. 1 for the architecture details.
We change the last layers of our network between training and testing. The testing step is performed with a standard fully convolutional architecture. The output is a heatmap of the size of the input. During training, we use only global image labels. To compute the loss function we need therefore to collapse the image output into a scalar. For this purpose, during training only, we introduce a global pooling layer before the last layer. In others word, we train a regression network and transform it to a segmentation network for testing (See Fig. 1). We detail this below.
Training - Regression Network. After the last convolutional layer, instead of stacking the voxels of each feature map and feeding them to a fully connected layer, we use a global pooling layer, which computes one pooling operation for each feature map. The resulting output of such a layer is a vector of scalars \(x \in \mathbb {R}^{n}\), with \(n \in \mathbb {N}\) the number of features maps of the preceding layer.
Let \(f_{i} \in \mathbb {R}^{h \times w \times d}\), for \(i \in \{1,2,..,n\}\), be the i-th feature map received by the global pooling layer. We can write the global pooling operation as \(g(f_{i}) = x_{i}\), with \(x_{i} \in \mathbb {R}\). Let G be the underlying mapping of the global pooling layer and \(f \in \mathbb {R}^{n \times h \times w \times d}\) the vector containing the n feature maps \(f_{i}\). We have \(G(f) \in \mathbb {R}^{n}\) and in the case of global max pooling we can write \((G(f))_{i} = \max \limits _{x,y,z} f_{i,x,y,z}\).
A convolution layer with a single output feature map \(\hat{y}\) and a filter size of \(1 \times 1 \times 1\) can be written as
with \(w_{i} \in \mathbb {R}\), the weights, b the bias and \(h_{i}\) the i-th feature map of the preceding layer. Considering the layer following the global pooling layer, we can write Eq. (1) as
with . denoting the scalar product, \(\hat{y} \in \mathbb {R}\), \(b \in \mathbb {R}\) and \(w \in \mathbb {R}^{n}\). The output of the network \(\hat{y}\) is the estimated lesion count. There is no activation function applied after this last convolutional layer. To optimize the weights, we compute a root mean square loss \(l = \sqrt{\frac{1}{m}\sum _{t=1}^{m}(\hat{y}_{t} - y_{t})^2}\), where \(\hat{y}_{t} \in \mathbb {R}\) is the predicted count, \(y_{t} \in \mathbb {N}\) the ground truth for volume \(S_{t}\) and m the number of samples.
Testing - Detection Network. During testing we remove the global pooling layer, but we do not change the weights of the network. The \(h_{i}\) of Eq. (1) are the feature maps \(f_{i}\) defined earlier and the bias \(b^{l}\) can be omitted. We can thus rewrite Eq. (2) as
with \(M \in \mathbb {R}^{h \times w \times d}\) the new output of the network. M is a heatmap indicating the location of the targeted lesions. Note that computation of M is mathematically equivalent to the class map computation proposed in [5]. However M has the same size as the input S and the weights \(w_{i}\) are optimized for a regression problem instead of classification.
After computing the heatmap, we need to select an appropriate threshold before we can compute a segmentation and retrieve the location of the targeted lesions. This efficiently be done by using the estimate of the lesion count \(\hat{y}\) provided by the neural network: we only need to keep the global pooling layer while testing. The threshold is then selected so that the number of connected components in the thresholded heatmap is equal to the lesion count.
3 Experiments
We evaluate our method on a testing set of 30 MRI scans and compute the sensitivity, false discovery rate and average number of false positive per image (Fig. 2).
Data. Our scans were acquired with a 1.5 T scanner. Our dataset is a subset of the Rotterdam Scan Study [13] and contains 1642 3D PD-weighted MRI scans. Note that in our dataset PD-weighted images have a signal intensity similar to T2-weighted images, the modality commonly used to detect EPVS. The voxel resolution is \(0.49 \times 0.49 \times 0.8\,\text {mm}^{3}\). Each of the scans of dataset is annotated with a single, global image label: a trained observer counted the number of EPVS in the slice of the basal ganglia showing the anterior commissure. In a random subset of 30 scans, they marked the center of all EPVS visible in this same slice. These 30 scans are kept for testing set. The remaining dataset is randomly split into the following subsets: 1289 scans for training and 323 for validation.
Experimental Settings. The initial ROI segmentation of the BG is computed with the subcortical segmentation of FreeSurfer [11]. For registration to MNI space we use the rigid registration implemented in Elastix [14] with the mutual information as similarity measure and default settings. The Gaussian blurring kernel has a standard deviation \(\sigma = 2\). The preprocessed image S has a size of \(168 \times 128 \times 84\) voxels. We trim its \(1\%\) highest values. We initialize the weights of the CNN by sampling from a Gaussian distribution, use Adadelta [15] for optimization and augment the training data with randomly transformed samples. A lesion is counted as detected if the center of a connected component in the thresholded heatmap is located within a radius of x pixels of the annotation. In our experiments we set \(x=3\) which corresponds to the average radius of the largest lesions. We implemented our algorithms in Python with Keras and Theano libraries and ran the experiments on a Nvidia GeForce GTX 1070 GPU.
Baseline Methods. We compare our method with four conventional approaches, using the same preprocessing (Sect. 2.1). The first method (a) thresholds the input image based on it intensities: EPVS are among the most hyperintense structures in the basal ganlgia. Both second and third methods compute saliency maps [16] using a neural network. Saliency maps are obtained by computing the gradient of the regression score with respect to the input image. These maps can be easily computed given any classification or regression network architecture. The third method (c) uses the same network architecture as ours, the second one (b) uses a regression network without the upsampling path. Finally we compared with a method similar to [5] where we replace the original classification objective with a regression. This network is the same as ours but without the upsampling part. The thresholds are chosen based on the lesion count estimated by each network as explain in Sect. 2.
Results. In Table 1 we report sensitivity, false discovery rate and average false positive per image (FPav). Our method (e) detects \(54.8\%\) of the lesions with on average 1.9 false positive per image. We found that the performance could be improved further by thresholding the weighted sum of the heatmap and the original voxel intensities of the image (f). This improves the sensitivity to \(62.0\%\) with 1.5 FPav. Considering the sensitivity, our method outperforms the other baseline methods by more than \(20\%\) (Table 1). By modifying the heatmap threshold to overestimate the lesion count, we can increase the sensitivity further - \(71.1\%\) - at the price of more false positive - 4.4 in average and a false discovery rate of \(60\%\) (Fig. 3).
4 Discussion
Our experiments show that using a fully convolutional network and optimizing its weights for a regression problem gives the best results on our dataset (Fig. 3 and Table 1).
The methods in [5, 6] produce coarse localization heatmaps on which it is impossible to locate small structures. In our early experiments we computed similar networks, optimized for classification. The resulting heatmaps highlight the whole ROI, which does not help to solve the detection problem. Using a regression network, the more precise output allows us to extract finer-scale information, but without an upsampling path, the results are still imprecise (see Fig. 3 method (d)). In our experiments, saliency maps [16] perform similarly to voxel intensity. By construction these maps are noisy (Fig. 3 method (b)).
Considering other works in automatic detection of EPVS, in [9] the authors compute an EPVS segmentation using a random forest classifier. They train this classifier with pixelwise ground truth annotations. They also use a 7 T scanner which produces high resolution images and eases small lesion detection. They report a sensitivity of \(78\%\) and a false discovery rate of \(55\%\). This is better than our \(71\%\) sensitivity and \(60\%\) false discovery rate. However contrary to [9] which requires pixelwise annotation, our method uses only weak labels for training. In [8, 10], the authors also compute segmentations, but the evaluation is limited to a correlation with a five-category visual score.
Overall we consider that our detection results are satisfactory, considering that, during training, our method did not use any ground truth information on the location or shape of the lesions, but only their count within the whole ROI.
5 Conclusion
We presented a novel 3D neural network to detect small structures from global image-level labels. Our method combines a fully convolutional architecture with global pooling. During training the weights are optimized to solve a regression problem, while during testing the network computes lesion heatmaps of the size of the input. We demonstrated the potential of our method to detect enlarged perivascular spaces in brain MRI. We obtained a sensitivity of \(62\%\) and on average 1.5 false positives per scan. Our method outperforms four other approaches with an increase of more than \(20\%\) in sensitivity. We believe this method can be applied to the detection of other brain lesions.
References
Çiçek, Ö., Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O.: 3D U-Net: learning dense volumetric segmentation from sparse annotation. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9901, pp. 424–432. Springer, Cham (2016). doi:10.1007/978-3-319-46723-8_49
Chen, H., Dou, Q., Yu, L., Qin, J., Heng, P.A.: VoxResNet: deep voxelwise residual networks for volumetric brain segmentation. NeuroImage (2017)
Bortsova, G., van Tulder, G., Dubost, F., Peng, T., Navab, N., van der Lugt, A., Bos, D., de Bruijne, M.: Segmentation of intracranial arterial calcification with deeply supervised residual dropout networks. In: Descoteaux, M., et al. (eds.) MICCAI 2017, Part III. LNCS, vol. 10435, pp. 359–367. Springer, Cham (2017). doi:10.1007/978-3-319-66179-7_41
Long, J., Shelhamer, E. and Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A. and Torralba, A.: Learning deep features for discriminative localization. In: CVPR (2016)
Hwang, S., Kim, H.-E.: Self-transfer learning for weakly supervised lesion localization. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9901, pp. 239–246. Springer, Cham (2016). doi:10.1007/978-3-319-46723-8_28
Ronneberger, O., Fischer, P., Brox, T.: Convolutional networks for biomedical image segmentation. In: MICCAI, pp. 234–241 (2015)
Ramirez, J., Berezuk, C., McNeely, A.A., Scott, C.J., Gao, F., Black, S.E.: Visible Virchow-Robin spaces on magnetic resonance imaging of Alzheimer’s disease patients and normal elderly from the Sunnybrook Dementia Study. J. Alzheimers Dis. 43(2), 415–424 (2015)
Park, S.H., Zong, X., Gao, Y., Lin, W., Shen, D.: Segmentation of perivascular spaces in 7T MR image using auto-context model with orientation-normalized features. NeuroImage 134, 223–235 (2016)
Ballerini, L., Lovreglio, R., Hernandez, M., del C. Valdés Hernández, M., Maniega, S.M., Pellegrini, E., Wardlaw, J.M.: Application of the ordered logit model to optimising frangi filter parameters for segmentation of perivascular spaces. Procedia Comput. Sci. 90, 6167 (2016)
Desikan, R.S., Sgonne, F., Fischl, B., Quinn, B.T., Dickerson, B.C., Blacker, D., Buckner, R.L., Dale, A.M., Maguire, R.P., Hyman, B.T., Albert, M.S.: An automated labeling system for subdividing the human cerebral cortex on MRI scans into gyral based regions of interest. Neuroimage 31(3), 968–980 (2006)
Lin, M., Chen, Q., Yan, S.: Network in network. In: ICLR (2014)
Hofman, A., Brusselle, G.G., Murad, S.D., van Duijn, C.M., Franco, O.H., Goedegebure, A., Ikram, M.A., Klaver, C.C., Nijsten, T.E., Peeters, R.P., Stricker, B.H.C.: The Rotterdam Study: 2016 objectives and design update. Eur. J. Epidemiol. 30(8), 661–708 (2015)
Klein, S., Staring, M., Murphy, K., Viergever, M.A., Pluim, J.P.W.: Elastix: a toolbox for intensity based medical image registration. TMI 29(1), 196–205 (2010)
Zeiler, M.D.: ADADELTA: an adaptive learning rate method. arXiv preprint. arxiv:1212.5701
Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: Visualising image classification models and saliency maps. In: CVPR (2014)
Acknowledgments
This research was funded by The Netherlands Organisation for Health Research and Development (ZonMw) Project 104003005.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Dubost, F. et al. (2017). GP-Unet: Lesion Detection from Weak Labels with a 3D Regression Network. In: Descoteaux, M., Maier-Hein, L., Franz, A., Jannin, P., Collins, D., Duchesne, S. (eds) Medical Image Computing and Computer Assisted Intervention − MICCAI 2017. MICCAI 2017. Lecture Notes in Computer Science(), vol 10435. Springer, Cham. https://doi.org/10.1007/978-3-319-66179-7_25
Download citation
DOI: https://doi.org/10.1007/978-3-319-66179-7_25
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-66178-0
Online ISBN: 978-3-319-66179-7
eBook Packages: Computer ScienceComputer Science (R0)