1 Introduction

This paper addresses the problem of the detection of small structures in 3D images. We aim at developing a machine learning method requiring the least possible annotations for training. Several deep learning techniques [1,2,3] have recently been proposed for 3D segmentation. These methods use fully convolutional networks (FCN) [4] with a downsampling and upsampling path and can therefore detect small regions and fine details. Although efforts have been made to reduce the amount of annotations required with e.g., sparse annotations [1], these techniques still need pixel-wise annotations for training. Acquiring those is very time-consuming and often not feasible for large datasets.

In [5] the authors propose 2D localization networks requiring only global image labels for their training. They aggregate the last features maps of their network into scalars thanks to a global pooling layer and can therefore compute a classification. Heatmaps can be obtained as a weighted sum of these feature maps. These heatmaps indicate which regions of the image contributed the most to the classification results. Because of the downsampling introduced by the pooling operations in the network, the heatmaps are several times smaller than the original input image, which makes it impossible to detect very small structures. In [6] a similar technique is applied to 3D CT lung data. This network splits into two branches, one with fully connected layers for classification, and the other with a global pooling layer for localization. The loss is computed as a weighted sum of these two terms. However, as in [5], this network is not suitable for the detection of small structures.

In this paper we propose a method to learn fine pixelwise detection from image-level labels. By combining global pooling with a fully convolutional architecture including upsampling steps as in the popular U-Net [7], we compute a heatmap of the size of the input. This heatmap reveals the presence of the targeted structures. During the training, unlike [5] or [6] where the authors use a classification network, the weights of the network are optimized to solve a regression task, where the objective is to predict the number of lesions.

We evaluate our method on the detection of enlarged perivascular spaces (EPVS) in the basal ganglia in MRI. EPVS is an emerging biomarker for cerebral small vessel disease. The detection of EPVS is a challenging task - even for a human observer - because of the very small size and variable shape. In [8,9,10] the authors propose different EPVS segmentation methods. However detection of individual EPVS has not been addressed much in the literature. Only [9] proposes an automatic method for this, but this method requires pixel-wise annotated scans for training.

2 Methods

Our method takes as input a full 3D MRI brain scan, computes a smaller smooth region of interest (ROI) and inputs it to a FCN which computes a heatmap revealing the presence of the targeted brain lesions. The FCN is trained with weak labels. Its architecture changes between training and testing but the optimized weights stay the same.

2.1 Preprocessing

Scans are first registered to MNI space. A binary mask segmenting a region of interest (ROI) in the brain is computed with standard algorithms [11]. This mask is dilated and its borders are smoothed with a Gaussian kernel. After applying the mask, we crop the image around the ROI, trim its highest values and rescale its intensities such that the input to the network is \(S \in [0,1]^{h \times w \times d}\).

2.2 3D Regression Fully Convolutional Network

Figure 1 shows the architecture of our network. It is similar to the one of 3D U-Net [1] but is less deep and has, during training, a global pooling layer [12] before the last layer. The use of this layer is detailed in the next section. Our network has \(3 \times 3 \times 3\) convolutional layers. In the encoding path, these layers are alternated with \(2 \times 2 \times 2\) pooling layers to downsample the image and increase the size of the receptive field. In the deconding path we use \(2 \times 2 \times 2\) upsampling layers. Skip connections connect the encoding and decoding paths. We do not use padding. After each convolutional layer - except for the last one - we compute a rectified linear unit activation. The depth and number of feature maps of the network are set to fit the available GPU memory. See Fig. 1 for the architecture details.

Fig. 1.
figure 1

3D Regression FCN Architecture. Top-left: network architecture (see Sect. 2.2). Bottom-right: during training, the network is built to solve a regression problem and outputs \(\hat{y} \in \mathbb {R}\) computed according to Eq. (1). During testing, the global pooling layer is removed. Using Eq. (3), the network computes a heatmap \(M \in \mathbb {R}^{h \times w \times d}\) of the same size as the network input volume S.

We change the last layers of our network between training and testing. The testing step is performed with a standard fully convolutional architecture. The output is a heatmap of the size of the input. During training, we use only global image labels. To compute the loss function we need therefore to collapse the image output into a scalar. For this purpose, during training only, we introduce a global pooling layer before the last layer. In others word, we train a regression network and transform it to a segmentation network for testing (See Fig. 1). We detail this below.

Training - Regression Network. After the last convolutional layer, instead of stacking the voxels of each feature map and feeding them to a fully connected layer, we use a global pooling layer, which computes one pooling operation for each feature map. The resulting output of such a layer is a vector of scalars \(x \in \mathbb {R}^{n}\), with \(n \in \mathbb {N}\) the number of features maps of the preceding layer.

Let \(f_{i} \in \mathbb {R}^{h \times w \times d}\), for \(i \in \{1,2,..,n\}\), be the i-th feature map received by the global pooling layer. We can write the global pooling operation as \(g(f_{i}) = x_{i}\), with \(x_{i} \in \mathbb {R}\). Let G be the underlying mapping of the global pooling layer and \(f \in \mathbb {R}^{n \times h \times w \times d}\) the vector containing the n feature maps \(f_{i}\). We have \(G(f) \in \mathbb {R}^{n}\) and in the case of global max pooling we can write \((G(f))_{i} = \max \limits _{x,y,z} f_{i,x,y,z}\).

A convolution layer with a single output feature map \(\hat{y}\) and a filter size of \(1 \times 1 \times 1\) can be written as

$$\begin{aligned} \hat{y} = \sum _{i} w_{i} h_{i} + b, \end{aligned}$$
(1)

with \(w_{i} \in \mathbb {R}\), the weights, b the bias and \(h_{i}\) the i-th feature map of the preceding layer. Considering the layer following the global pooling layer, we can write Eq. (1) as

$$\begin{aligned} \hat{y} = \sum _{i} w_{i} (G(f))_{i} + b = w.G(f) + b, \end{aligned}$$
(2)

with . denoting the scalar product, \(\hat{y} \in \mathbb {R}\), \(b \in \mathbb {R}\) and \(w \in \mathbb {R}^{n}\). The output of the network \(\hat{y}\) is the estimated lesion count. There is no activation function applied after this last convolutional layer. To optimize the weights, we compute a root mean square loss \(l = \sqrt{\frac{1}{m}\sum _{t=1}^{m}(\hat{y}_{t} - y_{t})^2}\), where \(\hat{y}_{t} \in \mathbb {R}\) is the predicted count, \(y_{t} \in \mathbb {N}\) the ground truth for volume \(S_{t}\) and m the number of samples.

Testing - Detection Network. During testing we remove the global pooling layer, but we do not change the weights of the network. The \(h_{i}\) of Eq. (1) are the feature maps \(f_{i}\) defined earlier and the bias \(b^{l}\) can be omitted. We can thus rewrite Eq. (2) as

$$\begin{aligned} M = \sum _{i} w_{i} f_{i}, \end{aligned}$$
(3)

with \(M \in \mathbb {R}^{h \times w \times d}\) the new output of the network. M is a heatmap indicating the location of the targeted lesions. Note that computation of M is mathematically equivalent to the class map computation proposed in [5]. However M has the same size as the input S and the weights \(w_{i}\) are optimized for a regression problem instead of classification.

After computing the heatmap, we need to select an appropriate threshold before we can compute a segmentation and retrieve the location of the targeted lesions. This efficiently be done by using the estimate of the lesion count \(\hat{y}\) provided by the neural network: we only need to keep the global pooling layer while testing. The threshold is then selected so that the number of connected components in the thresholded heatmap is equal to the lesion count.

3 Experiments

We evaluate our method on a testing set of 30 MRI scans and compute the sensitivity, false discovery rate and average number of false positive per image (Fig. 2).

Data. Our scans were acquired with a 1.5 T scanner. Our dataset is a subset of the Rotterdam Scan Study [13] and contains 1642 3D PD-weighted MRI scans. Note that in our dataset PD-weighted images have a signal intensity similar to T2-weighted images, the modality commonly used to detect EPVS. The voxel resolution is \(0.49 \times 0.49 \times 0.8\,\text {mm}^{3}\). Each of the scans of dataset is annotated with a single, global image label: a trained observer counted the number of EPVS in the slice of the basal ganglia showing the anterior commissure. In a random subset of 30 scans, they marked the center of all EPVS visible in this same slice. These 30 scans are kept for testing set. The remaining dataset is randomly split into the following subsets: 1289 scans for training and 323 for validation.

Fig. 2.
figure 2

Examples of EPVS detection in the basal ganglia. Left: Ground truth on the intensities - the lesions are circled in green. Center: Heatmap from the neural network. True positives in green, false positives in blue and false negatives in red. Right: Segmentation of the heatmap.

Experimental Settings. The initial ROI segmentation of the BG is computed with the subcortical segmentation of FreeSurfer [11]. For registration to MNI space we use the rigid registration implemented in Elastix [14] with the mutual information as similarity measure and default settings. The Gaussian blurring kernel has a standard deviation \(\sigma = 2\). The preprocessed image S has a size of \(168 \times 128 \times 84\) voxels. We trim its \(1\%\) highest values. We initialize the weights of the CNN by sampling from a Gaussian distribution, use Adadelta [15] for optimization and augment the training data with randomly transformed samples. A lesion is counted as detected if the center of a connected component in the thresholded heatmap is located within a radius of x pixels of the annotation. In our experiments we set \(x=3\) which corresponds to the average radius of the largest lesions. We implemented our algorithms in Python with Keras and Theano libraries and ran the experiments on a Nvidia GeForce GTX 1070 GPU.

Table 1. Results of our method in comparison to the baselines. TPR stands for true positive rate (i.e. sensitivity), FPav is the number of false positives per image in average and FDR stands for false discovery rate. The comparing methods are described in Sect. 3.

Baseline Methods. We compare our method with four conventional approaches, using the same preprocessing (Sect. 2.1). The first method (a) thresholds the input image based on it intensities: EPVS are among the most hyperintense structures in the basal ganlgia. Both second and third methods compute saliency maps [16] using a neural network. Saliency maps are obtained by computing the gradient of the regression score with respect to the input image. These maps can be easily computed given any classification or regression network architecture. The third method (c) uses the same network architecture as ours, the second one (b) uses a regression network without the upsampling path. Finally we compared with a method similar to [5] where we replace the original classification objective with a regression. This network is the same as ours but without the upsampling part. The thresholds are chosen based on the lesion count estimated by each network as explain in Sect. 2.

Results. In Table 1 we report sensitivity, false discovery rate and average false positive per image (FPav). Our method (e) detects \(54.8\%\) of the lesions with on average 1.9 false positive per image. We found that the performance could be improved further by thresholding the weighted sum of the heatmap and the original voxel intensities of the image (f). This improves the sensitivity to \(62.0\%\) with 1.5 FPav. Considering the sensitivity, our method outperforms the other baseline methods by more than \(20\%\) (Table 1). By modifying the heatmap threshold to overestimate the lesion count, we can increase the sensitivity further - \(71.1\%\) - at the price of more false positive - 4.4 in average and a false discovery rate of \(60\%\) (Fig. 3).

Fig. 3.
figure 3

Comparison with baselines. Left: Heatmaps of the different methods. EPVS are circled in green on the original image (a). The saliency maps (b, c(FCN)) are very noisy. Our method (e) has finer details than a network without upsampling path (d) similar to [5]. Combining the intensities with our heatmap (f) offers the best results. Right: Free-ROC curves for each method, showing the FPav on the x-axis and the sensitivity on the y-axis.

4 Discussion

Our experiments show that using a fully convolutional network and optimizing its weights for a regression problem gives the best results on our dataset (Fig. 3 and Table 1).

The methods in [5, 6] produce coarse localization heatmaps on which it is impossible to locate small structures. In our early experiments we computed similar networks, optimized for classification. The resulting heatmaps highlight the whole ROI, which does not help to solve the detection problem. Using a regression network, the more precise output allows us to extract finer-scale information, but without an upsampling path, the results are still imprecise (see Fig. 3 method (d)). In our experiments, saliency maps [16] perform similarly to voxel intensity. By construction these maps are noisy (Fig. 3 method (b)).

Considering other works in automatic detection of EPVS, in [9] the authors compute an EPVS segmentation using a random forest classifier. They train this classifier with pixelwise ground truth annotations. They also use a 7 T scanner which produces high resolution images and eases small lesion detection. They report a sensitivity of \(78\%\) and a false discovery rate of \(55\%\). This is better than our \(71\%\) sensitivity and \(60\%\) false discovery rate. However contrary to [9] which requires pixelwise annotation, our method uses only weak labels for training. In [8, 10], the authors also compute segmentations, but the evaluation is limited to a correlation with a five-category visual score.

Overall we consider that our detection results are satisfactory, considering that, during training, our method did not use any ground truth information on the location or shape of the lesions, but only their count within the whole ROI.

5 Conclusion

We presented a novel 3D neural network to detect small structures from global image-level labels. Our method combines a fully convolutional architecture with global pooling. During training the weights are optimized to solve a regression problem, while during testing the network computes lesion heatmaps of the size of the input. We demonstrated the potential of our method to detect enlarged perivascular spaces in brain MRI. We obtained a sensitivity of \(62\%\) and on average 1.5 false positives per scan. Our method outperforms four other approaches with an increase of more than \(20\%\) in sensitivity. We believe this method can be applied to the detection of other brain lesions.