Abstract
Deep learning for medical image classification faces three major challenges: (1) the number of annotated medical images for training are usually small; (2) regions of interest (ROIs) are relatively small with unclear boundaries in the whole medical images, and may appear in arbitrary positions across the x, y (and also z in 3D images) dimensions. However often only labels of the whole images are annotated, and localized ROIs are unavailable; and (3) ROIs in medical images often appear in varying sizes (scales). We approach these three challenges with a Multi-Instance Multi-Scale (MIMS) CNN: (1) We propose a multi-scale convolutional layer, which extracts patterns of different receptive fields with a shared set of convolutional kernels, so that scale-invariant patterns are captured by this compact set of kernels. As this layer contains only a small number of parameters, training on small datasets becomes feasible; (2) We propose a “top-k pooling” to aggregate the feature maps in varying scales from multiple spatial dimensions, allowing the model to be trained using weak annotations within the multiple instance learning (MIL) framework. Our method is shown to perform well on three classification tasks involving two 3D and two 2D medical image datasets.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
1 Introduction
Training a convolutional neural network (CNN) from scratch demands a massive amount of training images. Limited medical images encourage people to do transfer learning, i.e., fine-tune 2D CNN models pretrained on natural images [10]. A key difference between medical images and natural images is that, regions of interest (ROIs) are relatively small with unclear boundaries in the whole medical images, and ROIs may appear multiple times in arbitrary positions across the x, y (and also z in 3D images) dimensions. On the other hand, annotations for medical images are often “weak”, in that only image-level annotations are available, and there are no localized ROIs. In this setting, we can view each ROI as an instance in a bag of all image patches, and the image-level classification falls within the Multiple-Instance Learning (MIL) framework [2, 6, 11].
Another challenge with medical images is that ROIs are often scale-invariant, i.e., visually similar patterns often appear in varying sizes (scales). If approached with vanilla CNNs, an excess number of convolutional kernels with varying receptive fields would be required for full coverage of these patterns, which have more parameters and demand more training data. Some previous works have attempted to learn scale-invariant patterns, for example [8] adopted image pyramids, i.e. resizing input images into different scales, processing them with the same CNN and aggregating the outputs. However, our experiments show that image pyramids perform unstably across different datasets and consume much more computational resources than vanilla CNNs.
This paper aims to address all the challenges above in a holistic framework. We propose two novel components: (1) a multi-scale convolutional layer (MSConv) that further processes feature maps extracted from a pretrained CNN, aiming to capture scale-invariant patterns with a shared set of kernels; (2) a top-k pooling scheme that extracts and aggregates the highest activations from feature maps in each convolutional channel (across multiple spatial dimensions in varying scales), so that the model is able to be trained with image-level labels only.
The MSConv layer consists of a few resizing operators (with different output resolutions), and a shared set of convolutional kernels. First a pretrained CNN extracts feature maps from input images. Then the MSConv layer resizes them to different scales, and processes each scale with the same set of convolutional kernels. Given the varying scales of the feature maps, the convolutional kernels effectively have varying receptive fields, and therefore are able to detect scale-invariant patterns. As feature maps are much smaller than input images, the computation and memory overhead of the MSConv layer is insignificant.
The MSConv layer is inspired by ROI-pooling [1], and is closely related to Trident Network [5]. Trident Network uses shared convolutional kernels of different dilation rates to capture scale-invariant patterns. Its limitations include: (1) the receptive fields of dilated convolutions can only be integer multiples of the original receptive fields; (2) dilated convolutions may overlook prominent activations within a dilation interval. In contrast, the MSConv interpolates input feature maps to any desired sizes before convolution, so that the scales are more refined, and prominent activations are always retained for further convolution. [3] proposed a similar idea of resizing the input multiple times before convolution and aggregating the resulting feature maps by max-pooling. However we observed that empirically, activations in larger scales tend to dominate smaller scales and effectively mask smaller scales. MSConv incorporates a batchnorm layer and a learnable weight for each scale to eliminate such biases. In addition, MSConv adopts multiple kernel sizes to capture patterns in more varying scales.
A core operation in an MIL framework is to aggregate features or predictions from different instances (pattern occurrences). Intuitively, the most prominent patterns are usually also the most discriminative, and thus the highest activations could summarize a set of feature maps with the same semantics (i.e., in the same channel). In this regard, we propose a top-k pooling scheme that selects the highest activations of a group of feature maps, and takes their weighted average as the aggregate feature for downstream processing. The top-k pooling extends [9] with learnable pooling weights (instead of being specified by a hyperparameter as in [9]) and a learnable magnitude-normalization operator.
The MSConv layer and the top-k pooling comprise our Multi-Instance Multi-Scale (MIMS) CNN. To assess its performance, we evaluated 12 methods on three classification tasks: (1) classifying Diabetic Macular Edema (DME) on three Retinal Optical Coherence Tomography (OCT) datasets (two sets of 3D images); (2) classifying Myopic Macular Degeneration (MMD) on a 2D fundus image dataset; and (3) classifying Microsatellite Instable (MSI) against microsatellite stable (MSS) tumors of colorectal cancer (CRC) patients on histology images. In most cases, MIMS-CNN achieved better accuracy than five baselines and six ablated models. Our experiments also verified that both the MSConv layer and top-k pooling make important contributions.
2 Multi-Instance Multi-Scale CNN
The architecture of our Multi-Instance Multi-Scale CNN is illustrated in Fig. 1. It consists of: (1) a pretrained 2D CNN to extract primary feature maps, (2) a multi-scale convolutional (MSConv) layer to extract scale-invariant secondary feature maps, (3) a top-k pooling operator to aggregate secondary feature maps, and (4) a classifier.
2.1 Multi-Scale Convolutional Layer
Due to limited training images, a common practice in medical image analysis is to extract image features using 2D CNNs pretrained on natural images. These features are referred as the primary feature maps. Due to the domain gap between natural images and medical images, feeding primary feature maps directly into a classifier does not always yield good results. To bridge this domain gap, we propose to use an extra convolutional layer to extract more relevant features from primary feature maps. This layer produces the secondary feature maps.
In order to capture scale-invariant ROIs, we resize the primary feature maps into different scales before convolution. Each scale corresponds to a separate pathway, and weights of the convolutional kernels in all pathways are tied. In effect, this convolutional layer has multiple receptive fields on the primary feature maps. We name this layer as a multi-scale convolutional (MSConv) layer.
More formally, let x denote the primary feature maps, \(\{F_{1},\cdots ,F_{N}\}\) denote all the output channels of the MSConv layerFootnote 1, and \(\{(h_{1},w_{1}),\cdots ,(h_{m},w_{m})\}\) denote the scale factors of the heights and widths (typically \(\frac{1}{4}<=h_{i}=w_{i} <= 2\)) adopted by the m resizing operators. The combination of the i-th scale and the j-th channel yields the ij-th secondary feature maps:
where in theory \(\text {Resize}_{h_{i},w_{i}}(\cdot )\) could adopt any type of interpolation, and our choice is bilinear interpolation.
For more flexibility, the convolutional kernels in MSConv could also have different kernel sizes. In a setting of m resizing operators and n different sizes of kernels, effectively the kernels have at most \(m\times n\) different receptive fields. The multiple resizing operators and varying sizes of kernels complement each other and equip the CNN with scale-invariance.
Among \(\{\varvec{y}_{1j},\varvec{y}_{2j},\cdots ,\varvec{y}_{mj}\}\), feature maps in larger scales contain more elements and tend to have more top k activations, hence dominate the aggregate feature and effectively mask out the feature maps in smaller scales. In order to remove such biases, the feature maps in different scales are passed through respective magnitude normalization operators. The magnitude normalization operator consists of a batchnorm operator \(\text {BN}_{ij}\) and a learnable scalar multiplier \(sw_{ij}\). The scalar multiplier \(sw_{ij}\) adjusts the importance of the j-th channel in the i-th scale, and is optimized with back-propagation.
The MSConv layer is illustrated in Fig. 1 and the left side of Fig. 2.
2.2 Top-k Pooling
Multiple Instance Learning (MIL) views the whole image as a bag, and each ROI as an instance in the bag. Most existing MIL works [6, 11] were instance-based MIL, i.e., they aggregate label predictions on instances to yield a bag prediction. In contrast, [2] adopted embedding-based MIL, which aggregates features (embeddings) of instances to yield bag features, and then do classification on bag features. [2] showed that embedding-based MIL methods outperformed instance-based MIL baselines. Here we propose a simple but effective top-k pooling scheme to aggregate the most prominent features across a few spatial dimensions, as a new embedding-based MIL aggregation scheme.
Top-k pooling works as follows: given a set of feature maps with the same semantics, we find the top k highest activation values, and take a weighted average of them as the aggregate feature value. Intuitively, higher activation values are more important than lower ones, and thus the pooling weight should decrease as the ranking goes lower. However it may be sub-optimal to specify the weights manually as did in [9]. Hence we adopt a data-driven approach to learn these weights automatically. More formally, given a set of feature maps \(\{\varvec{x}_{i}\}\), top-k pooling aggregates them into a single value:
where \(a_{1},\cdots ,a_{k}\) are the highest k activations within \(\{\varvec{x}_{i}\}\), and \(w_{1},\cdots ,w_{k}\) are nonnegative pooling weights to be learned, subject to a normalization constraint \(\sum _{r}w_{r}=1\). In practice, \(w_{1},\cdots ,w_{k}\) is initialized with exponentially decayed values, and then optimized with back-propagation.
An important design choice in MIL is to choose the spatial dimensions to be pooled. Similar patterns, regardless of where they appear, contain similar information for classification. Correspondingly, features in the same channel could be pooled together. On 2D images, we choose to pool activations across the x, y-axes of the secondary feature maps, and on 3D images we choose to pool across the x, y and z (slices) axes. In addition, feature maps in the same channel but different scales (i.e., through different \(\text {Resize}_{h_{i},w_{i}}(\cdot )\) and the same \(F_{j}\)) encode the same semantics and should be pooled together. Eventually, all feature maps in the j-th channel, \(\{\varvec{y}_{\cdot j}\}=\varvec{y}_{1j},\varvec{y}_{2j},\cdots ,\varvec{y}_{mj}\) are pooled into a single value \(\text {Pool}_{k}(\{\varvec{y}_{\cdot j}\})\). Then following an N-channel MSConv layer, all feature maps will be pooled into an N-dimensional feature vector to represent the whole image. As typically \(N<100\), the downstream FC layer doing classification over this feature vector has only a small number of parameters and less prone to overfitting.
Figure 2 illustrates the top-k pooling being applied to the j-th channel feature maps in m scales.
3 Experiments
3.1 Datasets
Three classification tasks involving four datasets were used for evaluation.
DME classification on OCT images. The following two 3D datasets acquired by Singapore Eye Research Institute (SERI) were used:
(1) Cirrus dataset: 339 3D OCT images (239 normal, 100 DME). Each image has 128 slices in 512 * 1024. A 67–33% training/test split was used;
(2) Spectralis dataset: 197 3D OCT images (60 normal, 137 DME). Each image has \(25\sim 31\) slices in 497 * 768. A 50–50% training/test split was used;
MMD classification on fundus images:
(3) MMD dataset (acquired by SERI): 19,272 2D images (11,924 healthy, 631 MMD) in 900 * 600. A 70–30% training/test split was used.
MSI/MSS classification on CRC histology images:
(4) CRC-MSI dataset [4]: 93,408 2D training images (46,704 MSS, 46,704 MSI) in 224 * 224. 98,904 test images (70,569 MSS, 28,335 MSI) also in 224 * 224.
3.2 Compared Methods
MIMS-CNN, 5 baselines and 6 ablated models were compared. Unless specified, all methods used the ResNet-101 model (without FC) pretrained on ImageNet for feature extraction, and top-k pooling (\(k=5\)) for feature aggregation.
MI-Pre. The ResNet feature maps are pooled by top-k pooling and classified.
Pyramid MI-Pre. Input images are scaled to \(\{\frac{i}{4}|i=2,3,4\}\) of original sizes, before being fed into the MI-Pre model.
MI-Pre-Conv. The ResNet feature maps are processed by an extra convolutional layer, and aggregated by top-k pooling before classification. It is almost the same as the model in [6], except that [6] does patch-level classification and aggregates patch predictions to obtain image-level classification.
MIMS. The MSConv layer has 3 resizing operators that resize the primary feature maps to the following scales: \(\{\frac{i}{4}|i=2,3,4\}\). Two groups of kernels of different sizes were used.
MIMS-NoResizing. It is an ablated MIMS-CNN with all resizing operators removed. This is to evaluate the contribution of the resizing operators.
Pyramid MIMS. It is an ablated MIMS-CNN with all resizing operators removed, and the multi-scaledness is pursued with input image pyramids of scales \(\{\frac{i}{4}|i=2,3,4\}\). The MSConv kernels is configured identically as above.
MI-Pre-Trident [5]. It extends MI-Pre-Conv with dilation factors 1, 2, 3.
SI-CNN [3]. It is an ablated MIMS-CNN with the batchnorms and scalar multipliers removed from the MSConv layer.
FeatPyra-4,5. It is a feature pyramid network [7] that extracts features from conv4_x and conv5_x in ResNet-101, processes each set of features with a respective convolutional layer, and classifies the aggregate features.
ResNet34-scratch. It is a ResNet-34 model trained from scratch.
MIMS-patchcls and MI-Pre-Conv-patchcls. They are ablated MIMS and MI-Pre-Conv, respectively, evaluated on 3D OCT datasets. They classify each slice, and average slice predictions to obtain image-level classification.
3.3 Results
Table 1 lists the AUROC scores (averaged over three independent runs) of the 12 methods on the four datasets. All methods with an extra convolutional layer on top of a pretrained model performed well. The benefits of using pretrained models are confirmed by the performance gap between ResNet34-scratch and others. The two image pyramid methods performed significantly worse on some datasets, although they consumed twice as much computational time and GPU memory as other methods. MIMS-CNN almost always outperformed other methods.
The inferior performance of the two \(*\)-patchcls models demonstrated the advantages of top-k pooling for MIL. To further investigate its effectiveness, we trained MIMS-CNN on Cirrus with six MIL aggregation schemes: average-pooling (mean), max-pooling (max), top-k pooling with \(k=2,3,4,5\), and an instance-based MIL scheme: max-pooling over slice predictions (max-inst).
As can be seen in Table 2, the other three aggregation schemes fell behind all top-k schemes, and when k increases, the model tends to perform slightly better. It confirms that embedding-based MIL outperforms instance-based MIL.
4 Conclusions
Applying CNNs on medical images faces three challenges: datasets are of small sizes, annotations are often weak and ROIs are in varying scales. We proposed a framework to address all these challenges. This framework consists of two novel components: (1) a multi-scale convolutional layer on top of a pretrained CNN to capture scale-invariant patterns, which contains only a small number of parameters, (2) a top-k pooling operator to aggregate feature maps in varying scales across multiple spatial dimensions to facilitate training with weak annotations within the Multiple Instance Learning framework. Our method has been validated on three classification tasks involving four image datasets.
Notes
- 1.
Each convolutional kernel yields multiple channels with different semantics, so output channels are indexed separately, regardless of whether they are from the same kernel.
References
Girshick, R.: Fast R-CNN. In: Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV). ICCV 2015, pp. 1440–1448 (2015)
Ilse, M., Tomczak, J.M., Welling, M.: Attention-based deep multiple instance learning. In: Proceedings of the 35th International Conference on Machine Learning ICML, ICML 2018, pp. 2132–2141 (2018)
Kanazawa, A., Sharma, A., Jacobs, D.W.: Locally scale-invariant convolutional neural networks. In: NIPS Workshop on Deep Learning and Representation Learning (2014)
Kather, J.N.: Histological images for MSI vs. MSS classification in gastrointestinal cancer, FFPE samples. https://doi.org/10.5281/zenodo.2530835
Li, Y., Chen, Y., Wang, N., Zhang, Z.: Scale-aware trident networks for object detection. arXiv e-prints arXiv:1901.01892 (2019)
Li, Z., et al.: Thoracic disease identification and localization with limited supervision. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018 (2018)
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125 (2017)
Rasti, R., Rabbani, H., Mehridehnavi, A., Hajizadeh, F.: Macular OCT classification using a multi-scale convolutional neural network ensemble. IEEE Trans. Med. Imaging 37(4), 1024–1034 (2018)
Shi, Z., Ye, Y., Wu, Y.: Rank-based pooling for deep convolutional neural networks. Neural Networks 83, 21–31 (2016)
Tajbakhsh, N., et al.: Convolutional neural networks for medical image analysis: full training or fine tuning? IEEE Trans. Med. Imaging 35(5), 1299–1312 (2016)
Zhu, W., Lou, Q., Vang, Y.S., Xie, X.: Deep multi-instance networks with sparse label assignment for whole mammogram classification. In: Medical Image Computing and Computer Assisted Intervention - MICCAI 2017, pp. 603–611 (2017)
Acknowledgments
We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp used for this research.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Li, S. et al. (2019). Multi-Instance Multi-Scale CNN for Medical Image Classification. In: Shen, D., et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2019. MICCAI 2019. Lecture Notes in Computer Science(), vol 11767. Springer, Cham. https://doi.org/10.1007/978-3-030-32251-9_58
Download citation
DOI: https://doi.org/10.1007/978-3-030-32251-9_58
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32250-2
Online ISBN: 978-3-030-32251-9
eBook Packages: Computer ScienceComputer Science (R0)