Keywords

1 Introduction

In the field of biomedical imaging and computer-aided diagnosis, image acquisition and analysis with more than one modality have been a common practice for years as different modalities contain abundant information that can be complementary to each other. Taking the multi-modal imaging segmentation task with ISLES (Ischemic Stroke Lesion Segmentation) 2018 data set [1] as an example, each modality reveals a unique type of biological information about changes on the stroke-induced tissue. In particular, CBV, CBF and MTT images are defined by blood characteristics, and raw CT images mainly reveal tissue contrast which may be of less significance in the task of identifying lesion regions. Therefore, the combination of complementary information enables comprehensive analysis and accurate segmentation.

In recent years, deep learning methodologies, especially convolutional neural networks (CNNs) have achieved state-of-the-art accuracy in medical image segmentation. With the rapidly increasing amount of multi-modal image data, CNN approaches have already been applied to multi-modal segmentation tasks [2,3,4,5,6]. However, the question of how to use multi-modal data effectively is still not sufficiently investigated.

Most of the CNN-based segmentation approaches use multi-modal images by fusing different modalities together. Some of these methods, known as early fusion, integrate multi-modal information from original images or low-level features. For instance, the early fusion in one study [3] learns information from MRI TI, MRI T2, and FA images. These images from different modalities are concatenated together as the input of CNN models. This method does not take complex correlations between modalities into consideration. The other methods follow the late fusion strategy, using independent network architectures for each modality and then combining the high-level feature maps. The late fusion strategy shows better performance in [4], but also significantly increases the complexity of the network and makes the network more prone to over-fitting. Besides early fusion and late fusion strategies, a recent study from Tseng et al. [6] presents a novel framework which combines four modalities of MRI for brain tumor segmentation. The network first extracts low-level feature maps for each modality through multi-modal encoders and then uses a cross-modality convolution block to fuse feature maps together. Moreover, motivated by the concept of dense connection, [2] proposes a hyper-dense connected CNN which trains each modality in one path and densely connects all the paths. In summary, previous methods treat all modalities equally. However, it is a fact that different modalities have unequal contributions toward an accurate prediction.

Inspired by the success of the attention mechanism [7] and the LSTM structure [12], we propose MSAFusionNet, a multiple subspace attention method for fusion of multiple modalities. This architecture is designed to effectively leverage implicit correlations between different modalities. It utilizes our proposed inter-attention and generalized squeeze-excitation modules to generate attention weights and modify the output feature map by the importance. The network also integrates a U-Net architecture [9] and a densely-dilated convolution module [10] whose effectiveness have been proved in medical and general image segmentation tasks.

In this study, we address on four major contributions:

  • We propose MSAFusionNet to analyze multi-modal image data and achieve the state-of-the-art segmentation accuracy on ISLES 2018.

  • We design a multiple subspace attention model to generate more representative fused feature maps.

  • We construct a multi-modal fusion network which leverages CNN-LSTMs for sequential multi-modal data.

  • We apply a densely-dilated U-Net backbone for the medical image segmentation task.

It is also worth noting that the proposed multiple subspace attention model can be easily embedded with other networks and the multi-modal fusion network is also compatible with any segmentation network backbone. We believe these two points can promote the development of more effective methods.

2 Methods

2.1 Overview

MSAFusionNet, as shown in Fig. 1, consists of two major parts: a multi-modal fusion network (MFN) and an encoder-decoder network (EDN). MFN is designed to fuse multiple modalities. It accepts input data in two kinds of formats: an image and a sequence of images. In this study, according to ISLES 2018 data set, we consider the input as 2D images and 3D volumes each of which is a sequence of 2D images. This framework can be easily extended to include 3D images and 4D volumes by replacing 2D convolution operators with 3D convolution operators in the network. EDN is built upon a modified U-Net structure. A dense dilation convolution block is used to replace the original bottom layer. Moreover, we also leverage a convolution layer to feed the output feature map from MFN into EDN. As a result, two networks can be trained jointly in an end-to-end manner.

Fig. 1.
figure 1

The overall framework of the proposed MSAFusionNet.

More specifically, MSAFusionNet leverages a novel multiple subspace attention (MSA) model which contains two types of modules: inter-attention (IA) modules and generalized squeeze-and-excitation (GSE) modules. The details of the MSA model will be introduced in Sect. 2.2. In general, MSA in MFN is to fully utilize information between and within modalities. Meanwhile, in EDN, MSA facilitates to extract more representative features and works as connections between encoders and decoders.

2.2 Multiple Subspace Attention (MSA)

As introduced, IA and GES are two modules in the MSA. IA is an attention mechanism to embed a subspace with an auxiliary gating signal, and GSE focuses on identifying the importance of a single dimension in a subspace.

Inter-Attention Module. Given a signal from one modality, information from other modalities can serve as helpful auxiliaries with appropriate gating mechanisms. Moreover, feature maps at different scales can also augment each other. As a result, we propose IA to facilitate information among multiple input modalities and multiple feature maps with different scales.

Fig. 2.
figure 2

The inter-attention module.

Fig. 3.
figure 3

The generalized squeeze-and-excitation module.

As shown in Fig. 2, an IA module first concatenates a gating signal g with a feature map x into a concatenated feature map C. Then, C is analyzed by two convolution paths. One path serves as a gate function, which learns a gating attention map with a range of [0, 1] via a convolution module with G layers to impose different attentions on C. The other path directly uses a convolution module with L layers to extract high-level joint features, and then adds it to the input low-level features to maintain the original information. The output of the IA module is an element-by-element product of the gating attention map and the feature map. The IA module, denoted as \(\mathcal {IA}(.,.)\), given by:

$$\begin{aligned} \mathcal {IA}(x,g)= & {} \theta (\mathcal {F}^{(L)}(C)+x) \otimes \sigma (\mathcal {F}^{(G)}(C)+C) \end{aligned}$$
(1)

where C is the concatenation of x and g. \(\mathcal {F}^{(G)}(.)\) and \(\mathcal {F}^{(L)}(.)\) denote convolution modules with G layers and L layers, respectively. \(\theta \) is an activation function such as ReLU or ELU, and \(\sigma \) is a sigmoid function to maintain the range of the gating attention map.

Generalized Squeeze-and-Excitation Module. The squeeze-and-excitation unit has been proved successful for many studies by putting more attentions on channels with more semantic information [13]. In this study, we extend it to a generalized squeeze-and-excitation (GSE) module that can learn the importance of each element along any axis.

The input of a GSE module is a parameter space with a size of \(\varPhi \times V\), where V is the dimension to be applied on the attention, and \(\varPhi \) represents the remaining dimensions. As shown in Fig. 3, global average pooling, denoted as \(\mathcal {GAP}(.)\), is first applied to obtain local statistics along the attention dimension. Next, two cascaded dense connected convolution layers add the ability to learn the underlying principal components in a subspace with a dimension of V. The generated attention weighted vector is used to gate the original signal via an element-wise multiplication along the attention dimension. As a result, the GSE module, denoted as \(\mathcal {GSE}(.)\), is described as:

$$\begin{aligned} \mathcal {GSE}(x) = x\otimes \sigma (W_{us}\cdot \theta (W_{ds}\cdot \mathcal {GAP}(x)+b_{ds})+b_{us}) \end{aligned}$$
(2)

where \(W_{us}\) and \(W_{ds}\) are the weights of the upsampling and downsampling convolution layers, respectively. \(b_{us}\) and \(b_{ds}\) are the corresponding biases. \(\theta \) is a ReLU function and \(\sigma \) is a sigmoid function.

2.3 Multi-modal Fusion Network (MFN)

Given that the input data X contains a set of modalities in the format of 2D images, denoted as \(\mathbb {P}\), and a set of modalities in the format of 3D volumes denoted as \(\mathbb {Q}\). We first focus on generating a feature map of one modality i from the 2D modality set \(\mathbb {P}\) by an IA module, which is given by:

$$\begin{aligned} Y_{s}^{(i)}= \mathcal {IA}(\mathcal {P}_s(X_{2D}^{(i)}), \mathcal {R}_s(\{\mathcal {P}_s(X_{2D}^{(k)})| k\ne i, k\in \mathbb {P}\})) \end{aligned}$$
(3)

where \(\mathcal {P}_s(.)\) is a convolution unit with two layers and \(\mathcal {R}_s(.)\) is a channel reduction function applied on the concatenated feature map of the modalities in \(\mathbb {P}\) other than i. Followed by the similar approach, an IA module is applied to obtain a feature map of one modality j from the 3D modality set \(\mathbb {Q}\), given by:

$$\begin{aligned} Y_{d}^{(j)}= \mathcal {IA}(\mathcal {P}_d(X_{3D}^{(j)}), \mathcal {R}_d(\{\mathcal {P}_d(X_{3D}^{(l)})| l\ne j, l\in \mathbb {Q}\})) \end{aligned}$$
(4)

where \(\mathcal {P}_d(.)\) represents a 3D convolution unit which has two CNN-LSTM layers in our design, and \(\mathcal {R}_d(.)\) is a channel reduce function which works in the same manner as \(\mathcal {R}_s(.)\).

We concatenate \(\{Y_s^{(i)}|\ i\in \mathbb {P}\}\) into \(Y_s\), and \(\{Y_d^{(j)}|\ j\in \mathbb {Q}\}\) into \(Y_d\). \(Y_s\) and \(Y_d\) then are processed by GSE modules to adaptively calibrate attention-wise feature responses. They are further merged by IA modules using each other as the gating signal. Finally, a GSE module is applied to the concatenated IA results to generate the final combined feature map Y. The entire procedure is:

$$\begin{aligned} Y = \mathcal {GSE}\big (\mathcal {IA}\big (\mathcal {GSE}(Y_s), \mathcal {GSE}(Y_d)\big ),\mathcal {IA}\big (\mathcal {GSE}(Y_d), \mathcal {GSE}(Y_{s})\big )\big ) \end{aligned}$$
(5)

2.4 Encoder-Decoder Network (EDN)

U-Net with Dense Dilation Module. We adopt U-Net which is commonly suggested as the encoder-decoder backbone for small medical data sets [9]. More specifically, we use a deep U-Net structure with five blocks, which have 16, 32, 64, 128, and 256 filters in the convolution layers of each block, respectively. In each block, there are two convolution layers with the same kernel size of \(3\times 3\), and a dropout layer between these two convolution layers. The transitions between the blocks are implemented by a \(2\times 2\) max pooling layer in the encoder and a transpose convolution layer with \(2\times 2\) strides in the decoder. All convolutions are followed by a batch normalization layer to facilitate the learning process.

Different from the traditional U-Net structure [9], we replace the bottom block with a dense dilation module [11]. The dilation rates are set to \(\{1,2,3,4\}\) to achieve larger receptive fields. This design is suitable for cases such as ISLES 2018 data set, which has a large range of lesion sizes. The enlarged receptive field can help large lesions, e.g. those nearly cover the whole image, reveal clearer boundaries.

MSA in EDN. As mentioned in Sect. 2.2, IA and GSE modules can be regarded as attentions on multiple subspaces with different dimensions, given an auxiliary gating signal or not. These two modules can be used in the encoder-decoder structure as well. We apply IA and GES in EDN as follows. First, before each transition, a GSE module is used to modify the interest on different feature maps along the channels. Second, an IA module is used in the connection between the encoder and the decoder at every scale. We gate the features from the encoder with the feature map from the decoder. This design reduces the dependence on potential unnecessary areas of the encoder features, where more effective connections is better to leverage the information contained in the network.

2.5 Loss Function

For ISLES 2018 segmentation task, the network is jointly trained with a hybrid loss function consisting of contributions from both cross-entropy loss and DICE coefficient loss, given by:

$$\begin{aligned}&L = \frac{1}{|\mathbb {K}|}\sum \nolimits _{i\in \mathbb {K}}(\omega _i L_{DICE}^{(i)} + \lambda L_{CE}^{(i)}),\quad \omega _i = \frac{|\mathbb {K}|}{S^{(i)}\times M} \end{aligned}$$
(6)
$$\begin{aligned}&L_{DICE}^{(i)} =2 \frac{|Z^{(i)}\cap G^{(i)}|}{|Z^{(i)}|+|G^{(i)}|} \end{aligned}$$
(7)
$$\begin{aligned}&L_{CE}^{(i)} = \frac{1}{|\varOmega ^{(i)}|}\sum \nolimits _{j\in \varOmega ^{(i)}}[-g^{(i)}_jlog(z^{(i)}_j)-(1-g^{(i)}_j)log(1-z^{(i)}_j)] \end{aligned}$$
(8)

where M is the number of scans, \(\mathbb {K}\) denotes the slice set from all scans, \(S^{(i)}\) is the number of slices in the corresponding scan which the i-th slice in \(\mathbb {K}\) belongs, \(\omega _i\) is the weight of the i-th slice, and \(\lambda \) is a relax parameter set to 0.5 in the following experiments. \(\varOmega ^{(i)}\) is the voxel set of the i-th slice, \(Z^{(i)}=\{z^{(i)}_j|j\in \varOmega ^{(i)}\}\) and \(G^{(i)}=\{g^{(i)}_j|j\in \varOmega ^{(i)}\}\) are the predicted mask and the ground truth mask of the i-th slice, respectively. It is noted that we use \(\omega _i\) to adjust the importance of each slice based on the number of slices in the corresponding scan. The reason for adding \(\omega _i\) is that the number of slices in one scan varies in ISLES 2018 data set. Since many scans with only two slices may limit the power of 3D convolution operators, we have to train the network in the slice level but not the scan level. Meanwhile, we understand that the stereoscopic accuracy of the prediction on a scan really matters. As a result, we design a weight \(\omega _i\) to balance the scans with a large number of slices and those with only two slices.

3 Experiments on the ISLES Data Set

3.1 Data Set and Pre-processing

We evaluate the performance of the proposed MSAFusionNet on ISLES 2018 data set, with six modalities as inputs, namely, CT, CBV, CBF, MTT, Tmax and 4DPWI. The first five modalities are 2D images, while 4DPWI is a 3D volume. There are 63 patients and 94 scans for training because the stroke lesion volumes in some patients are split into two scans. The slice number per scan has a wide range of [2, 22]. Standard normalization is applied to each modality. Data augmentation approaches, including random horizontal/vertical flip, width/height shift, zoom, and rotation, are adopted to reduce over-fitting caused by the small data size.

3.2 Training and Evaluation

The network is implemented with Tensorflow. The Adam optimizer is used with an initial learning rate of 0.0003. The learning rate is reduced by a factor of 0.1 if the loss does not improve for 30 consecutive epochs. The batch size is set to four slices per GPU and the training is stopped after 120 epochs. In the evaluation, we first leverage a series of ablation studies to show the effectiveness of each component in the proposed MSAFusionNet and then evaluate the performance of the entire MSAFusionNet by comparing with other methods from ISLES 2018 leader board.

The settings of ablation studies are shown in Table 1. The baseline does not have MSA in either MFN or EDN. In addition, the baseline ignores the sequential property of 4DPWI and applies CNNs separately on each image from 4DPWI. Compared with the baseline, No-MSA-FusionNet replaces CNNs with CNN-LSTM layers for 4DPWI, but without MSA. Sub-MSA-FusionNet adds MSA in EDN, and MSAFusionNet incorporates the complete design in Fig. 1.

In this experiment, we use DICE to evaluate the performance. Table 1 shows that the MSA module can improve the performance significantly no matter applied in the fusion network or in the encoder-decoder backbone individually. It is also clear that the full MSAFusionNet outperforms other experiment settings. For the modality with sequential data, the CNN-LSTM layers bring marginal improvements over convolutions only. All the improvements are also shown in Fig. 4Footnote 1. The segmentation results produced by the full MSAFusionNet are much closer to the ground-truth than any produced by any incomplete MSAFusionNet.

Table 1. The results of ablation studies.
Table 2. The results on the ISLES 2018 testing leader board.
Fig. 4.
figure 4

Segmentation results. Images are from four patients, and different modalities are shown. Particularly, we list time points T0 and T1 of 4DPWI due to the page limit. We compare the results from MSAFusionNet (blue) with the ground-truth (solid red), as well as the results from ablation studies: Sub-MSA-FusionNet (purple), No-MSA-FusionNet (yellow) and the baseline (green). (Color figure online)

Moreover, we compare MSAFusionNet with the state-of-the-art methods on the testing leader board of ISLES 2018Footnote 2. As shown in Table 2, our method achieves the best AVD (Absolute Volume Difference) score, which is the most important criterion for lesion volumetry studies, with other scores comparable to the top methodsFootnote 3.

4 Conclusion and Future Work

In this paper, we present a novel framework called MSAFusionNet that can effectively integrate multi-modal medical image data. We demonstrate its competitive performance on ISLES 2018 segmentation challenge. We believe its application is beyond ISLES 2018 because the proposed multiple subspace attention model can be easily embedded with other networks and the multi-modal fusion network is also compatible with any segmentation network backbone. In the future, we will identify and use additional multi-modal image data set to further validate and improve our framework. It would also be interesting to extend the current framework to handle multi-modal fusion over time [14].