Deep Angular Embedding and Feature Correlation Attention for Breast MRI Cancer Analysis

Luyang Luo¹⁶,
Hao Chen¹⁷,
Xi Wang¹⁶,
Qi Dou¹⁸,
Huangjing Lin¹⁶,
Juan Zhou¹⁹,
Gongjie Li²⁰ &
…
Pheng-Ann Heng¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11767))

Included in the following conference series:

International Conference on Medical Image Computing and Computer-Assisted Intervention

9331 Accesses
10 Citations

Abstract

Accurate and automatic analysis of breast MRI plays a vital role in early diagnosis and successful treatment planning for breast cancer. Due to the heterogeneity nature, precise diagnosis of tumors remains a challenging task. In this paper, we propose to identify breast tumor in MRI by Cosine Margin Sigmoid Loss (CMSL) with deep learning (DL) and localize possible cancer lesion by COrrelation Attention Map (COAM) based on the learned features. The CMSL embeds tumor features onto a hyper-sphere and imposes a decision margin through cosine constraints. In this way, the DL model could learn more separable inter-class features and more compact intra-class features in the angular space. Furthermore, we utilize the correlations among feature vectors to generate attention maps that could accurately localize cancer candidates with only image-level labels. We build the largest breast cancer dataset involving 10,290 DCE-MRI scan volumes for developing and evaluating the proposed methods. The model driven by CMSL achieved a classification accuracy of 0.855 and AUC of 0.902 on the testing set, with sensitivity and specificity of 0.857 and 0.852, respectively, outperforming competitive methods overall. In addition, the proposed COAM accomplished more accurate localization of the cancer center compared with other state-of-the-art weakly supervised localization method.

You have full access to this open access chapter, Download conference paper PDF

MGBN: Convolutional neural networks for automated benign and malignant breast masses classification

Article 08 May 2021

HCCANet: histopathological image grading of colorectal cancer using CNN based on multichannel fusion attention mechanism

Article Open access 06 September 2022

Enhancing EfficientNetv2 with global and efficient channel attention mechanisms for accurate MRI-Based brain tumor classification

Article Open access 20 May 2024

1 Introduction

Breast cancer is the most common malignancy affecting women worldwide [1]. Early diagnosis of breast cancer is essential for successful treatment planning, where Magnetic Resonance Imaging (MRI) plays a vital role in screening high-risk populations [2]. Clinically, radiologists use the Breast Imaging-Reporting and Data System (BI-RADS) to categorize breast lesions into various levels according to their phenotypic characteristics presented in MRI images, indicating different degrees of cancer risk. However, such assessment suffers from inter-observer variance and often subjectively relies on the radiologists’ experience. Moreover, due to the heterogeneity nature, tumors of the same pathological result (malignant or benign) could have diverse patterns and hence result in different BI-RADS assessments. In other words, tumors could possess ambiguous inter-class difference and large intra-class variance, which poses a severe challenge to accurate diagnosis of breast cancer.

Generally, there are two major tasks regarding breast MRI tumor analysis: identification of tumors and localization of cancer candidates. Recently, Deep Learning (DL) based approaches have demonstrated great potential in assisting the diagnosis of breast cancer in an automatic and efficient manner. Previous studies manually annotated tumors and deliberately extracted the corresponding slices or patches for classification [3, 4]. Such methods depended on careful annotations both for training and testing and could not easily be adapted to clinical application. Meanwhile, Guy et al. [5] proposed to first automatically localize the lesions and then classify cancer candidates at the second stage. Although the inference stage thereby was free of lesion delineation, it still required annotations for model training. To get rid of manual lesion extraction, Gabriel et al. [6] proposed to meta-train a breast MRI cancer classifier with only image-level labels. However, all the mentioned studies were limited to small size datasets and consequently lack of generalization validation. More importantly, the relatively low precision or specificity reported in these works implied that the problem of inter-class difference and intra-class variance has not been addressed yet.

To this end, we propose a Cosine-Margin Sigmoid Loss (CMSL) to tackle the heterogeneity problem for breast tumor classification and COrrelation Attention Map (COAM) for precise cancer candidates localization, both with image-level labels only. The CMSL is extended from the cosine loss originally designed for face verification [7]. It embeds the deep feature vectors onto a hyper-sphere and learns a decision margin between classes in the angular feature space. As a result, the learned features possess more compact intra-class variance and more separable intra-class difference. In addition, we observe a Region of Interest (RoI) shifting problem of localizing cancer by class activation map [8]. Hence, we propose a novel weakly supervised method, i.e., COAM, to localize cancer candidates more accurately by leveraging deep feature correlations based on the Gram matrix. Furthermore, we build the largest breast DCE-MRI dataset, including 10,290 volume scans from 1715 subjects to develop and evaluate our methods.

2 Methods

Our framework of breast MRI tumor analysis consists of two parts, as illustrated in Fig. 1. One is tumor classification by deep-angular-embedding-driven DL network. The other is weakly supervised cancer candidates localization with feature correlation attention map.

2.1 Cosine Margin Sigmoid Loss for Tumor Classification

The phenotype of tumors has ambiguous inter-class difference and large intra-class variance. Accordingly, the features learned by the DL model could inherit these characteristics. To address this issue, we start by revisiting the traditional sigmoid loss for the binary classification problem. Given the input feature vector x of the last fully connected (FC) layer and its corresponding label y, the binary sigmoid loss is as follows:

$$\begin{aligned} \mathcal {L}(w;x) =&-y\cdot \text {log}(p(y \mid x))-(1-y)\cdot \text {log}(1-p(y \mid x)) \end{aligned}$$

(1)

$$\begin{aligned} =&-y\cdot \text {log}(\frac{1}{1+e^{-w^{T} x}})-(1-y)\cdot \text {log}(1-\frac{1}{1+e^{-w^{T} x}}) \end{aligned}$$

(2)

where w is the weight parameter of the FC layer, and $p(y \mid x)$ represents the probability of x being classified to y. To distinguish different classes, the DL model is expected to give different predictions by adjusting the value of $w^{T}x$. Notice that $w^{T}x=\Vert w\Vert \Vert x\Vert cos\theta $, where $\theta $ is the angle between feature vector x and weight vector w, and $\Vert \cdot \Vert $ is the $L_{2}$ norm operation. Generally, the DL model would implicitly alter $\Vert w\Vert $ and $\Vert x\Vert $ in the Euclidean space and $cos\theta $ in the angular space. However, the aforementioned heterogeneity issue could lead to ambiguous features that are quite hard to discriminate. To this end, constraints on feature distances are considered to regulate the DL model for more separable inter-class features and more compact intra-class features [7]. Since Euclidean distance is not bounded and hence difficult to constrain, we prefer to add regularization on the angular distance which is bounded by $-1 \le cos\theta \le 1$. Specifically, we eliminate the influence of the norms $\Vert x\Vert $ and $\Vert w\Vert $ by modifying the computation of $p(y \mid x)$ to:

$$\begin{aligned} p(y \mid x) = \frac{1}{1+e^{-s\frac{w^{T}x}{\Vert w\Vert \Vert x\Vert }}} = \frac{1}{1+e^{-s\cdot cos\theta }} \end{aligned}$$

(3)

where s is a hyper-parameter adjusting the slope of the sigmoid function and controlling the back propagated gradient values. If s is too small, the loss cannot converge to 0 because the sigmoid function is not able to reach its saturation area, given that $-1 \le cos\theta \le 1$. On the contrast, if s is set to a large value, the sigmoid function could easily reach the saturation area and result in small gradients, which prevents the network from learning sufficient knowledge. Following [7], we refer to the loss with modified p in Eq. (3) as Normalized Sigmoid Loss (NSL), which focuses on separating features in the angular space with the decision boundary $cos\theta =0$ for both classes. Geometrically, we embed the feature vector and the weight vector onto a hyper-sphere whose radius is tuned by s. However, the ambiguous features can still distribute near this boundary. Therefore we add explicit guidance to NSL as follows:

$$\begin{aligned} \mathcal {L}(w;x) = -y\cdot \text {log}(\frac{1}{1+e^{-s \cdot (cos\theta -I(y) \cdot m)}})-(1-y)\cdot \text {log}(1-\frac{1}{1+e^{-s \cdot (cos\theta -I(y) \cdot m)}}) \end{aligned}$$

(4)

where $I(\cdot )$ is an indicator function. $I(y) = 1$ if $y = 1$ and $I(y) = -1$ otherwise. m is a hyper-parameter that changes the decision boundaries for separating two classes (0 and 1 for benign and malignant) to: $B_{0}: cos\theta + m < 0$ and $B_{1}: cos\theta - m > 0$. Hence a decision margin is imposed by m in the angular space to make the learned inter-class features more separable. Consequently, the distribution space of features shrinks, which eventually leads to more compact intra-class features. Figure 2 shows a comparison among different sigmoid functions and the corresponding geometric illustrations.

2.2 Feature Correlation Attention for Cancer Localization

With highly informative deep features learned by the network, localization of cancer candidates can provide further clinical references. To this end, our secondary goal is to localize possible cancer out of other lesion mimics. Generally, it is natural for deep learning studies to use Class Activation Map (CAM) [8] for obtaining region of interests (RoIs) when only image-level labels are available. However, this method cannot be well generalized to our case due to an observed RoI shifting problem. With the CNN going deeper, the reception fields of neurons become larger accordingly, hence neighbors of the tumor feature are also able to capture views over the tumor patch. Since the deep features could still be ambiguous, the classifier layer would possibly tend to find discriminative patterns in the neighbors. As a consequence, the corresponding RoI generated by CAM would shift from the originally desired target.

To tackle this problem, we further figure out two insights of our task. First, the feature vectors of the same semantic (malignant or normal) ought to have higher correlations with each other than with those of different semantic. Second, through a series of rectified linear units, the network would implicitly learn larger activation values for features related to suspicious cancer patch (with the label “1”) and smaller activation values for features related to normal patch (with the label “0”). Based on these two intuitions, we propose to leverage the Gram matrix [9] to find out the RoI. Given the deep feature map $X \in \mathbb {R}^{H\times W \times S \times C}$ generated from the last activation layer, where H, W, S and C are the height, width, number of slices and number of channels, respectively, we first reshape X to $X' \in \mathbb {R}^{N \times C}$, where $N=H\times W \times S$. Afterwards, we compute an attention vector $M \in \mathbb {R}^{N}$ as follows:

$$\begin{aligned} M_{i} = \sum _{j=1}^{N}G_{i,j} = \sum _{j=1}^{N}\sum _{k=1}^{C}X'_{i,k}X'_{j,k} \end{aligned}$$

(5)

where $G \in \mathbb {R}^{N\times N}$ is the Gram matrix over the set of deep feature vectors in $X'$. Each entry $G_{i,j}$ is the inner product of $X'_{i}$ and $X'_{j}$, representing the correlation between i-th and j-th vector. Then the columns of the Gram matrix are summed over to form the attention vector M. Because our network is trained for binary classification, it enables the gap between large and small activation values of feature vector related to suspicious cancer and normal patches. Correspondingly, the correlation value would also be relatively large or small according to the activation values of the features. Inspired by [10], each column $G_{i}$ of the Gram matrix can be interpreted as a sub-attention map implying the network’s attention of the class that i-th vector belongs to. Thus, Eq. (5) is actually an element-wise summation over all sub-attention maps. Moreover, since G is symmetric, the operation is also the same as summing over $G_{i}$ to be the value of $M_{i}$. Essentially, $\sum _{j=1}^{N} G_{i,j}$ indicates the importance of i-th feature determined by the sub-attention of the feature map at its i-th position. Finally, we simply reshape M to size $H\times W\times S$ to obtain an attention map purely based on the feature correlations. We refer to this method as COrrelation Attention Map (COAM). It is worth mentioning that COAM is related to the self-attention mechanism [10] and the stationary feature space representation [9]. However, our work is characterized that the Gram matrix is not involved in any optimization stage and is directly used for attention generation.

3 Experiments and Results

3.1 Implementation Details

Dataset. We built the largest breast tumor Dynamic Contrast Enhanced (DCE) MRI dataset involving 10,290 scans from 1715 subjects, with 1137 cases containing malignant tumors and 578 cases containing benign tumors. All of the scans were conducted with a 1.5-T Siemens system. We collected 6 DCE-MRI subtraction scans and 1 non-fat suppressed T1 scan from each subject. BI-RADS categories were assessed by 3 radiologists. Pathological labels were given by biopsy or surgery diagnosis. The data were randomly divided into training, validation, and testing sets with 1204, 165, and 346 subjects, respectively.

Preprocessing. Frangi’s approach [11] was first applied on the slices of each non-fat suppressed T1 scan to detect evident edges. Next, thresholding, small connected component removal, and hole filling were employed to obtain coarse breast region masks. The 2D masks were then stacked into volumes and smoothed by Gaussian smooth. The 3D masks were used to segment the DCE-MRI scans. Note the two modalities were already registered in the scanning machine. Finally, we clipped and normalized the intensity values, concatenated six subtractions, and fixed the image size to $340\times 220\times 128$ by cropping or padding.

Training Strategy. We used 3D ResNet34 [12] as the base model and replaced the global average pooling layer and FC layer with a $1\times 1\times 1$ convolutional layer appended by a pooling layer. The hyper-parameter s and m were set to 20 and 0.35, respectively, similar to [7]. The learning rate was initially set to $10^{-4}$ and decreased ten times when the training error stagnated. The base model is trained until convergence and then employed to initialize all other methods.

3.2 Evaluation and Comparison

Tumor Classification. We conducted comparison among several deep learning methods: (1)2D MIL: a multi-instance method aggregating features from 2D slices by 2D ResNet34 [13]; (2)3D ResNet: a 3D implementation of ResNet34; (3)3D Sparse MIL: a sparse label assign method [14]; (4)3D DK-MT: a domain knowledge driven multi-task learning network [15]; (5)3D ResNet+NSL: Normalized sigmoid loss based on (2); (6)3D ResNet+CMSL: our proposed CMSL based on (2). We computed the accuracy, specificity, sensitivity, F1 score, and AUC as the evaluation metrics. Experimental results are reported in Table 1.

Compared with 2D methods, 3D models achieved better results by utilizing information from one more dimension. Both 3D Sparse MIL and 3D DK-MT adopted additional knowledge, leading to better performance than vanilla 3D ResNet. Noticeably, 3D DK-MT showed a poor specificity, which may be due to imbalanced auxiliary knowledge (more BI-RADS 4 and 5 than 3) dominating the learning process. For 3D ResNet+NSL based on deep angular embedding, simply taking the features into angular space without the margin constraint caused certain performance decay. It indicated that the network could not learn sufficient knowledge if s is too large. Moreover, our proposed 3D ResNet+CMSL significantly improved the results with imposed cosine margin forcing the network to learn more underlying discriminative patterns. Our method achieved the highest specificity, with over 7.9% better than all other methods and kept a competitive sensitivity in the meantime. It exceeded other methods with over 2% in AUC, over 3% in accuracy and over 1.5% in F1 score, proving that addressing the inter- and intra-class problem can improve performance of breast tumor classification.

Table 1. Comparison of different methods on cancer classification.

Full size table

Cancer Localization. We invited the radiologists to annotate 85 samples that were classified as malignant by our model. COAM and CAM were obtained and resized by interpolation to be the same size as original inputs. We then compared these two methods by computing the Euclidean distance between the center point of the annotation and the voxel position with the highest value in the attention maps. Then the distance is multiplied by the voxel spacing, i.e., 1.1 mm, as the final measurement. The criterion is reported in the form of $mean\pm std$, where mean and stdv represent the mean value and standard deviation of the center distances over 85 samples, respectively. Compared to the distance of $39.84\pm 8.82$ mm by CAM, COAM showed a significant advantage with $18.26\pm 13.65$ mm. Figure 3 showed a qualitative comparison with the two methods.

4 Conclusion

In this paper, we propose the cosine margin sigmoid loss for breast tumor classification and correlation attention map for weakly supervised cancer candidates localization based on MRI scans. First, we use CMSL-driven deep network to learn more separable inter-class features and more compact intra-class features which effectively tackle the heterogeneity problem of tumors. In addition, the proposed COAM leverages correlations among deep features to localize ROIs in a weakly supervised manner. Extensive experiments on our large-scale dataset demonstrate the efficacy of our methods, which outperform other state-of-the-art approaches significantly on both tasks. Our methods are general and can be extended to many other fields. Our future work would involve more cases without lesion when training the classification task to suppress false positives in the localization stage.

References

DeSantis, C.E., et al.: Breast cancer statistics, racial disparity in mortality by state. CA Cancer J. Clin. 67(6), 439–448 (2017)
Article Google Scholar
Kuhl, C., et al.: Prospective multicenter cohort study to refine management recommendations for women at elevated familial risk of breast cancer: the EVA trial. J. Clin. Oncol. 28(9), 1450–1457 (2010)
Article Google Scholar
Zheng, H., Gu, Y., Qin, Y., Huang, X., Yang, J., Yang, G.-Z.: Small lesion classification in dynamic contrast enhancement MRI for breast cancer early detection. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds.) MICCAI 2018. LNCS, vol. 11071, pp. 876–884. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00934-2_97
Chapter Google Scholar
Amit, G., et al.: Classification of breast MRI lesions using small-size training sets: comparison of deep learning approaches. In: Medical Imaging 2017: Computer-Aided Diagnosis, vol. 10134. International Society for Optics and Photonics (2017)
Google Scholar
Amit, G., et al.: Hybrid mass detection in breast MRI combining unsupervised saliency analysis and deep learning. In: Descoteaux, M., Maier-Hein, L., Franz, A., Jannin, P., Collins, D.L., Duchesne, S. (eds.) MICCAI 2017. LNCS, vol. 10435, pp. 594–602. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66179-7_68
Chapter Google Scholar
Maicas, G., Bradley, A.P., Nascimento, J.C., Reid, I., Carneiro, G.: Training medical image analysis systems like radiologists. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds.) MICCAI 2018. LNCS, vol. 11070, pp. 546–554. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00928-1_62
Chapter Google Scholar
Wang, H., et al.: CosFace: large margin cosine loss for deep face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)
Google Scholar
Zhou, B., et al.: Learning deep features for discriminative localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
Google Scholar
Gatys, L., Ecker, A.S., Bethge, M.: Texture synthesis using convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 262–270 (2015)
Google Scholar
Fu, J., et al.: Dual attention network for scene segmentation. arXiv preprint arXiv:1809.02983 (2018)
Frangi, A.F., Niessen, W.J., Vincken, K.L., Viergever, M.A.: Multiscale vessel enhancement filtering. In: Wells, W.M., Colchester, A., Delp, S. (eds.) MICCAI 1998. LNCS, vol. 1496, pp. 130–137. Springer, Heidelberg (1998). https://doi.org/10.1007/BFb0056195
Chapter Google Scholar
He, K., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Wu, J., et al.: Deep multiple instance learning for image classification and auto-annotation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)
Google Scholar
Zhu, W., Lou, Q., Vang, Y.S., Xie, X.: Deep multi-instance networks with sparse label assignment for whole mammogram classification. In: Descoteaux, M., Maier-Hein, L., Franz, A., Jannin, P., Collins, D.L., Duchesne, S. (eds.) MICCAI 2017. LNCS, vol. 10435, pp. 603–611. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66179-7_69
Chapter Google Scholar
Liu, J., et al.: Integrate domain knowledge in training CNN for ultrasonography breast cancer diagnosis. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds.) MICCAI 2018. LNCS, vol. 11071, pp. 868–875. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00934-2_96
Chapter Google Scholar

Download references

Acknowledgement

This work was supported by Research Grants Council of Hong Kong Special Administrative Region under Project No. CUHK14225616 and Hong Kong Innovation and Technology Fund under Project No. ITS/426/17FP.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, China
Luyang Luo, Xi Wang, Huangjing Lin & Pheng-Ann Heng
Imsight Medical Technology, Co., Ltd., Shenzhen, China
Hao Chen
Department of Computing, Imperial College London, London, UK
Qi Dou
Department of Radiology, The Fifth Medical Center of Chinese PLA General Hospital, Beijing, China
Juan Zhou
Beijing Image Diagnostic Center of Rimag, Beijing, China
Gongjie Li

Authors

Luyang Luo
View author publications
You can also search for this author in PubMed Google Scholar
Hao Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Qi Dou
View author publications
You can also search for this author in PubMed Google Scholar
Huangjing Lin
View author publications
You can also search for this author in PubMed Google Scholar
Juan Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Gongjie Li
View author publications
You can also search for this author in PubMed Google Scholar
Pheng-Ann Heng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hao Chen .

Editor information

Editors and Affiliations

University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Dinggang Shen
University of Georgia, Athens, GA, USA
Tianming Liu
Western University, London, ON, Canada
Terry M. Peters
Yale University, New Haven, CT, USA
Lawrence H. Staib
University of Strasbourg, Illkirch, France
Caroline Essert
United Imaging Intelligence, Shanghai, China
Sean Zhou
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Pew-Thian Yap
Western University, London, ON, Canada
Ali Khan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Luo, L. et al. (2019). Deep Angular Embedding and Feature Correlation Attention for Breast MRI Cancer Analysis. In: Shen, D., et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2019. MICCAI 2019. Lecture Notes in Computer Science(), vol 11767. Springer, Cham. https://doi.org/10.1007/978-3-030-32251-9_55

Download citation

DOI: https://doi.org/10.1007/978-3-030-32251-9_55
Published: 10 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32250-2
Online ISBN: 978-3-030-32251-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)