UTNet: A Hybrid Transformer Architecture for Medical Image Segmentation

Yunhe Gao¹⁵,
Mu Zhou^15,16 &
Dimitris N. Metaxas¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12903))

Included in the following conference series:

International Conference on Medical Image Computing and Computer-Assisted Intervention

13k Accesses
3 Altmetric

Abstract

Transformer architecture has emerged to be successful in a number of natural language processing tasks. However, its applications to medical vision remain largely unexplored. In this study, we present UTNet, a simple yet powerful hybrid Transformer architecture that integrates self-attention into a convolutional neural network for enhancing medical image segmentation. UTNet applies self-attention modules in both encoder and decoder for capturing long-range dependency at different scales with minimal overhead. To this end, we propose an efficient self-attention mechanism along with relative position encoding that reduces the complexity of self-attention operation significantly from $O(n^2)$ to approximate O(n). A new self-attention decoder is also proposed to recover fine-grained details from the skipped connections in the encoder. Our approach addresses the dilemma that Transformer requires huge amounts of data to learn vision inductive bias. Our hybrid layer design allows the initialization of Transformer into convolutional networks without a need of pre-training. We have evaluated UTNet on the multi-label, multi-vendor cardiac magnetic resonance imaging cohort. UTNet demonstrates superior segmentation performance and robustness against the state-of-the-art approaches, holding the promise to generalize well on other medical image segmentations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

CI-UNet: melding convnext and cross-dimensional attention for robust medical image segmentation

Article 08 January 2024

Medical Transformer: Gated Axial-Attention for Medical Image Segmentation

TransDeepLab: Convolution-Free Transformer-Based DeepLab v3+ for Medical Image Segmentation

References

Bello, I., Zoph, B., Vaswani, A., Shlens, J., Le, Q.V.: Attention augmented convolutional networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3286–3295 (2019)
Google Scholar
Campello, V.M., Palomares, J.F.R., Guala, A., Marakas, M., Friedrich, M., Lekadir, K.: Multi-Centre, Multi-Vendor & Multi-Disease Cardiac Image Segmentation Challenge (March 2020)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Fu, J., et al.: Dual attention network for scene segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3146–3154 (2019)
Google Scholar
Gao, Y., et al.: Focusnetv 2: imbalanced large and small organ segmentation with adversarial shape constraint for head and neck CT images. Med. Image Anal. 67, 101831 (2021)
Article Google Scholar
Gao, Y., Liu, C., Zhao, L.: Multi-resolution path CNN with deep supervision for intervertebral disc localization and segmentation. In: Shen, D., et al. (eds.) MICCAI 2019. LNCS, vol. 11765, pp. 309–317. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32245-8_35
Chapter Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 630–645. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_38
Chapter Google Scholar
Huang, Q., Yang, D., Wu, P., Qu, H., Yi, J., Metaxas, D.: MRI reconstruction via cascaded channel-wise attention network. In: 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), pp. 1622–1626. IEEE (2019)
Google Scholar
Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: CCNET: criss-cross attention for semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 603–612 (2019)
Google Scholar
Isensee, F., Jaeger, P.F., Kohl, S.A., Petersen, J., Maier-Hein, K.H.: nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods 18(2), 203–211 (2021)
Article Google Scholar
Kolesnikov, A., et al.: Big transfer (bit): General visual representation learning. arXiv preprint arXiv:1912.11370 6(2), 8 (2019)
Parmar, N., et al.: Image transformer. In: International Conference on Machine Learning, pp. 4055–4064. PMLR (2018)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Schlemper, J., et al.: Attention gated networks: learning to leverage salient regions in medical images. Med. Image Anal. 53, 197–207 (2019)
Article Google Scholar
Sinha, A., Dolz, J.: Multi-scale self-guided attention for medical image segmentation. IEEE J. Biomed. Health Inform. 25(1), 121–130 (2020)
Article Google Scholar
Tajbakhsh, N., Jeyaseelan, L., Li, Q., Chiang, J.N., Wu, Z., Ding, X.: Embracing imperfect datasets: a review of deep learning solutions for medical image segmentation. Med. Image Anal. 63, 101693 (2020)
Article Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NIPS (2017)
Google Scholar
Wang, Q., Li, B., Xiao, T., Zhu, J., Li, C., Wong, D.F., Chao, L.S.: Learning deep transformer models for machine translation. arXiv preprint arXiv:1906.01787 (2019)
Wang, S., et al.: Central focused convolutional neural networks: developing a data-driven model for lung nodule segmentation. Med. Image Anal. 40, 172–183 (2017)
Article Google Scholar
Wang, S., Li, B., Khabsa, M., Fang, H., Ma, H.: Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768 (2020)
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018)
Google Scholar
Woo, S., Park, J., Lee, J.Y., Kweon, I.S.: CBAM: convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018)
Google Scholar
Yi, J., Wu, P., Jiang, M., Huang, Q., Hoeppner, D.J., Metaxas, D.N.: Attentive neural cell instance segmentation. Med. Image Anal. 55, 228–240 (2019). https://doi.org/10.1016/j.media.2019.05.004
Article Google Scholar
Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015)
Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2881–2890 (2017)
Google Scholar
Zheng, S., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. arXiv preprint arXiv:2012.15840 (2020)
Zhu, Z., Xu, M., Bai, S., Huang, T., Bai, X.: Asymmetric non-local neural networks for semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 593–602 (2019)
Google Scholar

Download references

Acknowledgement

This research was supported in part by NSF: IIS 1703883, NSF IUCRC CNS-1747778 and funding from SenseBrain, CCF-1733843, IIS-1763523, IIS-1849238, MURI- Z8424104 -440149 and NIH: 1R01HL127661-01 and R01HL127661-05. and in part by Centre for Perceptual and Interactive Intellgience (CPII) Limited, Hong Kong SAR.

Author information

Authors and Affiliations

Department of Computer Science, Rutgers University, Piscataway, USA
Yunhe Gao, Mu Zhou & Dimitris N. Metaxas
SenseBrain and Shanghai AI Laboratory and Centre for Perceptual and Interactive Intelligence, Shanghai, China
Mu Zhou

Authors

Yunhe Gao
View author publications
You can also search for this author in PubMed Google Scholar
Mu Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Dimitris N. Metaxas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dimitris N. Metaxas .

Editor information

Editors and Affiliations

Erasmus MC - University Medical Center Rotterdam, Rotterdam, The Netherlands
Marleen de Bruijne
University of Basel, Allschwil, Switzerland
Philippe C. Cattin
Inria Nancy Grand Est, Villers-lès-Nancy, France
Stéphane Cotin
ICube, Université de Strasbourg, CNRS, Strasbourg, France
Nicolas Padoy
National Center for Tumor Diseases (NCT/UCC), Dresden, Germany
Stefanie Speidel
Tencent Jarvis Lab, Shenzhen, China
Yefeng Zheng
ICube, Université de Strasbourg, CNRS, Strasbourg, France
Caroline Essert

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1826 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gao, Y., Zhou, M., Metaxas, D.N. (2021). UTNet: A Hybrid Transformer Architecture for Medical Image Segmentation. In: de Bruijne, M., et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2021. MICCAI 2021. Lecture Notes in Computer Science(), vol 12903. Springer, Cham. https://doi.org/10.1007/978-3-030-87199-4_6

Download citation

DOI: https://doi.org/10.1007/978-3-030-87199-4_6
Published: 21 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-87198-7
Online ISBN: 978-3-030-87199-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)

UTNet: A Hybrid Transformer Architecture for Medical Image Segmentation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

CI-UNet: melding convnext and cross-dimensional attention for robust medical image segmentation

Medical Transformer: Gated Axial-Attention for Medical Image Segmentation

TransDeepLab: Convolution-Free Transformer-Based DeepLab v3+ for Medical Image Segmentation

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 1826 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Subscribe and save

Buy Now

Navigation

UTNet: A Hybrid Transformer Architecture for Medical Image Segmentation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

CI-UNet: melding convnext and cross-dimensional attention for robust medical image segmentation

Medical Transformer: Gated Axial-Attention for Medical Image Segmentation

TransDeepLab: Convolution-Free Transformer-Based DeepLab v3+ for Medical Image Segmentation

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 1826 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation