Improving Image Representations via MoCo Pre-training for Multimodal CXR Classification

Francesco Dalla Serra^11,12,
Grzegorz Jacenków¹³,
Fani Deligianni¹²,
Jeff Dalton¹² &
…
Alison Q. O’Neil^11,13

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13413))

Included in the following conference series:

Annual Conference on Medical Image Understanding and Analysis

2973 Accesses

Abstract

Multimodal learning, here defined as learning from multiple input data types, has exciting potential for healthcare. However, current techniques rely on large multimodal datasets being available, which is rarely the case in the medical domain. In this work, we focus on improving the extracted image features which are fed into multimodal image-text Transformer architectures, evaluating on a medical multimodal classification task with dual inputs of chest X-ray images (CXRs) and the indication text passages in the corresponding radiology reports. We demonstrate that self-supervised Momentum Contrast (MoCo) pre-training of the image representation model on a large set of unlabelled CXR images improves multimodal performance compared to supervised ImageNet pre-training. MoCo shows a $0.6\%$ absolute improvement in AUROC-macro, when considering the full MIMIC-CXR training set, and $5.1\%$ improvement when limiting to $10\%$ of the training data.

To the best of our knowledge, this is the first demonstration of MoCo image pre-training for multimodal learning in medical imaging.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Improving the Efficiency of Multimodal Approach for Chest X-Ray

Improving Joint Learning of Chest X-Ray and Radiology Report by Word Region Alignment

Multimodal Multitask Deep Learning for X-Ray Image Retrieval

Notes

1.
Due to the limited computing power, we decided to neglect the contrastive learning approach proposed by [21], trained on 16–64 Cloud TPU cores.
2.
https://pytorch-lightning-bolts.readthedocs.io/en/latest/self_supervised_models.html.
3.
https://github.com/jacenkow/mmbt.

References

Huang, S.C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: A systematic review and implementation guidelines. Digital Medicine, no. 1 (2020)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017). https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Jacenków, G., O’Neil, A.Q., Tsaftaris, S.A.: Indication as prior knowledge for multimodal disease classification in chest radiographs with transformers. In: 2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI)
Google Scholar
Hendricks, L.A., Mellor, J., Schneider, R., Alayrac, J.-B., Nematzadeh, A.: Decoupling the role of data, attention, and losses in multimodal transformers. Trans. Assoc. Comput. Linguistics 9, 570–585 (2021). https://doi.org/10.1162/tacl_a_00385. https://aclanthology.org/2021.tacl-1.35
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)
Google Scholar
Li, L.H., Yatskar, M., Yin, D., Hsieh, C.-J., Chang, K.-W.: VisualBERT: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019
Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems, 32 (2019)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, F.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, 28 (2015)
Google Scholar
Krishna, R., et al.: Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123(1), 32–73 (2017)
Google Scholar
Huang, Z., Zeng, Z., Liu, B., Fu, D., Fu, R.: Pixel-BERT: Aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.008492(020)
Aditya Ramesh, A., et al.: Zero-shot text-to-image generation. In International Conference on Machine Learning, pp. 8821–8831. PMLR (2021)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020
Johnson, A.E.W., et al.: Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs. arXiv preprint arXiv:1901.07042 (2019)
Johnson, A.E.W., et al.: MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci. Data 6(1), December 2019. https://doi.org/10.1038/s41597-019-0322-0. https://doi.org/10.1038/s41597-019-0322-0
Goldberger, A.L., et al.: Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals. Circulation 101(23), e215–e220 (2000)
Google Scholar
Demner-Fushman, D., et al.: Preparing a collection of radiology examinations for distribution and retrieval. J. Am. Med. Inf. Assoc. 23(2), 304–310 (2016)
Google Scholar
van Sonsbeek, T., Worring, M.: Towards Automated Diagnosis with Attentive Multi-modal Learning Using Electronic Health Records and Chest X-Rays. In: Syeda-Mahmood, T., Drechsler, K., Greenspan, H., Madabhushi, A., Karargyris, A., Linguraru, M.G., Oyarzun Laura, C., Shekhar, R., Wesarg, S., González Ballester, M.Á., Erdt, M. (eds.) CLIP/ML-CDS -2020. LNCS, vol. 12445, pp. 106–114. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-60946-7_11
Chapter Google Scholar
Liao, R., Moyer, D., Cha, M., Quigley, K., Berkowitz, S., Horng, S., Golland, P., Wells, W.M.: Multimodal representation learning via maximization of local mutual information. In: de Bruijne, M., Cattin, P.C., Cotin, S., Padoy, N., Speidel, S., Zheng, Y., Essert, C. (eds.) MICCAI 2021. LNCS, vol. 12902, pp. 273–283. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87196-3_26
Chapter Google Scholar
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020)
Google Scholar
Sowrirajan, J.Y., Ng, A.Y., Rajpurkar, P.: MoCo pretraining improves representation and transferability of chest X-ray models. In: Medical Imaging with Deep Learning, pp. 728–744. PMLR (2021)
Google Scholar
Azizi, S., et al.: Big self-supervised models advance medical image classification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3478–3488 (2021)
Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR (2020)
Google Scholar
Vu, Y.N.T.: Medaug: contrastive learning leveraging patient metadata improves representations for chest x-ray interpretation. In: Machine Learning for Healthcare Conference, pp. 755–769. PMLR (2021)
Google Scholar
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, B.: Grad-cam: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international Conference on Computer Vision, pp. 618–626 (2017)
Google Scholar
Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., Summers, R.M.: ChestX-ray8: Hospital-scale chest X-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In: Proceedings of the IEEE Conference on Computer vision and pattern recognition, pp. 2097–2106 (2017)
Google Scholar
Kiela, D., Bhooshan, S., Firooz, H., Testuggine, D.: Supervised multimodal bitransformers for classifying images and text. arXiv preprint arXiv:1909.02950 (2019)
Devlin, J., Chang Kenton, M.-W., Toutanova, L.K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019)
Google Scholar
van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
Irvin, J., et al.: CheXpert: large chest radiograph dataset with uncertainty labels and expert comparison. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 590–597 (2019)
Google Scholar
Singh, A., et al.: MMF: A multimodal framework for vision and language research (2020). https://github.com/facebookresearch/mmf
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Yijia, Z., Chen, Q., Yang, Z., Lin, H., lu, Z.: BioWordVec, improving biomedical word embeddings with subword information and MeSH. Sci. Data 6, 05 2019. https://doi.org/10.1038/s41597-019-0055-0
Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. In: NIPS 2014 Workshop on Deep Learning, December 2014
Google Scholar

Download references

Author information

Authors and Affiliations

Canon Medical Research Europe, Edinburgh, UK
Francesco Dalla Serra & Alison Q. O’Neil
University of Glasgow, Glasgow, UK
Francesco Dalla Serra, Fani Deligianni & Jeff Dalton
University of Edinburgh, Edinburgh, UK
Grzegorz Jacenków & Alison Q. O’Neil

Authors

Francesco Dalla Serra
View author publications
You can also search for this author in PubMed Google Scholar
Grzegorz Jacenków
View author publications
You can also search for this author in PubMed Google Scholar
Fani Deligianni
View author publications
You can also search for this author in PubMed Google Scholar
Jeff Dalton
View author publications
You can also search for this author in PubMed Google Scholar
Alison Q. O’Neil
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Francesco Dalla Serra .

Editor information

Editors and Affiliations

Imperial College London, London, UK
Guang Yang
University of Cambridge, Cambridge, UK
Angelica Aviles-Rivero
University of Cambridge, Cambridge, UK
Michael Roberts
University of Cambridge, Cambridge, UK
Carola-Bibiane Schönlieb

A Per-Class Results

Table 4. Per-class AUROC scores using different ResNet-50 initializations. The models are fine-tuned on the full training set (top) and on 10% of the training set (bottom).

Full size table

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dalla Serra, F., Jacenków, G., Deligianni, F., Dalton, J., O’Neil, A.Q. (2022). Improving Image Representations via MoCo Pre-training for Multimodal CXR Classification. In: Yang, G., Aviles-Rivero, A., Roberts, M., Schönlieb, CB. (eds) Medical Image Understanding and Analysis. MIUA 2022. Lecture Notes in Computer Science, vol 13413. Springer, Cham. https://doi.org/10.1007/978-3-031-12053-4_46

Download citation

DOI: https://doi.org/10.1007/978-3-031-12053-4_46
Published: 25 July 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-12052-7
Online ISBN: 978-3-031-12053-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Improving Image Representations via MoCo Pre-training for Multimodal CXR Classification

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Improving the Efficiency of Multimodal Approach for Chest X-Ray

Improving Joint Learning of Chest X-Ray and Radiology Report by Word Region Alignment

Multimodal Multitask Deep Learning for X-Ray Image Retrieval

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Per-Class Results

A Per-Class Results

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us