Abstract
Multimodal learning, here defined as learning from multiple input data types, has exciting potential for healthcare. However, current techniques rely on large multimodal datasets being available, which is rarely the case in the medical domain. In this work, we focus on improving the extracted image features which are fed into multimodal image-text Transformer architectures, evaluating on a medical multimodal classification task with dual inputs of chest X-ray images (CXRs) and the indication text passages in the corresponding radiology reports. We demonstrate that self-supervised Momentum Contrast (MoCo) pre-training of the image representation model on a large set of unlabelled CXR images improves multimodal performance compared to supervised ImageNet pre-training. MoCo shows a \(0.6\%\) absolute improvement in AUROC-macro, when considering the full MIMIC-CXR training set, and \(5.1\%\) improvement when limiting to \(10\%\) of the training data.
To the best of our knowledge, this is the first demonstration of MoCo image pre-training for multimodal learning in medical imaging.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Due to the limited computing power, we decided to neglect the contrastive learning approach proposed by [21], trained on 16–64 Cloud TPU cores.
- 2.
- 3.
References
Huang, S.C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: A systematic review and implementation guidelines. Digital Medicine, no. 1 (2020)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017). https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Jacenków, G., O’Neil, A.Q., Tsaftaris, S.A.: Indication as prior knowledge for multimodal disease classification in chest radiographs with transformers. In: 2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI)
Hendricks, L.A., Mellor, J., Schneider, R., Alayrac, J.-B., Nematzadeh, A.: Decoupling the role of data, attention, and losses in multimodal transformers. Trans. Assoc. Comput. Linguistics 9, 570–585 (2021). https://doi.org/10.1162/tacl_a_00385. https://aclanthology.org/2021.tacl-1.35
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)
Li, L.H., Yatskar, M., Yin, D., Hsieh, C.-J., Chang, K.-W.: VisualBERT: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019
Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems, 32 (2019)
Ren, S., He, K., Girshick, R., Sun, F.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, 28 (2015)
Krishna, R., et al.: Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123(1), 32–73 (2017)
Huang, Z., Zeng, Z., Liu, B., Fu, D., Fu, R.: Pixel-BERT: Aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.008492(020)
Aditya Ramesh, A., et al.: Zero-shot text-to-image generation. In International Conference on Machine Learning, pp. 8821–8831. PMLR (2021)
Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020
Johnson, A.E.W., et al.: Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs. arXiv preprint arXiv:1901.07042 (2019)
Johnson, A.E.W., et al.: MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci. Data 6(1), December 2019. https://doi.org/10.1038/s41597-019-0322-0. https://doi.org/10.1038/s41597-019-0322-0
Goldberger, A.L., et al.: Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals. Circulation 101(23), e215–e220 (2000)
Demner-Fushman, D., et al.: Preparing a collection of radiology examinations for distribution and retrieval. J. Am. Med. Inf. Assoc. 23(2), 304–310 (2016)
van Sonsbeek, T., Worring, M.: Towards Automated Diagnosis with Attentive Multi-modal Learning Using Electronic Health Records and Chest X-Rays. In: Syeda-Mahmood, T., Drechsler, K., Greenspan, H., Madabhushi, A., Karargyris, A., Linguraru, M.G., Oyarzun Laura, C., Shekhar, R., Wesarg, S., González Ballester, M.Á., Erdt, M. (eds.) CLIP/ML-CDS -2020. LNCS, vol. 12445, pp. 106–114. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-60946-7_11
Liao, R., Moyer, D., Cha, M., Quigley, K., Berkowitz, S., Horng, S., Golland, P., Wells, W.M.: Multimodal representation learning via maximization of local mutual information. In: de Bruijne, M., Cattin, P.C., Cotin, S., Padoy, N., Speidel, S., Zheng, Y., Essert, C. (eds.) MICCAI 2021. LNCS, vol. 12902, pp. 273–283. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87196-3_26
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020)
Sowrirajan, J.Y., Ng, A.Y., Rajpurkar, P.: MoCo pretraining improves representation and transferability of chest X-ray models. In: Medical Imaging with Deep Learning, pp. 728–744. PMLR (2021)
Azizi, S., et al.: Big self-supervised models advance medical image classification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3478–3488 (2021)
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR (2020)
Vu, Y.N.T.: Medaug: contrastive learning leveraging patient metadata improves representations for chest x-ray interpretation. In: Machine Learning for Healthcare Conference, pp. 755–769. PMLR (2021)
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, B.: Grad-cam: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international Conference on Computer Vision, pp. 618–626 (2017)
Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., Summers, R.M.: ChestX-ray8: Hospital-scale chest X-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In: Proceedings of the IEEE Conference on Computer vision and pattern recognition, pp. 2097–2106 (2017)
Kiela, D., Bhooshan, S., Firooz, H., Testuggine, D.: Supervised multimodal bitransformers for classifying images and text. arXiv preprint arXiv:1909.02950 (2019)
Devlin, J., Chang Kenton, M.-W., Toutanova, L.K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019)
van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
Irvin, J., et al.: CheXpert: large chest radiograph dataset with uncertainty labels and expert comparison. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 590–597 (2019)
Singh, A., et al.: MMF: A multimodal framework for vision and language research (2020). https://github.com/facebookresearch/mmf
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Yijia, Z., Chen, Q., Yang, Z., Lin, H., lu, Z.: BioWordVec, improving biomedical word embeddings with subword information and MeSH. Sci. Data 6, 05 2019. https://doi.org/10.1038/s41597-019-0055-0
Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. In: NIPS 2014 Workshop on Deep Learning, December 2014
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Per-Class Results
A Per-Class Results
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Dalla Serra, F., Jacenków, G., Deligianni, F., Dalton, J., O’Neil, A.Q. (2022). Improving Image Representations via MoCo Pre-training for Multimodal CXR Classification. In: Yang, G., Aviles-Rivero, A., Roberts, M., Schönlieb, CB. (eds) Medical Image Understanding and Analysis. MIUA 2022. Lecture Notes in Computer Science, vol 13413. Springer, Cham. https://doi.org/10.1007/978-3-031-12053-4_46
Download citation
DOI: https://doi.org/10.1007/978-3-031-12053-4_46
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-12052-7
Online ISBN: 978-3-031-12053-4
eBook Packages: Computer ScienceComputer Science (R0)