Abstract
Multimodal Federated Learning (MMFL) is a novel machine learning technique that enhances the capabilities of traditional Federated Learning (FL) to support collaborative training of local models using data available in various modalities. With the generation and storage of a vast amount of multimodal data from the internet, sensors, and mobile devices, as well as the rapid iteration of artificial intelligence models, the demand for multimodal models is growing rapidly. While FL has been widely studied in the past few years, most of the existing research was based in unimodal settings. With the hope of inspiring more applications and research within the MMFL paradigm, we conduct a comprehensive review of the progress and challenges in various aspects of state-of-the-art MMFL. Specifically, we analyze the research motivation for MMFL, propose a new classification method of existing research, discuss the available datasets and application scenarios, and put forward perspectives on the opportunities and challenges faced by MMFL.
Similar content being viewed by others
Data availability
No datasets were generated or analysed during the current study.
References
Cai, Y., Cai, H., Wan, X.: Multi-modal sarcasm detection in twitter with hierarchical fusion model. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2506–2515 (2019)
Castro, S., Hazarika, D., Pérez-Rosas, V., Zimmermann, R., Mihalcea, R., Poria, S.: Towards multimodal sarcasm detection (an _obviously_ perfect paper). Preprint at arXiv:1906.01815 (2019)
Yu, Q., Liu, Y., Wang, Y., Xu, K., Liu, J.: Multimodal federated learning via contrastive representation ensemble. Preprint at arXiv:2302.08888v3 (2023)
Thrasher, J., Devkota, A., Siwakotai, P., Chivukula, R., Poudel, P., Hu, C., Bhattarai, B., Gyawali, P.: Multimodal federated learning in healthcare: a review. Preprint at arXiv:2310.09650 (2023)
Wang, K., Yin, Q., Wang, W., Wu, S., Wang, L.: A comprehensive survey on cross-modal retrieval. Preprint at arXiv:1607.06215 (2016)
Ghandi, T., Pourreza, H., Mahyar, H.: Deep learning approaches on image captioning: a review. ACM Comput. Surv. 56(3), 1–39 (2023)
Hussain, T., Muhammad, K., Ding, W., Lloret, J., Baik, S.W., Albuquerque, V.H.C.: A comprehensive survey of multi-view video summarization. Pattern Recogn. 109, 107567 (2021)
Liang, P.P., Liu, T., Cai, A., Muszynski, M., Ishii, R., Allen, N., Auerbach, R., Brent, D., Salakhutdinov, R., Morency, L.-P.: Learning language and multimodal privacy-preserving markers of mood from mobile data. Preprint at arXiv:2106.13213 (2021)
Dalmaz, O., Yurt, M., Çukur, T.: Resvit: Residual vision transformers for multimodal medical image synthesis. IEEE Trans. Med. Imaging 41(10), 2598–2614 (2022)
Kairouz, P., McMahan, H.B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A.N., Bonawitz, K., Charles, Z., Cormode, G., Cummings, R., et al.: Advances and open problems in federated learning. Found. Trends® Mach. Learn. 14(1–2), 1–210 (2021)
McMahan, B., Moore, E., Ramage, D., Hampson, S., Arcas, B.A.: Communication-efficient learning of deep networks from decentralized data. In: Artificial Intelligence and Statistics, pp. 1273–1282 PMLR, (2017)
Li, T., Sahu, A.K., Zaheer, M., Sanjabi, M., Talwalkar, A., Smith, V.: Federated optimization in heterogeneous networks. Proc. Mach. Learn. syst. 2, 429–450 (2020)
Karimireddy, S.P., Kale, S., Mohri, M., Reddi, S., Stich, S., Suresh, A.T.: Scaffold: Stochastic controlled averaging for federated learning. In: International Conference on Machine Learning, pp. 5132–5143 PMLR, (2020)
Li, X., Jiang, M., Zhang, X., Kamp, M., Dou, Q.: Fedbn: Federated learning on non-iid features via local batch normalization. Preprint at arXiv:2102.07623 (2021)
Wang, J., Liu, Q., Liang, H., Joshi, G., Poor, H.V.: Tackling the objective inconsistency problem in heterogeneous federated optimization. Adv. Neural Inform. Proc. Syst. 33, 7611–7623 (2020)
Baltrušaitis, T., Ahuja, C., Morency, L.-P.: Multimodal machine learning: a survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41(2), 423–443 (2018)
Chen, S., Li, B.: Towards optimal multi-modal federated learning on non-iid data with hierarchical gradient blending. In: IEEE INFOCOM 2022-IEEE Conference on Computer Communications, pp. 1469–1478 IEEE, (2022)
Che, L., Wang, J., Zhou, Y., Ma, F.: Multimodal federated learning: a survey. Sensors 23(15), 6986 (2023)
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: common objects in context. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740–755 Springer, (2014)
Tan, A.Z., Yu, H., Cui, L., Yang, Q.: Towards personalized federated learning. IEEE Transactions on Neural Networks and Learning Systems (2022)
Smith, V., Chiang, C.-K., Sanjabi, M., Talwalkar, A.S.: Federated multi-task learning. Advances in neural information processing systems 30 (2017)
Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Comput. 32(5), 829–864 (2020)
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)
Cour, T., Jordan, C., Miltsakaki, E., Taskar, B.: Movie/script: Alignment and parsing of video and text transcription. In: Computer Vision–ECCV 2008: 10th European Conference on Computer Vision, Marseille, France, October 12-18, 2008, Proceedings, Part IV 10, pp. 158–171 Springer, (2008)
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. Preprint at arXiv:1409.1556 (2014)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. Preprint at arXiv:2010.11929 (2020)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. Preprint at arXiv:1412.3555 (2014)
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. Advances in neural information processing systems 27 (2014)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. Preprint at arXiv:1301.3781 (2013)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30,(2017)
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 PMLR (2020)
Feng, T., Bose, D., Zhang, T., Hebbar, R., Ramakrishna, A., Gupta, R., Zhang, M., Avestimehr, S., Narayanan, S.: Fedmultimodal: A benchmark for multimodal federated learning. Preprint at arXiv:2306.09486 (2023)
Zhang, N., Ding, S., Zhang, J., Xue, Y.: An overview on restricted boltzmann machines. Neurocomputing 275, 1186–1199 (2018)
Tschannen, M., Bachem, O., Lucic, M.: Recent advances in autoencoder-based representation learning. Preprint at arXiv:1812.05069 (2018)
Muhammad, G., Alshehri, F., Karray, F., El Saddik, A., Alsulaiman, M., Falk, T.H.: A comprehensive survey on multimodal medical signals fusion for smart healthcare systems. Inform. Fusion 76, 355–375 (2021)
Huang, K., Shi, B., Li, X., Li, X., Huang, S., Li, Y.: Multi-modal sensor fusion for auto driving perception: A survey. Preprint at arXiv:2202.02703 (2022)
Qi, P., Chiaro, D., Piccialli, F.: Fl-fd: Federated learning-based fall detection with multimodal data fusion. Inform. Fusion 99, 101890 (2023)
Jaggi, M., Smith, V., Takác, M., Terhorst, J., Krishnan, S., Hofmann, T., Jordan, M.I.: Communication-efficient distributed dual coordinate ascent. Advances in Neural Information Processing Systems 27 (2014)
Ma, C., Smith, V., Jaggi, M., Jordan, M., Richtárik, P., Takác, M.: Adding vs. averaging in distributed primal-dual optimization. In: International Conference on Machine Learning, pp. 1973–1982 PMLR, (2015)
Ye, M., Fang, X., Du, B., Yuen, P.C., Tao, D.: Heterogeneous federated learning: state-of-the-art and research challenges. ACM Comput. Surv. 56(3), 1–44 (2023)
Reisizadeh, A., Tziotis, I., Hassani, H., Mokhtari, A., Pedarsani, R.: Straggler-resilient federated learning: Leveraging the interplay between statistical accuracy and system heterogeneity. IEEE J. Selected Areas Inform. Theory 3(2), 197–205 (2022)
Chen, J., Zhang, A.: Fedmsplit: Correlation-adaptive federated multi-task learning across multimodal split networks. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 87–96 (2022)
Liu, Y., Kang, Y., Zou, T., Pu, Y., He, Y., Ye, X., Ouyang, Y., Zhang, Y.-Q., Yang, Q.: Vertical federated learning: Concepts, advances, and challenges. IEEE Transactions on Knowledge and Data Engineering (2024)
Yang, Q., Liu, Y., Chen, T., Tong, Y.: Federated machine learning: concept and applications. ACM Trans. Intell. Syst. Technol. (TIST) 10(2), 1–19 (2019)
Liu, F., Wu, X., Ge, S., Fan, W., Zou, Y.: Federated learning for vision-and-language grounding problems. In: Proceedings of the AAAI Conference on Artificial Intelligence, 34, pp. 11572–11579 (2020)
Lin, Y.-M., Gao, Y., Gong, M.-G., Zhang, S.-J., Zhang, Y.-Q., Li, Z.-Y.: Federated learning on multimodal data: a comprehensive survey. Mach. Intell. Res. 4, 1–15 (2023)
Chen, J., Zhang, A.: On disentanglement of asymmetrical knowledge transfer for modality-task agnostic federated learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, 38, pp. 11311–11319 (2024)
Fallah, A., Mokhtari, A., Ozdaglar, A.: Personalized federated learning with theoretical guarantees: A model-agnostic meta-learning approach. Adv. Neural Inform. Proc. Syst. 33, 3557–3568 (2020)
Liang, P.P., Liu, T., Ziyin, L., Allen, N.B., Auerbach, R.P., Brent, D., Salakhutdinov, R., Morency, L.-P.: Think locally, act globally: Federated learning with local and global representations. Preprint at arXiv:2001.01523 (2020)
Yang, X., Xiong, B., Huang, Y., Xu, C.: Cross-modal federated human activity recognition via modality-agnostic and modality-specific representation learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, 36, pp. 3063–3071 (2022)
Qayyum, A., Ahmad, K., Ahsan, M.A., Al-Fuqaha, A., Qadir, J.: Collaborative federated learning for healthcare: Multi-modal covid-19 diagnosis at the edge. IEEE Open J. Comput. Soc. 3, 172–184 (2022)
Li, D., Xie, W., Li, Y., Fang, L.: Fedfusion: Manifold driven federated learning for multi-satellite and multi-modality fusion. IEEE Transactions on Geoscience and Remote Sensing (2023)
Yang, X., Xiong, B., Huang, Y., Xu, C.: Cross-modal federated human activity recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024)
Dai, Q., Wei, D., Liu, H., Sun, J., Wang, L., Zheng, Y.: Federated modality-specific encoders and multimodal anchors for personalized brain tumor segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, 38, pp. 1445–1453 (2024)
Agbley, B.L.Y., Li, J., Haq, A.U., Bankas, E.K., Ahmad, S., Agyemang, I.O., Kulevome, D., Ndiaye, W.D., Cobbinah, B., Latipova, S.: Multimodal melanoma detection with federated learning. In: 2021 18th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP), pp. 238–244 IEEE, (2021)
Xiong, B., Yang, X., Qi, F., Xu, C.: A unified framework for multi-modal federated learning. Neurocomputing 480, 110–118 (2022)
Zheng, T., Li, A., Chen, Z., Wang, H., Luo, J.: Autofed: Heterogeneity-aware federated multimodal learning for robust autonomous driving. Preprint at arXiv:2302.08646 (2023)
Lu, W., Hu, X., Wang, J., Xie, X.: Fedclip: Fast generalization and personalization for clip in federated learning. Preprint at arXiv:2302.13485 (2023)
Ouyang, X., Xie, Z., Fu, H., Cheng, S., Pan, L., Ling, N., Xing, G., Zhou, J., Huang, J.: Harmony: Heterogeneous multi-modal federated learning through disentangled model training. In: Proceedings of the 21st Annual International Conference on Mobile Systems, Applications and Services, pp. 530–543 (2023)
Chen, J., Pan, R.: Medical report generation based on multimodal federated learning. Comput. Med. Imaging Graph. 113, 102342 (2024)
Zong, L., Xie, Q., Zhou, J., Wu, P., Zhang, X., Xu, B.: Fedcmr: Federated cross-modal retrieval. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1672–1676 (2021)
Zhao, Y., Barnaghi, P., Haddadi, H.: Multimodal federated learning on iot data. In: 2022 IEEE/ACM Seventh International Conference on Internet-of-Things Design and Implementation (IoTDI), pp. 43–54 IEEE, (2022)
Le, H.Q., Nguyen, M.N., Thwal, C.M., Qiao, Y., Zhang, C., Hong, C.S.: Fedmekt: Distillation-based embedding knowledge transfer for multimodal federated learning. Preprint at arXiv:2307.13214 (2023)
Guo, T., Guo, S., Wang, J.: pfedprompt: Learning personalized prompt for vision-language models in federated learning. In: Proceedings of the ACM Web Conference 2023, pp. 1364–1374 (2023)
Bao, G., Zhang, Q., Miao, D., Gong, Z., Hu, L.: Multimodal federated learning with missing modality via prototype mask and contrast. Preprint at arXiv:2312.13508 (2023)
Yu, S., Yang, Q., Wang, J., Wu, C.: Fedusl,: A federated annotation method for driving fatigue detection based on multimodal sensing data. ACM Trans. Sensor Netw. (2024). https://doi.org/10.1145/3657291
Gong, M., Zhang, Y., Gao, Y., Qin, A., Wu, Y., Wang, S., Zhang, Y.: A multi-modal vertical federated learning framework based on homomorphic encryption. IEEE Transactions on Information Forensics and Security (2023)
Tan, M., Feng, Y., Chu, L., Shi, J., Xiao, R., Tang, H., Yu, J.: Fedsea: Federated learning via selective feature alignment for non-iid multimodal data. IEEE Transactions on Multimedia (2023)
Yuan, L., Han, D.-J., Wang, S., Upadhyay, D., Brinton, C.G.: Communication-efficient multimodal federated learning: Joint modality and client selection. Preprint at arXiv:2401.16685 (2024)
Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: International Conference on Machine Learning, pp. 1126–1135 PMLR, (2017).
Hu, M., Luo, M., Huang, M., Meng, W., Xiong, B., Yang, X., Sang, J.: Towards a multimodal human activity dataset for healthcare. Multimed. Syst. 29(1), 1–13 (2023)
Reddi, S., Charles, Z., Zaheer, M., Garrett, Z., Rush, K., Konečnỳ, J., Kumar, S., McMahan, H.B.: Adaptive federated optimization. Preprint at arXiv:2003.00295 (2020)
Fang, H., Gupta, S., Iandola, F., Srivastava, R.K., Deng, L., Dollar, P., Gao, J., He, X., Mitchell, M., Platt, J.C., Zitnick, C.L., Zweig, G.: From captions to visual concepts and back. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Zhen, L., Hu, P., Wang, X., Peng, D.: Deep supervised cross-modal retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10394–10403 (2019)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. Preprint at arXiv:1810.04805 (2018)
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 PMLR, (2021)
Sun, Y.: Federated transfer learning with multimodal data. Preprint at arXiv:2209.03137 (2022)
Saeed, A., Salim, F.D., Ozcelebi, T., Lukkien, J.: Federated self-supervised learning of multisensor representations for embedded intelligence. IEEE Int. Things J. 8(2), 1030–1040 (2020)
Wang, J., Yang, X., Cui, S., Che, L., Lyu, L., Xu, D.D., Ma, F.: Towards personalized federated learning via heterogeneous model reassembly. Adv. Neural Inform. Proc. Syst. 36 (2024)
Kim, W., Son, B., Kim, I.: Vilt: Vision-and-language transformer without convolution or region supervision. International Conference on Machine Learning, 5583–5594 PMLR (2021)
Bao, H., Wang, W., Dong, L., Liu, Q., Mohammed, O.K., Aggarwal, K., Som, S., Piao, S., Wei, F.: Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. Adv. Neural Inform. Proc. Syst. 35, 32897–32912 (2022)
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2641–2649 (2015)
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6904–6913 (2017)
Zhao, S., Jia, G., Yang, J., Ding, G., Keutzer, K.: Emotion recognition from multiple modalities: fundamentals and methodologies. IEEE Signal Proc. Mag. 38(6), 59–73 (2021)
Chaturvedi, V., Kaur, A.B., Varshney, V., Garg, A., Chhabra, G.S., Kumar, M.: Music mood and human emotion recognition based on physiological signals: a systematic review. Multimed. Syst. 28(1), 21–44 (2022)
Busso, C., Bulut, M., Lee, C.-C., Kazemzadeh, A., Mower, E., Kim, S., Chang, J.N., Lee, S., Narayanan, S.S.: Iemocap: Interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42, 335–359 (2008)
Poria, S., Hazarika, D., Majumder, N., Naik, G., Cambria, E., Mihalcea, R.: Meld: A multimodal multi-party dataset for emotion recognition in conversations. Preprint at arXiv:1810.02508v6 (2018)
Zadeh, A.B., Liang, P.P., Poria, S., Cambria, E., Morency, L.-P.: Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Vol. 1, pp. 2236–2246 Long Papers, (2018)
Huang, Y., Yang, X., Gao, J., Sang, J., Xu, C.: Knowledge-driven egocentric multimodal activity recognition. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 16(4), 1–133 (2020)
Singh, R., Sonawane, A., Srivastava, R.: Recent evolution of modern datasets for human activity recognition: a deep survey. Multimed. Syst. 26(2), 83–106 (2020)
Chao, X., Hou, Z., Mo, Y., Shi, H., Yao, W.: Structural feature representation and fusion of human spatial cooperative motion for action recognition. Multimed. Syst. 29(3), 1301–1314 (2023)
Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al.: The kinetics human action video dataset. Preprint at arXiv:1705.06950 (2017)
Soomro, K., Zamir, A.R., Shah, M.: Ucf101: A dataset of 101 human actions classes from videos in the wild. Preprint at arXiv:1212.0402 (2012)
Kwolek, B., Kepski, M.: Human fall detection on embedded platform using depth maps and wireless accelerometer. Comput. Methods Prog. Biomed. 117(3), 489–501 (2014)
Kiela, D., Firooz, H., Mohan, A., Goswami, V., Singh, A., Ringshia, P., Testuggine, D.: The hateful memes challenge: detecting hate speech in multimodal memes. Adv. Neural Inform. Proc. Syst. 33, 2611–2624 (2020)
Hasan, M.K., Rahman, W., Zadeh, A., Zhong, J., Tanveer, M.I., Morency, L.-P., et al.: Ur-funny: A multimodal language dataset for understanding humor. Preprint at arXiv:1904.06618 (2019)
Alam, F., Ofli, F., Imran, M.: Crisismmd: Multimodal twitter datasets from natural disasters. In: Proceedings of the International AAAI Conference on Web and Social Media, vol. 12 (2018)
Duarte, M.F., Hu, Y.H.: Vehicle classification in distributed sensor networks. J. Parallel Distrib. Comput. 64(7), 826–838 (2004)
Banos, O., Garcia, R., Holgado-Terriza, J.A., Damas, M., Pomares, H., Rojas, I., Saez, A., Villalonga, C.: mhealthdroid: a novel framework for agile development of mobile health applications. In: Ambient Assisted Living and Daily Activities: 6th International Work-Conference, IWAAL 2014, Belfast, UK, December 2-5, 2014. Proceedings 6, pp. 91–98 Springer, (2014)
Wagner, P., Strodthoff, N., Bousseljot, R.-D., Kreiseler, D., Lunze, F.I., Samek, W., Schaeffter, T.: Ptb-xl, a large publicly available electrocardiography dataset. Sci. Data 7(1), 154 (2020)
Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., Xiao, J.: 3d shapenets: A deep representation for volumetric shapes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1912–1920 (2015)
Liang, P.P., Lyu, Y., Fan, X., Wu, Z., Cheng, Y., Wu, J., Chen, L., Wu, P., Lee, M.A., Zhu, Y., et al.: Multibench: Multiscale benchmarks for multimodal representation learning. Preprint at arXiv:2107.07502 (2021)
Li, X.: Tag relevance fusion for social image retrieval. Multimed. Syst. 23(1), 29–40 (2017)
Bano, S., Tonellotto, N., Cassarà, P., Gotta, A.: Fedcmd: A federated cross-modal knowledge distillation for drivers emotion recognition. ACM Transactions on Intelligent Systems and Technology (2024)
Liang, P.P., Liu, T., Cai, A., Muszynski, M., Ishii, R., Allen, N., Auerbach, R., Brent, D., Salakhutdinov, R., Morency, L.-P.: Learning language and multimodal privacy-preserving markers of mood from mobile data. Preprint at arXiv:2106.13213 (2021)
Li, Z., Cheng, W., Zhou, J., An, Z., Hu, B.: Deep learning model with multi-feature fusion and label association for suicide detection. Multimed. Syst. 29(4), 2193–2203 (2023)
Gupta, A., Savarese, S., Ganguli, S., Fei-Fei, L.: Embodied intelligence via learning and evolution. Nature Commun. 12(1), 5721 (2021)
Acknowledgements
This work was supported by National Natural Science Foundation of China, 62376151. Science and Technology Commission of Shanghai Municipality, 22DZ2205600.
Author information
Authors and Affiliations
Contributions
H.P. wrote the main manuscript text. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors have no Conflict of interest to declare that are relevant to the content of this article.
Additional information
Communicated by Bing-kun Bao.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Pan, H., Zhao, X., He, L. et al. A survey of multimodal federated learning: background, applications, and perspectives. Multimedia Systems 30, 222 (2024). https://doi.org/10.1007/s00530-024-01422-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00530-024-01422-9