Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

SemiCMT: Contrastive Cross-Modal Knowledge Transfer for IoT Sensing with Semi-Paired Multi-Modal Signals

Published: 21 November 2024 Publication History

Abstract

This paper proposes a novel contrastive cross-modal knowledge transfer framework, SemiCMT, for multi-modal IoT sensing applications. It effectively transfers the feature extraction capability (also called knowledge) learned from a source modality (e.g., acoustic signals) with abundant unlabeled training data, to a target modality (e.g., seismic signals) that lacks enough training data, in a self-supervised manner with the help of only a small set of synchronized multi-modal pairs. The transferred model can be quickly finetuned to downstream target-modal tasks with only limited labels. The key design constitutes of three aspects: First, we factorize the latent embedding of each modality into shared and private components and perform knowledge transfer considering both the modality information commonality and gaps. Second, we enforce structural correlation constraints between the source modality and the target modality, to push the target modal embedding space symmetric to the source modal embedding space, with the anchoring of additional source-modal samples, which expands the existing modal-matching objective in current multi-modal contrastive frameworks. Finally, we conduct downstream task finetuning in the spherical space with a KNN classifier to better align with the structured modality embedding space. Extensive evaluations on five multimodal IoT datasets are performed to validate the effectiveness of SemiCMT in cross-modal knowledge transfer, including a new self-collected dataset using seismic and acoustic signals for office activity monitoring. SemiCMT consistently outperforms existing self-supervised and knowledge transfer approaches by up to 36.47% in the finetuned target-modal classification tasks. The code and the self-collected dataset will be released at https://github.com/SJTU-RTEAS/SemiCMT.

References

[1]
Alex Andonian, Shixing Chen, and Raffay Hamid. 2022. Robust cross-modal representation learning with progressive self-distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 16430--16441.
[2]
Sejal Bhalla, Mayank Goel, and Rushil Khurana. 2021. Imu2doppler: Cross-modal domain adaptation for doppler-based activity recognition using imu data. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT) 5, 4 (2021), 1--20.
[3]
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning (ICML). PMLR, 1597--1607.
[4]
Shohreh Deldari, Dimitris Spathis, Mohammad Malekzadeh, Fahim Kawsar, Flora D Salim, and Akhil Mathur. 2024. CroSSL: Cross-modal Self-Supervised Learning for Time-series through Latent Masking. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining (WSDM). 152--160.
[5]
Shohreh Deldari, Hao Xue, Aaqib Saeed, Daniel V Smith, and Flora D Salim. 2022. Cocoa: Cross modality contrastive learning for sensor data. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT) 6, 3 (2022), 1--28.
[6]
Emadeldeen Eldele, Mohamed Ragab, Zhenghua Chen, Min Wu, Chee-Keong Kwoh, Xiaoli Li, and Cuntai Guan. 2023. Self-Supervised Contrastive Representation Learning for Semi-Supervised Time-Series Classification. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 45, 12 (2023), 15604--15618.
[7]
Ziqi Gao, Yuntao Wang, Jianguo Chen, Junliang Xing, Shwetak Patel, Xin Liu, and Yuanchun Shi. 2023. MMTSA: Multi-Modal Temporal Segment Attention Network for Efficient Human Activity Recognition. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT) 7, 3 (2023), 1--26.
[8]
Wen Ge, Guanyi Mou, Emmanuel O Agu, and Kyumin Lee. 2024. Deep Heterogeneous Contrastive Hyper-Graph Learning for In-the-Wild Context-Aware Human Activity Recognition. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT) 7, 4 (2024), 1--23.
[9]
Hristijan Gjoreski, Mathias Ciliberto, Lin Wang, Francisco Javier Ordonez Morales, Sami Mekki, Stefan Valentin, and Daniel Roggen. 2018. The university of sussex-huawei locomotion and transportation dataset for multimodal analytics with mobile devices. IEEE Access 6 (2018), 42592--42604.
[10]
Imad Gohar, Qaiser Riaz, Muhammad Shahzad, Muhammad Zeeshan Ul Hasnain Hashmi, Hasan Tahir, and Muhammad Ehsan Ul Haq. 2020. Person re-identification using deep modeling of temporally correlated inertial motion patterns. Sensors 20, 3 (2020), 949.
[11]
Harish Haresamudram, Apoorva Beedu, Varun Agrawal, Patrick L Grady, Irfan Essa, Judy Hoffman, and Thomas Plötz. 2020. Masked reconstruction based self-supervision for human activity recognition. In Proceedings of the 2020 ACM International Symposium on Wearable Computers (ISWC). 45--49.
[12]
Harish Haresamudram, Irfan Essa, and Thomas Plötz. 2021. Contrastive predictive coding for human activity recognition. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT) 5, 2 (2021), 1--26.
[13]
Harish Haresamudram, Irfan Essa, and Thomas Plötz. 2022. Assessing the state of self-supervised human activity recognition using wearables. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT) 6, 3 (2022), 1--47.
[14]
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. 2022. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 16000--16009.
[15]
Alvaro S Hervella, José Rouco, Jorge Novo, and Marcos Ortega. 2019. Deep multimodal reconstruction of retinal images using paired or unpaired data. In 2019 International Joint Conference on Neural Networks (IJCNN). IEEE, 1--8.
[16]
Yash Jain, Chi Ian Tang, Chulhong Min, Fahim Kawsar, and Akhil Mathur. 2022. Collossl: Collaborative self-supervised learning for human activity recognition. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT) 6, 1 (2022), 1--28.
[17]
Abhi Kamboj and Minh Do. 2024. A Survey of IMU Based Cross-Modal Transfer Learning in Human Activity Recognition. arXiv preprint arXiv:2403.15444 (2024), 1--18.
[18]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[19]
Wei Li, Can Gao, Guocheng Niu, Xinyan Xiao, Hao Liu, Jiachen Liu, Hua Wu, and Haifeng Wang. 2021. UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (IJCNLP). 2592--2607.
[20]
Dawei Liang, Guihong Li, Rebecca Adaimi, Radu Marculescu, and Edison Thomaz. 2022. Audioimu: Enhancing inertial sensing-based activity recognition with acoustic models. In Proceedings of the 2022 ACM International Symposium on Wearable Computers (ISWC). 44--48.
[21]
Paul Pu Liang, Peter Wu, Liu Ziyin, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2021. Cross-modal generalization: Learning in low resource modalities via meta-alignment. In Proceedings of the 29th ACM International Conference on Multimedia (MM). 2680--2689.
[22]
Hangyu Lin, Chen Liu, Chengming Xu, Zhengqi Gao, Hang Zhao, Yanwei Fu, and Yuan Yao. 2023. Generalizable Cross-Modality Distillation with Contrastive Learning. (2023), 1--22.
[23]
Dongxin Liu, Tianshi Wang, Shengzhong Liu, Ruijie Wang, Shuochao Yao, and Tarek Abdelzaher. 2021. Contrastive self-supervised representation learning for sensing signals from the time-frequency perspective. In 2021 International Conference on Computer Communications and Networks (ICCCN). IEEE, 1--10.
[24]
Rui Liu, Haolin Zuo, Zheng Lian, Bjorn W Schuller, and Haizhou Li. 2024. Contrastive Learning based Modality-Invariant Feature Acquisition for Robust Multimodal Emotion Recognition with Missing Modalities. IEEE Transactions on Affective Computing (TAC) 01 (2024), 1--18.
[25]
Shengzhong Liu, Tomoyoshi Kimura, Dongxin Liu, Ruijie Wang, Jinyang Li, Suhas Diggavi, Mani Srivastava, and Tarek Abdelzaher. 2023. FOCAL: Contrastive Learning for Multimodal Time-Series Sensing Signals in Factorized Orthogonal Latent Space. In Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS). 1--30.
[26]
Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. 2022. Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). 3202--3211.
[27]
Zihao Liu, Xiaoyu Wu, Shengjin Wang, and Yimeng Shang. 2024. Violent Video Recognition Based on Global-Local Visual and Audio Contrastive Learning. IEEE Signal Processing Letters (SPL) 31 (2024), 476--480.
[28]
Ilya Loshchilov and Frank Hutter. 2016. SGDR: Stochastic Gradient Descent with Warm Restarts. In International Conference on Learning Representations (ICLR). 1--16.
[29]
Ilya Loshchilov and Frank Hutter. 2018. Decoupled Weight Decay Regularization. In International Conference on Learning Representations (ICLR). 1--19.
[30]
Zelun Luo, Jun-Ting Hsieh, Lu Jiang, Juan Carlos Niebles, and Li Fei-Fei. 2018. Graph distillation for action detection with privileged modalities. In Proceedings of the European conference on Computer Vision (ECCV). 166--183.
[31]
Yu Meng, Jiaxin Huang, Guangyuan Wang, Chao Zhang, Honglei Zhuang, Lance Kaplan, and Jiawei Han. 2019. Spherical text embedding. Advances in neural information processing systems (NeurIPS) 32 (2019), 1--10.
[32]
Shenghuan Miao, Ling Chen, and Rong Hu. 2024. Spatial-Temporal Masked Autoencoder for Multi-Device Wearable Human Activity Recognition. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT) 7, 4 (2024), 1--25.
[33]
Ryumei Nakada, Halil Ibrahim Gulluk, Zhun Deng, Wenlong Ji, James Zou, and Linjun Zhang. 2023. Understanding multimodal contrastive learning and incorporating unpaired data. In International Conference on Artificial Intelligence and Statistics (AISTATS). PMLR, 4348--4380.
[34]
Jianyuan Ni, Anne HH Ngu, and Yan Yan. 2022. Progressive cross-modal knowledge distillation for human action recognition. In Proceedings of the 30th ACM International Conference on Multimedia (MM). 5903--5912.
[35]
Manuel T Nonnenmacher, Lukas Oldenburg, Ingo Steinwart, and David Reeb. 2022. Utilizing expert features for contrastive learning of time-series representations. In International Conference on Machine Learning (ICML). PMLR, 16969--16989.
[36]
Xiaomin Ouyang, Xian Shuai, Jiayu Zhou, Ivy Wang Shi, Zhiyuan Xie, Guoliang Xing, and Jianwei Huang. 2022. Cosmo: contrastive fusion learning with small data for multimodal human activity recognition. In Proceedings of the 28th Annual International Conference on Mobile Computing And Networking (MobiCom). 324--337.
[37]
Attila Reiss and Didier Stricker. 2012. Introducing a new benchmarked dataset for activity monitoring. In 2012 16th international symposium on wearable computers. IEEE, 108--109.
[38]
Sucheng Ren, Yong Du, Jianming Lv, Guoqiang Han, and Shengfeng He. 2021. Learning From the Master: Distilling Cross-Modal Advanced Knowledge for Lip Reading. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 13325--13333.
[39]
Aaqib Saeed, Tanir Ozcelebi, and Johan Lukkien. 2019. Multi-task self-supervised learning for human activity detection. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT) 3, 2 (2019), 1--30.
[40]
Aparajita Saraf, Seungwhan Moon, and Andrea Madotto. 2023. A survey of datasets, applications, and models for IMU sensor signals. In 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW). IEEE, 1--5.
[41]
Ketul Shah, Anshul Shah, Chun Pong Lau, Celso M de Melo, and Rama Chellappa. 2023. Multi-view action recognition using contrastive learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). 3381--3391.
[42]
Ruiyuan Song, Dongheng Zhang, Zhi Wu, Cong Yu, Chunyang Xie, Shuai Yang, Yang Hu, and Yan Chen. 2022. Rf-url: unsupervised representation learning for rf sensing. In Proceedings of the 28th Annual International Conference on Mobile Computing And Networking (MobiCom). 282--295.
[43]
Timo Sztyler and Heiner Stuckenschmidt. 2016. On-body localization of wearable devices: An investigation of position-aware activity recognition. In 2016 IEEE international conference on pervasive computing and communications (PerCom). IEEE, 1--9.
[44]
Chi Ian Tang, Ignacio Perez-Pozuelo, Dimitris Spathis, and Cecilia Mascolo. 2020. Exploring contrastive learning in human activity recognition for healthcare. arXiv preprint arXiv:2011.11542 (2020), 1--6.
[45]
Yonglong Tian, Dilip Krishnan, and Phillip Isola. 2020. Contrastive Multiview Coding. In European Conference on Computer Vision (ECCV). 776--794.
[46]
Yu TIAN, Guansong PANG, Yuanhong CHEN, Rajvinder SINGH, Johan W VERJANS, and Gustavo CARNEIRO. [n.d.]. Weakly-supervised video anomaly detection with contrastive learning of long and short-range temporal features. In CVF International Conference on Computer Vision (ICCV). 11--17.
[47]
Sana Tonekaboni, Danny Eytan, and Anna Goldenberg. 2020. Unsupervised Representation Learning for Time Series with Temporal Neighborhood Coding. In International Conference on Learning Representations (ICLR). 1--17.
[48]
Vinh Tran, Niranjan Balasubramanian, and Minh Hoai. 2022. From Within to Between: Knowledge Distillation for Cross Modality Retrieval. In Proceedings of the Asian Conference on Computer Vision (ACCV). 3223--3240.
[49]
Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, 11 (2008).
[50]
Valentin Vielzeuf, Alexis Lechervy, Stéphane Pateux, and Frédéric Jurie. 2018. Centralnet: a multilayer approach for multimodal fusion. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops. 1--15.
[51]
Xuan Wang, Tong Liu, Chao Feng, Dingyi Fang, and Xiaojiang Chen. 2023. RF-CM: Cross-modal framework for RF-enabled few-shot human activity recognition. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT) 7, 1 (2023), 1--28.
[52]
Zehan Wang, Yang Zhao, Haifeng Huang, Jiageng Liu, Aoxiong Yin, Li Tang, Linjun Li, Yongqi Wang, Ziang Zhang, and Zhou Zhao. 2024. Connecting multi-modal contrastive representations. Advances in Neural Information Processing Systems (NeurIPS) 36 (2024).
[53]
Zihui Xue, Zhengqi Gao, Sucheng Ren, and Hang Zhao. 2022. The Modality Focusing Hypothesis: Towards Understanding Crossmodal Knowledge Distillation. In The Eleventh International Conference on Learning Representations (ICLR). 1--41.
[54]
Zhengyuan Yang, Jingen Liu, Jing Huang, Xiaodong He, Tao Mei, Chenliang Xu, and Jiebo Luo. 2022. Cross-modal contrastive distillation for instructional activity anticipation. In 2022 26th International Conference on Pattern Recognition (ICPR). IEEE, 5002--5009.
[55]
Siyuan Yao, Hua Zhang, Wenqi Ren, Chao Ma, Xiaoguang Han, and Xiaochun Cao. 2021. Robust online tracking via contrastive spatio-temporal aware network. IEEE Transactions on Image Processing 30 (2021), 1989--2002.
[56]
Zhihao Yuan, Xu Yan, Yinghong Liao, Yao Guo, Guanbin Li, Shuguang Cui, and Zhen Li. 2022. X-trans2cap: Cross-modal knowledge transfer using transformer for 3d dense captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 8563--8573.
[57]
Zhihan Yue, Yujing Wang, Juanyong Duan, Tianmeng Yang, Congrui Huang, Yunhai Tong, and Bixiong Xu. 2022. Ts2vec: Towards universal representation of time series. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Vol. 36. 8980--8987.
[58]
Xiang Zhang, Ziyuan Zhao, Theodoros Tsiligkaridis, and Marinka Zitnik. 2022. Self-supervised contrastive pre-training for time series via time-frequency consistency. Advances in Neural Information Processing Systems (NiPS) 35 (2022), 3988--4003.
[59]
Ye Zhang, Longguang Wang, Huiling Chen, Aosheng Tian, Shilin Zhou, and Yulan Guo. 2022. IF-ConvTransformer: A framework for human activity recognition using IMU fusion and ConvTransformer. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT) 6, 2 (2022), 1--26.
[60]
Liangli Zhen, Peng Hu, Xi Peng, Rick Siow Mong Goh, and Joey Tianyi Zhou. 2020. Deep multimodal transfer learning for cross-modal retrieval. IEEE Transactions on Neural Networks and Learning Systems (TNNLS) 33, 2 (2020), 798--810.
[61]
Guanzhou Zhu, Dong Zhao, Kuo Tian, Zhengyuan Zhang, Rui Yuan, and Huadong Ma. 2023. Combining Smart Speaker and Smart Meter to Infer Your Residential Power Usage by Self-supervised Cross-modal Learning. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT) 7, 3 (2023), 1--26.
[62]
Mohammadreza Zolfaghari, Yi Zhu, Peter Gehler, and Thomas Brox. 2021. Crossclr: Cross-modal contrastive learning for multi-modal video representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 1450--1459.

Index Terms

  1. SemiCMT: Contrastive Cross-Modal Knowledge Transfer for IoT Sensing with Semi-Paired Multi-Modal Signals

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies
      Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies  Volume 8, Issue 4
      November 2024
      1788 pages
      EISSN:2474-9567
      DOI:10.1145/3705705
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 21 November 2024
      Published in IMWUT Volume 8, Issue 4

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Contrastive Learning
      2. Cross-Modal Knowledge Transfer
      3. Multi-Modal Time-Series Signals

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Funding Sources

      • The National Key R&D Program of China
      • China NSF grant
      • Alibaba Innovation Research Program
      • Tencent Rhino Bird Key Research Project

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 15
        Total Downloads
      • Downloads (Last 12 months)15
      • Downloads (Last 6 weeks)15
      Reflects downloads up to 25 Nov 2024

      Other Metrics

      Citations

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media