research-article

SemiCMT: Contrastive Cross-Modal Knowledge Transfer for IoT Sensing with Semi-Paired Multi-Modal Signals

Authors: Yatong Chen, Chenzhi Hu, Tomoyoshi Kimura, Qinya Li, Shengzhong Liu, Fan Wu, Guihai ChenAuthors Info & Claims

Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Volume 8, Issue 4

Article No.: 198, Pages 1 - 30

Published: 21 November 2024 Publication History

Abstract

This paper proposes a novel contrastive cross-modal knowledge transfer framework, SemiCMT, for multi-modal IoT sensing applications. It effectively transfers the feature extraction capability (also called knowledge) learned from a source modality (e.g., acoustic signals) with abundant unlabeled training data, to a target modality (e.g., seismic signals) that lacks enough training data, in a self-supervised manner with the help of only a small set of synchronized multi-modal pairs. The transferred model can be quickly finetuned to downstream target-modal tasks with only limited labels. The key design constitutes of three aspects: First, we factorize the latent embedding of each modality into shared and private components and perform knowledge transfer considering both the modality information commonality and gaps. Second, we enforce structural correlation constraints between the source modality and the target modality, to push the target modal embedding space symmetric to the source modal embedding space, with the anchoring of additional source-modal samples, which expands the existing modal-matching objective in current multi-modal contrastive frameworks. Finally, we conduct downstream task finetuning in the spherical space with a KNN classifier to better align with the structured modality embedding space. Extensive evaluations on five multimodal IoT datasets are performed to validate the effectiveness of SemiCMT in cross-modal knowledge transfer, including a new self-collected dataset using seismic and acoustic signals for office activity monitoring. SemiCMT consistently outperforms existing self-supervised and knowledge transfer approaches by up to 36.47% in the finetuned target-modal classification tasks. The code and the self-collected dataset will be released at https://github.com/SJTU-RTEAS/SemiCMT.

References

[1]

Alex Andonian, Shixing Chen, and Raffay Hamid. 2022. Robust cross-modal representation learning with progressive self-distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 16430--16441.

Abstract

References

Index Terms

Recommendations

Combining cross-modal knowledge transfer and semi-supervised learning for speech emotion recognition

SCLAV: Supervised Cross-modal Contrastive Learning for Audio-Visual Coding

Real-time Emotion Pre-Recognition in Conversations with Contrastive Multi-modal Dialogue Pre-training

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations