research-article

Audio-visual contrastive learning with temporal self-supervision

AUTHORs:

Alexander Black,

John CollomosseAuthors Info & Claims

AAAI'23/IAAI'23/EAAI'23: Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence

Article No.: 898, Pages 7996 - 8004

https://doi.org/10.1609/aaai.v37i7.25967

Published: 07 February 2023 Publication History

Abstract

We propose a self-supervised learning approach for videos that learns representations of both the RGB frames and the accompanying audio without human supervision. In contrast to images that capture the static scene appearance, videos also contain sound and temporal scene dynamics. To leverage the temporal and aural dimension inherent to videos, our method extends temporal self-supervision to the audio-visual setting and integrates it with multi-modal contrastive objectives. As temporal self-supervision, we pose playback speed and direction recognition in both modalities and propose intra- and inter-modal temporal ordering tasks. Furthermore, we design a novel contrastive objective in which the usual pairs are supplemented with additional sample-dependent positives and negatives sampled from the evolving feature space. In our model, we apply such losses among video clips and between videos and their temporally corresponding audio clips. We verify our model design in extensive ablation experiments and evaluate the video and audio representations in transfer experiments to action recognition and retrieval on UCF101 and HMBD51, audio classification on ESC50, and robust video fingerprinting on VGG-Sound, with state-of-the-art results.

References

[1]

Akbari, H.; Yuan, L.; Qian, R.; Chuang, W.-H.; Chang, S.-F.; Cui, Y.; and Gong, B. 2021. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. Advances in Neural Information Processing Systems, 34.

[2]

Alayrac, J.-B.; Recasens, A.; Schneider, R.; Arandjelović, R.; Ramapuram, J.; De Fauw, J.; Smaira, L.; Dieleman, S.; and Zisserman, A. 2020. Self-supervised multimodal versatile networks. Advances in Neural Information Processing Systems, 33: 25-37.

[3]

Alwassel, H.; Mahajan, D.; Korbar, B.; Torresani, L.; Ghanem, B.; and Tran, D. 2019. Self-supervised learning by cross-modal audio-video clustering. arXiv preprint arXiv:1911.12667.

[4]

Alwassel, H.; Mahajan, D.; Korbar, B.; Torresani, L.; Ghanem, B.; and Tran, D. 2020. Self-supervised learning by cross-modal audio-video clustering. Advances in Neural Information Processing Systems, 33: 9758-9770.

[5]

Arandjelovic, R.; and Zisserman, A. 2017. Look, listen and learn. In 2017 IEEE International Conference on Computer Vision (ICCV), 609-617. IEEE.

[6]

Bai, Y.; Fan, H.; Misra, I.; Venkatesh, G.; Lu, Y.; Zhou, Y.; Yu, Q.; Chandra, V.; and Yuille, A. 2020. Can Temporal Information Help with Contrastive Self-Supervised Learning? arXiv preprint arXiv:2011.13046.

[7]

Benaim, S.; Ephrat, A.; Lang, O.; Mosseri, I.; Freeman, W. T.; Rubinstein, M.; Irani, M.; and Dekel, T. 2020. SpeedNet: Learning the Speediness in Videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9922-9931.

[8]

Black, A.; Bui, T.; Jenni, S.; Swaminathan, V.; and Collomosse, J. 2021. VPN: Video Provenance Network for Robust Content Attribution. In European Conference on Visual Media Production, 1-10.

[9]

Brattoli, B.; Buchler, U.; Wahl, A.-S.; Schwab, M. E.; and Ommer, B. 2017. Lstm self-supervision for detailed behavior analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2.

[10]

Caron, M.; Misra, I.; Mairal, J.; Goyal, P.; Bojanowski, P.; and Joulin, A. 2020. Unsupervised learning of visual features by contrasting cluster assignments. arXiv preprint arXiv:2006.09882.

[11]

Chen, H.; Xie, W.; Vedaldi, A.; and Zisserman, A. 2020a. Vggsound: A large-scale audio-visual dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 721-725. IEEE.

[12]

Chen, T.; Kornblith, S.; Norouzi, M.; and Hinton, G. 2020b. A simple framework for contrastive learning of visual representations. In International conference on machine learning, 1597-1607. PMLR.

[13]

Chen, X.; and He, K. 2020. Exploring Simple Siamese Representation Learning. arXiv preprint arXiv:2011.10566.

[14]

Dave, I.; Gupta, R.; Rizve, M. N.; and Shah, M. 2021. TCLR: Temporal Contrastive Learning for Video Representation. arXiv preprint arXiv:2101.07974.

[15]

Doersch, C.; Gupta, A.; and Efros, A. A. 2015. Unsupervised Visual Representation Learning by Context Prediction. ICCV.

[16]

Dosovitskiy, A.; Fischer, P.; Springenberg, J. T.; Riedmiller, M.; and Brox, T. 2015. Discriminative unsupervised feature learning with exemplar convolutional neural networks. IEEE transactions on pattern analysis and machine intelligence, 38(9): 1734-1747.

[17]

Dwibedi, D.; Aytar, Y.; Tompson, J.; Sermanet, P.; and Zisserman, A. 2021. With a little help from my friends: Nearest-neighbor contrastive learning of visual representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 9588-9597.

[18]

Epstein, D.; Chen, B.; and Vondrick, C. 2020. Oops! Predicting Unintentional Action in Video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 919-929.

[19]

Feichtenhofer, C.; Fan, H.; Xiong, B.; Girshick, R.; and He, K. 2021. A large-scale study on unsupervised spatiotemporal representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3299-3309.

[20]

Fernando, B.; Bilen, H.; Gavves, E.; and Gould, S. 2017. Self-supervised video representation learning with odd-one-out networks. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, 5729-5738. IEEE.

[21]

Gidaris, S.; Singh, P.; and Komodakis, N. 2018. Unsupervised Representation Learning by Predicting Image Rotations. In International Conference on Learning Representations.

[22]

Grill, J.-B.; Strub, F.; Altché, F.; Tallec, C.; Richemond, P. H.; Buchatskaya, E.; Doersch, C.; Pires, B. A.; Guo, Z. D.; Azar, M. G.; et al. 2020. Bootstrap your own latent: A new approach to self-supervised learning. arXiv preprint arXiv:2006.07733.

[23]

Hara, K.; Kataoka, H.; and Satoh, Y. 2018. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 6546-6555.

[24]

He, K.; Fan, H.; Wu, Y.; Xie, S.; and Girshick, R. 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9729-9738.

[25]

He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770-778.

[26]

Jenni, S.; and Favaro, P. 2018. Self-Supervised Feature Learning by Learning to Spot Artifacts. In CVPR.

[27]

Jenni, S.; and Jin, H. 2021. Time-equivariant contrastive video representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 9970-9980.

[28]

Jenni, S.; Jin, H.; and Favaro, P. 2020. Steering Self-Supervised Feature Learning Beyond Local Pixel Statistics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6408-6417.

[29]

Jenni, S.; Meishvili, G.; and Favaro, P. 2020. Video representation learning by recognizing temporal transformations. arXiv preprint arXiv:2007.10730.

[30]

Kim, D.; Cho, D.; and Kweon, I. S. 2019. Self-supervised video representation learning with space-time cubic puzzles. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 8545-8552.

[31]

Koohpayegani, S. A.; Tejankar, A.; and Pirsiavash, H. 2021. Mean shift for self-supervised learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 10326-10335.

[32]

Korbar, B.; Tran, D.; and Torresani, L. 2018. Cooperative learning of audio and video models from self-supervised synchronization. In Advances in Neural Information Processing Systems, 7763-7774.

[33]

Kuehne, H.; Jhuang, H.; Garrote, E.; Poggio, T.; and Serre, T. 2011. HMDB: a large video database for human motion recognition. In Proceedings of the International Conference on Computer Vision (ICCV).

[34]

Lee, H.-Y.; Huang, J.-B.; Singh, M.; and Yang, M.-H. 2017. Unsupervised representation learning by sorting sequences. In Proceedings of the IEEE International Conference on Computer Vision, 667-676.

[35]

Lee, S.; and Yoo, C. D. 2008. Robust video fingerprinting for content-based video identification. IEEE Transactions on Circuits and Systems for Video Technology, 18(7): 983-988.

Digital Library

[36]

Loshchilov, I.; and Hutter, F. 2016. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983.

[37]

Loshchilov, I.; and Hutter, F. 2017. Fixing weight decay regularization in adam. arXiv preprint arXiv:1711.05101.

[38]

Misra, I.; Zitnick, C. L.; and Hebert, M. 2016. Shuffle and learn: unsupervised learning using temporal order verification. In European Conference on Computer Vision, 527-544. Springer.

[39]

Morgado, P.; Misra, I.; and Vasconcelos, N. 2021. Robust audio-visual instance discrimination. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12934-12945.

[40]

Morgado, P.; Vasconcelos, N.; and Misra, I. 2021. AudioVisual Instance Discrimination with Cross-Modal Agreement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 12475-12486.

[41]

Noroozi, M.; and Favaro, P. 2016. Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, 69-84. Springer.

[42]

Owens, A.; and Efros, A. A. 2018. Audio-Visual Scene Analysis with Self-Supervised Multisensory Features. In The European Conference on Computer Vision (ECCV).

[43]

Owens, A.; Wu, J.; McDermott, J. H.; Freeman, W. T.; and Torralba, A. 2016. Ambient sound provides supervision for visual learning. In European Conference on Computer Vision, 801-816. Springer.

[44]

Papakipos, Z.; and Bitton, J. 2022. AugLy: Data Augmentations for Robustness. arXiv:2201.06494.

[45]

Patrick, M.; Asano, Y. M.; Fong, R.; Henriques, J. F.; Zweig, G.; and Vedaldi, A. 2020. Multi-modal self-supervision from generalized data transformations. arXiv preprint arXiv:2003.04298.

[46]

Piczak, K. J. 2015. ESC: Dataset for environmental sound classification. In Proceedings of the 23rd ACM international conference on Multimedia, 1015-1018.

Digital Library

[47]

Qian, R.; Meng, T.; Gong, B.; Yang, M.-H.; Wang, H.; Belongie, S.; and Cui, Y. 2020. Spatiotemporal contrastive video representation learning. arXiv preprint arXiv:2008.03800.

[48]

Recasens, A.; Luc, P.; Alayrac, J.-B.; Wang, L.; Strub, F.; Tallec, C.; Malinowski, M.; Pătrăucean, V.; Altché, F.; Valko, M.; et al. 2021. Broaden your views for self-supervised video learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1255-1265.

[49]

Soomro, K.; Zamir, A. R.; and Shah, M. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.

[50]

Tran, D.; Wang, H.; Torresani, L.; Ray, J.; LeCun, Y.; and Paluri, M. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 6450-6459.

[51]

Wang, X.; Liu, Z.; and Yu, S. X. 2020. Unsupervised Feature Learning by Cross-Level Discrimination between Instances and Groups. arXiv preprint arXiv:2008.03813.

[52]

Wei, D.; Lim, J.; Zisserman, A.; and Freeman, W. T. 2018. Learning and using the arrow of time. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 8052-8060.

[53]

Wu, Z.; Xiong, Y.; Yu, S. X.; and Lin, D. 2018. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3733-3742.

[54]

Xu, D.; Xiao, J.; Zhao, Z.; Shao, J.; Xie, D.; and Zhuang, Y. 2019. Self-supervised spatiotemporal learning via video clip order prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 10334-10343.

[55]

Yao, Y.; Liu, C.; Luo, D.; Zhou, Y.; and Ye, Q. 2020. Video Playback Rate Perception for Self-Supervised SpatioTemporal Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6548-6557.

[56]

Zhang, R.; Isola, P.; and Efros, A. A. 2016. Colorful image colorization. In European Conference on Computer Vision, 649-666. Springer.

[57]

Zhang, R.; Isola, P.; and Efros, A. A. 2017. Split-brain autoencoders: Unsupervised learning by cross-channel prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1058-1067.

[58]

Zisserman, A.; Carreira, J.; Simonyan, K.; Kay, W.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; et al. 2017. The kinetics human action video dataset. ArXiv.

Cited By

Kim JLee HRho KKim JChung JSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)EquiAVProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693044(24327-24341)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3693044
Shen MWei YYin JRajan DHu DSee S(2024)Enhancing Modality Representation and Alignment for Multimodal Cold-start Active LearningProceedings of the 6th ACM International Conference on Multimedia in Asia10.1145/3696409.3700225(1-8)Online publication date: 3-Dec-2024
https://dl.acm.org/doi/10.1145/3696409.3700225
Tsuji SHashimoto AYang MMa JHonda HTanaka KLienhart RMoeslund TSaito H(2024)Audio-Visual Self-Supervision for Frame-Level Player-wise Offensive Shot Detection in Table Tennis MatchesProceedings of the 7th ACM International Workshop on Multimedia Content Analysis in Sports10.1145/3689061.3689064(27-33)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3689061.3689064
Show More Cited By

Recommendations

Rethinking weak supervision in helping contrastive learning
ICML'23: Proceedings of the 40th International Conference on Machine Learning

Contrastive learning has shown outstanding performances in both supervised and unsupervised learning, and has recently been introduced to solve weakly supervised learning problems such as semi-supervised learning and noisy label learning. Despite the ...
Improving Self-supervised Audio Representation based on Contrastive Learning with Conformer Encoder
SoICT '22: Proceedings of the 11th International Symposium on Information and Communication Technology

Self-supervised contrastive learning has drawn much attention due to its stunning performance in speech and audio processing. Recent works in music information retrieval (MIR) or audio classification tasks also adopted this paradigm. However, most ...
Audio-visual granular synthesis performance demo
IE '13: Proceedings of The 9th Australasian Conference on Interactive Entertainment: Matters of Life and Death

In this paper, I present a prototype of my audio-visual granular synthesis instrument Kortex. The instrument enables real-time improvisation of audio-visual material in a performance context. Granular synthesis is a processing technique that segments ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

AAAI'23/IAAI'23/EAAI'23: Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence

February 2023

16496 pages

ISBN:978-1-57735-880-0

Copyright © 2023 Association for the Advancement of Artificial Intelligence.

Sponsors

Association for the Advancement of Artificial Intelligence

Publisher

AAAI Press

Publication History

Published: 07 February 2023

Qualifiers

Research-article
Research
Refereed limited

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 28 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kim JLee HRho KKim JChung JSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)EquiAVProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693044(24327-24341)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3693044
Shen MWei YYin JRajan DHu DSee S(2024)Enhancing Modality Representation and Alignment for Multimodal Cold-start Active LearningProceedings of the 6th ACM International Conference on Multimedia in Asia10.1145/3696409.3700225(1-8)Online publication date: 3-Dec-2024
https://dl.acm.org/doi/10.1145/3696409.3700225
Tsuji SHashimoto AYang MMa JHonda HTanaka KLienhart RMoeslund TSaito H(2024)Audio-Visual Self-Supervision for Frame-Level Player-wise Offensive Shot Detection in Table Tennis MatchesProceedings of the 7th ACM International Workshop on Multimedia Content Analysis in Sports10.1145/3689061.3689064(27-33)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3689061.3689064
Sun LXu XWu MXie WCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Auto-ACD: A Large-scale Dataset for Audio-Language Representation LearningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681472(5025-5034)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681472
Zhao HZhou WChen DZhang WGuo YCheng ZYan PYu N(2024)Audio-Visual Contrastive Pre-train for Face Forgery DetectionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365131121:2(1-16)Online publication date: 13-Mar-2024
https://dl.acm.org/doi/10.1145/3651311

View Options

View options

Figures

Tables

Media

View Table of Conten