Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Learning Explicit and Implicit Dual Common Subspaces for Audio-visual Cross-modal Retrieval

Published: 17 February 2023 Publication History

Abstract

Audio-visual tracks in video contain rich semantic information with potential in many applications and research. Since the audio-visual data have inconsistent distributions and because of the heterogeneous nature of representations, the heterogeneous gap between modalities makes them impossible to compare directly. To bridge the modality gap, a frequently adopted approach is to simultaneously project audio-visual data into a common subspace to capture the commonalities and characteristics of modalities for measurement, which has been extensively studied in relation to the issues of modality-common and modality-specific feature learning in previous research. However, it is difficult for existing methods to address the tradeoff between both issues; e.g., the modality-common feature is learned from the latent commonalities of audio-visual data or the correlated features as aligned projections, in which the modality-specific feature can be lost. To solve the tradeoff, we propose a novel end-to-end architecture, which synchronously projects audio-visual data into the explicit and the implicit dual common subspaces. The explicit subspace is used to learn modality-common features and reduce the modality gap of explicitly paired audio-visual data, where the representation-specific details are abandoned to retain the common underlying structure of audio-visual data. The implicit subspace is used to learn modality-specific features, where each modality privately pulls apart the feature distances between different categories to maintain the category-based distinctions, by minimizing the distance between audio-visual features and corresponding labels. The comprehensive experimental results on two audio-visual datasets, VEGAS and AVE, demonstrate that our proposed model for using two different common subspaces for audio-visual cross-modal learning is effective and significantly outperforms the state-of-the-art cross-modal models that learn features from a single common subspace by 4.30% and 2.30% in terms of average MAP on the VEGAS and AVE datasets, respectively.

References

[1]
Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. 2016. YouTube-8M: A Large-Scale Video Classification Benchmark. arXiv:cs.CV/ 1609.08675
[2]
Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. 2013. Deep canonical correlation analysis. In Proceedings of the 30th International Conference on Machine Learning (Proceedings of Machine Learning Research), Vol. 28. PMLR, 1247–1255.
[3]
Monelli Ayyavaraiah and Bondu Venkateswarlu. 2018. Joint graph regularization based semantic analysis for cross-media retrieval: A systematic review. International Journal of Engineering & Technology 7, 2.7 (2018), 257–261. DOI:
[4]
Monelli Ayyavaraiah and Bondu Venkateswarlu. 2019. Cross media feature retrieval and optimization: A contemporary review of research scope, challenges and objectives. In International Conference on Computational Vision and Bio Inspired Computing. Springer, 1125–1136. DOI:
[5]
Konstantinos Bousmalis, George Trigeorgis, Nathan Silberman, Dilip Krishnan, and Dumitru Erhan. 2016. Domain separation networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems. Curran Associates Inc., 343–351.
[6]
Wenming Cao, Qiubin Lin, Zhihai He, and Zhiquan He. 2019. Hybrid representation learning for cross-modal retrieval. Neurocomputing 345 (2019), 45–57. DOI:
[7]
Yue Cao, Mingsheng Long, Jianmin Wang, Qiang Yang, and Philip S. Yu. 2016. Deep visual-semantic hashing for cross-modal retrieval. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1445–1454. DOI:
[8]
Micael Carvalho, Rémi Cadène, David Picard, Laure Soulier, Nicolas Thome, and Matthieu Cord. 2018. Cross-modal retrieval in the cooking context: Learning semantic text-image embeddings. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 35–44. DOI:
[9]
Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. 2020. Vggsound: A large-scale audio-visual dataset. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’20). IEEE, 721–725. DOI:
[10]
Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yantao Zheng. 2009. Nus-wide: A real-world web image database from National University of Singapore. In Proceedings of the ACM International Conference on Image and Video Retrieval (CIVR’09). ACM, 1–9. DOI:
[11]
Matthias Dorfer, Jan Schlüter, Andreu Vall, Filip Korzeniowski, and Gerhard Widmer. 2018. End-to-end cross-modality retrieval with CCA projections and pairwise ranking loss. International Journal of Multimedia Information Retrieval 7, 2 (2018), 117–128. DOI:
[12]
Andrea Frome, Gregory S. Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marc’Aurelio Ranzato, and Tomás Mikolov. 2013. DeViSE: A deep visual-semantic embedding model. In Proceedings of the 26th International Conference on Neural Information Processing Systems, Vol. 2. ACM, 2121–2129.
[13]
Chuang Gan, Deng Huang, Peihao Chen, Joshua B. Tenenbaum, and Antonio Torralba. 2020. Foley music: Learning to generate music from videos. In European Conference on Computer Vision (Lecture Notes in Computer Science), Vol. 12356. Springer, 758–775. DOI:
[14]
Chuang Gan, Deng Huang, Hang Zhao, Joshua B. Tenenbaum, and Antonio Torralba. 2020. Music gesture for visual sound separation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10478–10487. DOI:
[15]
Chuang Gan, Jeremy Schwartz, Seth Alter, Martin Schrimpf, James Traer, Julian De Freitas, Jonas Kubilius, Abhishek Bhandwaldar, Nick Haber, Megumi Sano, et al. 2020. Threedworld: A platform for interactive multi-modal physical simulation. arXiv preprint arXiv:2007.04954 (2020).
[16]
Gregor Geigle, Jonas Pfeiffer, Nils Reimers, Ivan Vulic, and Iryna Gurevych. 2022. Retrieve fast, rerank smart: Cooperative and joint approaches for improved cross-modal retrieval. Trans. Assoc. Comput. Linguistics 10 (2022), 503–521. DOI:
[17]
Wen Gu, Xiaoyan Gu, Jingzi Gu, Bo Li, Zhi Xiong, and Weiping Wang. 2019. Adversary guided asymmetric hashing for cross-modal retrieval. In Proceedings of the 2019 on International Conference on Multimedia Retrieval. ACM, 159–167. DOI:
[18]
Ning Han, Jingjing Chen, Guangyi Xiao, Yawen Zeng, Chuhao Shi, and Hao Chen. 2021. Visual spatio-temporal relation-enhanced network for cross-modal text-video retrieval. arXiv preprint arXiv:2110.15609 (2021).
[19]
Shota Harada, Hideaki Hayashi, and Seiichi Uchida. 2019. Biosignal generation and latent variable analysis with recurrent generative adversarial networks. IEEE Access 7 (2019), 144292–144302. DOI:
[20]
David R. Hardoon, Sándor Szedmák, and John Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural Computation 16, 12 (2004), 2639–2664. DOI:
[21]
Devamanyu Hazarika, Roger Zimmermann, and Soujanya Poria. 2020. MISA: Modality-invariant and-specific representations for multimodal sentiment analysis. In Proceedings of the 28th ACM International Conference on Multimedia. ACM, 1122–1131. DOI:
[22]
Li He, Xing Xu, Huimin Lu, Yang Yang, Fumin Shen, and Heng Tao Shen. 2017. Unsupervised cross-modal retrieval through adversarial learning. In 2017 IEEE International Conference on Multimedia and Expo. IEEE Computer Society, 1153–1158. DOI:
[23]
Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, et al. 2017. CNN architectures for large-scale audio classification. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’17). IEEE, 131–135. DOI:
[24]
William W. Hsieh. 2000. Nonlinear canonical correlation analysis by neural networks. Neural Networks 13, 10 (2000), 1095–1105. DOI:
[25]
Xinyang Jiang, Fei Wu, Xi Li, Zhou Zhao, Weiming Lu, Siliang Tang, and Yueting Zhuang. 2015. Deep compositional cross-modal learning to rank via local-global alignment. In Proceedings of the 23rd Annual ACM Conference on Multimedia Conference. ACM, 69–78. DOI:
[26]
Andrej Karpathy, Armand Joulin, and Li Fei-Fei. 2014. Deep fragment embeddings for bidirectional image sentence mapping. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems. ACM, 1889–1897.
[27]
Parminder Kaur, Husanbir Singh Pannu, and Avleen Kaur Malhi. 2021. Comparative analysis on cross-modal information retrieval: A review. Computer Science Review 39 (2021), 100336. DOI:
[28]
Will Kay, João Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. 2017. The kinetics human action video dataset. CoRR abs/1705.06950 (2017). arXiv:1705.06950http://arxiv.org/abs/1705.06950.
[29]
Pei Ling Lai and Colin Fyfe. 2000. Kernel and nonlinear canonical correlation analysis. Int. J. Neural Syst. 10, 5 (2000), 365–377. DOI:
[30]
Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2017. Adversarial multi-task learning for text classification. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vol. 1. ACL, 1–10. DOI:
[31]
Xingbo Liu, Xiushan Nie, Haoliang Sun, Chaoran Cui, and Yilong Yin. 2018. Modality-specific structure preserving hashing for cross-modal retrieval. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’18). IEEE, 1678–1682. DOI:
[32]
Rui Lu, Zhiyao Duan, and Changshui Zhang. 2018. Listen and look: Audio-visual matching assisted speech source separation. IEEE Signal Processing Letters 25, 9 (Sep.2018), 1315–1319. DOI:
[33]
Xinhong Ma, Tianzhu Zhang, and Changsheng Xu. 2020. Multi-level correlation adversarial hashing for cross-modal retrieval. IEEE Transactions on Multimedia 22, 12 (2020), 3101–3114. DOI:
[34]
Aditya Krishna Menon, Didi Surian, and Sanjay Chawla. 2015. Cross-modal retrieval: A pairwise classification approach. In Proceedings of the 2015 SIAM International Conference on Data Mining. SIAM, 199–207. DOI:
[35]
Pedro Morgado, Ishan Misra, and Nuno Vasconcelos. 2021. Robust audio-visual instance discrimination. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12934–12945. DOI:
[36]
Meinard Müller, Andreas Arzt, Stefan Balke, Matthias Dorfer, and Gerhard Widmer. 2018. Cross-modal music retrieval and applications: An overview of key methodologies. IEEE Signal Processing Magazine 36, 1 (2018), 52–62. DOI:
[37]
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y. Ng. 2011. Multimodal deep learning. In 28th International Conference on Machine Learning. Omnipress, 689–696.
[38]
Xiushan Nie, Bowei Wang, Jiajia Li, Fanchang Hao, Muwei Jian, and Yilong Yin. 2021. Deep multiscale fusion hashing for cross-modal retrieval. IEEE Trans. Circuits Syst. Video Technol. 31, 1 (2021), 401–410. DOI:
[39]
Yuxin Peng, Xin Huang, and Yunzhen Zhao. 2017. An overview of cross-media retrieval: Concepts, methodologies, benchmarks, and challenges. IEEE Transactions on Circuits and Systems for Video Technology 28, 9 (2017), 2372–2385. DOI:
[40]
Yuxin Peng, Jinwei Qi, and Yuxin Yuan. 2018. Modality-specific cross-modal similarity measurement with recurrent attention network. IEEE Transactions on Image Processing 27, 11 (2018), 5585–5599. DOI:
[41]
Yuxin Peng, Jinwei Qi, and Yuxin Yuan. 2018. Modality-specific cross-modal similarity measurement with recurrent attention network. IEEE Trans. Image Process. 27, 11 (July2018), 5585–5599. DOI:
[42]
Yu-xin Peng, Wen-wu Zhu, Yao Zhao, Chang-sheng Xu, Qing-ming Huang, Han-qing Lu, Qing-hua Zheng, Tie-jun Huang, and Wen Gao. 2017. Cross-media analysis and reasoning: Advances and directions. Frontiers of Information Technology & Electronic Engineering 18, 1 (2017), 44–57. DOI:
[43]
José Costa Pereira, Emanuele Coviello, Gabriel Doyle, Nikhil Rasiwasia, Gert R. G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2014. On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 36, 3 (2014), 521–535. DOI:
[44]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research), Vol. 139. PMLR, 8748–8763.
[45]
Viresh Ranjan, Nikhil Rasiwasia, and C. V. Jawahar. 2015. Multi-label cross-modal retrieval. In Proceedings of the IEEE International Conference on Computer Vision. IEEE Computer Society, 4094–4102. DOI:
[46]
Nikhil Rasiwasia, Dhruv Mahajan, Vijay Mahadevan, and Gaurav Aggarwal. 2014. Cluster canonical correlation analysis. In Proceedings of the 17th International Conference on Artificial Intelligence and Statistics. JMLR.org, 823–831. DOI:
[47]
Anurag Roy, Vinay Kumar Verma, Kripabandhu Ghosh, and Saptarshi Ghosh. 2020. ZSCRGAN: A GAN-based expectation maximization model for zero-shot retrieval of images from textual descriptions. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. ACM, Virtual Event, 1315–1324. DOI:
[48]
Sebastian Ruder and Barbara Plank. 2018. Strong baselines for neural semi-supervised learning under domain shift. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Vol. 1. ACL, 1044–1054. DOI:
[49]
Jie Shao, Zhicheng Zhao, Fei Su, and Ting Yue. 2017. Towards improving canonical correlation analysis for cross-modal retrieval. In Proceedings of the Thematic Workshops of ACM Multimedia 2017. ACM, 332–339. DOI:
[50]
Richard Socher, Andrej Karpathy, Quoc V. Le, Christopher D. Manning, and Andrew Y. Ng. 2014. Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics 2 (2014), 207–218. DOI:
[51]
Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, and Chenliang Xu. 2018. Audio-visual event localization in unconstrained videos. In 15th European Conference on Computer Vision (ECCV’18) (Lecture Notes in Computer Science), Vol. 11206. Springer, 252–268. DOI:
[52]
Luís Vilaça, Yi Yu, and Paula Viana. 2022. Recent advances and challenges in deep audio-visual correlation learning. arXiv preprint arXiv:2202.13673 (2022).
[53]
Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial cross-modal retrieval. In Proceedings of the 25th ACM International Conference on Multimedia (MM’17). ACM, 154–162. DOI:
[54]
Jian Wang, Yonghao He, Cuicui Kang, Shiming Xiang, and Chunhong Pan. 2015. Image-text cross-modal retrieval via modality-specific feature learning. In Proceedings of the 5th ACM on International Conference on Multimedia Retrieval. ACM, 347–354. DOI:
[55]
Kaiye Wang, Qiyue Yin, Wei Wang, Shu Wu, and Liang Wang. 2016. A comprehensive survey on cross-modal retrieval. CoRR abs/1607.06215 (2016), arXiv:1607.06215. arXiv:1607.06215http://arxiv.org/abs/1607.06215.
[56]
Fei Wu, Xiao-Yuan Jing, Zhiyong Wu, Yimu Ji, Xiwei Dong, Xiaokai Luo, Qinghua Huang, and Ruchuan Wang. 2020. Modality-specific and shared generative adversarial network for cross-modal retrieval. Pattern Recognition 104 (Aug.2020), 107335. DOI:
[57]
De Xie, Cheng Deng, Chao Li, Xianglong Liu, and Dacheng Tao. 2020. Multi-task consistency-preserving adversarial hashing for cross-modal retrieval. IEEE Transactions on Image Processing 29 (2020), 3626–3637. DOI:
[58]
Haixia Xiong, Weihua Ou, Zengxian Yan, Jianping Gou, Quan Zhou, and Anzhi Wang. 2020. Modality-specific matrix factorization hashing for cross-modal retrieval. Journal of Ambient Intelligence and Humanized Computing 1 (June2020), 1–15. DOI:
[59]
Gongwen Xu, Xiaomei Li, and Zhijun Zhang. 2020. Semantic consistency cross-modal retrieval with semi-supervised graph regularization. IEEE Access 8 (2020), 14278–14288. DOI:
[60]
Haoming Xu, Runhao Zeng, Qingyao Wu, Mingkui Tan, and Chuang Gan. 2020. Cross-modal relation-aware networks for audio-visual event localization. In Proceedings of the 28th ACM International Conference on Multimedia. 3893–3901.
[61]
X. Xu, H. Lu, J. Song, Y. Yang, H. T. Shen, and X. Li. 2020. Ternary adversarial networks with self-supervision for zero-shot cross-modal retrieval. IEEE Transactions on Cybernetics 50, 6 (2020), 2400–2413. DOI:
[62]
Rintaro Yanagi, Ren Togo, Takahiro Ogawa, and Miki Haseyama. 2020. Enhancing cross-modal retrieval based on modality-specific and embedding spaces. IEEE Access 8 (May2020), 96777–96786. DOI:
[63]
Erkun Yang, Cheng Deng, Wei Liu, Xianglong Liu, Dacheng Tao, and Xinbo Gao. 2017. Pairwise relationship guided deep hashing for cross-modal retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31. AAAI, 1618–1625.
[64]
Yi Yu, Suhua Tang, Kiyoharu Aizawa, and Akiko Aizawa. 2019. Category-based deep CCA for fine-grained venue discovery from multimodal data. IEEE Transactions on Neural Networks and Learning Systems 30, 4 (2019), 1250–1258.
[65]
Yi Yu, Suhua Tang, Francisco Raposo, and Lei Chen. 2019. Deep cross-modal correlation learning for audio and lyrics in music retrieval. ACM Trans. Multimedia Comput. Commun. Appl. 15, 1, Article 20 (Feb.2019), 16 pages. DOI:
[66]
Donghuo Zeng and Keizo Oyama. 2019. Learning joint embedding for cross-modal retrieval. In International Conference on Data Mining Workshops (ICDMW’19). IEEE,1070–1071.
[67]
Donghuo Zeng, Yi Yu, and Keizo Oyama. 2018. Audio-visual embedding for cross-modal music video retrieval through supervised deep CCA. In 2018 IEEE International Symposium on Multimedia. IEEE Computer Society, 143–150. DOI:
[68]
Donghuo Zeng, Yi Yu, and Keizo Oyama. 2020. Deep triplet neural networks with cluster-CCA for audio-visual cross-modal retrieval. ACM Trans. Multimedia Comput. Commun. Appl. 16, 3, Article 76 (July2020), 23 pages. DOI:
[69]
Donghuo Zeng, Yi Yu, and Keizo Oyama. 2020. MusicTM-dataset for joint representation learning among sheet music, lyrics, and musical audio. In National Conference on Sound and Music Technology. Springer, 78–89. DOI:
[70]
Jian Zhang, Yuxin Peng, and Mingkuan Yuan. 2018. Unsupervised generative adversarial cross-modal hashing. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence. AAAI Press, 539–546.
[71]
Liang Zhang, Bingpeng Ma, Guorong Li, Qingming Huang, and Qi Tian. 2016. PL-ranking: A novel ranking method for cross-modal retrieval. In Proceedings of the 24th ACM International Conference on Multimedia (MM’16). ACM, 1355–1364. DOI:
[72]
Xiang Zhang, Guohua Dong, Yimo Du, Chengkun Wu, Zhigang Luo, and Canqun Yang. 2018. Collaborative subspace graph hashing for cross-modal retrieval. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval. ACM, 213–221. DOI:
[73]
Hang Zhao, Chuang Gan, Wei-Chiu Ma, and Antonio Torralba. 2019. The sound of motions. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1735–1744. DOI:
[74]
Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, and Antonio Torralba. 2018. The sound of pixels. In Proceedings of the European Conference on Computer Vision. 570–586. DOI:
[75]
Liangli Zhen, Peng Hu, Xi Peng, Rick Siow Mong Goh, and Joey Tianyi Zhou. 2022. Deep multimodal transfer learning for cross-modal retrieval. IEEE Transactions on Neural Networks and Learning Systems 33, 2 (2022), 798–810. DOI:
[76]
Liangli Zhen, Peng Hu, Xu Wang, and Dezhong Peng. 2019. Deep supervised cross-modal retrieval. In IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 10394–10403. DOI:
[77]
Zhedong Zheng, Liang Zheng, Michael Garrett, Yi Yang, Mingliang Xu, and Yi-Dong Shen. 2020. Dual-path convolutional image-text embeddings with instance loss. ACM Transactions on Multimedia Computing, Communications, and Applications 16, 2 (2020), 1–23.
[78]
Yipin Zhou, Zhaowen Wang, Chen Fang, Trung Bui, and Tamara L. Berg. 2018. Visual to sound: Generating natural sound for videos in the wild. In 2018 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3550–3558. DOI:

Cited By

View all
  • (2024)Anchor-aware Deep Metric Learning for Audio-visual RetrievalProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658067(211-219)Online publication date: 30-May-2024
  • (2024)Learning Offset Probability Distribution for Accurate Object DetectionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363721420:5(1-24)Online publication date: 22-Jan-2024
  • (2023)Triplet Loss with Curriculum Learning for Audio-Visual Retrieval2023 IEEE International Symposium on Multimedia (ISM)10.1109/ISM59092.2023.00038(206-207)Online publication date: 11-Dec-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications
ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 2s
April 2023
545 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3572861
  • Editor:
  • Abdulmotaleb El Saddik
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 February 2023
Online AM: 22 September 2022
Accepted: 04 September 2022
Revised: 29 June 2022
Received: 15 January 2022
Published in TOMM Volume 19, Issue 2s

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Modality-common
  2. modality-specific
  3. explicit and implicit
  4. audio-visual cross-modal retrieval

Qualifiers

  • Research-article

Funding Sources

  • JSPS Scientific Research
  • KDDI research, Inc.

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)152
  • Downloads (Last 6 weeks)11
Reflects downloads up to 22 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Anchor-aware Deep Metric Learning for Audio-visual RetrievalProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658067(211-219)Online publication date: 30-May-2024
  • (2024)Learning Offset Probability Distribution for Accurate Object DetectionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363721420:5(1-24)Online publication date: 22-Jan-2024
  • (2023)Triplet Loss with Curriculum Learning for Audio-Visual Retrieval2023 IEEE International Symposium on Multimedia (ISM)10.1109/ISM59092.2023.00038(206-207)Online publication date: 11-Dec-2023
  • (2023)VideoAdviser: Video Knowledge Distillation for Multimodal Transfer LearningIEEE Access10.1109/ACCESS.2023.328018711(51229-51240)Online publication date: 2023
  • (2023)Multi-scale network with shared cross-attention for audio–visual correlation learningNeural Computing and Applications10.1007/s00521-023-08817-135:27(20173-20187)Online publication date: 19-Jul-2023

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media