research-article

HCMSL: Hybrid Cross-modal Similarity Learning for Cross-modal Retrieval

Authors:

Chengyuan Zhang,

Shichao ZhangAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Volume 17, Issue 1s

Article No.: 2, Pages 1 - 22

https://doi.org/10.1145/3412847

Published: 26 April 2021 Publication History

Abstract

The purpose of cross-modal retrieval is to find the relationship between different modal samples and to retrieve other modal samples with similar semantics by using a certain modal sample. As the data of different modalities presents heterogeneous low-level feature and semantic-related high-level features, the main problem of cross-modal retrieval is how to measure the similarity between different modalities. In this article, we present a novel cross-modal retrieval method, named Hybrid Cross-Modal Similarity Learning model (HCMSL for short). It aims to capture sufficient semantic information from both labeled and unlabeled cross-modal pairs and intra-modal pairs with same classification label. Specifically, a coupled deep fully connected networks are used to map cross-modal feature representations into a common subspace. Weight-sharing strategy is utilized between two branches of networks to diminish cross-modal heterogeneity. Furthermore, two Siamese CNN models are employed to learn intra-modal similarity from samples of same modality. Comprehensive experiments on real datasets clearly demonstrate that our proposed technique achieves substantial improvements over the state-of-the-art cross-modal retrieval techniques.

References

[1]

Galen Andrew, Raman Arora, Jeff A. Bilmes, and Karen Livescu. 2013. Deep canonical correlation analysis. In Proceedings of the 30th International Conference on Machine Learning (ICML’13) (JMLR Workshop and Conference Proceedings), Vol. 28. JMLR.org, 1247–1255.

Digital Library

[2]

David M. Blei and Michael I. Jordan. 2003. Modeling annotated data. In Proceedings of the 26th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 127–134.

Digital Library

[3]

Da Cao, Jingjing Chu, Ningbo Zhu, and Liqiang Nie. 2020. Cross-modal recipe retrieval via parallel- and cross-attention networks learning. Knowl. Based Syst. 193 (2020), 105428.

Digital Library

[4]

Da Cao, Ning Han, Hao Chen, Xiaochi Wei, and Xiangnan He. 2020. Video-based recipe retrieval. Inf. Sci. 514 (2020).

[5]

Da Cao, Zhiwang Yu, Hanling Zhang, Jiansheng Fang, Liqiang Nie, and Qi Tian. 2019. Video-based cross-modal recipe retrieval. In Proceedings of the 27th ACM International Conference on Multimedia (MM’19). ACM, 1685–1693.

Digital Library

[6]

Jian Cheng, Cong Leng, Peng Li, Meng Wang, and Hanqing Lu. 2014. Semi-supervised multi-graph hashing for scalable similarity search. Comput. Vis. Image Underst. 124 (2014), 12–21.

[7]

Tat Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, and Zhiping Luo. 2009. NUS-WIDE: A real-world web image database from National University of Singapore. In Proceedings of the 8th ACM International Conference on Image and Video Retrieval (CIVR’09).

Digital Library

[8]

Cheng Deng, Zhaojia Chen, Xianglong Liu, Xinbo Gao, and Dacheng Tao. 2018. Triplet-based deep hashing network for cross-modal retrieval. IEEE Trans. Image Proc. 27, 8 (2018), 3893–3903.

[9]

Guangwei Deng, Cheng Xu, XiaoHan Tu, Tao Li, and Nan Gao. 2018. Rapid image retrieval with binary hash codes based on deep learning. In Proceedings of the 3rd International Workshop on Pattern Recognition.

[10]

Guiguang Ding, Yuchen Guo, and Jile Zhou. 2014. Collective matrix factorization hashing for multimodal data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14). 2083–2090.

Digital Library

[11]

Leyuan Fang, Zhiliang Liu, and Weiwei Song. 2019. Deep hashing neural networks for hyperspectral image feature extraction. IEEE Geosci. Rem. Sens. Lett. 16, 9 (2019).

[12]

Fangxiang Feng, Xiaojie Wang, and Ruifan Li. 2014. Cross-modal retrieval with correspondence autoencoder. In Proceedings of the ACM International Conference on Multimedia (MM’14). ACM, 7–16.

Digital Library

[13]

H. Hotelling. 1936. Relations between two sets of variates. Biometrika 28, 3--4 (Dec. 1936), 321–377.

[14]

Di Hu, Feiping Nie, and Xuelong Li. 2019. Deep binary reconstruction for cross-modal hashing. IEEE Trans. Multimedia 21, 4 (2019), 973–985.

[15]

Xin Huang, Yuxin Peng, and Mingkuan Yuan. 2020. MHTN: Modal-adversarial hybrid transfer network for cross-modal retrieval. IEEE Trans. Cybern. 50, 3 (2020), 1047–1059.

[16]

Yangqing Jia, Mathieu Salzmann, and Trevor Darrell. 2011. Learning cross-modality similarity for multinomial data. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’11). IEEE Computer Society, 2407–2414.

Digital Library

[17]

Bin Jiang, Xin Huang, Chao Yang, and Junsong Yuan. 2019. Cross-modal video moment retrieval with spatial and language-temporal attention. In Proceedings of the International Conference on Multimedia Retrieval, (ICMR’19). ACM, 217–225.

Digital Library

[18]

Bin Jiang, Xin Huang, Chao Yang, and Junsong Yuan. 2019. SLTFNet: A spatial and language-temporal tensor fusion network for video moment retrieval. Inf. Proc. Manag. 56, 6 (2019).

[19]

Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1746–1751.

[20]

Shaishav Kumar and Raghavendra Udupa. 2011. Learning hash functions for cross-view similarity search. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence (IJCAI’11). IJCAI/AAAI, 1360–1365.

Digital Library

[21]

Jian Liang, Ran He, Zhenan Sun, and Tieniu Tan. 2016. Group-invariant cross-modal subspace learning. In Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI’16). IJCAI/AAAI Press, 1739–1745.

Digital Library

[22]

Renjie Liao, Jun Zhu, and Zengchang Qin. 2014. Nonparametric Bayesian upstream supervised multi-modal topic models. In Proceedings of the 7th ACM International Conference on Web Search and Data Mining (WSDM’14). ACM, 493–502.

Digital Library

[23]

Zijia Lin, Guiguang Ding, Jungong Han, and Jianmin Wang. 2017. Cross-view retrieval via probability-based semantics-preserving hashing. IEEE Trans. Cybern. 47, 12 (2017), 4342–4355.

[24]

Venice Erin Liong, Jiwen Lu, Yap-Peng Tan, and Jie Zhou. 2017. Cross-modal deep variational hashing. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). IEEE Computer Society, 4097–4105.

[25]

Yu Liu, Zheng Qin, Xiaofeng Liao, and Jiahui Wu. 2020. Cryptanalysis and enhancement of an image encryption scheme based on a 1-D coupled Sine map. Nonlin. Dyn. 100 (2020), 2917–2931.

[26]

Yu Ling Liu and Yong Xiao. 2013. A robust image hashing algorithm resistant against geometrical attacks. Radioengineering 22, 4 (2013), 1072--1082.

[27]

Yuling Liu, Guojiang Xin, and Xiao Yong. 2016. Robust image hashing using radon transform and invariant features. Radioengineering 25, 3 (2016), 556–564.

[28]

Vijay Mahadevan, Chi Wah Wong, Jose Costa Pereira, Tom Liu, Nuno Vasconcelos, and Lawrence K. Saul. 2011. Maximum covariance unfolding: Manifold learning for bimodal data. In Proceedings of the 25th Conference on Advances in Neural Information Processing Systems. 918–926.

Digital Library

[29]

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y. Ng. 2011. Multimodal deep learning. In Proceedings of the 28th International Conference on Machine Learning (ICML’11). Omnipress, 689–696.

Digital Library

[30]

Junlin Ouyang, Yizhi Liu, and Huazhong Shu. 2017. Robust hashing for image authentication using SIFT feature and quaternion Zernike moments. Multimedia Tools Applic. 76, 2 (2017), 2609--2626.

Digital Library

[31]

Junlin Ouyang, Xingzi Wen, Jianxun Liu, and Jinjun Chen. 2016. Robust hashing based on quaternion Zernike moments for image authentication. ACM Trans. Multimedia Comput. Commun. Appl. 12, 4s (2016), 63:1–63:13.

Digital Library

[32]

Yuxin Peng, Xin Huang, and Jinwei Qi. 2016. Cross-media shared representation by hierarchical learning with multiple deep networks. In Proceedings of the 25th International Joint Conference on Artificial Intelligence, (IJCAI’16). 3846–3853.

Digital Library

[33]

Yuxin Peng, Jinwei Qi, Xin Huang, and Yuxin Yuan. 2018. CCL: Cross-modal correlation learning with multigrained fusion by hierarchical network. IEEE Trans. Multimedia 20, 2 (2018), 405–420.

Digital Library

[34]

Yuxin Peng, Jinwei Qi, and Yuxin Yuan. 2017. CM-GANs: Cross-modal generative adversarial networks for common representation learning. CoRR abs/1710.05106 (2017).

[35]

Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Nikhil Rasiwasia, Gert R. G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2013. On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 36, 3 (2013), 521–535.

Digital Library

[36]

Viresh Ranjan, Nikhil Rasiwasia, and C. V. Jawahar. 2015. Multi-label cross-modal retrieval. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). IEEE Computer Society, 4094–4102.

Digital Library

[37]

Cyrus Rashtchian, Peter Young, Micah Hodosh, and Julia Hockenmaier. 2010. Collecting image annotations using Amazon’s mechanical turk. In Proceedings of the NAACL HLT Workshop on Creating Speech & Language Data with Amazon’s Mechanical Turk.

Digital Library

[38]

Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert R. G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of the 18th International Conference on Multimedia. ACM, 251–260.

Digital Library

[39]

Abhishek Sharma, Abhishek Kumar, Hal Daumé III, and David W. Jacobs. 2012. Generalized multiview analysis: A discriminative latent space. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 2160–2167.

Digital Library

[40]

Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In ICLR.

[41]

Nitish Srivastava and Ruslan Salakhutdinov. 2014. Multimodal learning with deep Boltzmann machines. J. Mach. Learn. Res. 15, 1 (2014), 2949–2980.

Digital Library

[42]

Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial cross-modal retrieval. In Proceedings of the 25th ACM International Conference on Multimedia. 154--162.

Digital Library

[43]

Cheng Wang, Haojin Yang, and Christoph Meinel. 2015. Deep semantic mapping for cross-modal retrieval. In Proceedings of the 27th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’15). IEEE Computer Society, 234–241.

Digital Library

[44]

Daixin Wang, Peng Cui, Mingdong Ou, and Wenwu Zhu. 2015. Learning compact hash codes for multimodal representations using orthogonal deep structure. IEEE Trans. Multimedia 17, 9 (2015), 1404–1416.

[45]

Jiale Wang, Guohui Li, Peng Pan, and Xiaosong Zhao. 2017. Semi-supervised semantic factorization hashing for fast cross-modal retrieval. Multimedia Tools Appl. 76, 19 (2017), 20197–20215.

Digital Library

[46]

Ke Wang, Jun Tang, Nian Wang, and Ling Shao. 2016. Semantic boosting cross-modal hashing for efficient multimedia retrieval. Inf. Sci. 330 (2016), 199–210.

Digital Library

[47]

Weiran Wang, Raman Arora, Karen Livescu, and Jeff A. Bilmes. 2015. On deep multi-view representation learning. In Proceedings of the 32nd International Conference on Machine Learning (ICML’15) (JMLR Workshop and Conference Proceedings), Francis R. Bach and David M. Blei (Eds.), Vol. 37. JMLR.org, 1083–1092.

Digital Library

[48]

Yang Wang. 2020. Survey on deep multi-modal data analytics: Collaboration, rivalry and fusion. ACM Trans. Multimedia Comput., Commun. Applic. (2020).

[49]

Yang Wang, Xuemin Lin, Lin Wu, and Wenjie Zhang. 2015. Effective multi-query expansions: Robust landmark retrieval. In Proceedings of the 23rdACM Conference on Multimedia Conference (MM’15). ACM, 79–88.

Digital Library

[50]

Yang Wang, Xuemin Lin, Lin Wu, and Wenjie Zhang. 2017. Effective multi-query expansions: Collaborative deep networks for robust landmark retrieval. IEEE Trans. Image Proc. 26, 3 (2017), 1393–1404.

Digital Library

[51]

Yanfei Wang, Fei Wu, Jun Song, Xi Li, and Yueting Zhuang. 2014. Multi-modal mutual topic reinforce modeling for cross-media retrieval. In Proceedings of the ACM International Conference on Multimedia (MM’14). ACM, 307–316.

Digital Library

[52]

Yang Wang, Lin Wu, Xuemin Lin, and Junbin Gao. 2018. Multiview spectral clustering via structured low-rank matrix factorization. IEEE Trans. Neural Netw. Learn. Syst. 29, 10 (2018), 4833–4843.

[53]

Yunchao Wei, Yao Zhao, Canyi Lu, Shikui Wei, Luoqi Liu, Zhenfeng Zhu, and Shuicheng Yan. 2017. Cross-modal retrieval with CNN visual features: A new baseline. IEEE Trans. Cybern. 47, 2 (2017), 449–460.

[54]

Xin Wen, Zhizhong Han, Xinyu Yin, and Yu-Shen Liu. 2019. Adversarial cross-modal retrieval via learning and transferring single-modal similarities. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME’19). IEEE, 478–483.

[55]

Fei Wu, Zhou Yu, Yi Yang, Siliang Tang, Yin Zhang, and Yueting Zhuang. 2014. Sparse multi-modal hashing. IEEE Trans. Multimedia 16, 2 (2014), 427–439.

Digital Library

[56]

Lin Wu, Yang Wang, Junbin Gao, and Xue Li. 2019. Where-and-when to look: Deep Siamese attention networks for video-based person re-identification. IEEE Trans. Multimedia 21, 6 (2019), 1412–1424.

Digital Library

[57]

Lin Wu, Yang Wang, Junbin Gao, Meng Wang, Zheng-Jun Zha, and Dacheng Tao. 2021. Deep coattention-based comparator for relative representation learning in person re-identification. IEEE Trans. Neural Netw. Learn. Syst. 32, 2 (2021), 722--735.

[58]

Lin Wu, Yang Wang, and Ling Shao. 2019. Cycle-consistent deep generative hashing for cross-modal retrieval. IEEE Trans. Image Proc. 28, 4 (2019), 1602–1612.

Digital Library

[59]

Lin Wu, Yang Wang, Hongzhi Yin, Meng Wang, and Ling Shao. 2020. Few-shot deep adversarial learning for video-based person re-identification. IEEE Trans. Image Proc. 29 (2020), 1233–1245.

[60]

Cong Xu, Jingru Sun, and Chunhua Wang. 2020. A novel image encryption algorithm based on bit-plane matrix rotation and hyper chaotic systems. Multimedia Tools Applic. 79 (2020), 5573--5593.

[61]

Oksana Yakhnenko and Vasant G. Honavar. 2009. Multi-modal hierarchical Dirichlet process model for predicting image annotation and image-object label correspondence. In Proceedings of the SIAM International Conference on Data Mining (SDM’09). SIAM, 283–293.

[62]

Xiaohua Zhai, Yuxin Peng, and Jianguo Xiao. 2013. Learning cross-media joint representation with sparse and semisupervised regularization. IEEE Trans. Circ. Syst. Vid. Technol. 24, 6 (2013), 965–978.

[63]

Dongqing Zhang and Wu-Jun Li. 2014. Large-scale supervised multimodal hashing with semantic correlation maximization. In Proceedings of the 28th AAAI Conference on Artificial Intelligence. AAAI Press, 2177–2183.

Digital Library

[64]

HanLing Zhang, CaiQiong Xiong, and GuangZhi Geng. 2009. Content based image hashing robust to geometric transformations. In Proceedings of the 2nd International Symposium on Electronic Commerce and Security. IEEE Computer Society.

Digital Library

[65]

Han Ling Zhang and Huang Sheng. 2008. A novel image authentication robust to geometric transformations. In Proceedings of the Congress on Image and Signal Processing.

Digital Library

[66]

Jian Zhang, Yuxin Peng, and Mingkuan Yuan. 2020. SCH-GAN: Semi-supervised cross-modal hashing by generative adversarial network. IEEE Trans. Cybern. 50, 2 (2020), 489–502.

[67]

Liangli Zhen, Peng Hu, Xu Wang, and Dezhong Peng. 2019. Deep supervised cross-modal retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19). Computer Vision Foundation/IEEE, 10394–10403.

[68]

Jile Zhou, Guiguang Ding, and Yuchen Guo. 2014. Latent semantic sparse hashing for cross-modal similarity search. In Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’14). ACM, 415–424.

Digital Library

Cited By

Liu ZPei XGao SLi CWang JXu J(2024)Perceive, Reason, and AlignApplied Soft Computing10.1016/j.asoc.2024.111395154:COnline publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1016/j.asoc.2024.111395
Wu JLiu WWang LShen XWei YWu C(2023)Dynamic Pruning of Regions for Image–Sentence MatchingImage Communication10.1016/j.image.2023.117021117:COnline publication date: 1-Sep-2023
https://dl.acm.org/doi/10.1016/j.image.2023.117021

Index Terms

HCMSL: Hybrid Cross-modal Similarity Learning for Cross-modal Retrieval
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
      1. Similarity measures

Recommendations

Semantic consistent adversarial cross-modal retrieval exploiting semantic similarity
Abstract
Cross-modal retrieval aims to search the semantically similar instances from the other modalities given a query from one modality. However, the differences of the distributions and representations between different modalities make that the ...
Cross-modal Retrieval with Label Completion
MM '16: Proceedings of the 24th ACM international conference on Multimedia

Cross-modal retrieval has been attracting increasing attention because of the explosion of multi-modal data, e.g., texts and images. Most supervised cross-modal retrieval methods learn discriminant common subspaces minimizing the heterogeneity of ...
Scalable Deep Multimodal Learning for Cross-Modal Retrieval
SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval

Cross-modal retrieval takes one type of data as the query to retrieve relevant data of another type. Most of existing cross-modal retrieval approaches were proposed to learn a common subspace in a joint manner, where the data from all modalities have to ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 17, Issue 1s

January 2021

353 pages

ISSN:1551-6857

EISSN:1551-6865

DOI:10.1145/3453990

Editor:
Alberto Del Bimbo
University of Firenze, Italy

Issue’s Table of Contents

Copyright © 2021 Association for Computing Machinery.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 April 2021

Accepted: 01 July 2020

Revised: 01 July 2020

Received: 01 April 2020

Published in TOMM Volume 17, Issue 1s

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Funding Sources

National Natural Science Foundation of China
Science and Technology Plan of Hunan Province

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
690
Total Downloads

Downloads (Last 12 months)134
Downloads (Last 6 weeks)9

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Liu ZPei XGao SLi CWang JXu J(2024)Perceive, Reason, and AlignApplied Soft Computing10.1016/j.asoc.2024.111395154:COnline publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1016/j.asoc.2024.111395
Wu JLiu WWang LShen XWei YWu C(2023)Dynamic Pruning of Regions for Image–Sentence MatchingImage Communication10.1016/j.image.2023.117021117:COnline publication date: 1-Sep-2023
https://dl.acm.org/doi/10.1016/j.image.2023.117021

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents