Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

HCMSL: Hybrid Cross-modal Similarity Learning for Cross-modal Retrieval

Published: 26 April 2021 Publication History

Abstract

The purpose of cross-modal retrieval is to find the relationship between different modal samples and to retrieve other modal samples with similar semantics by using a certain modal sample. As the data of different modalities presents heterogeneous low-level feature and semantic-related high-level features, the main problem of cross-modal retrieval is how to measure the similarity between different modalities. In this article, we present a novel cross-modal retrieval method, named Hybrid Cross-Modal Similarity Learning model (HCMSL for short). It aims to capture sufficient semantic information from both labeled and unlabeled cross-modal pairs and intra-modal pairs with same classification label. Specifically, a coupled deep fully connected networks are used to map cross-modal feature representations into a common subspace. Weight-sharing strategy is utilized between two branches of networks to diminish cross-modal heterogeneity. Furthermore, two Siamese CNN models are employed to learn intra-modal similarity from samples of same modality. Comprehensive experiments on real datasets clearly demonstrate that our proposed technique achieves substantial improvements over the state-of-the-art cross-modal retrieval techniques.

References

[1]
Galen Andrew, Raman Arora, Jeff A. Bilmes, and Karen Livescu. 2013. Deep canonical correlation analysis. In Proceedings of the 30th International Conference on Machine Learning (ICML’13) (JMLR Workshop and Conference Proceedings), Vol. 28. JMLR.org, 1247–1255.
[2]
David M. Blei and Michael I. Jordan. 2003. Modeling annotated data. In Proceedings of the 26th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 127–134.
[3]
Da Cao, Jingjing Chu, Ningbo Zhu, and Liqiang Nie. 2020. Cross-modal recipe retrieval via parallel- and cross-attention networks learning. Knowl. Based Syst. 193 (2020), 105428.
[4]
Da Cao, Ning Han, Hao Chen, Xiaochi Wei, and Xiangnan He. 2020. Video-based recipe retrieval. Inf. Sci. 514 (2020).
[5]
Da Cao, Zhiwang Yu, Hanling Zhang, Jiansheng Fang, Liqiang Nie, and Qi Tian. 2019. Video-based cross-modal recipe retrieval. In Proceedings of the 27th ACM International Conference on Multimedia (MM’19). ACM, 1685–1693.
[6]
Jian Cheng, Cong Leng, Peng Li, Meng Wang, and Hanqing Lu. 2014. Semi-supervised multi-graph hashing for scalable similarity search. Comput. Vis. Image Underst. 124 (2014), 12–21.
[7]
Tat Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, and Zhiping Luo. 2009. NUS-WIDE: A real-world web image database from National University of Singapore. In Proceedings of the 8th ACM International Conference on Image and Video Retrieval (CIVR’09).
[8]
Cheng Deng, Zhaojia Chen, Xianglong Liu, Xinbo Gao, and Dacheng Tao. 2018. Triplet-based deep hashing network for cross-modal retrieval. IEEE Trans. Image Proc. 27, 8 (2018), 3893–3903.
[9]
Guangwei Deng, Cheng Xu, XiaoHan Tu, Tao Li, and Nan Gao. 2018. Rapid image retrieval with binary hash codes based on deep learning. In Proceedings of the 3rd International Workshop on Pattern Recognition.
[10]
Guiguang Ding, Yuchen Guo, and Jile Zhou. 2014. Collective matrix factorization hashing for multimodal data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14). 2083–2090.
[11]
Leyuan Fang, Zhiliang Liu, and Weiwei Song. 2019. Deep hashing neural networks for hyperspectral image feature extraction. IEEE Geosci. Rem. Sens. Lett. 16, 9 (2019).
[12]
Fangxiang Feng, Xiaojie Wang, and Ruifan Li. 2014. Cross-modal retrieval with correspondence autoencoder. In Proceedings of the ACM International Conference on Multimedia (MM’14). ACM, 7–16.
[13]
H. Hotelling. 1936. Relations between two sets of variates. Biometrika 28, 3--4 (Dec. 1936), 321–377.
[14]
Di Hu, Feiping Nie, and Xuelong Li. 2019. Deep binary reconstruction for cross-modal hashing. IEEE Trans. Multimedia 21, 4 (2019), 973–985.
[15]
Xin Huang, Yuxin Peng, and Mingkuan Yuan. 2020. MHTN: Modal-adversarial hybrid transfer network for cross-modal retrieval. IEEE Trans. Cybern. 50, 3 (2020), 1047–1059.
[16]
Yangqing Jia, Mathieu Salzmann, and Trevor Darrell. 2011. Learning cross-modality similarity for multinomial data. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’11). IEEE Computer Society, 2407–2414.
[17]
Bin Jiang, Xin Huang, Chao Yang, and Junsong Yuan. 2019. Cross-modal video moment retrieval with spatial and language-temporal attention. In Proceedings of the International Conference on Multimedia Retrieval, (ICMR’19). ACM, 217–225.
[18]
Bin Jiang, Xin Huang, Chao Yang, and Junsong Yuan. 2019. SLTFNet: A spatial and language-temporal tensor fusion network for video moment retrieval. Inf. Proc. Manag. 56, 6 (2019).
[19]
Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1746–1751.
[20]
Shaishav Kumar and Raghavendra Udupa. 2011. Learning hash functions for cross-view similarity search. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence (IJCAI’11). IJCAI/AAAI, 1360–1365.
[21]
Jian Liang, Ran He, Zhenan Sun, and Tieniu Tan. 2016. Group-invariant cross-modal subspace learning. In Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI’16). IJCAI/AAAI Press, 1739–1745.
[22]
Renjie Liao, Jun Zhu, and Zengchang Qin. 2014. Nonparametric Bayesian upstream supervised multi-modal topic models. In Proceedings of the 7th ACM International Conference on Web Search and Data Mining (WSDM’14). ACM, 493–502.
[23]
Zijia Lin, Guiguang Ding, Jungong Han, and Jianmin Wang. 2017. Cross-view retrieval via probability-based semantics-preserving hashing. IEEE Trans. Cybern. 47, 12 (2017), 4342–4355.
[24]
Venice Erin Liong, Jiwen Lu, Yap-Peng Tan, and Jie Zhou. 2017. Cross-modal deep variational hashing. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). IEEE Computer Society, 4097–4105.
[25]
Yu Liu, Zheng Qin, Xiaofeng Liao, and Jiahui Wu. 2020. Cryptanalysis and enhancement of an image encryption scheme based on a 1-D coupled Sine map. Nonlin. Dyn. 100 (2020), 2917–2931.
[26]
Yu Ling Liu and Yong Xiao. 2013. A robust image hashing algorithm resistant against geometrical attacks. Radioengineering 22, 4 (2013), 1072--1082.
[27]
Yuling Liu, Guojiang Xin, and Xiao Yong. 2016. Robust image hashing using radon transform and invariant features. Radioengineering 25, 3 (2016), 556–564.
[28]
Vijay Mahadevan, Chi Wah Wong, Jose Costa Pereira, Tom Liu, Nuno Vasconcelos, and Lawrence K. Saul. 2011. Maximum covariance unfolding: Manifold learning for bimodal data. In Proceedings of the 25th Conference on Advances in Neural Information Processing Systems. 918–926.
[29]
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y. Ng. 2011. Multimodal deep learning. In Proceedings of the 28th International Conference on Machine Learning (ICML’11). Omnipress, 689–696.
[30]
Junlin Ouyang, Yizhi Liu, and Huazhong Shu. 2017. Robust hashing for image authentication using SIFT feature and quaternion Zernike moments. Multimedia Tools Applic. 76, 2 (2017), 2609--2626.
[31]
Junlin Ouyang, Xingzi Wen, Jianxun Liu, and Jinjun Chen. 2016. Robust hashing based on quaternion Zernike moments for image authentication. ACM Trans. Multimedia Comput. Commun. Appl. 12, 4s (2016), 63:1–63:13.
[32]
Yuxin Peng, Xin Huang, and Jinwei Qi. 2016. Cross-media shared representation by hierarchical learning with multiple deep networks. In Proceedings of the 25th International Joint Conference on Artificial Intelligence, (IJCAI’16). 3846–3853.
[33]
Yuxin Peng, Jinwei Qi, Xin Huang, and Yuxin Yuan. 2018. CCL: Cross-modal correlation learning with multigrained fusion by hierarchical network. IEEE Trans. Multimedia 20, 2 (2018), 405–420.
[34]
Yuxin Peng, Jinwei Qi, and Yuxin Yuan. 2017. CM-GANs: Cross-modal generative adversarial networks for common representation learning. CoRR abs/1710.05106 (2017).
[35]
Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Nikhil Rasiwasia, Gert R. G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2013. On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 36, 3 (2013), 521–535.
[36]
Viresh Ranjan, Nikhil Rasiwasia, and C. V. Jawahar. 2015. Multi-label cross-modal retrieval. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). IEEE Computer Society, 4094–4102.
[37]
Cyrus Rashtchian, Peter Young, Micah Hodosh, and Julia Hockenmaier. 2010. Collecting image annotations using Amazon’s mechanical turk. In Proceedings of the NAACL HLT Workshop on Creating Speech & Language Data with Amazon’s Mechanical Turk.
[38]
Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert R. G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of the 18th International Conference on Multimedia. ACM, 251–260.
[39]
Abhishek Sharma, Abhishek Kumar, Hal Daumé III, and David W. Jacobs. 2012. Generalized multiview analysis: A discriminative latent space. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 2160–2167.
[40]
Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In ICLR.
[41]
Nitish Srivastava and Ruslan Salakhutdinov. 2014. Multimodal learning with deep Boltzmann machines. J. Mach. Learn. Res. 15, 1 (2014), 2949–2980.
[42]
Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial cross-modal retrieval. In Proceedings of the 25th ACM International Conference on Multimedia. 154--162.
[43]
Cheng Wang, Haojin Yang, and Christoph Meinel. 2015. Deep semantic mapping for cross-modal retrieval. In Proceedings of the 27th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’15). IEEE Computer Society, 234–241.
[44]
Daixin Wang, Peng Cui, Mingdong Ou, and Wenwu Zhu. 2015. Learning compact hash codes for multimodal representations using orthogonal deep structure. IEEE Trans. Multimedia 17, 9 (2015), 1404–1416.
[45]
Jiale Wang, Guohui Li, Peng Pan, and Xiaosong Zhao. 2017. Semi-supervised semantic factorization hashing for fast cross-modal retrieval. Multimedia Tools Appl. 76, 19 (2017), 20197–20215.
[46]
Ke Wang, Jun Tang, Nian Wang, and Ling Shao. 2016. Semantic boosting cross-modal hashing for efficient multimedia retrieval. Inf. Sci. 330 (2016), 199–210.
[47]
Weiran Wang, Raman Arora, Karen Livescu, and Jeff A. Bilmes. 2015. On deep multi-view representation learning. In Proceedings of the 32nd International Conference on Machine Learning (ICML’15) (JMLR Workshop and Conference Proceedings), Francis R. Bach and David M. Blei (Eds.), Vol. 37. JMLR.org, 1083–1092.
[48]
Yang Wang. 2020. Survey on deep multi-modal data analytics: Collaboration, rivalry and fusion. ACM Trans. Multimedia Comput., Commun. Applic. (2020).
[49]
Yang Wang, Xuemin Lin, Lin Wu, and Wenjie Zhang. 2015. Effective multi-query expansions: Robust landmark retrieval. In Proceedings of the 23rdACM Conference on Multimedia Conference (MM’15). ACM, 79–88.
[50]
Yang Wang, Xuemin Lin, Lin Wu, and Wenjie Zhang. 2017. Effective multi-query expansions: Collaborative deep networks for robust landmark retrieval. IEEE Trans. Image Proc. 26, 3 (2017), 1393–1404.
[51]
Yanfei Wang, Fei Wu, Jun Song, Xi Li, and Yueting Zhuang. 2014. Multi-modal mutual topic reinforce modeling for cross-media retrieval. In Proceedings of the ACM International Conference on Multimedia (MM’14). ACM, 307–316.
[52]
Yang Wang, Lin Wu, Xuemin Lin, and Junbin Gao. 2018. Multiview spectral clustering via structured low-rank matrix factorization. IEEE Trans. Neural Netw. Learn. Syst. 29, 10 (2018), 4833–4843.
[53]
Yunchao Wei, Yao Zhao, Canyi Lu, Shikui Wei, Luoqi Liu, Zhenfeng Zhu, and Shuicheng Yan. 2017. Cross-modal retrieval with CNN visual features: A new baseline. IEEE Trans. Cybern. 47, 2 (2017), 449–460.
[54]
Xin Wen, Zhizhong Han, Xinyu Yin, and Yu-Shen Liu. 2019. Adversarial cross-modal retrieval via learning and transferring single-modal similarities. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME’19). IEEE, 478–483.
[55]
Fei Wu, Zhou Yu, Yi Yang, Siliang Tang, Yin Zhang, and Yueting Zhuang. 2014. Sparse multi-modal hashing. IEEE Trans. Multimedia 16, 2 (2014), 427–439.
[56]
Lin Wu, Yang Wang, Junbin Gao, and Xue Li. 2019. Where-and-when to look: Deep Siamese attention networks for video-based person re-identification. IEEE Trans. Multimedia 21, 6 (2019), 1412–1424.
[57]
Lin Wu, Yang Wang, Junbin Gao, Meng Wang, Zheng-Jun Zha, and Dacheng Tao. 2021. Deep coattention-based comparator for relative representation learning in person re-identification. IEEE Trans. Neural Netw. Learn. Syst. 32, 2 (2021), 722--735.
[58]
Lin Wu, Yang Wang, and Ling Shao. 2019. Cycle-consistent deep generative hashing for cross-modal retrieval. IEEE Trans. Image Proc. 28, 4 (2019), 1602–1612.
[59]
Lin Wu, Yang Wang, Hongzhi Yin, Meng Wang, and Ling Shao. 2020. Few-shot deep adversarial learning for video-based person re-identification. IEEE Trans. Image Proc. 29 (2020), 1233–1245.
[60]
Cong Xu, Jingru Sun, and Chunhua Wang. 2020. A novel image encryption algorithm based on bit-plane matrix rotation and hyper chaotic systems. Multimedia Tools Applic. 79 (2020), 5573--5593.
[61]
Oksana Yakhnenko and Vasant G. Honavar. 2009. Multi-modal hierarchical Dirichlet process model for predicting image annotation and image-object label correspondence. In Proceedings of the SIAM International Conference on Data Mining (SDM’09). SIAM, 283–293.
[62]
Xiaohua Zhai, Yuxin Peng, and Jianguo Xiao. 2013. Learning cross-media joint representation with sparse and semisupervised regularization. IEEE Trans. Circ. Syst. Vid. Technol. 24, 6 (2013), 965–978.
[63]
Dongqing Zhang and Wu-Jun Li. 2014. Large-scale supervised multimodal hashing with semantic correlation maximization. In Proceedings of the 28th AAAI Conference on Artificial Intelligence. AAAI Press, 2177–2183.
[64]
HanLing Zhang, CaiQiong Xiong, and GuangZhi Geng. 2009. Content based image hashing robust to geometric transformations. In Proceedings of the 2nd International Symposium on Electronic Commerce and Security. IEEE Computer Society.
[65]
Han Ling Zhang and Huang Sheng. 2008. A novel image authentication robust to geometric transformations. In Proceedings of the Congress on Image and Signal Processing.
[66]
Jian Zhang, Yuxin Peng, and Mingkuan Yuan. 2020. SCH-GAN: Semi-supervised cross-modal hashing by generative adversarial network. IEEE Trans. Cybern. 50, 2 (2020), 489–502.
[67]
Liangli Zhen, Peng Hu, Xu Wang, and Dezhong Peng. 2019. Deep supervised cross-modal retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19). Computer Vision Foundation/IEEE, 10394–10403.
[68]
Jile Zhou, Guiguang Ding, and Yuchen Guo. 2014. Latent semantic sparse hashing for cross-modal similarity search. In Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’14). ACM, 415–424.

Cited By

View all

Index Terms

  1. HCMSL: Hybrid Cross-modal Similarity Learning for Cross-modal Retrieval

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 17, Issue 1s
    January 2021
    353 pages
    ISSN:1551-6857
    EISSN:1551-6865
    DOI:10.1145/3453990
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 26 April 2021
    Accepted: 01 July 2020
    Revised: 01 July 2020
    Received: 01 April 2020
    Published in TOMM Volume 17, Issue 1s

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Cross-modal retrieval
    2. deep learning
    3. intra-modal semantic correlation
    4. hybrid cross-modal similarity

    Qualifiers

    • Research-article
    • Refereed

    Funding Sources

    • National Natural Science Foundation of China
    • Science and Technology Plan of Hunan Province

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)134
    • Downloads (Last 6 weeks)9
    Reflects downloads up to 10 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media