research-article

Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment

Authors:

Yueting ZhuangAuthors Info & Claims

MM '15: Proceedings of the 23rd ACM international conference on Multimedia

Pages 69 - 78

https://doi.org/10.1145/2733373.2806240

Published: 13 October 2015 Publication History

Abstract

Cross-modal retrieval is a very hot research topic that is imperative to many applications involving multi-modal data. Discovering an appropriate representation for multi-modal data and learning a ranking function are essential to boost the cross-media retrieval. Motivated by the assumption that a compositional cross-modal semantic representation (pairs of images and text) is more attractive for cross-modal ranking, this paper exploits the existing image-text databases to optimize a ranking function for cross-modal retrieval, called deep compositional cross-modal learning to rank (C₂MLR). In this paper, C2MLR considers learning a multi-modal embedding from the perspective of optimizing a pairwise ranking problem while enhancing both local alignment and global alignment. In particular, the local alignment (i.e., the alignment of visual objects and textual words) and the global alignment (i.e., the image-level and sentence-level alignment) are collaboratively utilized to learn the multi-modal embedding common space in a max-margin learning to rank manner. The experiments demonstrate the superiority of our proposed C2MLR due to its nature of multi-modal compositional embedding.

References

[1]

B. Bai, J. Weston, D. Grangier, R. Collobert, K. Sadamasa, Y. Qi, O. Chapelle, and K. Weinberger. Learning to rank with (a lot of) word features. Inf. Retr., 13(3):291--314, June 2010.

Digital Library

[2]

D. M. Blei and M. I. Jordan. Modeling annotated data. In Proceedings of Internetional ACM SIGIR Conference on Research and Development in Informaion Retrieval, pages 127--134, 2003.

Digital Library

[3]

C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N Hamilton, and G. Hullender. Learning to rank using gradient descent. In ICML, pages 89--96, New York, NY, USA, 2005. ACM.

Digital Library

[4]

C. Cortes and V. Vapnik. Support-vector networks. Machaine Learning, 20(3):273--297, Sept. 1995.

Digital Library

[5]

J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. arXiv preprint arXiv:1411.4389, 2014.

[6]

F. Feng, X. Wang, and R. Li. Cross-modal retrieval with correspondence autoencoder. In ACM MM, pages 7--16, New York, NY, USA, 2014. ACM.

Digital Library

[7]

A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. A. Ranzato, and T. Mikolov. Devise: A deep visual-semantic embedding model. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, editors, NIPS, pages 2121--2129. Curran Associates, Inc., 2013.

[8]

D. Grangier and S. Bengio. A discriminative kernel-based approach to rank images from text queries. TPAMI, pages 1371--1384, 2008.

Digital Library

[9]

D. R. Hardoon, S. R. Szedmak, and J. R. Shawe-taylor. Canonical correlation analysis: An overview with application to learning methods. Neural Computing, pages 2639--2664, 2004.

Digital Library

[10]

M. Hodosh, P. Young, and J. Hockenmaier. Framing image description as a ranking task: Data, models and evaluation metrics. J. Artif. Int. Res., 47(1):853--899, May 2013.

Digital Library

[11]

M. J. Huiskes and M. S. Lew. The mir flickr retrieval evaluation. In Proceedings of the 1st ACM International Conference on Multimedia Information Retrieval, pages 39--43, 2008.

Digital Library

[12]

A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In CVPR, June 2015.

[13]

A. Karpathy, A. Joulin, and F. F. F. Li. Deep fragment embeddings for bidirectional image sentence mapping. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, editors, NIPS, pages 1889--1897. Curran Associates, Inc., 2014.

[14]

A. Klementiev, I. Titov, and B. Bhattarai. Inducing crosslingual distributed representations of word. In COLING, pages 1459--1474. The COLING 2012 Organizing Committee, 2012.

[15]

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. Burges, L. Bottou, and K. Weinberger, editors, NIPS, pages 1097--1105. Curran Associates, Inc., 2012.

[16]

X. Lu, F. Wu, X. Li, Y. Zhang, W. Lu, D. Wang, and Y. Zhuang. Learning multimodal neural network with ranking examples. In ACM MM, pages 985--988, New York, NY, USA, 2014. ACM.

Digital Library

[17]

X. Lu, F. Wu, S. Tang, Z. Zhang, X. He, and Y. Zhuang. A low rank structural large margin method for cross-modal ranking. In ACM MM, pages 433--442, 2013.

Digital Library

[18]

J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille. Deep captioning with multimodal recurrent neural networks (m-rnn). In ICLR, 2015.

[19]

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, editors, NIPS, pages 3111--3119. Curran Associates, Inc., 2013.

[20]

G. Monaci, P. Jost, P. Vandergheynst, B. Mailhe, S. Lesage, and R. Gribonval. Learning multimodal dictionaries. TIP, pages 2272--2283, 2007.

Digital Library

[21]

N. Srivastava and R. R. Salakhutdinov. Multimodal learning with deep boltzmann machines. In F. Pereira, C. Burges, L. Bottou, and K. Weinberger, editors, NIPS, pages 2222--2230. Curran Associates, Inc., 2012.

[22]

X. Tian, L. Yang, J. Wang, Y. Yang, X. Wu, and X.-S. Hua. Bayesian video search reranking. In ACM MM, pages 131--140, New York, NY, USA, 2008. ACM.

Digital Library

[23]

X. Tian, Y. Yang, J. Wang, X. Wu, and X.-S. Hua. Bayesian visual reranking. TMM, 13(4):639--652, Aug 2011.

Digital Library

[24]

Y. Wang, F. Wu, J. Song, X. Li, and Y. Zhuang. Multi-modal mutual topic reinforce modeling for cross-media retrieval. In ACM MM, pages 307--316, New York, NY, USA, 2014. ACM.

Digital Library

[25]

F. Wu, X. Lu, Z. Zhang, S. Yan, Y. Rui, and Y. Zhuang. Cross-media semantic representation via bi-directional learning to rank. In ACM MM, pages 877--886, New York, NY, USA, 2013. ACM.

Digital Library

[26]

F. Wu, H. Zhang, and Y. Zhuang. Learning semantic correlations for cross-media retrieval. In ICIP, pages 1465--1468, Oct 2006.

[27]

E. P. Xing, R. Yan, and A. G. Hauptmann. Mining associated text and images with dual-wing harmoniums. In UAI, pages 633--641, 2005.

[28]

J. Yang, J. Wright, T. Huang, and Y. Ma. Image super-resolution via sparse representation. TIP, pages 2861--2873, 2010.

Digital Library

[29]

Y. Yang, Y. Zhuang, F. Wu, and Y.-H. Pan. Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval. TMM, 10(3):437--446, April 2008.

Digital Library

[30]

J. Yu, Y. Rui, and D. Tao. Click prediction for web image reranking using multimodal sparse coding. TIP, 23(5):2019--2032, May 2014.

[31]

Y. Zhuang, Y. F. Wang, F. Wu, Y. Zhang, and W. Lu. Supervised coupled dictionary learning with group structures for multi-modal retrieval. In AAAI, 2013.

Digital Library

[32]

W. Y. Zou, R. Socher, D. Cer, and C. D. Manning. Bilingual word embeddings for phrase-based machine translation. In EMNLP, 2013

Cited By

Qian XXue WZhang QTao RLi H(2024)Deep Cross-Modal Retrieval Between Spatial Image and Acoustic SpeechIEEE Transactions on Multimedia10.1109/TMM.2023.332387626(4480-4489)Online publication date: 2024
https://doi.org/10.1109/TMM.2023.3323876
Chen YHuang JXiong SLu X(2024)Integrating Multisubspace Joint Learning With Multilevel Guidance for Cross-Modal Retrieval of Remote Sensing ImagesIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2024.336904262(1-17)Online publication date: 2024
https://doi.org/10.1109/TGRS.2024.3369042
Han ZAzman ARina Binti Mustaffa MBinti Khalid F(2024)Cross-Modal Retrieval: A Review of Methodologies, Datasets, and Future PerspectivesIEEE Access10.1109/ACCESS.2024.344481712(115716-115741)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3444817
Show More Cited By

Index Terms

Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking

Recommendations

Cross-Modal Learning to Rank via Latent Joint Representation
Cross-modal ranking is a research topic that is imperative to many applications involving multimodal data. Discovering a joint representation for multimodal data and learning a ranking function are essential in order to boost the cross-media retrieval (...
Simple to complex cross-modal learning to rank

We learn a more optimal multi-modal embedding space gradually from easy to more complex rankings.We employ non-linear mapping functions to establish the multi-modal embedding space for more sophisticated cross-modal correspondence.An efficient ...
Learning to Rank Images with Cross-Modal Graph Convolutions
Advances in Information Retrieval
Abstract
We are interested in the problem of cross-modal retrieval for web image search, where the goal is to retrieve images relevant to a text query. While most of the current approaches for cross-modal retrieval revolve around learning how to represent ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '15: Proceedings of the 23rd ACM international conference on Multimedia

October 2015

1402 pages

ISBN:9781450334594

DOI:10.1145/2733373

General Chairs:
Xiaofang Zhou
The University of Queensland, Australia
,
Alan F. Smeaton
Dublin City University, Ireland
,
Qi Tian
The University of Texas at San Antonio, USA
,
Program Chairs:
Dick C.A. Bulterman
FXPAL, USA
,
Heng Tao Shen
The University of Queensland, Australia
,
Ketan Mayer-Patel
The University of North Carolina, USA
,
Shuicheng Yan
National University of Singapore, Singapore

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 October 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Basic Research Program of China
863 Program
Zhejiang Provincial Natural Science Foundation of China
National Natural Science Foundation of China
Fundamental Research Funds for the Central Universities
China Knowledge Centre for Engineering Sciences and Technology

Conference

MM '15

Sponsor:

SIGMM

MM '15: ACM Multimedia Conference

October 26 - 30, 2015

Brisbane, Australia

Acceptance Rates

MM '15 Paper Acceptance Rate 56 of 252 submissions, 22%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

34
Total Citations
View Citations
1,059
Total Downloads

Downloads (Last 12 months)19
Downloads (Last 6 weeks)1

Reflects downloads up to 18 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Qian XXue WZhang QTao RLi H(2024)Deep Cross-Modal Retrieval Between Spatial Image and Acoustic SpeechIEEE Transactions on Multimedia10.1109/TMM.2023.332387626(4480-4489)Online publication date: 2024
https://doi.org/10.1109/TMM.2023.3323876
Chen YHuang JXiong SLu X(2024)Integrating Multisubspace Joint Learning With Multilevel Guidance for Cross-Modal Retrieval of Remote Sensing ImagesIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2024.336904262(1-17)Online publication date: 2024
https://doi.org/10.1109/TGRS.2024.3369042
Han ZAzman ARina Binti Mustaffa MBinti Khalid F(2024)Cross-Modal Retrieval: A Review of Methodologies, Datasets, and Future PerspectivesIEEE Access10.1109/ACCESS.2024.344481712(115716-115741)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3444817
Li YLiu XZhang LTian HJing P(2024)Multimodal semantic enhanced representation network for micro-video event detectionKnowledge-Based Systems10.1016/j.knosys.2024.112255301(112255)Online publication date: Oct-2024
https://doi.org/10.1016/j.knosys.2024.112255
Yi ZZhu XWu RZou ZLiu YZhu L(2023)Multi-Label Weighted Contrastive Cross-Modal HashingApplied Sciences10.3390/app1401009314:1(93)Online publication date: 21-Dec-2023
https://doi.org/10.3390/app14010093
Zeng DWu JHattori GXu RYu Y(2023)Learning Explicit and Implicit Dual Common Subspaces for Audio-visual Cross-modal RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/356460819:2s(1-23)Online publication date: 17-Feb-2023
https://dl.acm.org/doi/10.1145/3564608
Huang SSun LYousefnezhad MWang MZhang D(2022)Temporal Information-Guided Generative Adversarial Networks for Stimuli Image Reconstruction From Human Brain ActivitiesIEEE Transactions on Cognitive and Developmental Systems10.1109/TCDS.2021.309874314:3(1104-1118)Online publication date: Sep-2022
https://doi.org/10.1109/TCDS.2021.3098743
Tie YLi XZhang TJin CZhao XTie J(2022)Deep learning based audio and video cross-modal recommendation2022 IEEE International Conference on Systems, Man, and Cybernetics (SMC)10.1109/SMC53654.2022.9945521(2366-2371)Online publication date: 9-Oct-2022
https://doi.org/10.1109/SMC53654.2022.9945521
Zou XWu SBakker EWang X(2022)Multi-label enhancement based self-supervised deep cross-modal hashingNeurocomputing10.1016/j.neucom.2021.09.053467:C(138-162)Online publication date: 7-Jan-2022
https://dl.acm.org/doi/10.1016/j.neucom.2021.09.053
Sun JLi J(2022)PBLF: Prompt Based Learning Framework for Cross-Modal Recipe RetrievalArtificial Intelligence and Robotics10.1007/978-981-19-7946-0_33(388-402)Online publication date: 14-Dec-2022
https://doi.org/10.1007/978-981-19-7946-0_33
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents