Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2733373.2806240acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment

Published: 13 October 2015 Publication History

Abstract

Cross-modal retrieval is a very hot research topic that is imperative to many applications involving multi-modal data. Discovering an appropriate representation for multi-modal data and learning a ranking function are essential to boost the cross-media retrieval. Motivated by the assumption that a compositional cross-modal semantic representation (pairs of images and text) is more attractive for cross-modal ranking, this paper exploits the existing image-text databases to optimize a ranking function for cross-modal retrieval, called deep compositional cross-modal learning to rank (C2MLR). In this paper, C2MLR considers learning a multi-modal embedding from the perspective of optimizing a pairwise ranking problem while enhancing both local alignment and global alignment. In particular, the local alignment (i.e., the alignment of visual objects and textual words) and the global alignment (i.e., the image-level and sentence-level alignment) are collaboratively utilized to learn the multi-modal embedding common space in a max-margin learning to rank manner. The experiments demonstrate the superiority of our proposed C2MLR due to its nature of multi-modal compositional embedding.

References

[1]
B. Bai, J. Weston, D. Grangier, R. Collobert, K. Sadamasa, Y. Qi, O. Chapelle, and K. Weinberger. Learning to rank with (a lot of) word features. Inf. Retr., 13(3):291--314, June 2010.
[2]
D. M. Blei and M. I. Jordan. Modeling annotated data. In Proceedings of Internetional ACM SIGIR Conference on Research and Development in Informaion Retrieval, pages 127--134, 2003.
[3]
C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N Hamilton, and G. Hullender. Learning to rank using gradient descent. In ICML, pages 89--96, New York, NY, USA, 2005. ACM.
[4]
C. Cortes and V. Vapnik. Support-vector networks. Machaine Learning, 20(3):273--297, Sept. 1995.
[5]
J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. arXiv preprint arXiv:1411.4389, 2014.
[6]
F. Feng, X. Wang, and R. Li. Cross-modal retrieval with correspondence autoencoder. In ACM MM, pages 7--16, New York, NY, USA, 2014. ACM.
[7]
A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. A. Ranzato, and T. Mikolov. Devise: A deep visual-semantic embedding model. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, editors, NIPS, pages 2121--2129. Curran Associates, Inc., 2013.
[8]
D. Grangier and S. Bengio. A discriminative kernel-based approach to rank images from text queries. TPAMI, pages 1371--1384, 2008.
[9]
D. R. Hardoon, S. R. Szedmak, and J. R. Shawe-taylor. Canonical correlation analysis: An overview with application to learning methods. Neural Computing, pages 2639--2664, 2004.
[10]
M. Hodosh, P. Young, and J. Hockenmaier. Framing image description as a ranking task: Data, models and evaluation metrics. J. Artif. Int. Res., 47(1):853--899, May 2013.
[11]
M. J. Huiskes and M. S. Lew. The mir flickr retrieval evaluation. In Proceedings of the 1st ACM International Conference on Multimedia Information Retrieval, pages 39--43, 2008.
[12]
A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In CVPR, June 2015.
[13]
A. Karpathy, A. Joulin, and F. F. F. Li. Deep fragment embeddings for bidirectional image sentence mapping. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, editors, NIPS, pages 1889--1897. Curran Associates, Inc., 2014.
[14]
A. Klementiev, I. Titov, and B. Bhattarai. Inducing crosslingual distributed representations of word. In COLING, pages 1459--1474. The COLING 2012 Organizing Committee, 2012.
[15]
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. Burges, L. Bottou, and K. Weinberger, editors, NIPS, pages 1097--1105. Curran Associates, Inc., 2012.
[16]
X. Lu, F. Wu, X. Li, Y. Zhang, W. Lu, D. Wang, and Y. Zhuang. Learning multimodal neural network with ranking examples. In ACM MM, pages 985--988, New York, NY, USA, 2014. ACM.
[17]
X. Lu, F. Wu, S. Tang, Z. Zhang, X. He, and Y. Zhuang. A low rank structural large margin method for cross-modal ranking. In ACM MM, pages 433--442, 2013.
[18]
J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille. Deep captioning with multimodal recurrent neural networks (m-rnn). In ICLR, 2015.
[19]
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, editors, NIPS, pages 3111--3119. Curran Associates, Inc., 2013.
[20]
G. Monaci, P. Jost, P. Vandergheynst, B. Mailhe, S. Lesage, and R. Gribonval. Learning multimodal dictionaries. TIP, pages 2272--2283, 2007.
[21]
N. Srivastava and R. R. Salakhutdinov. Multimodal learning with deep boltzmann machines. In F. Pereira, C. Burges, L. Bottou, and K. Weinberger, editors, NIPS, pages 2222--2230. Curran Associates, Inc., 2012.
[22]
X. Tian, L. Yang, J. Wang, Y. Yang, X. Wu, and X.-S. Hua. Bayesian video search reranking. In ACM MM, pages 131--140, New York, NY, USA, 2008. ACM.
[23]
X. Tian, Y. Yang, J. Wang, X. Wu, and X.-S. Hua. Bayesian visual reranking. TMM, 13(4):639--652, Aug 2011.
[24]
Y. Wang, F. Wu, J. Song, X. Li, and Y. Zhuang. Multi-modal mutual topic reinforce modeling for cross-media retrieval. In ACM MM, pages 307--316, New York, NY, USA, 2014. ACM.
[25]
F. Wu, X. Lu, Z. Zhang, S. Yan, Y. Rui, and Y. Zhuang. Cross-media semantic representation via bi-directional learning to rank. In ACM MM, pages 877--886, New York, NY, USA, 2013. ACM.
[26]
F. Wu, H. Zhang, and Y. Zhuang. Learning semantic correlations for cross-media retrieval. In ICIP, pages 1465--1468, Oct 2006.
[27]
E. P. Xing, R. Yan, and A. G. Hauptmann. Mining associated text and images with dual-wing harmoniums. In UAI, pages 633--641, 2005.
[28]
J. Yang, J. Wright, T. Huang, and Y. Ma. Image super-resolution via sparse representation. TIP, pages 2861--2873, 2010.
[29]
Y. Yang, Y. Zhuang, F. Wu, and Y.-H. Pan. Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval. TMM, 10(3):437--446, April 2008.
[30]
J. Yu, Y. Rui, and D. Tao. Click prediction for web image reranking using multimodal sparse coding. TIP, 23(5):2019--2032, May 2014.
[31]
Y. Zhuang, Y. F. Wang, F. Wu, Y. Zhang, and W. Lu. Supervised coupled dictionary learning with group structures for multi-modal retrieval. In AAAI, 2013.
[32]
W. Y. Zou, R. Socher, D. Cer, and C. D. Manning. Bilingual word embeddings for phrase-based machine translation. In EMNLP, 2013

Cited By

View all
  • (2024)Deep Cross-Modal Retrieval Between Spatial Image and Acoustic SpeechIEEE Transactions on Multimedia10.1109/TMM.2023.332387626(4480-4489)Online publication date: 2024
  • (2024)Integrating Multisubspace Joint Learning With Multilevel Guidance for Cross-Modal Retrieval of Remote Sensing ImagesIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2024.336904262(1-17)Online publication date: 2024
  • (2024)Cross-Modal Retrieval: A Review of Methodologies, Datasets, and Future PerspectivesIEEE Access10.1109/ACCESS.2024.344481712(115716-115741)Online publication date: 2024
  • Show More Cited By

Index Terms

  1. Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '15: Proceedings of the 23rd ACM international conference on Multimedia
    October 2015
    1402 pages
    ISBN:9781450334594
    DOI:10.1145/2733373
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 13 October 2015

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. compositional embedding
    2. learning to rank
    3. local-global alignment

    Qualifiers

    • Research-article

    Funding Sources

    • National Basic Research Program of China
    • 863 Program
    • Zhejiang Provincial Natural Science Foundation of China
    • National Natural Science Foundation of China
    • Fundamental Research Funds for the Central Universities
    • China Knowledge Centre for Engineering Sciences and Technology

    Conference

    MM '15
    Sponsor:
    MM '15: ACM Multimedia Conference
    October 26 - 30, 2015
    Brisbane, Australia

    Acceptance Rates

    MM '15 Paper Acceptance Rate 56 of 252 submissions, 22%;
    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)19
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 18 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Deep Cross-Modal Retrieval Between Spatial Image and Acoustic SpeechIEEE Transactions on Multimedia10.1109/TMM.2023.332387626(4480-4489)Online publication date: 2024
    • (2024)Integrating Multisubspace Joint Learning With Multilevel Guidance for Cross-Modal Retrieval of Remote Sensing ImagesIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2024.336904262(1-17)Online publication date: 2024
    • (2024)Cross-Modal Retrieval: A Review of Methodologies, Datasets, and Future PerspectivesIEEE Access10.1109/ACCESS.2024.344481712(115716-115741)Online publication date: 2024
    • (2024)Multimodal semantic enhanced representation network for micro-video event detectionKnowledge-Based Systems10.1016/j.knosys.2024.112255301(112255)Online publication date: Oct-2024
    • (2023)Multi-Label Weighted Contrastive Cross-Modal HashingApplied Sciences10.3390/app1401009314:1(93)Online publication date: 21-Dec-2023
    • (2023)Learning Explicit and Implicit Dual Common Subspaces for Audio-visual Cross-modal RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/356460819:2s(1-23)Online publication date: 17-Feb-2023
    • (2022)Temporal Information-Guided Generative Adversarial Networks for Stimuli Image Reconstruction From Human Brain ActivitiesIEEE Transactions on Cognitive and Developmental Systems10.1109/TCDS.2021.309874314:3(1104-1118)Online publication date: Sep-2022
    • (2022)Deep learning based audio and video cross-modal recommendation2022 IEEE International Conference on Systems, Man, and Cybernetics (SMC)10.1109/SMC53654.2022.9945521(2366-2371)Online publication date: 9-Oct-2022
    • (2022)Multi-label enhancement based self-supervised deep cross-modal hashingNeurocomputing10.1016/j.neucom.2021.09.053467:C(138-162)Online publication date: 7-Jan-2022
    • (2022)PBLF: Prompt Based Learning Framework for Cross-Modal Recipe RetrievalArtificial Intelligence and Robotics10.1007/978-981-19-7946-0_33(388-402)Online publication date: 14-Dec-2022
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media