research-article

Image-Text Embedding with Hierarchical Knowledge for Cross-Modal Retrieval

Authors:

Sanghyun Seo,

Juntae KimAuthors Info & Claims

CSAI '18: Proceedings of the 2018 2nd International Conference on Computer Science and Artificial Intelligence

Pages 350 - 353

https://doi.org/10.1145/3297156.3297244

Published: 08 December 2018 Publication History

Get Access

Abstract

Heterogeneous data embedding is a process of mapping different kinds of data into a common vector space of a certain dimension. Image-text embedding also means mapping image and text data that have completely different characteristics into a common vector space. In this paper, we propose an image-text embedding method using hierarchical knowledge such as coarse and fine labels of text data. The proposed method improves the training efficiency of the embedding model by fixing the coarse label vectors. In addition, the loss function is designed by arbitrarily selecting the negative sample from the fine labels having a hierarchical relationship with the coarse label, so that the difference between the vectors of the fine labels which have same coarse label becomes larger. So, when the images that are visual data is mapped into a common vector space, the semantic of images becomes clear. Experimental results show that embedding with hierarchical knowledge has been successfully performed using the proposed methodology and that cross-modal retrieval can be efficiently performed through embedding model.

References

[1]

Globerson, A., Chechik, G., Pereira, F., & Tishby, N. (2006, July). Embedding heterogeneous data using statistical models. In PROCEEDINGS OF THE NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (Vol. 21, No. 2, p. 1605). Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999.,.

Digital Library

Google Scholar

[2]

Frome, A., Corrado, G. S., Shlens, J., Bengio, S., Dean, J., & Mikolov, T. (2013). Devise: A deep visual-semantic embedding model. In Advances in neural information processing systems (pp. 2121--2129).

Digital Library

Google Scholar

[3]

Chang, S., Han, W., Tang, J., Qi, G. J., Aggarwal, C. C., & Huang, T. S. (2015, August). Heterogeneous network embedding via deep architectures. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 119--128). ACM.

Digital Library

Google Scholar

[4]

Wang, K., Yin, Q., Wang, W., Wu, S., & Wang, L. (2016). A comprehensive survey on cross-modal retrieval. arXiv preprint arXiv:1607.06215.

Google Scholar

[5]

Wang, L., Li, Y., & Lazebnik, S. (2016). Learning deep structure-preserving image-text embeddings. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5005--5013).

Crossref

Google Scholar

[6]

Weinberger, K. Q., & Saul, L. K. (2009). Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research, 10(Feb), 207--244.

Digital Library

Google Scholar

[7]

Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532--1543).

Crossref

Google Scholar

[8]

Koch, G., Zemel, R., & Salakhutdinov, R. (2015). Siamese neural networks for one-shot image recognition. In ICML Deep Learning Workshop (Vol. 2).

Google Scholar

[9]

Hoffer, E., & Ailon, N. (2015, October). Deep metric learning using triplet network. In International Workshop on Similarity-Based Pattern Recognition (pp. 84--92). Springer, Cham.

Crossref

Google Scholar

[10]

Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

Google Scholar

[11]

Krizhevsky, A., & Hinton, G. (2009). Learning multiple layers of features from tiny images (Vol. 1, No. 4, p. 7). Technical report, University of Toronto

Google Scholar

Index Terms

Image-Text Embedding with Hierarchical Knowledge for Cross-Modal Retrieval
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

Multi-label double-layer learning for cross-modal retrieval

This paper proposes a novel method named Multi-label Double-layer Learning (MDLL) for multi-label cross-modal retrieval task. MDLL includes two stages (layers): L2C (Label to Common) and C2L (Common to Label). In the L2C stage, considering that labels ...
Cross-modal Retrieval with Label Completion
MM '16: Proceedings of the 24th ACM international conference on Multimedia

Cross-modal retrieval has been attracting increasing attention because of the explosion of multi-modal data, e.g., texts and images. Most supervised cross-modal retrieval methods learn discriminant common subspaces minimizing the heterogeneity of ...
Super Visual Semantic Embedding for Cross-Modal Image-Text Retrieval
CSAE '21: Proceedings of the 5th International Conference on Computer Science and Application Engineering

Visual semantic embedding network or cross-modal cross-attention network are usually adopted for image-text retrieval. Existing works have confirmed that both visual semantic embedding network and cross-modal cross-attention network can achieve similar ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

CSAI '18: Proceedings of the 2018 2nd International Conference on Computer Science and Artificial Intelligence

December 2018

641 pages

ISBN:9781450366069

DOI:10.1145/3297156

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

Shenzhen University: Shenzhen University

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 December 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

National Research Foundation of Korea

Conference

CSAI '18

CSAI '18: 2018 2nd International Conference on Computer Science and Artificial Intelligence

December 8 - 10, 2018

Shenzhen, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
96
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Index Terms

Recommendations

Multi-label double-layer learning for cross-modal retrieval

Cross-modal Retrieval with Label Completion

Super Visual Semantic Embedding for Cross-Modal Image-Text Retrieval

Comments

Information

Published In

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations