Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3511808.3557710acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
short-paper

Texture BERT for Cross-modal Texture Image Retrieval

Published: 17 October 2022 Publication History

Abstract

We propose Texture BERT, a model describing visual attributes of texture using natural language. To capture the rich details in texture images, we propose a group-wise compact bilinear pooling method, which represents the texture image by a set of visual patterns. The similarity between the texture image and the corresponding language description is determined by the cross-matching between the set of visual patterns from the texture image and the set of word features from the language description. We also exploit the self-attention transformer layers to provide the cross-modal context and enhance the effectiveness of matching. Our efforts achieve state-of-the-art accuracy on both text retrieval and image retrieval tasks, demonstrating the effectiveness of the proposed Texture BERT model in describing texture through natural language.

References

[1]
Sean Bell, Paul Upchurch, Noah Snavely, and Kavita Bala. 2015. Material recognition in the wild with the Materials in Context Database. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, MA, 3479--3487.
[2]
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. UNITER: UNiversal Image-TExt Representation Learning. In Proceedings of the 16th European Conference on Computer Vision (ECCV), Part XXX. Glasgow, UK, 104--120.
[3]
Jaemin Cho, Jiasen Lu, Dustin Schwenk, Hannaneh Hajishirzi, and Aniruddha Kembhavi. 2020. X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 8785--8805.
[4]
Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. 2014. Describing Textures in the Wild. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Columbus, OH, 3606--3613.
[5]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR). Miami, FL, 248--255.
[6]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Minneapolis, MN, 4171--4186.
[7]
Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2018. VSE: Improving Visual-Semantic Embeddings with Hard Negatives. In Proceedings of British Machine Vision Conference (BMVC). Newcastle, UK, 12.
[8]
Hongliang Fei, Tan Yu, and Ping Li. 2021. Cross-lingual Cross-modal Pretraining for Multimodal Retrieval. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Online.
[9]
Yang Gao, Oscar Beijbom, Ning Zhang, and Trevor Darrell. 2016. Compact Bilinear Pooling. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, 317--326.
[10]
Yunchao Gong, Qifa Ke, Michael Isard, and Svetlana Lazebnik. 2014. A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics. Int. J. Comput. Vis., Vol. 106, 2 (2014), 210--233.
[11]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, 770--778.
[12]
Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, and Fu Jianlong. 2020. Pixel-BERT: Aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849 (2020).
[13]
Andrej Karpathy and Li Fei-Fei. 2017. Deep Visual-Semantic Alignments for Generating Image Descriptions. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 39, 4 (2017), 664--676.
[14]
Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked Cross Attention for Image-Text Matching. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Part IV. Munich Germany, 212--228.
[15]
Gen Li, Nan Duan, Yuejian Fang, Ming Gong, and Daxin Jiang. 2020. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI). New York, NY, 11336--11344.
[16]
Tsung-Yu Lin, Aruni RoyChowdhury, and Subhransu Maji. 2015. Bilinear CNN Models for Fine-Grained Visual Recognition. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile, 1449--1457.
[17]
Haoliang Liu, Tan Yu, and Ping Li. 2021. Inflate and Shrink:Enriching and Reducing Interactions for Fast Text-Image Retrieval. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP).
[18]
Jiaheng Liu, Tan Yu, Hanyu Peng, Mingming Sun, and Ping Li. 2022. Cross-Lingual Cross-Modal Consolidation for Effective Multilingual Video Corpus Moment Retrieval. In Findings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT).
[19]
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In Advances in Neural Information Processing Systems (NeurIPS). Vancouver, Canada, 13--23.
[20]
Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. 2011. Im2Text: Describing Images Using 1 Million Captioned Photographs. In Advances in Neural Information Processing Systems (NIPS). Granada, Spain, 1143--1151.
[21]
Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2020. VL-BERT: Pre-training of Generic Visual-Linguistic Representations. In Proceedings of the 8th International Conference on Learning Representations (ICLR). Addis Ababa, Ethiopia.
[22]
Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China, 5099--5110.
[23]
Reuben Tan, Bryan Plummer, and Kate Saenko. 2020. Detecting Cross-Modal Inconsistency to Defend against Neural Fake News. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online, 2081--2106.
[24]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems (NIPS). Long Beach, CA, 5998--6008.
[25]
Chenyun Wu, Mikayla Timm, and Subhransu Maji. 2020. Describing Textures Using Natural Language. In Proceedings of the 16th European Conference on Computer Vision (ECCV), Part I. Glasgow, UK, 52--70.
[26]
Tan Yu, Yunfeng Cai, and Ping Li. 2020a. Toward Faster and Simpler Matrix Normalization via Rank-1 Update. In Proceedings of the 16th European Conference on Computer Vision (ECCV), Part XIX. Glasgow, UK, 203--219.
[27]
Tan Yu, Yunfeng Cai, and Ping Li. 2022a. Efficient Compact Bilinear Pooling via Kronecker Product. In Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI). 3170--3178.
[28]
Tan Yu, Hongliang Fei, and Ping Li. 2022b. Cross-Probe BERT for Fast Cross-Modal Search. In Proceedings of the 45th International ACM SIGIR conference on research and development in Information Retrieval (SIGIR).
[29]
Tan Yu, Xiaoyun Li, and Ping Li. 2021a. Fast and Compact Bilinear Pooling by Shifted Random Maclaurin. In Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI). Virtual Event, 3243--3251.
[30]
Tan Yu, Yi Yang, Hongliang Fei, Yi Li, Xiaodong Chen, and Ping Li. 2021b. Assorted Attention Network for Cross-Lingual Language-to-Vision Retrieval. In Proceedings of the 30th ACM International Conference on Information and Knowledge Management (CIKM). Virtual Event.
[31]
Tan Yu, Yi Yang, Yi Li, Xiaodong Chen, Mingming Sun, and Ping Li. 2020b. Combo-Attention Network for Baidu Video Advertising. In Proceedings of the 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD). Virtual Event, CA, USA, 2474--2482.
[32]
Tan Yu, Yi Yang, Yi Li, Lin Liu, Mingming Sun, and Ping Li. 2021c. Multi-modal Dictionary BERT for Cross-modal Video Search in Baidu Advertising. In Proceedings of the 30th ACM International Conference on Information and Knowledge Management (CIKM). Virtual Event.

Cited By

View all
  • (2023)The Visual Language of FabricsACM Transactions on Graphics10.1145/359239142:4(1-15)Online publication date: 26-Jul-2023
  • (2022)Tree-based Text-Vision BERT for Video Search in Baidu Video Advertising2022 IEEE International Conference on Big Data (Big Data)10.1109/BigData55660.2022.10020922(2150-2159)Online publication date: 17-Dec-2022

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '22: Proceedings of the 31st ACM International Conference on Information & Knowledge Management
October 2022
5274 pages
ISBN:9781450392365
DOI:10.1145/3511808
  • General Chairs:
  • Mohammad Al Hasan,
  • Li Xiong
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. computer vision
  2. cross-modal search
  3. retrieval

Qualifiers

  • Short-paper

Conference

CIKM '22
Sponsor:

Acceptance Rates

CIKM '22 Paper Acceptance Rate 621 of 2,257 submissions, 28%;
Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)45
  • Downloads (Last 6 weeks)3
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)The Visual Language of FabricsACM Transactions on Graphics10.1145/359239142:4(1-15)Online publication date: 26-Jul-2023
  • (2022)Tree-based Text-Vision BERT for Video Search in Baidu Video Advertising2022 IEEE International Conference on Big Data (Big Data)10.1109/BigData55660.2022.10020922(2150-2159)Online publication date: 17-Dec-2022

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media