short-paper

Texture BERT for Cross-modal Texture Image Retrieval

Authors:

Ping LiAuthors Info & Claims

CIKM '22: Proceedings of the 31st ACM International Conference on Information & Knowledge Management

Pages 4610 - 4614

https://doi.org/10.1145/3511808.3557710

Published: 17 October 2022 Publication History

Abstract

We propose Texture BERT, a model describing visual attributes of texture using natural language. To capture the rich details in texture images, we propose a group-wise compact bilinear pooling method, which represents the texture image by a set of visual patterns. The similarity between the texture image and the corresponding language description is determined by the cross-matching between the set of visual patterns from the texture image and the set of word features from the language description. We also exploit the self-attention transformer layers to provide the cross-modal context and enhance the effectiveness of matching. Our efforts achieve state-of-the-art accuracy on both text retrieval and image retrieval tasks, demonstrating the effectiveness of the proposed Texture BERT model in describing texture through natural language.

References

[1]

Sean Bell, Paul Upchurch, Noah Snavely, and Kavita Bala. 2015. Material recognition in the wild with the Materials in Context Database. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, MA, 3479--3487.

[2]

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. UNITER: UNiversal Image-TExt Representation Learning. In Proceedings of the 16th European Conference on Computer Vision (ECCV), Part XXX. Glasgow, UK, 104--120.

Digital Library

[3]

Jaemin Cho, Jiasen Lu, Dustin Schwenk, Hannaneh Hajishirzi, and Aniruddha Kembhavi. 2020. X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 8785--8805.

[4]

Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. 2014. Describing Textures in the Wild. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Columbus, OH, 3606--3613.

Digital Library

[5]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR). Miami, FL, 248--255.

[6]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Minneapolis, MN, 4171--4186.

[7]

Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2018. VSE: Improving Visual-Semantic Embeddings with Hard Negatives. In Proceedings of British Machine Vision Conference (BMVC). Newcastle, UK, 12.

[8]

Hongliang Fei, Tan Yu, and Ping Li. 2021. Cross-lingual Cross-modal Pretraining for Multimodal Retrieval. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Online.

[9]

Yang Gao, Oscar Beijbom, Ning Zhang, and Trevor Darrell. 2016. Compact Bilinear Pooling. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, 317--326.

[10]

Yunchao Gong, Qifa Ke, Michael Isard, and Svetlana Lazebnik. 2014. A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics. Int. J. Comput. Vis., Vol. 106, 2 (2014), 210--233.

Digital Library

[11]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, 770--778.

[12]

Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, and Fu Jianlong. 2020. Pixel-BERT: Aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849 (2020).

[13]

Andrej Karpathy and Li Fei-Fei. 2017. Deep Visual-Semantic Alignments for Generating Image Descriptions. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 39, 4 (2017), 664--676.

Digital Library

[14]

Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked Cross Attention for Image-Text Matching. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Part IV. Munich Germany, 212--228.

Digital Library

[15]

Gen Li, Nan Duan, Yuejian Fang, Ming Gong, and Daxin Jiang. 2020. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI). New York, NY, 11336--11344.

[16]

Tsung-Yu Lin, Aruni RoyChowdhury, and Subhransu Maji. 2015. Bilinear CNN Models for Fine-Grained Visual Recognition. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile, 1449--1457.

Digital Library

[17]

Haoliang Liu, Tan Yu, and Ping Li. 2021. Inflate and Shrink:Enriching and Reducing Interactions for Fast Text-Image Retrieval. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP).

[18]

Jiaheng Liu, Tan Yu, Hanyu Peng, Mingming Sun, and Ping Li. 2022. Cross-Lingual Cross-Modal Consolidation for Effective Multilingual Video Corpus Moment Retrieval. In Findings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT).

[19]

Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In Advances in Neural Information Processing Systems (NeurIPS). Vancouver, Canada, 13--23.

[20]

Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. 2011. Im2Text: Describing Images Using 1 Million Captioned Photographs. In Advances in Neural Information Processing Systems (NIPS). Granada, Spain, 1143--1151.

Digital Library

[21]

Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2020. VL-BERT: Pre-training of Generic Visual-Linguistic Representations. In Proceedings of the 8th International Conference on Learning Representations (ICLR). Addis Ababa, Ethiopia.

[22]

Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China, 5099--5110.

[23]

Reuben Tan, Bryan Plummer, and Kate Saenko. 2020. Detecting Cross-Modal Inconsistency to Defend against Neural Fake News. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online, 2081--2106.

[24]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems (NIPS). Long Beach, CA, 5998--6008.

[25]

Chenyun Wu, Mikayla Timm, and Subhransu Maji. 2020. Describing Textures Using Natural Language. In Proceedings of the 16th European Conference on Computer Vision (ECCV), Part I. Glasgow, UK, 52--70.

Digital Library

[26]

Tan Yu, Yunfeng Cai, and Ping Li. 2020a. Toward Faster and Simpler Matrix Normalization via Rank-1 Update. In Proceedings of the 16th European Conference on Computer Vision (ECCV), Part XIX. Glasgow, UK, 203--219.

Digital Library

[27]

Tan Yu, Yunfeng Cai, and Ping Li. 2022a. Efficient Compact Bilinear Pooling via Kronecker Product. In Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI). 3170--3178.

[28]

Tan Yu, Hongliang Fei, and Ping Li. 2022b. Cross-Probe BERT for Fast Cross-Modal Search. In Proceedings of the 45th International ACM SIGIR conference on research and development in Information Retrieval (SIGIR).

Digital Library

[29]

Tan Yu, Xiaoyun Li, and Ping Li. 2021a. Fast and Compact Bilinear Pooling by Shifted Random Maclaurin. In Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI). Virtual Event, 3243--3251.

[30]

Tan Yu, Yi Yang, Hongliang Fei, Yi Li, Xiaodong Chen, and Ping Li. 2021b. Assorted Attention Network for Cross-Lingual Language-to-Vision Retrieval. In Proceedings of the 30th ACM International Conference on Information and Knowledge Management (CIKM). Virtual Event.

Digital Library

[31]

Tan Yu, Yi Yang, Yi Li, Xiaodong Chen, Mingming Sun, and Ping Li. 2020b. Combo-Attention Network for Baidu Video Advertising. In Proceedings of the 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD). Virtual Event, CA, USA, 2474--2482.

Digital Library

[32]

Tan Yu, Yi Yang, Yi Li, Lin Liu, Mingming Sun, and Ping Li. 2021c. Multi-modal Dictionary BERT for Cross-modal Video Search in Baidu Advertising. In Proceedings of the 30th ACM International Conference on Information and Knowledge Management (CIKM). Virtual Event.

Digital Library

Cited By

Deschaintre VGuerrero-Viu JGutierrez DBoubekeur TMasia B(2023)The Visual Language of FabricsACM Transactions on Graphics10.1145/359239142:4(1-15)Online publication date: 26-Jul-2023
https://dl.acm.org/doi/10.1145/3592391
Yu TLiu JYang YLi YFei HLi P(2022)Tree-based Text-Vision BERT for Video Search in Baidu Video Advertising2022 IEEE International Conference on Big Data (Big Data)10.1109/BigData55660.2022.10020922(2150-2159)Online publication date: 17-Dec-2022
https://doi.org/10.1109/BigData55660.2022.10020922

Index Terms

Texture BERT for Cross-modal Texture Image Retrieval
1. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval

Recommendations

3D texton spaces for color-texture retrieval
ICIAR'10: Proceedings of the 7th international conference on Image Analysis and Recognition - Volume Part I

Color and texture are visual cues of different nature, their integration in an useful visual descriptor is not an easy problem. One way to combine both features is to compute spatial texture descriptors independently on each color channel. Another way ...
Histogram refinement for texture descriptor based image retrieval

Texture descriptors such as local binary patterns (LBP) have been successfully employed for feature extraction in image retrieval algorithms because of their high discriminating ability and computational efficiency. In this paper, we propose histogram ...
Texture image retrieval using compact texton co-occurrence matrix descriptor
MIR '10: Proceedings of the international conference on Multimedia information retrieval

Statistical approaches and structural approaches have been extensively investigated in texture analysis of content-based image retrieval whereas little work has integrated them. This paper puts forward an effective texture descriptor-compact texton co-...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '22: Proceedings of the 31st ACM International Conference on Information & Knowledge Management

October 2022

5274 pages

ISBN:9781450392365

DOI:10.1145/3511808

General Chairs:
Mohammad Al Hasan
Indiana University Purdue University, Indianapolis, USA
,
Li Xiong
Emory University, Atlanta, USA

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Conference

CIKM '22

Sponsor:

CIKM '22: The 31st ACM International Conference on Information and Knowledge Management

October 17 - 21, 2022

GA, Atlanta, USA

Acceptance Rates

CIKM '22 Paper Acceptance Rate 621 of 2,257 submissions, 28%;

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
163
Total Downloads

Downloads (Last 12 months)45
Downloads (Last 6 weeks)3

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Deschaintre VGuerrero-Viu JGutierrez DBoubekeur TMasia B(2023)The Visual Language of FabricsACM Transactions on Graphics10.1145/359239142:4(1-15)Online publication date: 26-Jul-2023
https://dl.acm.org/doi/10.1145/3592391
Yu TLiu JYang YLi YFei HLi P(2022)Tree-based Text-Vision BERT for Video Search in Baidu Video Advertising2022 IEEE International Conference on Big Data (Big Data)10.1109/BigData55660.2022.10020922(2150-2159)Online publication date: 17-Dec-2022
https://doi.org/10.1109/BigData55660.2022.10020922

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents