research-article

Multi-modal Dictionary BERT for Cross-modal Video Search in Baidu Advertising

Authors:

Ping LiAuthors Info & Claims

CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management

Pages 4341 - 4351

https://doi.org/10.1145/3459637.3481937

Published: 30 October 2021 Publication History

Abstract

Due to their attractiveness, video advertisements are adored by advertisers. Baidu, as one of the leading search advertisement platforms in China, is putting more and more effort into video advertisements for its advertisement customers. Search-based video advertisement display is, in essence, a cross-modal retrieval problem, which is normally tackled through joint embedding methods. Nevertheless, due to the lack of interactions between text features and image features, joint embedding methods cannot achieve as high accuracy as its counterpart based on attention. Inspired by the great success achieved by BERT in NLP tasks, many cross-modal BERT models emerge and achieve excellent performance in cross-modal retrieval. Last year, Baidu also launched a cross-modal BERT, CAN, in video advertisement platform, and achieved considerably better performance than the previous joint-embedding model. In this paper, we present our recent work for video advertisement retrieval, Multi-modal Dictionary BERT (MDBERT) model. Compared with CAN and other cross-modal BERT models, MDBERT integrates a joint dictionary, which is shared among video features and word features. It maps the relevant word features and video features into the same codeword and thus fosters effective cross-modal attention. To support end-to-end training, we propose to soften the codeword assignment. Meanwhile, to enhance the inference efficiency, we adopt the product quantization to achieve fine-level feature space partition at a low cost. After launching MDBERT in Baidu video advertising platform, the conversion ratio (CVR) increases by 3.34%, bringing a considerable revenue boost for advertisers in Baidu.

References

[1]

Galen Andrew, Raman Arora, Jeff A. Bilmes, and Karen Livescu. 2013. Deep Canonical Correlation Analysis. In Proceedings of the 30th International Conference on Machine Learning (ICML). Atlanta, GA, 1247--1255.

Digital Library

[2]

Yue Cao, Mingsheng Long, Jianmin Wang, Qiang Yang, and Philip S. Yu. 2016. Deep Visual-Semantic Hashing for Cross-Modal Retrieval. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). San Francisco, CA, 1445--1454.

Digital Library

[3]

Yue Cao, Mingsheng Long, Jianmin Wang, Han Zhu, and Qingfu Wen. 2016. Deep Quantization Network for Efficient Image Retrieval. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI). Phoenix, AZ, 3457--3463.

Digital Library

[4]

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. UNITER: UNiversal Image-TExt Representation Learning. In Proceedings of the 16th European Conference on Computer Vision (ECCV), Part XXX. Glasgow, UK, 104--120.

Digital Library

[5]

Haibin Cheng, Roelof van Zwol, Javad Azimi, Eren Manavoglu, Ruofei Zhang, Yang Zhou, and Vidhya Navalpakkam. 2012. Multimedia features for click pre- diction of new ads in display advertising. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). Beijing, China, 777--785.

Digital Library

[6]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Minneapolis, MN, 4171--4186.

[7]

Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2018. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives. In Proceedings of the British Machine Vision Conference (BMVC). Newcastle, UK.

[8]

Miao Fan, Jiacheng Guo, Shuai Zhu, Shuo Miao, Mingming Sun, and Ping Li. 2019. MOBIUS: Towards the Next Generation of Query-Ad Matching in Baidu's Sponsored Search. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD). Anchorage, AK, 2509--2517.

Digital Library

[9]

Hongliang Fei, Shulong Tan, Pengju Guo, Wenbo Zhang, Hongfang Zhang, and Ping Li. 2020. Sample optimization for display advertising. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management (CIKM). 2017--2020.

Digital Library

[10]

Hongliang Fei, Tan Yu, and Ping Li. 2021. Cross-lingual Cross-modal Pretraining for Multimodal Retrieval. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Online.

[11]

Andrea Frome, Gregory S. Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marc'Aurelio Ranzato, and Tomás Mikolov. 2013. DeViSE: A Deep Visual-Semantic Embedding Model. In Advances in Neural Information Processing Systems (NIPS). Lake Tahoe, NV, 2121--2129.

Digital Library

[12]

Valentin Gabeur, Chen Sun, Karteek Alahari, and Cordelia Schmid. 2020. Multi- modal Transformer for Video Retrieval. In Proceedings of the 16th European Conference on Computer Vision (ECCV), Part IV. Glasgow UK, 214--229.

Digital Library

[13]

Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, and Jingjing Liu. 2020. Large-Scale Adversarial Training for Vision-and-Language Representation Learning. In Advances in Neural Information Processing Systems (NeurIPS). virtual.

[14]

Tiezheng Ge, Kaiming He, Qifa Ke, and Jian Sun. 2014. Optimized Product Quantization. IEEE Trans. Pattern Anal. Mach. Intell. 36, 4 (2014), 744--755.

Digital Library

[15]

Tiezheng Ge, Liqin Zhao, Guorui Zhou, Keyu Chen, Shuying Liu, Huiming Yi, Zelin Hu, Bochao Liu, Peng Sun, Haoyu Liu, Pengtao Yi, Sui Huang, Zhiqiang Zhang, Xiaoqiang Zhu, Yu Zhang, and Kun Gai. 2018. Image Matters: Visually Modeling User Behaviors Using Advanced Model Server. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM). Torino, Italy, 2087--2095.

Digital Library

[16]

Gregor Geigle, Jonas Pfeiffer, Nils Reimers, Ivan Vulić, and Iryna Gurevych. 2021. Retrieve Fast, Rerank Smart: Cooperative and Joint Approaches for Improved Cross-Modal Retrieval. arXiv preprint arXiv:2103.11920 (2021).

[17]

Yunchao Gong, Qifa Ke, Michael Isard, and Svetlana Lazebnik. 2014. A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics. Int. J. Comput. Vis. 106, 2 (2014), 210--233.

Digital Library

[18]

David R. Hardoon, Sándor Szedmák, and John Shawe-Taylor. 2004. Canonical Correlation Analysis: An Overview with Application to Learning Methods. Neural Comput. 16, 12 (2004), 2639--2664.

Digital Library

[19]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, 770--778.

[20]

Po-Yao Huang, Mandela Patrick, Junjie Hu, Graham Neubig, Florian Metze, and Alexander G Hauptmann. 2021. Multilingual Multimodal Pre-training for Zero- Shot Cross-Lingual Transfer of Vision-Language Models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). 2443--2459.

[21]

Yan Huang, Wei Wang, and Liang Wang. 2017. Instance-Aware Image and Sentence Matching with Selective Multimodal LSTM. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, 7254--7262.

[22]

Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, and Jianlong Fu. 2020. Pixel-BERT: Aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849 (2020).

[23]

Hervé Jégou, Matthijs Douze, and Cordelia Schmid. 2011. Product Quantization for Nearest Neighbor Search. IEEE Trans. Pattern Anal. Mach. Intell. 33, 1 (2011), 117--128.

Digital Library

[24]

Andrej Karpathy and Fei-Fei Li. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, MA, 3128--3137.

[25]

Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. 2021. Transformers in Vision: A Survey. arXiv preprint arXiv:2101.01169 (2021).

Digital Library

[26]

Douwe Kiela, Suvrat Bhooshan, Hamed Firooz, and Davide Testuggine. 2019. Supervised Multimodal Bitransformers for Classifying Images and Text. In Proceedings of the NeurIPS 2019 Workshop on Visually Grounded Interaction and Language (ViGIL). Vancouver, Canada.

[27]

Douwe Kiela, Hamed Firooz, Aravind Mohan, Vedanuj Goswami, Amanpreet Singh, Pratik Ringshia, and Davide Testuggine. 2020. The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes. In Advances in Neural Information Processing Systems (NeurIPS). Virtual.

[28]

Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked Cross Attention for Image-Text Matching. In Computer Vision - ECCV 2018 Proceedings of the 15th European Conference on Computer Vision (ECCV), Part IV. Munich, Germany, 212--228.

[29]

Sangho Lee, Youngjae Yu, Gunhee Kim, Thomas M. Breuel, Jan Kautz, and Yale Song. 2021. Parameter Efficient Multimodal Transformers for Video Representation Learning. In Proceedings of the 9th International Conference on Learning Representations (ICLR). Virtual Event, Austria.

[30]

Dingcheng Li, Xu Li, Jun Wang, and Ping Li. 2020. Video Recommendation with Multi-gate Mixture of Experts Soft Actor Critic. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval (SIGIR). Virtual Event, China, 1553--1556.

Digital Library

[31]

Gen Li, Nan Duan, Yuejian Fang, Ming Gong, and Daxin Jiang. 2020. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI). New York, NY, 11336--11344.

[32]

Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. 2020. HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online, 2046--2065.

[33]

Linjie Li, Zhe Gan, and Jingjing Liu. 2020. A Closer Look at the Robustness of Vision-and-Language Pre-trained Models. arXiv preprint arXiv:2012.08673 (2020).

[34]

Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019. VisualBERT: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019).

[35]

Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao. 2020. OSCAR: Object-Semantics Aligned Pre-training for Vision-Language Tasks. In Proceedings of the 16th European Conference on Computer Vision (ECCV), Part XXX. Glasgow, UK, 121--137.

Digital Library

[36]

Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In Advances in Neural Information Processing Systems (NeurIPS). Vancouver, Canada, 13--23.

Digital Library

[37]

Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, and Stefan Lee. 2020. 12-in-1: Multi-Task Vision and Language Representation Learning. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition CVPR). Seattle, WA, 10434--10443.

[38]

Xiaopeng Lu, Tiancheng Zhao, and Kyusong Lee. 2021. VisualSparta: An Embarrassingly Simple Approach to Large-scale Text-to-Image Search with Weighted Bag-of-words. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL/IJCNLP). Virtual Event, 5020--5029.

[39]

Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Jason Li, Taroon Bharti, and Ming Zhou. 2020. UniVL: A unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353 (2020).

[40]

Antoine Miech, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic, and Andrew Zisserman. 2021. Thinking Fast and Slow: Efficient Text-to-Visual Retrieval With Transformers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). virtual, 9826--9836.

[41]

Kaixiang Mo, Bo Liu, Lei Xiao, Yong Li, and Jie Jiang. 2015. Image Feature Learning for Cold Start Problem in Display Advertising. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI). Buenos Aires, Argentina, 3728--3734.

Digital Library

[42]

Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. 2017. Dual Attention Net- works for Multimodal Reasoning and Matching. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, 2156--2164.

[43]

Zhenxing Niu, Mo Zhou, Le Wang, Xinbo Gao, and Gang Hua. 2017. Hierarchical Multimodal LSTM for Dense Visual-Semantic Embedding. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). Venice, Italy, 1899--1907.

[44]

Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. 2011. Im2Text: Describing Images Using 1 Million Captioned Photographs. In Advances in Neural Information Processing Systems (NIPS). Granada, Spain, 1143--1151.

Digital Library

[45]

Di Qi, Lin Su, Jia Song, Edward Cui, Taroon Bharti, and Arun Sacheti. 2020. ImageBERT: Cross-modal pre-training with large-scale weak-supervised image- text data. arXiv preprint arXiv:2001.07966 (2020).

[46]

Nikhil Rasiwasia, José Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert R. G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to cross- modal multimedia retrieval. In Proceedings of the 18th International Conference on Multimedia (ACMMM). Firenze, Italy, 251--260.

Digital Library

[47]

Chen Sun, Fabien Baradel, Kevin Murphy, and Cordelia Schmid. 2019. Learning video representations using contrastive bidirectional transformer. arXiv preprint arXiv:1906.05743 (2019).

[48]

Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019. VideoBERT: A Joint Model for Video and Language Representation Learning. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea, 7463--7472.

[49]

Siqi Sun, Yen-Chun Chen, Linjie Li, Shuohang Wang, Yuwei Fang, and Jingjing Liu. 2021. LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT).

[50]

Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China, 5099--5110.

[51]

Jianfeng Wang, Xiaowei Hu, Pengchuan Zhang, Xiujun Li, Lijuan Wang, Lei Zhang, Jianfeng Gao, and Zicheng Liu. 2020. MiniVLM: A Smaller and Faster Vision-Language Model. arXiv preprint arXiv:2012.06946 (2020).

[52]

Zhiqiang Xu, Dong Li, Weijie Zhao, Xing Shen, Tianbo Huang, Xiaoyun Li, and Ping Li. 2021. Agile and Accurate CTR Prediction Model Training for Massive- Scale Online Advertising Systems. In Proceedings of the International Conference on Management of Data (SIGMOD). Virtual Event, China, 2404--2409.

Digital Library

[53]

Xiao Yang, Tao Deng, Weihan Tan, Xutian Tao, Junwei Zhang, Shouke Qin, and Zongyao Ding. 2019. Learning Compositional, Visual and Relational Representations for CTR Prediction in Sponsored Search. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM). Beijing, China, 2851--2859.

Digital Library

[54]

Fei Yu, Jiji Tang, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, and Haifeng Wang. 2021. ERNIE-ViL: Knowledge Enhanced Vision-Language Representations through Scene Graphs. In Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI). Virtual Event, 3208--3216.

[55]

Tan Yu, Xuemeng Yang, Yan Jiang, Hongfang Zhang, Weijie Zhao, and Ping Li. 2021. TIRA in Baidu Image Advertising. In Proceedings of the 37th IEEE International Conference on Data Engineering (ICDE). Chania, Greece, 2207--2212.

[56]

Tan Yu, Yi Yang, Hongliang Fei, Yi Li, Xiaodong Chen, and Ping Li. 2021. Assorted Attention Network for Cross-Lingual Language-to-Vision Retrieval. In Proceedings of the 30th ACM International Conference on Information and Knowledge Management (CIKM). Virtual Event, Australia.

Digital Library

[57]

Tan Yu, Yi Yang, Yi Li, Xiaodong Chen, Mingming Sun, and Ping Li. 2020. Combo- Attention Network for Baidu Video Advertising. In Proceedings of the 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD). Virtual Event, CA, 2474--2482.

Digital Library

[58]

Tan Yu, Yi Yang, Yi Li, Lin Liu, Hongliang Fei, and Ping Li. 2021. Heterogeneous Attention Network for Effective and Efficient Cross-modal Retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). Virtual Event, Canada, 1146--1156.

Digital Library

[59]

Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. 2021. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 5579--5588.

[60]

Weijie Zhao, Shulong Tan, and Ping Li. 2020. SONG: Approximate Nearest Neighbor Search on GPU. In Proceedings of the 36th IEEE International Conference on Data Engineering (ICDE). Dallas, TX, 1033--1044.

[61]

Weijie Zhao, Deping Xie, Ronglai Jia, Yulei Qian, Ruiquan Ding, Mingming Sun, and Ping Li. 2020. Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems. In Proceedings of the 3rd Conference on Machine Learning and Systems (MLSys). Austin, TX.

[62]

Weijie Zhao, Jingyuan Zhang, Deping Xie, Yulei Qian, Ronglai Jia, and Ping Li. 2019. AIBox: CTR Prediction Model Training on a Single Node. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM). Beijing, China, 319--328.

Digital Library

[63]

Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J. Corso, and Jian- feng Gao. 2020. Unified Vision-Language Pre-Training for Image Captioning and VQA. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI). New York, NY, 13041--13049.

[64]

Mingyang Zhou, Luowei Zhou, Shuohang Wang, Yu Cheng, Linjie Li, Zhou Yu, and Jingjing Liu. 2021. UC2: Universal Cross-lingual Cross-modal Vision-and- Language Pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 4155--4165.

[65]

Zhixin Zhou, Shulong Tan, Zhaozhuo Xu, and Ping Li. 2019. Möbius Transformation for Fast Inner Product Search on Graph. In Advances in Neural Information Processing Systems (NeurIPS). Vancouver, Canada, 8216--8227.

Digital Library

[66]

Linchao Zhu and Yi Yang. 2020. ActBERT: Learning Global-Local Video-Text Representations. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, 8743--87

Cited By

Li RWeng ZChen YZhuang HTan YLin Z(2024)Joint-Neighborhood Product Quantization for Unsupervised Cross-Modal Retrieval2024 IEEE International Conference on Visual Communications and Image Processing (VCIP)10.1109/VCIP63160.2024.10849920(1-5)Online publication date: 8-Dec-2024
https://doi.org/10.1109/VCIP63160.2024.10849920
Wang MKe XXu XChen LGao YHuang PZhu R(2024)MUST: An Effective and Scalable Framework for Multimodal Search of Target Modality2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00361(4747-4759)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00361
Yu TFei HLi PCrestani FPasi GGaussier E(2022)U-BERT for Fast and Scalable Text-Image RetrievalProceedings of the 2022 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3539813.3545148(193-203)Online publication date: 23-Aug-2022
https://dl.acm.org/doi/10.1145/3539813.3545148
Show More Cited By

Index Terms

Multi-modal Dictionary BERT for Cross-modal Video Search in Baidu Advertising
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Visual content-based indexing and retrieval
2. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval
        Image search
        Video search

Recommendations

Multi-scale Multi-modal Dictionary BERT For Effective Text-image Retrieval in Multimedia Advertising
CIKM '22: Proceedings of the 31st ACM International Conference on Information & Knowledge Management

Visual content in multimedia advertising effectively attracts the customer's attention. Search-based multimedia advertising is a cross-modal retrieval problem. Due to the modal gap between texts and images/videos, cross-modal image/video retrieval is a ...
Cross-Probe BERT for Fast Cross-Modal Search
SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

Owing to the effectiveness of cross-modal attentions, text-vision BERT models have achieved excellent performance in text-image retrieval. Nevertheless, cross-modal attentions in text-vision BERT models require expensive computation cost when tackling ...
HCMSL: Hybrid Cross-modal Similarity Learning for Cross-modal Retrieval
The purpose of cross-modal retrieval is to find the relationship between different modal samples and to retrieve other modal samples with similar semantics by using a certain modal sample. As the data of different modalities presents heterogeneous low-...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management

October 2021

4966 pages

ISBN:9781450384469

DOI:10.1145/3459637

General Chairs:
Gianluca Demartini
The University of Queensland, Australia
,
Guido Zuccon
The University of Queensland, Australia
,
Program Chairs:
J. Shane Culpepper
RMIT University, Australia
,
Zi Huang
The University of Queensland, Australia
,
Hanghang Tong
University of Illinois at Urbana-Champaign, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CIKM '21

Sponsor:

CIKM '21: The 30th ACM International Conference on Information and Knowledge Management

November 1 - 5, 2021

Queensland, Virtual Event, Australia

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
449
Total Downloads

Downloads (Last 12 months)43
Downloads (Last 6 weeks)5

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Li RWeng ZChen YZhuang HTan YLin Z(2024)Joint-Neighborhood Product Quantization for Unsupervised Cross-Modal Retrieval2024 IEEE International Conference on Visual Communications and Image Processing (VCIP)10.1109/VCIP63160.2024.10849920(1-5)Online publication date: 8-Dec-2024
https://doi.org/10.1109/VCIP63160.2024.10849920
Wang MKe XXu XChen LGao YHuang PZhu R(2024)MUST: An Effective and Scalable Framework for Multimodal Search of Target Modality2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00361(4747-4759)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00361
Yu TFei HLi PCrestani FPasi GGaussier E(2022)U-BERT for Fast and Scalable Text-Image RetrievalProceedings of the 2022 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3539813.3545148(193-203)Online publication date: 23-Aug-2022
https://dl.acm.org/doi/10.1145/3539813.3545148
Yu TLiu JYang YLi YFei HLi PZhang ARangwala H(2022)EGM: Enhanced Graph-based Model for Large-scale Video Advertisement SearchProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3539061(4443-4451)Online publication date: 14-Aug-2022
https://dl.acm.org/doi/10.1145/3534678.3539061
Xu ZYu TLi PAl Hasan MXiong L(2022)Texture BERT for Cross-modal Texture Image RetrievalProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557710(4610-4614)Online publication date: 17-Oct-2022
https://dl.acm.org/doi/10.1145/3511808.3557710
Yu TLiu JJin ZYang YFei HLi PAl Hasan MXiong L(2022)Multi-scale Multi-modal Dictionary BERT For Effective Text-image Retrieval in Multimedia AdvertisingProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557653(4655-4660)Online publication date: 17-Oct-2022
https://dl.acm.org/doi/10.1145/3511808.3557653
Aygun IKaya BKaya M(2022)Detection of Customer Opinions with Deep Learning Method for Metaverse Collaborating Brands2022 International Conference on Data Analytics for Business and Industry (ICDABI)10.1109/ICDABI56818.2022.10041681(603-607)Online publication date: 25-Oct-2022
https://doi.org/10.1109/ICDABI56818.2022.10041681
Yu TLiu JYang YLi YFei HLi P(2022)Tree-based Text-Vision BERT for Video Search in Baidu Video Advertising2022 IEEE International Conference on Big Data (Big Data)10.1109/BigData55660.2022.10020922(2150-2159)Online publication date: 17-Dec-2022
https://doi.org/10.1109/BigData55660.2022.10020922
Yu TJin ZLiu JYang YFei HLi P(2022)Boost CTR Prediction for New Advertisements via Modeling Visual Content2022 IEEE International Conference on Big Data (Big Data)10.1109/BigData55660.2022.10020786(2140-2149)Online publication date: 17-Dec-2022
https://doi.org/10.1109/BigData55660.2022.10020786
Yu TYang YFei HLi YChen XLi PDemartini GZuccon GCulpepper JHuang ZTong H(2021)Assorted Attention Network for Cross-Lingual Language-to-Vision RetrievalProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482233(2444-2454)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.1145/3459637.3482233
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten