Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3459637.3481937acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Multi-modal Dictionary BERT for Cross-modal Video Search in Baidu Advertising

Published: 30 October 2021 Publication History

Abstract

Due to their attractiveness, video advertisements are adored by advertisers. Baidu, as one of the leading search advertisement platforms in China, is putting more and more effort into video advertisements for its advertisement customers. Search-based video advertisement display is, in essence, a cross-modal retrieval problem, which is normally tackled through joint embedding methods. Nevertheless, due to the lack of interactions between text features and image features, joint embedding methods cannot achieve as high accuracy as its counterpart based on attention. Inspired by the great success achieved by BERT in NLP tasks, many cross-modal BERT models emerge and achieve excellent performance in cross-modal retrieval. Last year, Baidu also launched a cross-modal BERT, CAN, in video advertisement platform, and achieved considerably better performance than the previous joint-embedding model. In this paper, we present our recent work for video advertisement retrieval, Multi-modal Dictionary BERT (MDBERT) model. Compared with CAN and other cross-modal BERT models, MDBERT integrates a joint dictionary, which is shared among video features and word features. It maps the relevant word features and video features into the same codeword and thus fosters effective cross-modal attention. To support end-to-end training, we propose to soften the codeword assignment. Meanwhile, to enhance the inference efficiency, we adopt the product quantization to achieve fine-level feature space partition at a low cost. After launching MDBERT in Baidu video advertising platform, the conversion ratio (CVR) increases by 3.34%, bringing a considerable revenue boost for advertisers in Baidu.

References

[1]
Galen Andrew, Raman Arora, Jeff A. Bilmes, and Karen Livescu. 2013. Deep Canonical Correlation Analysis. In Proceedings of the 30th International Conference on Machine Learning (ICML). Atlanta, GA, 1247--1255.
[2]
Yue Cao, Mingsheng Long, Jianmin Wang, Qiang Yang, and Philip S. Yu. 2016. Deep Visual-Semantic Hashing for Cross-Modal Retrieval. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). San Francisco, CA, 1445--1454.
[3]
Yue Cao, Mingsheng Long, Jianmin Wang, Han Zhu, and Qingfu Wen. 2016. Deep Quantization Network for Efficient Image Retrieval. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI). Phoenix, AZ, 3457--3463.
[4]
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. UNITER: UNiversal Image-TExt Representation Learning. In Proceedings of the 16th European Conference on Computer Vision (ECCV), Part XXX. Glasgow, UK, 104--120.
[5]
Haibin Cheng, Roelof van Zwol, Javad Azimi, Eren Manavoglu, Ruofei Zhang, Yang Zhou, and Vidhya Navalpakkam. 2012. Multimedia features for click pre- diction of new ads in display advertising. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). Beijing, China, 777--785.
[6]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Minneapolis, MN, 4171--4186.
[7]
Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2018. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives. In Proceedings of the British Machine Vision Conference (BMVC). Newcastle, UK.
[8]
Miao Fan, Jiacheng Guo, Shuai Zhu, Shuo Miao, Mingming Sun, and Ping Li. 2019. MOBIUS: Towards the Next Generation of Query-Ad Matching in Baidu's Sponsored Search. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD). Anchorage, AK, 2509--2517.
[9]
Hongliang Fei, Shulong Tan, Pengju Guo, Wenbo Zhang, Hongfang Zhang, and Ping Li. 2020. Sample optimization for display advertising. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management (CIKM). 2017--2020.
[10]
Hongliang Fei, Tan Yu, and Ping Li. 2021. Cross-lingual Cross-modal Pretraining for Multimodal Retrieval. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Online.
[11]
Andrea Frome, Gregory S. Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marc'Aurelio Ranzato, and Tomás Mikolov. 2013. DeViSE: A Deep Visual-Semantic Embedding Model. In Advances in Neural Information Processing Systems (NIPS). Lake Tahoe, NV, 2121--2129.
[12]
Valentin Gabeur, Chen Sun, Karteek Alahari, and Cordelia Schmid. 2020. Multi- modal Transformer for Video Retrieval. In Proceedings of the 16th European Conference on Computer Vision (ECCV), Part IV. Glasgow UK, 214--229.
[13]
Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, and Jingjing Liu. 2020. Large-Scale Adversarial Training for Vision-and-Language Representation Learning. In Advances in Neural Information Processing Systems (NeurIPS). virtual.
[14]
Tiezheng Ge, Kaiming He, Qifa Ke, and Jian Sun. 2014. Optimized Product Quantization. IEEE Trans. Pattern Anal. Mach. Intell. 36, 4 (2014), 744--755.
[15]
Tiezheng Ge, Liqin Zhao, Guorui Zhou, Keyu Chen, Shuying Liu, Huiming Yi, Zelin Hu, Bochao Liu, Peng Sun, Haoyu Liu, Pengtao Yi, Sui Huang, Zhiqiang Zhang, Xiaoqiang Zhu, Yu Zhang, and Kun Gai. 2018. Image Matters: Visually Modeling User Behaviors Using Advanced Model Server. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM). Torino, Italy, 2087--2095.
[16]
Gregor Geigle, Jonas Pfeiffer, Nils Reimers, Ivan Vulić, and Iryna Gurevych. 2021. Retrieve Fast, Rerank Smart: Cooperative and Joint Approaches for Improved Cross-Modal Retrieval. arXiv preprint arXiv:2103.11920 (2021).
[17]
Yunchao Gong, Qifa Ke, Michael Isard, and Svetlana Lazebnik. 2014. A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics. Int. J. Comput. Vis. 106, 2 (2014), 210--233.
[18]
David R. Hardoon, Sándor Szedmák, and John Shawe-Taylor. 2004. Canonical Correlation Analysis: An Overview with Application to Learning Methods. Neural Comput. 16, 12 (2004), 2639--2664.
[19]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, 770--778.
[20]
Po-Yao Huang, Mandela Patrick, Junjie Hu, Graham Neubig, Florian Metze, and Alexander G Hauptmann. 2021. Multilingual Multimodal Pre-training for Zero- Shot Cross-Lingual Transfer of Vision-Language Models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). 2443--2459.
[21]
Yan Huang, Wei Wang, and Liang Wang. 2017. Instance-Aware Image and Sentence Matching with Selective Multimodal LSTM. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, 7254--7262.
[22]
Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, and Jianlong Fu. 2020. Pixel-BERT: Aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849 (2020).
[23]
Hervé Jégou, Matthijs Douze, and Cordelia Schmid. 2011. Product Quantization for Nearest Neighbor Search. IEEE Trans. Pattern Anal. Mach. Intell. 33, 1 (2011), 117--128.
[24]
Andrej Karpathy and Fei-Fei Li. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, MA, 3128--3137.
[25]
Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. 2021. Transformers in Vision: A Survey. arXiv preprint arXiv:2101.01169 (2021).
[26]
Douwe Kiela, Suvrat Bhooshan, Hamed Firooz, and Davide Testuggine. 2019. Supervised Multimodal Bitransformers for Classifying Images and Text. In Proceedings of the NeurIPS 2019 Workshop on Visually Grounded Interaction and Language (ViGIL). Vancouver, Canada.
[27]
Douwe Kiela, Hamed Firooz, Aravind Mohan, Vedanuj Goswami, Amanpreet Singh, Pratik Ringshia, and Davide Testuggine. 2020. The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes. In Advances in Neural Information Processing Systems (NeurIPS). Virtual.
[28]
Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked Cross Attention for Image-Text Matching. In Computer Vision - ECCV 2018 Proceedings of the 15th European Conference on Computer Vision (ECCV), Part IV. Munich, Germany, 212--228.
[29]
Sangho Lee, Youngjae Yu, Gunhee Kim, Thomas M. Breuel, Jan Kautz, and Yale Song. 2021. Parameter Efficient Multimodal Transformers for Video Representation Learning. In Proceedings of the 9th International Conference on Learning Representations (ICLR). Virtual Event, Austria.
[30]
Dingcheng Li, Xu Li, Jun Wang, and Ping Li. 2020. Video Recommendation with Multi-gate Mixture of Experts Soft Actor Critic. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval (SIGIR). Virtual Event, China, 1553--1556.
[31]
Gen Li, Nan Duan, Yuejian Fang, Ming Gong, and Daxin Jiang. 2020. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI). New York, NY, 11336--11344.
[32]
Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. 2020. HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online, 2046--2065.
[33]
Linjie Li, Zhe Gan, and Jingjing Liu. 2020. A Closer Look at the Robustness of Vision-and-Language Pre-trained Models. arXiv preprint arXiv:2012.08673 (2020).
[34]
Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019. VisualBERT: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019).
[35]
Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao. 2020. OSCAR: Object-Semantics Aligned Pre-training for Vision-Language Tasks. In Proceedings of the 16th European Conference on Computer Vision (ECCV), Part XXX. Glasgow, UK, 121--137.
[36]
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In Advances in Neural Information Processing Systems (NeurIPS). Vancouver, Canada, 13--23.
[37]
Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, and Stefan Lee. 2020. 12-in-1: Multi-Task Vision and Language Representation Learning. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition CVPR). Seattle, WA, 10434--10443.
[38]
Xiaopeng Lu, Tiancheng Zhao, and Kyusong Lee. 2021. VisualSparta: An Embarrassingly Simple Approach to Large-scale Text-to-Image Search with Weighted Bag-of-words. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL/IJCNLP). Virtual Event, 5020--5029.
[39]
Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Jason Li, Taroon Bharti, and Ming Zhou. 2020. UniVL: A unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353 (2020).
[40]
Antoine Miech, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic, and Andrew Zisserman. 2021. Thinking Fast and Slow: Efficient Text-to-Visual Retrieval With Transformers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). virtual, 9826--9836.
[41]
Kaixiang Mo, Bo Liu, Lei Xiao, Yong Li, and Jie Jiang. 2015. Image Feature Learning for Cold Start Problem in Display Advertising. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI). Buenos Aires, Argentina, 3728--3734.
[42]
Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. 2017. Dual Attention Net- works for Multimodal Reasoning and Matching. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, 2156--2164.
[43]
Zhenxing Niu, Mo Zhou, Le Wang, Xinbo Gao, and Gang Hua. 2017. Hierarchical Multimodal LSTM for Dense Visual-Semantic Embedding. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). Venice, Italy, 1899--1907.
[44]
Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. 2011. Im2Text: Describing Images Using 1 Million Captioned Photographs. In Advances in Neural Information Processing Systems (NIPS). Granada, Spain, 1143--1151.
[45]
Di Qi, Lin Su, Jia Song, Edward Cui, Taroon Bharti, and Arun Sacheti. 2020. ImageBERT: Cross-modal pre-training with large-scale weak-supervised image- text data. arXiv preprint arXiv:2001.07966 (2020).
[46]
Nikhil Rasiwasia, José Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert R. G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to cross- modal multimedia retrieval. In Proceedings of the 18th International Conference on Multimedia (ACMMM). Firenze, Italy, 251--260.
[47]
Chen Sun, Fabien Baradel, Kevin Murphy, and Cordelia Schmid. 2019. Learning video representations using contrastive bidirectional transformer. arXiv preprint arXiv:1906.05743 (2019).
[48]
Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019. VideoBERT: A Joint Model for Video and Language Representation Learning. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea, 7463--7472.
[49]
Siqi Sun, Yen-Chun Chen, Linjie Li, Shuohang Wang, Yuwei Fang, and Jingjing Liu. 2021. LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT).
[50]
Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China, 5099--5110.
[51]
Jianfeng Wang, Xiaowei Hu, Pengchuan Zhang, Xiujun Li, Lijuan Wang, Lei Zhang, Jianfeng Gao, and Zicheng Liu. 2020. MiniVLM: A Smaller and Faster Vision-Language Model. arXiv preprint arXiv:2012.06946 (2020).
[52]
Zhiqiang Xu, Dong Li, Weijie Zhao, Xing Shen, Tianbo Huang, Xiaoyun Li, and Ping Li. 2021. Agile and Accurate CTR Prediction Model Training for Massive- Scale Online Advertising Systems. In Proceedings of the International Conference on Management of Data (SIGMOD). Virtual Event, China, 2404--2409.
[53]
Xiao Yang, Tao Deng, Weihan Tan, Xutian Tao, Junwei Zhang, Shouke Qin, and Zongyao Ding. 2019. Learning Compositional, Visual and Relational Representations for CTR Prediction in Sponsored Search. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM). Beijing, China, 2851--2859.
[54]
Fei Yu, Jiji Tang, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, and Haifeng Wang. 2021. ERNIE-ViL: Knowledge Enhanced Vision-Language Representations through Scene Graphs. In Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI). Virtual Event, 3208--3216.
[55]
Tan Yu, Xuemeng Yang, Yan Jiang, Hongfang Zhang, Weijie Zhao, and Ping Li. 2021. TIRA in Baidu Image Advertising. In Proceedings of the 37th IEEE International Conference on Data Engineering (ICDE). Chania, Greece, 2207--2212.
[56]
Tan Yu, Yi Yang, Hongliang Fei, Yi Li, Xiaodong Chen, and Ping Li. 2021. Assorted Attention Network for Cross-Lingual Language-to-Vision Retrieval. In Proceedings of the 30th ACM International Conference on Information and Knowledge Management (CIKM). Virtual Event, Australia.
[57]
Tan Yu, Yi Yang, Yi Li, Xiaodong Chen, Mingming Sun, and Ping Li. 2020. Combo- Attention Network for Baidu Video Advertising. In Proceedings of the 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD). Virtual Event, CA, 2474--2482.
[58]
Tan Yu, Yi Yang, Yi Li, Lin Liu, Hongliang Fei, and Ping Li. 2021. Heterogeneous Attention Network for Effective and Efficient Cross-modal Retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). Virtual Event, Canada, 1146--1156.
[59]
Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. 2021. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 5579--5588.
[60]
Weijie Zhao, Shulong Tan, and Ping Li. 2020. SONG: Approximate Nearest Neighbor Search on GPU. In Proceedings of the 36th IEEE International Conference on Data Engineering (ICDE). Dallas, TX, 1033--1044.
[61]
Weijie Zhao, Deping Xie, Ronglai Jia, Yulei Qian, Ruiquan Ding, Mingming Sun, and Ping Li. 2020. Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems. In Proceedings of the 3rd Conference on Machine Learning and Systems (MLSys). Austin, TX.
[62]
Weijie Zhao, Jingyuan Zhang, Deping Xie, Yulei Qian, Ronglai Jia, and Ping Li. 2019. AIBox: CTR Prediction Model Training on a Single Node. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM). Beijing, China, 319--328.
[63]
Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J. Corso, and Jian- feng Gao. 2020. Unified Vision-Language Pre-Training for Image Captioning and VQA. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI). New York, NY, 13041--13049.
[64]
Mingyang Zhou, Luowei Zhou, Shuohang Wang, Yu Cheng, Linjie Li, Zhou Yu, and Jingjing Liu. 2021. UC2: Universal Cross-lingual Cross-modal Vision-and- Language Pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 4155--4165.
[65]
Zhixin Zhou, Shulong Tan, Zhaozhuo Xu, and Ping Li. 2019. Möbius Transformation for Fast Inner Product Search on Graph. In Advances in Neural Information Processing Systems (NeurIPS). Vancouver, Canada, 8216--8227.
[66]
Linchao Zhu and Yi Yang. 2020. ActBERT: Learning Global-Local Video-Text Representations. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, 8743--87

Cited By

View all
  • (2024)MUST: An Effective and Scalable Framework for Multimodal Search of Target Modality2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00361(4747-4759)Online publication date: 13-May-2024
  • (2022)U-BERT for Fast and Scalable Text-Image RetrievalProceedings of the 2022 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3539813.3545148(193-203)Online publication date: 23-Aug-2022
  • (2022)EGM: Enhanced Graph-based Model for Large-scale Video Advertisement SearchProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3539061(4443-4451)Online publication date: 14-Aug-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management
October 2021
4966 pages
ISBN:9781450384469
DOI:10.1145/3459637
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. advertisement
  2. computer vision
  3. cross-modal retrieval
  4. deep learning
  5. natural language processing
  6. search

Qualifiers

  • Research-article

Conference

CIKM '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)63
  • Downloads (Last 6 weeks)2
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)MUST: An Effective and Scalable Framework for Multimodal Search of Target Modality2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00361(4747-4759)Online publication date: 13-May-2024
  • (2022)U-BERT for Fast and Scalable Text-Image RetrievalProceedings of the 2022 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3539813.3545148(193-203)Online publication date: 23-Aug-2022
  • (2022)EGM: Enhanced Graph-based Model for Large-scale Video Advertisement SearchProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3539061(4443-4451)Online publication date: 14-Aug-2022
  • (2022)Texture BERT for Cross-modal Texture Image RetrievalProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557710(4610-4614)Online publication date: 17-Oct-2022
  • (2022)Multi-scale Multi-modal Dictionary BERT For Effective Text-image Retrieval in Multimedia AdvertisingProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557653(4655-4660)Online publication date: 17-Oct-2022
  • (2022)Detection of Customer Opinions with Deep Learning Method for Metaverse Collaborating Brands2022 International Conference on Data Analytics for Business and Industry (ICDABI)10.1109/ICDABI56818.2022.10041681(603-607)Online publication date: 25-Oct-2022
  • (2022)Tree-based Text-Vision BERT for Video Search in Baidu Video Advertising2022 IEEE International Conference on Big Data (Big Data)10.1109/BigData55660.2022.10020922(2150-2159)Online publication date: 17-Dec-2022
  • (2022)Boost CTR Prediction for New Advertisements via Modeling Visual Content2022 IEEE International Conference on Big Data (Big Data)10.1109/BigData55660.2022.10020786(2140-2149)Online publication date: 17-Dec-2022
  • (2021)Assorted Attention Network for Cross-Lingual Language-to-Vision RetrievalProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482233(2444-2454)Online publication date: 26-Oct-2021
  • (2021)MixBERT for Image-Ad Relevance Scoring in AdvertisingProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482143(3597-3602)Online publication date: 26-Oct-2021

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media