Abstract
This is the first work to introduce the Most Retrievable Image(MRI) and Least Retrievable Image(LRI) concepts in modern text-to-image retrieval systems. An MRI is associated with and thus can be retrieved by many unrelated texts, while an LRI is disassociated from and thus not retrievable by related texts. Both of them have important practical applications and implications. Due to their one-to-many nature, it is fundamentally challenging to construct MRI and LRI. This research addresses this nontrivial problem by developing novel and effective loss functions to craft perturbations that essentially corrupt feature correlation between visual and language spaces, thus enabling MRI and LRI. The proposed schemes are implemented based on CLIP, a state-of-the-art image and text representation model, to demonstrate MRI and LRI and their application in privacy-preserved image sharing and malicious advertisement. They are evaluated by extensive experiments based on the modern visual-language models on multiple benchmarks, including Paris, ImageNet, Flickr30k, and MSCOCO. The experimental results show the effectiveness and robustness of the proposed schemes for constructing MRI and LRI.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Here, we set \(\varepsilon =16/255\), which is commonly used in the robustness analysis for image classification systems.
References
Acar, G., Eubank, C., Englehardt, S., Juarez, M., Narayanan, A., Diaz, C.: The web never forgets: persistent tracking mechanisms in the wild. In: Proceedings of the ACM SIGSAC Conference on Computer and Communications Security(CCS), pp. 674–689 (2014)
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), pp. 6077–6086 (2018)
Thomas, A.: Ogiz Elibol: defense against adversarial attack-rank3. ‘github.com/anlthms/nips-2017/tree/master/mmd’ (2017)
Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision(ICCV), pp. 2425–2433 (2015)
Bachman, P., Hjelm, R.D., Buchwalter, W.: Learning representations by maximizing mutual information across views. In: Proceedings of the International Conference on Neural Information Processing Systems (NeurIPS) (2019)
Belghazi, M.I., et al.: Mutual information neural estimation. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 531–540 (2018)
Benjamin, E.: False and deceptive display ads at yahoo’s right media. www.benedelman.org/rightmedia-deception (2009)
Carlini, N., Wagner, D.: Towards evaluating the robustness of neural networks. In: Proceedings of the IEEE Symposium on Security and Privacy (S&P), pp. 39–57 (2017)
Changpinyo, S., Sharma, P., Ding, N., Soricut, R.: Conceptual 12m: pushing web-scale image-text pre-training to recognize long-tail visual concepts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3558–3568 (2021)
Chen, H., Zhang, H., Chen, P.Y., Yi, J., Hsieh, C.J.: Attacking visual language grounding with adversarial examples: a case study on neural image captioning. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) (2018)
Chen, Y.C., et al.: Uniter: universal image-text representation learning. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 104–120 (2020)
Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: Proceedings of the International Conference on Machine Learning(ICML), pp. 2206–2216 (2020)
Deng, J., Dong, W., Socher, R., Li, L., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), pp. 248–255 (2009)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Dzabraev, M., Kalashnikov, M., Komkov, S., Petiushko, A.: MDMMT: multidomain multimodal transformer for video retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3354–3363 (2021)
Gao, J., Lanchantin, J., Soffa, M.L., Qi, Y.: Black-box generation of adversarial text sequences to evade deep learning classifiers. In: Proceedings of the IEEE Security and Privacy Workshops (SPW), pp. 50–56 (2018)
Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. In: Proceedings of the International Conference on Learning Representations (ICLR) (2015)
Guo, C., Rana, M., Cissé, M., van der Maaten, L.: Countering adversarial images using input transformations. In: 6th International Conference on Learning Representations, ICLR (2018)
Han, Y., Shen, Y.: Accurate spear phishing campaign attribution and early detection. In: Proceedings of the Annual ACM Symposium on Applied Computing(SAC), pp. 2079–2086 (2016)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), pp. 770–778 (2016)
Ji, J., et al.: Attacking image captioning towards accuracy-preserving target words removal. In: Proceedings of the ACM International Conference on Multimedia(ACMMM), pp. 4226–4234 (2020)
Li, G., Duan, N., Fang, Y., Gong, M., Jiang, D.: Unicoder-VL: a universal encoder for vision and language by cross-modal pre-training. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 11336–11344 (2020)
Li, J., Ji, R., Liu, H., Hong, X., Gao, Y., Tian, Q.: Universal perturbation attack against image retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4899–4908 (2019)
Li, L., Ma, R., Guo, Q., Xue, X., Qiu, X.: BERT-ATTACK: adversarial attack against BERT using BERT. In: Proceedings of the IEEE Conference on Empirical Methods in Natural Language Processing (EMNLP) (2020)
Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: VisualBERT: a simple and performant baseline for vision and language. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) (2019)
Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: What does BERT with vision look at? In: Proceedings of the Annual Meeting of the Association for Computational Linguistics(ACL), pp. 5265–5275 (2020)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Liu, Z., Zhao, Z., Larson, M.: Who’s afraid of adversarial queries? the impact of image modifications on content-based image retrieval. In: Proceedings of the Annual ACM International Conference on Multimedia Retrieval(ICMR), pp. 306–314 (2019)
Lu, J., Batra, D., Parikh, D., Lee, S.: VILBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Proceedings of the Advances in Neural Information Processing Systems(NeurIPS), vol. 32 (2019)
van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(86), 2579–2605 (2008)
Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: Proceedings of the International Conference on Learning Representations (ICLR) (2018)
Mosbach, M., Andriushchenko, M., Trost, T., Hein, M., Klakow, D.: Logit pairing methods can fool gradient-based attacks. In: Proceedings of the NeurIPS Workshop on Security in Machine Learning (2018)
Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Lost in quantization: improving particular object retrieval in large scale image databases. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), pp. 1–8. IEEE (2008)
Radford, A., et al.: Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020 (2021)
Reznichenko, A., Francis, P.: Private-by-design advertising meets the real world. In: Proceedings of the ACM SIGSAC Conference on Computer and Communications Security(CCS), pp. 116–128 (2014)
Sayak Paul, P.Y.C.: Vision transformers are robust learners. arXiv preprint arXiv:2105.07581 (2021)
Sharma, V., Kalra, A., Vaibhav, Chaudhary, S., Patel, L., Morency, L.: Attend and attack: attention guided adversarial attacks on visual question answering models. In: Proceedings of the Advances in Neural Information Processing Systems (NeurIPS) (2018)
Su, W., et al.: VL-BERT: pre-training of generic visual-linguistic representations. In: Proceedings of the International Conference on Learning Representations (ICLR) (2020)
Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) (2019)
Tan, M., Le, Q.: EfficientNet: rethinking model scaling for convolutional neural networks. In: Proceedings of the International Conference on Machine Learning(ICML), pp. 6105–6114 (2019)
The New York Times: clearview ai’s facial recognition app called illegal in canada. www.nytimes.com/2021/02/03/technology/clearview-ai-illegal-canada.html (2021). Accessed 03 Feb 2021
Tian, Y., Krishnan, D., Isola, P.: Contrastive multiview coding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 776–794 (2020)
Tolias, G., Radenovic, F., Chum, O.: Targeted mismatch adversarial attack: query with a flower to retrieve the tower. In: Proceedings of the IEEE/CVF International Conference on Computer Vision(ICCV) pp. 5037–5046 (2019)
Vaswani, A., et al.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6000–6010 (2017)
Xie, C., Wang, J., Zhang, Z., Ren, Z., Yuille, A.L.: Mitigating adversarial effects through randomization. In: 6th International Conference on Learning Representations, ICLR (2018)
Xie, C., et al.: Improving transferability of adversarial examples with input diversity. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2730–2739 (2019)
Xu, W., Evans, D., Qi, Y.: Feature squeezing: detecting adversarial examples in deep neural networks. In: 25th Annual Network and Distributed System Security Symposium NDSS (2018)
Xu, X., Chen, X., Liu, C., Rohrbach, A., Darrell, T., Song, D.: Fooling vision and language models despite localization and attention mechanism. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), pp. 4951–4961 (2018)
Xu, Y., et al.: Exact adversarial attack to image captioning via structured output learning with latent variables. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4135–4144 (2019)
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguistics 2, 67–78 (2014)
Zellers, R., Bisk, Y., Farhadi, A., Choi, Y.: From recognition to cognition: visual commonsense reasoning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), pp. 6720–6731 (2019)
Zhao, G., Zhang, M., Liu, J., Li, Y., Wen, J.-R.: AP-GAN: adversarial patch attack on content-based image retrieval systems. GeoInformatica, 1–31 (2020). https://doi.org/10.1007/s10707-020-00418-7
Acknowledgements
This work was supported in part by the NSF under Grant CNS-2120279, CNS-1950704, CNS-1828593, CNS-2153358 and OAC-1829771, ONR under Grant N00014-20-1-2065, AFRL under grant FA8750-19-3-1000, NSA under Grant H98230-21-1-0165 and H98230-21-1-0278, DoD CoE-AIML under Contract Number W911NF-20-2-0277, the Commonwealth Cyber Initiative, and InterDigital Communications, Inc.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zhu, L., Ning, R., Li, J., Xin, C., Wu, H. (2022). Most and Least Retrievable Images in Visual-Language Query Systems. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13697. Springer, Cham. https://doi.org/10.1007/978-3-031-19836-6_1
Download citation
DOI: https://doi.org/10.1007/978-3-031-19836-6_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19835-9
Online ISBN: 978-3-031-19836-6
eBook Packages: Computer ScienceComputer Science (R0)