Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3581783.3611758acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Reservoir Computing Transformer for Image-Text Retrieval

Published: 27 October 2023 Publication History

Abstract

Although the attention mechanism in transformers has proven successful in image-text retrieval tasks, most transformer models suffer from a large number of parameters. Inspired by brain circuits that process information with recurrent connected neurons, we propose a novel Reservoir Computing Transformer Reasoning Network (RCTRN) for image-text retrieval. The proposed RCTRN employs a two-step strategy to focus on feature representation and data distribution of different modalities respectively. Specifically, we send visual and textual features through a unified meshed reasoning module, which encodes multi-level feature relationships with prior knowledge and aggregates the complementary outputs in a more effective way. The reservoir reasoning network is proposed to optimize memory connections between features at different stages and address the data distribution mismatch problem introduced by the unified scheme. To investigate the significance of the low power dissipation and low bandwidth characteristics of RRN in practical scenarios, we deployed the model in the wireless transmission system, demonstrating that RRN's optimization of data structures also has a certain robustness against channel noise. Extensive experiments on two benchmark datasets, Flickr30K and MS-COCO, demonstrate the superiority of RCTRN in terms of performance and low-power dissipation compared to state-of-the-art baselines.

References

[1]
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017.
[2]
Liang Peng, Yang Yang, Zheng Wang, Zi Huang, and Heng Tao Shen. Mra-net: Improving vqa via multi-modal relation attention network. In TPAMI, 2020.
[3]
Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In CVPR, 2018.
[4]
Xin Wang, Qiuyuan Huang, Asli Celikyilmaz, Jianfeng Gao, Dinghan Shen, Yuan-Fang Wang, William Yang Wang, and Lei Zhang. Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In CVPR, 2019.
[5]
Xiushan Nie, Yilong Yin, Jiande Sun, Ju Liu, and Chaoran Cui. Comprehensive feature-based robust video fingerprinting using tensor model. IEEE TMM, 2017.
[6]
Jingkuan Song, Hanwang Zhang, Xiangpeng Li, Lianli Gao, Meng Wang, and Richang Hong. Self-supervised video hashing with hierarchical binary autoencoder. IEEE TIP, 2018.
[7]
Jingkuan Song, Tao He, Lianli Gao, Xing Xu, Alan Hanjalic, and Heng Tao Shen. Binary generative adversarial networks for image retrieval. AAAI, 2018.
[8]
Siyuan Li, Xing Xu, Zailei Zhou, Yang Yang, Guoqing Wang, and Heng Tao Shen. Arra: Absolute-relative ranking attack against image retrieval. In ACM MM, 2022.
[9]
En Yu, Jiande Sun, Jing Li, Xiaojun Chang, Xian-Hua Han, and Alexander G. Hauptmann. Adaptive semi-supervised feature selection for cross-modal retrieval. IEEE TMM, 2019.
[10]
Fangyu Liu, Rémi Lebret, Didier Orel, Philippe Sordet, and Karl Aberer. Upgrading the newsroom: An automated image selection system for news articles. In TOMM, 2020.
[11]
Jiwei Wei, Yang Yang, Xing Xu, Jingkuan Song, Guoqing Wang, and Heng Tao Shen. Less is better: Exponential loss for cross-modal matching. IEEE TCSVT, 2023.
[12]
Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In CVPR, 2015.
[13]
Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. Visual semantic reasoning for image-text matching. In ICCV, 2019.
[14]
Aviv Eisenschtat and Lior Wolf. Linking image and text with 2-way nets. In CVPR, 2017.
[15]
Zhong Ji, Haoran Wang, Jungong Han, and Yanwei Pang. Saliency-guided attention network for image-sentence matching. In CVPR, 2019.
[16]
Wenrui Li, Zhengyu Ma, Jinqiao Shi, and Xiaopeng Fan. The style transformer with common knowledge optimization for image-text retrieval, In arXiv preprint arXiv:2303.00448, 2023.
[17]
Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Bin Xiao, Ce Liu, Lu Yuan, and Jianfeng Gao. Unified contrastive learning in image-text-label space. In CVPR, 2022.
[18]
Leigang Qu, Meng Liu, Jianlong Wu, Zan Gao, and Liqiang Nie. Dynamic modality interaction modeling for image-text retrieval. In SIGIR, 2021.
[19]
Zhuangzhuang Chen, Jin Zhang, Zhuonan Lai, Jie Chen, Zun Liu, and Jianqiang Li. Geometry-aware guided loss for deep crack recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
[20]
Zhuangzhuang Chen, Jin Zhang, Pan Wang, Jie Chen, and Jianqiang Li. When active learning meets implicit semantic data augmentation. In European Conference on Computer Vision (ECCV).
[21]
Jianqiang Li, Zhuang-Zhuang Chen, Luxiang Huang, Min Fang, Bing Li, Xianghua Fu, Huihui Wang, and Qingguo Zhao. Automatic classification of fetal heart rate based on convolutional neural network. IEEE Internet of Things Journal, 2019.
[22]
Jianqiang Li, Zhuangzhuang Chen, Jie Chen, and Qiuzhen Lin. Diversity-sensitive generative adversarial network for terrain mapping under limited human intervention. IEEE Transactions on Cybernetics, 2021.
[23]
Wenrui Li, Zhengyu Ma, Liang-Jian Deng, Xiaopeng Fan, and Yonghong Tian. Neuron-based spiking transmission and reasoning network for robust image-text retrieval. IEEE TCSVT, 2022.
[24]
F. Faghri, D.J. Fleet, J.R. Kiros, and S. Fidler. VSE: improved visual-semantic embeddings with hard negatives. In BMVC, 2018.
[25]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.
[26]
Yiling Wu, Shuhui Wang, Guoli Song, and Qingming Huang. Learning fragment self-attention embeddings for image-text matching. In ACMMM, 2019.
[27]
Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pretraining of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019.
[28]
Leigang Qu, Meng Liu, Da Cao, Liqiang Nie, and Qi Tian. Context-aware multi-view summarization network for image-text matching. In ACMMM, 2020.
[29]
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS, 2019.
[30]
Di Qi, Lin Su, Jia Song, Edward Cui, Taroon Bharti, and Arun Sacheti. Imagebert: Cross-modal pre-training with large-scale weak-supervised image-text data. In arXiv preprint arXiv:2001.07966, 2020.
[31]
Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, and Jianlong Fu. Pixel-bert: Aligning image pixels with text by deep multi-modal transformers. In arXiv preprint arXiv:2004.00849, 2020.
[32]
Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. Vl-bert: Pre-training of generic visual-linguistic representations. In ICLR, 2019.
[33]
Dhireesha Kudithipudi, Qutaiba Saleh, Cory Merkel, James Thesing, and Bryant Wysocki. Design and analysis of a neuromemristive reservoir computing architecture for biosignal processing. Frontiers in Neuroscience, 2016.
[34]
Herbert Jaeger. The ?echo state" approach to analysing and training recurrent neural networks-with an erratum note. In Germany: German National Research Center for Information Technology GMD Technical Report, 2001.
[35]
Wolfgang Maass, Thomas Natschläger, and Henry Markram. Real-time computing without stable states: A new framework for neural computation based on perturbations. In Neural computation, 2002.
[36]
Herbert Jaeger. Echo state network. In Scholarpedia, 2007.
[37]
Lina Jaurigue and Kathy Lüdge. Connecting reservoir computing with statistical forecasting and deep neural networks. Nature Communications, 2022.
[38]
Herbert Jaeger and Harald Haas. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. In Science, 2004.
[39]
Sebastian Thrun, Lawrence K Saul, and Bernhard Schölkopf. Advances in neural information processing systems 16: proceedings of the 2003 conference. MIT press, 2004.
[40]
Filippo Maria Bianchi, Simone Scardapane, Sigurd Løkse, and Robert Jenssen. Reservoir computing approaches for representation and classification of multivariate time series. IEEE TNNLS, 2021.
[41]
Zhou Zhou, Lingjia Liu, Vikram Chandrasekhar, Jianzhong Zhang, and Yang Yi. Deep reservoir computing meets 5g mimo-ofdm systems in symbol detection. AAAI, 2020.
[42]
Daniel J Gauthier, Erik Bollt, Aaron Griffith, and Wendson AS Barbosa. Next generation reservoir computing. Nature communications, 2021.
[43]
Qi Zhang, Zhen Lei, Zhaoxiang Zhang, and Stan Z Li. Context-aware attention network for image-text retrieval. In CVPR, 2020.
[44]
Song Yang, Qiang Li, Wenhui Li, Xuan-Ya Li, Ran Jin, Bo Lv, Rui Wang, and Anan Liu. Semantic completion and filtration for image-text retrieval. ACM TOMM, 2023.
[45]
Yuhao Cheng, Xiaoguang Zhu, Jiuchao Qian, Fei Wen, and Peilin Liu. Cross-modal graph matching network for image-text retrieval. In TOMM, 2022.
[46]
Song Yang, Qiang Li, Wenhui Li, Xuanya Li, and An-An Liu. Dual-level represen- tation enhancement on characteristic and context for image-text retrieval. IEEE TCSVT, 2022.
[47]
Nicola Messina, Giuseppe Amato, Andrea Esuli, Fabrizio Falchi, Claudio Gennaro, and Stéphane Marchand-Maillet. Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. In TOMM, 2021.
[48]
Jiangtong Li, Li Niu, and Liqing Zhang. Action-aware embedding enhancement for image-text retrieval. In AAAI, 2022.
[49]
Zhihao Fan, Zhongyu Wei, Zejun Li, Siyuan Wang, Haijun Shan, Xuanjing Huang, and Jianqing Fan. Constructing phrase-level semantic labels to form multi-grained supervision for image-text retrieval. In ICMR, 2022.
[50]
Siqu Long, Soyeon Caren Han, Xiaojun Wan, and Josiah Poon. Gradual: Graph-based dual-modal representation for image-text matching. In WACV, 2022.
[51]
Haiwen Diao, Ying Zhang, Lin Ma, and Huchuan Lu. Similarity reasoning and filtration for image-text matching. In AAAI, 2021.
[52]
Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. In TACL. MIT Press, 2014.
[53]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
[54]
Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. Stacked cross attention for image-text matching. In ECCV, 2018.
[55]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, 2018.
[56]
Weikuo Guo, Huaibo Huang, Xiangwei Kong, and Ran He. Fine-grained cross-modal retrieval with triple-streamed memory fusion transformer encoder. In ICME, 2022.
[57]
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text representation learning. In ECCV, 2020.
[58]
Wenrui Li and Xiaopeng Fan. Image-text alignment and retrieval using light-weight transformer. In ICASSP, 2022.

Cited By

View all
  • (2025)Hire: Hybrid-modal Interaction with Multiple Relational Enhancements for Image-Text MatchingACM Transactions on Intelligent Systems and Technology10.1145/3714431Online publication date: 23-Jan-2025
  • (2024)Sample-agnostic Adversarial Perturbation for Vision-Language Pre-training ModelsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3686835(9749-9758)Online publication date: 28-Oct-2024
  • (2024)A Unified Understanding of Adversarial Vulnerability Regarding Unimodal Models and Vision-Language Pre-training ModelsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681184(18-27)Online publication date: 28-Oct-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '23: Proceedings of the 31st ACM International Conference on Multimedia
October 2023
9913 pages
ISBN:9798400701085
DOI:10.1145/3581783
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. image-text retrieval
  2. reservoir computing
  3. transformer

Qualifiers

  • Research-article

Funding Sources

Conference

MM '23
Sponsor:
MM '23: The 31st ACM International Conference on Multimedia
October 29 - November 3, 2023
Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)213
  • Downloads (Last 6 weeks)12
Reflects downloads up to 26 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Hire: Hybrid-modal Interaction with Multiple Relational Enhancements for Image-Text MatchingACM Transactions on Intelligent Systems and Technology10.1145/3714431Online publication date: 23-Jan-2025
  • (2024)Sample-agnostic Adversarial Perturbation for Vision-Language Pre-training ModelsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3686835(9749-9758)Online publication date: 28-Oct-2024
  • (2024)A Unified Understanding of Adversarial Vulnerability Regarding Unimodal Models and Vision-Language Pre-training ModelsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681184(18-27)Online publication date: 28-Oct-2024
  • (2024)Multi-Layer Probabilistic Association Reasoning Network for Image-Text RetrievalIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.339455134:10(9706-9717)Online publication date: Oct-2024
  • (2024)Do Keypoints Contain Crucial Information? Mining Keypoint Information to Enhance Cross-View Geo-Localization2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10688249(1-6)Online publication date: 15-Jul-2024
  • (2024)Improving Multimodal Rumor Detection via Dynamic Graph ModelingPattern Recognition10.1007/978-3-031-78456-9_16(242-258)Online publication date: 3-Dec-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media