research-article

Reservoir Computing Transformer for Image-Text Retrieval

Authors:

Liang-Jian Deng,

Xiaopeng FanAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 5605 - 5613

https://doi.org/10.1145/3581783.3611758

Published: 27 October 2023 Publication History

Abstract

Although the attention mechanism in transformers has proven successful in image-text retrieval tasks, most transformer models suffer from a large number of parameters. Inspired by brain circuits that process information with recurrent connected neurons, we propose a novel Reservoir Computing Transformer Reasoning Network (RCTRN) for image-text retrieval. The proposed RCTRN employs a two-step strategy to focus on feature representation and data distribution of different modalities respectively. Specifically, we send visual and textual features through a unified meshed reasoning module, which encodes multi-level feature relationships with prior knowledge and aggregates the complementary outputs in a more effective way. The reservoir reasoning network is proposed to optimize memory connections between features at different stages and address the data distribution mismatch problem introduced by the unified scheme. To investigate the significance of the low power dissipation and low bandwidth characteristics of RRN in practical scenarios, we deployed the model in the wireless transmission system, demonstrating that RRN's optimization of data structures also has a certain robustness against channel noise. Extensive experiments on two benchmark datasets, Flickr30K and MS-COCO, demonstrate the superiority of RCTRN in terms of performance and low-power dissipation compared to state-of-the-art baselines.

References

[1]

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017.

[2]

Liang Peng, Yang Yang, Zheng Wang, Zi Huang, and Heng Tao Shen. Mra-net: Improving vqa via multi-modal relation attention network. In TPAMI, 2020.

[3]

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In CVPR, 2018.

[4]

Xin Wang, Qiuyuan Huang, Asli Celikyilmaz, Jianfeng Gao, Dinghan Shen, Yuan-Fang Wang, William Yang Wang, and Lei Zhang. Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In CVPR, 2019.

[5]

Xiushan Nie, Yilong Yin, Jiande Sun, Ju Liu, and Chaoran Cui. Comprehensive feature-based robust video fingerprinting using tensor model. IEEE TMM, 2017.

Digital Library

[6]

Jingkuan Song, Hanwang Zhang, Xiangpeng Li, Lianli Gao, Meng Wang, and Richang Hong. Self-supervised video hashing with hierarchical binary autoencoder. IEEE TIP, 2018.

[7]

Jingkuan Song, Tao He, Lianli Gao, Xing Xu, Alan Hanjalic, and Heng Tao Shen. Binary generative adversarial networks for image retrieval. AAAI, 2018.

[8]

Siyuan Li, Xing Xu, Zailei Zhou, Yang Yang, Guoqing Wang, and Heng Tao Shen. Arra: Absolute-relative ranking attack against image retrieval. In ACM MM, 2022.

Digital Library

[9]

En Yu, Jiande Sun, Jing Li, Xiaojun Chang, Xian-Hua Han, and Alexander G. Hauptmann. Adaptive semi-supervised feature selection for cross-modal retrieval. IEEE TMM, 2019.

Digital Library

[10]

Fangyu Liu, Rémi Lebret, Didier Orel, Philippe Sordet, and Karl Aberer. Upgrading the newsroom: An automated image selection system for news articles. In TOMM, 2020.

Digital Library

[11]

Jiwei Wei, Yang Yang, Xing Xu, Jingkuan Song, Guoqing Wang, and Heng Tao Shen. Less is better: Exponential loss for cross-modal matching. IEEE TCSVT, 2023.

Digital Library

[12]

Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In CVPR, 2015.

[13]

Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. Visual semantic reasoning for image-text matching. In ICCV, 2019.

[14]

Aviv Eisenschtat and Lior Wolf. Linking image and text with 2-way nets. In CVPR, 2017.

[15]

Zhong Ji, Haoran Wang, Jungong Han, and Yanwei Pang. Saliency-guided attention network for image-sentence matching. In CVPR, 2019.

[16]

Wenrui Li, Zhengyu Ma, Jinqiao Shi, and Xiaopeng Fan. The style transformer with common knowledge optimization for image-text retrieval, In arXiv preprint arXiv:2303.00448, 2023.

[17]

Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Bin Xiao, Ce Liu, Lu Yuan, and Jianfeng Gao. Unified contrastive learning in image-text-label space. In CVPR, 2022.

[18]

Leigang Qu, Meng Liu, Jianlong Wu, Zan Gao, and Liqiang Nie. Dynamic modality interaction modeling for image-text retrieval. In SIGIR, 2021.

Digital Library

[19]

Zhuangzhuang Chen, Jin Zhang, Zhuonan Lai, Jie Chen, Zun Liu, and Jianqiang Li. Geometry-aware guided loss for deep crack recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.

[20]

Zhuangzhuang Chen, Jin Zhang, Pan Wang, Jie Chen, and Jianqiang Li. When active learning meets implicit semantic data augmentation. In European Conference on Computer Vision (ECCV).

[21]

Jianqiang Li, Zhuang-Zhuang Chen, Luxiang Huang, Min Fang, Bing Li, Xianghua Fu, Huihui Wang, and Qingguo Zhao. Automatic classification of fetal heart rate based on convolutional neural network. IEEE Internet of Things Journal, 2019.

[22]

Jianqiang Li, Zhuangzhuang Chen, Jie Chen, and Qiuzhen Lin. Diversity-sensitive generative adversarial network for terrain mapping under limited human intervention. IEEE Transactions on Cybernetics, 2021.

[23]

Wenrui Li, Zhengyu Ma, Liang-Jian Deng, Xiaopeng Fan, and Yonghong Tian. Neuron-based spiking transmission and reasoning network for robust image-text retrieval. IEEE TCSVT, 2022.

[24]

F. Faghri, D.J. Fleet, J.R. Kiros, and S. Fidler. VSE: improved visual-semantic embeddings with hard negatives. In BMVC, 2018.

[25]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.

Digital Library

[26]

Yiling Wu, Shuhui Wang, Guoli Song, and Qingming Huang. Learning fragment self-attention embeddings for image-text matching. In ACMMM, 2019.

Digital Library

[27]

Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pretraining of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019.

[28]

Leigang Qu, Meng Liu, Da Cao, Liqiang Nie, and Qi Tian. Context-aware multi-view summarization network for image-text matching. In ACMMM, 2020.

Digital Library

[29]

Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS, 2019.

[30]

Di Qi, Lin Su, Jia Song, Edward Cui, Taroon Bharti, and Arun Sacheti. Imagebert: Cross-modal pre-training with large-scale weak-supervised image-text data. In arXiv preprint arXiv:2001.07966, 2020.

[31]

Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, and Jianlong Fu. Pixel-bert: Aligning image pixels with text by deep multi-modal transformers. In arXiv preprint arXiv:2004.00849, 2020.

[32]

Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. Vl-bert: Pre-training of generic visual-linguistic representations. In ICLR, 2019.

[33]

Dhireesha Kudithipudi, Qutaiba Saleh, Cory Merkel, James Thesing, and Bryant Wysocki. Design and analysis of a neuromemristive reservoir computing architecture for biosignal processing. Frontiers in Neuroscience, 2016.

[34]

Herbert Jaeger. The ?echo state" approach to analysing and training recurrent neural networks-with an erratum note. In Germany: German National Research Center for Information Technology GMD Technical Report, 2001.

[35]

Wolfgang Maass, Thomas Natschläger, and Henry Markram. Real-time computing without stable states: A new framework for neural computation based on perturbations. In Neural computation, 2002.

Digital Library

[36]

Herbert Jaeger. Echo state network. In Scholarpedia, 2007.

[37]

Lina Jaurigue and Kathy Lüdge. Connecting reservoir computing with statistical forecasting and deep neural networks. Nature Communications, 2022.

[38]

Herbert Jaeger and Harald Haas. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. In Science, 2004.

[39]

Sebastian Thrun, Lawrence K Saul, and Bernhard Schölkopf. Advances in neural information processing systems 16: proceedings of the 2003 conference. MIT press, 2004.

[40]

Filippo Maria Bianchi, Simone Scardapane, Sigurd Løkse, and Robert Jenssen. Reservoir computing approaches for representation and classification of multivariate time series. IEEE TNNLS, 2021.

[41]

Zhou Zhou, Lingjia Liu, Vikram Chandrasekhar, Jianzhong Zhang, and Yang Yi. Deep reservoir computing meets 5g mimo-ofdm systems in symbol detection. AAAI, 2020.

[42]

Daniel J Gauthier, Erik Bollt, Aaron Griffith, and Wendson AS Barbosa. Next generation reservoir computing. Nature communications, 2021.

[43]

Qi Zhang, Zhen Lei, Zhaoxiang Zhang, and Stan Z Li. Context-aware attention network for image-text retrieval. In CVPR, 2020.

[44]

Song Yang, Qiang Li, Wenhui Li, Xuan-Ya Li, Ran Jin, Bo Lv, Rui Wang, and Anan Liu. Semantic completion and filtration for image-text retrieval. ACM TOMM, 2023.

Digital Library

[45]

Yuhao Cheng, Xiaoguang Zhu, Jiuchao Qian, Fei Wen, and Peilin Liu. Cross-modal graph matching network for image-text retrieval. In TOMM, 2022.

Digital Library

[46]

Song Yang, Qiang Li, Wenhui Li, Xuanya Li, and An-An Liu. Dual-level represen- tation enhancement on characteristic and context for image-text retrieval. IEEE TCSVT, 2022.

[47]

Nicola Messina, Giuseppe Amato, Andrea Esuli, Fabrizio Falchi, Claudio Gennaro, and Stéphane Marchand-Maillet. Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. In TOMM, 2021.

Digital Library

[48]

Jiangtong Li, Li Niu, and Liqing Zhang. Action-aware embedding enhancement for image-text retrieval. In AAAI, 2022.

[49]

Zhihao Fan, Zhongyu Wei, Zejun Li, Siyuan Wang, Haijun Shan, Xuanjing Huang, and Jianqing Fan. Constructing phrase-level semantic labels to form multi-grained supervision for image-text retrieval. In ICMR, 2022.

Digital Library

[50]

Siqu Long, Soyeon Caren Han, Xiaojun Wan, and Josiah Poon. Gradual: Graph-based dual-modal representation for image-text matching. In WACV, 2022.

[51]

Haiwen Diao, Ying Zhang, Lin Ma, and Huchuan Lu. Similarity reasoning and filtration for image-text matching. In AAAI, 2021.

[52]

Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. In TACL. MIT Press, 2014.

[53]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.

[54]

Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. Stacked cross attention for image-text matching. In ECCV, 2018.

Digital Library

[55]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, 2018.

[56]

Weikuo Guo, Huaibo Huang, Xiangwei Kong, and Ran He. Fine-grained cross-modal retrieval with triple-streamed memory fusion transformer encoder. In ICME, 2022.

[57]

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text representation learning. In ECCV, 2020.

Digital Library

[58]

Wenrui Li and Xiaopeng Fan. Image-text alignment and retrieval using light-weight transformer. In ICASSP, 2022.

Cited By

Ge XChen FXu STao FWang JJose J(2025)Hire: Hybrid-modal Interaction with Multiple Relational Enhancements for Image-Text MatchingACM Transactions on Intelligent Systems and Technology10.1145/3714431Online publication date: 23-Jan-2025
https://doi.org/10.1145/3714431
Zheng HJiang WDeng XLi WCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Sample-agnostic Adversarial Perturbation for Vision-Language Pre-training ModelsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3686835(9749-9758)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3686835
Zheng HDeng XJiang WLi WCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)A Unified Understanding of Adversarial Vulnerability Regarding Unimodal Models and Vision-Language Pre-training ModelsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681184(18-27)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681184
Show More Cited By

Index Terms

Reservoir Computing Transformer for Image-Text Retrieval
1. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval

Recommendations

Multi-task Collaborative Network for Image-Text Retrieval
MultiMedia Modeling
Abstract
Image-text retrieval aims to capture semantic relevance between images and texts. Most existing approaches rely solely on the image-text pairs to learn visual-semantic representation through fine-grained alignments while neglecting the potential ...
Cross-modal alignment with graph reasoning for image-text retrieval
Abstract
Image-text retrieval task has received a lot of attention in the modern research field of artificial intelligence. It still remains challenging since image and text are heterogeneous cross-modal data. The key issue of image-text retrieval is how ...
Multi-Task Visual Semantic Embedding Network for Image-Text Retrieval
Abstract
Image-text retrieval aims to capture the semantic correspondence between images and texts, which serves as a foundation and crucial component in multi-modal recommendations, search systems, and online shopping. Existing mainstream methods ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
National Key Research and Development Program of China

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
326
Total Downloads

Downloads (Last 12 months)213
Downloads (Last 6 weeks)12

Reflects downloads up to 26 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ge XChen FXu STao FWang JJose J(2025)Hire: Hybrid-modal Interaction with Multiple Relational Enhancements for Image-Text MatchingACM Transactions on Intelligent Systems and Technology10.1145/3714431Online publication date: 23-Jan-2025
https://doi.org/10.1145/3714431
Zheng HJiang WDeng XLi WCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Sample-agnostic Adversarial Perturbation for Vision-Language Pre-training ModelsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3686835(9749-9758)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3686835
Zheng HDeng XJiang WLi WCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)A Unified Understanding of Adversarial Vulnerability Regarding Unimodal Models and Vision-Language Pre-training ModelsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681184(18-27)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681184
Li WXiong RFan X(2024)Multi-Layer Probabilistic Association Reasoning Network for Image-Text RetrievalIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.339455134:10(9706-9717)Online publication date: Oct-2024
https://doi.org/10.1109/TCSVT.2024.3394551
Liang YWu X(2024)Do Keypoints Contain Crucial Information? Mining Keypoint Information to Enhance Cross-View Geo-Localization2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10688249(1-6)Online publication date: 15-Jul-2024
https://doi.org/10.1109/ICME57554.2024.10688249
Wu XHu XQin XZhang PZeng GGuo YZhao RHuang X(2024)Improving Multimodal Rumor Detection via Dynamic Graph ModelingPattern Recognition10.1007/978-3-031-78456-9_16(242-258)Online publication date: 3-Dec-2024
https://doi.org/10.1007/978-3-031-78456-9_16

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten