Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Exploring Visual Relationships via Transformer-based Graphs for Enhanced Image Captioning

Published: 22 January 2024 Publication History

Abstract

Image captioning (IC), bringing vision to language, has drawn extensive attention. A crucial aspect of IC is the accurate depiction of visual relations among image objects. Visual relations encompass two primary facets: content relations and structural relations. Content relations, which comprise geometric positions content (i.e., distances and sizes) and semantic interactions content (i.e., actions and possessives), unveil the mutual correlations between objects. In contrast, structural relations pertain to the topological connectivity of object regions. Existing Transformer-based methods typically resort to geometric positions to enhance the visual relations, yet only using the shallow geometric content is unable to precisely cover actional content correlations and structural connection relations. In this article, we adopt a comprehensive perspective to examine the correlations between objects, incorporating both content relations (i.e., geometric and semantic relations) and structural relations, with the aim of generating plausible captions. To achieve this, first, we construct a geometric graph from bounding box features and a semantic graph from the scene graph parser to model the content relations. Innovatively, we construct a topology graph that amalgamates the sparsity characteristics of the geometric and semantic graphs, enabling the representation of image structural relations. Second, we propose a novel unified approach to enrich image relation representations by integrating semantic, geometric, and structural relations into self-attention. Finally, in the language decoding stage, we further leverage the semantic relation as prior knowledge to generate accurate words. Extensive experiments on MS-COCO dataset demonstrate the effectiveness of our model, with improvements of CIDEr from 128.6% to 136.6%. Codes have been released at https://github.com/CrossmodalGroup/ER-SAN/tree/main/VG-Cap.

References

[1]
Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. Spice: Semantic propositional image caption evaluation. In Proceedings of the European Conference on Computer Vision. Springer, 382–398.
[2]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077–6086.
[3]
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 65–72.
[4]
Shan Cao, Gaoyun An, Zhenxing Zheng, and Zhiyong Wang. 2022. Vision-enhanced and consensus-aware transformer for image captioning. IEEE Trans. Circ. Syst. Video Technol. 32, 10 (2022), 7005–7018.
[5]
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In Proceedings of the 16th European Conference on Computer Vision (ECCV’20). Springer, 213–229.
[6]
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating long sequences with sparse transformers. Retrieved from https://arXiv:1904.10509 (2019).
[7]
Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. 2020. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10578–10587.
[8]
Zhihao Fan, Zhongyu Wei, Siyuan Wang, Ruize Wang, Zejun Li, Haijun Shan, and Xuanjing Huang. 2021. TCIC: Theme concepts learning cross language and vision for image captioning. Retrieved from https://arXiv:2106.10936
[9]
Zhengcong Fei. 2022. Attention-aligned transformer for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 607–615.
[10]
Qianyu Feng, Yu Wu, Hehe Fan, Chenggang Yan, Mingliang Xu, and Yi Yang. 2020. Cascaded revision network for novel object captioning. IEEE Trans. Circ. Syst. Video Technol. 30, 10 (2020), 3413–3421. DOI:
[11]
Longteng Guo, Jing Liu, Jinhui Tang, Jiangwei Li, Wei Luo, and Hanqing Lu. 2019. Aligning linguistic words and visual semantic units for image captioning. In Proceedings of the 27th ACM International Conference on Multimedia. 765–773.
[12]
Longteng Guo, Jing Liu, Xinxin Zhu, Peng Yao, Shichen Lu, and Hanqing Lu. 2020. Normalized and geometry-aware self-attention network for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10327–10336.
[13]
Simao Herdade, Armin Kappeler, Kofi Boakye, and Joao Soares. 2019. Image captioning: Transforming objects into words. Adv. Neural Info. Process. Syst. 32 (2019).
[14]
Feicheng Huang and Zhixin Li. 2022. Improve image captioning via relation modeling. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’22). IEEE, 1945–1949.
[15]
Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei. 2019. Attention on attention for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4634–4643.
[16]
Lun Huang, Wenmin Wang, Yaxian Xia, and Jie Chen. 2019. Adaptively aligned image captioning via adaptive attention time. Adv. Neural Info. Process. Syst. 32 (2019).
[17]
Jiayi Ji, Yunpeng Luo, Xiaoshuai Sun, Fuhai Chen, Gen Luo, Yongjian Wu, Yue Gao, and Rongrong Ji. 2021. Improving image captioning by leveraging intra-and inter-layer global representation in transformer network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 1655–1663.
[18]
Jiayi Ji, Yiwei Ma, Xiaoshuai Sun, Yiyi Zhou, Yongjian Wu, and Rongrong Ji. 2022. Knowing what to Learn: A metric-oriented focal mechanism for image captioning. IEEE Transactions on Image Processing 31, (2022), 4321–4335. DOI:
[19]
Huaizu Jiang, Ishan Misra, Marcus Rohrbach, Erik Learned-Miller, and Xinlei Chen. 2020. In defense of grid features for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10267–10276.
[20]
Weitao Jiang, Weixuan Wang, and Haifeng Hu. 2021. Bi-directional co-attention network for image captioning. ACM Trans. Multimedia Comput., Commun. Appl. 17, 4 (2021), 1–20.
[21]
Weitao Jiang, Wei Zhou, and Haifeng Hu. 2022. Double-stream position learning transformer network for image captioning. IEEE Trans. Circ. Syst. Video Technol. 32, 11 (2022), 7706–7718.
[22]
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Computer Vision and Pattern Recognition Conference (CVPR’15). 3128–3137.
[23]
Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. Retrieved from https://arXiv:1609.02907
[24]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision 123, 1 (2017), 32–73.
[25]
Jingyu Li, Zhendong Mao, Shancheng Fang, and Hao Li. 2022. ER-SAN: Enhanced-adaptive relation self-attention network for image captioning. In Proceedings of the 31th International Joint Conference on Artificial Intelligence. 1056–1062.
[26]
Zhongli Li, Qingyu Zhou, Chao Li, Ke Xu, and Yunbo Cao. 2020. Improving bert with syntax-aware local attention. Retrieved from https://arXiv:2012.15150
[27]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV’14). Springer, 740–755.
[28]
An-An Liu, Chenxi Huang, Ning Xu, Hongshuo Tian, Jing Liu, and Yongdong Zhang. 2023. Counterfactual Visual dialog: Robust commonsense knowledge learning from unbiased training. IEEE Transactions on Multimedia (2023), 1–13. DOI:
[29]
An-An Liu, Haochun Lu, Heyu Zhou, Tianbao Li, and Mohan Kankanhalli. 2024. Balanced class-incremental 3D object classification and retrieval. IEEE Transactions on Knowledge and Data Engineering 36, 1 (2024), 35–48. DOI:
[30]
An-An Liu, Yu-Ting Su, Wei-Zhi Nie, and Mohan Kankanhalli. 2016. Hierarchical clustering multi-task learning for joint human action grouping and recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39, 1 (2016), 102–114.
[31]
An-An Liu, Yingchen Zhai, Ning Xu, Weizhi Nie, Wenhui Li, and Yongdong Zhang. 2022. Region-aware image captioning via interaction learning. IEEE Transactions on Circuits and Systems for Video Technology 32, 6 (2022), 3685–3696. DOI:
[32]
Shilong Liu, Feng Li, Hao Zhang, Xiao Yang, Xianbiao Qi, Hang Su, Jun Zhu, and Lei Zhang. 2022. DAB-DETR: Dynamic anchor boxes are better queries for DETR. Retrieved from https://arXiv:2201.12329
[33]
Xiaoxiao Liu and Qingyang Xu. 2020. Adaptive attention-based high-level semantic introduction for image caption. ACM Trans. Multimedia Comput., Commun. Appl. 16, 4 (2020), 1–22.
[34]
Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 375–383.
[35]
Yunpeng Luo, Jiayi Ji, Xiaoshuai Sun, Liujuan Cao, Yongjian Wu, Feiyue Huang, Chia-Wen Lin, and Rongrong Ji. 2021. Dual-level collaborative transformer for image captioning. Retrieved from https://arXiv:2101.06462
[36]
Yingwei Pan, Ting Yao, Yehao Li, and Tao Mei. 2020. X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10971–10980.
[37]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311–318.
[38]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. Adv. Neural Info. Process. Syst. 28 (2015), 91–99.
[39]
Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7008–7024.
[40]
Chin-Yew Lin. 2004. A package for automatic evaluation of summaries. In Proceedings of Workshop on Text Summarization of ACL.
[41]
Zhan Shi, Xu Zhou, Xipeng Qiu, and Xiaodan Zhu. 2020. Improving image captioning with better use of captions. Retrieved from https://arXiv:2006.11807
[42]
Zeliang Song, Xiaofei Zhou, Linhua Dong, Jianlong Tan, and Li Guo. 2021. Direction relation transformer for image captioning. In Proceedings of the 29th ACM International Conference on Multimedia. 5056–5064.
[43]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008.
[44]
Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4566–4575.
[45]
Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2019. Hierarchy parsing for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2621–2629.
[46]
Xiaohan Wang, Wenguan Wang, Jiayi Shao, and Yi Yang. 2023. LANA: A language-capable navigator for instruction following and generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19048–19058.
[47]
Yanhui Wang, Ning Xu, An-An Liu, Wenhui Li, and Yongdong Zhang. 2021. High-order interaction learning for image captioning. IEEE Trans. Circ. Syst. Video Technol. 32, 7 (2021), 4417–4430.
[48]
Hanjie Wu, Yongtuo Liu, Hongmin Cai, and Shengfeng He. 2022. Learning transferable perturbations for image captioning. ACM Trans. Multimedia Comput., Commun. Appl. 18, 2 (2022), 1–18.
[49]
Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1492–1500.
[50]
Yang Xu, Li Li, Haiyang Xu, Songfang Huang, Fei Huang, and Jianfei Cai. 2022. Image captioning in the transformer age. Retrieved from https://arXiv:2204.07374
[51]
Chenggang Yan, Yiming Hao, Liang Li, Jian Yin, Anan Liu, Zhendong Mao, Zhenyu Chen, and Xingyu Gao. 2021. Task-adaptive attention for image captioning. IEEE Trans. Circ. Syst. Video Technol. 32, 1 (2021), 43–51.
[52]
Xu Yang, Kaihua Tang, Hanwang Zhang, and Jianfei Cai. 2019. Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10685–10694.
[53]
Xu Yang, Hanwang Zhang, and Jianfei Cai. 2019. Learning to collocate neural modules for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’19).
[54]
Yi Yang, Yueting Zhuang, and Yunhe Pan. 2021. Multiple knowledge representation for big data artificial intelligence: Framework, applications, and case studies. Front. Info. Technol. Electr. Eng. 22, 12 (2021), 1551–1558.
[55]
Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2018. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV’18). 684–699.
[56]
Jun Yu, Jing Li, Zhou Yu, and Qingming Huang. 2019. Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans. Circ. Syst. Video Technol. 30, 12 (2019), 4467–4480.
[57]
Mengqi Yuan, Bing-Kun Bao, Zhiyi Tan, and Changsheng Xu. 2023. Adaptive text denoising network for image caption editing. ACM Trans. Multimedia Comput., Commun. Appl. 19, 1s (2023), 1–18.
[58]
Kun Zhang, Bo Hu, Huatian Zhang, Zhe Li, and Zhendong Mao. 2023. Enhanced semantic similarity learning framework for image-text matching. IEEE Trans. Circ. Syst. Video Technol. Retrieved from https://ieeexplore.ieee.org/document/10226265. DOI:
[59]
Kun Zhang, Zhendong Mao, Anan Liu, and Yongdong Zhang. 2022. Unified adaptive relevance distinguishable attention network for image-text matching. IEEE Trans. Multimedia 25 (Jan. 2022), 1320–1332.
[60]
Kun Zhang, Zhendong Mao, Quan Wang, and Yongdong Zhang. 2022. Negative-aware attention framework for image-text matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15661–15670.
[61]
Xuying Zhang, Xiaoshuai Sun, Yunpeng Luo, Jiayi Ji, Yiyi Zhou, Yongjian Wu, Feiyue Huang, and Rongrong Ji. 2021. RSTNet: Captioning with adaptive attention on visual and non-visual words. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15465–15474.
[62]
Zhuosheng Zhang, Yuwei Wu, Junru Zhou, Sufeng Duan, Hai Zhao, and Rui Wang. 2020. SG-Net: Syntax-guided machine reading comprehension. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 9636–9643.
[63]
Guangxiang Zhao, Junyang Lin, Zhiyuan Zhang, Xuancheng Ren, Qi Su, and Xu Sun. 2019. Explicit sparse transformer: Concentrated attention through explicit selection. Retrieved from https://arXivpreprintarXiv:1912.11637
[64]
Tianfei Zhou, Siyuan Qi, Wenguan Wang, Jianbing Shen, and Song-Chun Zhu. 2021. Cascaded parsing of human-object interaction recognition. IEEE Trans. Pattern Anal. Mach. Intell. 44, 6 (2021), 2827–2840.

Cited By

View all
  • (2024)Towards Retrieval-Augmented Architectures for Image CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366366720:8(1-22)Online publication date: 12-Jun-2024
  • (2024)Simple but Effective Raw-Data Level Multimodal Fusion for Composed Image RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657727(229-239)Online publication date: 10-Jul-2024
  • (2024)Improving radiology report generation with multi-grained abnormality predictionNeurocomputing10.1016/j.neucom.2024.128122600(128122)Online publication date: Oct-2024

Index Terms

  1. Exploring Visual Relationships via Transformer-based Graphs for Enhanced Image Captioning

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 5
    May 2024
    650 pages
    EISSN:1551-6865
    DOI:10.1145/3613634
    • Editor:
    • Abdulmotaleb El Saddik
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 22 January 2024
    Online AM: 25 December 2023
    Accepted: 20 December 2023
    Revised: 19 September 2023
    Received: 30 May 2023
    Published in TOMM Volume 20, Issue 5

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Image captioning
    2. transformer
    3. scene graph
    4. topology graph

    Qualifiers

    • Research-article

    Funding Sources

    • National Natural Science Foundation of China
    • National Natural Science Foundation of China
    • Science Fund for Creative Research Groups
    • National Natural Science Foundation of China

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)444
    • Downloads (Last 6 weeks)68
    Reflects downloads up to 12 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Towards Retrieval-Augmented Architectures for Image CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366366720:8(1-22)Online publication date: 12-Jun-2024
    • (2024)Simple but Effective Raw-Data Level Multimodal Fusion for Composed Image RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657727(229-239)Online publication date: 10-Jul-2024
    • (2024)Improving radiology report generation with multi-grained abnormality predictionNeurocomputing10.1016/j.neucom.2024.128122600(128122)Online publication date: Oct-2024

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media