Closed-loop reasoning with graph-aware dense interaction for visual dialog

An-An Liu^1,2,
Guokai Zhang¹,
Ning Xu ORCID: orcid.org/0000-0002-7526-4356¹,
Junbo Guo³,
Guoqing Jin³ &
…
Xuanya Li⁴

330 Accesses
1 Citation
Explore all metrics

Abstract

Visual dialog is one attractive vision-language task to predict correct answer according to the given question, dialog history and image. Although researchers have offered diversified solutions to contact text with vision, multi-modal information still get inadequate interaction for semantic alignment. To solve the problem, we propose closed-loop reasoning with graph-aware dense interaction, aiming to discover cues through the dynamic structure of graph and leverage it to benefit dialog and image features. Moreover, we analyze the statistics of the linguistic entities hidden in dialog to prove the reliability of graph construction. Experiments are set up on two VisDial datasets, which indicate that our model achieves the competitive results against the previous methods. Ablation study and parameter analysis can further demonstrate the effectiveness of our model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Relation-Aware Multi-hop Reasoning forVisual Dialog

Multi-aware coreference relation network for visual dialog

Article 29 November 2022

Reciprocal question representation learning network for visual dialog

Article 16 June 2022

References

Zhang, Y., Shi, X., Mi, S., Yang, X.: Image captioning with transformer and knowledge graph. Pattern Recognit. Lett. 143, 43–49 (2021)
Article Google Scholar
Shao, Z., Han, J., Marnerides, D., Debattista, K.: Region-object relation-aware dense captioning via transformer. IEEE Trans. Neural Netw. Learning Syst., 1–12 (2022)
Peng, Y., Chi, J.: Unsupervised cross-media retrieval using domain adaptation with scene graph. IEEE Trans. Circuits Syst. Video Technol. 30(11), 4368–4379 (2020)
Article Google Scholar
Chen, X., Jiang, M., Zhao, Q.: Predicting human scanpaths in visual question answering. In: CVPR, pp. 10876–10885 (2021)
Wu, G., Han, J., Lin, Z., Ding, G., Zhang, B., Ni, Q.: Joint image-text hashing for fast large-scale cross-media retrieval using self-supervised deep learning. IEEE Trans. Ind. Electron. 66(12), 9868–9877 (2019)
Article Google Scholar
Ji, Z., Wang, H., Han, J., Pang, Y.: SMAN: stacked multimodal attention network for cross-modal image-text retrieval. IEEE Trans. Cybern. 52(2), 1086–1097 (2022)
Article Google Scholar
Das, A., Kottur, S., Gupta, K., Singh, A., Yadav, D.: Visual dialog. In: CVPR, pp. 1080–1089 (2017)
Zhao, W., Guan, Z., Huang, Y., Xi, T., Sun, H., Wang, Z., He, X.: Discerning influence patterns with beta-poisson factorization in microblogging environments. IEEE Trans. Knowl. Data Eng. 32(6), 1092–1103 (2020)
Article Google Scholar
Bao, B., Lang, C., Mei, T., Bimbo, A.D.: Guest editorial: Learning multimedia for real world applications. Multim. Tools Appl. 75(5), 2413–2417 (2016)
Article Google Scholar
Kottur, S., Moura, J.M.F., Parikh, D., Batra, D., Rohrbach, M.: Visual coreference resolution in visual dialog using neural module networks. In: ECCV, vol. 11219, pp. 160–178 (2018)
Niu, Y., Zhang, H., Zhang, M., Zhang, J., Lu, Z., Wen, J.: Recursive visual attention in visual dialog. In: CVPR, pp. 6679–6688 (2019)
Wu, Q., Wang, P., Shen, C., Reid, I.D., van den Hengel, A.: Are you talking to me? reasoned visual dialog generation through adversarial learning. In: CVPR, pp. 6106–6115 (2018)
Chen, F., Meng, F., Xu, J., Li, P., Xu, B., Zhou, J.: DMRM: A dual-channel multi-hop reasoning model for visual dialog. In: AAAI, pp. 7504–7511 (2020)
Yu, J., Jiang, X., Qin, Z., Zhang, W., Hu, Y., Wu, Q.: Learning dual encoding model for adaptive visual understanding in visual dialogue. IEEE Trans. Image Process. 30, 220–233 (2021)
Article Google Scholar
Mazuecos, M., Luque, F.M., Sánchez, J., Maina, H., Vadora, T., Benotti, L.: Region under discussion for visual dialog. In: EMNLP, pp. 4745–4759. Association for Computational Linguistics, ??? (2021)
Shi, Y., Tan, Y., Feng, F., Zheng, C., Wang, X.: Category-based strategy-driven question generator for visual dialogue. In: CCL, vol. 12869, pp. 177–192 (2021)
Zhu, L., Huang, Z., Li, Z., Xie, L., Shen, H.T.: Exploring auxiliary context: Discrete semantic transfer hashing for scalable image retrieval. IEEE Trans. Neural Netw. Learning Syst. 29(11), 5264–5276 (2018)
Article MathSciNet Google Scholar
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: VQA: visual question answering. In: ICCV, pp. 2425–2433 (2015)
Malinowski, M., Rohrbach, M., Fritz, M.: Ask your neurons: A neural-based approach to answering questions about images. In: ICCV, pp. 1–9 (2015)
Niu, Y., Huang, F., Liang, J., Chen, W., Zhu, X., Huang, M.: A semantic-based method for unsupervised commonsense question answering. In: ACL/IJCNLP, pp. 3037–3049 (2021)
Zhang, L., Lin, C., Zhou, D., He, Y., Zhang, M.: A bayesian end-to-end model with estimated uncertainties for simple question answering over knowledge bases. Comput. Speech Lang. 66, 101167 (2021)
Article Google Scholar
Kapanipathi, P., Abdelaziz, I., Ravishankar, S., Roukos, S., Gray, A.G., Astudillo, R.F.: Leveraging abstract meaning representation for knowledge base question answering. Findings of ACL, vol. ACL/IJCNLP 2021, pp. 3884–3894 (2021)
Li, X., Sun, Y., Cheng, G.: TSQA: tabular scenario based question answering. In: AAAI, pp. 13297–13305 (2021)
Huang, Z., Shen, Y., Li, X., Wei, Y., Cheng, G.: Geosqa: A benchmark for scenario-based question answering in the geography domain at high school level. In: EMNLP-IJCNLP, pp. 5865–5870 (2019)
Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., Xu, W.: Are you talking to a machine? dataset and methods for multilingual image question. In: NIPS, pp. 2296–2304 (2015)
Yu, Z., Yu, J., Cui, Y., Tao, D., Tian, Q.: Deep modular co-attention networks for visual question answering. In: CVPR, pp. 6281–6290 (2019)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J.: Attention is all you need. In: NIPS, pp. 5998–6008 (2017)
Agarwal, S., Bui, T., Lee, J., Konstas, I., Rieser, V.: History for visual dialog: Do we really need it? In: ACL, pp. 8182–8197 (2020)
Das, A., Kottur, S., Moura, J.M.F., Lee, S., Batra, D.: Learning cooperative visual dialog agents with deep reinforcement learning. In: ICCV, pp. 2970–2979 (2017)
Murahari, V., Chattopadhyay, P., Batra, D., Parikh, D., Das, A.: Improving generative visual dialog by answering diverse questions. In: EMNLP-IJCNLP, pp. 1449–1454 (2019)
Cho, Y., Kim, I.: NMN-VD: A neural module network for visual dialog. Sensors 21(3), 931 (2021)
Article Google Scholar
Yang, T., Zha, Z., Zhang, H.: Making history matter: History-advantage sequence training for visual dialog. In: ICCV, pp. 2561–2569 (2019)
Kang, G., Lim, J., Zhang, B.: Dual attention networks for visual reference resolution in visual dialog. In: EMNLP-IJCNLP, pp. 2024–2033 (2019)
Cogswell, M., Lu, J., Jain, R., Lee, S., Parikh, D.: Dialog without dialog data: Learning visual dialog agents from VQA data. In: NIPS (2020)
Zheng, Z., Wang, W., Qi, S., Zhu, S.: Reasoning visual dialogs with structural and partial observations. In: CVPR, pp. 6669–6678 (2019)
Schwartz, I., Yu, S., Hazan, T., Schwing, A.G.: Factor graph attention. In: CVPR, pp. 2039–2048 (2019)
Lu, C., Krishna, R., Bernstein, M.S., Fei-Fei, L.: Visual relationship detection with language priors. In: ECCV, vol. 9905, pp. 852–869 (2016)
Jain, U., Lazebnik, S., Schwing, A.G.: Two can play this game: Visual dialog with discriminative question generation and answering. In: CVPR, pp. 5754–5763 (2018)
Lee, S., Gao, T., Yang, S., Yoo, J., Ha, J.: Large-scale answerer in questioner’s mind for visual dialog question generation. In: ICLR (2019)
Pang, W., Wang, X.: Visual dialogue state tracking for question generation. In: AAAI, pp. 11831–11838 (2020)
Zheng, D., Xu, Z., Meng, F., Wang, X., Wang, J., Zhou, J.: Enhancing visual dialog questioner with entity-based strategy learning and augmented guesser. In: EMNLP, pp. 1839–1851 (2021)
Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014)
Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.: Deep contextualized word representations. In: NAACL-HLT, pp. 2227–2237 (2018)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. In: Neural Computation, pp. 1735–1780 (1997)
Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017)
Article Google Scholar
Zhang, J., Wang, Q., Han, Y.: Multi-modal fusion with multi-level attention for visual dialog. Inf. Process. Manag. 57(4), 102152 (2020)
Article Google Scholar
Paddlepaddle, An Easy-to-use, Easy-to-learn Deep Learning Platform. http://www.paddlepaddle.org/
Gao, J., Zhang, T., Yang, X., Xu, C.: Deep relative tracking. IEEE Trans. Image Process. 26(4), 1845–1858 (2017)
Article MathSciNet Google Scholar

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (U21B2024, 62002257), State Key Laboratory of Communication Content Cognition (Grant No. A02106), Open Funding Project of the State Key Laboratory of Communication Content Cognition (Grant No. 20K04), the China Postdoctoral Science Foundation (2021M692395) and the Baidu Program. Besides, we sincerely thank to the Baidu Program for the Paddlepaddle platform.

Author information

Authors and Affiliations

School of Electrical and Information Engineering, Tianjin University, Tianjin, 300072, China
An-An Liu, Guokai Zhang & Ning Xu
Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, 230088, China
An-An Liu
State of key Laboratory of Communication Content Cognition, Peoples Daliy Online, Beijing, 100733, China
Junbo Guo & Guoqing Jin
Baidu Inc., Beijing, 100085, China
Xuanya Li

Authors

An-An Liu
View author publications
You can also search for this author in PubMed Google Scholar
Guokai Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Ning Xu
View author publications
You can also search for this author in PubMed Google Scholar
Junbo Guo
View author publications
You can also search for this author in PubMed Google Scholar
Guoqing Jin
View author publications
You can also search for this author in PubMed Google Scholar
Xuanya Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ning Xu.

Additional information

Communicated by B-K. Bao.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, AA., Zhang, G., Xu, N. et al. Closed-loop reasoning with graph-aware dense interaction for visual dialog. Multimedia Systems 28, 1823–1832 (2022). https://doi.org/10.1007/s00530-022-00947-1

Download citation

Received: 13 March 2022
Accepted: 13 April 2022
Published: 01 June 2022
Issue Date: October 2022
DOI: https://doi.org/10.1007/s00530-022-00947-1

Closed-loop reasoning with graph-aware dense interaction for visual dialog

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Relation-Aware Multi-hop Reasoning forVisual Dialog

Multi-aware coreference relation network for visual dialog

Reciprocal question representation learning network for visual dialog

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Closed-loop reasoning with graph-aware dense interaction for visual dialog

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Relation-Aware Multi-hop Reasoning forVisual Dialog

Multi-aware coreference relation network for visual dialog

Reciprocal question representation learning network for visual dialog

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation