Abstract
Visual dialog is one attractive vision-language task to predict correct answer according to the given question, dialog history and image. Although researchers have offered diversified solutions to contact text with vision, multi-modal information still get inadequate interaction for semantic alignment. To solve the problem, we propose closed-loop reasoning with graph-aware dense interaction, aiming to discover cues through the dynamic structure of graph and leverage it to benefit dialog and image features. Moreover, we analyze the statistics of the linguistic entities hidden in dialog to prove the reliability of graph construction. Experiments are set up on two VisDial datasets, which indicate that our model achieves the competitive results against the previous methods. Ablation study and parameter analysis can further demonstrate the effectiveness of our model.
Similar content being viewed by others
References
Zhang, Y., Shi, X., Mi, S., Yang, X.: Image captioning with transformer and knowledge graph. Pattern Recognit. Lett. 143, 43–49 (2021)
Shao, Z., Han, J., Marnerides, D., Debattista, K.: Region-object relation-aware dense captioning via transformer. IEEE Trans. Neural Netw. Learning Syst., 1–12 (2022)
Peng, Y., Chi, J.: Unsupervised cross-media retrieval using domain adaptation with scene graph. IEEE Trans. Circuits Syst. Video Technol. 30(11), 4368–4379 (2020)
Chen, X., Jiang, M., Zhao, Q.: Predicting human scanpaths in visual question answering. In: CVPR, pp. 10876–10885 (2021)
Wu, G., Han, J., Lin, Z., Ding, G., Zhang, B., Ni, Q.: Joint image-text hashing for fast large-scale cross-media retrieval using self-supervised deep learning. IEEE Trans. Ind. Electron. 66(12), 9868–9877 (2019)
Ji, Z., Wang, H., Han, J., Pang, Y.: SMAN: stacked multimodal attention network for cross-modal image-text retrieval. IEEE Trans. Cybern. 52(2), 1086–1097 (2022)
Das, A., Kottur, S., Gupta, K., Singh, A., Yadav, D.: Visual dialog. In: CVPR, pp. 1080–1089 (2017)
Zhao, W., Guan, Z., Huang, Y., Xi, T., Sun, H., Wang, Z., He, X.: Discerning influence patterns with beta-poisson factorization in microblogging environments. IEEE Trans. Knowl. Data Eng. 32(6), 1092–1103 (2020)
Bao, B., Lang, C., Mei, T., Bimbo, A.D.: Guest editorial: Learning multimedia for real world applications. Multim. Tools Appl. 75(5), 2413–2417 (2016)
Kottur, S., Moura, J.M.F., Parikh, D., Batra, D., Rohrbach, M.: Visual coreference resolution in visual dialog using neural module networks. In: ECCV, vol. 11219, pp. 160–178 (2018)
Niu, Y., Zhang, H., Zhang, M., Zhang, J., Lu, Z., Wen, J.: Recursive visual attention in visual dialog. In: CVPR, pp. 6679–6688 (2019)
Wu, Q., Wang, P., Shen, C., Reid, I.D., van den Hengel, A.: Are you talking to me? reasoned visual dialog generation through adversarial learning. In: CVPR, pp. 6106–6115 (2018)
Chen, F., Meng, F., Xu, J., Li, P., Xu, B., Zhou, J.: DMRM: A dual-channel multi-hop reasoning model for visual dialog. In: AAAI, pp. 7504–7511 (2020)
Yu, J., Jiang, X., Qin, Z., Zhang, W., Hu, Y., Wu, Q.: Learning dual encoding model for adaptive visual understanding in visual dialogue. IEEE Trans. Image Process. 30, 220–233 (2021)
Mazuecos, M., Luque, F.M., Sánchez, J., Maina, H., Vadora, T., Benotti, L.: Region under discussion for visual dialog. In: EMNLP, pp. 4745–4759. Association for Computational Linguistics, ??? (2021)
Shi, Y., Tan, Y., Feng, F., Zheng, C., Wang, X.: Category-based strategy-driven question generator for visual dialogue. In: CCL, vol. 12869, pp. 177–192 (2021)
Zhu, L., Huang, Z., Li, Z., Xie, L., Shen, H.T.: Exploring auxiliary context: Discrete semantic transfer hashing for scalable image retrieval. IEEE Trans. Neural Netw. Learning Syst. 29(11), 5264–5276 (2018)
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: VQA: visual question answering. In: ICCV, pp. 2425–2433 (2015)
Malinowski, M., Rohrbach, M., Fritz, M.: Ask your neurons: A neural-based approach to answering questions about images. In: ICCV, pp. 1–9 (2015)
Niu, Y., Huang, F., Liang, J., Chen, W., Zhu, X., Huang, M.: A semantic-based method for unsupervised commonsense question answering. In: ACL/IJCNLP, pp. 3037–3049 (2021)
Zhang, L., Lin, C., Zhou, D., He, Y., Zhang, M.: A bayesian end-to-end model with estimated uncertainties for simple question answering over knowledge bases. Comput. Speech Lang. 66, 101167 (2021)
Kapanipathi, P., Abdelaziz, I., Ravishankar, S., Roukos, S., Gray, A.G., Astudillo, R.F.: Leveraging abstract meaning representation for knowledge base question answering. Findings of ACL, vol. ACL/IJCNLP 2021, pp. 3884–3894 (2021)
Li, X., Sun, Y., Cheng, G.: TSQA: tabular scenario based question answering. In: AAAI, pp. 13297–13305 (2021)
Huang, Z., Shen, Y., Li, X., Wei, Y., Cheng, G.: Geosqa: A benchmark for scenario-based question answering in the geography domain at high school level. In: EMNLP-IJCNLP, pp. 5865–5870 (2019)
Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., Xu, W.: Are you talking to a machine? dataset and methods for multilingual image question. In: NIPS, pp. 2296–2304 (2015)
Yu, Z., Yu, J., Cui, Y., Tao, D., Tian, Q.: Deep modular co-attention networks for visual question answering. In: CVPR, pp. 6281–6290 (2019)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J.: Attention is all you need. In: NIPS, pp. 5998–6008 (2017)
Agarwal, S., Bui, T., Lee, J., Konstas, I., Rieser, V.: History for visual dialog: Do we really need it? In: ACL, pp. 8182–8197 (2020)
Das, A., Kottur, S., Moura, J.M.F., Lee, S., Batra, D.: Learning cooperative visual dialog agents with deep reinforcement learning. In: ICCV, pp. 2970–2979 (2017)
Murahari, V., Chattopadhyay, P., Batra, D., Parikh, D., Das, A.: Improving generative visual dialog by answering diverse questions. In: EMNLP-IJCNLP, pp. 1449–1454 (2019)
Cho, Y., Kim, I.: NMN-VD: A neural module network for visual dialog. Sensors 21(3), 931 (2021)
Yang, T., Zha, Z., Zhang, H.: Making history matter: History-advantage sequence training for visual dialog. In: ICCV, pp. 2561–2569 (2019)
Kang, G., Lim, J., Zhang, B.: Dual attention networks for visual reference resolution in visual dialog. In: EMNLP-IJCNLP, pp. 2024–2033 (2019)
Cogswell, M., Lu, J., Jain, R., Lee, S., Parikh, D.: Dialog without dialog data: Learning visual dialog agents from VQA data. In: NIPS (2020)
Zheng, Z., Wang, W., Qi, S., Zhu, S.: Reasoning visual dialogs with structural and partial observations. In: CVPR, pp. 6669–6678 (2019)
Schwartz, I., Yu, S., Hazan, T., Schwing, A.G.: Factor graph attention. In: CVPR, pp. 2039–2048 (2019)
Lu, C., Krishna, R., Bernstein, M.S., Fei-Fei, L.: Visual relationship detection with language priors. In: ECCV, vol. 9905, pp. 852–869 (2016)
Jain, U., Lazebnik, S., Schwing, A.G.: Two can play this game: Visual dialog with discriminative question generation and answering. In: CVPR, pp. 5754–5763 (2018)
Lee, S., Gao, T., Yang, S., Yoo, J., Ha, J.: Large-scale answerer in questioner’s mind for visual dialog question generation. In: ICLR (2019)
Pang, W., Wang, X.: Visual dialogue state tracking for question generation. In: AAAI, pp. 11831–11838 (2020)
Zheng, D., Xu, Z., Meng, F., Wang, X., Wang, J., Zhou, J.: Enhancing visual dialog questioner with entity-based strategy learning and augmented guesser. In: EMNLP, pp. 1839–1851 (2021)
Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014)
Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.: Deep contextualized word representations. In: NAACL-HLT, pp. 2227–2237 (2018)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. In: Neural Computation, pp. 1735–1780 (1997)
Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017)
Zhang, J., Wang, Q., Han, Y.: Multi-modal fusion with multi-level attention for visual dialog. Inf. Process. Manag. 57(4), 102152 (2020)
Paddlepaddle, An Easy-to-use, Easy-to-learn Deep Learning Platform. http://www.paddlepaddle.org/
Gao, J., Zhang, T., Yang, X., Xu, C.: Deep relative tracking. IEEE Trans. Image Process. 26(4), 1845–1858 (2017)
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China (U21B2024, 62002257), State Key Laboratory of Communication Content Cognition (Grant No. A02106), Open Funding Project of the State Key Laboratory of Communication Content Cognition (Grant No. 20K04), the China Postdoctoral Science Foundation (2021M692395) and the Baidu Program. Besides, we sincerely thank to the Baidu Program for the Paddlepaddle platform.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by B-K. Bao.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Liu, AA., Zhang, G., Xu, N. et al. Closed-loop reasoning with graph-aware dense interaction for visual dialog. Multimedia Systems 28, 1823–1832 (2022). https://doi.org/10.1007/s00530-022-00947-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00530-022-00947-1