Abstract
VQA task requires deep understanding of visual and textual content and access to key information to better answer the question. Most of current works only use image and question as the input of the network, where the image features are over-sampling and the text features are under-sampling, resulting in insufficient alignment between image regions and question words. In this paper, we propose a Visual-Textual Semantic Alignment Network (VTSAN). Our network acquires tags for visual semantics from a target detector and takes the Image-Tag-Question\(\mathbf {\left\langle {I, T, Q} \right\rangle }\) triad as the input. The tags can serve as an intermediate medium between the key regions of image and the key words of question, and can greatly enrich the text features. Thereby, the visual-textual semantic alignment is significantly improved. We demonstrate the effectiveness of our proposed network on the standard VQAv2 and VQA-CPv2 benchmarks. The experimental results show that the proposed network outperforms the baseline significantly, especially on the counting questions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR (2018)
Antol, S., et al.: VQA: visual question answering. In: ICCV (2015)
Ashish, V., et al.: Attention is all you need. In: NIPS (2017)
Cadene, R., Ben-younes, H., Cord, M., Thome, N.: Murel: multimodal relational reasoning for visual question answering. In: CVPR (2019)
Gao, P., et al.: Dynamic fusion with intra- and inter-modality attention flow for visual question answering. In: CVPR (2019)
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the v in VQA matter: elevating the role of image understanding in visual question answering. In: CVPR (2017)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997)
Jeffrey, P., Richard, S., Christopher D.M.: Glove: global vectors for word representation. In: EMNLP (2014)
Kim, J., Jun, J., Zhang, B.: Bilinear attention networks. In: NIPS (2018)
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. arXiv preprint arXiv:1602.07332 (2016)
Li, Q., Huang, S., Hong, Y., Zhu, S.-C.: A competence-aware curriculum for visual concepts learning via question answering. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020, Part II. LNCS, vol. 12347, pp. 141–157. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_9
Li, Q., Tao, Q., Joty, S., Cai, J., Luo, J.: VQA-E: explaining, elaborating, and enhancing your answers for visual questions. In: ECCV (2018)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part V. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Mao, J., Gan, C., Kohli, P., Tenenbaum, J.B.: The neuro-symbolic concept learner: interpreting scenes, words, and sentences from natural supervision. In: ICLR (2019)
Nguyen, D.K., Okatani, T.: Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In: CVPR (2018)
Norcliffe-Brown, W., Vafeais, E., Parisot, S.: Learning conditioned graph structures for interpretable visual question answering. In: NIPS (2018)
Ramakrishnan, S., Agrawal, A., Lee, S.: Overcoming language priors in visual question answering with adversarial regularization. In: NIPS (2018)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS (2015)
Tang, K., Zhang, H., Wu, B., Luo, W., Liu, W.: Learning to compose dynamic tree structures for visual contexts. In: CVPR (2019)
Teney, D., Anderson, P., He, X., Hengel, A.V.D.: Tips and tricks for visual question answering. arXiv preprint arXiv:1708.02711 (2017)
Wu, Q., Shen, C., Liu, L., Dick, A., Hengel, A.V.D.: What value do explicit high level concepts have in vision to language problems?. In: CVPR (2016)
You, Q., Jin, H., Wang, Z., Fang, C., Luo, J.: Image captioning with semantic attention. In: CVPR (2016)
Yu, Z., Yu, J., Cui, Y., Tao, D., Tian, Q.: Deep modular co-attention networks for visual question answering. In: CVPR (2020)
Zhou, L., Palangi, H., Zhang, L., Hu, H., Gao, J.: Unified vision-language pre-training for image captioning and VQA. In: AAAI (2020)
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China under Grants 61976079 & 61672203, in part by Anhui Natural Science Funds for Distinguished Young Scholar under Grant 170808J08, and in part by Anhui Key Research and Development Program under Grant 202004a05020039.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Tian, W., Zhang, Y., He, B., Zhu, J., Zhao, Z. (2021). Visual-Textual Semantic Alignment Network for Visual Question Answering. In: Farkaš, I., Masulli, P., Otte, S., Wermter, S. (eds) Artificial Neural Networks and Machine Learning – ICANN 2021. ICANN 2021. Lecture Notes in Computer Science(), vol 12895. Springer, Cham. https://doi.org/10.1007/978-3-030-86383-8_21
Download citation
DOI: https://doi.org/10.1007/978-3-030-86383-8_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86382-1
Online ISBN: 978-3-030-86383-8
eBook Packages: Computer ScienceComputer Science (R0)