research-article

Video question answering via traffic knowledge database and question classification

Authors:

Xuefen LinAuthors Info & Claims

Multimedia Systems, Volume 30, Issue 1

https://doi.org/10.1007/s00530-023-01240-5

Published: 16 January 2024 Publication History

Abstract

Video question answering (VideoQA) is a task that involves answering questions related to videos. The main idea is to understand the content of the video and to combine it with the relevant semantic context to answer various types of questions. Existing methods typically analyze the spatiotemporal correlations of the entire video to answer questions. However, for some simple questions, the answer is related to only a specific frame of the video, and analyzing the entire video undoubtedly increases the learning cost. For some complex questions, the information contained in the video is limited, and these methods are not sufficient to fully answer such questions. Therefore, we proposes a VideoQA model based on question classification and a traffic knowledge database. The model starts from the perspective of the question and classifies the questions into general scene questions and causal questions using different methods to process these two types of questions. For general scene questions, we first extract the key frames of the video to convert it into a simpler image question-answering task and then we use top–down and bottom–up attention mechanisms to process it. For causal questions, we design a lightweight traffic knowledge database that provides relevant traffic knowledge not originally present in VideoQA datasets, to help model reasoning. Then, we use a question and knowledge-guided aggregation graph attention network to process causal questions. The experimental results show that while greatly reducing resource costs, our model performs better on the TrafficQA dataset than do models utilizing millions of external data for pretraining.

References

[1]

Jang, Y., Song, Y., Yu, Y., Kim, Y., Kim, G.: “Tgif-qa: Toward spatio-temporal reasoning in visual question answering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2758–2766, (2017)

[2]

Xu, L., Huang, H., Liu, J.: “Sutd-trafficqa: A question answering benchmark and an efficient network for video reasoning over traffic events,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9878–9888, (2021)

[3]

Zhu L, Xu Z, Yang Y, and Hauptmann AG Uncovering the temporal context for video question answering Int. J. Comput. Vision 2017 124 409-421

Digital Library

[4]

Xiao, J., Zhou, P., Chua, T.-S., Yan, S.: “Video graph transformer for video question answering,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVI, pp. 39–58, Springer, (2022)

[5]

Xu, D., Zhao, Z., Xiao, J., Wu, F., Zhang, H., He, X., Zhuang, Y.: “Video question answering via gradually refined attention over appearance and motion,” in Proceedings of the 25th ACM international conference on Multimedia, pp. 1645–1653, (2017)

[6]

Jin, W., Zhao, Z., Li, Y., Li, J., Xiao, J., Zhuang, Y.: “Video question answering via knowledge-based progressive spatial-temporal attention network,” ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 15, no. 2s, pp. 1–22, (2019)

[7]

Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: “Dbpedia: A nucleus for a web of open data,” in The Semantic Web: 6th International Semantic Web Conference, 2nd Asian Semantic Web Conference, ISWC 2007+ ASWC 2007, Busan, Korea, November 11-15, 2007. Proceedings, pp. 722–735, Springer, (2007)

[8]

Bollacker, K., Evans, C., Paritosh, P., Sturge, T., Taylor, J.: “Freebase: a collaboratively created graph database for structuring human knowledge,” in Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp. 1247–1250, (2008)

[9]

Malinowski, M., Fritz, M.: “A multi-world approach to question answering about real-world scenes based on uncertain input,” Advances in neural information processing systems, vol. 27, (2014)

[10]

Geman D, Geman S, Hallonquist N, and Younes L Visual turing test for computer vision systems Proc. Natl. Acad. Sci. 2015 112 12 3618-3623

[11]

Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: “Long-term recurrent convolutional networks for visual recognition and description,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2625–2634, (2015)

[12]

Karpathy, A., Joulin, A., Fei-Fei, L. F.: “Deep fragment embeddings for bidirectional image sentence mapping,” Advances in neural information processing systems, vol. 27, (2014)

[13]

Malinowski, M., Rohrbach, M., Fritz, M.: “Ask your neurons: A neural-based approach to answering questions about images,” in Proceedings of the IEEE international conference on computer vision, pp. 1–9, (2015)

[14]

Liu, R., Zhuang, L., Yu, Z., Jiang, Z., Bai, T.: “Question-relationship guided graph attention network for visual question answer,” Multimedia Systems, pp. 1–12, (2022)

[15]

Li, Q., Xiao, F., Bhanu, B., Sheng, B., Hong, R.: “Inner knowledge-based img2doc scheme for visual question answering,” ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 18, no. 3, pp. 1–21, (2022)

[16]

Fukui, A., Park, D. H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: “Multimodal compact bilinear pooling for visual question answering and visual grounding,” arXiv preprint arXiv:1606.01847, (2016)

[17]

Kim, J.-H., Lee, S.-W., Kwak, D., Heo, M.-O., Kim, J., Ha, J.-W., Zhang, B.-T.: “Multimodal residual learning for visual QA,” Advances in neural information processing systems, vol. 29, (2016)

[18]

Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y.: “Show, attend and tell: Neural image caption generation with visual attention,” in International conference on machine learning, pp. 2048–2057, PMLR, (2015)

[19]

Lin, X., Parikh, D.: “Leveraging visual question answering for image-caption ranking,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pp. 261–277, Springer, (2016)

[20]

Zhu, Y., Groth, O., Bernstein, M., Fei-Fei, L.: “Visual7w: Grounded question answering in images,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4995–5004, (2016)

[21]

Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: “Stacked attention networks for image question answering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 21–29, (2016)

[22]

Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L.: “Bottom-up and top-down attention for image captioning and visual question answering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6077–6086, (2018)

[23]

Ye, Y., Zhao, Z., Li, Y., Chen, L., Xiao, J., Zhuang, Y.: “Video question answering via attribute-augmented attention network learning,” in Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 829–832, (2017)

[24]

Na, S., Lee, S., Kim, J., Kim, G.: “A read-write memory network for movie story understanding,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 677–685, (2017)

[25]

Zhao, Z., Zhang, Z., Xiao, S., Yu, Z., Yu, J., Cai, D., Wu, F., Zhuang, Y.: “Open-ended long-form video question answering via adaptive hierarchical reinforced networks.,” in IJCAI, vol. 2, p. 8, (2018)

[26]

Gao, J., Ge, R., Chen, K., Nevatia, R.: “Motion-appearance co-memory networks for video question answering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6576–6585, (2018)

[27]

Guo, Z., Zhao, J., Jiao, L., Liu, X., Liu, F.: “A universal quaternion hypergraph network for multimodal video question answering,” IEEE Transactions on Multimedia, (2021)

[28]

Li X, Wu A, and Han Y Complementary spatiotemporal network for video question answering Multimed. Syst. 2022 28 1 161-169

Digital Library

[29]

Liu Y, Zhang X, Huang F, Shen S, Tian P, Li L, and Li Z Dynamic self-attention with vision synchronization networks for video question answering Pattern Recogn. 2022 132 108959

Digital Library

[30]

Lei, J., Li, L., Zhou, L., Gan, Z., Berg, T. L., Bansal, M., Liu, J.: “Less is more: Clipbert for video-and-language learning via sparse sampling,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7331–7341, (2021)

[31]

Vrandečić D and Krötzsch M Wikidata: a free collaborative knowledgebase Commun. ACM 2014 57 10 78-85

Digital Library

[32]

Kipf, T. N., Welling, M.: “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, (2016)

[33]

Mahdisoltani, F., Biega, J., Suchanek, F.: “Yago3: A knowledge base from multilingual wikipedias,” in 7th biennial conference on innovative data systems research, CIDR Conference, (2014)

[34]

Chen, X., Shrivastava, A., Gupta, A.: “Neil: Extracting visual knowledge from web data,” in Proceedings of the IEEE international conference on computer vision, pp. 1409–1416, (2013)

[35]

Etzioni O, Banko M, Soderland S, and Weld DS Open information extraction from the web Commun. ACM 2008 51 12 68-74

Digital Library

[36]

Liu H and Singh P Conceptnet-a practical commonsense reasoning tool-kit BT Technol. J. 2004 22 4 211-226

Digital Library

[37]

Bordes, A., Usunier, N., Chopra, S., Weston, J.: “Large-scale simple question answering with memory networks,” arXiv preprint arXiv:1506.02075, (2015)

[38]

Bordes, A., Weston, J., Usunier, N.: “Open question answering with weakly supervised embedding models,” in Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2014, Nancy, France, September 15-19, 2014. Proceedings, Part I 14, pp. 165–180, Springer, (2014)

[39]

Iyyer, M., Yih, W.-t., Chang, M.-W.: “Search-based neural structured learning for sequential question answering,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1821–1831, (2017)

[40]

Reddy S, Täckström O, Collins M, Kwiatkowski T, Das D, Steedman M, and Lapata M Transforming dependency structures to logical forms for semantic parsing Trans. Assoc. Comput. Linguist. 2016 4 127-140

[41]

Xiao, C., Dymetman, M., Gardent, C.: “Sequence-based structured prediction for semantic parsing,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1341–1350, (2016)

[42]

Ren S, He K, Girshick R, and Sun J Towards real-time object detection with region proposal networks Advances in neural information processing systems. 2015 28 127-140

[43]

Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, (2018)

[44]

Ren, S., He, K., Girshick, R., Sun, J.: “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in neural information processing systems, vol. 28, (2015)

[45]

He, K., Zhang, X., Ren, S., Sun, J.: “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, (2016)

[46]

Xiao J, Yao A, Liu Z, Li Y, Ji W, and Chua T-S Video as conditional graph hierarchy for multi-granular question answering Proc AAAI Conf Artif Intell 2022 36 2804-2812

[47]

Ren, M., Kiros, R., Zemel, R.: “Exploring models and data for image question answering,” Advances in neural information processing systems, vol. 28, (2015)

[48]

Carreira, J., Zisserman, A.: “Quo vadis, action recognition? a new model and the kinetics dataset,” in proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308, (2017)

[49]

Lei, J., Yu, L., Bansal, M., Berg, T. L.: “Tvqa: Localized, compositional video question answering,” arXiv preprint arXiv:1809.01696, (2018)

[50]

Le, T. M., Le, V., Venkatesh, S., Tran, T.: “Hierarchical conditional relation networks for video question answering,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9972–9981, (2020)

[51]

Yang, Z., Garcia, N., Chu, C., Otani, M., Nakashima, Y., Takemura, H.: “Bert representations for video question answering,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1556–1565, (2020)

Index Terms

Video question answering via traffic knowledge database and question classification
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
  2. Machine learning
2. Information systems
  1. Information retrieval

Index terms have been assigned to the content through auto-classification.

Recommendations

Equivariant and Invariant Grounding for Video Question Answering
MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Video Question Answering (VideoQA) is the task of answering the natural language questions about a video. Producing an answer requires understanding the interplay across visual scenes in video and linguistic semantics in question. However, most leading ...
Quality-aware collaborative question answering: methods and evaluation
WSDM '09: Proceedings of the Second ACM International Conference on Web Search and Data Mining

Community Question Answering (QA) portals contain questions and answers contributed by hundreds of millions of users. These databases of questions and answers are of great value if they can be used directly to answer questions from any user. In this ...
Domain Ontology Based Automatic Question Answering
ICCET '09: Proceedings of the 2009 International Conference on Computer Engineering and Technology - Volume 02

In the paper, we introduce a music knowledge question answering system based on ontology knowledge base. Users can propose a question about music knowledge in natural language, the system automatically extract relative knowledge to form answer based on ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Multimedia Systems

Multimedia Systems Volume 30, Issue 1

Feb 2024

905 pages

Issue’s Table of Contents

© The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2024. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 16 January 2024

Accepted: 08 December 2023

Received: 01 May 2023

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents