Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Video question answering via traffic knowledge database and question classification

Published: 16 January 2024 Publication History

Abstract

Video question answering (VideoQA) is a task that involves answering questions related to videos. The main idea is to understand the content of the video and to combine it with the relevant semantic context to answer various types of questions. Existing methods typically analyze the spatiotemporal correlations of the entire video to answer questions. However, for some simple questions, the answer is related to only a specific frame of the video, and analyzing the entire video undoubtedly increases the learning cost. For some complex questions, the information contained in the video is limited, and these methods are not sufficient to fully answer such questions. Therefore, we proposes a VideoQA model based on question classification and a traffic knowledge database. The model starts from the perspective of the question and classifies the questions into general scene questions and causal questions using different methods to process these two types of questions. For general scene questions, we first extract the key frames of the video to convert it into a simpler image question-answering task and then we use top–down and bottom–up attention mechanisms to process it. For causal questions, we design a lightweight traffic knowledge database that provides relevant traffic knowledge not originally present in VideoQA datasets, to help model reasoning. Then, we use a question and knowledge-guided aggregation graph attention network to process causal questions. The experimental results show that while greatly reducing resource costs, our model performs better on the TrafficQA dataset than do models utilizing millions of external data for pretraining.

References

[1]
Jang, Y., Song, Y., Yu, Y., Kim, Y., Kim, G.: “Tgif-qa: Toward spatio-temporal reasoning in visual question answering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2758–2766, (2017)
[2]
Xu, L., Huang, H., Liu, J.: “Sutd-trafficqa: A question answering benchmark and an efficient network for video reasoning over traffic events,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9878–9888, (2021)
[3]
Zhu L, Xu Z, Yang Y, and Hauptmann AG Uncovering the temporal context for video question answering Int. J. Comput. Vision 2017 124 409-421
[4]
Xiao, J., Zhou, P., Chua, T.-S., Yan, S.: “Video graph transformer for video question answering,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVI, pp. 39–58, Springer, (2022)
[5]
Xu, D., Zhao, Z., Xiao, J., Wu, F., Zhang, H., He, X., Zhuang, Y.: “Video question answering via gradually refined attention over appearance and motion,” in Proceedings of the 25th ACM international conference on Multimedia, pp. 1645–1653, (2017)
[6]
Jin, W., Zhao, Z., Li, Y., Li, J., Xiao, J., Zhuang, Y.: “Video question answering via knowledge-based progressive spatial-temporal attention network,” ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 15, no. 2s, pp. 1–22, (2019)
[7]
Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: “Dbpedia: A nucleus for a web of open data,” in The Semantic Web: 6th International Semantic Web Conference, 2nd Asian Semantic Web Conference, ISWC 2007+ ASWC 2007, Busan, Korea, November 11-15, 2007. Proceedings, pp. 722–735, Springer, (2007)
[8]
Bollacker, K., Evans, C., Paritosh, P., Sturge, T., Taylor, J.: “Freebase: a collaboratively created graph database for structuring human knowledge,” in Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp. 1247–1250, (2008)
[9]
Malinowski, M., Fritz, M.: “A multi-world approach to question answering about real-world scenes based on uncertain input,” Advances in neural information processing systems, vol. 27, (2014)
[10]
Geman D, Geman S, Hallonquist N, and Younes L Visual turing test for computer vision systems Proc. Natl. Acad. Sci. 2015 112 12 3618-3623
[11]
Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: “Long-term recurrent convolutional networks for visual recognition and description,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2625–2634, (2015)
[12]
Karpathy, A., Joulin, A., Fei-Fei, L. F.: “Deep fragment embeddings for bidirectional image sentence mapping,” Advances in neural information processing systems, vol. 27, (2014)
[13]
Malinowski, M., Rohrbach, M., Fritz, M.: “Ask your neurons: A neural-based approach to answering questions about images,” in Proceedings of the IEEE international conference on computer vision, pp. 1–9, (2015)
[14]
Liu, R., Zhuang, L., Yu, Z., Jiang, Z., Bai, T.: “Question-relationship guided graph attention network for visual question answer,” Multimedia Systems, pp. 1–12, (2022)
[15]
Li, Q., Xiao, F., Bhanu, B., Sheng, B., Hong, R.: “Inner knowledge-based img2doc scheme for visual question answering,” ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 18, no. 3, pp. 1–21, (2022)
[16]
Fukui, A., Park, D. H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: “Multimodal compact bilinear pooling for visual question answering and visual grounding,” arXiv preprint arXiv:1606.01847, (2016)
[17]
Kim, J.-H., Lee, S.-W., Kwak, D., Heo, M.-O., Kim, J., Ha, J.-W., Zhang, B.-T.: “Multimodal residual learning for visual QA,” Advances in neural information processing systems, vol. 29, (2016)
[18]
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y.: “Show, attend and tell: Neural image caption generation with visual attention,” in International conference on machine learning, pp. 2048–2057, PMLR, (2015)
[19]
Lin, X., Parikh, D.: “Leveraging visual question answering for image-caption ranking,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pp. 261–277, Springer, (2016)
[20]
Zhu, Y., Groth, O., Bernstein, M., Fei-Fei, L.: “Visual7w: Grounded question answering in images,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4995–5004, (2016)
[21]
Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: “Stacked attention networks for image question answering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 21–29, (2016)
[22]
Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L.: “Bottom-up and top-down attention for image captioning and visual question answering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6077–6086, (2018)
[23]
Ye, Y., Zhao, Z., Li, Y., Chen, L., Xiao, J., Zhuang, Y.: “Video question answering via attribute-augmented attention network learning,” in Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 829–832, (2017)
[24]
Na, S., Lee, S., Kim, J., Kim, G.: “A read-write memory network for movie story understanding,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 677–685, (2017)
[25]
Zhao, Z., Zhang, Z., Xiao, S., Yu, Z., Yu, J., Cai, D., Wu, F., Zhuang, Y.: “Open-ended long-form video question answering via adaptive hierarchical reinforced networks.,” in IJCAI, vol. 2, p. 8, (2018)
[26]
Gao, J., Ge, R., Chen, K., Nevatia, R.: “Motion-appearance co-memory networks for video question answering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6576–6585, (2018)
[27]
Guo, Z., Zhao, J., Jiao, L., Liu, X., Liu, F.: “A universal quaternion hypergraph network for multimodal video question answering,” IEEE Transactions on Multimedia, (2021)
[28]
Li X, Wu A, and Han Y Complementary spatiotemporal network for video question answering Multimed. Syst. 2022 28 1 161-169
[29]
Liu Y, Zhang X, Huang F, Shen S, Tian P, Li L, and Li Z Dynamic self-attention with vision synchronization networks for video question answering Pattern Recogn. 2022 132 108959
[30]
Lei, J., Li, L., Zhou, L., Gan, Z., Berg, T. L., Bansal, M., Liu, J.: “Less is more: Clipbert for video-and-language learning via sparse sampling,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7331–7341, (2021)
[31]
Vrandečić D and Krötzsch M Wikidata: a free collaborative knowledgebase Commun. ACM 2014 57 10 78-85
[32]
Kipf, T. N., Welling, M.: “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, (2016)
[33]
Mahdisoltani, F., Biega, J., Suchanek, F.: “Yago3: A knowledge base from multilingual wikipedias,” in 7th biennial conference on innovative data systems research, CIDR Conference, (2014)
[34]
Chen, X., Shrivastava, A., Gupta, A.: “Neil: Extracting visual knowledge from web data,” in Proceedings of the IEEE international conference on computer vision, pp. 1409–1416, (2013)
[35]
Etzioni O, Banko M, Soderland S, and Weld DS Open information extraction from the web Commun. ACM 2008 51 12 68-74
[36]
Liu H and Singh P Conceptnet-a practical commonsense reasoning tool-kit BT Technol. J. 2004 22 4 211-226
[37]
Bordes, A., Usunier, N., Chopra, S., Weston, J.: “Large-scale simple question answering with memory networks,” arXiv preprint arXiv:1506.02075, (2015)
[38]
Bordes, A., Weston, J., Usunier, N.: “Open question answering with weakly supervised embedding models,” in Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2014, Nancy, France, September 15-19, 2014. Proceedings, Part I 14, pp. 165–180, Springer, (2014)
[39]
Iyyer, M., Yih, W.-t., Chang, M.-W.: “Search-based neural structured learning for sequential question answering,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1821–1831, (2017)
[40]
Reddy S, Täckström O, Collins M, Kwiatkowski T, Das D, Steedman M, and Lapata M Transforming dependency structures to logical forms for semantic parsing Trans. Assoc. Comput. Linguist. 2016 4 127-140
[41]
Xiao, C., Dymetman, M., Gardent, C.: “Sequence-based structured prediction for semantic parsing,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1341–1350, (2016)
[42]
Ren S, He K, Girshick R, and Sun J Towards real-time object detection with region proposal networks Advances in neural information processing systems. 2015 28 127-140
[43]
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, (2018)
[44]
Ren, S., He, K., Girshick, R., Sun, J.: “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in neural information processing systems, vol. 28, (2015)
[45]
He, K., Zhang, X., Ren, S., Sun, J.: “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, (2016)
[46]
Xiao J, Yao A, Liu Z, Li Y, Ji W, and Chua T-S Video as conditional graph hierarchy for multi-granular question answering Proc AAAI Conf Artif Intell 2022 36 2804-2812
[47]
Ren, M., Kiros, R., Zemel, R.: “Exploring models and data for image question answering,” Advances in neural information processing systems, vol. 28, (2015)
[48]
Carreira, J., Zisserman, A.: “Quo vadis, action recognition? a new model and the kinetics dataset,” in proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308, (2017)
[49]
Lei, J., Yu, L., Bansal, M., Berg, T. L.: “Tvqa: Localized, compositional video question answering,” arXiv preprint arXiv:1809.01696, (2018)
[50]
Le, T. M., Le, V., Venkatesh, S., Tran, T.: “Hierarchical conditional relation networks for video question answering,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9972–9981, (2020)
[51]
Yang, Z., Garcia, N., Chu, C., Otani, M., Nakashima, Y., Takemura, H.: “Bert representations for video question answering,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1556–1565, (2020)

Index Terms

  1. Video question answering via traffic knowledge database and question classification
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image Multimedia Systems
        Multimedia Systems  Volume 30, Issue 1
        Feb 2024
        905 pages

        Publisher

        Springer-Verlag

        Berlin, Heidelberg

        Publication History

        Published: 16 January 2024
        Accepted: 08 December 2023
        Received: 01 May 2023

        Author Tags

        1. Video question answering
        2. Knowledge
        3. Transformer
        4. Question classification

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 0
          Total Downloads
        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 23 Nov 2024

        Other Metrics

        Citations

        View Options

        View options

        Login options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media