Abstract
Vehicle searching from videos by textual descriptions is one of the most important tasks in traffic management towards smart cities. This paper proposes a method for retrieval of vehicles using a natural language-based query. Our method consists of two main components of textual extractor based on Bi-LSTM and visual extractor using ResNet-50 model. Both components extract hidden features from different modalities and then match them in a common space. This end-to-end process tries to build a textual-visual alignment model that will be utilized for the search phase. Our particularities in this framework are two-fold. In the video stream, we evaluate in detail the role of vehicle appearance compared to its motion. In the textual stream, we apply back-translation systems to enrich the textual dataset for the training phase. Experiments are conducted on AI City Challenging, showing the efficiency of each contribution in the overall framework. It confirms that not only appearance but additional motion cues are promising for vehicle retrieval, which provides the results of MRR, Rank@5 and Rank@10 are 0.2333, 0.3587 and 0.4837, respectively.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Islam, K.: Person search: new paradigm of person re-identification: a survey and outlook of recent works. Image Vis. Comput. 101, 103970 (2020)
Li, S., Xiao, T., Li, H., Zhou, B., Yue, D., Wang, X.: Person search with natural language description. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Los Alamitos, CA, USA, IEEE Computer Society, July 2017, pp. 5187–5196 (2017)
Pham, T.T.T., et al.: Towards a large-scale person search by Vietnamese natural language: dataset and methods. Multimedia Tools. Appl. 81, 1–32 (2022). https://doi.org/10.1007/s11042-022-12138-1
Naphade, M., et al.: The 5th AI city challenge. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (2021)
Feng, Q., Ablavsky, V., Sclaroff, S.: CityFlow-NL: tracking and retrieval of vehicles at city scale by natural language descriptions. CoRR abs/2101.04741 (2021)
Yu, Y., Ko, H., Choi, J., Kim, G.: End-to-end concept word detection for video captioning, retrieval, and question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3165–3173 (2017)
Bai, S., et al.: Connecting language and vision for natural language-based vehicle retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4034–4043 (2021)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, vol. 27 (2014)
Carreira, J., Zisserman, A.: Quo Vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308(2017)
Sun, C., Myers, A., Vondrick, C., Murphy, K., Schmid, C.: VideoBERT: a joint model for video and language representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7464–7473 (2019)
Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: VSE++: improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612 (2017)
Wang, Z., Fang, Z., Wang, J., Yang, Y.: Visual-textual attributes alignment in person search by natural language. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 402–420. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_24
Dzabraev, M., Kalashnikov, M., Komkov, S., Petiushko, A.: MDMMT: multidomain multimodal transformer for video retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3354–3363 (2021)
Park, E.J., Kim, H., Jeong, S., Kang, B., Kwon, Y.: Keyword-based vehicle retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 4220–4227 (2021)
Bain, M., Nagrani, A., Varol, G., Zisserman, A.: Frozen in time: a joint video and image encoder for end-to-end retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1728–1738 (2021)
Sennrich, R., Haddow, B., Birch, A.: Improving neural machine translation models with monolingual data. arXiv preprint arXiv:1511.06709 (2015)
Hoang, V.C.D., Koehn, P., Haffari, G., Cohn, T.: Iterative back-translation for neural machine translation. In: Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, pp. 18–24 (2018)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Zheng, Z., Zheng, L., Garrett, M., Yang, Y., Xu, M., Shen, Y.D.: Dual-path convolutional image-text embeddings with instance loss. ACM Trans. Multimedia Comput. Commun. Appl. (TOMM) 16(2), 1–23 (2020)
Tang, Z., et al.: A city-scale benchmark for multi-target multi-camera vehicle tracking and re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8797–8806 (2019)
Voorhees, E.M., Tice, D.M.: The TREC-8 question answering track. In: Proceedings of the Second International Conference on Language Resources and Evaluation (LREC 2000), Athens, Greece, European Language Resources Association (ELRA) (2000)
Pham, T.T.T., et al.: Person search by natural language description in Vietnamese using pre-trained visual-textual attributes alignment model. In: 2021 13th International Conference on Knowledge and Systems Engineering (KSE), pp. 1–6. IEEE (2021)
Acknowledgement
This research is funded by the Vietnam MPS under grant number BCN. 2020. T01. 04.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Can, QH. et al. (2022). Exploring the Effect of Vehicle Appearance and Motion for Natural Language-Based Vehicle Retrieval. In: Szczerbicki, E., Wojtkiewicz, K., Nguyen, S.V., Pietranik, M., Krótkiewicz, M. (eds) Recent Challenges in Intelligent Information and Database Systems. ACIIDS 2022. Communications in Computer and Information Science, vol 1716. Springer, Singapore. https://doi.org/10.1007/978-981-19-8234-7_5
Download citation
DOI: https://doi.org/10.1007/978-981-19-8234-7_5
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-8233-0
Online ISBN: 978-981-19-8234-7
eBook Packages: Computer ScienceComputer Science (R0)