Abstract
Describing open-domain videos in natural language is a major challenge for video understanding and can largely fulfill its potential in a host of applications, such as assisting blind people and managing massive videos. This paper presents an updated sequence-to-sequence video to text model (MM-BiS2VT), which incorporates multimodal feature fusion and bidirectional language structure and aims at optimizing conventional methods. The model totally considered four features-RGB images, optical flow, spatiotemporal and audio features. RGB images and optical flow features were extracted by ResNet152. And with the help of the improved three-dimensional convolutional neural networks model, spatiotemporal feature was included. As a vital factor to increase the accuracy of results, audio feature was also added to make up for visual information. After combining these features by a feature fusion method, bidirectional long short-term memory units (BiLSTMs) was adopted to generate descriptive sentences. The results indicate that fusing multimodal features could gain better sentences over other methods and BiLSTMs plays a significant role as well to improve the accuracy of the outputs, which means the works in this paper could be an available reference for computer vision and video processing.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Chen, N., Prasanna, V.K.: A bag-of-semantics model for image clustering. Vis. Comput. 29(11), 1221–1229 (2013)
Kulkarni, G., Premraj, V., Ordonez, V., Dhar, S., Li, S., Choi, Y., et al.: BabyTalk: understanding and generating simple image descriptions. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 35, pp. 1601–1608. IEEE Computer Society (2011)
Yao, B.Z., Yang, X., Lin, L., Lee, M.W., Zhu, S.C.: I2t: image parsing to text description. Proc. IEEE 98(8), 1485–1508 (2010)
Donahue, J., Hendricks, L.A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. AB initto calculation of the structures and properties of molecules. Elsevier, Amsterdam (2015)
Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Venugopalan, S., Mooney, R., Darrell, T., et al.: YouTube2Text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: IEEE International Conference on Computer Vision, pp. 2712–2719. IEEE (2014)
Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., Yuille, A.: Deep captioning with multimodal recurrent neural networks (m-rnn). Eprint Arxiv (2014)
Rohrbach, M., Qiu, W., Titov, I., Thater, S., Pinkal, M., Schiele, B.: Translating video content to natural language descriptions. In: IEEE International Conference on Computer Vision, vol. 21, pp. 433–440. IEEE (2013)
Derpanis, K.G.: Dynamic scene understanding: the role of orientation features in space and time in scene classification. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 157, pp. 1306–1313. IEEE Computer Society (2012)
Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: Joint IEEE International Workshop on Visual Surveillance and PERFORMANCE Evaluation of Tracking and Surveillance, pp. 65–72. IEEE (2006)
Venugopalan S, Rohrbach M, Donahue J, et al.: Sequence to sequence—video to text. In: IEEE International Conference on Computer Vision, pp. 4534–4542. IEEE Computer Society (2015)
Cho, K., Van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., et al.: Learning phrase representations using rnn encoder -decoder for statistical machine translation. Comput. Sci. arXiv:1406.1078 (2014)
Mirzaei, M.R., Ghorshi, S., Mortazavi, M.: Audio-visual speech recognition techniques in augmented reality environments. Vis. Comput. 30(3), 245–257 (2014)
Chen, X., Zitnick, C.L.: Mind’s eye: a recurrent visual representation for image caption generation. In Computer Vision and Pattern Recognition, pp. 2422–2431. IEEE (2015)
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In Computer Vision and Pattern Recognition, pp. 3156–3164. IEEE (2015)
Barbu, A., Bridge, A., Burchill, Z., Dan, C., Dickinson, S., Fidler, S., et al.: Video in sentences out. In: Twenty-Eighth Conference on Uncertainty in Artificial Intelligence, vol. 1401, pp. 274–283. arXiv (2012)
Rohrbach, A., Rohrbach, M., Tandon, N., Schiele, B.: A dataset for movie description. In Computer Vision and Pattern Recognition, pp. 3202–3212 (2015)
Yao, L., Torabi, A., Cho, K., et al.: Describing videos by exploiting temporal structure. In: IEEE International Conference on Computer Vision, pp. 4507–4515. IEEE Computer Society (2015)
Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. Comput. Sci. arXiv:1412.4729 (2014)
Xu, H., Venugopalan, S., Ramanishka, V., Rohrbach, M., Saenko, K.: A multi-scale multiple instance video description network. Comput. Sci. 6738, 272–279 (2015)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: C3d: generic features for video analysis. Eprint. arXiv:1412.0767 (2014)
Shaikh, A.A., Kumar, D.K., Gubbi, J.: Automatic visual speech segmentation and recognition using directional motion history images and zernike moments. Vis. Comput. 29(10), 969–982 (2013)
Hershey, S., Chaudhuri, S., Ellis, D. P. W., Gemmeke, J. F., Jansen, A., Moore, R. C., et al.: Cnn architectures for large-scale audio classification. In: International Conference on Acoustics, Speech and Signal Processing, pp. 131–135 (2017)
Jin, Q., Chen, J., Chen, S., Xiong, Y., Hauptmann, A.: Describing videos using multi-modal fusion. In: ACM on Multimedia Conference, pp. 1087–1091. ACM (2016)
Ramanishka, V., Das, A., Dong, H.P., Venugopalan, S., Hendricks, L.A., Rohrbach, M., et al.: Multimodal video description. In: ACM on Multimedia Conference, pp. 1092–1096. ACM (2016)
Shetty, R., Laaksonen, J.: Frame- and segment-level features and candidate pool evaluation for video caption generation. In: ACM on Multimedia Conference. arXiv:1608.04959 (2016)
Alvaro, P., Bolanos, M., Radeva, P., Casacuberta, F.: Video description using bidirectional recurrent neural networks. In: International Conference on Artificial Neural Networks, pp. 3–11. Springer, Cham (2016)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In Computer Vision and Pattern Recognition, pp. 770–778 (2015)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: International Conference on Neural Information Processing Systems, vol. 25, pp. 1097–1105. Curran Associates Inc (2012)
D’Angelo, E., Paratte, J., Puy, G., Vandergheynst, P.: Fast TV-L1 optical flow for interactivity. In; IEEE International Conference on Image Processing, vol. 6626, pp. 1885–1888. IEEE (2011)
Giannakopoulos, T.: Pyaudioanalysis: an open-source python library for audio signal analysis. PLoS ONE 10(12), e0144610 (2015)
Srivastava, N., Mansimov, E., Salakhutdinov, R.: Unsupervised learning of video representations using lstms. In Computer Vision and Pattern Recognition, pp. 843–852 (2015)
Chen, D.L., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 190–200. Association for Computational Linguistics (2011)
Torabi, A., Pal, C., Larochelle, H., et al.: Using descriptive video services to create a large data source for video annotation research. Comput. Sci. arXiv:1503.01070 (2015)
Chen, X., Fang, H., Lin, T.Y., Vedantam, R., Gupta, S., Dollar, P., et al.: Microsoft coco captions: data collection and evaluation server. Comput. Sci. arXiv:1504.00325 (2015)
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In Meeting on Association for Computational Linguistics, vol. 4, pp. 311–318. Association for Computational Linguistics (2002)
Denkowski, M., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In The Workshop on Statistical Machine Translation, pp. 376–380 (2014)
Thomason, J., Venugopalan, S., Guadarrama, S., Saenko, K., Mooney, R.: Integrating language and vision to generate natural language descriptions of videos in the wild. In; International Conference on Computational Linguistics, pp. 1218–1227 (2014)
Pasunuru, R., Bansal, M.: Multi-task video captioning with video and entailment generation. In Meeting of the Association for Computational Linguistics, pp. 1273–1283 (2017)
Acknowledgements
This work was supported by Research and Industrialization for Intelligent Video Processing Technology based on GPUs Parallel Computing of the Science and Technology Supported Program of Jiangsu Province (BY 2016003-11) and the Application platform and Industrialization for efficient cloud computing for Big data of the Science and Technology Supported Program of Jiangsu Province (BA2015052). We thank all the shared achievements of selfless antecessors.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
All the authors have no conflict of interest.
Rights and permissions
About this article
Cite this article
Du, X., Yuan, J., Hu, L. et al. Description generation of open-domain videos incorporating multimodal features and bidirectional encoder. Vis Comput 35, 1703–1712 (2019). https://doi.org/10.1007/s00371-018-1591-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00371-018-1591-x