Generating natural language tags for video information management

Muhammad Usman Ghani Khan¹ &
Yoshihiko Gotoh²

677 Accesses
2 Citations
Explore all metrics

Abstract

This exploratory work is concerned with generation of natural language descriptions that can be used for video retrieval applications. It is a step ahead of keyword-based tagging as it captures relations between keywords associated with videos. Firstly, we prepare hand annotations consisting of descriptions for video segments crafted from a TREC Video dataset. Analysis of this data presents insights into human’s interests on video contents. Secondly, we develop a framework for creating smooth and coherent description of video streams. It builds on conventional image processing techniques that extract high-level features from individual video frames. Natural language description is then produced based on high-level features. Although feature extraction processes are erroneous at various levels, we explore approaches to putting them together to produce a coherent, smooth and well-phrased description by incorporating spatial and temporal information. Evaluation is made by calculating ROUGE scores between human-annotated and machine-generated descriptions. Further, we introduce a task-based evaluation by human subjects which provides qualitative evaluation of generated descriptions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

http://trecvid.nist.gov.
Although annotations were also provided by TREC Video for these two video segments, they were not used for this study. TREC Video annotations differ from our hand annotations to some extent; they are shot based, created for one camera take. Multiple humans performing multiple actions in different backgrounds can be shown in one shot. Descriptions for human, gender and action are observed. Additionally camera motion and angle, ethnicity information and human’s dressing are frequently stated; however, there are not much details for events or objects.
en.wikipedia.org/wiki/Paul_Ekman.
We plan to make this dataset public with the following structure, video ID, start time, end time, set of keywords, title, description and annotator ID.
www.virtualffs.co.uk/In_a_Nutshell.html.
One of the hand annotation for this video clip is as follows: ‘A woman appears from left. She is walking while a bike in the background. Later she comes across other humans’.
The advantages of using GST, in comparison with alternative string similarity algorithms such as a longest common subsequence or an edit distance, is its ability to detect block moves: treating the transposition of a substring of contiguous words as a single move instead of considering each word separately.
A tile is a consecutive subsequence of the maximal length that occurs as one-to-one pairing between two input sentences.
No comparison is made against keywords since measuring fluency with keywords does not make sense.

References

Abella, A., Kender, J.R., Starren, J.: Description generation of abnormal densities found in radiographs. In: Proceedings of the Annual Symposium on Computer Application in Medical Care, p. 42 (1995)
Aker, A., Gaizauskas, R.: Generating image descriptions using dependency relational patterns. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 1250–1258 (2010)
Allen, J.F.: Towards a general theory of action and time. Artif. Intell. 23(2), 123–154 (1984)
Article MATH Google Scholar
Bai, L., Li, K., Pei, J., Jiang, S.: Main objects interaction activity recognition in real images. Neural Comput. Appl. 1–14 (2015)
Baiget, P., Fernández, C., Roca, X., Gonzàlez, J.: Trajectory-Based Abnormality Categorization for Learning Route Patterns in Surveillance. Springer, Berlin (2012)
Book Google Scholar
Chen, D.L., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 190–200. Association for Computational Linguistics (2011)
Cruz-Perez, C., Starostenko, O., Alarcon-Aquino, V., Rodriguez-Asomoza, J.: Automatic image annotation for description of urban and outdoor scenes. In: Innovations and Advances in Computing, Informatics, Systems Sciences, Networking and Engineering, pp. 139–147. Springer (2015)
Das, D.: Human gait classification using combined HMM & SVM hybrid classifier. In: IEEE International Conference on Electronic Design, Computer Networks & Automated Verification (EDCAV), pp. 169–174 (2015)
Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. arXiv preprint arXiv:1411.4389 (2014)
Feng, Y., Lapata, M.: How many words is a picture worth? automatic caption generation for news images. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 1239–1249 (2010)
Filice, S., Da San Martino, G., Moschitti, A.: Structural representations for learning relations between pairs of texts. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, Beijing, China, July. Association for Computational Linguistics (2015)
Gitte, M., Bawaskar, H., Sethi, S., Shinde, A.: Content based video retrieval system. Int. J. Res. Eng. Technol. 3(6), 1 (2014)
Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Venugopalan, S., Mooney, R., Darrell, T., Saenko, K.: Youtube2text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2013, pp. 2712–2719. IEEE (2013)
Hu, W.C., Yang, C.Y., Huang, D.Y., Huang, C.H.: Feature-based face detection against skin-color like backgrounds with varying illumination. J. Inf. Hiding Multimed. Signal Process. 2(2), 123–132 (2011)
Google Scholar
Khan, M.U.G., Gotoh, Y.: Describing video contents in natural language. In: Proceedings of the EACL Workshop, Avignon (2012)
Khan, M.U.G., Al Harbi, N., Gotoh, Y.: A framework for creating natural language descriptions of video streams. Inf. Sci. 303, 61–82 (2015)
Article MathSciNet Google Scholar
Khan, M.U.G., Saeed, A.: Human detection in videos. J. Theor. Appl. Inf. Technol. 5(2), 1 (2009)
Khan, M.U.G., Zhang, L., Gotoh, Y.: Towards coherent natural language description of video streams. In: Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on, pp. 664–671. IEEE (2011)
Khan, M.U.G., Nawab, R.M.A., Gotoh, Y.: Natural language descriptions of visual scenes: corpus generation and analysis. In: Proceedings of the Joint Workshop on Exploiting Synergies Between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra), pp. 38–47. Association for Computational Linguistics (2012)
Kim, W., Park, J., Kim, C.: A novel method for efficient indoor–outdoor image classification. J. Signal Process. Syst. 61, 251–258 (2010)
Article Google Scholar
Klein, D., Manning, C.D.: Accurate unlexicalized parsing. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, pp. 423–430 (2003)
Kojima, A., Takaya, M., Aoki, S., Miyamoto, T., Fukunaga, K.: Recognition and textual description of human activities by mobile robot. In: Proceedings of the 3rd International Conference on Innovative Computing Information and Control, pp. 53–53 (2008)
Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A.C., Berg, T.L.: Baby talk: understanding and generating simple image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1601–1608 (2011)
Lee, H., Morariu, V., Davis, L.S.: Clauselets: leveraging temporally related actions for video event analysis. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1161–1168 (2015)
Lee, M.W., Hakeem, A., Haering, N., Zhu, S.C.: Save: a framework for semantic annotation of visual events. In: Proceedings of the Computer Vision and Pattern Recognition Workshops, pp. 1–8 (2008)
Li, S., Kulkarni, G., Berg, T.L., Berg, A.C., Choi, Y.: Composing simple image descriptions using web-scale n-grams. In: Proceedings of the 15th Conference on Computational Natural Language Learning, pp. 220–228 (2011)
Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Proceedings of the ACL-04 Workshop (2004)
Lin, D., Kong, C., Fidler, S., Urtasun, R.: Generating multi-sentence lingual descriptions of indoor scenes. arXiv preprint arXiv:1503.00064 (2015)
Marcus, M.P., Marcinkiewicz, M.A., Santorini, B.: Building a large annotated corpus of English: The Penn Treebank. Comput. Linguist. 19(2), 313–330 (1993)
Google Scholar
Muller, P., Reymonet, A.: Using inference for evaluating models of temporal discourse. In: 12th International Symposium on Temporal Representation and Reasoning (2005)
Nevatia, R., Zhao, T., Hongeng, S.: Hierarchical language-based representation of events in video streams. In: Proceedings of the Conference on Computer Vision and Pattern Recognition Workshop, vol. 4, pp. 39–39 (2003)
Pustejovsky, J., Castano, J., Ingria, R., Saurí, R., Gaizauskas, R., Setzer, A., Katz, G., Radev, D.: TimeML: robust specification of event and temporal expressions in text. In: Proceedings of the 5th International Workshop on Computational Semantics (2003)
Pustejovsky, J., Ingria, B., Sauri, R., Castano, J., Littman, J., Gaizauskas, R., Setzer, A., Katz, G., Mani, I.: The Specification Language TimeML. The Language of Time: A Reader. Oxford University Press, Oxford (2004)
Google Scholar
Rohrbach, A., Rohrbach, M., Qiu, W., Friedrich, A., Pinkal, M., Schiele, B.: Coherent Multi-Sentence Video Description with Variable Level of Detail. Pattern Recognition, Lecture Notes in Computer Science, vol. 8753, pp. 184–195. Springer (2014)
Rohrbach, M., Qiu, W., Titov, I., Thater, S., Pinkal, M., Schiele, B.: Translating video content to natural language descriptions. In: Computer Vision (ICCV), 2013 IEEE International Conference on, pp. 433–440. IEEE (2013)
Rosani, A., Conci, N., De Natale, F.G.B.: Human behavior understanding for assisted living by means of hierarchical context free grammars. In: IS&T/SPIE Electronic Imaging, pp. 90260E-90260E. International Society for Optics and Photonics (2014)
Singh, B., Han, X., Wu, Z., Morariu, V.I., Davis, L.S.: Selecting relevant web trained concepts for automated event retrieval. arXiv preprint arXiv:1509.07845 (2015)
Singh, D., Yadav, A.K., Kumar, V.: Human activity tracking using star skeleton and activity recognition using hmms and neural network. Int. J. Sci. Res. Publ. 4(5), 9 (2014)
Smeaton, A.F., Over, P., Kraaij, W.: High-level feature detection from video in TRECVid: a 5-year retrospective of achievements. In: Multimedia Content Analysis, pp. 1–24 (2009)
Stolcke, A.: SRILM—an extensible language modeling toolkit. In: Proceedings of the International Conference on Spoken Language Processing, vol. 2, pp. 901–904 (2002)
Tan, C.C., Jiang, Y.-G., Ngo, C.-W.: Towards textually describing complex video contents with audio-visual concept classifiers. In: Proceedings of the 19th ACM International Conference on Multimedia, pp. 655–658. ACM (2011)
Thomason, J., Venugopalan, S., Guadarrama, S., Saenko, K., Mooney, R.: Integrating language and vision to generate natural language descriptions of videos in the wild. In: Proceedings of the 25th International Conference on Computational Linguistics (COLING) (2014)
Thomason, J., Venugopalan, S., Guadarrama, S., Saenko, K., Mooney, R.: Integrating language and vision to generate natural language descriptions of videos in the wild. In: Proceedings of the 25th International Conference on Computational Linguistics (COLING), August (2014)
Vallacher, R.R., Wegner, D.M.: A Theory of Action Identification. Psychology Press, Hove (2014)
Google Scholar
Verhagen, M., Mani, I., Sauri, R., Knippen, R., Jang, S.B., Littman, J., Rumshisky, A., Phillips, J., Pustejovsky, J.: Automating temporal annotation with TARSQI. In: Proceedings of the ACL 2005 on Interactive Poster and Demonstration Sessions, pp. 81–84 (2005)
Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1 (2001)
Wilhelm, T., Böhme, H.-J., Gross, H.-M.: Classification of face images for gender, age, facial expression, and identity. In: Artificial Neural Networks: Biological Inspirations–ICANN 2005, pp. 569–574. Springer (2005)
Wise, M.J.: String similarity via greedy string tiling and running karp-rabin matching. Online Preprint, Dec (1993)
Yan, F., Mikolajczyk, K.: Leveraging high level visual information for matching images and captions. In: Computer Vision–ACCV 2014, pp. 613–627. Springer (2015)
Yang, Y., Teo, C.L., Daumé III, H., Fermüller, C., Aloimonos, Y.: Corpus-guided sentence generation of natural images. In: Proceedings of the EMNLP (2011)
Yang, Y., Guha, A., Fermuller, C., Aloimonos, Y.: A cognitive system for understanding human manipulation actions. Adv. Cognit. Syst. 3, 67–86 (2014)
Google Scholar
Yao, B.Z., Yang, X., Lin, L., Lee, M.W., Zhu, S.C.: I2T: image parsing to text description. Proc. IEEE 98(8), 1485–1508 (2010)
Article Google Scholar
Yu, H., Siskind, J.M.: Grounded language learning from video described with sentences. In: ACL (1), pp. 53–63 (2013)
Zhang, L., Khan, M.U.G., Gotoh, Y.: Video scene classification based on natural language description. In: Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on, pp. 942–949. IEEE (2011)

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Engineering & Technology, Lahore, Pakistan
Muhammad Usman Ghani Khan
Department of Computer Science, University of Sheffield, Sheffield, United Kingdom
Yoshihiko Gotoh

Authors

Muhammad Usman Ghani Khan
View author publications
You can also search for this author in PubMed Google Scholar
Yoshihiko Gotoh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Muhammad Usman Ghani Khan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Khan, M.U.G., Gotoh, Y. Generating natural language tags for video information management. Machine Vision and Applications 28, 243–265 (2017). https://doi.org/10.1007/s00138-017-0825-7

Download citation

Received: 13 October 2014
Revised: 21 May 2016
Accepted: 20 November 2016
Published: 14 February 2017
Issue Date: May 2017
DOI: https://doi.org/10.1007/s00138-017-0825-7

Generating natural language tags for video information management

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Natural Language Description of Surveillance Events

Towards Automatic Textual Summarization of Movies

A comprehensive review of the video-to-text problem

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Generating natural language tags for video information management

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Natural Language Description of Surveillance Events

Towards Automatic Textual Summarization of Movies

A comprehensive review of the video-to-text problem

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now