Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Generating natural language tags for video information management

  • Original Paper
  • Published:
Machine Vision and Applications Aims and scope Submit manuscript

Abstract

This exploratory work is concerned with generation of natural language descriptions that can be used for video retrieval applications. It is a step ahead of keyword-based tagging as it captures relations between keywords associated with videos. Firstly, we prepare hand annotations consisting of descriptions for video segments crafted from a TREC Video dataset. Analysis of this data presents insights into human’s interests on video contents. Secondly, we develop a framework for creating smooth and coherent description of video streams. It builds on conventional image processing techniques that extract high-level features from individual video frames. Natural language description is then produced based on high-level features. Although feature extraction processes are erroneous at various levels, we explore approaches to putting them together to produce a coherent, smooth and well-phrased description by incorporating spatial and temporal information. Evaluation is made by calculating ROUGE scores between human-annotated and machine-generated descriptions. Further, we introduce a task-based evaluation by human subjects which provides qualitative evaluation of generated descriptions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25

Similar content being viewed by others

Notes

  1. http://trecvid.nist.gov.

  2. Although annotations were also provided by TREC Video for these two video segments, they were not used for this study. TREC Video annotations differ from our hand annotations to some extent; they are shot based, created for one camera take. Multiple humans performing multiple actions in different backgrounds can be shown in one shot. Descriptions for human, gender and action are observed. Additionally camera motion and angle, ethnicity information and human’s dressing are frequently stated; however, there are not much details for events or objects.

  3. en.wikipedia.org/wiki/Paul_Ekman.

  4. We plan to make this dataset public with the following structure, video ID, start time, end time, set of keywords, title, description and annotator ID.

  5. www.virtualffs.co.uk/In_a_Nutshell.html.

  6. One of the hand annotation for this video clip is as follows: ‘A woman appears from left. She is walking while a bike in the background. Later she comes across other humans’.

  7. The advantages of using GST, in comparison with alternative string similarity algorithms such as a longest common subsequence or an edit distance, is its ability to detect block moves: treating the transposition of a substring of contiguous words as a single move instead of considering each word separately.

  8. A tile is a consecutive subsequence of the maximal length that occurs as one-to-one pairing between two input sentences.

  9. No comparison is made against keywords since measuring fluency with keywords does not make sense.

References

  1. Abella, A., Kender, J.R., Starren, J.: Description generation of abnormal densities found in radiographs. In: Proceedings of the Annual Symposium on Computer Application in Medical Care, p. 42 (1995)

  2. Aker, A., Gaizauskas, R.: Generating image descriptions using dependency relational patterns. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 1250–1258 (2010)

  3. Allen, J.F.: Towards a general theory of action and time. Artif. Intell. 23(2), 123–154 (1984)

    Article  MATH  Google Scholar 

  4. Bai, L., Li, K., Pei, J., Jiang, S.: Main objects interaction activity recognition in real images. Neural Comput. Appl. 1–14 (2015)

  5. Baiget, P., Fernández, C., Roca, X., Gonzàlez, J.: Trajectory-Based Abnormality Categorization for Learning Route Patterns in Surveillance. Springer, Berlin (2012)

    Book  Google Scholar 

  6. Chen, D.L., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 190–200. Association for Computational Linguistics (2011)

  7. Cruz-Perez, C., Starostenko, O., Alarcon-Aquino, V., Rodriguez-Asomoza, J.: Automatic image annotation for description of urban and outdoor scenes. In: Innovations and Advances in Computing, Informatics, Systems Sciences, Networking and Engineering, pp. 139–147. Springer (2015)

  8. Das, D.: Human gait classification using combined HMM & SVM hybrid classifier. In: IEEE International Conference on Electronic Design, Computer Networks & Automated Verification (EDCAV), pp. 169–174 (2015)

  9. Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. arXiv preprint arXiv:1411.4389 (2014)

  10. Feng, Y., Lapata, M.: How many words is a picture worth? automatic caption generation for news images. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 1239–1249 (2010)

  11. Filice, S., Da San Martino, G., Moschitti, A.: Structural representations for learning relations between pairs of texts. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, Beijing, China, July. Association for Computational Linguistics (2015)

  12. Gitte, M., Bawaskar, H., Sethi, S., Shinde, A.: Content based video retrieval system. Int. J. Res. Eng. Technol. 3(6), 1 (2014)

  13. Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Venugopalan, S., Mooney, R., Darrell, T., Saenko, K.: Youtube2text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2013, pp. 2712–2719. IEEE (2013)

  14. Hu, W.C., Yang, C.Y., Huang, D.Y., Huang, C.H.: Feature-based face detection against skin-color like backgrounds with varying illumination. J. Inf. Hiding Multimed. Signal Process. 2(2), 123–132 (2011)

    Google Scholar 

  15. Khan, M.U.G., Gotoh, Y.: Describing video contents in natural language. In: Proceedings of the EACL Workshop, Avignon (2012)

  16. Khan, M.U.G., Al Harbi, N., Gotoh, Y.: A framework for creating natural language descriptions of video streams. Inf. Sci. 303, 61–82 (2015)

    Article  MathSciNet  Google Scholar 

  17. Khan, M.U.G., Saeed, A.: Human detection in videos. J. Theor. Appl. Inf. Technol. 5(2), 1 (2009)

  18. Khan, M.U.G., Zhang, L., Gotoh, Y.: Towards coherent natural language description of video streams. In: Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on, pp. 664–671. IEEE (2011)

  19. Khan, M.U.G., Nawab, R.M.A., Gotoh, Y.: Natural language descriptions of visual scenes: corpus generation and analysis. In: Proceedings of the Joint Workshop on Exploiting Synergies Between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra), pp. 38–47. Association for Computational Linguistics (2012)

  20. Kim, W., Park, J., Kim, C.: A novel method for efficient indoor–outdoor image classification. J. Signal Process. Syst. 61, 251–258 (2010)

    Article  Google Scholar 

  21. Klein, D., Manning, C.D.: Accurate unlexicalized parsing. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, pp. 423–430 (2003)

  22. Kojima, A., Takaya, M., Aoki, S., Miyamoto, T., Fukunaga, K.: Recognition and textual description of human activities by mobile robot. In: Proceedings of the 3rd International Conference on Innovative Computing Information and Control, pp. 53–53 (2008)

  23. Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A.C., Berg, T.L.: Baby talk: understanding and generating simple image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1601–1608 (2011)

  24. Lee, H., Morariu, V., Davis, L.S.: Clauselets: leveraging temporally related actions for video event analysis. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1161–1168 (2015)

  25. Lee, M.W., Hakeem, A., Haering, N., Zhu, S.C.: Save: a framework for semantic annotation of visual events. In: Proceedings of the Computer Vision and Pattern Recognition Workshops, pp. 1–8 (2008)

  26. Li, S., Kulkarni, G., Berg, T.L., Berg, A.C., Choi, Y.: Composing simple image descriptions using web-scale n-grams. In: Proceedings of the 15th Conference on Computational Natural Language Learning, pp. 220–228 (2011)

  27. Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Proceedings of the ACL-04 Workshop (2004)

  28. Lin, D., Kong, C., Fidler, S., Urtasun, R.: Generating multi-sentence lingual descriptions of indoor scenes. arXiv preprint arXiv:1503.00064 (2015)

  29. Marcus, M.P., Marcinkiewicz, M.A., Santorini, B.: Building a large annotated corpus of English: The Penn Treebank. Comput. Linguist. 19(2), 313–330 (1993)

    Google Scholar 

  30. Muller, P., Reymonet, A.: Using inference for evaluating models of temporal discourse. In: 12th International Symposium on Temporal Representation and Reasoning (2005)

  31. Nevatia, R., Zhao, T., Hongeng, S.: Hierarchical language-based representation of events in video streams. In: Proceedings of the Conference on Computer Vision and Pattern Recognition Workshop, vol. 4, pp. 39–39 (2003)

  32. Pustejovsky, J., Castano, J., Ingria, R., Saurí, R., Gaizauskas, R., Setzer, A., Katz, G., Radev, D.: TimeML: robust specification of event and temporal expressions in text. In: Proceedings of the 5th International Workshop on Computational Semantics (2003)

  33. Pustejovsky, J., Ingria, B., Sauri, R., Castano, J., Littman, J., Gaizauskas, R., Setzer, A., Katz, G., Mani, I.: The Specification Language TimeML. The Language of Time: A Reader. Oxford University Press, Oxford (2004)

    Google Scholar 

  34. Rohrbach, A., Rohrbach, M., Qiu, W., Friedrich, A., Pinkal, M., Schiele, B.: Coherent Multi-Sentence Video Description with Variable Level of Detail. Pattern Recognition, Lecture Notes in Computer Science, vol. 8753, pp. 184–195. Springer (2014)

  35. Rohrbach, M., Qiu, W., Titov, I., Thater, S., Pinkal, M., Schiele, B.: Translating video content to natural language descriptions. In: Computer Vision (ICCV), 2013 IEEE International Conference on, pp. 433–440. IEEE (2013)

  36. Rosani, A., Conci, N., De Natale, F.G.B.: Human behavior understanding for assisted living by means of hierarchical context free grammars. In: IS&T/SPIE Electronic Imaging, pp. 90260E-90260E. International Society for Optics and Photonics (2014)

  37. Singh, B., Han, X., Wu, Z., Morariu, V.I., Davis, L.S.: Selecting relevant web trained concepts for automated event retrieval. arXiv preprint arXiv:1509.07845 (2015)

  38. Singh, D., Yadav, A.K., Kumar, V.: Human activity tracking using star skeleton and activity recognition using hmms and neural network. Int. J. Sci. Res. Publ. 4(5), 9 (2014)

  39. Smeaton, A.F., Over, P., Kraaij, W.: High-level feature detection from video in TRECVid: a 5-year retrospective of achievements. In: Multimedia Content Analysis, pp. 1–24 (2009)

  40. Stolcke, A.: SRILM—an extensible language modeling toolkit. In: Proceedings of the International Conference on Spoken Language Processing, vol. 2, pp. 901–904 (2002)

  41. Tan, C.C., Jiang, Y.-G., Ngo, C.-W.: Towards textually describing complex video contents with audio-visual concept classifiers. In: Proceedings of the 19th ACM International Conference on Multimedia, pp. 655–658. ACM (2011)

  42. Thomason, J., Venugopalan, S., Guadarrama, S., Saenko, K., Mooney, R.: Integrating language and vision to generate natural language descriptions of videos in the wild. In: Proceedings of the 25th International Conference on Computational Linguistics (COLING) (2014)

  43. Thomason, J., Venugopalan, S., Guadarrama, S., Saenko, K., Mooney, R.: Integrating language and vision to generate natural language descriptions of videos in the wild. In: Proceedings of the 25th International Conference on Computational Linguistics (COLING), August (2014)

  44. Vallacher, R.R., Wegner, D.M.: A Theory of Action Identification. Psychology Press, Hove (2014)

    Google Scholar 

  45. Verhagen, M., Mani, I., Sauri, R., Knippen, R., Jang, S.B., Littman, J., Rumshisky, A., Phillips, J., Pustejovsky, J.: Automating temporal annotation with TARSQI. In: Proceedings of the ACL 2005 on Interactive Poster and Demonstration Sessions, pp. 81–84 (2005)

  46. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1 (2001)

  47. Wilhelm, T., Böhme, H.-J., Gross, H.-M.: Classification of face images for gender, age, facial expression, and identity. In: Artificial Neural Networks: Biological Inspirations–ICANN 2005, pp. 569–574. Springer (2005)

  48. Wise, M.J.: String similarity via greedy string tiling and running karp-rabin matching. Online Preprint, Dec (1993)

  49. Yan, F., Mikolajczyk, K.: Leveraging high level visual information for matching images and captions. In: Computer Vision–ACCV 2014, pp. 613–627. Springer (2015)

  50. Yang, Y., Teo, C.L., Daumé III, H., Fermüller, C., Aloimonos, Y.: Corpus-guided sentence generation of natural images. In: Proceedings of the EMNLP (2011)

  51. Yang, Y., Guha, A., Fermuller, C., Aloimonos, Y.: A cognitive system for understanding human manipulation actions. Adv. Cognit. Syst. 3, 67–86 (2014)

    Google Scholar 

  52. Yao, B.Z., Yang, X., Lin, L., Lee, M.W., Zhu, S.C.: I2T: image parsing to text description. Proc. IEEE 98(8), 1485–1508 (2010)

    Article  Google Scholar 

  53. Yu, H., Siskind, J.M.: Grounded language learning from video described with sentences. In: ACL (1), pp. 53–63 (2013)

  54. Zhang, L., Khan, M.U.G., Gotoh, Y.: Video scene classification based on natural language description. In: Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on, pp. 942–949. IEEE (2011)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Muhammad Usman Ghani Khan.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Khan, M.U.G., Gotoh, Y. Generating natural language tags for video information management. Machine Vision and Applications 28, 243–265 (2017). https://doi.org/10.1007/s00138-017-0825-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00138-017-0825-7

Keywords

Navigation