Abstract
Numerous web videos associated with rich metadata are available on the Internet today. While such metadata like video tags bring us facilitation and opportunities for video search and multimedia content understanding, some challenges also arise due to the fact that those video tags are usually annotated at the video level, while many tags actually only describe parts of the video content. How to localize the relevant parts or frames of web video for given tags is the key to many applications and research tasks. In this paper we propose combining topic model and relevance filtering to localize relevant frames. Our method is designed in three steps. First, we apply relevance filtering to assign relevance scores to video frames and a raw relevant frame set is obtained by selecting the top ranked frames. Then, we separate the frames into topics by mining the underlying semantics using latent Dirichlet allocation and use the raw relevance set as validation set to select relevant topics. Finally, the topical relevances are used to refine the raw relevant frame set and the final results are obtained. Experiment results on two real web video databases validate the effectiveness of the proposed approach.
Similar content being viewed by others
References
Hong, R., Tang, J., Tan, H.K., Ngo, C.W., Yan, S., Chua, T.S.: Beyond search: event-driven summarization for web videos. ACM Trans. Multimed. Comput. Commun. Appl. 7(4), 35:1–35:18 (2011)
Wang, M., Ni, B., Hua, X.S., Chua, T.S.: Assistive tagging: a survey of multimedia tagging with human–computer joint exploration. ACM Comput. Surv. 44(4), 25:1–25:24 (2012)
Wang, M., Yang, K., Hua, X.S., Zhang, H.J.: Towards a relevant and diverse search of social images. IEEE Trans. Multimed. 12(8), 829–842 (2010)
Ulges, A., Schulze, C., Koch, M., Breuel, T.M.: Learning automatic concept detectors from online video. Comput. Vis. Image Underst. 114(4), 429–438 (2010)
Ulges, A., Schulze, C., Keysers, D., Breuel, T.: Identifying relevant frames in weakly labeled videos for training concept detectors. In: Proceedings of the 2008 International Conference on Content-Based Image and Video Retrieval, CIVR ’08, pp. 9–16. ACM, New York, NY, USA (2008)
Borth, D., Ulges, A., Breuel, T.M.: Relevance filtering meets active learning: improving web-based concept detectors. In: Proceedings of the International Conference on Multimedia Information Retrieval, MIR ’10, pp. 25–34. ACM, New York, NY, USA (2010)
Tang, J., Zha, Z.J., Tao, D., Chua, T.S.: Semantic-gap-oriented active learning for multilabel image annotation. IEEE Trans. Image Process. 21(4), 2354–2360 (2012)
Tang, J., Yan, S., Hong, R., Qi, G.J., Chua, T.S.: Inferring semantic concepts from community-contributed images and noisy tags. In: Proceedings of the 17th ACM International Conference on Multimedia, MM ’09, pp. 223–232. ACM, New York, NY, USA (2009)
Li, H., Yi, L., Guan, Y., Zhang, H.: DUT-WEBV: a benchmark dataset for performance evaluation of tag localization for web video. In: Advances in Multimedia Modeling. Lecture Notes in Computer Science, vol. 7733, pp. 305–315. Springer, Berlin (2013)
Ballan, L., Bertini, M., Del Bimbo, A., Meoni, M., Serra, G.: Tag suggestion and localization in user-generated videos based on social knowledge. In: Proceedings of second ACM SIGMM workshop on Social media, WSM ’10, pp. 3–8. ACM, New York, NY, USA (2010)
Tang, J., Hua, X.S., Wang, M., Gu, Z., Qi, G.J., Wu, X.: Correlative linear neighborhood propagation for video annotation. IEEE Trans. Syst. Man Cybern. Part B Cybern. 39(2), 409–416 (2009)
Tang, J., Hong, R., Yan, S., Chua, T.S., Qi, G.J., Jain, R.: Image annotation by knn-sparse graph-based label propagation over noisily tagged web images. ACM Trans. Intell. Syst. Technol. 2(2), 14:1–14:15 (2011)
Ulges, A., Schulze, C., Breuel, T.: Multiple instance learning on weakly labeled videos. In: Workshop on Cross-Media Information Analysis, Extraction and Management. Springer, Berlin (2008)
Zhang, M.L., Zhou, Z.H.: Improve multi-instance neural networks through feature selection. Neural Process. Lett. 19(1), 1–10 (2004)
Li, G., Wang, M., Zheng, Y.T., Li, H., Zha, Z.J., Chua, T.S.: Shottagger: tag location for internet videos. In: Proceedings of the 1st ACM International Conference on Multimedia Retrieval, ICMR ’11, pp. 37:1–37:8. ACM, New York, NY, USA (2011)
Wang, M., Hong, R., Li, G., Zha, Z.J., Yan, S., Chua, T.S.: Event driven web video summarization by tag localization and key-shot identification. IEEE Trans. Multimed. 14(4), 975–985 (2012)
Shen, J., Cheng, Z.: Personalized video similarity measure. Multimed. Syst. 17(5), 421–433 (2011)
Wang, M., Hua, X.S., Tang, J., Hong, R.: Beyond distance measurement: constructing neighborhood similarity for video annotation. IEEE Trans. Multimed. 11(3), 465–476 (2009)
Shen, J., Tao, D., Li, X.: Modality mixture projections for semantic video event detection. IEEE Trans. Circuits Syst. Video Technol. 18(11), 1587–1596 (2008)
Yanai, K.: Automatic web image selection with a probabilistic latent topic model. In: Proceedings of the 17th International Conference on World Wide Web, WWW ’08, pp. 1237–1238. ACM, New York, NY, USA (2008)
Fergus, R., Fei-Fei, L., Perona, P., Zisserman, A.: Learning object categories from google’s image search. In: Tenth IEEE International Conference on Computer Vision 2005, ICCV 2005, vol. 2, pp. 1816–1823. (2005)
Yi, L., Li, H., Neo, S.Y.: Combining topic model and relevance filtering to localize relevant frames in web videos. In: Advances in Multimedia Modeling. Lecture Notes in Computer Science, vol. 7733, pp. 206–216. Springer, Berlin (2013)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Cai, X., Wang, H., Huang, H., Ding, C.: Simultaneous image classification and annotation via biased random walk on tri-relational graph. In: Proceedings of the 12th European Conference on Computer Vision—Volume Part VI, ECCV’12, pp. 823–836. Springer, Berlin (2012)
Feng, Y., Lapata, M.: Topic models for image annotation and text illustration. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT ’10, pp. 831–839. Association for Computational Linguistics, Stroudsburg, PA, USA (2010)
Li, H., Wang, X., Tang, J., Zhao, C.: Combining global and local matching of multiple features for precise item image retrieval. Multimed. Syst. 19(1), 37–49 (2013)
Acknowledgments
This work was supported by National Natural Science Funds of China (61033012, 61173104,61202133).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Li, H., Yi, L., Liu, B. et al. Localizing relevant frames in web videos using topic model and relevance filtering. Machine Vision and Applications 25, 1661–1670 (2014). https://doi.org/10.1007/s00138-013-0537-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00138-013-0537-6