Abstract
This study addresses an automatic approach to analyze the structure of large scale web videos based on visual and acoustic information. In our approach, video streams are macro-segmented via mining the duplicate sequences. Acoustic and visual information are both adopted for mining so as to avoid missing true-positive. Web videos contain severe visual and acoustic distortions, differing to TV data, where duplicate clips are quite similar. In this case, we present novel visual-acoustic feature schemes to handle the distortions. And shot based indexing algorithm and several temporary constrains are presented to mine the duplicate sequences, where the weak geometric verification is combined with direct hashing to achieve high efficiency and superior performance of image-based duplicate sequences detection, and dynamic programming is introduced to recall missing true-positives in audio-based section. Experiments conducted on the dataset composed of 500 h content-unknown videos show that F-Measure of duplicate sequences mining for web videos can achieve the rate of 95 % and, in terms of efficiency and detection performance, the proposed algorithm outperforms the state-of-art approaches.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Zhao, J. Y., Hayasaka, R., Muranoi, R., & Matsushita, Y. (1998). A MPEG video structure analysis scheme and its application to hierarchical video browser. Telecommunication Systems, 9(3–4), 403–422.
Gauch, J. M., & Shivadas, A. (2006). Finding and identifying unknown commercials using repeated video sequence detection. Computer Vision and Image Understanding, 103, 80–88.
Berrani S., Lechat P., & Manson G. (2007) TV broadcast macro-segmentation: metadata-based vs. content-based approaches, Proceedings of the 6th ACM international conference on Image and video retrieval, Amsterdam, The Netherlands: ACM, pp. 325–332.
Berrani, S., Manson, G., & Lechat, P. (2008). A non-supervised approach for repeated sequence detection in TV broadcast streams. Image Communication, 23, 525–537.
Covell, M., Baluja, S. (2006) Advertisement detection and replacement using acoustic and visual repetition, MMSP’06, IEEE 8th workshop on multimedia signal processing.
Bai, H., Wang, L., Qin, G., Zhang, J., Tao, K., Chang, X., Dong, Y. (2011). TV program segmentation using multi-modal information fusion, Proceedings of the 1st ACM international conference on multimedia retrieval, 2011 ACM, New York, NY, USA.
Wang, L., Dong, Y., Bai, H., Zhangy, J., Huang, C., & Liu, W. (2012). Content-based large scale web audio copy detection, International conference on multimedia & expo (ICME).
Hampapur, A., Hyun, K., & Bolle, R. (2002). Comparison of sequence matching techniques for video copy detection. Proceedings of the storage and retrieval for media databases, pp. 194–201.
Bai, H., Dong, Y., Liu, W., Wang, L., Huang, C., & Tao, K. (2011). France telecom orange labs (Beijing) at TRECVID 2011: Content-Based Copy Detection-TRECVID 2011 Notebook Paper.
Duan, L., Wang, J., Zheng, Y., Jin, J. S., Lu, H., & Xu, C. (2006) Segmentation, categorization, and identification of commercial clips from TV streams using multimodal analysis, Proceedings of the 14th annual ACM international conference on Multimedia, Santa Barbara, CA, USA: ACM, pp. 201–210.
Derek, Y. K., Ke, Y., Hoiem, D., & Sukthankar, R. (2005). Computer vision for music identification. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1, 597–604.
Haitsma, J., Kalker, T. (2001) Robust audio hashing for content identification, Content-based multimedia indexing (CBMI).
Dong, Y., Qin, G., Xiao, G. R., Lian, S. G., & Chang, X. F. (2013) Advanced news video parsing via visual characteristics of anchorperson scenes, Telecommunication Systems. doi:10.1007/s11235-013-9731-0
Smeaton, A. F., Over, P., & Doherty, A. R. (2010). Video shot boundary detection: Seven years of trecvid activity. Computer Vision and Image Understanding, 114(4), 411–418.
Fei-Fei, L., & Perona, P. (2005) A Bayesian hierarchical model for learning natural scene categories. Proceedings of IEEE computer vision and pattern recognition. pp. 524–531.
Lowe, David G. (1999). Object recognition from local scale-invariant features. Proceedings of the International Conference on Computer Vision, 2, 1150–1157.
Huang, C., & Dong, Y. (2012) A fast color feature for real-time image retrieval, IC-NIDC.
Lazebnik, S., Schmid, C., & Ponce, J. (2006) Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. CVPR
Uijlings, J. R. R., Smeulders, A. W. M., & Scha, R. J. H. (2010). Real-time visual concept classifcation. IEEE Transactions of Multimedia, 12(7), 665.
Nister, D., & Stewenius, H. (2006). Scalable recognition with a vocabulary tree, IEEE computer society conference on computer vision and pattern recognition. 2, 2161–2168.
Shang, L., Yang, L., Wang, F., Chan, K., & Hua, X. (2010) Real-time large scale near-duplicate web video retrieval, ACM MM.
Needleman, S. B., & Wunsch, C. D. (1970). An efficient method applicable to the search for similarities in the amino acid sequences of two proteins. Journal of Molecular Biology, 48, 444–453.
Sellers, P. H. (1974). An algorithm for the distance between two finite sequences. Journal of Combinatorial Theory, A16, 253–258.
Wang, L., Dong, Y., Bai, H., Zhangy, J., Huang, C., Liu, W. (2012) Content-based large scale web audio copy detection, International conference on multimedia & expo (ICME).
Datar, M., Immorlica, N., Indyk, P., Mirrokni, V. S. (2004) Locality-sensitive hashing scheme based on p-stable distributions, Annual symposium on computational geometry, pp. 253–262.
Gionis, A., Indyk, P., & Motwani, R. (1999) Similarity search in high dimensions via hashing, Proceeding VLDB ’99 Proceedings of the 25th international conference on very large data bases, pp. 518–529.
Schaefer, G., & Zhou, H. Y. (2009). Fuzzy clustering for colour reduction in images. Telecommunication Systems, 40(1–2), 17–25.
Acknowledgments
This work is sponsored by collaborative Research Project (SEV01100474) between Beijing University of Posts and Telecommunications and France Telecom – Orange Lab Beijing, the National High Technology Research and Development Program of China (863 Program, No. 2012AA012505), and the National Natural Science Foundation of China (61372169).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Dong, Y., Wang, L., Lian, S. et al. A novel feature fusion based framework for efficient shot indexing to massive web videos. Telecommun Syst 59, 401–413 (2015). https://doi.org/10.1007/s11235-014-9945-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11235-014-9945-9