Abstract
Traditional multimedia (video) retrieval systems use the keyword-based approach in order to make the search process fast although this approach has several shortcomings and limitations related to the way the user is able to formulate her/his information need. Typical Web multimedia retrieval systems illustrate this paradigm in the sense that the result of a search consists of a collection of thousands of multimedia documents, many of which would be irrelevant or not fully exploited by the typical user. Indeed, according to studies related to users’ behavior, an individual is mostly interested in the initial documents returned during a search session and therefore a multimedia retrieval system is to model the multimedia content as precisely as possible to allow for the first retrieved images to be fully relevant to the user’s information need. For this, the keyword-based approach proves to be clearly insufficient and the need for a high-level index and query language, addressing the issue of combining modalities within expressive frameworks for video indexing and retrieval is of huge importance and the only solution for achieving significant retrieval performance. This paper presents a multi-facetted conceptual framework integrating multiple characterizations of the visual and audio contents for automatic video retrieval. It relies on an expressive representation formalism handling high-level video descriptions and a full-text query framework in an attempt to operate video indexing and retrieval beyond trivial low-level processes, keyword-annotation frameworks and state-of-the art architectures loosely-coupling visual and audio descriptions. Experiments on the multimedia topic search task of the TRECVID evaluation campaign validate our proposal.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Amato, G., Mainetto, G., & Savino, P. (1998). An approach to a content-based retrieval of multimedia data. Multimedia Tools and Applications, 7(1–2), 9–36.
Amir, A., Berg, M., & Chang, S.-F. (2003). IBM research TRECVID-2003 video retrieval system. In NIST TRECVID-2003.
Assfalg, J., Bertini, M., Colombo, C., & del Bimbo, A. (2002). Semantic annotation of sports videos. IEEE MultiMedia, 9(2), 52–60.
Belkhatir, M. (2005). Combining visual semantics and texture characterizations for precision-oriented automatic image retrieval. In Proceedings of ECIR (pp. 457–474).
Belkhatir, M., Mulhem, P., Chiaramella, Y. (2004). Integrating perceptual signal features within a multi-facetted conceptual model for automatic image retrieval. In Proceedings of ECIR (pp. 267–282).
Belkhatir, M., Mulhem, P., & Chiaramella, Y. (2005). A full-text framework for the image retrieval signal/semantic integration. In Proceedings of DEXA 2005 (pp. 113–123).
Berlin, B., & Kay, P. (1991). Basic color terms: Their universality and evolution. Berkeley: University of California Press.
Bertini, M., del Bimbo, A., & Nunziati, W. (2003). Annotation and retrieval of structured video documents. In Proceedings of ECIR (pp. 12–24).
Bhushan, N. A., & Lohse, G. (1997). The texture lexicon: Understanding the categorization of visual texture terms and their relationship to texture images. Cognitive Science, 21(2), 219–246.
Blei, D., & Jordan, M. (2003). Modeling annotated data. ACM SIGIR, 127–134.
Carneiro, G., et al. (2006). Supervised learning of semantic classes for image annotation and retrieval. IEEE PAMI, 394–410.
Charhad, M., Moraru, D., Ayache, S., & Quenot, G. (2005). Speaker identity indexing in audio-visual documents. In Proceedings of content-based multimedia indexing (CBMI).
Chua, T.-S., et al. (2004). TRECVID 2004 search task by NUS PRIS. In The online proceedings of the TREC video retrieval evaluation. Retrieved from http://www-nlpir.nist.gov/projects/tvpubs/tv.pubs.org.html#2004.
Cleverdon, C. W., Mills, J., & Keen, E. M. (1966). Factors determining the performance of indexing systems. TR vol. 2: Test results, ASLIB Cranfield Research Project (2).
Cohn, A. (1997). Qualitative spatial representation and reasoning with the region connection calculus. Geoinformatica, 1, 1–44.
Cox, I., et al. (2000). The Bayesian IR system, PicHunter: Theory, implementation and psychophysical experiments. IEEE Transactions on Image Processing, 9(1), 20–37.
Etievent, E., Lebourgeois, F., & Jolion, J. M. (1999). Assisted video sequences indexing: Motion analysis based on interest points. In Proceedings of ICIAP (pp. 27–29).
Fablet, R., & Bouthemy, P. (2000). Statistical motion-based video indexing and retrieval. In Proceedings of the conf. on content-based multimedia information access RIAO (pp. 602–619).
Fan, J., et al. (2004). ClassView: Hierarchical video shot classification, indexing, and accessing. IEEE Transactions on Multimedia, 6(1), 70–86.
Feng, S. L., Manmatha, R., & Lavrenko, V. (2004). Multiple Bernoulli relevance models for image and video annotation. In Proceedings of CVPR (pp. 1002–1009).
Gauvain, J. L., Lamel, L., & Adda, G. (2002). The LIMSI broadcast news transcription system. Speech Communication, 37, 89–108.
Gong, Y., Chua, C. H., & Xiaoyi, G. (1996). Image indexing and retrieval based on color histograms. Multimedia Tools and Applications, II, 133–156.
Hollink, L. (2004). Classification of user image descriptions. International Journal of Human–Computer Studies, 61(5), 601–626.
Ianeva, T. (2004). Probabilistic approaches to video retrieval. In The online proceedings of the TREC video retrieval evaluation. Retrieved from http://www-nlpir.nist.gov/projects/tvpubs/tv.pubs.org.html#2004.
Iyengar, G., et al. (2005). Joint visual-text modeling for automatic retrieval of multimedia documents. In Proceedings of ACM MM (pp. 21–30).
Jiang, H., Montesi, D., & Elmagarmid, A. K. (1999). Integrated video and text for content-based access to video databases. Multimedia Tools and Applications, 9(3), 227–249.
Jin, Y., et al. (2005). Image annotations by combining multiple evidence & wordNet. In Proceedings of ACM MM (pp. 706–715).
Kemp, T., Schmidt, M., Westphal, M., & Waibel, A. (2000). Strategies for automatic segmentation of audio data. In Proceedings of ICASSP (pp. 1423–1426).
Kennedy, L. S., Natsev, A., & Chang, S.-F. (2005). Automatic discovery of query-class-dependent models for multimodal search. In Proceedings of ACM Multimedia (pp. 24–28).
Kwon, S., & Narayanan, S. (2002). Speaker change detection using a new weighted distance measure. In Proceedings of int’l conf. spoken language processing (ICSLP) (pp. 2537–2540).
Lim, J. H., & Jin, J. S. (2005). A structured learning framework for content-based image indexing and visual query. Multimedia Systems, 10(4), 317–331.
Lin, P.-C., Wang, J.-C., Wang, J.-F., & Sung, H.-C. (2007). Unsupervised speaker change detection using SVM training misclassification rate. IEEE Transactions on Computers, 56(9), 1234–1244.
Liu, J., et al. (2007). Dual cross-media relevance model for image annotation. In Proceedings of ACM MM (pp. 605–614).
Lu, Y., et al. (2000). A unified framework for semantics and feature based RF in image retrieval systems. In Proceedings of ACM MM (pp. 31–37).
Martinet, J., Mulhem, P., & Chiaramella, Y. (2005). A model for weighting image objects in home photographs. In Proceedings of CIKM (pp. 760–767).
Mittal, A., & Cheong, L. F. (2003). Framework for synthesizing semantic-level indices. Multimedia Tools and Applications, 20(2), 135–158.
Miyahara, M., & Yoshida, Y. (1988). Mathematical transform of (R,G,B) color data to munsell (H,V,C) color data. In Proceedings of SPIE-visual communications and image processing (pp. 650–657).
Mojsilovic, A., & Rogowitz, B. (2001). Capturing image semantics with low-level descriptors. In Proceedings of IEEE ICIP (pp. 18–21).
Mulhem, P., Lim, J. H., Leow, W. K., & Kankanhalli, M. (2003). Advances in digital home image albums (chapter IX, pp. 201–226). Multimedia Systems and Content-Based Image Retrieval, Idea Publishing.
Naphade, M. R., & Huang, T. S. (2002). Factor graph framework for semantic video indexing. IEEE Transactions on Circuits and Systems for Video Technology, 12(1), 40–52.
Natsev, A., Naphade, M., & Tesic, J. (2005). Learning the semantics of multimedia queries and concepts from a small number of examples. In Proceedings of ACM MM (pp. 598–607).
Neo, S. Y., et al. (2006). Video retrieval using high-level features: Exploiting query matching and confidence-based weighting. In Proceedings of CIVR.
Ounis, I., & Pasca, M. (1998). RELIEF: Combining expressiveness and rapidity into a single system. In Proceedings of ACM SIGIR (pp. 266–274).
Platt, J. C. (1999). Probabilities for support vector machines. In Advances in large margin classifiers (pp. 61–74). Cambridge, MA: MIT.
Quénot, G. (2001). TREC-10 shot boundary detection task: CLIPS system description and evaluation. In Proceedings of TREC (pp. 13–16).
Smeaton, A. F., Over, P., & Kraaij, W. (2006). Evaluation campaigns and TRECVid. In Proceeding of the multimedia information retrieval workshop (pp. 321–330).
Smeulders, A., et al. (2000). Content-based image retrieval at the end of the early years. IEEE PAMI, 22(12), 1349–1380.
Snoek, S., et al. (2006). The semantic pathfinder: Using an authoring metaphor for generic multimedia indexing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10), 1678–1689.
Sowa, J. F. (1984). Conceptual structures: Information processing in mind and machine. Reading, MA: Addison-Wesley.
Srikanth, M., et al. (2005). Exploiting ontologies for automatic image annotation. In Proceedings of ACM SIGIR (pp. 1349–1380).
Town, C. P., & Sinclair, D. (2000). Content-based image retrieval using semantic visual categories. TR2000-14, AT&T Labs Cambridge.
Van Rijsbergen, C. J. (1986). A non-classical logic for information retrieval. Computer Journal, 29(6), 481–485.
Vapnik, V. (1998). Statistical learning theory. New York: Wiley.
Westerveld, T., & de Vries, A. P. (2003). Experimental evaluation of a generative probabilistic image retrieval model on ‘easy’ data. SIGIR Multimedia Information Retrieval Workshop.
Westerveld, T., et al. (2003). Combining infomation sources for video retrieval: The lowlands team at TRECVID 2003. In NIST TRECVID-2003.
Yan, R., Yang, J., & Hauptmann, A. G. (2004). Learning query-class dependent weights in automatic video retrieval. In Proceedings of ACM MM (pp. 270–278).
Yang, J., Chen, M. Y., & Hauptmann, A. G. (2004). Finding person X: Correlating names with visual appearances. In Proceedings of CIVR (pp. 270–278).
Zhou, X. S., & Huang, T. S. (2002). Unifying keywords and visual contents in image retrieval. IEEE Multimedia, 9(2), 23–33.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Belkhatir, M. CLOVIS: towards precision-oriented text-based video retrieval through the unification of automatically-extracted concepts and relations of the visual and audio/speech contents. J Intell Inf Syst 34, 135–175 (2010). https://doi.org/10.1007/s10844-009-0083-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10844-009-0083-x