CLOVIS: towards precision-oriented text-based video retrieval through the unification of automatically-extracted concepts and relations of the visual and audio/speech contents

M. Belkhatir¹

234 Accesses
1 Citation
Explore all metrics

Abstract

Traditional multimedia (video) retrieval systems use the keyword-based approach in order to make the search process fast although this approach has several shortcomings and limitations related to the way the user is able to formulate her/his information need. Typical Web multimedia retrieval systems illustrate this paradigm in the sense that the result of a search consists of a collection of thousands of multimedia documents, many of which would be irrelevant or not fully exploited by the typical user. Indeed, according to studies related to users’ behavior, an individual is mostly interested in the initial documents returned during a search session and therefore a multimedia retrieval system is to model the multimedia content as precisely as possible to allow for the first retrieved images to be fully relevant to the user’s information need. For this, the keyword-based approach proves to be clearly insufficient and the need for a high-level index and query language, addressing the issue of combining modalities within expressive frameworks for video indexing and retrieval is of huge importance and the only solution for achieving significant retrieval performance. This paper presents a multi-facetted conceptual framework integrating multiple characterizations of the visual and audio contents for automatic video retrieval. It relies on an expressive representation formalism handling high-level video descriptions and a full-text query framework in an attempt to operate video indexing and retrieval beyond trivial low-level processes, keyword-annotation frameworks and state-of-the art architectures loosely-coupling visual and audio descriptions. Experiments on the multimedia topic search task of the TRECVID evaluation campaign validate our proposal.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

VERGE in VBS 2019

Free-Form Multi-Modal Multimedia Retrieval (4MR)

VERGE: A Multimodal Interactive Search Engine for Video Browsing and Retrieval

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Amato, G., Mainetto, G., & Savino, P. (1998). An approach to a content-based retrieval of multimedia data. Multimedia Tools and Applications, 7(1–2), 9–36.
Article Google Scholar
Amir, A., Berg, M., & Chang, S.-F. (2003). IBM research TRECVID-2003 video retrieval system. In NIST TRECVID-2003.
Assfalg, J., Bertini, M., Colombo, C., & del Bimbo, A. (2002). Semantic annotation of sports videos. IEEE MultiMedia, 9(2), 52–60.
Article Google Scholar
Belkhatir, M. (2005). Combining visual semantics and texture characterizations for precision-oriented automatic image retrieval. In Proceedings of ECIR (pp. 457–474).
Belkhatir, M., Mulhem, P., Chiaramella, Y. (2004). Integrating perceptual signal features within a multi-facetted conceptual model for automatic image retrieval. In Proceedings of ECIR (pp. 267–282).
Belkhatir, M., Mulhem, P., & Chiaramella, Y. (2005). A full-text framework for the image retrieval signal/semantic integration. In Proceedings of DEXA 2005 (pp. 113–123).
Berlin, B., & Kay, P. (1991). Basic color terms: Their universality and evolution. Berkeley: University of California Press.
Google Scholar
Bertini, M., del Bimbo, A., & Nunziati, W. (2003). Annotation and retrieval of structured video documents. In Proceedings of ECIR (pp. 12–24).
Bhushan, N. A., & Lohse, G. (1997). The texture lexicon: Understanding the categorization of visual texture terms and their relationship to texture images. Cognitive Science, 21(2), 219–246.
Article Google Scholar
Blei, D., & Jordan, M. (2003). Modeling annotated data. ACM SIGIR, 127–134.
Carneiro, G., et al. (2006). Supervised learning of semantic classes for image annotation and retrieval. IEEE PAMI, 394–410.
Charhad, M., Moraru, D., Ayache, S., & Quenot, G. (2005). Speaker identity indexing in audio-visual documents. In Proceedings of content-based multimedia indexing (CBMI).
Chua, T.-S., et al. (2004). TRECVID 2004 search task by NUS PRIS. In The online proceedings of the TREC video retrieval evaluation. Retrieved from http://www-nlpir.nist.gov/projects/tvpubs/tv.pubs.org.html#2004.
Cleverdon, C. W., Mills, J., & Keen, E. M. (1966). Factors determining the performance of indexing systems. TR vol. 2: Test results, ASLIB Cranfield Research Project (2).
Cohn, A. (1997). Qualitative spatial representation and reasoning with the region connection calculus. Geoinformatica, 1, 1–44.
Article Google Scholar
Cox, I., et al. (2000). The Bayesian IR system, PicHunter: Theory, implementation and psychophysical experiments. IEEE Transactions on Image Processing, 9(1), 20–37.
Article Google Scholar
Etievent, E., Lebourgeois, F., & Jolion, J. M. (1999). Assisted video sequences indexing: Motion analysis based on interest points. In Proceedings of ICIAP (pp. 27–29).
Fablet, R., & Bouthemy, P. (2000). Statistical motion-based video indexing and retrieval. In Proceedings of the conf. on content-based multimedia information access RIAO (pp. 602–619).
Fan, J., et al. (2004). ClassView: Hierarchical video shot classification, indexing, and accessing. IEEE Transactions on Multimedia, 6(1), 70–86.
Article Google Scholar
Feng, S. L., Manmatha, R., & Lavrenko, V. (2004). Multiple Bernoulli relevance models for image and video annotation. In Proceedings of CVPR (pp. 1002–1009).
Gauvain, J. L., Lamel, L., & Adda, G. (2002). The LIMSI broadcast news transcription system. Speech Communication, 37, 89–108.
Article MATH Google Scholar
Gong, Y., Chua, C. H., & Xiaoyi, G. (1996). Image indexing and retrieval based on color histograms. Multimedia Tools and Applications, II, 133–156.
Google Scholar
Hollink, L. (2004). Classification of user image descriptions. International Journal of Human–Computer Studies, 61(5), 601–626.
Article Google Scholar
Ianeva, T. (2004). Probabilistic approaches to video retrieval. In The online proceedings of the TREC video retrieval evaluation. Retrieved from http://www-nlpir.nist.gov/projects/tvpubs/tv.pubs.org.html#2004.
Iyengar, G., et al. (2005). Joint visual-text modeling for automatic retrieval of multimedia documents. In Proceedings of ACM MM (pp. 21–30).
Jiang, H., Montesi, D., & Elmagarmid, A. K. (1999). Integrated video and text for content-based access to video databases. Multimedia Tools and Applications, 9(3), 227–249.
Article Google Scholar
Jin, Y., et al. (2005). Image annotations by combining multiple evidence & wordNet. In Proceedings of ACM MM (pp. 706–715).
Kemp, T., Schmidt, M., Westphal, M., & Waibel, A. (2000). Strategies for automatic segmentation of audio data. In Proceedings of ICASSP (pp. 1423–1426).
Kennedy, L. S., Natsev, A., & Chang, S.-F. (2005). Automatic discovery of query-class-dependent models for multimodal search. In Proceedings of ACM Multimedia (pp. 24–28).
Kwon, S., & Narayanan, S. (2002). Speaker change detection using a new weighted distance measure. In Proceedings of int’l conf. spoken language processing (ICSLP) (pp. 2537–2540).
Lim, J. H., & Jin, J. S. (2005). A structured learning framework for content-based image indexing and visual query. Multimedia Systems, 10(4), 317–331.
Article Google Scholar
Lin, P.-C., Wang, J.-C., Wang, J.-F., & Sung, H.-C. (2007). Unsupervised speaker change detection using SVM training misclassification rate. IEEE Transactions on Computers, 56(9), 1234–1244.
MathSciNet Google Scholar
Liu, J., et al. (2007). Dual cross-media relevance model for image annotation. In Proceedings of ACM MM (pp. 605–614).
Lu, Y., et al. (2000). A unified framework for semantics and feature based RF in image retrieval systems. In Proceedings of ACM MM (pp. 31–37).
Martinet, J., Mulhem, P., & Chiaramella, Y. (2005). A model for weighting image objects in home photographs. In Proceedings of CIKM (pp. 760–767).
Mittal, A., & Cheong, L. F. (2003). Framework for synthesizing semantic-level indices. Multimedia Tools and Applications, 20(2), 135–158.
Article Google Scholar
Miyahara, M., & Yoshida, Y. (1988). Mathematical transform of (R,G,B) color data to munsell (H,V,C) color data. In Proceedings of SPIE-visual communications and image processing (pp. 650–657).
Mojsilovic, A., & Rogowitz, B. (2001). Capturing image semantics with low-level descriptors. In Proceedings of IEEE ICIP (pp. 18–21).
Mulhem, P., Lim, J. H., Leow, W. K., & Kankanhalli, M. (2003). Advances in digital home image albums (chapter IX, pp. 201–226). Multimedia Systems and Content-Based Image Retrieval, Idea Publishing.
Naphade, M. R., & Huang, T. S. (2002). Factor graph framework for semantic video indexing. IEEE Transactions on Circuits and Systems for Video Technology, 12(1), 40–52.
Article Google Scholar
Natsev, A., Naphade, M., & Tesic, J. (2005). Learning the semantics of multimedia queries and concepts from a small number of examples. In Proceedings of ACM MM (pp. 598–607).
Neo, S. Y., et al. (2006). Video retrieval using high-level features: Exploiting query matching and confidence-based weighting. In Proceedings of CIVR.
Ounis, I., & Pasca, M. (1998). RELIEF: Combining expressiveness and rapidity into a single system. In Proceedings of ACM SIGIR (pp. 266–274).
Platt, J. C. (1999). Probabilities for support vector machines. In Advances in large margin classifiers (pp. 61–74). Cambridge, MA: MIT.
Google Scholar
Quénot, G. (2001). TREC-10 shot boundary detection task: CLIPS system description and evaluation. In Proceedings of TREC (pp. 13–16).
Smeaton, A. F., Over, P., & Kraaij, W. (2006). Evaluation campaigns and TRECVid. In Proceeding of the multimedia information retrieval workshop (pp. 321–330).
Smeulders, A., et al. (2000). Content-based image retrieval at the end of the early years. IEEE PAMI, 22(12), 1349–1380.
Google Scholar
Snoek, S., et al. (2006). The semantic pathfinder: Using an authoring metaphor for generic multimedia indexing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10), 1678–1689.
Article Google Scholar
Sowa, J. F. (1984). Conceptual structures: Information processing in mind and machine. Reading, MA: Addison-Wesley.
MATH Google Scholar
Srikanth, M., et al. (2005). Exploiting ontologies for automatic image annotation. In Proceedings of ACM SIGIR (pp. 1349–1380).
Town, C. P., & Sinclair, D. (2000). Content-based image retrieval using semantic visual categories. TR2000-14, AT&T Labs Cambridge.
Van Rijsbergen, C. J. (1986). A non-classical logic for information retrieval. Computer Journal, 29(6), 481–485.
Article MATH Google Scholar
Vapnik, V. (1998). Statistical learning theory. New York: Wiley.
MATH Google Scholar
Westerveld, T., & de Vries, A. P. (2003). Experimental evaluation of a generative probabilistic image retrieval model on ‘easy’ data. SIGIR Multimedia Information Retrieval Workshop.
Westerveld, T., et al. (2003). Combining infomation sources for video retrieval: The lowlands team at TRECVID 2003. In NIST TRECVID-2003.
Yan, R., Yang, J., & Hauptmann, A. G. (2004). Learning query-class dependent weights in automatic video retrieval. In Proceedings of ACM MM (pp. 270–278).
Yang, J., Chen, M. Y., & Hauptmann, A. G. (2004). Finding person X: Correlating names with visual appearances. In Proceedings of CIVR (pp. 270–278).
Zhou, X. S., & Huang, T. S. (2002). Unifying keywords and visual contents in image retrieval. IEEE Multimedia, 9(2), 23–33.
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Center for Multimedia Computing, Communications and Applications Research, Monash University, Sunway Campus, Sunway, Malaysia
M. Belkhatir

Authors

M. Belkhatir
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to M. Belkhatir.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Belkhatir, M. CLOVIS: towards precision-oriented text-based video retrieval through the unification of automatically-extracted concepts and relations of the visual and audio/speech contents. J Intell Inf Syst 34, 135–175 (2010). https://doi.org/10.1007/s10844-009-0083-x

Download citation

Received: 16 January 2009
Revised: 28 February 2009
Accepted: 01 March 2009
Published: 04 April 2009
Issue Date: April 2010
DOI: https://doi.org/10.1007/s10844-009-0083-x

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

VERGE in VBS 2019

Free-Form Multi-Modal Multimedia Retrieval (4MR)

VERGE: A Multimodal Interactive Search Engine for Video Browsing and Retrieval

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

CLOVIS: towards precision-oriented text-based video retrieval through the unification of automatically-extracted concepts and relations of the visual and audio/speech contents

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

VERGE in VBS 2019

Free-Form Multi-Modal Multimedia Retrieval (4MR)

VERGE: A Multimodal Interactive Search Engine for Video Browsing and Retrieval

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation