Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

A Hamming Embedding Kernel with Informative Bag-of-Visual Words for Video Semantic Indexing

Published: 17 April 2014 Publication History

Abstract

In this article, we propose a novel Hamming embedding kernel with informative bag-of-visual words to address two main problems existing in traditional BoW approaches for video semantic indexing. First, Hamming embedding is employed to alleviate the information loss caused by SIFT quantization. The Hamming distances between keypoints in the same cell are calculated and integrated into the SVM kernel to better discriminate different image samples. Second, to highlight the concept-specific visual information, we propose to weight the visual words according to their informativeness for detecting specific concepts. We show that our proposed kernels can significantly improve the performance of concept detection.

References

[1]
F. Alhwarin, C. Wang, D. Ristic-Durrant, and A. Graser. 2008. Improved sift-features matching for object recognition. In Proceedings of the BCS International Academic Conference on Visions of Computer Science (VoCS'08). 179--190.
[2]
D. Batra, R. Sukthankar, and T. Chen. 2008. Learning class-specific affinities for image labelling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[3]
H. Cai, K. Mikolajczyk, and J. Matas. 2011. Learning linear discriminant projections for dimensionality reduction of image descriptors. IEEE Trans. Pattern Anal. Mach. Intell. 33, 2, 338--352.
[4]
M. Charikar. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings of the ACM Symposium on Theory of Computing.
[5]
N. Cristianini, J. Kandola, A. Elisseeff, and J. S. Taylor. 2002. On kernel target alignment. Adv. Neural Inf. Process. Syst. 14, 367--373.
[6]
B. Fulkerson, A. Vedaldi, and S. Soatto. 2008. Localizing objects with smart dictionaries. In Proceedings of the European Conference on Computer Vision. 179--192.
[7]
J. Gemert, J. Geusebroek, C. Veenman, and A. Smeulders. 2008. Kernel codebooks for scene categorization. In Proceedings of the European Conference on Computer Vision. 696--709.
[8]
Gemert, J., Veenman, C., Smeulders, A., and Geusebroek, J. 2010. Visual Word Ambiguity. IEEE Trans. on Pattern Analysis and Machine Intelligence. 32, 7, 1271--1283.
[9]
X. He and P. Niyogi. 2003. Locality preserving projections. Adv. Neural Inf. Process. Syst. 16.
[10]
C. Igel, T. Glasmachers, B. Mersch, N. Pfeifer, and P. Meinicke. 2007. Gradient-based optimization of kernel-target alignment for sequence kernels applied to bacterial gene start detection. IEEE/ACM Trans. Comput. Biol. Bioinf. 4, 2, 216--226.
[11]
P. Indyk and R. Motwani. 1998. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the Symposium on Theory of Computing.
[12]
M. Jain, R. Benmokhtar, and P. Gros. 2012. Hamming embedding similarity-based image classification. In Proceedings of the 2nd ACM International Conference on Multimedia Retrieval.
[13]
M. Jain, H. Jegou, and P. Gros. 2011. Asymmetric hamming embedding: Taking the best of our bits for large scale image search. In Proceedings of the 19th ACM International Conference on Multimedia. 1441--1444.
[14]
H. Jegou, M. Douze, and C. Schmid. 2008. Hamming embedding and weak geometric consistency for large scale image search. In Proceedings of the European Conference on Computer Vision.
[15]
Y. G. Jiang, C. W. Ngo, and J. Yang. 2007. Towards optimal bag-of-features for object categorization and semantic video retrieval. In Proceedings of International Conference on Image and Video Retrieval.
[16]
Y. G. Jiang and C. W. Ngo. 2008. Bag-of-visual-words expansion using visual relatedness for video indexing. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 769--770.
[17]
Y. G. Jiang, X. Zeng, G. Ye, D. Ellis, and S. F. Chang. 2010. Columbia-ucf trecvid 2010 multimedia event detection: Combining multiple modalities, contextual concepts, and temporal matching. In NIST TRECVID Workshop.
[18]
Y. G. Jiang, J. Wang, X. Xue, and S. F. Chang. 2013. Query-adaptive image search with hash codes. IEEE Trans. Multimedia 15, 2, 442--453.
[19]
F. Jurie and B. Triggs. 2005. Creating efficient codebooks for visual recognition. IEEE Conf. Comput. Vis., 604--610.
[20]
K. Kesorn and S. Poslad. 2012. An enhanced bag-of-visual word vector space model to represent visual content in athletics images. IEEE Trans. Multimedia 14, 1.
[21]
H. W. Kuhn. 1955. The hungarian method for the assignment problem. Naval Res. Logistics Quart. 2, 83--97.
[22]
Libsvm. 2014. http://www.csie.ntu.edu.tw/cjlin/libsvm/.
[23]
D. Lowe. 2004. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60, 2, 91--110.
[24]
K. Mikoljczyk and C. Schmid. 2004. Scale and affine invariant interest point detectors. Int. J. Comput. Vis. 60, 63--86.
[25]
F. Moosmann, E. Nowak, and F. Jurie. 2008. Randomized clustering forests for image classification. IEEE Trans. Pattern Anal. Mach. Intell. 30, 9, 1632--1646.
[26]
P. Natarajan, P. Natarajan, S. Wu, X. Zhuang, A. Vazquez-Reina, et al. 2012. BBN viser trecvid 2012 multimedia event detection and multimedia event recounting systems. In NIST TRECVID Workshop.
[27]
E. Nowak, F. Jurie, and B. Triggs. 2006. Sampling strategies for bag-of-features image classification. In Proceedings of the European Conference on Computer Vision.
[28]
F. Perronnin. 2008. Universal and adapted vocabularies for generic visual categorization. IEEE Trans. Pattern Anal. Machine Intell. 30, 7, 1243--1256.
[29]
J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. 2008. Lost in quantization: improving particular object retrieval in large scale image databases. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[30]
P. Quelhas, F. Monay, J. Odobez, D. Perez, and T. Tuytelaars. 2007. A thousand words in a scene. IEEE Trans. Pattern Anal. Mach. Intell. 29, 9, 1575--1589.
[31]
G. Quenot and G. Awad. 2012. TRECVID 2012 semantic indexing task. In NIST TRECVID Workshop.
[32]
S. Roweis and L. Saul. 2000. Nonlinear dimensionality reduction by locally linear embedding. Sci. 290, 5500.
[33]
L. Saul and S. Roweis. 2003. Think globally, fit locally: Unsupervised learning of low dimensional manifolds. J. Mach. Learn. Res. 4, 12, 119--155.
[34]
G. Shakhnarovich. 2005. Learning task-specific similarity. PhD dissertation, Massachusetts Institute of Technology.
[35]
A. Sibiryakov. 2009. High-entropy hamming embedding of local image descriptors using random projections. In Proceedings of the IEEE International Workshop on Multimedia Signal Processing.
[36]
J. Sivic and A. Zisserman. 2003. Video google: A text retrieval approach to object matching in videos. In Proceedings of the IEEE Conference on Computer Vision. 1470--1477.
[37]
E. Sudderth, A. Torralba, W. Freeman, and A. Willsky. 2008. Describing visual scenes using transformed objects and parts. Int. J. Comput. Vis. 77, 1--3, 291--330.
[38]
P. Tirilly, V. Claveau, and P. Gros. 2008. Language modeling for bag-of-visual-words image categorization. In Proceedings of the International Conference on Content-Based Image and Video Retrieval. 249--258.
[39]
Trec Video Retrieval Evaluation. 2012. Guidelines for trecvid. http://www-nlpir.nist.gov/projects/trecvid/, http://www-nlpir.nist.gov/projects/tv2012/tv2012.html.
[40]
T. Tuytelaars and C. Schmid. 2007. Vector quantizing feature space with a regular lattice. In Proceedings of the IEEE Conference on Computer Vision.
[41]
F. Wang and B. Merialdo. 2010. Weighting informativeness of bag-of-visual-words by kernel optimization for video concept detection. In Proceedings of the International Workshop on Very-Large-Scale Multimedia Corpus, Mining and Retrieval.
[42]
F. Wang, Z. Sun, D. Zhang, and C. W. Ngo. 2012a. Semantic indexing and multimedia event detection: ECNU at trecvid 2012. In NIST TRECVID Workshop.
[43]
J. Wang, S. Kumar, and S. F. Chang. 2012b. Semi-supervised hashing for large-scale search. IEEE Trans. Pattern Anal. Mach. Intell. 34, 12, 2393--2406.
[44]
X. Wang, L. Zhang, L. Zhang, and F. Jing. 2006. Annosearch: Image auto-annotation by search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[45]
J. Winn, A. Criminisi, and T. Minka. 2005. Ojbect categorization by learned universal visual dictionary. In Proceedings of the IEEE Conference on Computer Vision. 1800--1807.
[46]
L. Yang, R. Jin, C. Pantofaru, and R. Sukthankar. 2007. Discriminative cluster refinement: Improving object category recognition given limited training data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[47]
L. Yang, R. Jin, R. Sukthankar, and F. Jurie. 2008. Unifying discriminative visual codebook generation with classifier training for object category recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[48]
Y. Yang, F. Nie, D. Xu, and J. Luo. 2012. A multimedia retrieval framework based on semi-supervised ranking and relevance feedback. IEEE Trans. Pattern Anal. Mach. Intell. 34, 4, 723--742.
[49]
J. Yuan, Y. Wu, and M. Yang. 2008. Discovery of collocation patterns: From visual words to visual phrases. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

Cited By

View all
  • (2023)Deep Learning for Instance Retrieval: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2022.321859145:6(7270-7292)Online publication date: 1-Jun-2023
  • (2016)Recursive ground truth estimator for social data streamsProceedings of the 15th International Conference on Information Processing in Sensor Networks10.5555/2959355.2959369(1-12)Online publication date: 11-Apr-2016
  • (2016)Palmprint Recognition via Sparse Coding Spatial Pyramid Matching Representation of SIFT FeatureBiometric Recognition10.1007/978-3-319-46654-5_26(235-243)Online publication date: 21-Sep-2016
  • Show More Cited By

Index Terms

  1. A Hamming Embedding Kernel with Informative Bag-of-Visual Words for Video Semantic Indexing

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 10, Issue 3
    April 2014
    140 pages
    ISSN:1551-6857
    EISSN:1551-6865
    DOI:10.1145/2602979
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 April 2014
    Accepted: 01 October 2013
    Revised: 01 August 2013
    Received: 01 April 2013
    Published in TOMM Volume 10, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Bag-of-visual word
    2. Hamming embedding
    3. kernel optimization
    4. video semantic indexing

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)5
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 22 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Deep Learning for Instance Retrieval: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2022.321859145:6(7270-7292)Online publication date: 1-Jun-2023
    • (2016)Recursive ground truth estimator for social data streamsProceedings of the 15th International Conference on Information Processing in Sensor Networks10.5555/2959355.2959369(1-12)Online publication date: 11-Apr-2016
    • (2016)Palmprint Recognition via Sparse Coding Spatial Pyramid Matching Representation of SIFT FeatureBiometric Recognition10.1007/978-3-319-46654-5_26(235-243)Online publication date: 21-Sep-2016
    • (2015)Kernelizing Spatially Consistent Visual Matches for Fine-Grained ClassificationProceedings of the 5th ACM on International Conference on Multimedia Retrieval10.1145/2671188.2749328(155-162)Online publication date: 22-Jun-2015

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media