Abstract
The Bag of Visual Words (BoVW) is an established representation in computer vision. Taking inspiration from text mining, this representation has proved to be very effective in many domains. However, in most cases, standard term-weighting schemes are adopted (e.g., term-frequency or TF-IDF). It remains open the question of whether alternative weighting schemes could boost the performance of methods based on BoVW. More importantly, it is unknown whether it is possible to automatically learn and determine effective weighting schemes from scratch. This paper brings some light into both of these unknowns. On the one hand, we report an evaluation of the most common weighting schemes used in text mining, but rarely used in computer vision tasks. Besides, we propose an evolutionary algorithm capable of automatically learning weighting schemes for computer vision problems. We report empirical results of an extensive study in several computer vision problems. Results show the usefulness of the proposed method.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
One should note the text mining community has proposed variants that aim to soften such assumptions, e.g., using n-grams [2], still the BoW is very competitive with such formulations.
Please note that traditional weighting schemes have been proposed by researchers based on their own experiences and biases, making strong assumptions and relying on intuition.
Please note that in GP, for each individual, either mutation or crossover is performed each time, but not both. This is different from other variants like genetic algorithms.
Matlab files with the predefined partitions are publicly available under request.
PHOW is an extension to the raw BoVW formulation that aims at incorporating spatial information by means of a pyramidal structure, see [3] for details.
Please note that estimating the fitness function is quite efficient, as it is based on a fast approximation to a linear SVM. So this method can be used for most computer vision applications. Also, we emphasize that the fitness function is only estimated during the learning process, which has to be done a single time and most of the times is performed offline.
References
Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval. Addison-Wesley, Boston
Bekkerman R, Allan J (2004) Using bigrams in text categorization. Technical Report, Department of Computer Science. University of Massachusetts, Amherst, vol 1003, pp 1–2
Bosch A, Zisserman A, Munoz X (2007) Image classification using random forests and ferns. In: Proceedings of the ICCV
Chang KW, Roth D (2011) Selective block minimization for faster convergence of limited memory large-scale linear models. In: SIGKDD conference on knowledge discovery and data mining. ACM
Csurka G, Dance CR, Fan L, Willamowski J, Bra C (2004) Visual categorization with bags of keypoints. In: International workshop on statistical learning in computer vision
Cummins R, O’Riordan C (2006) Evolving local and global weighting schemes in information retrieval. Inf Retr 9:311–330
Debole F, Sebastiani F (2003) Supervised term-weighting for automated text categorization. In: Proceedings of the 2003 ACM symposium on applied computing, SAC ’03. ACM, New York, pp 784–788
Demsar J (2006) Statistical comparisons of classifiersover multiple data sets. J Mach Learn Res 7:1–30
Deselaers T, Pimenidis L, Ney H (2008) Bag of visual words for adult image classification and filtering. In: Proceedings of the international conference on pattern recognition. IEEE
Djuric N, Lan L, Vucetic S, Wang Z (2013) Budgetedsvm: a toolbox for scalable svm approximations. J Mach Learn Res 14:3813–3817
Escalante HJ, Garcia M, Morales A, Graff M, Montes M, Morales EF, Martinez J (2015) Term-weighting learning via genetic programming for text classification. Knowl Based Syst 83:176–189
Escalante HJ, Martinez-Carranza J, Escalera S, Ponce-López V, Baró X (2015) Improving bag of visual words representations with genetic programming. In: Proceedings of the 2015 international joint conference on neural networks. IEEE, pp 3674–3681
Escalante HJ, Montes M, Sucar E (2012) Semantic cohesion for image annotation and retrieval. Comput Sist 10(1):121–126
Escalante HJ, Sucar E, Morales E (2016) A naive bayes baseline for early gesture recognition. Pattern Recogn Lett 73:91–99
Escalera S, Baro X, Gonzalez J, Bautista MA, Madadi M, Reyes M, Ponce V, Escalante HJ, Shotton J, Guyon I (2014) ChaLearn looking at people challenge 2014: dataset and results. In: Proceedings of ECCV—chalearn workshop
Fei-Fei L, Fergus R, Perona P (2004) Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. In: Proceedings of the IEEE, CVPRW
Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3:1289–1305
García-Limón M, Escalante HJ, Montes y Gómez M, Morales A, Morales E (2014) Towards the automated generation of term-weighting schemes for text categorization. In: Procddings of GECCO Comp’14, (Late-breaking abstract), pp 1459–1460
Gonzalez-Gurrola LC, Moreno R, Escalante HJ, Martnez F, Carlos R (2015) Learning roadway surface disruption patterns using the bag of words representation. IEEE transactions on intelligent transportation systems (under review)
Grauman K, Leibe B (2010) Visual object recognition. Morgan and Claypool, San Rafael
Guyon I, Athitsos V, Jangyodsuk P, Escalante HJ (2014) The Chalearn gesture dataset (CGD 2011). Mach Vis Appl 25(8):1929–1951
Hernández-Vela A, Bautista MA, Perez-Sala X, Ponce-López V, Escalera S, Baró X, Pujol O, Angulo C (2014) Probability-based dynamic time warping and bag-of-visual-and-depth-words for human gesture recognition in rgb-d. Pattern Recognit Lett 50(1):112–121
Hoai M, De la Torre F (2012) Max-margin early event detectors. In: IEEE conference on computer vision and pattern recognition. IEEE, Providence, RI, pp 2863–2870
Hoai M, Lan Z, De la Torre F (2011) Joint segmentation and classification of human actions in video. In: IEEE conference on computer vision and pattern recognition. IEEE, Providence, RI, pp 3265–3272
Huang D, Yao S, Wang Y, De La Torre F (2014) Sequential max-margin event detectors. In: European conference on computer vision
Lan M, Tan CL, Su J, Lu Y (2009) Supervised and traditional term-weighting methods for automatic text categorization. Trans PAMI 31(4):721–735
Langdon WB, Poli R (2001) Foundations of genetic programming. Springer, Berlin
Laptev I (2005) On space-time interest points. Int J Comput Vis 64(2–3):107–123
Lazebnik S, Schmid C, Ponce J (2004) Semi-local affine parts for object recognition. In: British machine vision conference, pp 779–788
Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: Proceedings of the computer vision and image processing conference, IEEE, pp 2169–2178
Lazebnik S, Schmid C, Ponce JA (2015) Maximum entropy framework for part-based texture and object recognition. In: IEEE international conference on computer vision, pp 832–838
Lopez-Monroy AP, Montes y Gomez M, Escalante HJ, Cruz-Roa A, Gonzalez FA (2015) Improving the bovw with discriminative n-grams and mkl. Neurocomputing 175:768–781
Luke S, Panait L (2002) Lexicographic parsimony pressure. In: Proceedings of the 2002 genetic and evolutionary computation conference, pp 829–836
Manchala S, Prasad VK, Janaki V (2014) Gmm based language identification system using robust features. Int J Speech Technol 17:99–105
Mirza-Mohammadi M, Escalera S, Radeva P(2009) Contextual-guided bag-of-visual-words model for multi-class object categorization. In: Proceedings of the CAIP. Springer, pp 748–756
Neverova N, Wolf C, Taylor GW, Nebout F (2014) Multi-scale deep learning for gesture detection and localization. In: Proceedings of the ECCV chalearn workshop on looking at people
Saffari A, Guyon I (2006) Quick start guide for clop. Technical report, TU Graz—CLOPINET
Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24:513–523
Sebastiani F (2008) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47
Sidorov G, Gelbukh A, Gomez-Adorno H, Pinto D (2014) Soft similarity and soft cosine measure: similarity of features in vector space model. Comput Sist 18(3):491–504
Silva S, Almeida J (2003) Gplab-a genetic programming toolbox for matlab. In: Proceedings of the Nordic MATLAB conference, pp 273–278
Sivic J, Zisserman A (2003) Video google: a text retrieval approach to object matching in videos. Int Conf Comput Vis 2:1470–1477
Tirilly P, Claveau V, Gros P (2009) A review of weighting schemes for bag of visual words image retrieval. Technical report, IRISA
Turney P, Pantel P (2010) From frequency to meaning: vector space models of semantics. J Artif Intell Res 37:141–188
Vedaldi A, Fulkerson B (2010) VLFeat: an open and portable library of computer vision algorithms. In: Proceedings of the 18th ACM international conference on multimedia. ACM, pp 1469–1472
Wang J, Liu P, She FH, Nahavandi M, Kouzani A (2013) Bag-of-words representation for biomedical time series classification. Biomed Signal Process Control 8(6):634–644
Wang J, Liu Z, Wu Y, Yuan J (2012) Mining actionlet ensemble for action recognition with depth cameras. In: IEEE conference on computer vision and pattern recognition. IEEE, Providence, RI, pp 1290–1297
Xia L, Aggarwal JK (2013) Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. In: IEEE conference on computer vision and pattern recognition. IEEE, Portland, OR, pp 2834–2841
Yoo SJ (2004) Intelligent multimedia information retrieval for identifying and rating adult images. In: Proceedings of the international conference KES, vol 3213 of LNAI, pp 164–170. Springer
Zhang J, Marszablek M, Lazebnik S, Schmid C (2007) Local features and kernels for classification of texture and object categories: a comprehensive study. Int J Comput Vis 73(2):213–238
Zhang K, Lan L, Wang Z, Moerchen F (2012) Scaling up kernel svm on limited resources: A low-rank linearization approach. In: Proceedings of th AISTATS 2012
Acknowledgments
This work was supported by CONACyT under Project Grant No. CB-2014-241306 (Clasificación y recuperación de imágenes mediante técnicas de minería de textos) and Spanish Ministry of Economy and Competitiveness TIN2013-43478-P. Víctor Ponce-López is supported by Fellowship No. 2013FI-B01037 and Project TIN2012-38187-C03-02.
Author information
Authors and Affiliations
Corresponding author
Additional information
This paper is an extended and improved version of [12] and it is being submitted to the Special Issue on Computational Intelligence for Vision and Robotics of the Neural Computing and Applications Journal.
Rights and permissions
About this article
Cite this article
Escalante, H.J., Ponce-López, V., Escalera, S. et al. Evolving weighting schemes for the Bag of Visual Words. Neural Comput & Applic 28, 925–939 (2017). https://doi.org/10.1007/s00521-016-2223-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-016-2223-x