Features for image retrieval: an experimental comparison

2008, Information Retrieval

Features for Image Retrieval: An Experimental Comparison Thomas Deselaers1 , Daniel Keysers2 , and Hermann Ney1 1 Human Language Technology and Pattern Recognition, Computer Science Department, RWTH Aachen University, Germany {deselaers,ney}@cs.rwth-aachen.de 2 Image Understanding and Pattern Recognition, German Research Center for Artificial Intelligence (DFKI), Kaiserslautern, Germany daniel.keysers@dfki.de November 29, 2007 Abstract ple, medical doctors have to access large amounts of images daily [1], home-users often have image databases of thousands of images [2], and journalists also need to search for images by various criteria [3, 4]. In the past, several CBIR systems have been proposed and all these systems have one thing in common: images are represented by numeric values, called features or descriptors, that are meant to represent the properties of the images to allow meaningful retrieval for the user. Only recently have some standard benchmark databases and evaluation campaigns been created which allow for a quantitative comparison of CBIR systems. These benchmarks allow for the comparison of image retrieval systems under different aspects: usability and user interfaces, combination with text retrieval, or overall performance of a system. However, to our knowledge, no quantitative comparison of the building blocks of the systems, the features that are used to compare images, has been presented so far. In [5] a method for comparing image retrieval systems was proposed relying on the Corel database, which has restricted copyrights, is no longer commercially available today, and can therefore not be used for experiments that are meant to be a basis for other comparisons. Another aspect of evaluating CBIR systems are the requirements of the users. In [3] and [4] studies of user needs in searching image archives are presented and the outcome in both studies is that CBIR alone is very unlikely to fulfill the needs but that semantic information obtained from meta data and textual information is an important additional knowledge source. Although today the semantic analysis and understanding of images is much further developed due to the recent achievements in object detection and recognition, still most of the requirements specified are not satisfiable fully automatically. Therefore, in this paper we compare the performance of a large variety of visual descriptors. These can then later be combined with the outcome of textual information retrieval as described e.g. in [6]. An experimental comparison of a large number of different image descriptors for content-based image retrieval is presented. Many of the papers describing new techniques and descriptors for content-based image retrieval describe their newly proposed methods as most appropriate without giving an in-depth comparison with all methods that were proposed earlier. In this paper , we first give an overview of a large variety of features for content-based image retrieval and compare them quantitatively on four different tasks: stock photo retrieval, personal photo collection retrieval, building retrieval, and medical image retrieval. For the experiments, five different, publicly available image databases are used and the retrieval performance of the features is analysed in detail. This allows for a direct comparison of all features considered in this work and furthermore will allow a comparison of newly proposed features to these in the future. Additionally, the correlation of the features is analysed, which opens the way for a simple and intuitive method to find an initial set of suitable features for a new task. The article concludes with recommendations which features perform well for what type of data. Interestingly, the often used, but very simple, colour histogram performs well in the comparison and thus can be recommended as a simple baseline for many applications. 1 Introduction Image retrieval in general and content-based image retrieval (CBIR) in particular are well-known fields of research in information management in which a large number of methods have been proposed and investigated but in which still no satisfying general solutions exist. The need for adequate solutions is growing due to the increasing amount of digitally produced images in areas like journalism, medicine, and private life, requiring new ways of accessing images. For exam1 The main question we address in this paper is: Which features are suitable for which task in image retrieval? This question is thoroughly investigated by examining the performance of a wide variety of different visual descriptors for four different types of CBIR tasks. The question of which features perform how well is closely related to the question which features can be combined to obtain good results in a particular task. Although we do not directly address this question here, the results from this paper lead to a new and intuitive method to choose an appropriate combination of features based on the correlation of the individual features. For the evaluation of the features we use five different publicly available databases which are a good starting point to evaluate the performance of new image descriptors. Although today various initiatives for evaluation of CBIR systems have evolved, only few of them resulted in evaluation campaigns with participants and results: Benchathlon 1 was started in 2001 and located at the SPIE Electronic Imaging conference but has become smaller over time. TRECVID 2 is an initiative by the TREC (Text Retrieval Conference) on video retrieval in which video retrieval systems are compared. ImageCLEF 3 is part of the Cross-Language Evaluation Framework (CLEF) and started in 2003 with only one task aiming at a combination of multi-lingual information retrieval with CBIR. In 2004, it comprised three tasks, one of them focused on visual queries and in 2005 and 2006 there were four tasks, one and two of them purely visual, respectively. We can observe that evaluation in the field of CBIR is at a far earlier stage than it is in textual information retrieval (e.g. Text REtrieval Conference, TREC) or in speech recognition (e.g. Hub4-DARPA evaluation). One reason for this is likely to be the smaller commercial impact that (contentbased) image retrieval has had in the past. However, with the increasing amount of visual data available in various form, this is likely to change in the future. The main contributions of this paper are answers to the questions above, namely • pointing out a set of five databases from four different domains that can be used for benchmarking CBIR systems. Note that we do not focus on the combination of features nor on the use of user feedback for content-based image retrieval in this paper; several other authors propose and evaluate approaches to these important issues [7, 8, 9, 10, 11]. Instead, we mainly investigate the performance of single features for different tasks. 1.1 State of the Art in Content-based Image Retrieval This section gives an overview on literature on CBIR. We mainly focus on different descriptors and image representations. More general overviews on CBIR are given in [12, 13, 14]. Two recent reviews of CBIR techniques are given in [15, 16]. In CBIR, there are, roughly speaking, two different main approaches: a discrete approach and a continuous approach [17]. (1) The discrete approach is inspired by textual information retrieval and uses techniques like inverted files and text retrieval metrics. This approach requires all features to be mapped to binary features; the presence of a certain image feature is treated like the presence of a word in a text document. (2) The continuous approach is similar to nearest neighbor classification. Each image is represented by a feature vector and these features are compared using various distance measures. The images with lowest distances are ranked highest in the retrieval process. A first, though not exhaustive, comparison of these two models is presented in [17]. Among the first systems that were available were the QBIC system from IBM [18] and the Photobook system from MIT [19]. QBIC uses color histograms, a moment based shape feature, and a texture descriptor. Photobook uses appearance features, texture features, and 2D shape features. Another well known system is Blobworld [20], developed at UC Berkeley. In Blobworld, images are represented by regions that are found in an Expectation-Maximization-like (EM) segmentation process. In these systems, images are retrieved in a nearest-neighbor-like manner, following the continuous approach to CBIR. Other systems following this approach include SIMBA [21], CIRES [22], SIMPLIcity [23], IRMA [24], and our own system FIRE [25, 26]. The Moving Picture Experts Group (MPEG) defines a standard for content-based access to multimedia data in their MPEG-7 standard. In this standard, a set of descriptors for images is defined. A reference implementation for these descriptors is given in the XM Software4 . A system that uses MPEG-7 features in combination with semantic web ontologies is presented in [27]. • an extensive overview of features proposed for CBIR, including features that were proposed in the early days of CBIR and techniques that were proposed only recently in the object recognition and image understanding literature as well as a subset of features from the MPEG7 standard, • a quantitative analysis of the performance of these features for various CBIR tasks (in particular: stock photo retrieval, personal photo retrieval, building/touristic image retrieval, and medical image retrieval) 1 http://www.benchathlon.net/ 2 http://www-nlpir.nist.gov/projects/trecvid/ 4 http://www.lis.ei.tum.de/research/bv/topics/mmdb/e mpeg7. html 3 http://www.imageclef.org 2 In [28] a method starting from low-level features and creating a semantic representation of the images is presented and in [29] an approach to consistently fuse the efforts in various fields of multimedia information retrieval is presented. Overview. The remainder of this paper is structured as follows. The next section describes the retrieval metric used to rank images given a feature and a distance measure and the performance measures used to compare different settings. Section 3 gives an overview of 19 different image descriptors and distance measures which are used for the experiments. Section 4 presents a method to analyse the correlation of different image descriptor/distance combinations. In Section 5, five different benchmark databases are described that are used for the experiments presented in Section 6. The experimental section is subdivided into three parts: Section 6.1 directly compares the performance of the different methods for the different tasks, Section 6.2 describes the results of the correlation analysis, and Section 6.3 analyses the connection between the error rate and the mean average precision. The paper concludes with answers to the questions posed above. In [30], the VIPER system is presented which follows the discrete approach. VIPER is now publicly available as the GNU Image Finding Tool (GIFT) and several enhancements have been implemented during the last years. An advantage of the discrete approach is that methods from textual information retrieval can easily be transferred as e.g. user interaction and storage handling. Nonetheless, most image retrieval systems follow the continuous approach often using some optimization, for example pre-filtering and preclassification [12, 23, 31], to achieve better runtime performance, e.g. [18, 19, 20, 21]. We can clearly observe that many different image description features have been developed. However, only few works have quantitatively compared different features. Interesting insights can also be gained from the outcomes of the ImageCLEF image retrieval evaluations [32, 33] in which different systems are compared on the same task. The comparison is not easy because all groups use different retrieval systems and text-based information retrieval is an important part of these evaluations. Due to the lack of standard tasks, in many papers on image retrieval, new benchmark sets are defined to allow for quantitative comparison of the proposed methods to a baseline system. A problem with this approach is that it is simple to create a benchmark for which you can show improved results [34]. 2 Retrieval Metric The CBIR framework used to conduct the experiments described here follows the continuous approach: images are represented by vectors that are compared using distance measures. For the experiments we use our CBIR system FIRE5 . FIRE was designed as a research system with extensibility and flexibility in mind. For the evaluation of features, only one feature and one query image is used at a time, as described in the following. Retrieval Metric. Let the database {x1 , . . . xn , . . . , xN } be a set of images represented by features. To retrieve images similar to a query image q, each database image xn is compared with the query image using an appropriate distance function d(q, xn ). Then, the database images are sorted according to the distances such that d(q, xni ) ≤ d(q, xni+1 ) holds for each pair of images xni and xni+1 in the sequence (xn1 . . . , xni , . . . xnN ). If a combination of different features is used, the distances are normalized to be in the same value range and then a linear combination of the distances is used to create the ranking. To evaluate CBIR, several performance evaluation measures have been proposed [46] based on the precision P and the recall R: Recently, local image descriptors are getting more attention within the computer vision community. The underlying idea is that objects in images consist of parts that can be modelled with varying degrees of independence. These approaches are successfully used for object recognition and detection [35, 36, 37, 38, 39, 40] and CBIR [26, 41, 42, 43]. For the representation of local image parts, SIFT features [44] and raw image patches are commonly used and a bag-offeatures approach, similar to the bag-of-words approach in natural language processing, is commonly taken. The features described in Section 3.7 also follow this approach and are strongly related to the modern approaches in object recognition. In contrast to the methods described above, the image Number of relevant images retrieved is not modelled as a whole but rather image parts are modP = Total number of images retrieved elled individually. Most approaches found in the literature on part-based object recognition learn (often complicated) Number of relevant images retrieved R = models from a large set of training data. This approach is Total number of relevant images impractical for CBIR applications since it would require an Precision and recall values are usually represented in a enormous amount of training data on the one hand and would precision-recall-graph R → P (R) summarizing (R, P (R)) lead to tremendous computing times to create these models on the other hand. However, some of these approaches are 5 freely available under the terms of the GNU General Public applicable for limited domain retrieval, e.g. on the IRMA Licencse at http://www-i6.informatik.rwth-aachen.de/∼deselaers/ fire.html. database (cf. Section 5.3) [45]. 3 3 pairs for varying numbers of retrieved images. The most common way to summarize this graph into one value is the mean average precision that is also used e.g. in the TREC and CLEF evaluations. The average precision AP for a single query q is the mean over the precision scores after each retrieved relevant item: AP (q) = NR 1 X Pq (Rn ), NR n=1 where Rn is the recall after the nth relevant image was retrieved. NR is the total number of relevant documents for the query. The mean average precision M AP is the mean of the average precision scores over all queries: 1 X M AP = AP (q), |Q| q∈Q where Q is the set of queries q. An advantage of the mean average precision is that it contains both precision and recall oriented aspects and is sensitive to the entire ranking. We also indicate the classification error rate ER for all experiments. To do so we consider only the most similar image according to the applied distance function. We consider a query image to be classified correctly, if the first retrieved image is relevant. Otherwise the query is misclassified: ( 1 X 0 if the most similar image is relevant/from ER = |Q| 1 otherwise q∈Q This is in particular interesting if the database for retrieval consists of images labelled with classes, which is the case for some of the databases considered in this paper. For databases without defined classes but with selected query images and corresponding relevant images, the classes to be distinguished are “relevant” and “irrelevant” only. This is in accordance with precision at document X being used as an additional performance measure in many information retrieval evaluations. The ER used here is equal to 1 − P (1), where P (1) is the precision after one document retrieved. In [47] it was experimentally shown that the error rate and P (50), the precision after 50 documents, are correlated with a coefficient of 0.96 and thus they essentially describe the same property. The precision oriented evaluation is interesting, because most search engines, both for images and text, return between 10 and 50 results, given a query. Using the ER, the image retrieval system can be viewed as a nearest neighbor classifier using the same features and the same distance function as the image retrieval system. The decision rule of this classifier can be written in the form q → r(q) = arg min { min k=1,...,K n=1,...,Nk Features for CBIR In this section we give an overview of the features tested, with the intention to include as many features as possible. Obviously we cannot cover all features that have been proposed in the literature. For example, we have left out the Blobworld features [20] because for comparing images based on these features, user interaction to select the relevant regions in the query image is required. Furthermore, a variety of texture representations have not been included and we have not investigated different color spaces. However, we have tried to make the selection of features as representative and at the state-of-the-art as possible. Roughly speaking, the features can be grouped into the following types: (a) color representation, (b) texture representation, (c) local features, and (d) shape representation6 . The features that are presented in the following are grouped according to these four categories in Table 1. Table 1 also gives the timing information on feature extraction and retrieval time for a database consisting of 10 images7 . The distance function used to compare the features representing an image obviously also has a big influence on the performance of the system. Therefore, we refer to the used distance functions for each feature in the particular sections. We have chosen distance functions that are known to work well for the features used as the discussion of their influence the correctthe class is beyond scope of this paper. Different comparison measures for histograms are presented e.g. in [49, 50] and dissimilarity metrics for direct image comparison are presented in [51]. 3.1 Appearance-based Image Features The most straight-forward approach is to directly use the pixel values of the images as features: the images are scaled to a common size and compared using the Euclidean distance. In this work, we have used a 32 × 32 down-sampled representation of the images and these have been compared using the Euclidean distance. It has been observed that for classifica6 Note that no features that fully cover the shapes in the images are included since therefore an algorithm segmenting the images into meaningful regions is required, but since fully-automatic segmentation for general images is an unsolved problem, it is not covered here. The features that we mark to represent shape only represent shape in a local (for the SIFT features) and very rough global context (for appearancebased image features). There are however, overview papers on the shape features defined in MPEG7 which use databases consisting of segmented images for benchmarks [48]. 7 These experiments have been carried out on a 1.8GHz machine with our standard C++ implementation of the software. The SIFT feature extraction was done with the software from Gyuri Dorko (http:// lear.inrialpes.fr/people/dorko/downloads.html), the MPEG7 experiments were performed with the MPEG7 XM reference implementation (http://www.lis.ei.tum.de/research/bv/topics/mmdb/mpeg7. html), and the downscaling of images was performed using the ImageMagick library (http://www.imagemagick.org/). The timings include the time to load all data and initialize the system. d(q, xnk )}. The query image q is predicted to be from the same class as the database image that has the smallest distance to it. Here, xnk denotes the n-th image of class k. 4 Table 1: Grouping of the features into different types. (a) color representation, (b) texture representation, (c) local features, (d) shape representation. The table also gives the time to extract the features from 10 images and to query 10 images in a 10 image database to give an impression of the computational costs of the different features (experiments were performed on a 1.8GHz machine). Feature name Section Appearance-based Image Features 32×32 image 3.1 X×32 image 3.1 Color Histograms 3.2 Tamura Features 3.3 Global Texture Descriptor 3.4 Gabor histogram 3.5 Gabor vector 3.5 Invariant Feature Histograms w. monomial kernel 3.6 w. relational kernel 3.6 LF Patches global search 3.7 histograms 3.7 signatures 3.7 LF SIFT global search 3.7 histograms 3.7 signatures 3.7 MPEG 7: scalable color 3.8.1 MPEG 7: color layout 3.8.2 MPEG 7: edge histogram 3.8.3 comp. measure type extr.[s] retr.[s] Euclidean IDM JSD JSD Euclidean JSD Euclidean abcd abcd a b b b b 0.25 0.25 0.77 14.24 3.51 8.01 8.68 0.19 9.72 0.16 0.13 0.16 0.12 0.17 JSD JSD ab ab 28.93 18.23 0.16 0.14 JSD EMD ac ac ac 4.69 4.69+5.17 4.69+3.37 7.13 0.27 0.55 JSD EMD MPEG7-internal MPEG7-internal MPEG7-internal cd cd cd a ad b 11.91 11.91+6.23 11.91+4.50 0.48 0.20 0.16 9.23 0.27 1.03 0.42 0.33 0.43 tion and retrieval of medical radiographs, this method serves as a reasonable baseline [51]. In [51] different methods were proposed to directly compare images accounting for local deformations. The proposed image distortion model (IDM) is shown to be a very effective means of comparing images with reasonable computing time. IDM clearly outperforms the Euclidean distance for optical character recogniton and medical radiographs. The Image Distortion Model is a non-linear deformation model, it was also successfully used to compare general photographs [52] and for sign language and gesture recognition [53]. In this work it is used as a second comparison measure to compare images directly. Therefore the images are scaled to have a common width of 32 pixels while keeping the aspect ratio constant, i.e. the images may be of different heights. 3.2 resentation of the relative frequencies of the occurring colors. We use the RGB color space for the histograms. We observed only minor differences with other color spaces which was also observed in [55]. In accordance with [49], we use the Jeffrey divergence or Jensen-Shannon divergence (JSD) to compare histograms: dJSD (H, H ′ ) = M X m=1 Hm log ′ 2Hm 2Hm ′ + H log , m ′ ′ +H Hm + Hm Hm m where H and H ′ are the histograms to be compared and Hm is the mth bin of H. 3.3 Tamura Features In [56] the authors propose six texture features corresponding to human visual perception: coarseness, contrast, directionality, line-likeness, regularity, and roughness. From experiments testing the significance of these features with respect to human perception, it was concluded that the first three features are very important. Thus, in our experiments we use coarseness, contrast, and directionality to create a histogram describing the texture [52] and compare these histograms us- Color Histograms Color histograms are among the most basic approaches and widely used in image retrieval [12,18,52,49,54]. To show performance improvements in image retrieval systems, systems using only color histograms are often used as a baseline. The color space is partitioned and for each partition the pixels with a color within its range are counted, resulting in a rep5 ing the Jeffrey divergence [49]. In the QBIC system [18] his- functions we have considered are monomial and relational functions [21] over the pixel intensities. Instead of summing tograms of these features are used as well. over translation and rotation, we only sum over rotation and create a histogram over translation. This histogram is still 3.4 Global Texture Descriptor invariant with respect to rotation and translation. The reIn [52] a texture feature consisting of several parts is de- sulting histograms are compared using the JSD. Previous exscribed: Fractal dimension measures the roughness of a sur- periments have shown that the characteristics of invariant face. The fractal dimension is calculated using the reticular feature histograms and color histograms are very similar and cell counting method [57]. Coarseness characterizes the grain that invariant feature histograms can sometimes outperform size of an image. It is calculated depending on the variance color histograms [26]. of the image. Entropy of pixel values is used as a measure of disorderedness in an image. The spatial gray-level difference statistics describe the brightness relationship of pixels 3.7 Local Image Descriptors within neighborhoods. It is also known as co-occurrence ma- Image patches, i.e. small subimages of images, or features trix analysis [58]. The circular Moran autocorrelation func- derived thereof currently are a very promising approach for tion measures the roughness of the texture. For the calcu- object recognition, e.g. [40, 61, 62]. Obviously, object recoglation a set of autocorrelation functions is used [59]. From nition and CBIR are closely related fields [63, 64] and for these, we obtain a 43 dimensional vector consisting of one some clearly defined retrieval tasks, object recognition methvalue for the fractal dimension, one value for the coarseness, ods might actually be the only possible solution: e.g. lookone value for the entropy and 32 values for the difference ing for all images showing a certain person, clearly a face statistics, and 8 values for the circular Moran autocorrela- detection and recognition system would deliver the best retion function. This descriptor has been successfully used for sults [19, 65]. medical images in [24]. We consider two different types of local image descriptors or local features (LF): a) patches that are extracted from the images at salient points and dimensionality reduced using PCA transformation [40] and b) SIFT descriptors [44] Gabor features have been widely used for Texture analyextracted at Harris interest points [35, chapters 3, 4]. sis [31, 30]. Here we use two different descriptors derived We employ three methods to incorporate local features into from Gabor features: our image retrieval system. The methods are evaluated for • Mean and standard deviation: Gabor features are ex- both types of local features described above: tracted at different scales and directions from the images and the mean and standard deviation of the filter LF histograms. The first method follows [40]: local fearesponses is calculated. We extract Gabor features in tures are extracted from all database images and jointly clusfive different orientations and five different scales lead- tered to form 2048 clusters. Then for each of the local feaing to a 50 dimensional vector. tures all information except the identifier of the most similar cluster center is discarded and for each image a histogram of • A bank of 12 different circularly symmetric Gabor filters the occurring patch-cluster identifiers is created, resulting in is applied to the image, the energy for each filter on the a 2048 dimensional histogram per image. These histograms bank is quantized into 10 bands and a histogram of the are then used as features in the retrieval process and are commean filter outputs over image regions is computed to pared using the Jeffrey divergence. This method was shown give a global measure of the texture characteristics of to produce good performance in object recognition and detecthe image [30]. These histograms are compared using tion tasks [40]. Note that the timing information in Table 1 the JSD. does not give the time to create the cluster model, since this is only done once for a database and can be computed offline. 3.5 Gabor Features 3.6 Invariant Feature Histograms LF signatures. The second method is derived from the method proposed in [66]. Local features are extracted from each database image and clustered for each image separately to form 32 clusters per image. Then for each image, the parameters of the clusters, i.e. the mean and the variance, are saved and the according cluster-identifier histogram of the extracted features is created. These “local feature signatures” are then used as features in the retrieval process and A feature is called invariant with respect to certain transformations if it does not change when these transformations are applied to the image. The transformations considered here are translation, rotation, and scaling. In this work, invariant feature histograms as presented in [60] are used. These features are based on the idea of constructing invariant features by integration, i.e. a certain feature function is integrated over the set of all considered transformations. The feature 6 are compared using the Earth Mover’s Distance (EMD) [67]. compactness allows visual signal matching functionality with This method was shown to produce good performance in ob- high retrieval efficiency at very small computational costs. It ject recognition and detection tasks [66]. allows for query-by-sketch queries because the descriptor captures the layout information of color features. This is a clear LF global search. The third method is based on global advantage over other color descriptors. This approach closely patch search and derived from the method presented in [62]. resembles the use of very small thumbnails of the images with Here, local features are extracted from all database images a quantization of the colors used. and stored in a KD tree to allow for efficient nearest neighbor searching. Given a query image, we extract local features 3.8.3 MPEG 7: Edge Histogram from the image in the same way as we did for the database The edge histogram descriptor represents the spatial distribuimages and search for the k nearest neighbors for each of tion of five types of edges, namely four directional edges and the query-patches in the set of database-patches. Then, we one non-directional edge. According to the MPEG-7 stancount how many patches from each of the database image dard, the image retrieval performance can be significantly were found for the query patches and the database images improved if the edge histogram descriptor is combined with with the highest number of patch-hits are returned. Note other descriptors such as the color histogram descriptor. The that the timing information in Table 1 does not include the descriptor is scale invariant and supports rotation invariant time to create the KD tree, since this is only done once for a and rotation sensitive matching operations. database and can be computed offline. 3.8 4 MPEG-7 Features The Moving Picture Experts Group (MPEG) has defined several visual descriptors in their standard referred to as MPEG7 standard 8 . An overview of these features can be found in [68, 69, 70, 71]. The MPEG initiative focuses strongly on features that are computationally inexpensive to obtain and to compare and also strongly optimizes the features with respect to the required memory for storage. Coordinated by the MPEG, a reference implementation of this standard has been developed9 . This reference implementation was used in our framework for experiments with these features. Unfortunately, the software is not yet in a fully functional state and thus only three MPEG7 features could be used in the experiments. For each of these features, we use the comparison measures proposed by the MPEG standard and implemented in the reference implementation. The feature types are briefly described in the following: 3.8.1 After discussing various features, now let us assume that a set of features is given, some of which account for color, others accounting for texture, and maybe others accounting for shape. A very interesting question then is, how features that can be used in combination can be chosen. Automatic methods for feature selection have e.g. been proposed in [72, 73]. These automatic methods, however do not directly explain why features are chosen, are difficult to manipulate from a user’s perspective, and normally require labelled training data. The method proposed here does not require training data but only analyses the correlations between the features themselves, and instead of automatically selecting a set of features it provides the user with information helping to select an appropriate set of features. To analyze the correlation between different features, we analyze the correlation between the distances d(q, X) obtained for each feature of each of the images X from the database given a query q. For each pair of query image q and database image X we create a vector (d1 (q, X), d2 (q, X), . . . dm (q, X), . . . dM (q, X)) where dm (q, X) is the distance of the query image q to the database image X for the mth feature. Then we calculate the correlation between the dm over all q ∈ {q1 , . . . , ql , . . . qL } and all X ∈ {X1 , . . . , Xn , . . . , XN }. The M × M covariance matrix Σ of the dm is calculated over all N database images and all L query images as: MPEG 7: Scalable Color Descriptor The scalable color descriptor is a color histogram in the HSV color space that is encoded by a Haar transform. Its binary representation is scalable in terms of bin numbers and bit representation accuracy over a broad range of data rates. Retrieval accuracy increases with the number of bits used in the representation. We use the default setting of 64 coefficients. 3.8.2 Correlation Analysis of Features for CBIR MPEG 7: Color Layout Descriptor This descriptor effectively represents the spatial distribution of the color of visual signals in a very compact form. This Σij = 8 http://www.chiariglione.org/mpeg/standards/mpeg-7/mpeg-7. L N 1 XX (di (ql , Xn ) − µi ) · (dj (ql , Xn ) − µj ) (1) N L n=1 l=1 htm 9 http://www.lis.e-technik.tu-muenchen.de/research/bv/ with µi = topics/mmdb/e mpeg7.html 7 1 NL PN n=1 PL l=1 di (ql , Xn ). Given the covariance matrix Σ, we calculate the correlation Σ matrix R as Rij = √ ij . The entries of this correlation Σii Σjj matrix can be interpreted as similarities of different features. A high value Rij means a high similarity between features i and j. This similarity matrix can then be analyzed to find out which features have similar properties and which do not. One way to do this is to visualize it using multi-dimensional scaling [74, p. 84ff]. Multi-dimensional scaling (MDS) seeks a representation of data points in a lower dimensional space while preserving the distances between data points as well as possible. To visualize this data by multi-dimensional scaling, we convert the similarity matrix R into a dissimilarity matrix D by setting Dij = 1 − |Rij |. For visualization purposes, we choose a two-dimensional space for MDS. 5 africa monuments buses food dinosaurs elephants flowers horses mountains Benchmark databases for CBIR To cover a wide range of different applications in which CBIR is used, we propose benchmark databases from different domains. In the ImageCLEF evaluations large image retrieval benchmark databases have been collected. However, these are not suitable for the comparison of image features as for most of the tasks textual information is supplied and necessary for an appropriate solution of the task. Table 2 gives an overview of the databases used in the evaluations. Although the databases presented here are small in comparison to other CBIR tasks, they represent a wide variety of tasks and allow for a meaningful comparison of feature performances. The WANG database (Section 5.1), as a subset from the Corel stock photo collection, can be considered similar to stock photo searches. The UW database (Section 5.2) and the UCID database (Section 5.5) mainly consist of personal images and represent the home user domain. The ZuBud database (Section 5.4) and the IRMA database (Section 5.3) are limited domain tasks for touristic/building retrieval and medical applications, respectively. 5.1 beach Figure 1: One example image from each of the 10 classes of the WANG database together with their class labels. 5.2 UW database The database created at the University of Washington consists of a roughly categorized collection of 1,109 images. These images are partly annotated using keywords. The remaining images were annotated by our group to allow the annotation to be used for relevance estimation; our annotations are publicly available10 . The images are of various sizes and mainly include vacation pictures from various locations. There are 18 categories, for example “spring flowers”, “Barcelona”, and “Iran”. Some example images with annotations are shown in Figure 2. The complete annotation consists of 6,383 words with a vocabulary of 352 unique words. On the average, each image has about 6 words of annotation. The maximum number of keywords per image is 22 and the minimum is 1. The database is freely available11 . The relevance assessment for the experiments with this database were performed using the annotation: an image is considered to be relevant w.r.t. a given query image if the two images have a common keyword in the annotation. On the average, 59.3 relevant images correspond to each image. The keywords are rather general; thus for example images showing sky are relevant w.r.t. each other, which makes it quite easy to find relevant images (high precision is likely easy) but it can be extremely difficult to obtain a high recall since some images showing sky might have hardly WANG database The WANG database is a subset of 1,000 images of the Corel stock photo database which have been manually selected and which form 10 classes of 100 images each. One example of each class is shown in Figure 1. The WANG database can be considered similar to common stock photo retrieval tasks with several images from each category and a potential user having an image from a particular category and looking for similar images which have e.g. cheaper royalties or which have not been used by other media. The 10 classes are used for relevance estimation: given a query image, it is assumed that the user is searching for images from the same class, and 10 http://www-i6.informatik.rwth-aachen.de/∼deselaers/uwdb/ therefore the remaining 99 images from the same class are index.html 11 http://www.cs.washington.edu/research/imagedatabase/ considered relevant and the images from all other classes are groundtruth/ considered irrelevant. 8 Table 2: Summary of the databases used for the evaluation with database name, number of images in the database, number of query images, average number of relevant images per query, and a description how the queries are evaluated. database images queries avg. rel query mode WANG 1,000 1,000 99.0 leaving-one-out UW 1,109 1,109 59.3 leaving-one-out IRMA 10000 10,000 1,000 520.2 test & database images are disjoint ZuBuD 1,005 105 5.0 test & database images are disjoint UCID 1,338 262 3.5 leaving-one-out buildings clouds bench, car, house, mountain people sand lantern, trees, overcast sky, sky window building, post buildings, fountain, car, sidewalk, struct, mosque, tiles, people, grass, lantern, sky bushes, flowers, sky, car annotation task. For CBIR, the relevances are defined by the classes, given a query image from a certain class, all database images from the same class are considered relevant. Example images along with their class numbers and textual descriptions of the classes are given in Figure 3. This task is a medical image retrieval task and is in practical use at the Department for Diagnostic Radiology of the RWTH Aachen University Hospital. As all images from this database are gray value images, we evaluate neither the color histograms nor the MPEG7 scalable color descriptor since they only account for color information. trees, bushes, overcast sky, house, 5.4 The “Zurich Buildings Database for Image Based Recognition”(ZuBuD) is a database which has been created by the Swiss Federal Institute of Technology in Zurich and is described in more detail in [75, 76]. The database consists of two parts, a training part of 1,005 images of 201 buildings, 5 of each building and a query part of 115 images. Each of the query images contains one of the buildings from the main part of the database. The pictures of each building are taken from different viewpoints and some of them are also taken under different weather conditions and with two different cameras. Given a query image, only images showing exactly the same building are considered relevant. To give a more precise idea of this database, some example images are shown in Figure 4. This database can be considered as an example for a mobile travel guide task, which attempts to identify buildings in pictures taken with a mobile phone camera and then obtains certain information about the building [75]. The ZuBud database is freely available12 . people partially cloudy sky, Husky Stadium, hills, trees, grasses, north stands, people, ground, houses football, field,... sailboats, ice, water, buildings Figure 2: Examples from the UW database with annotation. any visual similarity with a given query. This task can be considered a personal photo retrieval task, e.g. a user with a collection of personal vacation pictures is looking for images from the same vacation, or showing the same type of building. 5.3 ZuBuD database 5.5 IRMA-10000 database UCID database The UCID database13 was created as a benchmark database for CBIR and image compression applications [77]. In [78] this database was used to measure the performance of a CBIR system using compressed domain features. This database is The IRMA database consists of 10,000 fully annotated radiographs taken randomly from medical routine at the RWTH Aachen University Hospital. The images are split into 9,000 training and 1,000 test images. The images are subdivided into 57 classes. The IRMA database was used in the ImageCLEF 2005 image retrieval evaluation for the automatic 12 http://www.vision.ee.ethz.ch/ZuBuD 13 http://vision.doc.ntu.ac.uk/datasets/UCID/ucid.html 9 20 21 02 plain radiography coronal facial cranium musculosceletal s. 31 plain radiography coronal lower leg musculosceletal s. 49 48 plain radiography other orientation right breast reproductive s. plain radiography sagittal handforearm musculosceletal s. 50 56 plain radiography other orientation foot musculosceletal s. plain radiography coronal knee musculosceletal s. plain radiography other orientation left breast reproductive s. 57 fluoroscopy coronal upper leg cardiovascular s. angiography coronal pelvis cardiovascular s. Figure 3: Example images of the IRMA 10000 database along with their class and annotation similar to the UW database as it consists of vacation images and thus poses a similar task. For 264 images, manual relevance assessments among all database images were created, allowing for performance evaluation. The images that are judged to be relevant are images which are very clearly relevant, e.g. for an image showing a particular person, images showing the same person are searched and for an image showing a football game, images showing football games are considered to be relevant. The used relevance assumption makes the task easy on one hand, because relevant images are very likely quite similar, but on (b) (a) the other hand, it makes the task difficult, because there are likely images in the database which have a high visual simFigure 4: a) A query image and the 5 images from the ilarity but which are not considered relevant. Thus, it can same building in the ZuBuD-database b) 6 images of different be difficult to have high precision results using the given relevance assessment, but since only few images are considered buildings in the ZuBuD-database. relevant, high recall values might be rather easy to obtain. Example images are given in Figure 5. 6 Evaluation of the Features Considered In this section we report the results of the experimental evaluation of the features. To evaluate all features on the given databases, we extracted the features from the images and executed experiments to test the particular features. For all experiments, we report the mean average precision and the classification error rate. The connection between the classification error rate and mean average precision shows the strong relation between CBIR and classification. Both per- Figure 5: Example images from the UCID database 10 formance measures have advantages. The error rate is very precision oriented and thus it is best if relevant images are retrieved early. On the contrary, the mean average precision accounts for the average performance over the complete PR graph. Furthermore, we calculated the distance vectors mentioned in Section 4 for each of the queries performed to obtain a global correlation analysis of all features. 6.1 Performance Evaluation of Features The results from the single feature experiments are given in Figures 6 and 7 and in Tables 3 and 4. The results are sorted by the average of the classification error rates. The results from the correlation analysis are given in Figure 9. Note that the features ‘color histogram’ and ‘MPEG7 scalable color’ were not evaluated for the IRMA database because pure color descriptors are not suitable for this gray-scale database. It can clearly be seen that different features perform differently on the databases. Grouping the features by performance results in three groups, one group of five features clearly outperforms the other features (average error rate < 30%, average mean average precision ≈ 50%). A second group has average error rates of approximately 40% (respectively average mean average precision 40%) and a last group performs clearly worse. The top group is led by the color histogram which performs very well for all color tasks and has not been evaluated on the IRMA data. When all databases are considered, the global feature search (cf. Section 3.7) of SIFT features extracted at Harris points [35, chapters 3, 4] performs best on the average. This good performance is probably partly due to the big success on the ZuBuD database, where features of similar type were observed to perform exceedingly well [79]. They also perform well on the UCID database, where relevant images, in contrast to the UW task, are very close neighbours. The possible high dissimilarity between relevant images in the UW database, thus explains the bad performance there. However, the patch histograms outperform the SIFT features on all other tasks as they include color information which obviously is very important for most of the tasks. They also obtain a good performance for the IRMA data. It can be observed that the error rates for the UCID database are very high in comparison to the other databases, so the UCID task can be considered to be harder than e.g. the UW task. A similar result to the one obtained using color histogram is obtained by the invariant feature histogram with monomial kernel. This is not surprising, as it is very similar to a color histogram, except that it also partly accounts for local texture. It can be observed that the performance for the color databases is nearly identical to the color histogram. The relatively bad ranking of these features in the tables is due to the bad performance on the IRMA task. Leaving out the IRMA task for this feature, it would be ranked second in the entire ranking. The high similarity of color histograms and invariant feature histograms with monomial kernel can also directly be observed in Figure 9 where it can be seen that color histograms (point 1) and invariant feature histograms with monomial kernel (point 11) have very similar properties. The second group of features consists of four features: signatures of SIFT features, appearance-based image features, and the MPEG 7 color layout descriptor. Although the image thumbnails compared with the image distortion model perform quite poorly for the WANG, the UW, and the UCID tasks, they perform extremely well for the IRMA task and reasonably well for the ZuBuD task. A major difference between these tasks is that the first three databases contain general color photographs of completely unconstrained scenes, whereas the latter ones contain images from limited domains only. The simpler appearance-based feature of 32×32 thumbnails of the images, compared using Euclidean distance, is the next best feature, and again it can be observed that it performs well for the ZuBuD and IRMA tasks only. As expected, the MPEG7 color layout descriptor and 32×32 image thumbnails obtain similar results because they both encode the spatial distribution of colors or gray values in the images. Among the texture features (Tamura texture histogram, Gabor features, global texture descriptor, relational invariant feature histogram, and MPEG-7 edge histogram), the Tamura texture histogram and the Gabor histogram outperform the others. 6.2 Correlation Analysis of Features Figure 8 shows the average correlation of different features over all databases. The darker a field in this image is, the lower the correlation between the corresponding features, bright fields denote high correlations. Figure 9 shows the visualizations of the outcomes of multi-dimensional scaling of the correlation analysis. We applied the correlation analysis for the different tasks individually (4 top plots) and for all tasks jointly (bottom plot). Multi-dimensional scaling was used to translate the similarities of the different features into distances in a two-dimensional space. The further away two points are in the graph, the less similar the corresponding features are for CBIR, and conversely the closer together they appear, the higher the similarity between these features. For each of these plots the according distance vectors obtained from all queries with all database images have been used (WANG database: 1,000,000 distance vectors, UW&UCID database: 194,482+350,557 distance vectors, IRMA database: 9,000,000 distance vectors, ZuBuD database: 115,575 distance vectors, all databases: 10,660,614 distance vectors). The figures show a very strong correlation between color histograms (point 1) and invariant feature histograms with monomial kernel (point 11). In fact, they lead to hardly any 11 Table 3: Error rate [%] for each of the features for each of the databases (sorted by average error rate over the databases). feature color histogram LF SIFT global search LF patches histogram LF SIFT histogram inv. feature histogram (monomial) MPEG7: scalable color LF patches signature Gabor histogram 32x32 image MPEG7: color layout Xx32 image Tamura texture histogram LF SIFT signature gray value histogram LF patches global MPEG7: edge histogram inv. feature histogram (relational) Gabor vector global texture feature 80 wang 16.9 37.2 17.9 25.6 19.2 25.1 24.3 30.5 47.2 35.4 55.9 28.4 35.1 45.3 42.9 32.8 38.3 65.5 51.4 uw 12.3 31.5 14.6 21.4 12.9 13.0 17.6 20.5 26.4 21.2 26.7 16.8 20.9 23.0 42.7 22.9 23.6 37.9 32.4 irma – 27.7 24.9 30.8 55.8 – 42.7 44.9 22.8 47.7 21.4 33.0 99.3 42.6 48.2 99.3 39.2 42.5 67.7 ucid 51.5 31.7 58.0 50.4 53.8 60.7 68.7 74.1 82.8 75.2 83.2 63.4 58.4 86.6 63.4 69.9 83.2 95.8 95.4 zubud 7.8 7.0 24.4 18.3 7.8 32.2 36.5 24.4 27.0 27.0 20.9 84.4 20.0 47.0 47.8 23.5 93.9 73.0 98.3 average 22.1 27.0 28.0 29.3 29.9 32.7 38.0 38.9 41.2 41.3 41.6 45.2 46.7 48.9 49.0 49.7 55.6 62.9 69.0 wang uw irma ucid zubud 60 40 20 global texture feature Gabor vector inv. feature histogram (relational) MPEG7: edge histogram LF patches global gray value histogram LF SIFT signature Tamura texture histogram Xx32 image MPEG7: color layout 32x32 image Gabor histogram LF patches signature MPEG7: scalable color inv. feature histogram (monomial) LF SIFT histogram LF patches histogram LF SIFT global search color histogram 0 Figure 6: Classification error rate [%] for each of the features for each of the databases (sorted by average error rate over the databases). The different shades of gray denote different databases and the blocks of bars denote different features. 12 Table 4: Mean average precision [%] for each of the features for each of the databases (sorted in the same order as Table 3 to allow for easy comparison). feature color histogram LF SIFT global search LF patches histogram LF SIFT histogram inv. feature histogram (monomial) MPEG7: scalable color LF patches signature Gabor histogram 32x32 image MPEG7: color layout Xx32 image Tamura texture histogram LF SIFT signature gray value histogram LF patches global MPEG7: edge histogram inv. feature histogram (relational) Gabor vector global texture feature wang 50.5 38.3 48.3 48.2 47.6 46.7 40.4 41.3 37.6 41.8 24.3 38.2 36.7 31.7 30.5 40.8 34.9 23.7 26.3 uw 63.0 63.6 62.0 62.3 62.6 63.9 59.9 59.7 60.1 61.0 57.0 60.8 61.2 59.4 55.7 61.4 59.7 56.3 56.5 irma – 20.9 31.4 32.7 24.4 – 23.0 25.2 40.9 29.8 35.0 30.4 10.9 26.1 17.6 10.9 24.1 27.7 16.4 ucid 43.3 62.5 37.5 44.7 41.6 37.9 27.6 22.3 14.0 21.7 13.9 33.2 34.1 11.8 30.3 25.2 14.4 4.7 6.7 zubud 75.6 81.3 64.7 68.0 71.0 54.3 42.6 48.7 41.9 47.7 47.0 15.8 62.7 36.5 38.5 46.3 6.3 15.9 2.6 average 58.1 53.3 48.8 51.2 49.5 50.7 38.7 39.4 38.9 40.4 35.4 35.7 41.1 33.1 34.5 36.9 27.9 25.7 21.7 wang uw irma ucid zubud 80 70 60 50 40 30 20 10 global texture feature Gabor vector inv. feature histogram (relational) MPEG7: edge histogram LF patches global gray value histogram LF SIFT signature Tamura texture histogram Xx32 image MPEG7: color layout 32x32 image Gabor histogram LF patches signature MPEG7: scalable color inv. feature histogram (monomial) LF SIFT histogram LF patches histogram LF SIFT global search color histogram 0 Figure 7: Mean average precision for each of the features for each of the databases (sorted in the same order as Fig.6 to allow for easy comparison). 13 differences in the experiments. For the databases consisting of color photographs they outperform most other features. A high similarity is also observed between the patch signatures (point 14) and the MPEG7 color layout (point 2) for all tasks. Two other features that are highly correlated are the two methods that use local feature search for the two different types of local features (points 5 and 12). The different comparison methods for local feature histograms/signature have similar performances (3, 4 and 13, 14, respectively). Another strong correlation can be observed between 32×32 image thumbnails (point 18) and the MPEG7 color layout representation (point 2), which was to be expected as both of these have a rough representation of the spatial distribution of colors (resp. gray values) of the images. Interestingly, the correlation between 32×32 images compared using Euclidean distance (point 18) and the X × 32 images compared using the image distortion model (point 19) is low, with only some similarity for the IRMA and the ZuBuD task. This is partly due to the exceedingly good performance of the image distortion model for the IRMA task and partly due to the missing invariance with respect to slight deformations in the images for the Euclidean distance. For example in the ZuBuD task, the image distortion model can partly compensate for the changes in the viewpoints which leads to a much better performance. Another interesting aspect is that the various texture features (MPEG7 edge histogram (6), global texture feature (10), Gabor features (8, 7), relational invariant feature histogram (15), and Tamura texture histogram (17)) are not correlated strongly. We conclude that none of the texture features is sufficient to completely describe the textural properties of an image. The Tamura texture histogram and the Gabor histogram outperform the other texture features, Tamura features being better in three and Gabor histograms being clearly better in two of the five tasks, both of them are a good choice for texture representation. To give a little insight into how these plots can be used to select sets of features for a given task, we discuss how features for the WANG database could be chosen in the following paragraph. Combined features are linearly combined as described in Section 2. Here, all features are weighted equally, but some improvement of the retrieval results can be achieved by choosing different weights for the individual features. In [80] we present an approach to automatically learning a feature combination from a set of queries with known relevant images using a discriminative maximum entropy model. Finding a suitable set of features. Assume we are about to create a CBIR system for a new database consisting of general photographs. We extract features from the data and create the according MDS plot (Figure 9, top left). Since we know that we are dealing with general photographs, we start 14 with a simple color histogram (point 1). The plot now tells us that invariant feature histograms with monomial kernel (11) would not give us much additional information. Next, we consider the various texture descriptors (points 6, 10, 15, 17, 7, 8) and choose one of these, say global texture features (10) and maybe another: Tamura texture histograms (17). Now we have covered color and texture and can consider a global descriptor such as the image thumbnails (18) or a local descriptor such as one of (12, 13, or 14) or (3, 4, or 5). After adding a feature, the performance of the CBIR system can be evaluated by the user. In Table 5 we quantitatively show the influence of adding these features for the WANG database. It can be seen that the performance is incrementally improved by adding more and more features. 6.3 Connection Between Mean Average Precision and Error Rate In Figures 10 and 11 the correlation between mean average precision and error rate is visualized database-wise and feature-wise, respectively. The correlation of error rate and mean average precision over all experiments presented in this paper is 0.87. In the keys of the figures, the correlations per database and per feature are given, respectively. From Figure 10 it can be seen that this correlation varies between the tasks between 0.99 and 0.67. For the UCID task, this correlation is markedly strong with 0.99. The correlation is lowest for the UW task which has a correlation of 0.67 and which is the only task with a correlation below 0.8. In Figure 11, the same correlation is analyzed feature-wise. Here, the correlation values vary strongly between 0.4 and 1.0. The LF SIFT signature descriptor has the lowest correlation and the LF patches histogram descriptor also has a low correlation of only 0.6. The two different image thumbnail descriptors have a correlation of 0.7. All other features have correlation values greater than 0.8, thus it can be said that an image representation that works well for classification will generally work well for CBIR as well and vice versa. Exemplarily, this effect can be observed when looking at the results for the WANG and IRMA database for the color histograms and the X ×32 thumbnails. On the one hand, for the WANG database, the color histograms perform very well for error rate and mean average precision; in contrast, the image Table 5: Combining features using the results from the correlation analysis described for the WANG database. features color histograms + global texture + Tamura histograms + thumbnails + patch histograms ER [%] 16.9 15.7 13.7 13.7 11.6 MAP [%] 50.5 49.5 51.2 53.9 55.7 color histogram MPEG7: color layout LF SIFT histogram LF SIFT signature LF SIFT global search MPEG7: edge histogram Gabor vector Gabor histogram gray value histogram global texture feature inv. feature histogram (monomial) LF patches global LF patches histogram LF patches signature inv. feature histogram (relational) MPEG7: scalable color Tamura texture histogram 32x32 image Xx32 image color histogram MPEG7: color layout LF SIFT histogram LF SIFT signature LF SIFT global search MPEG7: edge histogram Gabor vector Gabor histogram gray value histogram global texture feature inv. feature histogram (monomial) LF patches global LF patches histogram LF patches signature inv. feature histogram (relational) MPEG7: scalable color Tamura texture histogram 32x32 image Xx32 image Figure 8: Correlation of the different features. Bright fields denote high and dark fields denote low correlation. Another representation of this information is given in Figure 9 thumbnails perform poorly. On the other hand, the effect is reversed for the IRMA database: here, the color histograms perform poorly and the image thumbnails outstandingly well. It can be observed that the performance increase (resp. decrease) is in the same magnitude for mean average precision and error rate. Thus, it can be seen that a feature that performs well for the task of classification on a certain dataset, it will most probably be a good choice for retrieval of images from that dataset, too. 7 Conclusion We have discussed a large variety of features for image retrieval and a setup of five freely available databases that can be used to quantitatively compare these features. From the experiments conducted it can be deduced, which features perform well on which kind of task and which do not. In contrast to other papers, we consider tasks from different domains jointly and directly compare and analyze which features are suitable for which task. Which features are suitable for which task in CBIR? The main question addressed in this paper, which features are suitable for which task in image retrieval, has been thoroughly investigated: One clear finding is that color histograms, often cited as a baseline in CBIR, clearly are a reasonably good baseline for general color photographs. However, approaches using local image descriptors outperform color histograms in various tasks but usually at the cost of much higher computational costs. If the images are from a restricted domain, as they are in the IRMA and in the ZuBuD task, other methods should be considered as a baseline, e.g. a simple nearest neighbor classifier using thumbnails of the images. Furthermore, it has been shown that, despite more than 30 years in research on texture descriptors, still none of the texture features presented can convey a complete description of the texture properties of an image. Therefore a combination of different texture features will usually lead to best results. It should be noted that for specialized tasks, such as finding images that show certain objects, better methods exist today that can learn models of particular objects from a set of training data. However, these approaches are computationally far 15 more expensive and always require relatively large amounts of training data. Although the selection of features tested was not completely exhaustive, the selection was wide and the methods presented can easily be applied to other features to compare them to the features presented here. On one hand, the presented descriptors were selected such that features presented many years ago, such as color histograms [54], Tamura texture features [56], Gabor features, and spatial autocorrelation features [58], as well as very recent features such as SIFT descriptors [44] and patches [40] were compared. On the other hand, the features were selected such that descriptors accounting for color, texture, and (partly) shape, as well as local and global descriptors were covered. We also included a subset of the standardized MPEG7 features. All features have been thoroughly examined experimentally on a set of five databases. All of these databases are freely available and pointers to their location are given in this paper. This allows researchers to compare the findings from this work with other features that were not covered here or which will be presented in future. The databases chosen are representative for four different tasks in which CBIR plays an important role. Which features are correlated and how can features be combined? We conducted a correlation analysis of the features considered showing which features have similar properties and which do not. The outcomes of this method can be used as an intuitive help to finding suitable combinations of features for certain tasks. In contrast to other methods for feature combination, the method presented here does not rely on training data/relevance judgements to find a suitable set of features. In particular, it will tell you which features are not worth combining because they produce correlated distance results. The method is not a fully automatic feature selection method but the process of selecting features is demonstrated for one of the tasks with promising results. However, the focus of this paper is not to combine several features as this would exceed the scope and a variety of known methods have covered this aspect, e.g. [7, 81, 82]. Another conclusion we have drawn from this work is that the intuitive assumption that classification of images and CBIR are strongly connected is justified. Both tasks are strongly related to the concept of similarity which can be measured best if suitable features are available. In this paper, we have evaluated this assumption quantitatively by considering four different domains and analyzing the classification error rate for classification and the mean average precision for CBIR. It was clearly shown empirically that features that perform well for classification also perform well for CBIR and vice versa. This strong connection allows us to take advantage of knowledge obtained in either classification or CBIR for the other respective task. For example, in the medical 16 domain much research has been done to classify whether an image shows a pathological case or not, likely some of the knowledge obtained in these studies can be transfered to the CBIR domain to help retrieving images from a picture archiving system. Future Work. Future work in CBIR certainly includes finding new and better image descriptors and methods to combine these appropriately. Furthermore, the achievements in object detection and recognition will certainly find their way into the CBIR domain and a shift towards methods that automatically learn about the semantics of images is imaginable. First steps into this direction can be seen in [83] where a method is presented that learns how to compare never seen objects and presents an image similarity measurement which works on the object level. Methods for automatic image annotation are also related to CBIR and the automatic generation of textual labels for images allows to use textual information retrieval techniques to retrieve images. Acknowledgement This work was partially funded by the DFG (Deutsche Forschungsgemeinschaft) under contract NE-572/6. The authors would like to thank Gyuri Dorkó (formerly with INRIA Rhône-Alpes) for providing his SIFT feature extraction software and the authors of the MPEG7 XM reference implementation. References on Knowledge Discovery and Data Mining (Workshop on Multimedia Data Mining MDM/KDD2000). Boston, MA, USA; 2000. . [1] Müller H, Michoux N, Bandon D, Geissbuhler A. A review of content-based image retrieval systems in medical applications – clinical benefits and future directions. In- [11] MacArthur S D, Brodley C E, Shyu C-R. Relevance Feedback Decision Trees in Content-based Image Reternational Journal of Medical Informatics 2004;(73):1– trieval. In: Content-based Access of Image and Video 23. Libraries. Hilton Head Island, SC, USA: IEEE; 2000. p. [2] Sun Y, Zhang H, Zhang L, Li M. MyPhotos A System 68–72. for Home Photo Management and Processing. In: ACM Multimedia Confernce. Juan-les-Pins, France; 2002. p. [12] Smeulders A W M, Worring M, Santini S, Gupta A, Jain R. Content-Based Image Retrieval at the End of the 81–82. Early Years. IEEE Transactions on Pattern Analysis [3] Markkula M, Sormunen E. Searching for Photos - Jourand Machine Intelligence 2000;22(12):1349–1380. nalists’ Practices in Pictorial IR. In: Electronic Workshops in Computing – Challenge of Image Retrieval. [13] Forsyth D A, Ponce J. Computer Vision: A Modern Approach. Prentice Hall; 2002. p. 599–619. Newcastle, UK; 1998. p. 1–13. [4] Armitage L H, Enser P G. Analysis of user need in image [14] Rui Y, Huang T, Chang S. Image Retrieval: Current Techniques, Promising Directions and Open Issues. archives. Journal of Information Science 1997;23(4):287– Journal of Visual Communication and Image Represen299. tation 1999;10(4):39–62. [5] Shirahatti N V, Barnard K. Evaluating Image Retrieval. In: IEEE Conference on Computer Vision and Pattern [15] Datta R, Li J, Wang J Z. Content-based Image Retrieval – Approaches and Trends of the New Age. In: ACM Intl. Recognition (CVPR 05). vol. 1. San Diego, CA, USA: Workshop on Multimedia Information Retrieval, ACM IEEE; 2005. p. 955–961. Multimedia. Singapore; 2005. . [6] Deselaers T, Weyand T, Keysers D, Macherey W, Ney [16] Lew M S, Sebe N, Djeraba C, Jain R. Content-based H. FIRE in ImageCLEF 2005: Combining ContentMultimedia Information Retrieval: State of the Art and based Image Retrieval with Textual Information ReChallenges. ACM Transactions on Multimedia Computtrieval. In: Accessing Multilingual Information Reposiing, Communications and Applications 2006;2(1):1–19. tories, 6th Workshop of the Cross-Language Evaluation Forum, CLEF 2005. vol. 4022 of Lecture Notes in Com- [17] de Vries A P, Westerveld T. A comparison of continuous puter Science. Vienna, Austria; 2006. p. 652–661. vs. discrete image models for probabilistic image and video retrieval. In: Proc. International Conference on [7] Yavlinski A, Pickering M J, Heesch D, Rüger S. A comImage Processing. Singapore; 2004. p. 2387–2390. parative Study of Evidence Combination Strategies. In: IEEE International Conference on Acoustics, Speech, [18] Faloutsos C, Barber R, Flickner M, Hafner J, Niblack and Signal Processing (ICASSP 2004). vol. 3. Montreal, W, Petkovic D, et al. Efficient and Effective Querying Canada; 2004. p. 1040–1043. by Image Content. Journal of Intelligent Information Systems 1994;3(3/4):231–262. [8] Heesch D, Rüger S. Performance boosting with three mouse clicks - Relevance feedback for CBIR. In: Eu- [19] Pentland A, Picard R, Sclaroff S. Photobook: Contentropean Conference on Information Retrieval Research. Based Manipulation of Image Databases. International No. 2633 in LNCS. Pisa, Italy: Springer Verlag; 2003. p. Journal of Computer Vision 1996;18(3):233–254. 363–376. [20] Carson C, Belongie S, Greenspan H, Malik J. Blobworld: [9] Müller H, Müller W, Marchand-Maillet S, Squire D M. Image Segmentation Using Expectation-Maximization Strategies for positive and negative Relevance Feedback and its Application to Image Querying. IEEE Transin Image Retrieval. In: International Conference on Patactions on Pattern Analysis and Machine Intelligence tern Recognition. vol. 1 of Computer Vision and Image 2002;24(8):1026–1038. Analysis. Barcelona, Spain; 2000. p. 1043–1046. [21] Siggelkow S, Schael M, Burkhardt H. SIMBA — Search IMages By Appearance. In: DAGM 2001, Pattern [10] Müller H, Müller W, Squire D M, Marchand-Maillet S, Recognition, 23rd DAGM Symposium. vol. 2191 of LecPun T. Learning features weights from user behavior ture Notes in Computer Science. Munich, Germany: in Content-Based Image Retrieval. In: Simoff S, ZaSpringer Verlag; 2001. p. 9–17. iane O, editors. ACM SIGKDD International Conference 17 [22] Iqbal Q, Aggarwal J. CIRES: A System for Content- [32] Clough P, Müller H, Sanderson M. The CLEF Cross Language Image Retrieval Track (ImageCLEF) 2004. In: Based Retrieval in Digital Image Libraries. In: InterFifth Workshop of the Cross–Language Evaluation Fonational Conference on Control, Automation, Robotics rum (CLEF 2004). vol. 3491 of LNCS; 2005. p. 597–613. and Vision. Singapore; 2002. p. 205–210. [23] Wang J Z, Li J, Wiederhold G. SIMPLIcity: Semantics- [33] Clough P, Mueller H, Deselaers T, Grubinger M, Sensitive Integrated Matching for Picture LIbraries. Lehmann T, Jensen J, et al. The CLEF 2005 CrossIEEE Transactions on Pattern Analysis and Machine InLanguage Image Retrieval Track. In: Accessing Multitelligence 2001;23(9):947–963. lingual Information Repositories, 6th Workshop of the Cross-Language Evaluation Forum, CLEF 2005. vol. [24] Lehmann T M, Güld M-O, Deselaers T, Keysers D, Schu4022 of Lecture Notes in Computer Science. Vienna, bert H, Spitzer K, et al. Automatic Categorization of Austria; 2006. p. 535–557. Medical Images for Content-based Retrieval and Data Mining. Computerized Medical Imaging and Graphics [34] Müller H, Marchand-Maillet S, Pun T. The Truth About 2005;29(2):143–155. Corel - Evaluation in Image Retrieval. In: Proceedings of The Challenge of Image and Video Retrieval [25] Deselaers T, Keysers D, Ney H. FIRE – Flexible Im(CIVR2002). vol. 2383 of LNCS. London, UK; 2002. p. age Retrieval Engine: ImageCLEF 2004 Evaluation. In: 38–49. Multilingual Information Access for Text, Speech and Images – Fifth Workshop of the Cross-Language Eval- [35] Dorkó G. Selection of Discriminative Regions and Louation Forum, CLEF 2004. vol. 3491 of Lecture Notes cal Descriptors for Generic Object Class Recognition. in Computer Science. Bath, UK: Springer; 2005. p. 688– Ph.D. thesis. Institut National Polytechnique de Greno698. ble; 2006. [26] Deselaers T, Keysers D, Ney H. Features for Image Re- [36] Fei-Fei L, Perona P. A Bayesian Hierarchical Model trieval – A Quantitative Comparison. In: DAGM 2004, for Learning Natural Scene Categories. In: IEEE ConPattern Recognition, 26th DAGM Symposium. vol. 3175 ference on Computer Vision and Pattern Recognition. of Lecture Notes in Computer Science. Tübingen, Gervol. 2. San Diego, CA, USA: IEEE; 2005. p. 524–531. many; 2004. p. 228–236. [37] Fergus R, Perona P, Zissermann A. Object Class [27] Bloehdorn S, Petridis K, Saathoff C, Simou N, TzouRecognition by Unsupervised Scale-Invariant Learning. varas V, Avrithis Y, et al. Semantic Annotation of In: IEEE Conference on Computer Vision and Pattern Images and Videos for Multimedia Analysis. In: EuRecognition (CVPR 03). Blacksburg, VG; 2003. p. 264– ropoean Semantic Web Conference (ESWC 05). Herak271. lian, Greece; 2005. . [38] Opelt A, Pinz A, Fussenegger M, Auer P. Generic Object [28] Di Sciascio E, Donini F M, Mongiello M. Structured Recognition with Boosting 2006;28(3):416–431. Knowledge Representation for Image Retrieval. Journal [39] Marée R, Geurts P, Piater J, Wehenkel L. Random Subof Artificial Intelligence Research 2002;16:209–257. windows for Robust Image Classification. In: IEEE Con[29] Meghini C, Sebastiani F, Straccia U. A Model of Mulference on Computer Vision and Pattern Recognition; timedia Information Retrieval. Journal of the ACM 2005. p. 34–40. 2001;48(5):909–970. [40] Deselaers T, Keysers D, Ney H. Discriminative Training [30] Squire D M, Müller W, Müller H, Raki J. Contentfor Object Recognition using Image Patches. In: IEEE Based Query of Image Databases, Inspirations from Text Conference on Computer Vision and Pattern RecogniRetrieval: Inverted Files, Frequency-Based Weights and tion (CVPR 05). vol. 2. San Diego, CA; 2005. p. 157– Relevance Feedback. In: Scandinavian Conference on 162. Image Analysis. Kangerlussuaq, Greenland; 1999. p. [41] Jain S. Fast Image Retrieval Using Local Features: Im143–149. proving Approximate Search Employing Seed-Grow Ap[31] Park M, Jin J S, Wilson L S. Fast Content-Based Improach. Master’s thesis. INPG, Grenoble; 2004. age Retrieval Using Quasi-Gabor Filter and Reduction of Image Feature. In: Southwest Symposium on Im- [42] Schmid C, Mohr R. Local Grayvalue Invariants for Image Retrieval. IEEE Transactions on Pattern Analysis age Analysis and Interpretation. Santa Fe, NM; 2002. p. & Machine Intelligence 1997;19(5):530–534. 178–182. 18 [43] van Gool L, Tuytelaars T, Turina A. Local Features for Image Retrieval. In: Veltkamp R C, Burkhardt H, Kriegel H-P, editors. State-of-the-Art in Content-Based Image and Video Retrieval. Kluwer Academic Publish[54] ers; 2001. p. 21–41. DAGM 2005, Pattern Recognition, 26th DAGM Symposium. vol. 3663 of Lecture Notes in Computer Science. Vienna, Austria; 2005. p. 401–408. Swain M J, Ballard D H. Color Indexing. International Journal of Computer Vision 1991;7(1):11–32. [44] Lowe D G. Distinctive image features from scale- [55] Smith J R, Chang S-F. Tools and Techniques for Color invariant keypoints. International Journal of Computer Image Retrieval. In: SPIE Storage and Retrieval for Vision 2004;60(2):91–110. Image and Video Databases. vol. 2670; 1996. p. 426–437. [45] Deselaers T, Hegerath A, Keysers D, Ney H. Sparse [56] Tamura H, Mori S, Yamawaki T. Textural Features CorPatch-Histograms for Object Classification in Cluttered responding to Visual Perception. IEEE Transaction on Images. In: DAGM 2006, Pattern Recognition, 27th Systems, Man, and Cybernetics 1978;8(6):460–472. DAGM Symposium. vol. 4174 of Lecture Notes in Com[57] Haberäcker P. Praxis der Digitalen Bildverarbeitung puter Science. Berlin, Germany; 2006. p. 202–211. und Mustererkennung. München, Wien: Carl Hanser Verlag; 1995. [46] Müller H, Müller W, Squire D M, Marchand-Maillet S, Pun T. Performance Evaluation in Content-Based Im[58] Haralick R M, Shanmugam B, Dinstein I. Texture Feaage Retrieval: Overview and Proposals. Pattern Recogtures for Image Classification. IEEE Transactions on nition Letters (Special Issue on Image and Video IndexSystems, Man, and Cybernetics 1973;3(6):610–621. ing) 2001;22(5):593–601. H. Bunke and X. Jiang Eds. [59] Gu Z Q, Duncan C N, Renshaw E, Mugglestone M A, [47] Deselaers T, Keysers D, Ney H. Classification Error Rate Cowan C F N, Grant P M. Comparison of Techniques for for Quantitative Evaluation of Content-based Image ReMeasuring Cloud Texture in Remotely Sensed Satellite trieval Systems. In: International Conference on Pattern Meteorological Image Data. Radar and Signal ProcessRecognition 2004 (ICPR 2004). vol. 2. Cambridge, UK; ing 1989;136(5):236–248. 2004. p. 505–508. [60] Siggelkow S. Feature Histograms for Content-Based Im[48] Bober M. MPEG-7 Visual Shape Descriptors. IEEE age Retrieval. Ph.D. thesis. University of Freiburg, InTrans on Circuits and Systems for Video Technology stitute for Computer Science. Freiburg, Germany; 2002. 2001;11(6):716–719. [61] Fergus R, Perona P, Zisserman A. A Sparse Object Category Model for Efficient Learning and Exhaustive [49] Puzicha J, Rubner Y, Tomasi C, Buhmann J. Empirical Recognition. In: IEEE Conference on Computer Vision Evaluation of Dissimilarity Measures for Color and Texand Pattern Recognition (CVPR 05). vol. 2. San Diego, ture. In: International Conference on Computer Vision. CA, USA: IEEE; 2005. p. 380–389. vol. 2. Corfu, Greece; 1999. p. 1165–1173. [50] Nölle M. Distribution Distance Measures Applied to 3-D [62] Paredes R, Perez-Cortes J, Juan A, Vidal E. Local Representations and a Direct Voting Scheme for Face RecogObject Recognition – A Case Study. In: DAGM 2003, nition. In: Workshop on Pattern Recognition in InforPattern Recognition, 25th DAGM Symposium. vol. 2781 mation Systems. Setúbal, Portugal; 2001. p. 71–79. of Lecture Notes in Computer Science. Magdeburg, Germany: Springer Verlag; 2003. p. 84–91. [63] Vailaya A, Figueiredo M A T, Jain A K, Zhang H-J. Image Classification for Content-Based Indexing. IEEE [51] Keysers D, Deselaers T, Gollan C, Ney H. DeforTransactions on Image Processing 2001;10(1):117–130. mation Models for Image Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence [64] Antani S, Kasturi R, Jain R. A survey on the use of 2007;29(8):1422–1435. pattern recognition methods for abstraction, indexing and retrieval of images and video. Pattern Recognition [52] Deselaers T. Features for Image Retrieval. Master’s the2002;35:945–965. sis. Human Language Technology and Pattern Recognition Group, RWTH Aachen University. Aachen, Ger- [65] Deselaers T, Rybach D, Dreuw P, Keysers D, Ney H. Face-based Image Retrieval One Step Toward Objectmany; 2003. based Image Retrieval. In: Müller H, Hanbury A, editors. MUSCLE / ImageCLEF Workshop on Image and [53] Zahedi M, Keysers D, Deselaers T, Ney H. CombinaVideo Retrieval Evaluation. Vienna, Austria; 2005. p. tion of Tangent Distance and an Image Distortion Model 25–32. for Appearance-Based Sign Language Recognition. In: 19 [66] Mikolajczyk K, Tuytelaars T, Schmid C, Zisserman A, [78] Schaefer G. CVPIC Colour/Shape Histograms for Compressed Domain Image Retrieval. In: DAGM 2004. vol. Matas J, Schaffalitzky F, et al. A comparison of affine 3175 of LNCS. Tübingen, Germany; 2004. p. 424–431. region detectors. International Journal of Computer Vision 2005;65(1/2). [79] Obdrzalek S, Matas J. Image Retrieval Using Local Compact DCT-Based Representation. In: DAGM 2003, [67] Rubner Y, Tomasi C, Guibas L J. A Metric for DisPattern Recognition, 25th DAGM Symposium. vol. 2781 tributions with Applications to Image Databases. In: of Lecture Notes in Computer Science. Magdeburg, GerInternational Conference on Computer Vision. Bombay, many: Springer Verlag; 2003. p. 490–497. India; 1998. p. 59–66. [68] Eidenberger H. How good are the visual MPEG-7 fea- [80] Deselaers T, Weyand T, Ney H. Image Retrieval and Annotation Using Maximum Entropy. In: Peters C, tures? In: Proceedings SPIE Visual Communications Clough P, Gey F, Karlgren J, Magnini B, Oard D, et al., and Image Processing Conference. vol. 5150. Lugano, editors. Evaluation of Multilingual and Multi-modal InItaly; 2003. p. 476–488. formation Retrieval – Seventh Workshop of the CrossLanguage Evaluation Forum, CLEF 2006. vol. 4730 of [69] Manjunath B, Ohm J-R, Vasudevan V V, Yamada A. Lecture Notes in Computer Series. Alicante, Spain; 2007. Color and Texture Descriptors. IEEE Trans Circuits p. 725–734. and Systems for Video Technology 2001;11(6):703–715. [70] Ohm J-R. The MPEG-7 Visual Description Framework [81] Kittler J. On Combining Classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence – Concepts, Accuracy and Applications. In: CAIP 2001. 1998;20(3):226–239. No. 2124 in LNCS; 2001. p. 2–10. [71] Yang Z, Kuo C. Survey on image content analysis, indexing, and retrieval techniques and status report of MPEG-7. Tamkang Journal of Science and Engineering 1999;3(2):101–118. [82] Heesch D, Rüger S. Combining Features for ContentBased Sketch Retrieval – A Comparative Evaluation of Retrieval Performance. In: European Colloquium on Information Retrieval Research. vol. 2291 of LNCS. Glasgow, Scotland, UK; 2002. p. 41–52. [72] Vasconcelos N, Vasconcelos M. Scalable Discriminant [83] Nowak E, Jurie F. Learning Visual Similarity Measures Feature Selection for Image Retrieval and Recognition. for Comparing Never Seen Objects. In: CVPR 2007. In: CVPR 2004. 2. Washington, DC, USA; 2004. p. 770– Minneapolis, MN, USA; 2007. . 775. [73] Najjar M, Ambroise C, Cocquerez J-P. Feature Selection For Semi Supervised Learning Applied To Image Retrieval. In: ICIP 2003. vol. 3. Barcelona, Spain; 2003. p. 559–562. [74] Hand D, Manila H, Smyth P. Principles of Data Mining. Cambridge, MA: MIT Press; 2001. [75] Shao H, Svoboda T, van Gool L. ZuBuD – Zurich Buildings Database for Image Based Recognition. Computer Vision Lab, Swiss Federal Institute of Technology, Switzerland. Zurich, Switzerland; 2003. [76] Shao H, Svoboda T, Tuytelaars T, Gool L V. HPAT Indexing for Fast Object/Scene Recognition Based on Local Appearance. In: Conference on Image and Video Retrieval. vol. 2728 of LNCS. Urbana-Champaign, IL: Springer Verlag; 2003. p. 71–80. [77] Schaefer G, Stich M. UCID - An Uncompressed Colour Image Database. In: Proc. SPIE Storage and Retrieval Methods and Applications for Multimedia. San Jose, CA, USA; 2004. p. 472–480. 20 WANG UW & UCID 10 12 5 16 2 18 1411 1 8 19 15 34 7 17 13 9 6 6 17 15 13 7 8 9 34 14 1 16 11 19 2 18 10 5 12 IRMA ZuBuD 10 7 16 15 7 8 111 14 17 13 19 14 3 918 11 2 17 9 13 4 6 8 4 2 3 6 18 3 5 10 15 12 5 19 12 10 7 17 15 8 16 11 1 14 9 13 18 2 19 6 3 4 12 5 Figure 9: Correlation of the different features visualized using multi-dimensional scaling. Features that lie close together have similar properties. Top 4 plots: database-wise visualization, bottom plot: all databases jointly. The numbers in the plots denote the individual features: 1: color histogram, 2: MPEG7: color layout, 3: LF SIFT histogram, 4: LF SIFT signature, 5: LF SIFT global search, 6: MPEG7: edge histogram, 7: Gabor vector, 8: Gabor histograms, 9: gray value histogram, 10: global texture feature, 11: inv. feature histogram (monomial), 12: LF patches global, 13: LF patches histogram, 14: LF patches signature , 15: inv. feature histogram (relational), 16: MPEG7: scalable color, 17: Tamura texture histogram, 18: 32x32 image, 19: Xx32 image. 21 80 irma ucid uw wang zubud 40 20 mean average precision [%] 60 −0.81 −0.99 −0.67 −0.90 −0.93 0 Correlation of ER and MAP: −0.86 0 20 40 60 80 100 classification error rate [%] Figure 10: Analysis of the correlation between classification error rate and mean average precision for the databases. The numbers in the legend give the correlation for the experiments performed on the individual databases. 60 20 40 color histogram MPEG7: color layout LF SIFT histogram LF SIFT signature LF SIFT global search MPEG7: edge histogram Gabor vector Gabor histogram gray value histogram global texture feature inv. feature histogram (monomial) LF patches global LF patches histogram LF patches signature inv. feature histogram (relational) MPEG7: scalable color Tamura texture histogram 32x32 image Xx32 image 0 mean average precision [%] 80 −1.0 −1.0 −0.9 −0.4 −1.0 −1.0 −1.0 −1.0 −0.6 −1.0 −1.0 −0.8 −0.6 −0.8 −1.0 −0.9 −0.9 −0.7 −0.7 0 20 40 60 80 100 classification error rate [%] Figure 11: Analysis of the correlation between classification error rate and mean average precision for the features. The numbers in the legend give the correlation for the experiments performed using the individual features. 22

Log In

Features for image retrieval: an experimental comparison

Free related PDFsRelated papers

Free related PDFsRelated papers

Related topics