1 Introduction

Nowadays, many image repositories are available for almost every domain, such as architecture, astronomy, education, geology, medicine, multimedia, and remote sensing. With such a massive growth in the number of images available, the need to store and retrieve images in an efficient manner arises, leading to an increase in the importance of Content-Based Image Retrieval (CBIR) systems. Such systems have a multitude of real-life applications concerning crime prevention, digital libraries, medical diagnostic, textile industry, traffic congestion analysis, and so on.

One of the main challenges in CBIR is to choose features that are sufficiently discriminative to infer how similar images are, while keeping them compact to ensure that the system is timely and computationally efficient. Furthermore, human perception of image similarity, which is subjective, semantic, and task-dependent, may not be captured by commonly used low-level features (e.g., color, shape, texture). This phenomenon is known as the semantic gap between high-level concepts conveyed by an image (e.g., emotions, events, or objects) and the limited descriptive power of low-level visual features [2, 60, 86, 115]. For example, consider Fig. 1a (yellow car close to a green wall) as the query image of a retrieval system that only uses low-level features. Both Fig. 1b (lady with a yellow dress on green grass) and Fig. 1d (yellow car close to a tree) would be returned as a result (due to the color similarity). However, it would be desirable that Fig. 1c (red car in the wilderness) would be returned instead of Fig. 1b, since Fig. 1a and c are very similar semantically (both depict a car).

Fig. 1
figure 1

Example of a query image and three related images that illustrate the semantic gap between high-level concepts and low-level features: only Fig. 1d is similar according to both visual and semantic features. (figure best seen in color)

Over the years, multiple approaches have been proposed to mitigate the semantic gap: (i) generation of high-level features that mimic human perception using deep learning; (ii) multi-feature early, hierarchical, and late fusion methods to combine low-level features (and to a lower extent, low- and high-level features, and multiple high-level features); (iii) incorporation of human expertise, through relevance feedback, in the retrieval process, leading to perceptually and semantically more meaningful results.

Besides the semantic gap, another issue for CBIR systems concerns the choice of the best features for a certain domain. If we consider pictures of our everyday life, one might expect that, the more information available, the better a retrieval system’s performance will be. However, in certain domains, such as art or medicine, this is not necessarily true. Not only images from those domains have characteristics that are quite different from everyday pictures, but also their characteristics within the same domain may differ significantly. For example, many medical images are only gray-level (e.g., radiography, computed tomography, Magnetic Resonance Imaging (MRI)), leading shape and texture to acquire increased relevance, when compared to color or semantic features [26, 103].

Unveiling the perfect combination of different features to design novel CBIR systems for new datasets or domains usually involves a huge experimentation overhead, and is highly task and domain-specific, leading to the fine-tuning of CBIR systems for each domain. Consider, for instance, a scenario where multiple biomedical CBIR systems are available, each of them fine-tuned to a specific disease and diagnostic method (e.g., a shape-based CBIR tuned for brain cancer MRI scans, a color and texture-based CBIR for breast cancer histopathological images, etc.). How can one leverage on the existing fine-tuned CBIR systems to accommodate other diseases or diagnostic methods, or even other domains? Early fusion and hierarchical approaches would require even further experimentation to combine the different types of features, resulting in more fine-tuned systems. Although late fusion approaches are more robust and well-suited for such task, they lack interpretability, as it is unclear which CBIR system performs best for the task in hand. Relevance feedback techniques could also be used to tune the results and adapt the CBIR system, but they usually require constant feedback from the users. In the biomedical CBIR example above, it would be unrealistic and costly to expect medical specialists to give feedback on every result retrieved by a CBIR system.

To address these challenges, we present ExpertosLF, an interpretable late fusion technique that takes advantage of human feedback (requiring minimum effort and interaction). For that, we propose a novel application of online learning to late fuse multiple CBIR systems, under the framework of prediction with expert advice. To each CBIR in the ensemble is assigned a weight that determines how much it contributes to the final set of images to be retrieved for a given query. The systems’ weights are updated in an online fashion, based on the quality of each system’s results, assessed by one or more human evaluators at each query. The resulting ensemble will be independent of the dataset and domain, while being able to take advantage of previous experiments to create the individual CBIR systems. ExpertosLF is designed to be interpretable, model-agnostic, modular, and scalable.

With this work, we aim to address the following research questions:

  • RQ1) Does our late fusion technique improve retrieval performance?

  • RQ2) Does the resulting ensemble perform as well as the best individual CBIR?

  • RQ3) Can we use the ensemble learned in an online setting in an offline setting?

  • RQ4) Are the CBIR experts in the resulting ensemble plausible considering the domain in hand?

Our contribution is threefold:

  1. 1.

    A model-agnostic interpretable late fusion technique based on online learning with expert advice, which dynamically combines CBIR systems without knowing a priori which ones are the best for a given domain;

  2. 2.

    Mitigation of the semantic gap between the low-level information of an image and its high-level semantic concepts, by studying the impact of combining both kinds of descriptors in CBIR in different domains;

  3. 3.

    A set of extensive experiments on 13 benchmark datasets focusing on three different domains: Biomedical, Real, and Sketch.

ExpertosLF surpasses the performance of state of the art late fusion techniques for the majority of the datasets. It quickly converges to the performance of the best CBIR systems across domains, without any previous domain knowledge (in most cases, fewer than 25 queries need to receive human feedback). Moreover, the ensemble learned using our weighted late-fusion technique can be successfully applied to an offline scenario (i.e., in which there is no feedback available).

2 Related work

The typical flow of a CBIR system is depicted in Fig. 2. The first step consists of generating a set of features to accurately represent the content of each image in the database. These sets of features, also called descriptors, are used to compute the distance between the query image and each candidate image in the database, in order to retrieve the most similar images to the query image.

Fig. 2
figure 2

Architecture of the typical CBIR setting. The retrieved images with a green border represent relevant images, while a red border represents non-relevant images for the given query. (figure best seen in color)

Ascertaining the most discriminative descriptors is highly dependent on both the type of images the CBIR system will handle (colored, black and white, or grey-level) and the domain in hand (e.g., art, medical, textile, remote sensing). For example, an image of a sunset will have more semantic and color information than an image from a medical examination (consider an X-ray or Computed Tomography scan whose shape of the organ under analysis is more prominent). Moreover, there is a semantic gap between high-level concepts, such as emotions, events, objects or activities conveyed by an image, and the limited descriptive power of low-level visual descriptors, as exemplified earlier in Section 1.

Here, we analyse how the semantic gap has been addressed in several CBIR works focusing on computational methods that: (i) propose novel low- and high-level descriptors (Sections 2.1 and 2.2), and combinations among them (Section 2.3); (ii) improve the retrieval process using human relevance feedback at each query (Section 2.4). Our analysis is focused on the last five years. For a more complete review, see [52, 77, 122, 123].

2.1 Low-level descriptors

Hand-crafted global and local low-level descriptors representing color, shape, and texture are widely used in current CBIR systems. Color is extensively used since it is the basic constituent of images, relatively robust to background complexity and independent of orientation and image size. Shape is useful for matching objects based on their physical structure and profile. Texture is used to look for visual patterns with properties of homogeneity that are not achieved by the presence of a single color, and how those patterns are spatially defined.

Global descriptors extracted from the whole image, are easy to compute, and have lower dimensionality. Multiple descriptors have been proposed: color (e.g., Auto Color Correlogram (ACC) [38], Color Coherence Vectors (CCV) [89], Color Histogram (CH) [89], Color Moments (CM) [28], Opponent Histogram (OH) [105], and Reference Color Similarity (RCS) [48]), shape (e.g., Edge Histogram (EH) [19], and Zernike Moment Descriptor (ZMD) [47]), and texture (e.g., Gabor [64], Haralick [37], and Hybrid Directional Extrema Pattern (HDEP)).

Local descriptors are extracted from sub-images of a given image. They are robust to occlusion, changes in illumination and background, and geometric transformations; they are usually complex and produce high-dimensional vectors [113]. The Local Binary Patterns (LBP) descriptor is widely used in color and texture retrieval since it reflects the correlation among pixels within a local area [32, 71]. Other binary descriptors are Binary Robust Independent Elementary Features (BRIEF) [16], Binary Robust Invariant Scalable Keypoints (BRISK) [53], Fast Retina KeyPoint (FREAK) [4], Scale-Invariant Feature Transform (SIFT) [66], and Speeded Up Robust Features (SURF) [12].

Singular Value Decomposition (SVD)-based descriptors take advantage of the local spatial relationship of non-overlapping images’ sub-regions [32, 63, 106]. Radon transforms are useful to reconstruct objects, and attain special attention on the medical domain [9, 100, 101].

2.2 High-level descriptors

Low-level information is useful to discriminate images, but it often fails at capturing high-level semantic concepts perceived by humans. To model high-level abstractions present in images, deep learning approaches have been proposed in the latest years. Deep approaches are able to learn complex representations from large amounts of data, in a supervised manner. An example of such representations are Convolutional Neural Networks (CNNs) [49], which have been widely adopted for multiple tasks, such as classification, image segmentation, or object recognition. Recently, CNNs have also been explored in retrieval tasks.

The most common approach is to extract feature representations from a pre-trained CNN model by feeding images in the input layer of a model, and taking activation values either from fully connected layers (to capture semantic information), or from convolutional layers using pooling techniques (to exploit spatial information). Pre-trained CNNs with ImageNet dataset are commonly used in CBIR systems [29]: AlexNet [88, 104, 110], Fast Convolutional Neural Network (FCNN) [118], VGG-19 [119]. Some authors have also proposed novel deep approaches: a CNN to retrieve images of different body organs [79], a CNN scalable face CBIR [98], a Convolutional Sparse Kernel Network for the medical domain [3], a Deep Belief Network for object-based retrieval [85], and a Fuzzy Neural Network to learn effective binary codes, while enhancing interpretability [60].

In some domains, the amount of images available is not sufficient to train a robust deep model, i.e., the model is prone to overfit. Transfer learning is beneficial in such situations, since features can be learnt in a resource-rich domain and then applied to a resource-scarce domain. Several works adopted pre-trained CNNs with natural images, and apply it to their target domain: VGG-m and VGG-16 for landmarks/monuments [5], ResNet-50 for diabetic retinopathy [26], Capsule Networks with 3D CNNs to detect Alzheimer disease using MRI [50], VGG19 for brain tumors [93], Inception-ResNet-V2 for otoscope images [17], and DenseNet121 for chest X-ray images [94]. Most models were pre-trained using ImageNet dataset [5, 26, 93, 94] (for the remaining ones, the information is missing/unclear).

2.3 Multi-feature fusion methods

Over the years, different approaches have been proposed to combine global and local descriptors (color and texture [13, 33, 44, 45, 107], color and shape [2, 25, 75, 113, 116], shape and texture [92, 103], and color, shape, and texture [7, 8, 14, 73, 81, 83, 86, 121]). To a lesser extent, authors have also proposed combinations of low- and high-level descriptors [57, 58], and an ensemble of high-level descriptors [36].

The most common approach is to extract multiple descriptors, and combine them using an early fusion approach [2, 7, 14, 25, 45, 81, 103, 113, 116, 121]. In the early fusion approach, the descriptors are extracted and combined into a single feature vector. The resulting vector is used to index all the images in the CBIR and search for the most similar ones. Usually, it is assumed that all descriptors have the same importance. However, that is not necessarily true: descriptors may not yield the same results for different categories of images. An alternative approach is to use weights to early fuse the descriptors using Particle Swarm Optimization (PSO) algorithm [33], genetic algorithms [75], or weighted functions [10, 83]. The fusion of descriptors at different levels can benefit CBIR systems, but it requires mechanisms for the selection of appropriate weights which are usually highly dependent of the dataset in use, and still involves a lot of experimentation (since the parameter tuning of the descriptors is carried out from the analysis of their performance in the proposed CBIR). Moreover, although a large number of features may better represent the discriminative properties of images, it may lead to the dimensionality curse problem.

Previous works focused mainly on single resolution processing of an image, however it may not be sufficient to gather varying level of details in an image since an image consists of both high and low resolution objects, and both large and small size objects [92]. Another approach is to combine descriptors in a hierarchical way by processing an image at multiple resolutions [8, 44, 45, 92]. This way, features that were not detected at a certain resolution, will be detected at another one. Wavelets offer a good energy compression and multi-resolution capability [8]. As such, LBP, Legendre moments, Gabor (or similar descriptors) are combined using Discrete Wavelet Transform (DWT) to extract shape information from texture features from an image at multiple resolutions.

A considerable body of work has been proposed to combine descriptors to create a single CBIR. Another possibility is to combine multiple CBIR systems, which may be less dependent on the task or type of images, while being able to take advantage of experiments already performed to create the individual CBIR. Some authors have proposed a hierarchical approach to combine low-level CBIR [73, 107], and low- and high-level CBIR [57, 58]. When combining multiple CBIR hierarchically, the main idea is to use a single CBIR to find the most relevant images [58, 73] or discard irrelevant images [57, 107], and then apply a second CBIR to refine the search.

Late-fusion techniques are also used to combine CBIR systems (mostly based on low-level features). In the late fusion approaches, multiple CBIR systems are created (where each one uses one or more early-fused descriptors to index and search for the images), and the results of each one are combined. They are usually split into two major groups: (i) similarity score-based rank list fusion and (ii) order based rank list fusion [6]. For the first group, the similarity scores of each image (of each retrieved list) are merged using an aggregation function (e.g., minimum, median, or maximum) to form the final search result. In the second one, a revised retrieval list is created as a function of the position in which images appear in different rank lists. Such fusion techniques tend to be more robust and efficient than early fusion techniques.

Finally, Hamreras et. al [36] proposed to take advantage of ensemble learning to combine different CNNs. However, its scope is very limited; the main focus was the identification of good parameters to form the ensemble (the number of neural networks to be used, and number of hidden neurons in each network).

All these methods, to some extent, require a huge experimental overload to find out which are the best combinations of descriptors or CBIR systems for each possible domain/dataset. Thus, in this work, we extend the late-fusion method so that it dynamically assigns a greater weight to the best CBIR systems for the domain/dataset in hand, taking advantage of relevance feedback provided by the user (when available).

2.4 Relevance feedback

Relevance feedback has been used in CBIR systems to modify the retrieval process in order to generate perceptually and semantically more meaningful results by involving the user in the retrieval process [65, 104]. The main idea is to present to the user the results from a given query, collect feedback about whether or not those results are relevant, and perform a new query based on that information; these steps are carried out iteratively until the user is satisfied with the results.

The most common types of feedback are explicit and implicit. In the explicit feedback, the user explicitly informs which images are relevant/not relevant (binary relevance feedback) or how relevant each image is (graded relevance feedback). In the implicit feedback, the system automatically infers user’s feedback from their behavior.

Different approaches have been proposed to reformulate the query according to the feedback received: finding an optimized query feature vector using Rocchio’s algorithm [11, 46, 67, 114], modifying the similarity measure so that relevant images have a high similarity value [1, 124], exploiting images’ geometrical and discriminant structures to learn a semantic subspace [39, 120], or separating relevant and non-relevant images using Bayesian Networks [87], CNN [56, 76, 78, 104], Clustering [27], Logistic Regression [30], Optimum Forest algorithm [54], and Support Vector Machine [82, 97, 112].

Active learning has also been used to reduce the annotation effort, by selecting which images should be annotated by the users [82, 97]. Moreover, Tang et.al combined different active learning relevance feedback approaches carried out simultaneously, and then fused their results to improve the initial query [97].

All the aforementioned methods to mitigate the semantic gap helped furthering the development of CBIR systems, but they come with a number of drawbacks. Early and hierarchical fusion involve a huge experimentation overload to choose the best set of descriptors among descriptors of the same category or across categories. Furthermore, early approaches are prone to suffer from the so called curse of dimensionality. With late-fusion approaches, the merge of most similar images is done at a query-level, i.e., no knowledge of which CBIR is better is acquired over time. Moreover, regardless of the fusion technique in use, most CBIR solutions in the literature suffer from a lack of interpretability regarding which descriptors or CBIR systems are the best for a given domain or type of images. Finally, relevance feedback relies on user feedback (which is not always available), and the retrieval process is repeated multiple times until the user is satisfied.

To address some of these drawbacks, we propose to take advantage of human feedback (when available), not to improve the results for the current query, but instead to improve the late-fusion process for the next queries. Thus, our focus is to reward the CBIR systems that made the best contributions to the final set of images retrieved to the user, giving them a greater weight in the late-fusion for future queries.

3 Dynamic late fusion of CBIR using online learning

Given several existing CBIR systems (each one encompassing different descriptors or combinations of descriptors), how can we combine them in order to dynamically reach at least the performance of the best CBIR system (without knowing which are the best ones for a given domain a priori)? To tackle this question, we frame our late fusion technique as a problem of prediction with expert advice, using online learning to dynamically find which are the best CBIR in the ensemble, making the most of minimal human interaction.

We start by providing some background on the prediction with expert advice online learning framework (Section 3.1), and then we describe how we adapt it to late fuse multiple CBIR systems (Section 3.2).

3.1 Prediction with expert advice

A problem of prediction with expert advice can be seen as a repeated game between a forecaster and the environment, in which the forecaster resorts to a set of weighted experts to provide the best forecast [18]. At each round t, the forecaster F consults the predictions \({p_{k}^{t}}\) in the decision space \(\mathcal {A}\) made by each expert k. Considering the experts’ predictions, the forecaster makes its own prediction, \(\hat {p}^{t}_{F}\in \mathcal {A}\). At the same time, the environment reveals an outcome yt in the decision space \(\mathcal {Y}\).

In order to learn the experts’ weights, an online learning algorithm can be used. A well-established algorithm for prediction with expert advice is the Exponentially Weighted Average Forecaster (EWAF) [18]. In EWAF, the prediction \(\hat {p}^{t}_{F}\) made by the forecaster is given by (1):

$$ \hat{p}^{t}_{F} = \frac{{\sum}_{k=1}^{K}\omega_{k}^{t-1} {p_{k}^{t}}}{{\sum}_{k=1}^{K}\omega_{k}^{t-1}}. $$
(1)

At the end of each round, the forecaster and each of the experts receive a non-negative loss based on the outcome yt revealed by the environment (\({\ell _{F}^{t}}\) and \({\ell _{k}^{t}}\) respectively):

$$ {\ell_{F}^{t}}, {\ell_{k}^{t}} : \mathcal{A} \times \mathcal{Y} \rightarrow \mathbb{R} $$
(2)

The weights \({\omega _{1}^{t}}, \ldots , {\omega _{K}^{t}}\) of each expert k are then updated according to the loss incurred by each expert as shown in (3).

$$ {\omega_{k}^{t}}=\omega_{k}^{t-1}e^{-{\eta\ell_{k}^{t}}} $$
(3)

After T rounds, by setting:

$$ \eta = \sqrt{8\log \frac{K}{T}} $$
(4)

it can be shown that the forecaster’s regret for not following the best expert’s advice is bounded as follows:

$$ \sum\limits_{t=1}^{T}{\ell_{F}^{t}}-\min\limits_{k=1,\ldots,K}\sum\limits_{t=1}^{T}{\ell^{t}_{k}}\leq\sqrt{\frac{T}{2}\log K} $$
(5)

i.e., the forecaster quickly converges to the performance of the best expert [18].

3.2 CBIR late fusion with expert advice

We frame the late fusion of multiple CBIR systems as a problem of prediction with expert advice: given an ensemble of K CBIR systems, each system corresponds to an expert k = 1,…,K, associated with a weight ωk (all experts start with the same weight, ωk = \(\frac {1}{K}\)); all the possible sets of images that can be retrieved (from the database of images of each expert) correspond to the decision space \(\mathcal {A}\); the late fusion of the CBIR systems thus corresponds to the forecaster, i.e., the forecaster’s decision is the final set of images to be retrieved, combining images from multiple systems in the ensemble.

An overview of the learning process is depicted in Fig. 3 and Algorithm 1, and can be summed up as follows. At each round t, a query image qt is given as input to all the experts m1,…,mK, and each returns the most similar ones to the query according to its descriptor(s), \(retrieve{d^{t}_{k}}\) (line 5). Based on the experts’ selections, the forecaster selects the final set of retrieved images queryResultt (line 7). Both the forecaster’s and each expert’s set of images are then evaluated with a quality score, reflecting how similar the query image and the ones retrieved by each expert are according to one (or more) human evaluators (e.g., a user searching for similar images of their dog in a searching engine, or a doctor using a system to obtain medical examinations similar to those of their patient or a specific diagnosis) (line 9). This quality score is used at the end of the round to update the experts’ weights (lines 11–12).

figure e
Fig. 3
figure 3

Overview of our late fusion of multiple CBIR systems with expert advice. (figure best seen in color)

In this paper, we propose to late fuse the images retrieved by the experts in the ensemble by taking into account the weights learned using EWAF. Each expert contributes with a number of images \({I_{k}^{t}}\) proportional to its weight (line 6), as shown in (6). Note that merging the results from multiple experts deviates from the traditional EWAF formulation, in which only a single expert is selected by the forecaster at a time (since prediction with expert advice typically deals with single-value predictions).

$$ {I_{k}^{t}} = \lfloor\omega_{k} * N \rceil $$
(6)

We sort the images in each retrieved set (either the forecaster’s or each of the experts’ sets) in ascending order according to their distance to the query image (the smaller the distance, the more similar the images are; a perfect match corresponds to a distance of 0). In order to avoid duplicates, we skip images that have already been added to the set to be retrieved by some expert. We consider first the images retrieved by the experts with a lower weight (i.e., the one that performed worse). This way, we ensure that each expert k contributes with \({I_{k}^{t}}\) images.

A key condition for applying online learning is the availability of feedback (which, in the case of EWAF, is based on the outcome of the environment). To simulate the feedback from human evaluators in a real-world scenario, we used the images’ category present in the datasets, curated by human annotators, as a feedback source to compute the loss and update the weight of each CBIR system. In other words, images belonging to the same category are considered as relevant for the remaining images within that category. We thus compute the loss for each expert k at a round t as:

$$ {\ell_{k}^{t}} = 1 - sim(relevant^{t}, retrieve{d_{k}^{t}}) $$
(7)

where sim is computed using a set similarity measure that allows us to quantify how similar the sets of relevant images relevantt and retrieved images \(retrieve{d_{k}^{t}}\) are for each expert k (the set similarity measures experimented are listed in Section 5.2). The weights of all the CBIR systems are then updated based on the loss received, according to (3).

Note that although we rely on the notion of relevance feedback from the user to learn the weights, we did not follow the traditional setup, i.e., we did not refine the query by iteratively asking which images are relevant until the user is satisfied with the results retrieved. We only need input once.

Fig. 4
figure 4

Example of an iteration using our online CBIR setting. The orange arrows represent the new weight for each expert. (figure best seen in color)

In Fig. 4, we present a snapshot of the learning process, in which we consider the same query image as in Fig. 3, three CBIR experts, numResultst = 8, and the Jaccard index to compute the similarity between the sets (see Section. 5.2). Following (6) and considering the weights assigned to each expert, Expert 3 will contribute with five images to the final set, Expert 1 with three images, and Expert 2 with none, excluding possible duplicates. If we had considered the traditional EWAF setting in which only one expert is selected by the forecaster, we would have a precision of 75% (6 out of 8 successfully relevant images retrieved). By late fusing the retrieved images from several CBIR experts, it increases to 87.5% (7 out of 8). Moreover, if there is an expert that clearly outperforms the others, its weight would converge to 1, leading to EWAF’s traditional behavior of choosing only one expert.

4 Implementation details

We created two late fusion CBIR solutions based on expert advice. In the first solution, ExpertosLF_V, we considered four CBIR systems as experts representing low-level information (color, shape, texture, and joint). The first three ones represent the early fusion of either color, shape, or texture descriptors alone. The joint one represents the early fusion of three existing descriptors that already encompass multiple visual characteristics: color, shape, and texture. In the second one, ExpertosLF_VS, we added a fifth CBIR expert that represents the semantic information.

One of our goals is to evaluate whether the experts in the resulting ensemble are plausible for the domain in hand. We focused on low- and high-level descriptors in order to study how the use of different kinds of information (visual or semantic) varies across the different domains, and whether the resulting ensemble reflects it.

Each CBIR follows the typical architecture of a retrieval system with a Database to store the images, a Descriptor Extraction module, a Descriptors DB (indexing structure), and a Similarity Comparison algorithm. Following, we present the list of descriptors under consideration (Section 4.1), and detail each component of the CBIR architecture (Sections 4.2 and 4.3).

4.1 Low and High-level descriptors

We selected a diverse set of low-level descriptors representing color (Auto Color Correlogram (ACC) [38], Color Histogram (CH) [89], Itten Contrasts (IC) [41], Opponent Histogram (OH) [105], and Reference Color Similarity (RCS) [48]), shape (Edge Histogram (EH) [19] and Edges), texture (Tamura [96] and Haralick [37]), joint color, shape, and texture information (Color and Edge Directive Descriptor (CEDD) [20], Joint Composite Descriptor (JCD) [22], and Fuzzy Color and Texture Histogram (FCTH) [21]), as well as high-level descriptors that represent images’ semantic content using tags (Adjective-Noun Pairs (ANP), Adjectives, Nouns, and General Concepts (GC)) (see Tables 1 and 2).

Table 1 Summary of the color, shape, texture, and joint descriptors selected to study. The column ‘#’ indicates the feature vector length
Table 2 Summary of the semantic descriptors selected to study. The column ‘#’ indicates the feature vector length

The majority of the aforementioned descriptors were computed using jFeatureLib [35] and LIRE [61]. The remaining descriptors were implemented by us. The E dges descriptor can be seen as a simplified version of EH, where the number of edges of an image along vertical, horizontal, 35, 135, non-directional, and all directions are counted and represented as a descriptor. Our implementation of the IC follows the details presented in [41].

We used SentiBank [15] to extract ANPs. Each image was annotated with the 10 ANPs with the highest probability. The ANPs descriptor has a dimensionality of 2089 (i.e., the number of pairs that can be identified by SentiBank), and the probability for each ANP was used as a feature. For each image, we divided each of the 10 ANP into adjective and noun, and computed the average of the probabilities (considering how many times each adjective or noun occurs in the image) to create the A djectives and N ouns descriptors. Finally, the semantic tags were obtained automatically, using the Clarify API deep learning pre-trained General model [117], to avoid relying on human-generated tags. The General model computes the probability of the presence of relevant general concepts in the image. It is able to identify over 11,000 concepts within an image, but the set of possible concepts is not known a priori. Furthermore, each image can only be annotated with at most 200 concepts. To devise the final set of most relevant concepts among the possible 11,000 general concepts identified by the model, we annotated each image from all the datasets used in our study (with 200 concepts). The probability given to each concept for each image is used as a feature to create the GC descriptor.

4.2 Descriptor Extraction

Before computing any descriptor, we first resized each image to a maximum of 400 pixels on their larger dimension (width or height), keeping the original aspect ratio. Additionally, images of Digital Imaging and Communications in Medicine (DICOM) format were converted to the RGB color space (PNG or JPEG format).

Since we did not know a priori which were the best descriptors within each category, we conducted some preliminary tests. We started by testing each descriptor individually. All visual descriptors proved to be relevant for at least one dataset used, whereas the GC descriptor was consistently better than the remaining high-level descriptors (even when we combined it with the remaining high-level descriptors, those descriptors did not improve the system performance comparing to using solely GC for any of the datasets).

As mentioned earlier, each type of descriptor usually captures only one aspect of an image property. Thus, there is no single “best” descriptor that leads to accurate results regardless of the setting, which means that a combination of descriptors is usually needed to provide adequate retrieval results [89]. As such, we tested multiple early fused combinations of descriptors within each category (using the min-max normalization of each feature before fusing them). Given the high dimensionality of some descriptors, we applied Principal Component Analysis (PCA) to reduce their dimensionality in all the tests performed. We tested a different number of principal components to be able to account from 90 to 100% of the variance. The best result was achieved for 90%. Besides ensuring that the system was usable in a timely manner, we also improved the discriminative power of the resulting vector.

4.3 Descriptors DB and similarity comparison

Indexing and Searching plays a fundamental role in Information Retrieval when dealing efficiently with large collections of data (in our case, image descriptors). Thus, we used NB-Tree [34], an efficient indexing structure for high–dimensional data points, which exhibits low insertion and searching times. For the similarity comparison between the descriptor(s) computed for the query image and those available at the descriptors database, we used the k-Nearest Neighbors (kNN) query provided by the NB-Tree. kNN is commonly used in content–based retrieval, and we chose this implementation since it takes advantage of the indexing structure, which was optimized for high–dimensional data points. As expected, the smaller the distance between the descriptor(s) of the query image and the descriptor(s) of each retrieved image, the more similar the images are. Note that the NB-Tree was used both in our dynamic late-fusion CBIR solutions and in the state of the art early and late-fusion techniques (used for comparison in the experimental tests performed), as the kNN algorithm and indexing structure.

5 Experimental setup

In this section, we present our experimental setup, in terms of: 1) the datasets in which we performed our experiments; 2) the similarity measures tested as a loss function for the update of experts’ weights; 3) the state of the art early and late fusion techniques under comparison; 4) the evaluation metrics used to report and analyze the performance of each retrieval system; 5) the computing infrastructure in which we ran our experiments.

5.1 Datasets

Experiments were conducted on 13 benchmark datasets divided into three main categories: Biomedical, Real, and Sketch (see Table 33).

Table 3 Summary of the datasets

In the Biomedical category, we used BrainCE-MRI, BreakHis, COVID19-Rx, HAM10000, IRMA, and PlantPathology datasets. They depict brain and breast tumors, COVID-19, bone fractures, pneumonia, pigmented skin lesions, and leaf diseases (see Fig. 5). BrainCE-MRI contains 3064 T1-weighted contrast-inhanced images of three types of brain tumor: glioma, meningioma, and pituitary. BreakHis contains 7909 Histopathological images of benign breast tumors (adenosis, fibroadenoma, phyllodes tumor, and tubular adenona), and breast cancer (carcinoma, lobular carcinoma, mucinous carcinoma, and papillary carcinoma). COVID19-Rx contains 3886 chest X-ray images for COVID-19 positive cases along with normal and viral Pneumonia images. HAM10000 contains 10015 multi-source dermatoscopic images of pigmented skin lesions: actinic keratoses and intraepithelial carcinoma/bowen’s disease, basal cell carcinoma, benign keratosis-like lesions, dermatofibroma, melanocytic nevi, melanoma, and vascular lesions. IRMA, from ImageCLEF initiative, contains 14410 images of scanned X-rays of various human body parts. PlantPathology contains 1821 high-quality, real-life symptom images of apple foliar diseases, with variable illumination, angles, surfaces, and noise.

Fig. 5
figure 5

Example images from the Biomedical datasets. (figure best seen in color)

In the Real category, we used CopyDays, COREL1K, COREL10K, and GHIM10K datasets, which depict realistic images of diverse aspects of everyday life (see Fig. 6). CopyDays contains 3212 personal holidays photos that were artificially manipulated (cropped, scaled, and strongly attacked). COREL1K contains 1000 images depicting African people, beaches, buildings, buses, dinosaurs, elephants, flowers, foods, horses, and mountains. COREL10K contains 10000 images representing buildings, sunsets, fish, flowers, cars, mountains, tigers, etc. GHIM10K contains 10000 images depicting cars, insects, mountains, ships, sunsets, etc.

Fig. 6
figure 6

Example images from the Real datasets. (figure best seen in color)

In the Sketch category, we used $P, ImiSketchS, and mCali (see Fig. 7). These datasets have a large variety in the types of symbols represented (e.g. digits, furniture, mathematical, smiles), and the way they were drawn. $P contains 4802 images, drawn by 10 users, with gestures (multi-stroke without rotation) representing geometric shapes, letters or symbols; ImiSketchS contains 1871 images of furniture symbols (e.g., doors, or tables) drawn with multi-stroke and rotation; mCali contains 8159 symbols, drawn by 17 users, with gestures (with multi-stroke and rotation) representing geometric shapes, smiles, generic symbols and letters.

Fig. 7
figure 7

Example images from the Sketch datasets. (figure best seen in color)

5.2 Similarity Measures

In order to compute the loss for each expert in our late fusion solution with expert advice, we tested four measures to quantify how similar the sets of relevant and retrieved images are: Jaccard index [42] (8), Otsuka-Ochiai coefficient [70, 72] (9), Overlap coefficient [74] (10), and Sørensen-Dice index [31, 95] (11).

$$ jaccard = \frac{\big| relevant \cap retrieved \big|} {\big| relevant \big| + \big| retrieved \big| - \big| relevant \cap retrieved \big|} $$
(8)
$$ otsukaOchiai = \frac{\big| relevant \cap retrieved \big|} {\sqrt{\Big(\big| relevant \big| * \big| retrieved \big| \Big)}} $$
(9)
$$ overlap = \frac{\big| relevant \cap retrieved \big|} {min \Big(\big| relevant \big| , \big| retrieved \big| \Big)} $$
(10)
$$ sorensenDice = \frac{2\ \big| relevant \cap retrieved \big|} {\big| relevant \big| + \big| retrieved \big|} $$
(11)

Sørensen-Dice and Jaccard are more rigid measures since they penalize the existence of more retrieved images than relevant ones, and vice-versa. Otsuka-Ochiai is a less rigid measure than the aforementioned ones since it penalizes less the existence of different cardinalities between the sets of retrieved and relevant images, so it can be seen as a more balanced measure. Finally, Overlap prioritizes the existence of retrieved images that are relevant, even though the cardinality of the sets differs, thus being the least restrictive measure.

5.3 State of the art fusion techniques

To assess the quality of the ensemble of CBIR systems produced by our technique, we compared it to well-known state of the art fusion techniques.

The first technique is the widely used early fusion of the descriptors, followed by PCA to reduce the high dimensionality of the resulting feature vector (EF). Since we wanted to ensure that our technique performs as well as early fusion, we used EF as a baseline in our work. Following, we considered two late fusion techniques to combine the results of multiple CBIR systems. Let D be the set of all images in the database, and i an image from this set (iD). Each CBIR j returns a list of the most similar images to the query one (Lj), where each image i contains its normalized similarity score (representing how similar it is to the query image) denoted as Sj(i). The goal of each late fusion method is to produce a final ranked list (Lf).

Late fusion techniques are usually split into two major groups: (i) order based rank list fusion and (ii) similarity score-based rank list fusion. For the first group, we implemented a method based on the frequency of occurrence of each image in Lj (FreqRankLF) [62]. Lf is sorted by descending order of the frequency of images. For the second one, we implemented a method based on the similarity score (SimRankLF) [69]. The scores Sj(i) are arranged in ascending order and the final list Lf is generated. If an image is present in more than one Lj, the lowest score is considered in the merging process.

5.4 Retrieval metrics

To evaluate the performance of our approach, we used precision (12), recall (13), F1 score (14), and AveP (15). Precision allows us to identify the percentage of retrieved images that are relevant, while recall allows us to obtain the percentage of relevant images that are successfully retrieved. F1 score combines both precision and recall measures in a balanced way (i.e., both metrics are evenly weighted). AveP evaluates whether all of the relevant images retrieved are ranked higher (or not).

All the results are reported in terms of the average precision at the top-10 retrieved images (avgP@10), average F1 score (avgF1), and the mean Average Precision (mAP). For each query, the number of retrieved results is set to be equal to the number of relevant images for that query. Images belonging to the same category within the same dataset are considered as relevant. With this setup, precision and recall are equal to the F1 score, thus we did not report them individually.

$$ precision = \frac{\big| relevant \cap retrieved \big|} {\big|retrieved \big|} $$
(12)
$$ recall = \frac{\big| relevant \cap retrieved \big|} {\big|relevant \big|} $$
(13)
$$ avg F_{1} = \frac{{\sum}_{q=1}^{Q} F_{1} (q)} {Q}, F_{1} = \frac{2 * precision * recall} {precision + recall} $$
(14)
$$ mAP = \frac{{\sum}_{q=1}^{Q} AveP(q)} {Q}, AveP = \frac{{\sum}_{k=1}^{n} \big(precision(k)*relevant(k)\big)} {\big| relevant \big|} $$
(15)

5.5 Computing infrastructure

All the experiments were carried out on the same PC, running Arch Linux 5.12.7, with an Intel Core i7-8700 3.20GHz CPU, 64GB of memory, and two GeForce RTX 2080 8GB GDDR6.

6 Experimental results

In this section, we report the performance of our expert-based solutions. Recall that the ExpertosLF_V encompasses four CBIR systems as experts: color (ACC, CH, IC, OH, and RCS), shape (EH and Edges), texture (Tamura and Haralick), and joint (CEDD, FCTH, and JCD - combines color, shape, and texture information into a single expert). in ExpertosLF_VS, a fifth CBIR expert was added to represent the semantic information (GC). We considered the similarity measures presented in Section 5.2 to update the experts’ weights. We report the best one for each solution, although overall the difference between the different measures is negligible.

We compared the performance of each solution to the following state of the art fusion techniques: a) early fusion of all the descriptors that compose our experts (EF); b) late fusion of the experts’ results using frequency (FreqRankLF); c) late fusion of the experts’ results using similarity (SimRankLF). we analysed our expert-based solutions in an online setting, in which we assumed user feedback is always available (Section 6.1), and in an offline setting, in which we assumed that feedback is no longer available (Section 6.2).

6.1 Online setting

We started by shuffling each dataset and randomly select 1000 images to be used as queries. After each query, the weights of each CBIR expert in each expert-based CBIR solution were updated according to how relevant the retrieved images were.

6.1.1 Biomedical

In Tables 4 and 5, we present, respectively, the results for the ExpertosLF_V and ExpertosLF_VS solutions, and the best CBIR expert in each ensemble, and the aforementioned state of the art techniques, for comparison purposes.

Table 4 Results for Biomedical datasets using only visual descriptors. Bold values indicate the best results
Table 5 Results for Biomedical datasets using visual and semantic descriptors. Bold values indicate the best results

For ExpertosLF_V, the best individual CBIR systems are shape and color. For the majority of the datasets, ExpertosLF_V performs as well as the best individual CBIR, outperforming it in the HAM10000 dataset. When we include the semantic CBIR in the ensemble, the results achieved by ExpertosLF_VS are very similar to the ones obtained by ExpertosLF_V. The only exception is the PlantPathology dataset, which benefits from using the semantic expert. SorensenDice metric shows better results for ExpertosLF_V, while Overlap is equally useful when we add semantic information.

In Fig. 8, for each dataset, we present the evolution of the weights for each expert in the ExpertosLF_V and ExpertosLF_VS solutions, and the evolution of the F1 over the queries. Overall, both solutions converge quickly to the best individual expert. An interesting exception is COVID19-Rx, for which ExpertosLF_VS converges to a combination of three experts: shape, joint, and semantic. With this, we are able to achieve better results than the ones provided by the best expert individually (an increase of ≈ 0.02 for avgF1 and mAP).

To compare the different fusion techniques per domain, we report the difference between our technique and the remaining ones (Δ), averaged across the datasets of each domain. ExpertosLF_V and ExpertosLF_VS usually return more relevant images at the top of the retrieved set of images. This is also supported by an avgP@10 better than the avgF1Footnote 1, with an increase varying from 0.14 (BrainCE-MRI) to 0.41 (BreakHis). The early fusion of all experts (EF) is slightly better than ours when focusing on the top ten retrieved images (ΔavgP@10 = − 0.04 ± 0.07); while for the remaining metrics, they perform similarly (ΔavgF1 = 0.01 ± 0.02, ΔmAP = 0.00 ± 0.02). The performance of our proposed technique surpasses both FreqRankLF (ΔavgP@10 = 0.48 ± 0.09, ΔavgF1 = 0.05 ± 0.04, ΔmAP = 0.07 ± 0.06) and SimRankLF (ΔavgP@10 = 0.11 ± 0.09, ΔavgF1 = 0.06 ± 0.04, ΔmAP = 0.07 ± 0.06).

Fig. 8
figure 8figure 8

Evolution of the weights for each expert, and evolution of F1 over queries. (figure best seen in color)

6.1.2 Real

The results for ExpertosLF_V and ExpertosLF_VS solutions, best CBIR expert in each ensemble, and fusion techniques are presented in Tables 6 and 7.

Table 6 Results for Real datasets using only visual descriptors. Bold values indicate the best results
Table 7 Results for Real datasets using visual and semantic descriptors. Bold values indicate the best results

For ExpertosLF_V, the best individual CBIR systems are color and joint. ExpertosLF_V performs as well as the best individual CBIR for almost all datasets, and outperforms the individual expert for COREL1K dataset (except for avgP@10). When considering semantic information in the ensemble (ExpertosLF_VS), semantic expert becomes the best CBIR system for COREL1K, COREL10K, and GHIM10K. Overall, Sørensen-Dice and Otsuka-Ochiai similarity measures yield the best results for the majority of the datasets.

Figure 9 depicts the evolution of the weights for each expert, as well as the evolution of the F1 over queries. ExpertosLF_V and ExpertosLF_VS quickly converge to the best individual expert. The only exception is the CopyDays dataset, for which ExpertosLF_V converge to a combination of two experts: color and joint. The result achieved by the combination slightly outperforms the result of the best expert individually (an increase of ≈ 0.04 for both avgF1 and mAP). Overall, our ExpertosLF_V and ExpertosLF_VS solutions have a very good retrieval performance, in particular, with the inclusion of the semantic expert. Many relevant images are successfully retrieved, with more relevant images at the top of the retrieved set (avgP@10 is slightly better than the avgF1 for all datasets).

Fig. 9
figure 9

Evolution of the weights for each expert, and evolution of F1 over queries. (figure best seen in color)

When considering only visual experts, EF performs slightly better than our technique (ΔavgP@10 = − 0.07 ± 0.07, ΔavgF1 = − 0.01 ± 0.03, ΔmAP = − 0.03 ± 0.06). Considering both visual and semantic experts, on average, our technique surpasses all the remaining techniques for all the metrics. In particular, it is considerably better than SimRankLF (ΔavgP@10 = 0.39 ± 0.32, ΔavgF1 = 0.38 ± 0.22, ΔmAP = 0.43 ± 0.25).

6.1.3 Sketch

In Tables 8 and 9, the results for the ExpertosLF_V and ExpertosLF_VS solutions, the best CBIR expert in each ensemble, and the state of the art fusion techniques are presented.

Table 8 Results for Sketch datasets using only visual descriptors. Bold values indicate the best results
Table 9 Results for Sketch datasets using visual and semantic descriptors. Bold values indicate the best results

Considering only visual descriptors, the best individual CBIR systems are shape and joint. For $P and mCali datasets, ExpertosLF_V performs as well as the best individual CBIR, while for ImiSketchS it outperforms the best individual expert. All datasets benefit from the inclusion of semantic information in the ExpertosLF_VS. Overall, SorensenDice similarity metric shows better results.

In Fig. 10, we present the evolution of the weights for each expert, and the evolution of the F1 over queries. For the $P dataset, we can see that the ExpertosLF_VS solution converges to a combination of the best experts: shape and semantic. For ImiSketchS dataset, ExpertosLF_V solution also converges to a combination of the best experts: shape and joint. Our solutions return more relevant images at the top of the retrieved set of images (avgP@10 is always considerably better than the avgF1, with increases varying from 0.147 (ImiSketchS) to 0.375 (mCali)).

Fig. 10
figure 10

Evolution of the weights for each expert, and evolution of F1 over queries. (figure best seen in color)

The early fusion of all experts, EF, achieves slightly better results when compared to ours, in particular when considering only visual experts (ΔavgP@10 = − 0.09 ± 0.07, ΔavgF1 = − 0.02 ± 0.02, ΔmAP = − 0.03 ± 0.01). Compared with the late fusion techniques, our techniques achieves overall better results, being particularly better than SimRankLF (ΔavgP@10 = 0.38 ± 0.11, ΔavgF1 = 0.21 ± 0.06, ΔmAP = 0.26 ± 0.10) when considering both visual and semantic experts.

For all the domains under study, when considering both visual and semantic experts, our late fusion technique achieved, as desirable, similar performance to the baseline (ΔavgP@10 = − 0.03 ± 0.05, ΔavgF1 = 0.01 ± 0.03, ΔmAP = 0.01 ± 0.04). We would like to emphasize that our technique is not expected to always achieve better results than the baseline, since it needs a few queries to learn the best ensemble. Yet, our technique achieved better results than the two late fusion techniques tested: SimRankLF (ΔavgP@10 = 0.25 ± 0.23, ΔavgF1 = 0.19 ± 0.19, ΔmAP = 0.22 ± 0.21) and FreqRankLF (ΔavgP@10 = 0.44 ± 0.19, ΔavgF1 = 0.03 ± 0.04, ΔmAP = 0.09 ± 0.06).

6.2 Offline setting

Our late fusion technique relies on the existence of feedback from its users to gauge how well the experts are behaving, but such feedback may not be always available. Thus, we studied how many queries need to receive feedback from users in order to successfully learn the best set of weights to apply them in an offline setting (for a given dataset of images).

For each dataset, we used the set of weights learned in the online setting until learning round X to create the ExpertosLF_V and ExpertosLF_VS solutions (using our weighted late fusion technique). We tested multiple X values: 25, 50, 75, 100, 125, 250, 500 and 1000. Contrary to what happens in the online setting, the weights used are always the same throughout the offline queries, i.e., they are never updated.

In the offline setting, only the remaining images for each dataset are used as queries (i.e., we did not consider the images used as learning rounds in the online setting): 800 images for CopyDays, ImiSketchS, and PlantPathology, 2000 for BrainCE-MRI, COVID19-Rx, and $P, 6000 for BreakHis and mCali, and 9000 for COREL10K, GHIM10K, HAM10000, and IRMA datasets. We did not use COREL1K because it only has 1000 images.

This way, we ensure that the weights learned in the online setting are independent of the queries used in the offline setting. Moreover, it allows us to evaluate the quality of the ensembles learned using different query sizes, in particular, whether the performance deteriorates with the increase of the number of unseen queries.

6.2.1 Biomedical

Figure 11 depicts the evolution of the F1 for each expert individually and both expert-based ExpertosLF_V and ExpertosLF_VS solutions. For BrainCE-MRI (Fig. 11a), both ExpertosLF_V and ExpertosLF_VS surpass the best expert performance using the weights learnt up to X = 25 and 50, keeping a similar performance after that. For BreakHis (Fig. 11b) and IRMA (see Fig. 11d), both ExpertosLF_V and ExpertosLF_VS achieve a similar performance as the best expert for X = 50. For COVID19-Rx (Fig. 11c), ExpertosLF_V performs similarly to the performance of the best expert for X = 25. ExpertosLF_VS also performs similarly to the best expert for X = 25, and it ends up surpassing it (best performance achieved for X = 100 with an avgF1 of 0.627 against 0.605 of that expert s hape). HAM10000 (Fig. 11e) and PlantPathology (Fig. 11f) are the ones for which our solution takes the longest to converge. In HAM10000, ExpertosLF_VS does not converge for the best expert performance until X = 1000, while for PlantPathology, ExpertosLF_V converges at X = 500. For the remaining datasets, our solutions converge at X = 50 (PlantPathology) and 75 (HAM10000).

Fig. 11
figure 11

Biomedical: PlantPathology (Q = 800), BrainCE-MRI, COVID19-Rx (Q = 2000), BreakHis (Q = 6000), and IRMA and HAM10000 (Q = 9000). (figure best seen in color)

6.2.2 Real

In Fig. 12, we present the evolution of the F1 for each expert individually, and both expert-based solutions (ExpertosLF_V and ExpertosLF_VS). For all datasets, the weighted ensembles obtained from our expert-based solutions very quickly achieve the performance of the best individual expert.

Fig. 12
figure 12

Real. CopyDays (Q = 800), COREL10K, and GHIM10K (Q = 9000). (figure best seen in color)

In CopyDays (Fig. 12a) and COREL10K (Fig. 12b), ExpertosLF_V achieves a performance similar to that of the best expert for X = 25; for COREL10K, ExpertosLF_VS also achieves a similar performance for X = 25, while for CopyDays, it does so for X = 100, and it performs even better for X = 75. In GHIM10K (Fig. 12c), ExpertosLF_V achieves the best performance for X = 75, being marginally better at X = 150, while ExpertosLF_VS quickly converges to the best expert at X = 25.

6.2.3 Sketch

Figure 13 depicts the evolution of the F1. Once again, for all datasets, the ensembles learnt with our solutions very quickly achieve (or surpass) the performance of the best individual expert. For $P (Fig. 13a), ImiSketchS (Fig. 13b), and mCali (Fig. 13c), ExpertosLF_V achieves the performance of the best expert for X = 25, and surpasses it for X = 50 (for ImiSketchS) and 75 (mCali). ExpertosLF_VS achieves the best expert’s performance for X = 50 for ImiSketchS, X = 25 for $P (surpassing the best expert performance for X = 125), and X = 25 for mCali.

Fig. 13
figure 13

Sketch. ImiSketchS (Q = 800), $P (Q = 2000), and mCali (Q = 6000). (figure best seen in color)

To sum up these results: in the online setting, for all the domains under study, our late fusion technique achieves similar results to the early fusion of all the experts (with the exception of the avgP@10 metric), and surpasses the FreqRankLF and SimRankLF late fusion techniques. We also validated that our solutions, ExpertosLF_V and ExpertosLF_VS, converge to the performance of the best experts individually, surpassing them in some cases. For the offline setting, the weighted ensembles obtained from our solutions very quickly achieve (or surpass) the performance of the best individual expert for almost all datasets.

7 Discussion

In this section, we discuss the results obtained in our experiments in the light of our research questions.

7.1 RQ1) Does our late fusion technique improves retrieval performance?

Our late fusion technique achieved similar or better results than the baseline (EF), and surpassed the FreqRankLF and SimRankLF late fusion techniques.

We believe that our technique achieves best results because we acknowledge the subjectivity of human perception of image similarity by including human annotators in the loop to learn which CBIR systems are more suitable for the different domains. Furthermore, the fact that the best performing CBIR systems vary across domains illustrates how task dependency affects the quality of the retrieval results: if, instead of using our dynamic ensemble of CBIR systems, one had committed to a single CBIR system, it would perform inconsistently across domains/tasks.

Moreover, our technique allows to create ensembles adapted to each domain without introducing an overhead of time in the retrieval process. To assess this, we considered the larger datasets for each domain using visual and semantic experts. We performed five runs to collect the average elapsed time of each query for each dataset and fusion technique, which we report in Fig. 14.

Fig. 14
figure 14

Average elapsed time of performing a query on the datasets IRMA, GHIM10K, and mCali, using different fusion approaches. (figure best seen in color)

Our technique is more efficient at performing a query than FreqRankLF and SimRankLF. It takes a little longer at computing results than the baseline (EF), but the creation (extraction and indexation of the descriptors) of the CBIR system for the baseline is slower than for any of the late fusion techniques (e.g., for the IRMA dataset, creation with the late fusion takes around 20 hours, and with early fusion 27 hours).

7.2 RQ2) Does the resulting ensemble perform as well as the best individual CBIR?

As we have seen across the different domains for all the datasets under evaluation, our late fusion solutions based on expert advice indeed quickly converged to the performance of the best CBIR expert. They usually needed fewer than 25 queries to converge to the most suitable combination of weights for the CBIR experts.

Our solutions were also very quick to re-adapt the weights distribution to follow the current best CBIR. This can be observed on the plots depicting the evolution of the F1 over queries. For the majority of the datasets, our solutions achieved (or surpassed) the precision of the best CBIR expert(s) at a recall cut-off of 0.1 (BreakHis, PlantPathology, CopyDays, COREL1K, COREL10K, GHIM10K, ImiSketchS, and mCali), 0.2 (BrainCE-MRI, and COVID19-Rx), or 0.3 (IRMA).

We observed a possible limitation of our technique: it tends to converge to a single expert in the ensemble, even if a combination of multiple experts yielded better results in previous iterations. This may be explained by the exponential behavior of EWAF, which tends to favor the highest weighted expert over the remaining ones. We believe the same effect may happen with other late fusion techniques.

7.3 RQ3) Can we use the ensemble learned in an online setting in an offline setting?

The ensemble learned using our weighted late fusion technique, for each dataset across the three domains under study, was successfully applied to an offline scenario, in which we did not have feedback available. The expert-based solutions (used to learn the weights) only needed to receive feedback on how good the retrieved results are for approximately 25 queries for the Real and Sketch domains, and around 50 queries for the Biomedical domain. After that, it is ready to be used in an offline scenario.

7.4 RQ4) Are the CBIR experts in the resulting ensemble plausible considering the domain in hand?

In Fig. 15, we present an overview of the CBIR experts weights’ distribution for the ExpertosLF_V and ExpertosLF_VS solutions considering either only CBIR visual experts or the combinations of visual and semantic CBIR experts. We considered the combinations of weights of each CBIR expert at the first iteration for which they converged to (or surpassed) the performance of the best CBIR expert. Each horizontal bar encodes the distribution of the weights of each expert for each dataset. As we can see, the best experts vary across the different domains.

Fig. 15
figure 15

Distribution of the experts weights per dataset. (figure best seen in color)

Leveraging on the interpretability associated with each ensemble, we provide a thorough and detailed analysis of each solution. Our aim is to demonstrate that the distribution of the weights for the experts across the domains is plausible and well-rooted on the type of images within each dataset.

In the Biomedical domain, and considering only the use of visual descriptors, BrainCE-MRI is well described by the texture expert with a slight contribution of the remaining visual experts. This result can be explained by the fact that the images are gray-level with brain and tumors’ shapes being roughly similar (they may vary in size), thus, there is little color, shape or joint information available. As a result, mainly texture information is able to successfully account for differences on tissue characteristics, such as calcifications, fat, cysts, contrast enhancement, or signal intensity.

BreakHis and PlantPathology datasets are best discriminated only by the color expert. These results can be explained, respectively, by the kind of differences observed in images regarding the tissue of different breast tumors (BreakHis) or how the leaves change with the foliar diseases (PlantPathology). BreakHis presents images in shades of pink, white, and purple to represent what percent of the tumor forms normal duct structures, how larger, darker, or irregular the cell nucleus is, and how many cells exist. PlantPathology depicts images mainly in shades of green where the visual symptoms of a disease vary greatly between varieties, but color plays a major role in differentiating them. For example, in apple scab, the initial infection appears as black or olive-brown lesions, while in cedar apple rust, early symptoms of the disease are small, light yellow spots on leaves, that will expand and turn into bright orange ones.

COVID19-Rx, IRMA, and HAM10000 are better described by the shape expert. The latter also slightly benefits from the inclusion of the remaining experts, in particular the joint one. Both COVID19-Rx and IRMA depict gray-level images where mainly the size and shape of either the opacities and pleural abnormalities of the organs under study, respectively, differ significantly. Color could also be relevant to capture the density of the whiter pixels in the COVID19-Rx dataset. Overall, our results are in line with the characteristics of the images. HAM10000 depicts images of skin lesions with different shapes in tones of pink (for the skin) and combinations of pink, brown, and black (for the lesion itself). Moreover, there is little texture information available in the images. It is widely known that color in skin lesions provide important morphologic information (melanin is the most important chromophore in pigmented skin lesions), however, our solution failed to capture that when using only visual experts: only the shape of the lesion itself was useful to distinguish among types of lesions.

The inclusion of the semantic tags expert was useful to half of the Biomedical datasets: COVID19-Rx, HAM10000, and PlantPathology. We believe this is due to the semantic richness of the identified terms: although they may not be suited to the domain, they may be sufficiently distinct to discriminate the images and improve the performance of the retrieval task. Interestingly, the inclusion of semantic information made the system adjust itself in a way that, in addition to the semantic tags, it benefits from color and joint experts for the HAM10000, and shape and joint for COVID19-Rx. These results are in line with what we believe it would be expected considering the characteristics of the images of those datasets as described above.

In the Real domain, datasets are best described by either color (COREL10K), joint information (CopyDays), or the combination of both (COREL1K and GHIM10K). COREL1K also slightly benefits from the contribution of texture and shape experts. These results are not surprising, since all datasets depicts natural colored images with diverse colors, shapes, and textures. The inclusion of the semantic information majored almost all datasets. These results were also expected since the semantic richness of the tags (which describes objects, people, emotions, events, etc.) is much more powerful than just the visual content of the images. For the CopyDays dataset, it is useful to use both visual (shape and joint) and s emantic information. We believe this is due to the fact that several images have been heavily manipulated (scaled, decreased image quality, and parts of the image painted or blurred). Thus, for many images, the semantic information is less discriminative.

In the Sketch domain, datasets are best discriminated by shape ($P), or a combination of shape and joint information (ImiSketchS and mCALI). The latter datasets, in particular the ImiSketchS, benefit slightly from the inclusion of the remaining visual experts. All datasets depict black and white images representing numbers, geometrical shapes, furniture, mathematical symbols, among others. As expected, shape plays an important role in our expert-based solution (demonstrated by the use of both the shape and joint expert). Similar to the Real domain, the inclusion of semantic information majored our solution for the ImiSketchS and mCALI datasets, and we believe it happens for the same reasons. The $P dataset benefits from the use of semantic information combined with shape.

Figure 16 depicts the difference between the performance achieved by our solution when using both visual and semantic experts and the performance achieved by using only visual experts. As we can see, the Biomedical domain is well described using mainly visual CBIR experts. while the Real and Sketch ones benefit from the use of semantic information.

Fig. 16
figure 16

Difference between the ExpertosLF_VS solution with visual and semantic experts and the ExpertosLF_V solution using only visual experts performance. Darker shades of green mean that the performance of ExpertosLF_VS is better than ExpertosLF_V. (figure best seen in color)

8 Conclusions and future work

We presented a novel late fusion technique using online learning and prediction with expert advice to the problem of combining, in a dynamic fashion, the best types of descriptors to discriminate images in a CBIR scenario, regardless of the dataset or domain in hand. We did so by leveraging on relevance feedback that may be available in a realistic scenario.

Our late fusion solutions based on expert advice were indeed able to quickly learn the best descriptor sets in three distinct domains (Biomedical, Real, and Sketch), spanning a total of 13 benchmark datasets. The expert-based solutions achieved similar performance to that of the early fusion of all the experts, and surpassed existing state of the art late fusion techniques (FreqRankLF and SimRankLF). The expert-based solutions were also more efficient than state of the art techniques with similar or better results. Moreover, our solutions guaranteed that the retrieval performance was as good as the best CBIR system in the ensemble. Finally, the ensembles learnt through our approach also proved useful in an offline setting (i.e., when human feedback is no longer available).

In this work, we focused mainly on low- and high-level descriptors (instead of using, for instance, CNN layers), since we intended to 1) ensure that the resulting ensembles were interpretable, and 2) study how the use of different kinds of information (visual or semantic) varied across the different domains, and whether the resulting ensemble reflected it. For future work, neural descriptors could also be included in the ensemble, since our technique is model-agnostic, modular, and scalable.

Another line of future work to be explored is that of the online learning frameworks. The framework used in this work, prediction with expert advice, assumes that the forecaster learns both its own loss and the loss of each expert after the environment’s outcome is revealed (i.e., that the set of the most relevant images for a given query is known). However, this may not always be the case in CBIR. Thus, we intend to explore a related class of problems, multi-armed bandits [51, 84], in which the environment’s outcome is unknown, and only the forecaster learns its own loss , i.e., only the expert chosen by the forecaster receives feedback regarding its set of retrieved images.