Keywords

1 Introduction

Traditional single-label image categorization assumes that an image only contains objects associated with a single label, which obviously does not fit the real-world situation. Since different objects such as beach, sea and sky, belonging to different classes, can be contained in one image, it is naturally to assign the image with multiple labels to express the multiple semantic meaning of the image, which is formulated as a multi-label image categorization and is becoming a more prevailing but more challenging and complex task. Typically, multi-label categorization is treated as a set of single-label categorization tasks, for example, Boutell et al. [1] built an individual classifier for each label and the labels of a new image were determined by all of these classifiers. However, researchers found that this solution regarded each label isolated and ignored the correlations among labels, so that they proposed some approaches such as CLP [2] and CML-SVM [3] to exploit these correlations. Regretfully, all of them just considered an image as an indiscrete unit and only made use of global features, neglecting the fact that regions in the image greatly contribute the semantic labels, not the whole image. But under the multi-label learning framework, it is difficult to express the relationship between the whole image and its local regions precisely.

To tackle this issue, Multi-instance multi-label learning (MIMLL) applies multi-instance learning framework to multi-label problems. In multi-instance learning, an image is regarded as a bag, consisting of several instances corresponding to the regions in the image, if at least one instance is associated with a label, the image will be marked with this label, which is consistent with the situation where labels are always characterized by their corresponding regions in the image, rather than the entire image. Given a training image dataset composed of a set of bags of instances and each bag relates to multiple labels, MIMLL is to learn a classifier to predict all labels for an unseen bag [4].

In recent years, a number of effective algorithms have been presented in the field of MIMLL, the initial attempt was made by Zhou et al. [5]. They transformed the MIMLL task into a multi-instance or a multi-label learning task, which can be further transformed into a conventional learning task. Based on this, they presented two different models called MIML-Boost and MIML-SVM which achieved good performance in scene classification. Considering the issue of losing information in the degeneration process,then Zhou et al. [6] proposed D-MIMLSVM which directly tackled multi-instance multi-label (MIML) problems in a regularization framework. However, these MIMLL algorithms always neglected the contexture information hidden in the image, for example, some related objects such as “car-building” and “ship-seaboard” always appeared together in one image. To alleviate this problem, Zha et al. [7] proposed an integrated multi-label multi-instance learning approach called MLMIL based on hidden conditional random fields (HCRFS) [8] to handle the fine image classification, which simultaneously took the relationship between labels and regions, and the correlations among labels into account in the training progress. Ding et al. [4] solved the MIML problem by involving hierarchical context among instances and labels into classification model. In their work, multiple graphs were adopted to construct the instance context and a linear combination of several latent conceptions that link low level features and high level semantic labels was used to construct the label context. In addition, considering most of the existing algorithms only can deal with the moderate-sized data, Yakhnenko et al. [9] proposed DMIML and Huang et al. [10] proposed MIMLfast to efficiently handle large datasets, and both of them also exploited the correlations among labels.

Obviously, most of these existing algorithms have taken the contexture information into consideration more or less, demonstrating that the hidden contexts have a positive impact on the performance for image categorization. However, they just focused on the connections between labels and regions and the correlations among labels, but ignored other two kinds of contexts. First is the spatial context among adjacent instance. Regions in an image do not exist independently, for example, streets always appear along with the surrounding objects such as houses, cars and pedestrian and so on, which can be regarded as the spatial context. Note that this spatial context is also able to capture the correlations among labels. Second is the latent probability distribution of instances. Instances inevitably have a certain distribution in their feature space, which can reflect the similarity of instances belonging to same category. If we consider the latent probability distribution of instances, it will be helpful to better predict the corresponding label of a given instance and eventually improve the performance of image categorization.

To circumvent the limitations embedded in the existing MIMLL methods, we propose a novel approach for image categorization based on integrated contextual information, including the latent probability distribution of all instances, the spatial context among adjacent instances and the correlations between instances and labels. In our algorithm, we design a multi-instance conditional random fields (mi-CRFs) framework that can jointly model these contexts. Specifically, to model the latent probability distribution of all instances, we introduce Gaussian Mixture Model (GMM) to simulate the real distribution of positive and negative instances approximately; To express the spatial context between a certain instance and its adjacent instances, we use the proportion of labels that the instance’s adjacent areas associated with; To explore the correlations between instances and labels, we give the probability that each instance belongs to a label by using multi-class support vector machine(SVM). The contributions of this paper can be summarized as follows:

  1. (1)

    The mi-CRFs integrates three different terms of contextual information into model. Experimental results show that these terms are all useful and indispensable.

  2. (2)

    In contrast to other MIMLL approaches, the mi-CRFs introduces GMM to construct the latent probability distribution of all instances.

  3. (3)

    Besides, the mi-CRFs firstly adopts the spatial context among adjacent instances which can filter positive instances from each bag.

The remainder of this paper is organized as follows: we present the construction of three types of contexts we use in Sect. 2. The formulation of proposed mi-CRFs is described in Sect. 3. Section 4 shows our experimental results and their corresponding analysis. At last, we give the conclusion of this work in Sect. 5.

2 Construction of Contextual Information

2.1 MIMLL Formulation

For convenience, we give the framework of MIMLL first. Let \( \partial \in R^{d} \) denotes the input instance space and \( l = \{ \pm 1\}^{E} \) denotes the set of \( E \) class labels. The target of MIMLL is to learn a function \( F_{MIMLL} :2^{\partial } \to 2^{l} \) from a given data set \( D = \{ B_{i} ,Y_{i} \}_{i = 1}^{N} \), where \( B_{i} = \{ x_{i,j} \}_{j = 1}^{{m_{i} }} \; \subseteq \;\partial \) denotes a bag consisting of \( m_{i} \) instances and its corresponding labels set is \( Y_{i} = [y_{i,1} , \ldots ,y_{i,E} ] \in l \). Here N is the number of bags, if \( B_{i} \) associates with the label e, then \( y_{i,e} = 1,(e = 1, \ldots ,E) \), otherwise \( y_{i,e} = - 1. \)

2.2 Latent Probability Distribution of Instances

Theoretically, probability distribution of instances in the feature space can be illustrated in Fig. 1. Obviously, each category of instances presents as a “cluster”. Different clusters are formed due to the different features extracted from different categories of instances.

Fig. 1.
figure 1

Probability distribution of instances in the feature space.

In opinion of probability statistics, generally, Gaussian distribution can be used to approximate the probability distribution of a given data set when we are unable to determine which one is appropriate. While a single Gaussian distribution cannot fit multiple “clusters”, therefore we introduce Gaussian Mixture Model (GMM) for MIMLL, which can be defined as formula (1), assuming that \( h_{i,j} (h_{i,j} \in Y_{i} ) \) denotes the label of the instance \( x_{i,j} \).

$$ P\left( {\left. {h_{i,j} } \right|\theta } \right) = \sum\limits_{k = 1}^{E} {P\left( k \right)P\left( {\left. {h_{i,j} } \right|k} \right) = \sum\limits_{k = 1}^{E} {\alpha_{k} \phi \left( {\left. {h_{i,j} } \right|\theta_{k} } \right)} } $$
(1)

In formula (1), E is the number of Gaussian models as well as the number of labels of the given data set, \( \emptyset (h_{i,j} |\theta_{k} ) \) is the Gaussian distribution density and \( \alpha_{k} \) is the coefficient of the \( k^{th} \) model, representing the probability of the instance \( x_{i,j} \) from the \( k^{th} \) model, the coefficients must satisfy the constraint of formula (2). In addition, \( \theta_{k} = \left( {\mu_{k} , \sigma_{k}^{2} } \right) \) denotes the parameters of the \( k^{th} \) model, where \( \mu_{k} \) denotes mean and \( \sigma_{k} \) denotes standard deviation. If suitable parameters \( \alpha_{k} \), \( \mu_{k} \) and \( \sigma_{k} \) are found, the probability distribution of instances will be determined.

$$ \sum\limits_{k = 1}^{E} {\alpha_{k} = 1} $$
(2)

2.3 Spatial Context Among Adjacent Instances

Figure 2 gives a detailed description of the spatial context among adjacent instances. The right is the constructed bag \( B_{i} \) of the image \( I_{i} \), it is divided into left different objective regions, each of these regions consists of a number of dots which represent the super-pixels. Edges between dots represent the adjacent relationship. For “Car”, super-pixels around the edge of “Car” almost belong to “Building” and “Road” which is in line with the actual situation, providing a theoretical feasibility for our model.

Fig. 2.
figure 2

Example of the spatial context among adjacent instances.

Here, for a given instance, we use the proportion of the labels its adjacent instances associated with to represent the spatial context. Let \( I_{i} = \{ sp_{i,j} \}_{j = 1}^{{m_{i} }} \) denotes that an image \( I_{i} \) is segmented into \( m_{i} \) super-pixels and \( c_{i,j} \) denotes the corresponding position the super-pixel \( sp_{i,j} \) located at. First, we find out n-nearest neighbors for the super-pixel \( sp_{i,j} \): \( nei_{i,j} = \{ sp_{i,1}^{nei} , \ldots ,sp_{i,n}^{nei} \} \). And then we calculate the ratio these n-nearest neighbors belong to each category respectively according to formula (3), where \( r_{i,j}^{c} \) denotes the ratio that n-nearest neighbors of the \( j^{th} \) super-pixel in the image \( I_{i} \) belong to the label \( c(c \in l) \), \( \left[ \cdot \right] \) is the indicator function whose value is 1 when the condition is satisfied and otherwise is 0, \( count( \cdot ) \) denotes the total number, and \( h_{i,k} \) denotes the label of \( sp_{i,k}^{nei} \).

$$ r_{i,j}^{c} = \frac{1}{{count(nei_{i,j} )}}\sum\limits_{{k \in nei_{i,j} }} {\left[ {h_{i,k} = c} \right]} $$
(3)

By constructing the spatial context in this manner, we can select out all possible labels of the super-pixel according to its set of n-nearest neighbors, when training the classifier for multi-label tasks. In our algorithm, we choose n as 10 because experiments show that overall performance of our algorithm on image categorization is best under this setting.

2.4 Correlations Between Instances and Labels

Correlations between instances and labels are demonstrated useful by [4, 8]. Each instance corresponds to a unique label, whose prior probability is given by the labels of the bag the instance situated in. Given the labels set \( Y_{i} \) of the bag \( B_{i} \), one direct solution to exploit the correlation between instance \( x_{i,j} \) and its label \( h_{i,j} \) is to calculate the posterior probability \( P(h_{i,j} |x_{i,j} ,Y_{i} ) \) and we adapt multi-class SVM to calculate it.

3 Multi-instance Conditional Random Fields Algorithm

HCFRs was proposed by Quattoni et al. [8], it is a robust algorithm aiming at exploring the labels of local regions in object detection. HCFRs considered that each observation sequence corresponded to a set of hidden variables, which represent all possible sources of the observation sequence. Inspired by HCRFs, labels of the bag \( B_{i} \) can be regarded as the observation sequence and labels of instances in this bag can be expressed by a set of hidden variables: \( H_{i} = \{ h_{i,1} , \ldots ,h_{{i,m_{i} }} \} \in l \). In this paper, we propose a modified conditional random fields applying to MIMLL: multi-instance conditional random fields (mi-CRFs).

3.1 Algorithm Framework

The framework of proposed mi-CRFs algorithm mainly contains four steps, which is illustrated in Fig. 3.

Fig. 3.
figure 3

Algorithm framework.

Firstly, it constructs training data set by segmenting each image into multiple super-pixels, in other words, it aims at constructing instances for bags; Secondly, it initializes the labels for each instance with the labels of its corresponding bag, and then estimates these labels to select out the most probable label for each instance by using mi-SVM [11]; Thirdly, after obtaining the estimated label for each instance, it respectively constructs the corresponding context potential for each contextual information, then, it integrates three context potentials into a total potential and gets the posterior distribution of instance labels. Lastly, it introduces BFGS [12] to maximize the likelihood of the posterior distribution obtained by the third step. After obtaining all labels of the instances in the bag, we will mark this bag with these obtained labels. Next, we begin with the details of our algorithm in Sect. 3.2.

3.2 Formulation of Mi-CRFs

Firstly, we give the definition of the integrated potential for mi-CRFs as formula (4), consisting of three parts: \( \emptyset_{ins} \), \( \emptyset_{nei} \), \( \emptyset_{pair} \).

$$ \begin{aligned} & E\left( {H_{i} } \right) = \sum\limits_{j} {\phi_{unary} \left( {x_{i,j} } \right) + \sum\limits_{j < k} {\phi_{pair} \left( {x_{i,j} ,x_{i,k} } \right)} } \\ & \,\,\,\,\,\,\,\,\,\,\,\,\, = \sum\limits_{j} {\phi_{ins} \left( {x_{i,j} } \right) + \sum\limits_{j} {\phi_{nei} \left( {x_{i,j} } \right) + \sum\limits_{j < k} {\phi_{pair} \left( {x_{i,j} ,x_{i,k} } \right)} } } \\ \end{aligned} $$
(4)

Secondly, we give the details of three parts in formula (4). \( \emptyset_{ins} \) is an unary potential denoting the correlations between instances and labels, as illustrated in Sect. 2.2, we can give a probability for the label of each instance, so \( \emptyset_{ins} \) can be defined as formula (5).

$$ \phi_{ins} \left( {H_{i} \left| {B_{i} } \right.,Y_{i} } \right) = - \sum\limits_{j} {\log P\left( {h_{i,j} \left| {x_{i,j} } \right.,Y_{i} } \right), \, 1 \le j \le m_{i} } $$
(5)

\( \emptyset_{nei} \) is also an unary potential but denoting the spatial context among adjacent instances. According to Sect. 2.3, to express spatial context among adjacent instances, \( \emptyset_{nei} \) is illustrated in formula (6).

$$ \phi_{nei} \left( {h_{i,j} \left| {nei_{i,j} } \right.} \right) = - \log r_{i,j}^{c} $$
(6)

\( \emptyset_{pair} \) is a pairwise potential denoting the latent probability distribution of instances. It is defined as formula (7), where \( \varphi_{m} (x_{i,j} ,x_{i,k} ) \) denotes a Gaussian kernel, formula (8) is the specific form of \( \varphi_{m} (x_{i,j} ,x_{i,k} ) \), where \( P_{i,j} \) represents the position of the instance \( x_{i,j} \) and \( P_{i,k} \) represents the position of \( x_{i,k} \). By designing such a Gaussian kernel, adjacent instances can be classified to the same category. \( \theta_{\alpha } \), \( \theta_{\beta } \) and \( \theta_{\gamma } \) are parameters of the Gaussian kernel.

$$ \phi_{pair} \left( {h_{i,j} ,h_{i,k} } \right) = \left\{ {\begin{array}{*{20}c} {\begin{array}{*{20}c} 0 & {\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,{\text{if }}h_{i,j} = h_{i,k} ,} \\ \end{array} } \\ {\begin{array}{*{20}c} {\sum\limits_{m = 1}^{E} {\alpha_{m} \varphi_{m} \left( {x_{i,j} ,x_{i,k} } \right)} } & {otherwise.} \\ \end{array} } \\ \end{array} } \right. $$
(7)
$$ \begin{aligned} & \varphi_{m} \left( {x_{i,j} ,x_{i,k} } \right) = w^{\left( 1 \right)} \exp \left( { - \frac{{\left| {p_{i,j} - p_{i,k} } \right|^{2} }}{{2\theta_{\alpha }^{2} }} - \frac{{\left| {x_{i,j} - x_{i,k} } \right|^{2} }}{{2\theta_{\beta }^{2} }}} \right) \\ & \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, + \;w^{\left( 2 \right)} \exp \left( { - \frac{{\left| {p_{i,j} - p_{i,k} } \right|^{2} }}{{2\theta_{\gamma }^{2} }}} \right) \\ \end{aligned} $$
(8)

Thirdly, CRF [13] must be constructed based on graph, here we choose full-connected graph since any pair of nodes in it has an edge connected, showing its comprehensiveness and stability. In addition, a CRF can be represented as a Gibbs distribution, general form of which is defined as formula (9).

$$ P(Y_{i} ) \propto \exp \left( { - E (Y_{i} )} \right) $$
(9)

In mi-CRFs, it defines a set of hidden label variables: \( H_{i} = \{ h_{i,1} , \ldots ,h_{{i,m_{i} }} \} \in l \), similarly to MLMIL [7], it gets the posterior probability of the label vector \( Y_{i} \) of the bag \( B_{i} \) by combining all possible values of \( H_{i} \), which is defined as formula (10), where \( Z\left( {B_{i} } \right) = \mathop \sum \limits_{Y} \mathop \sum \limits_{h} exp( - E(Y_{i} ,h_{i,j} )) \) is the division function.

$$ P\left( {Y_{i} \left| {B_{i} } \right.} \right) = \sum\limits_{j = 1}^{{m_{i} }} {P\left( {Y_{i} ,h_{i,j} \left| {B_{i} } \right.} \right)} = \frac{1}{{Z\left( {B_{i} } \right)}}\sum\limits_{j = 1}^{{m_{i} }} {\exp \left( { - E\left( {Y_{i} ,h_{i,j} } \right)} \right)} $$
(10)

In the task of MIML image categorization, an image corresponds to a bag and a super-pixel corresponds to an instance, mi-CRFs establishes a full-connected graph for each bag, and then obtains the corresponding potential and the expression of the posteriori probability. As well as traditional CRF [13] model, the final optimization target of mi-CRFs is to maximize the posteriori probability of the hidden sequence, which is defined in formula (11).

$$ H_{i}^{*} = \mathop {\arg \hbox{max} }\nolimits_{{H_{i} \in \left\{ { \pm 1} \right\}^{{m_{i} }} }} P\left( {H_{i} \left| {Y_{i} ,B_{i} } \right.} \right) $$
(11)

Now mi-CRFs is established. Since it use full-connected graph, we use Mean-field to ratiocinate and adapt BFGS [12] to optimize the parameters.

4 Experiments and Analysis

To evaluate the performance of mi-CRFs, we design a set of experiments on two data sets: MSRC and Corel 1000. In our implementation, all images are firstly segmented into several super-pixels using SLIC [14], and then a 367-dimension feature vector of each super-pixel is extracted to represent the instance, including a 3-dimension RGB, a 64-dimension color histogram and a 300-dimension dense sift histogram. As well as [7], then we perform 5-fold cross validation on each data set. For each data set, images are randomly and equally split into five parts, in addition, they must satisfy the constraint that there should be at least five positive images of each class per partition. We choose one of the five parts as the testing data set, and the other as the training set. We compare our method with several state-of-art algorithms for MIML image categorization, such as MIML-Boost [5], MIML-SVM [5], MLMIL [7], DMIML [9], CMIML [4]. And we take widely used AUC (Area Under Curve) [15] to evaluate the performance of these approaches, because of its ability to describe the probability that a randomly chosen positive sample will be ranked higher than a randomly chosen negative sample [7].

MSRC data set consists of 591 images with 23 classes. Each image is marked with at least one label on image level and also has the pixel level ground truth with its corresponding label. Note that we only use the image level ground truth to train our model. As well as [7], we consider “horse” and “mountain” as “void” because either of them only has few positive samples. Thus there are 21 labels in total. Table 1 presents the average AUC on two MSRC data set of mi-CRFs and other compared methods. Obviously, mi-CRFs achieves the best overall performance.

Table 1. Average AUC on MSRC data set of different approaches.

Corel 1000 data set contains 1000 images with 10 classes. All these images have been manually marked with 1–5 labels. Table 2 presents the average AUC on Corel 1000 data set of mi-CRFs and other compared methods, and mi-CRFs also achieves the best overall performance on this data set.

Table 2. Average AUC on Corel 1000 data set of different approaches.

Figure 4 gives the detailed AUC on each category of MSRC data set, as illustrated in Fig. 4, mi-CRFs achieves the best performance on most categories, but the bad performance on “car”, “flower”, “bird”, “dog” and “boat”. We find these images which mi-CRFs does not perform well on, have several common characteristics as follows: (1) One object contains too many subtypes, for example, different kinds of flowers vary in color, size, appearance and angle; Boats contains the yacht, fishing boats and canoeing, which are extremely different with each other. (2) Backgrounds of the same phyletic images are complex and varied, for example, birds can fly in the sky and on the water, and can appear on the shore and on the grass. These characteristics make it difficult for mi-CRFs to capture useful contextual information, resulting in the unsatisfied classification performance on these categories.

Fig. 4.
figure 4

AUC on each category of MSRC data set of different approaches.

In addition, in order to verify the assumption that three types of contexture information are useful and indispensable, we design two contrastive experiments and give the results on Table 3 and Fig. 5.

Table 3. Average AUC on MSRC data set of mi-CRFs with different combinations of different contexts.
Fig. 5.
figure 5

AUC on each category of MSRC data set of mi-CRFs with different combinations of different contexts.

As can be seen from Table 3, only when mi-CRFs takes the integrated contextual information into consideration can it achieve the best performance. Average AUC of mi-CRFs will decline when losing any type of contexts, and it will decline to 0.517 when losing all contexts. Reasonable explanation for 0.517 is that, the priori information only depends on the estimated labels of instances by mi-SVM [11], has nothing to do with the features of instances and contexture information, therefore the result is close to 0.5. Apart from this, each context we use has a different influence on the performance of our model, and missing spatial context among adjacent instances leads to the largest drop by 14% compared to other two types of contexts. Figure 5 gives the detailed results on each category, showing the different impact of different contexts on different labels. Spatial context among adjacent instances is also demonstrated the most significant on most categories, while it has little influence on “dog” and “boat”. Images associated with these two labels always have complex backgrounds. For example, categories of the background of dogs can be grass, road and even water, while the number of grasses almost equal to that of roads. Result of “flower” seems abnormal, maybe it has something to do with the enormous variation in color, size, appearance and angle of flowers.

5 Conclusion

In this paper, we propose a multi-instance and multi-label framework based on combining three different types of contextual information and apply it to image categorization. Not only does our approach consider latent probability distribution of instances and spatial context among adjacent instances, it is also flexible to model correlations between instances and labels. A set of experiments based on MSRC and Corel 1000 data sets present the excellent performance of our approach and demonstrate that three types of contexts we make use of are all helpful and indispensable for the total improvements of proposed algorithm.