Keywords

1 Introduction

Recently, several models have been proposed to identify some categories of objects in images, such as face identification in digital cameras [13]. The difficulty of these approaches is that distinct models may be needed to identify different object classes. An idea to cope with this scenario is to create some strategy which is capable to identify image regions with a high probability of containing objects, i.e. object-related regions. According to [12], an approach to object detection should group pixels into uniform and homogeneous regions with respect to some characteristic(s) in a way that leads to the separation between the foreground objects and the background, making this task closely related to figure-ground segmentation and/or to semantic image segmentation [13].

Segmentation methods divide images into regions (or disjoint segments) that, in an ideal situation, are equivalent to (parts of) real objects portrayed by them. State-of-the-art methods to detect object-related regions generally depend on image segmentation [2, 3, 5]. In this work, we hypothesize that having an accurate segmentation method is essential for improving object detection rates. A hierarchical segmentation method generates multiple disjoint segments at different levels of detail (see Fig. 1 for an example). A coarse level of detail can be generated by simple merges of segments belonging to finer detail levels [9]. According to [8], a hierarchical method must respect two principles of multi-scale analysis: (i) causality principle, which states that a contour of a region on a given scale (or level of detail) \(k_1\) must be present on any other scale \(k_2 < k_1\); (ii) location principle, which says that contours of the regions should not change or deform when there is a change of scale. This work evaluates the adoption of the Hierarchical Graph-Based segmentation method (hGB) [9, 10] to compute Superpixel Straddling (SS) measure used in the state-of-the-art object detection approach proposed by [2]. We also evaluate the use of SLIC [1] (a state-of-the-art method to generate superpixels), since SS measure is intended to quantify the relationship between an object-related window and the superpixels intersected by it. The main contributions of this work are two-fold: (i) it shows that the use of hierarchical segmentation outperforms the non-hierarchical one; and (ii) it gives evidence that the use of a state-of-the-art method to generate superpixels does not seem to obtain the same improvement.

Fig. 1.
figure 1

Example of obtained result by a hierarchical image segmentation method.

The paper is organized as follows. Section 2 presents related works, while Sect. 3 describes objectness measurement in details. Section 4 presents concepts about hierarchical graph-based segmentation used in this work. Section 5 presents experimental results of our approach together with a comparative analysis. Finally, we draw some conclusions in Sect. 6.

2 Related Work

In [3], the authors use foreground and background seeds, which are group of pixels with a regular size. The input image is mapped onto a weighted graph. So, graph cuts are used with different parameters and seeds to obtain regions. Each generated segment is ranked based a learned scoring function. Similarly, in [5], the method generates background and foreground segments which are then ranked. In [11], multiple segmentations are computed using the Graph-Based method (GB) [7] on different colour spaces. Afterwards, these segments are grouped and reported as object-related regions. Finally, in [2] the authors proposed a method which creates a random collection of windows over the image. Then, windows are scored and ranked based on a objectness measurement. Since this measurement is used in this work, it will be describe with more details in Sect. 3.

3 Objectness Measurement

In [2], the authors proposed a window score, named objectness measurement, which combines some measures in order to positively identify an object inside a given window. Several measures were investigated in [2], but the authors were able to demonstrate that the three most important are as follows.

Multiscale Saliency (MS): this is based on a saliency measure and analyzes the spectral residual of the FFT image. The saliency map \(I\) of image \(f\) is calculated (for each pixel \(p\)) using Eq. 1.

$$\begin{aligned} I(p) = g(p) * \mathcal{F}^{-1}[exp(\mathcal{R}(f) + \mathcal{P}(f))]^2 \end{aligned}$$
(1)

in which \(\mathcal F\) is the FFT, \(\mathcal{R}(f)\) and \(\mathcal{P}(f)\) are the spectral residual and the phase spectrum of image \(f\), respectively; and \(g\) is a Gaussian filter. This calculation is done in different scales (see Fig. 2a) and for each color channel. For each scale \(s\), Eq. 1 was used to calculate saliency \(I_{MS}^s(p)\). Then, the saliency of a window \(w\) at scale s is calculated using Eq. 2.

$$\begin{aligned} MS(w, \theta _{MS}^s) = \sum _{\{p\in w \ | \ I_{MS}^s(S)\ge \theta _{MS}^s\}} \frac{I_{MS}^s(p) \times |\{p \in w \ | \ I_{MS}^s(p) \ge \theta _{MS}^s\}|}{|w|} \end{aligned}$$
(2)

Color Contrast (CC): this measure calculates the dissimilarity between a window \(w\) and a surrounding window \(Surr(w, \theta _{CC})\) which is bigger than \(w\) by a factor of \(\theta _{CC}\) – see Eq. 3.

$$\begin{aligned} \frac{|Surr(w, \theta _{CC})|}{|w|} = \theta _{CC}^2 - 1 \end{aligned}$$
(3)

The CC score is calculated by \(\chi ^2\) distance between their LAB histograms h, as stated by Eq. 4.

$$\begin{aligned} CC(w,\theta _{CC}) = \chi ^2(h(w), h(Surr(w, \theta _{CC}))) \end{aligned}$$
(4)

Superpixel Straddling (SS): according to [2], a “bad” superpixel \(s_p\) (defined as a small region of uniform color or texture, that ideally preserves object boundaries) occurs when it straddles (surpasses) the borders of a window \(w\). Thus, a window with fewer number of straddling superpixels should have a higher score. This score is calculated using Eq. 5.

$$\begin{aligned} SS(w,\theta _{SS}) = 1 - \sum _{s_p\in S(\theta _{SS})} \frac{min(|s_p \backslash w|, |s_p \cap w|)}{|w|} \end{aligned}$$
(5)

in which \(S(\theta _{SS})\) is the set of superpixels obtained using a segmentation method (the authors in [2] used GB method described in [7]) with a “scale” parameter \(\theta _{SS}\), \(|s_p \backslash w|\) represents the area of \(s_p\) outside \(w\), while \(|s_p \cap w|\) stands for the area of \(s_p\) inside \(w\).

Fig. 2.
figure 2

Example of the three measures used: (a) MS (Multiscale Saliency); (b) CC (Color Contrast); and (c) SS (Superpixel Straddling). (Color figure online)

Learning the Parameters: in [2], using a supervised training, the best parameter values \(\theta _{MS}^s,\theta _{CC},\theta _{SS}\) related to each measure were calculated in order to maximize the object detection rate. Finally, several combinations of measures were tested, and the combination using all three measures described before (MS+CC+SS) has obtained the best object detection results.

4 Hierarchical Graph-Based Segmentation

According to [8], a hierarchical segmentation method must respect two principles of multi-scale analysis. Thus, hierarchical segmentation is able to maintain spatial and neighborhood information between segments, even when changing the scale [9]. Actually, in [9], the authors proposed a strategy for transforming a non-hierarchical method into a completely hierarchical one; and they have applied that approach to GB in order to obtained a new method called hGB, which is able to compute a hierarchy of partitions for all scales, as shown in Fig. 1. A comparison between GB and hGB segmentation results is shown in Fig. 3. One can easily noticed that, for GB method [7], the number of regions does not decrease when scale parameter increases, neither contours of the regions are stable (see Fig. 3b–d). In contrast, hGB method [9, 10] respects both causality and location principles (see Fig. 3e–g).

Fig. 3.
figure 3

Example of GB and hGB segmentation results.

In both methods, an image is transformed into an undirected graph \(G = (V,E)\), in which \(V\) is a finite set (representing pixels) and \(E\) is a subset of \(\{\{x,y\} \subseteq V \ |\ x \ne y\}\) (representing pixel neighborhood). To every tree \(T\) spanning the set \(V\) of the image pixels, to every map \(w:E\mapsto \mathbb {N}\) that weights the edges of \(T\) and to every threshold \(\lambda \in \mathbb {N}\), one may associate the partition \(\mathcal{P}_{\lambda }^{w}\) of \(V\) induced by the connected components of the graph made from \(V\) and the edges with weight below \(\lambda \). It is well known [4] that for any two values \(\lambda _1\) and \(\lambda _2\) such that \(\lambda _1\ge \lambda _2\), the partitions \(\mathcal{P}_{\lambda _1}^{w}\) and \(\mathcal{P}_{\lambda _2}^{w}\) are nested and \(\mathcal{P}_{\lambda _1}^{w}\) is coarser than \(\mathcal{P}_{\lambda _2}^{w}\). Hence, the set \(\mathcal{H}^{w}=\{\mathcal{P}_{\lambda }^{w}~|~\lambda \in \mathbb {N}\}\) is a hierarchy of partitions induced by the weight map \(w\).

The method hGB [9] does not explicitly produce a hierarchy of partitions, but instead produces a weight map \({{L}}\) (scales of observations) from which the desired hierarchy \(\mathcal{H}^{{{L}}}\) can be inferred on a given \(T\). It starts from a minimum spanning tree \(T\) of the edge-weighted graph built from the image. In order to compute the scale \({{L}({e})}\) associated with each edge of \(T\), hGB method iteratively considers the edges of \(T\) in a non-decreasing order of their original weights \(w\). For every edge e, the new weight map \({{L}({e})}\) is initialized to \(\infty \); then for each edge e linking two vertices x and y the following steps are performed:

  1. (i)

    Find the region \(X\) of \(\mathcal{P}_{w(e)}^{w}\) that contains x;

  2. (ii)

    Find the region \(Y\) of \(\mathcal{P}_{w(e)}^{w}\) that contains y;

  3. (iii)

    Compute the hierarchical observation scale \({L}(e)\).

At step (iii), the hierarchical scale \({S}_{Y}({X})\) of \(X\) relative to \(Y\) is needed to obtain the value\(~{L}(e)\). Intuitively, \({S}_{Y}({X})\) is the lowest observation scale at which some sub-region of \(X\), namely \(X^{*}\), will be merged to \(Y\). Then, the hierarchical scale \({{L}}(e)\) is simply set to:

$$\begin{aligned} {{L}}(e) = \max \{{S}_{Y}({X}),{S}_{X}({Y})\}, \end{aligned}$$
(6)

in which,

$$\begin{aligned} S_{Y}(X) = [Dif(X,Y) - Int(X)]\times |X| \end{aligned}$$
(7)

with \(Dif(X,Y)\) and \(Int(X)\) defined (analogously to [7]). Thus, the internal difference Int(X) of a region X is the highest edge weight among all the edges linking two vertices of X in MST; and the difference Dif(XY) between two neighboring regions X and Y is the smallest edge weight among all the edges that link X to Y. Due to lack of space, additional details were omitted and the reader should refer to [9].

5 Experimental Results

In order to evaluate the adoption of hGB method [9] as a tool for solving object detection problem, we use the well-known PASCAL VOC 2007 dataset [6] with 20 object categories, which is also used in [2]. Similar to [2], during supervised training for obtaining parameter values, the train+val subsets were used and only images belonging to 06 categories (bird, car, cat, cow, dog, and sheep) were used, ignoring images marked as difficult. Therefore, 1,183 images from a total of 1,587 were selected for training. The SS measure was trained 03 times, i.e. one distinct training for each segmentation approach GB, hGB and SLIC in order to get the best \(\theta _{SS}\) value for each one (for GB and hGB, \(\theta _{SS}\) represents scale value used to obtain a partition; while, in SLIC, it is the number of superpixels).

During testing, the test subset of the dataset was used, selecting only images belonging to categories that were not used in training. According to [2], this is a way to show the effectiveness of the proposed approach. In addition, images marked as difficult have also been ignored during testing. Thus, 2,941 images from a total of 7,610 were selected. Since windows are chosen randomly for each new image, during tests, we use the same list of random windows to evaluate all combinations of measures for a given image.

Following the criterion adopted on PASCAL VOC 2007 challenge [6], if \(W\) and \(O\) represent the window area to be tested and the object area in the ground-truth image, respectively; an object appearing in the ground-truth image is considered covered if \(|W \cap O|/|W \cup O|\) is greater than 0.5. To assess the results as in [2], we adopt the same test methodology. So, the evaluation was executed changing the number of random windows up to the limit of 1,000 windows; and the detection rate DR (i.e., the percentage of objects covered by random windows) is calculated. The evaluation was executes 10 times; and average results are shown in Fig. 4, where the average DR values are presented for distinct combinations of measures as a function of the number of random windows (#WIN). The area under the curve (AUC) for each combination is also shown.

Fig. 4.
figure 4

Results – average DR \(\times \) #WIN.

As one can see in the Fig. 4, the hGB method helps improving object detection rate when 100 or more windows are used. Despite of this, a big difference can be noticed when reaching 1000 windows, which could indicate that segments obtained by hGB (which respects causality and location principles) are more likely to have a positive impact on the result as the number of windows increases. In contrast, the use of SLIC method does not seem to have a significant impact on detection rate, when it is compared with GB for 100 or more windows.

As shown by AUC values, hGB outperforms both GB and SLIC by 11% and 6.25%, respectively. However, hGB (and also SLIC) presents a higher average execution time compared to GB – see Table 1.

Finally, a qualitative assessment indicates that hGB seems to be better for detecting more than one object in the same image. This can be seen in Fig. 5, where heatmaps are used to show the object probability distribution (warmer regions represent stronger evidence of object presence in the original image).

Table 1. Average execution time per image of each combination (in seconds).

6 Conclusion and Future Work

This paper proposed the use of a hierarchical graph-based segmentation to improve object detection. Experimental results have shown that the modified approach outperforms by 11% the previous state-of-the-art version, but with a high average execution time. When compared to a state-of-the-art method to generate superpixels, the modified approach consumes almost the same amount of time per image, still exhibiting an improvement of 6.25% in detection rate.

Fig. 5.
figure 5

Comparison between results: (a) original image; (b) ground-truth; (c) random windows over image; (d) heatmap for MS+CC+SS_GB; (e) heatmap for MS+CC+SS_SLIC; and (f) heatmap for MS+CC+SS_HGB.

As future work, we plan to investigate the use of segments of different detail levels, to produce the best score for a current window. This could help dealing with partially occluded objects and with cluttered scenes.