Keywords

1 Introduction

Many biomedical applications, such as phenotyping [1] and tracking [2], rely on instance segmentation, which aims not only to group pixels in semantic categories but also to segment individuals from the same category. This task is challenging because objects of the same class can get crowded together without obvious boundary clues.

A prevalent class of approaches used for biomedical images is based on semantic segmentation, obtaining instances through per-pixel classification [3, 4]. Although this approach generates good object coverage, crowded objects are often mistakenly regarded as one connected region. DCAN [4] predicts the object contour explicitly to separate touching glands. However, segmentation by contours is very unreliable in many cases, since a few misclassified pixels can break a continuous boundary.

Another major class of approaches, such as Mask-RCNN [7], refine the bounding boxes obtained from object detection methods [5, 6]. Object detection methods rely on non-maximum suppression (NMS) to remove duplicate predictions resulting from exhaustive search. This becomes problematic when bounding boxes of two objects overlap with a large ratio: one valid object will be suppressed. A finer shape representation star-convex polygons is used by [8] with the intention of reducing false suppression. However, it is only suitable for roundish objects [8, 9].

Fig. 1.
figure 1

(a) In images of repeated patterns, different pixels, such as X and Y, can have similar content in their receptive fields. (c)(e) demonstrate the convergence of the embedding loss on image (a) in a 3 dimensional space (background ignored). In (e), both local and global constraints form 3 clusters which are orthogonal to each other. While adjacencies A, B and C are well separated under local constraints, B and C belong to the same cluster under global constraints. The better discriminative property of local constraints is also reflected by the mean angle of neighbors (mAN). In addition, distant objects, such as B and D, occupying the same space is a desired property.

In this work, we propose to get instances by grouping pixels based on an object-aware embedding. A deep neural network is trained to assign each pixel an embedding vector. Pixels of the same object will have similar directions in the embedding space, while spatially close objects are orthogonal to each other. Since our method performs pixel-level grouping, it is not affected by different object shape and it does not suffer from the false suppression problem. On the other hand, it avoids the fusion of adjacent objects like the semantic segmentation based methods.

Some recent research [10, 12, 13] proposes the use of embedding vectors to distinguish individual objects in the driving scene and natural images. These approaches force each object to occupy a different part of the embedding space. The global constraint is actually not necessary, and could even be detrimental, for biomedical images that often contain repeated local patterns. For example, content in the receptive fields of pixel X and Y (Fig. 1(a)) are very similar, both with one object above and one below. The network has no clear clue to assign X and Y different embeddings. Forcing them to be different is likely to hinder training. Furthermore, the global constraint is inefficient in terms of embedding space utilization. There is no risk of distant objects being merged, thus they could share the same embedding space, such as B and D in Fig. 1.

The main contributions of our work are as follows: (1) we propose to train the embedding mapping only constraining adjacent objects to be different, (2) a novel loss of a good geometrical explanation (adjacent instances live in orthogonal space), (3) a multi-task network head for embedding training and obtaining segmentations from embeddings, which can be applied to any backbone networks.

Our method is compared with several strong competing approaches. It yields comparable or better results on two data sets: a combined fluorescence microscopy data set of BBBC06Footnote 1 and the part of DSB2018Footnote 2 used by [8] and the CVPPP2017Footnote 3 leaf segmentation data set.

2 Method

Our approach has has two output branches taking the same feature map as input: the embedding branch and the distance regression branch (Fig. 2). Both consist of two convolutional layers. The last layer of the embedding branch uses linear activation and each filter outputs one dimension of the embedding vector.

The distance regression branch has a single layer output with relu activation. We regress the distance from an object pixel to the closest boundary pixel (normalized within each object). The distance map is used to help obtain segmentations from the embedding map, details are depicted in Sect. 2.2.

The background is treated as a standalone object that is adjacent to all other objects. For distance regression, background pixels are set to zero. It is worth mentioning that the distance map alone provides enough cue to separate objects. But we argue that it is not optimal to obtain accurate segmentations since both object and background pixels are of small values around the boundaries, which is ambiguous and sensitive to small perturbations. In this work, the distance regression plays the role of roughly locating the objects.

Fig. 2.
figure 2

Our framework consists of two branches: the distance regression branch predicts the normalized distance from a pixel to the closest boundary, the embedding branch is responsible for mapping the feature map to the embedding space. The distance map and embedding map are combined to get segmentations (Sect. 2.2). We demonstrate the embedding as RGB images for every 3 channels. (Color figure online)

2.1 Loss Function

The training loss consists of two parts: \(L_{reg}\) and \(L_{emb}\), which supervise the learning of the distance regression branch and the embedding branch separately. We use \(\lambda _1=5\) to give more emphasis on the embedding training.

$$\begin{aligned} L=L_{reg}+\lambda _1 L_{emb} \end{aligned}$$

We minimize the mean squared error for the distance regression, with each pixel weighted to balance the foreground and background frequency.

Intuitively, embeddings of the same object should end up at similar positions in the embedding space, while different objects should be discriminable. So naturally, the embedding loss is formulated as the sum of two terms: the consistency term \(L_{con}\) and the discriminative term \(L_{dis}\).

To give a specific formula, we have to determine how “similarity” is measured. While euclidean distance is used by many works [10, 11], we construct the loss with cosine distance, which decouples from the output range of different networks: \(D(e_i,e_j) = 1-\frac{e_i^Te_j}{\Vert e_i\Vert _2\Vert e_j\Vert _2}\), where \(e_i, e_j\in \mathbb {R} ^D\) are embeddings of pixel i and j. The outcome of cosine distance ranges from 0 meaning exactly the same direction, to 2 meaning the opposite, with 1 indicating orthogonality.

Instead of pushing each object pair as far as possible [10, 11, 13] in the embedding space (global constraint), we only push adjacent objects into each other’s orthogonal space (local constraint). As shown in Fig. 1, far away objects can occupy the same position in the embedding space, which uses the space more effectively. In the embedding map in Fig. 2, only a few colors appears repeatedly, still ensuring that adjacent objects have different colors.

Let’s say that there are K objects within an image with \((M_1, M_2, \dots , M_K)\) pixels respectively. The loss can be written as follows:

$$\begin{aligned} L_{center} =&~ \frac{1}{\sum _{k=1}^{K}M_k} \sum _{k=1}^{K} \sum _{p=1}^{M_k} w_p(d_p - \widehat{d}_p)^2\\ L_{emb} =&~ L_{con}+ L_{dis} \\ =&~ \frac{1}{\sum _{k=1}^{K}M_k} \sum _{k=1}^{K} \sum _{p=1}^{M_k} w_p D(e_p, u_k) + \frac{1}{K} \sum _{k=1}^{K} \frac{1}{|N_d(k)|}\sum _{n\in N_d(k)} |1-D(u_k, u_n) |, \\ \end{aligned}$$

where \(d_p\) and \(\widehat{d}_p\) are the regression output and ground truth of pixel p, \(e_p\) is the embedding of pixel p, \(u_k\) is the mean embedding (normalized) of object k, \(w_p\) is the factor for balancing the foreground and background frequency. \(N_d(k)\) indicates the neighbors of object k. An object is considered as a neighbor if its shortest distance to object k is less than d.

2.2 Postprocessing

Since objects form clusters in the embedding space, a clustering method that does not require to specify the number of clusters (e.g. mean shift [14]) can be employed to obtain segmentations from the embedding. However, due to the time complexity of mean shift, even processing medium-size images takes tens of seconds. Since our embedding space has a good geometric explanation, we propose a simple but effective way to obtain segmentations:

  1. 1.

    Threshold the distance map to get the central region of an object. We use \(T_c=0.7\) in our experiment.

  2. 2.

    Compute the mean embedding \(u_k\) of each seed region.

  3. 3.

    Iteratively perform morphological dilation with a \(3\times 3\) kernel. Frontier pixels \(e_i\) are included into the object, if it is not assigned to other objects and \(D(e_i, u_k)\) is smaller than \(T_e=0.3\).

  4. 4.

    Stop when no new pixels are included.

Threshold \(T_e\) is determined based on the fact that a pixel embedding should be closer to the ground truth object than any others in terms of angle. Thus, we set the midpoint \(45^{\circ }\) as the boundary, \(T_e=1-cos({45}^{\circ })\approx 0.3\).

3 Results

3.1 Data Sets and Evaluation Metrics

In order to compare different methods, we chose two data sets that reflect typical phenomena in biomedical images:

BBBC006 + PartDSB2018: We combined the fluorescence microscopy images of cells used by [8] (part of DSB2018\(^{3}\)) and BBBC0006\(^{2}\). BBBC006 is a larger data set containing more densely distributed cells. We removed a small number of images without objects or with obvious labeling mistakes. The data were randomly split into 1003 training images and 230 test images. The evaluation metric was the average precision (AP) over a range of IoU (intersection over union) thresholds from 0.5 to 0.95 (see footnote 2).

CVPPP2017: Compared to the roundish cells, the leaves in CVPPP2017 have more complex shapes and exhibit more overlap or contact. We randomly sampled 648 images for training and 162 images for testing. The results were evaluated in terms of symmetric best dice (SBD), foreground-background dice (FBD), difference in count (DiC) and absolute DiC [1].

Fig. 3.
figure 3

Qualitative results of the cell segmentation and leaf segmentation for four approaches. In the first row, correct matches (\(IoU=0.6\)) are highlighted in blue, while false positives are marked in red. The second row shows the leaf segmentation results with color-coded instances. (Color figure online)

3.2 Competing Methods

Unet: We employed the widely used Unet [3] to perform 3-label segmentation (object, contour, background). Since many objects are in contact, we introduced a 2-pixel boundary to separate them.

Mask-RCNN: Mask-RCNN [7] localizes objects by proposal classification and non-max suppression (NMS). Afterwards, segmentation is performed on each object bounding box. We generated 1000 proposals with anchor scales (8, 16, 32, 64, 128) for the cell data set and 50 proposals with scales (16, 32, 64, 128, 256) for the leaf data set. The NMS threshold was set to 0.9 for both data sets.

Stardist: Star-convex polygons are used by [8] as a finer shape representation. Without an explicit segmentation step, the final segmentation is obtained by combining distances from center to boundary in 32 radial directions. The final step of Stardist consists of NMS to suppress overlapping polygons.

For comparability, all methods except Mask-RCNN used a simplified U-net [8] (3 pooling and 3 upsampling) as the backbone network and trained from scratch. Mask-RCNN (ResNet-101 [15] backbone) was fine-tuned on the basis of a model pretrained with the MS COCO data setFootnote 4.

Table 1. Average precision (AP) for different IoU thresholds on the cell data set. Different d for defining neighbors (Sect. 2.1) are tested (-d10, -d30 and -d100). To highlight the effect of local constraint, a 4-dimensional embedding is trained additionally.

3.3 Results and Discussion

Th Unet had the lowest mean AP in Table 1. The AP value decreased rapidly with increasing IoU because of the false fusion of adjacent cells. Both Stardist and Mask-RCNN can handle most adjacent objects, but when a few cells form a tight roundish cluster, both methods are likely to fail. Mask-RCNN yielded the best score in the high IoU range, which is the benefit of an explicit segmentation step: masks are better aligned with the object boundary. Qualitative results in Fig. 3 show that our method is better at distinguishing objects that are in contact. This is also reflected by the highest AP of our method for \(IoU < 0.7\).

The leaf segmentation results better reflect the characteristics of each approach. As shown in Fig. 3, the Unet outlines the leaves accurately, but merges several instances into one (green and yellow). All other approaches proved to be object-aware. However, Mask-RCNN missed leaf B, because the bounding box of B is almost identical to that of A. Stardist avoids such false suppression by using a better shape representation, which comes at the expense of losing finer structures, such as the petioles. This is easy to understand, since Stardist obtains a mask by fitting a polygon based on discrete radial directions. In contrast, our method does not only avoids misses, but also produces a good contour.

Local vs. Global Constraint: To demonstrate the effect of local constraint, we tested the method with different d: larger d treats more objects as neighbors (large enough d is equivalent to the global constraint). The best result is always achieved at \(d=10\), which only takes objects in contact or almost in contact as neighbors. In the case of dimension 4, the performance drop on the cell data set is especially significant at \(d=100\) due to the inefficient use of embedding space. The same drop happens at \(d=30\) on the leaf segmentation data set.

Incomplete Object Mask: Inconsistent embeddings within an object (Fig. 4) sometimes occurs near the boundary, leading to incomplete segmentations. This is why our method performs not as good as Mask-RCNN in high IoU range. The reason of the inconsistence deserves further study.

Table 2. Evaluation results on CVPPP2017 data set. See Table 2 caption for method name abbreviations.
Fig. 4.
figure 4

Embeddings within the same object are not completely consistent (white arrows) in some cases. (Color figure online)

4 Conclusion and Outlook

Our proposed approach can not only outline objects accurately, but also is free from false object suppression and object fusion. The local constraint (orthogonality of neighboring objects) makes full use of the embedding space and gives a good geometric interpretation. Our method is especially attractive for images containing a large number of objects that are repeated and in contact and yields state-of-the-art results even with a light-weighted backbone network.

Since our approach generates embeddings that live in orthogonal spaces, if this space can be aligned with the standard space by rotating, segmentations can directly obtained from embeddings. An alternative approach to bypass postprocessing would be to add sparsity constraints on the embedding vector during training. We will test the feasibility of these two methods in the future.