Abstract
Automatic instance segmentation is a problem that occurs in many biomedical applications. State-of-the-art approaches either perform semantic segmentation or refine object bounding boxes obtained from detection methods. Both suffer from crowded objects to varying degrees, merging adjacent objects or suppressing a valid object. In this work, we assign an embedding vector to each pixel through a deep neural network. The network is trained to output embedding vectors of similar directions for pixels from the same object, while adjacent objects are orthogonal in the embedding space, which effectively avoids the fusion of objects in a crowd. Our method yields state-of-the-art results even with a light-weighted backbone network on a cell segmentation (BBBC006 + DSB2018) and a leaf segmentation data set (CVPPP2017). The code and model weights are public available (https://github.com/looooongChen/instance_segmentation_with_pixel_embeddings/).
This work was supported by the Deutsche Forschungsgemeinschaft (Research Training Group 2416 MultiSenses-MultiScales).
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Many biomedical applications, such as phenotyping [1] and tracking [2], rely on instance segmentation, which aims not only to group pixels in semantic categories but also to segment individuals from the same category. This task is challenging because objects of the same class can get crowded together without obvious boundary clues.
A prevalent class of approaches used for biomedical images is based on semantic segmentation, obtaining instances through per-pixel classification [3, 4]. Although this approach generates good object coverage, crowded objects are often mistakenly regarded as one connected region. DCAN [4] predicts the object contour explicitly to separate touching glands. However, segmentation by contours is very unreliable in many cases, since a few misclassified pixels can break a continuous boundary.
Another major class of approaches, such as Mask-RCNN [7], refine the bounding boxes obtained from object detection methods [5, 6]. Object detection methods rely on non-maximum suppression (NMS) to remove duplicate predictions resulting from exhaustive search. This becomes problematic when bounding boxes of two objects overlap with a large ratio: one valid object will be suppressed. A finer shape representation star-convex polygons is used by [8] with the intention of reducing false suppression. However, it is only suitable for roundish objects [8, 9].
In this work, we propose to get instances by grouping pixels based on an object-aware embedding. A deep neural network is trained to assign each pixel an embedding vector. Pixels of the same object will have similar directions in the embedding space, while spatially close objects are orthogonal to each other. Since our method performs pixel-level grouping, it is not affected by different object shape and it does not suffer from the false suppression problem. On the other hand, it avoids the fusion of adjacent objects like the semantic segmentation based methods.
Some recent research [10, 12, 13] proposes the use of embedding vectors to distinguish individual objects in the driving scene and natural images. These approaches force each object to occupy a different part of the embedding space. The global constraint is actually not necessary, and could even be detrimental, for biomedical images that often contain repeated local patterns. For example, content in the receptive fields of pixel X and Y (Fig. 1(a)) are very similar, both with one object above and one below. The network has no clear clue to assign X and Y different embeddings. Forcing them to be different is likely to hinder training. Furthermore, the global constraint is inefficient in terms of embedding space utilization. There is no risk of distant objects being merged, thus they could share the same embedding space, such as B and D in Fig. 1.
The main contributions of our work are as follows: (1) we propose to train the embedding mapping only constraining adjacent objects to be different, (2) a novel loss of a good geometrical explanation (adjacent instances live in orthogonal space), (3) a multi-task network head for embedding training and obtaining segmentations from embeddings, which can be applied to any backbone networks.
Our method is compared with several strong competing approaches. It yields comparable or better results on two data sets: a combined fluorescence microscopy data set of BBBC06Footnote 1 and the part of DSB2018Footnote 2 used by [8] and the CVPPP2017Footnote 3 leaf segmentation data set.
2 Method
Our approach has has two output branches taking the same feature map as input: the embedding branch and the distance regression branch (Fig. 2). Both consist of two convolutional layers. The last layer of the embedding branch uses linear activation and each filter outputs one dimension of the embedding vector.
The distance regression branch has a single layer output with relu activation. We regress the distance from an object pixel to the closest boundary pixel (normalized within each object). The distance map is used to help obtain segmentations from the embedding map, details are depicted in Sect. 2.2.
The background is treated as a standalone object that is adjacent to all other objects. For distance regression, background pixels are set to zero. It is worth mentioning that the distance map alone provides enough cue to separate objects. But we argue that it is not optimal to obtain accurate segmentations since both object and background pixels are of small values around the boundaries, which is ambiguous and sensitive to small perturbations. In this work, the distance regression plays the role of roughly locating the objects.
2.1 Loss Function
The training loss consists of two parts: \(L_{reg}\) and \(L_{emb}\), which supervise the learning of the distance regression branch and the embedding branch separately. We use \(\lambda _1=5\) to give more emphasis on the embedding training.
We minimize the mean squared error for the distance regression, with each pixel weighted to balance the foreground and background frequency.
Intuitively, embeddings of the same object should end up at similar positions in the embedding space, while different objects should be discriminable. So naturally, the embedding loss is formulated as the sum of two terms: the consistency term \(L_{con}\) and the discriminative term \(L_{dis}\).
To give a specific formula, we have to determine how “similarity” is measured. While euclidean distance is used by many works [10, 11], we construct the loss with cosine distance, which decouples from the output range of different networks: \(D(e_i,e_j) = 1-\frac{e_i^Te_j}{\Vert e_i\Vert _2\Vert e_j\Vert _2}\), where \(e_i, e_j\in \mathbb {R} ^D\) are embeddings of pixel i and j. The outcome of cosine distance ranges from 0 meaning exactly the same direction, to 2 meaning the opposite, with 1 indicating orthogonality.
Instead of pushing each object pair as far as possible [10, 11, 13] in the embedding space (global constraint), we only push adjacent objects into each other’s orthogonal space (local constraint). As shown in Fig. 1, far away objects can occupy the same position in the embedding space, which uses the space more effectively. In the embedding map in Fig. 2, only a few colors appears repeatedly, still ensuring that adjacent objects have different colors.
Let’s say that there are K objects within an image with \((M_1, M_2, \dots , M_K)\) pixels respectively. The loss can be written as follows:
where \(d_p\) and \(\widehat{d}_p\) are the regression output and ground truth of pixel p, \(e_p\) is the embedding of pixel p, \(u_k\) is the mean embedding (normalized) of object k, \(w_p\) is the factor for balancing the foreground and background frequency. \(N_d(k)\) indicates the neighbors of object k. An object is considered as a neighbor if its shortest distance to object k is less than d.
2.2 Postprocessing
Since objects form clusters in the embedding space, a clustering method that does not require to specify the number of clusters (e.g. mean shift [14]) can be employed to obtain segmentations from the embedding. However, due to the time complexity of mean shift, even processing medium-size images takes tens of seconds. Since our embedding space has a good geometric explanation, we propose a simple but effective way to obtain segmentations:
-
1.
Threshold the distance map to get the central region of an object. We use \(T_c=0.7\) in our experiment.
-
2.
Compute the mean embedding \(u_k\) of each seed region.
-
3.
Iteratively perform morphological dilation with a \(3\times 3\) kernel. Frontier pixels \(e_i\) are included into the object, if it is not assigned to other objects and \(D(e_i, u_k)\) is smaller than \(T_e=0.3\).
-
4.
Stop when no new pixels are included.
Threshold \(T_e\) is determined based on the fact that a pixel embedding should be closer to the ground truth object than any others in terms of angle. Thus, we set the midpoint \(45^{\circ }\) as the boundary, \(T_e=1-cos({45}^{\circ })\approx 0.3\).
3 Results
3.1 Data Sets and Evaluation Metrics
In order to compare different methods, we chose two data sets that reflect typical phenomena in biomedical images:
BBBC006 + PartDSB2018: We combined the fluorescence microscopy images of cells used by [8] (part of DSB2018\(^{3}\)) and BBBC0006\(^{2}\). BBBC006 is a larger data set containing more densely distributed cells. We removed a small number of images without objects or with obvious labeling mistakes. The data were randomly split into 1003 training images and 230 test images. The evaluation metric was the average precision (AP) over a range of IoU (intersection over union) thresholds from 0.5 to 0.95 (see footnote 2).
CVPPP2017: Compared to the roundish cells, the leaves in CVPPP2017 have more complex shapes and exhibit more overlap or contact. We randomly sampled 648 images for training and 162 images for testing. The results were evaluated in terms of symmetric best dice (SBD), foreground-background dice (FBD), difference in count (DiC) and absolute DiC [1].
3.2 Competing Methods
Unet: We employed the widely used Unet [3] to perform 3-label segmentation (object, contour, background). Since many objects are in contact, we introduced a 2-pixel boundary to separate them.
Mask-RCNN: Mask-RCNN [7] localizes objects by proposal classification and non-max suppression (NMS). Afterwards, segmentation is performed on each object bounding box. We generated 1000 proposals with anchor scales (8, 16, 32, 64, 128) for the cell data set and 50 proposals with scales (16, 32, 64, 128, 256) for the leaf data set. The NMS threshold was set to 0.9 for both data sets.
Stardist: Star-convex polygons are used by [8] as a finer shape representation. Without an explicit segmentation step, the final segmentation is obtained by combining distances from center to boundary in 32 radial directions. The final step of Stardist consists of NMS to suppress overlapping polygons.
For comparability, all methods except Mask-RCNN used a simplified U-net [8] (3 pooling and 3 upsampling) as the backbone network and trained from scratch. Mask-RCNN (ResNet-101 [15] backbone) was fine-tuned on the basis of a model pretrained with the MS COCO data setFootnote 4.
3.3 Results and Discussion
Th Unet had the lowest mean AP in Table 1. The AP value decreased rapidly with increasing IoU because of the false fusion of adjacent cells. Both Stardist and Mask-RCNN can handle most adjacent objects, but when a few cells form a tight roundish cluster, both methods are likely to fail. Mask-RCNN yielded the best score in the high IoU range, which is the benefit of an explicit segmentation step: masks are better aligned with the object boundary. Qualitative results in Fig. 3 show that our method is better at distinguishing objects that are in contact. This is also reflected by the highest AP of our method for \(IoU < 0.7\).
The leaf segmentation results better reflect the characteristics of each approach. As shown in Fig. 3, the Unet outlines the leaves accurately, but merges several instances into one (green and yellow). All other approaches proved to be object-aware. However, Mask-RCNN missed leaf B, because the bounding box of B is almost identical to that of A. Stardist avoids such false suppression by using a better shape representation, which comes at the expense of losing finer structures, such as the petioles. This is easy to understand, since Stardist obtains a mask by fitting a polygon based on discrete radial directions. In contrast, our method does not only avoids misses, but also produces a good contour.
Local vs. Global Constraint: To demonstrate the effect of local constraint, we tested the method with different d: larger d treats more objects as neighbors (large enough d is equivalent to the global constraint). The best result is always achieved at \(d=10\), which only takes objects in contact or almost in contact as neighbors. In the case of dimension 4, the performance drop on the cell data set is especially significant at \(d=100\) due to the inefficient use of embedding space. The same drop happens at \(d=30\) on the leaf segmentation data set.
Incomplete Object Mask: Inconsistent embeddings within an object (Fig. 4) sometimes occurs near the boundary, leading to incomplete segmentations. This is why our method performs not as good as Mask-RCNN in high IoU range. The reason of the inconsistence deserves further study.
4 Conclusion and Outlook
Our proposed approach can not only outline objects accurately, but also is free from false object suppression and object fusion. The local constraint (orthogonality of neighboring objects) makes full use of the embedding space and gives a good geometric interpretation. Our method is especially attractive for images containing a large number of objects that are repeated and in contact and yields state-of-the-art results even with a light-weighted backbone network.
Since our approach generates embeddings that live in orthogonal spaces, if this space can be aligned with the standard space by rotating, segmentations can directly obtained from embeddings. An alternative approach to bypass postprocessing would be to add sparsity constraints on the embedding vector during training. We will test the feasibility of these two methods in the future.
References
Scharr, H., et al.: Leaf segmentation in plant phenotyping: a collation study. Mach. Vis. Appl. 27(4), 585–606 (2016)
Ulman, V., et al.: An objective comparison of cell-tracking algorithms. Nat. Methods 14, 1141–1152 (2017)
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chen, H., Qi, X., Yu L., Dou, Q., Qin, J., Heng, P.A.: DCAN: deep contour-aware networks for accurate gland segmentation. In: 2016 CVPR, pp. 2487–2496 (2016)
Ren, S., He, K, Girshick, R., Sun, J,: Faster R-CNN: towards real-time object detection with region proposal networks. In: 28th NIPS, pp. 91–99 (2015)
Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: 2017 ICCV, pp. 2980-2988 (2017)
Schmidt, U., Weigert, M., Broaddus, C., Myers, G.: Cell detection with star-convex polygons. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds.) MICCAI 2018. LNCS, vol. 11071, pp. 265–273. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00934-2_30
Jetley, S., Sapienza, M., Golodetz, S., Torr, P.H.: Straight to shapes: real-time detection of encoded shapes. In: 2017 CVPR, pp. 4207–4216 (2017)
De Brabandere, B., Neven, D., Van Gool, L.: Semantic instance segmentation with a discriminative loss function. CoRR (2017)
Fathi, A., et al.: Semantic instance segmentation via deep metric learning. CoRR (2017)
De Brabandere, B., Neven, D., Van Gool, L.: Semantic instance segmentation for autonomous driving. In: 2017 CVPR Workshop, pp. 478–480 (2017)
Kong, S., Fowlkes, C.C.: Recurrent pixel embedding for instance grouping. In: CVPR, pp. 9018–9028 (2018)
Comaniciu, D., Meer, P.: Mean shift: a robust approach toward feature space analysis. IEEE Trans. Pattern Anal. Mach. Intell. 24(5), 603–619 (2002)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 CVPR, pp. 770–778 (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Chen, L., Strauch, M., Merhof, D. (2019). Instance Segmentation of Biomedical Images with an Object-Aware Embedding Learned with Local Constraints. In: Shen, D., et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2019. MICCAI 2019. Lecture Notes in Computer Science(), vol 11764. Springer, Cham. https://doi.org/10.1007/978-3-030-32239-7_50
Download citation
DOI: https://doi.org/10.1007/978-3-030-32239-7_50
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32238-0
Online ISBN: 978-3-030-32239-7
eBook Packages: Computer ScienceComputer Science (R0)