Keywords

1 Introduction

1.1 Problem Definition

Automatic cell counting is to obtain the number of certain types of cells in a medical image like microscopic images. It is of great interest to a wide range of medical scenarios [2, 4, 23]. An example is the diagnosis and treatment of breast cancer, which is one of the most common female diseases leading to death worldwide. The number of proliferating (e.g. Ki67 positive) tumor cells is an important index associated with the severity of disease clinically. One available method of quantization involves counting the nuclei of proliferating cells using traditional image analysis techniques on a microscopic image. However, it has been proven to be challenging because of inability to distinguish tumor cells from surrounding normal tissue like vessels, fat and fibrous tissue [11], especially in reality the resolution of input medical image could be very high, at the same time the target cells could easily be extremely dense. Consequently, it is quite difficult to manually count target cells one by one. This is the principal motivation of automatic cell counting.

1.2 Background

From the perspective of computer vision community, automatic cell counting is a branch task of the object counting problem. Many methods have chosen to fulfill object counting task following the detection pipeline [17, 20, 21, 27, 30, 32]. In this case, an object detection framework is designed to localize each object (e.g. cell, head or vehicle) one by one, after that a counter naturally takes all the detected objects and produces the final count. Following the success of deep learning applied in computer vision like detection, segmentation and localization [13, 14, 33], most recent counting-by-detection works choose learning based approach, where each training object has annotation information, sometimes dot annotation indicating the centroid of the object [8, 18, 31], sometimes bounding-box annotation around the object [30]. However, it is well known that the problem of detecting and localizing individual object instance is far from being solved, especially in real world application of cell counting, where the cell density can be extremely high [7, 15, 29]. For example, the number of cells can easily reach or exceed thousands per image, and the cells also show huge variations in terms of type, size, shape and appearance etc. Another related work is Fully Convolutional Neural Network (FCNN) [26], which has remarkable result in semantic segmentation and spatial density prediction (in some way, object counting can be seen as an integration over spatial density prediction). To build the end-to-end and pixel-to-pixel FCNN, its training phase requires the pixel-wise annotation, which is strongly supervised information and gives much benefit to FCNN, consequently work like [9] is proposed to compensate.

Fig. 1.
figure 1

Counting-by-detection framework, most of which requires the pixel-wise annotation, is not a good choice in cell counting task, where cells could be extremely dense.

Figure 1 shows a cell image example with its two level of annotations: (a) pixel-wise annotation, where the nuclei of each cell is dot-annotated; (b) global count, where the total number of cell in the image is provided. Most counting-by-detection framework takes the strongly supervised pixel-wise annotation as input during training, and then generate global count (i.e. the total number of target cells) for test image. However, the automatic cell counting task is to predict a global count only, thus it is a limitation and also unnecessary to design an object counting framework relying on the more expensive annotation data, which will make system unpractical in the real-world application of cell counting.

Fig. 2.
figure 2

Cell counting is better to be modeled as a regression problem rather than a classification problem.

Here we expand the scope of our related work study from purely cell counting to all the object counting problem. Object (including cell, people, vehicle, etc.) counting task [3, 7, 8, 15, 18, 19, 24, 29, 35] benefits a lot from the superior performance of Convolutional Neural Network (CNN) in recent years. Several work [3, 7, 15] exploit CNN to enable their counting system work in dense and crowded environment. While [24, 29] focus on extracting deep features from pre-trained or fine-tuned CNN models. However a multi-class classification CNN is usually trained and used. That means the counting problem is cast as a classification problem, where the cell count is treated as class ID, and images with the same number of objects are seen as belonging to the same class. During its test phase, each test image is predicted with an integer class label (for example: 25, 26 or 53 in Fig. 2), which indicates the number of objects in the image. However as we know in object classification task, training images of different categories (for example: cat, bicycle and airplane) are independent to each other, and softmax loss function is used in classification CNN architectures [13, 14]. Actually, a classification-orientated CNN model treats cell counts 25 and 53 as far apart as counts 25 and 26. However from the nature of cell counting, the distance between different cell counts is very important. It is beneficial to treat the object counting task as a regression task [29], where counts 25 and 26 should be closer and that could be correctly reflected in the regression setting.

Taking all the discussion above into consideration, the advantages of cell counting approach proposed in this paper can be summarized into the following aspects:

  1. 1.

    To capture the relationship between RGB cell image and its overall cell count, we cast the cell counting task as developing an end-to-end regression framework, which is more suitable for counting task compared to counting by detection. Additionally, instead of being applied for classification purpose, a convolutional neural network architecture with Euclidean loss function is used for regression.

  2. 2.

    As the final output, the proposed approach not only estimate the global number of certain cells in an image but also produce the spatial density prediction, which is able to describe the local cell density of an image sub-region. In many clinical imaging systems [10, 28, 34], researchers have confirmed that the topographic map that illustrate the cell density distribution is a valuable tool correlated with the disease diagnose and treatment.

  3. 3.

    We utilize several mainstream CNN architectures (including the Deep Residual Network [13], AlexNet [14]) into our regression model. To the best of our knowledge, this is the first piece of work to expand the deep residual network from classification, detection, segmentation to object counting.

2 The Proposed Framework

2.1 System Overview

In this section, we give a general overview on the proposed approach, details of every part are provided in the following sections. In this paper, we propose a supervised learning framework for cell counting task shown in Fig. 3.

In the training phase, a Convolutional Neural Network (CNN) is utilized to build a regression model between image patch and its cell count number. We employ several kinds of CNN architectures and use Euclidean loss function during training, to enable the regression model fit for the cell counting task. To prepare the training data, we generate a large number of square patches from every training image. Along with each training patch, there is a patch count number, which indicates the number of target cells present in the patch. After that, patch rotation is performed on the collected training patches for the purpose of making the system more robust to rotational variance and data augmentation.

Fig. 3.
figure 3

System overview

In the test phase, one test image is cropped into a number of overlapping test patches with the same size as the training patches in the sliding-window manner. Each of these test patches is passed into the CNN-based regression model, and then the estimated cell count of the input test patch is output from the last layer of the CNN model. After predicting cell counts for all the patches, we perform a 2-D Linear Interpolation over the estimated cell count and its corresponding x-y coordinates to build a heatmap, which provides a spatial density prediction as shown in Fig. 3. Lastly, integrating these interpolated counts on pixel locations provides us the final count on the test image. The whole procedure is illustrated in Fig. 3.

2.2 Data Preparation and Processing

In the real application of automatic cell counting, the resolution of input medical image could be very high, at the same time the target cells could be very dense. Consequently, it is quite difficult to manually count target cells one by one. This is the original motivation of automatic cell counting. Considering the nature of these medical images, which need automatic cell counting, data preparation and processing is naturally necessary. In this work, we crop image into consistent patches and then perform training and prediction over these patches, in order to (1) make the approach more robust to scale variance, (2) avoid resizing original microscope image, which could cause information loss, (3) prepare more training data to prevent the CNN based regression model from overfitting during training.

The proposed method operates by first partitioning image to smaller patches. Patches are generated in a sliding window manner: from the top-left corner of a large W-by-H image with a certain patch size and stride size. Usually, stride size is set smaller than half of path size to ensure that adjacent patches have overlapping region. To construct training data, every training patch is accompanied by a patch count, which is an integer indicating how many cells exist in the patch. Then for data augmentation, a training patch is rotated from 0 degree to 360 degree with a certain rotation step, for example 30 degrees.

2.3 Convolutional Neural Network Regression Model

2.3.1 Classification vs. Regression for Counting

As we know, in a CNN-based classification model, the network outputs a vector whose size is the same size as the number of classes. The i-element in the vector describes the confidence score that the input image belongs to the i-th class. During test phase, the index with the highest confidence score is selected as the final classification result. Softmax loss is widely used for classification problem.

However for counting problems, it is not proper to take cell count number as class index. The reason why regression is a better choice than classification for counting task has been explained in detail in introduction part. In our counting-by-regression model, the difference between ground-truth value and the estimated value can be better preserved during calculating the error. This information is quite helpful for optimizing the CNN weights more accurately in the back-propagation phase. The layer of our regression model outputs a single number, indicating the number of cells that our model predicts. In our model, we employ two kinds of CNN architectures, the first one is AlexNet [14] which consists of 5 convolution layers + 3 fully connected layers; the other is the deep residual network (ResNet) [13] which we will explain in the next section. In both of these architectures, the loss function is defined as the Euclidean loss, which measures the sum of squares of differences between the ground truth and prediction. We train the AlexNet model from scratch with Softmax and Euclidean loss layer respectively, the performance improvement of regression over classification is experimentally explained in detail at Sect. 3.4.

2.3.2 Deep Residual Network for Regression

Since the 2012 ImageNet competition, convolutional neural networks have become popular in large scale image recognition tasks, several milestone networks (including AlexNet [14] VGGNet [22] and GoogLeNet [6], etc.) have been proposed. Recently, the introduction of residual connections into deep convolutional networks has yielded state-of-the-art performance in the 2015 ILSVRC challenge. This raises the question of whether there is any benefit in introducing deep residual network (ResNet) [13] in to the cell counting task. In the following section, we are going to explain the network architecture and its components used in this paper.

Convolutional layer. It consists of a set of learnable filters. During the forward pass, we slide each filter along the width and height of the input volume and compute dot products between its weights and the activation map from previous layer. Intuitively, the filters will be trained to be active to some type of visual feature such as an edge of some orientation or a blotch of some color on the first layer.

Pooling layer. It works by down-sampling the convolutional features using the max operation (max-pooling) or average operation (average-pooling). Pooling layer is usually inserted between successive convolutional layers, in order to reduce the amount of network parameters and also to control overfitting.

Batch Normalization layer. In the ResNet architecture, authors [13] deploy Batch Normalization (BN) layer [25] right after each convolution and before activation. As we know, normalization is often used as a pre-processing step to make the data consistent. When the input flows through a deep network, the weights and parameters adjust the values of the input, sometimes making the data too big or too small again. Batch Normalization layer allows us to normalize the data in each mini-batch across the network rather than just performing normalization once in the beginning, thus this problem is largely avoided. [25] has demonstrated that batch normalization helps to boost the learning speed and also increase the overall accuracy.

Fully-Connected layer. As the name implies, each neuron in a Fully-Connected (FC) layer has full connections to all neurons in the previous layer. After gathering all the responses from previous layers into each of its neuron, fully connected layer is responsible for computing a class-specific confidence vectors, where its each neuron outputs a score for a certain class. For example, the ResNet ends with a 1000-way fully-connected layer, on which the class with maximum score is selected as its final predicted label.

Overall Architecture. Different from other CNN architectures, ResNet consists of a number of Residual Blocks. Each residual block is a made up of Convolutional layer, Batch Normalization layer and a shortcut that connects the original input with the output as shown in Fig. 4 (a) and (b), where a Residual Block with Identity Shortcut (RB-IS) and a Residual Block with Projection Shortcut (RB-PS) is illustrated, respectively. The mathematical model of residual block can be summarized as:

$$\begin{aligned} y_{l}=F(X_{l},\left\{ W_{l} \right\} ) +h(X_{l})\qquad X_{l+1}=f(y_{l}) \end{aligned}$$
(1)
$$\begin{aligned} h(X_{l}) = \left\{ \begin{array}{ll} X_{l}\ \ \ &{} {identity\, mapping}\\ W_{p}X_{l}\ \ \ &{} {projection\, mapping}\\ \end{array} \right. \end{aligned}$$
(2)

\(X_{l}\) and \(X_{l+1}\) are the input and output of the l-th residual block, \(F(X_{l},\left\{ W_{l} \right\} )\) stands for the residual function, and f is a activation function (e.g. ReLU). \(h(X_{l})\) represents the shortcut connection: identity mapping or projection mapping. If the dimension of \(X_{l}\) and \(X_{l+1}\) is the same, the identity shortcuts is used; otherwise a linear projection \(W_{p}\) is performed on the shortcut connections to match the dimension, that is projection mapping. The central idea [13] of ResNet is to learn the additive residual function F with respect to \(h(X_{l})\), with a key choice of using an identity mapping and/or projection mapping.

Figure 4 (a) and (b) show two types of Residual Blocks, which are used in different layers of ResNet model (c) according to whether the dimensions of input and output are the same. \(Nr_{1}\), \(Nr_{2}\), \(Nr_{3}\) and \(Nr_{4}\) represent the number of residual blocks used in four sections of ResNet model. For example, \(Nr_{1}\) = 3, \(Nr_{2}\) = 4, \(Nr_{3}\) = 6 and \(Nr_{4}\) = 3 in ResNet-50. Additionally, it has been demonstrated that pre-trained network can be adjusted to be effective for other computer vision tasks. We modify the last fully-connected layer of ResNet from outputting a 1000-D vector to outputting 1 item indicating the predicted number of cells. Additionally, we replace the softmax loss with Euclidean loss. After that, we perform fine-tuning on the weights in fully-connected layer of the ResNet using cell datasets, the parameters of previous layers are preserved. Finally, we obtain three ResNet based regression models for cell counting.

Fig. 4.
figure 4

(a) RB-IS stands for the Residual Block with Identity Shortcut; (b) RB-PS is the Residual Block with Projection Shortcut; (c) An illustration of the architecture that we used in this paper.

3 Experiment and Performance

3.1 Datasets

First, we describe the three cell datasets, on which the proposed method and other comparison methods are evaluated. The first dataset [12] involves 100 H&E stained histology images of colorectal adenocarcinomas. A total of 29,756 nuclei were marked at/around the center with over 22,000 labeled with the cell type. The second dataset [1] consists of 200 highly-realistic synthetic emulations of fluorescence microscopic images of bacterial cells. The third dataset comprise of 55 high-resolution RGB images, each of them is a microscopic image of proliferative tumor cells area with a resolution of 1920-by-2560 pixels. The tumor cell size is about 10 to 20 pixels in diameter or 10 \(\upmu \)m in physical length.

Fig. 5.
figure 5

Example images from the three evaluation datasets and their dotted annotation. (Color figure online)

Table 1. Details of three datasets: Size is the image size; Ntr/Nte is the number of images selected for training and testing; AC indicates the average number of cells; MinC-MaxC is the minimum and maximum numbers of cells in a dataset.

To build this Ki-67 cell image dataset, a 10X microscopic field representing the highest proliferative area was acquired using a Nikon Eclipse E600 microscope with 0.25 aperture and a QImaging Micropublisher 5.0 RTV camera equipped with a Sony ICX282 CCD, finally it gives us 24-bit color pictures with a resolution of 1920 \(\times \) 2560 pixels. All of the three evaluation cell datasets have their dotted annotation available, which represents the location of cells as shown in Fig. 5. For the three datasets, we randomly select images for training and testing. Details of the three evaluation datasets are summarized in Table 1.

3.2 Implementation Details

The proposed method is implemented in Matlab, and we utilize Caffe [33], a fully open source implementation of Convolutional Neural Network, which affords clear access to Matlab/Python with support for GPU computation. As discussed in image partion part, each original RGB images is partitioned under certain rotation, stride size, and patch size. After taking experiments under different settings, we use stride size = 30 (pixels), patch size = 60 \(\times \) 60 (pixels) and rotation step = 30 for the nuclei data; stride size = 20 (pixels), patch size = 40 \(\times \) 40 (pixels) and rotation step = 30 for the bacterial data; stride size = 50 (pixels), patch size = 200 \(\times \) 200 (pixels) and rotation step = 30 for the Ki-67 cell data. All the experiments are run on a machine with Intel Core i7-4790K CPU@4.00 GHz \(\times \) 8 and GPU GeForce GTX TITAN Black/PCIe/SSE2.

3.3 Evaluation Metric and Counting Examples

In all the experiments, we use the Mean Relative Error (MRE) and Mean Absolute Error (MAE) as the metric for quantitative evaluation: where N is the total number of test images, \(t_{i}\) and \(p_{i}\) are the true and predicted numbers of cells in the i-th test image. MRE and MAE are defined as follows:

$$\begin{aligned} MRE=\dfrac{1}{N}\sum _{i=1}^{N}\dfrac{|t_{i}-p_{i}|}{t_{i}} \end{aligned}$$
(3)
$$\begin{aligned} MAE=\dfrac{1}{N}\sum _{i=1}^{N}|t_{i}-p_{i}| \end{aligned}$$
(4)
Fig. 6.
figure 6

Spatial density prediction and counting results on the cell datasets. Figure 6 provides several cell counting results on the three evaluation datasets. Original cell image is shown in on left side; the middle panel shows the patch level prediction, which is a middle result of our cell counting result; The right panel shows the spatial density prediction map (measured in number/square pixel) as well as the global counts of ground-truth and our prediction. From the patch level prediction curve, we can see that our estimated counts for each patch approximate the pattern of ground truth counts well.

3.4 Counting Performance Using Different Models

First, we are going to investigate the performance difference between Classification (C) model and Regression (R) model for the cell counting task. The whole framework follows the pipeline shown in Fig. 3. As for the CNN architecture, we employ AlexNet (5conv + 3fc) and ResNet (50 layers) separately. And softmax loss function and Euclidean loss function are used respectively in the classification and regression model. We conduct this comparison experiment on all the three cell datasets and evaluate the performance in terms of Mean Relative Error and Mean Absolute Error (std also provided). Table 2 shows that on all the three datasets regression model shows lower prediction error by considerable margins than the classification model, which experimentally support the discussion in Sect. 2.3.1. It is also necessary to note that the AlexNet based regerssion model outperforms the ResNet based classification model. Table 3 shows the counting performances using ResNets with different number of layers. The 50/101/152-layer ResNet based regression models are used in this experiment. We can observe that ResNet-152 model shows the lowest prediction error, followed by ResNet-101 and ResNet-50 respectively.

Table 2. Counting performance in terms of MRE and MAE±std comparison between Classification (C) and Regression (R) model
Table 3. Counting performance in terms of MRE and MAE±std using different models.

3.5 Comparison with State of the Art

We carry out experimental performance comparison between our method and three other state-of-the-art approaches (presented in [5, 16, 24]) on three evaluation datasets. The counting result from ResNet-152 regression model is used as our approach result during this comparison. Figure 7 provides the cell counts of ground-truth and four predictions of “Le.count” [16], “Le.detect” [5], “DeepFeat” [24] and “The proposed” on every test image. To quantify Fig. 7, Table 4 reports the performance in terms of Mean Relative Error (MRE) and Mean Absolute Error (MAE) over the three evaluation datasets.

The proposed method has achieved very competitive result on Nuclei dataset and Ki67 dataset, MRE = 16.66 % and 6.41 % respectively. The images from Nuclei and Ki67 datasets contain 310.22 and 2045.85 cells on average, our proposed method is able to predict with only 14.93 and 108.30 cells in terms of mean absolute error; while other three methods gives 33.89–71.80 and 189.35–259.67 error cells on average.

On Bacterial dataset, the proposed method gives MRE = 4.50 %, but [16] makes 2.4 % further improvement over our result. The central idea of [16] is to estimate a density function whose integral over any image region gives the count of objects within that region. In its learning phase, each cell is dot-annotated and is assigned a real-valued Sift feature vector describing the local appearance. It means that for each cell, [16] needs its x-y coordinate on an image and then compute the Sift feature on the image sub-region around this cell. In comparison, the proposed method only takes the number of cells as annotation to an image patch during training. As one can imagine, for an image containing hundreds to thousands of cells, the complexity and time consuming of [16] will increase greatly. Furthermore, when it comes to the much more dense datasets (Nuclei and Ki67), the Sift descriptor based learning of [16] becomes less reliable.

It is also necessary to mention that the MRE values of all the four methods on Nuclei dataset are higher than those on other two datasets, because Nuclei dataset has several test images, which only contains a few cells e.g. 1, 4 or 8. For example, predicting the cell count from 1 (ground truth) to 2 or from 4 (ground truth) to 6 will greatly affect the final MRE value.

Table 4. Counting performance (MRE) and (MAE±std) comparison on the three evaluation datasets.
Fig. 7.
figure 7

The estimated count versus ground-truth of different approaches on the three evaluation datasets.

4 Conclusions

In this paper, we propose a novel regression based framwork for cell counting. As the output, spatial density map and global cell count are provided. The proposed method is able to handle dense cell microscopy image, where the cells also present huge variation in appearance. We have experimentally demonstrated that the proposed approach achieved superior performance compared with several recent related methods.