Keywords

1 Introduction

Lung cancer is one of the common cancers and the leading causes of cancer-related death worldwide. Clinical trials [1] show that early screening of pulmonary nodules for high-risk subjects would allow detecting lung cancer at early stages, thus greatly reducing death rates caused by lung cancer. Low-dose CT scan is an effective and recommended way for pulmonary nodule screening. However, reading hundreds of slice images per CT scan is time-consuming, radiologists under labor-intensive work only spend very limited time on each scan, which will result in missed detections of pulmonary nodules, and hence decreasing the efficacy of CT pulmonary nodule screening. Therefore, it is highly desirable to develop a computer-aided approach to automatically discovering pulmonary nodules in clinical CT scans.

CT pulmonary nodule detection is technically challenging because (1) nodule sizes vary to a great extent, i.e., 3 mm to 3 cm; and (2) the appearance of nodules is inconsistent, as shown in Fig. 1. Most existing computer aided detection (CAD) systems for pulmonary nodules consist of two stages [2]: (1) candidate generation; and (2) false positive (FP) reduction. In the first stage, a detection network (e.g., faster RCNN [3]) is often applied to find all nodules in lung fields while filtering out as many background positions (negatives) as possible. In order to keep a high sensitivity, detection precision is often compromised, and many false positives are generated in the first stage. In the second stage, a binary classifier (e.g., CNN) is trained to separate these false positive detections produced in the first stage.

Fig. 1.
figure 1

Different density types of pulmonary nodules in CT scans. From left to right: solid nodule, calcific nodule, ground glass opacifications (GGO) and mixed GGO.

In this work, we formulate CT pulmonary nodule detection as a cascade learning problem. Instead of using a two-stage approach, we train multiple stages of detection networks in a cascade fashion and chain them into an ensemble model to gradually separate background from pulmonary nodules. Compared to existing pulmonary nodule detection systems, our work has two major contributions. First, we use detection networks not only in the first stage but also in the subsequent stages, as we empirically found that a binary CNN classifier is prone to overfitting in FP reduction. Second, a feature pyramid network [4] is used as the backbone detection network because it performs detection on multi-scale feature maps, which can better handle the large size variation of pulmonary nodules than faster RCNN [3] that performs detection on a single-scale feature map. In addition, we propose a new cascade paradigm called “Relu cascade”. At each cascade level, we use a Relu strategy to aggregate the judgements from all previous cascade levels before filtering false positives. Validated on a large internal CT dataset, the Relu strategy greatly improves the detection performance of pulmonary nodules compared to the conventional cascade that filters false positives independently at each level in the inference phase.

2 Method

2.1 3D Feature Pyramid Network for Pulmonary Nodule Detection

The sizes of pulmonary nodules vary greatly from 3 mm to 3 cm. Conventional deep learning approaches, e.g., faster RCNN [3], are not suitable for detecting objects with large scale variations, because they detect objects at a single scale. For example, the faster RCNN uses the interleaving of convolutions and down-sampling to extract context information, and detects objects on the final coarse feature maps. Using such methods for pulmonary nodule detection will inevitably miss small nodules. To detect nodules of all sizes, detection has to be performed on various scales. To this end, we extend feature pyramid network (FPN) [4] for 3D pulmonary nodule detection. We derive our FPN network architecture from the popular segmentation network, V-Net [5], as shown in Fig. 2. The difference is that the top level in the ascending path of V-Net is removed, as this level contains very fine-grained features, which is generally unnecessary for object detection. In FPN, detection is performed separately on feature maps of four different scales by attaching a detection head onto each feature map. The detection head consists of two convolution layers at each scale, and it outputs an anchor likelihood map \( P \) and an anchor regression map \( R \). The nodule detection process follows the conventional region proposal network [3] by first identifying positive anchor boxes on \( P \) and then adjusting these positive boxes using the predicted anchor regression maps \( R \). Finally, redundant bounding boxes are suppressed using non-maximum suppression.

Fig. 2.
figure 2

The architecture of our feature pyramid network.

Training Loss of FPN:

The training loss of FPN is the summation of losses of all four different scales. The loss at a single scale is given as:

$$ Loss\left( {P,R} \right) = \lambda L_{cls} \left( {P,P^{*} } \right) + L_{reg} \left( {R,R^{*} } \right) , $$
(1)

where \( L_{\text{cls}} \) and \( L_{\text{reg}} \) are anchor classification and regression losses, respectively, and \( \lambda \) is a balancing weight between them. Conventionally, the classification loss is measured over all anchor boxes. However, considering that the positive and negative anchor boxes are highly imbalanced (1:\( 10^{6} \)) in this task, we use online hard negative example mining (OHEM) to pick the top K hard negative anchors in a min-batch, and then use focal loss [6] to measure classification losses for all positive anchors and the top K hard negative anchors. Mathematically, the classification loss is defined as,

$$ L_{cls} \left( {P,P^{*} } \right) = - \sum\nolimits_{i}^{{N_{pos} }} {\alpha \left( {1 - p_{i} } \right)^{\gamma } logp_{i} } - \sum\nolimits_{i}^{{N_{neg} }} {1_{i \in Top K} \left( {1 - \alpha } \right)p_{i}^{\gamma } logp_{i} } , $$
(2)

where \( N_{\text{pos}} \) and \( N_{\text{neg}} \) are the numbers of positive and negative anchors, \( p_{\text{i}} \) is the classification likelihood of the \( i \)-th anchor, \( 1_{i \in Top K} \) is an indicator function to pick only the top K hard negative anchors into loss calculation. \( \upgamma \) and \( \upalpha \) are parameters of focal loss, respectively. We set \( K = 5000N_{pos} \), \( \upalpha = 0.9 \), \( \upgamma = 2 \) in our experiments.

The regression loss \( L_{\text{reg}} \) follows the definition of faster RCNN:

$$ L_{\text{reg}} \left( {R,R^{*} } \right) = \sum\nolimits_{i} {1_{{l_{i} = 1}} L1_{smooth} \left( {r_{\text{i}} - r_{\text{i}}^{*} } \right)} , $$
(3)

where \( 1_{{l_{i} = 1}} \) is an indicator function to pick only the positive anchors, \( L1_{smooth} \) is the smooth L1 loss, \( r_{i}^{*} \) is a 6-tuple (\( \varDelta x^{*} ,\varDelta y^{*} , \varDelta z^{*} , \varDelta w^{*} , \varDelta h^{*} , \varDelta d^{*} \)) defined for the \( i \)-th positive anchor box, and \( r_{i} \) is the same 6-tuple predicted by the network. The definitions of \( \varDelta^{*} \) values are given below:

$$ \varDelta x = \frac{{x^{*} - x_{a} }}{{w_{a} }}, \varDelta y = \frac{{y^{*} - y_{a} }}{{h_{a} }}, \varDelta z = \frac{{z^{*} - z_{a} }}{{d_{a} }}, \varDelta w = \log \left( {\frac{{w^{*} }}{{w_{a} }}} \right), \varDelta h = \log \left( {\frac{{h^{*} }}{{h_{a} }}} \right), \varDelta d = log\left( {\frac{{d^{*} }}{{d_{a} }}} \right) , $$
(4)

where \( x_{a} \), \( y_{a} \), \( z_{a} \), \( w_{a} \), \( h_{a} \) and \( d_{a} \) are the 3D box center and size of the positive anchor box, \( x^{*} \), \( y^{*} \), \( z^{*} \), \( w^{*} \), \( h^{*} \), and \( d^{*} \) are the 3D box center and size of the ground-truth box that has the largest overlap ratio with the positive anchor box.

2.2 Relu Cascade of FPNs for Pulmonary Nodule Detection

Compared to the background (negatives), the number of pulmonary nodules (positives) per CT scan is very limited, e.g., 2–3 per scan. Given the imbalanced positive and negative training dataset, it is difficult for a single detection network (essentially a classifier) to clearly separate massive background positions from limited nodule positions. A classic solution to this imbalanced problem is cascade learning [7], which trains multiple classifiers one by one to gradually filter massive negatives from limited positives.

When combined with FPNs, the workflow is shown in Fig. 3. Specifically, a first FPN is trained with positive patches sampled around all nodule locations and random negative patches sampled from the background. Then, the trained FPN forms a cascade with one level. It is applied to all training images and a likelihood threshold is determined by choosing a maximum value that correctly classifies all positive anchor boxes (100% sensitivity). Doing so clearly produces many false positive detections. In the next round, negative patches are randomly sampled only from these false positive locations, and a second FPN is trained aiming to reduce these false positives. By chaining two FPNs together, the second FPN can filter out negatives that the first FPN fails to do (i.e., false positives). The same process continues until all false positives are eliminated or a specific number of FPNs is obtained.

Fig. 3.
figure 3

The paradigm of conventional cascade with feature pyramid networks (FPNs). (Color figure online)

The drawback of the conventional cascade in the inference can be illustrated in Fig. 3, where \( {\text{Heaviside}} = {\text{Heaviside}}\left( {P_{i} - T_{i} } \right) \), \( P_{i} \) and \( T_{i} \) are the anchor likelihood map and threshold at level \( i \), respectively, the circled red cross is an element-wise product operator. As we can see, each level in the conventional cascade filters out negatives based on its own judgement without referring to peer levels. As long as one level classifies a sample as negative, the sample location is filtered out. This strategy can be very risky as FPNs trained in the later cascade are more specific and easier to overfit the training data than FPNs in the beginning cascade levels. The reason is because the negative training samples that later FPNs trained with are extracted from false positives left by the previous cascade levels. As the level of cascade increases, the number of false positives decreases. The latter FPNs are exposed to limited negative samples, thus very sensitive and prone to overfitting.

To address this problem, we propose a new cascade called “Relu cascade” as shown in Fig. 4. The training process is the same with the conventional cascade. The difference lies in the inference stage, where Relu cascade introduces two operators to aggregate the judgements from all previous cascade levels before filtering out negatives.

Fig. 4.
figure 4

The paradigm of Relu cascade with feature pyramid networks (FPNs).

  1. (1)

    Threshold Relu operator is defined as \( P_{i}^{relu} = max\left( {0,P_{i}^{avg} - T_{i} } \right) + 1_{P^{avg}_i \geq T_i} * T_{i} \), where \( P_{i}^{avg} \), \( P_{i}^{relu} \) and \( T_{i} \) are the input anchor likelihood map, output anchor likelihood map, and likelihood threshold at cascade level i, respectively.

  2. (2)

    Conditional mean operator is defined as:

$$ P_{i}^{avg} = \left\{ {\begin{array}{*{20}c} {\left( {P_{i - 1}^{relu} + P_{i} } \right)/2, } & {if\ P_{i - 1}^{relu} > 0} \\ {0,} & {if\ P_{i - 1}^{relu} = 0} \\ \end{array} } \right. . $$
(5)

Unlike the conventional cascade (Fig. 3) that uses Heaviside function for binary judgement, our Relu cascade (Fig. 4) uses an “threshold Relu” operator to filter out negative samples while still preserving anchor likelihoods estimated by previous cascade levels. The preserved anchor likelihoods are treated as prior information. The conditional mean operator then combines this prior with the current likelihood estimation to derive more reliable anchor likelihoods. Since each cascade level in Relu cascade utilizes not only its own judgement but also considers the judgements from previous levels, the overfitting problem of conventional cascade can be largely relieved, and the detection performance can be further boosted.

3 Results

Dataset and Parameter Setting:

Our experimental data consists of 2595 CT scans from 2595 patients. The in-plane resolution is 0.6–0.8 mm, and the slice thickness is 5 mm. Pulmonary nodules at each CT scan were first annotated as 3D bounding boxes by a senior radiologist using our internal tool. The annotations were then reviewed and confirmed by another senior radiologist with more than 20 years’ experience. The distribution of pulmonary nodule sizes is shown in Fig. 5 (right). We can see that the majority of pulmonary nodule sizes in our dataset is less than 4 mm (47.7%), which poses a great challenge for automatic nodule detection algorithms.

Fig. 5.
figure 5

Left: quantitative comparison of FROC curves of faster RCNN and FPN. Right: the distribution of nodule sizes in our dataset (nodule size is defined as the average of the largest major axis and the vertical maximum minor axis measurement in-plane).

In the experiments, we randomly split the dataset into two parts: (1) training set of 1989 CT scans with 2932 pulmonary nodules; (2) testing set of 606 CT scans with 1076 pulmonary nodules. All CT scans are first resampled to the same spatial resolution (i.e., 0.7 mm × 0.7 mm × 1.0 mm) before any processing step. During training, due to the limitation of GPU memory, we sample random 3D patches from CT scans to train FPNs. The 3D patch size is 96 × 96 × 96. \( \uplambda \) in the FPN loss is 1000. We evaluate the detection performance of different algorithms using two metrics: (1) average precision (AP) of precision-recall (PR) curve; (2) free-response ROC curve (FROC).

Impact of Multi-scale Feature Maps:

We first evaluated the importance of multi-scale feature maps in pulmonary nodule detection by comparing faster RCNN and FPN. Faster RCNN utilizes only the coarsest feature maps for pulmonary nodule detection. It tends to miss small pulmonary nodules. By detecting pulmonary nodules at various scales of feature maps, FPN outperforms faster RCNN by a large margin (0.12 increase in AP) as shown in both Fig. 5 (left) and Table 1, which justifies the importance of multi-scale feature maps.

Table 1. Quantitative comparison of average precisions of different algorithms.

Evaluation of Different Loss Functions:

We evaluated the detection performance of different loss functions under the FPN framework. The loss functions under comparison are: (1) weighted cross entropy loss, (2) focal loss, (3) cross entropy loss with online hard negative example mining (cross entropy + OHEM), and (4) focal loss with OHEM (focal loss + OHEM). The best performance of each loss function is reported in Fig. 6. Focal loss with OHEM achieves the best detection performance. Without OHEM, the number of negatives is way too many than the number of positives in a min-batch, which limits training efficacy. With OHEM, the detection performance of focal loss is slightly better than that of cross entropy as shown in Fig. 6 (right). Besides, we found that the convergence is accelerated using focal loss compared with cross entropy loss.

Fig. 6.
figure 6

The quantitative comparison of FPN detection performance using different loss functions. Left: FROC curve; Right: PR curve.

Evaluation of Cascade Learning:

We evaluated the performance gain using cascade learning in Tables 1, 2 and Fig. 7. As we can see from Table 2, cascade learning improves the detection performance gradually by adding more FPNs. In the experiments, we trained up to 5 FPNs considering the performance gain in the expense of inference time increase. Comparing with the conventional cascade, Relu cascade is clearly better at all cascade levels, which justifies the effectiveness of aggregating judgements from previous cascade levels in Relu cascade.

Table 2. The detection performance at different cascade levels and using different cascades
Fig. 7.
figure 7

PR curves without cascade (FPN), with conventional cascade, and with Relu cascade.

Inference Time and GPU Memory Cost:

We use pytorch to implement the training part of our algorithm, and build the inference part from scratch using CUDNN. This allows us to fully optimize the inference speed and GPU memory cost of the entire algorithm. After optimization, the runtime of our algorithm is 6 to 12 s using a Geforce Titan Xp graphics card. The runtime GPU memory cost is about 3 to 5 gigabytes depending on the size of lung fields.

4 Conclusion and Discussion

In this paper, we propose a new paradigm of cascade using Relu cascade. We combine this strategy with feature pyramid networks (FPNs) for pulmonary nodule detection. We show that, compared with faster RCNN, FPN improves the detection performance greatly in our dataset that contains many small pulmonary nodules. Using our proposed Relu cascade, the detection performance of cascade can be further boosted. The methodology proposed in this paper is very general, and we will validate its performance in other computer-aided detection problems in the near future.