Keywords

1 Introduction

Abdominal aortic aneurysms (AAAs), an enlargement or widening of the abdominal aorta, commonly occurs in males older than 65 years with a prevalence of 4 to 8% [5]. Untreated aneurysms tend to grow and eventually may rupture with mortality rates exceeding 90\(\%\). As most AAAs are asymptomatic until critical bleeding, incidental finding of AAAs becomes critical. However, on routine abdominal computed tomography (CT) exams, only 65\(\%\) of AAAs are incidentally identified [2]. This low reporting rate makes it difficult to provide timely intervention for patients. Indeed, it is common for AAAs to be first diagnosed at a point where a patient is already at risk for rupture [7]. Furthermore, in routine clinical practice, the size of AAAs is determined by manual measurement of the maximal aortic diameter, which is time-consuming and prone to high inter-reader variability.

Consequently, a variety of computer-aided diagnosis techniques have been proposed over the past decade for automated aorta segmentation. Many of these previous aids used classical computer vision techniques that required prior knowledge, such as external seed points for initialization [3]. Driven by the ever-increasing capability of deep learning, neural networks have recently been used for aorta segmentation on CT angiography [6]. However, these previous deep learning algorithms focused only on CT exams with contrast, while incidental identification of AAAs on scans without contrast is equally important but more challenging. Additionally, most of the previous works concentrated on the task of automated aortic segmentation [6, 9, 11], but there are very few studies investigating the more applied task of AAA detection, which has much greater clinical relevance than purely performing segmentation alone.

In this paper, we demonstrate a deep-learning solution (DeepAAA) for automated aorta segmentation and AAA detection on both contrast and non-contrast CT series. Specifically, we develop a variant of a 3D U-Net [1] for aorta segmentation on abdominal CT scans. The proposed method handles series with varying numbers of images. We then apply ellipse fitting to the segmented aortic contours and estimate the largest aortic diameter. DeepAAA is a general solution, achieving a high detection rate for AAAs on both contrast and non-contrast CT scans and working with variable image resolutions and slice thicknesses. Furthermore, our solution demonstrates strong generalizability and performance relative to literature-reported values for radiologist sensitivity at AAA detection.

2 Cohort and Annotation

Image data consisted of contrast and non-contrast CT examinations of the abdomen and pelvis performed between January 2005 and April 2017 by Massachusetts General Hospital Department of Radiology. The investigators obtained local Institutional Review Board approval for the project and selected two datasets from the database. The two datasets differ in terms of their capture dates and imaging equipment used as characterized in Table 1.

Table 1. Comparison between primary and additional validation data sets

2.1 Primary Data Set

The primary dataset was used for the training and initial validation of the model and contained 321 studies (223 unique patients). These were selected based on a keyword search of study reports ensuring a mixture of positive and negative cases of AAA. The query was biased to largely include studies captured between 2005 and 2007. Of the studies selected, there were 217 (67.6\(\%\)) males and 104 (32.4\(\%\)) females with a mean age of 70.3 years; 153 (47.7\(\%\)) CT scans with contrast and 168 (52.3\(\%\)) without; 247 (76.9\(\%\)) studies with AAA present and 74 (23.1\(\%\)) without AAA. For each study, the axial series was used for aorta segmentation and AAA detection. Slice thickness of the images ranged from 2 to 10 mm, while the number of images for each series varied from 40 to 384.

To generate a ground-truth aortic segmentation, the abdominal aorta was manually contoured on the axial scans slice-by-slice until the aortic bifurcation. Each study was annotated by 1 to 4 CT technologists under supervision of 2 radiologists. Based on the clinical definition [2], the presence of AAA was determined by applying a 3.0 cm threshold to the maximum aortic diameter as defined by the manual segmentations.

As many exams were annotated by multiple annotators, a partial assessment of inter-rater variability was possible. Of the 153 contrast studies, 124 were annotated by at least 2 independent technologists, leading to 517 pairwise comparisons. The non-contrast data, however, contained only 10 studies where more than one segmentation was performed, resulting in only 16 pairwise comparisons. The average inter-rater Dice on contrast series was \(0.95\,\pm \,0.03\), while on noncontrast series, it was \(0.90\,\pm \,0.08\). Given the small number of samples, the inter-rater variability on non-contrast data should not be considered definitive but suggests roughly similar levels of agreement. For the subsequent analysis, one reference segmentation per dataset was selected randomly as ground truth.

2.2 Additional Validation Set

An additional validation set was used to test the robustness of the model to changes in imaging equipment, imaging department capture protocols, and patient demographics. All of these factors may vary significantly over time at a single site, and thus, we selected 57 studies (57 unique patients) predominantly captured between 2012 and 2016 for this dataset. The studies were selected to include a mixture of positive and negative cases of AAA through keyword search of study reports. All negative studies were manually verified to not contain a AAA. To assess the model against radiologist-reported ground truth and validate post-processing stages which generate the AAA measurement, the maximum aortic diameter and presence of AAA was sourced from radiology reporting rather than being derived from manual segmentations (as was done for the primary data set).

3 Methods

We achieve AAA detection via two sequential steps: (1) aorta segmentation (2) aorta contour fitting for the estimation of the largest cross-sectional diameter. For abdominal aortic segmentation, we developed a variant of a 3D U-Net [1] which accepts series with varying numbers of images. As discussed in Sect. 2, our dataset contained a wide distribution of image counts and slice thicknesses as abdominal studies may also cover other regions of the body, including the pelvis or thorax. It is thus essential to develop an algorithm adapts to variability along the axial dimension. The 3D U-Net architecture we used contained 4 down/upsampling modules (plus the bottleneck layer), 2 convolutional layers per module, and 32 initial features in the network. The convolutional kernel size was 3 \(\times \) 3 \(\times \) 3 in both the downsampling and upsampling path, while the 3D pooling kernels were 2 \(\times \) 2 \(\times \) 1 to preserve image count. Batch normalization was applied before each ReLU activation, and dropout regularization was utilized at the bottleneck layer with a dropout rate of 0.2. A 1 \(\times \) 1 \(\times \) 1 convolutional layer with softmax activation over two classes (background and aorta) was applied at the output layer and thresholded at 0.5 to generate the binary aorta mask.

The model was trained with the RMSprop optimizer using a learning rate of 0.0001. Weights selected for evaluation were those that minimized the loss on the validation set, which were not in general the last epoch weights. The loss function was a smoothed negative Dice coefficient:

$$\begin{aligned} D=-\frac{2 \sum _{i=1}^N p_i g_i + 1}{\sum _{i=1}^N p_i + \sum _{i=1}^N g_i + 1} \end{aligned}$$
(1)

similar, but not identical, to that used in [8]. The summation is over all N voxels in a scan, \(p_i\) is the predicted aorta probability and \(g_i\) is the ground truth classification for voxel i. The additional ones in the numerator and denominator avoid division by zero and yield a perfect score for a correct, empty segmentation.

In order to build a general AAA detector that worked with both contrast and non-contrast CT scans, we mixed both types of CT images for model training. All the experiments were implemented utilizing the Keras deep learning library with the Tensorflow backend on NVIDIA DGX-1 Volta.

After aorta segmentation, we applied ellipse fitting [4] image-by-image to the contours of the aorta. The largest aortic diameters (d) were thus assigned by the long axis of the ellipses. For the regions where the aorta was not parallel to the axial CT scans, angle correction was applied to retrieve the true aorta diameter, i.e. \(d \cos \theta \), where \(\theta \) was the angle between the secant plane of the aorta and the axial scan. Based on the definition of AAA, predicted positives were the studies where the largest diameter of the aorta segment was greater than 3 cm. We then compared the predicted results with the ground truth annotations.

4 Results

4.1 Training and Cross-validation on Primary Data Set

To assess model validity and repeatability, the primary dataset was divided into 5 folds such that no patient was repeated between folds. Cross validation was performed by selecting folds \(\{n,n+1,n+2\} \bmod 5\) as training, \(n+3 \bmod 5\) as validation and the remaining fold as test for \(n\in \{0..5\}\). For each combination, the weights with the best validation score after 100 epochs were selected.

Table 2. Results of 5-fold cross-validation. Delta is predicted minus reference largest diameter. Standard deviations combined using pooled variance.

Inference on each test study was evaluated in terms of Dice score relative to the reference segmentation and in terms of the maximum diameter of the aorta evaluated on the inferred segmentation versus the same calculation on the reference segmentation. The detailed results of this cross validation are presented in Table 2. Over the 5 folds, the average Dice score ranged from 0.883 to 0.894, with a average Dice score of \(0.887\pm 0.111\). The estimate of the diameter is consistently within one standard deviation of zero. There may be a slight bias towards smaller diameter, as 4 of the 5 folds had negative means but this bias is small with overall mean −1.3 mm ± 7.3.

For a final set of weights, the complete primary dataset was randomly split into training (80%), validation (10%), and test sets (10%). Training was performed for 300 epochs and the weights with lowest validation loss were selected.

Fig. 1.
figure 1

DeepAAA aorta segmentation (red overlay) and the largest aortic diameter estimation (yellow crosses, the long axis of ellipse fitting [green curves] of the aorta segment): (a–c) Aneurysm with thrombus on contrast CT. (d–f) Large aneurysm on non-contrast CT where aortic boundary is hard to segment. (g–i) normal aorta. (Color figure online)

As shown in Fig. 1, DeepAAA successfully segments the aorta on both contrast and non-contrast CT images, and works well with more challenging cases where blood-clots are present or the aortic boundary is unclear in the images. We achieve high performance on aortic segmentation with an average Dice coefficient of 0.91, which yields high sensitivity (0.91) and high specificity (0.95) on AAA detection (Table 3). We further examine the error in the largest aortic diameter measurement (\(d_{pred}\)\(d_{true}\)). We find that the algorithm tends to underestimate the aorta size, but the 2.02 mm average discrepancy is well within the 10 mm gradations on which clinical decisions are generally based.

Table 3. Performance of DeepAAA on segmentation and detection

4.2 Testing Model Robustness on the Additional Validation Set

Using the final model trained in Sect. 4.1, we performed inference on studies from the additional validation set described in Sect. 2.2. Each study was labelled for the presence of a AAA via the radiology report, and for those studies with positive findings, the maximum aortic diameter was also extracted.

For each study, the model’s outputs were compared to the study labels and the model’s overall performance was measured in terms of sensitivity/specificity for detecting AAA and mean error in the maximum diameter. Table 3, last row, summarizes these results, along with a comparison to the model’s performance on the held-out test set for the same metrics. During the process we noted that some studies in this additional validation set extended into thoracic anatomy, and model inference of this region was removed manually in post-processing.

5 Discussion

While AAAs are rarely missed when the leading indication for a study, the rate of detection significantly decreases when the AAA is an incidental finding. DeepAAA aims to provide a “second set of eyes” and reduce the rate of missed incidental findings. Therefore, to properly contextualize model performance, it is important to quantify this rate of misdiagnosis. Claridge et al., in a retrospective analysis of 3246 abdominal CT scans and their reports, found that only 65% of AAAs were detected by radiologists [2]. DeepAAA exceeds the sensitivity they found (Table 4) while achieving a high specificity (Table 3) and localizes the suspected AAA for radiologist confirmation. Thus, a parallel read from our algorithm could potentially provide a significant reduction in missed AAAs and offer significant clinical value, enabling early detection and treatment of AAA.

Many observers have noted that machine learning models applied to radiology may not generalize well [10]. Changing the equipment used to capture input images and changing the demographics of the underlying patient cohorts tend to reduce model performance. This lack of generalizability would significantly hamper a model’s clinical utility because deployment at sites other than where the model was trained may result in surprising under-performance. To test DeepAAA’s ability to generalize, we simulated a significant change in input data by creating a second cohort of validation data (Sect. 2.2) acquired from different patients using different equipment more than five years after the original training data were acquired. The model showed higher specificity (100%) and reduced mean error in diameter prediction with only slightly lower sensitivity (85%) - essentially demonstrating that the model is robust and has not over-fit to any cohort- or equipment-related idiosyncrasies of the original training data.

Table 4. Comparison between DeepAAA and literature reported performance of radiologists on AAA reporting for routine abdominal CT according to aneurysm size

Future work would involve extending the DeepAAA model beyond the abdominal region to include segmentation of the thoracic aorta. Thoracic aortic aneurysms (TAA), although not nearly as prevalent as AAA, are still a significant source of mortality and generally affect a younger population. In addition, models to predict AAA growth or rupture would be of significant clinical value in guiding more targeted surveillance programs and therapy.