Keywords

1 Introduction

Idiopathic scoliosis is the most common type of spinal deformity [1] and is typically found in adolescents. Scoliosis in adolescents may progress during their growth period, and it can cause further deterioration of spinal imbalance, back pain, neurological deficits, and even cardiopulmonary compromise. Therefore, early detection and regular monitoring of curve progression are very important. Cobb angle [2], defined as the angle between the most tilted vertebras in anterior-posterior view X-rays, is the gold standard for assessment of curve severity and plays an important role in making surgical decision of each patient. In clinical practice, there are five lumbar vertebras and twelve thoracic vertebras in each X-ray for assessment. The physicians first find four landmarks and draw a quadrilateral accordingly on each vertebra (Fig. 1), then observe the most tilted position and extend the straight line from the edge of relevant quadrilaterals to get the Cobb angle. For manual measurement, the landmarks are determined by subjective judgement, which leads to large inter-rater variation, and the measurement takes a long time because some trials are required to get the final results. In order to reduce the variability of Cobb angle measurement and improve clinical efficiency, in this paper, we propose an automated Cobb angle estimation framework, which can provide reliable assistance for the physicians.

Fig. 1.
figure 1

(a) The vertebras labeled by physicians, four landmarks are connected to a quadrilateral on each vertebra. (b) The anterior-posterior view X-ray with Cobb angle.

Previous attempts on automated Cobb angle estimation can be divided to indirect estimation and direct estimation according to their respective methodology. The indirect estimation methods are based on vertebra location and segmentation. Such as [3] uses the snake model to segment the vertebras. [4] applies the particle model to determine the curvature on radiographical spinal images. After the segmentation, these methods attempt to fit straight lines based on the segmentation result or contour of each vertebra, and calculate the Cobb angle based on each pair of lines. The direct estimation methods aim to directly find all the landmarks of vertebras on the whole image, which is just like the manual measurement of physicians. Such as [5] uses the structured multi-output regression to predict the landmarks. [6] uses a special Boost-net to transform the feature space and finds the landmarks. [7] considers the relationship of adjacent vertebras, then uses a MVC-net to detect all the landmarks. After landmark prediction, the Cobb angle result can be obtained by rules.

The indirect and direct estimation methods have their own advantages and disadvantages. In indirect estimation methods, the whole vertebra area is large and robust against the noise, so that it can always be located and segmented. However, for each vertebra, the straight line fitting is sensitive to the segmentation result. Even a small change of segmentation result can lead to great change of the straight line, which may cause large error on Cobb angle measurement. In direct estimation methods, once the landmarks are predicted, the Cobb angle can be directly obtained by rules. However, a single landmark is less robust than a vertebra area, which can be severely influenced by the noise in the background, and lead to large deviation of Cobb angle.

The instinctive thought is to take advantages in both methods. Based on this motivation, we design a two-stage framework for the Cobb angle measurement. In last few years, the deep learning techniques especially the CNN methods have been developed rapidly, such as Resnet [8] and U-net [9] have reached high performance in medical image applications. Therefore we use CNN methods in our task. We first detect each vertebra in original X-ray and get the vertebra bounding box sequence. The bounding boxes provide the local vertebra areas, which are used as the area limitation for the landmark estimation. We design a CNN with an unsupervised enhancement module to predict four landmarks on each local vertebra area, which can remove the noise in the background. Finally we calculate the Cobb angle by rules. Compared to the indirect estimation and direct estimation methods, the experiment resultsFootnote 1 show that our hybrid method achieves much more precise landmark location and smaller error on Cobb angle. Therefore it can provide reliable diagnostic support for the physicians.

2 Methods

2.1 System Framework

Given an anterior-posterior view X-ray, our system can output the four landmarks of each vertebra and Cobb angle. The system framework is shown in Fig. 2. Our system takes advantages of the indirect and direct estimation methods to get better performance. The input X-ray is first sent to a vertebra detection network. It outputs the filtered vertebra boxes sequence, which provides the area limitation for the landmark estimation. Then each local vertebra area in the sequence is sent to the landmark estimation network to get four landmarks. Based on the vertebra sequence with landmarks, we can calculate the Cobb angle by rules. More details are explained in the following sections.

Fig. 2.
figure 2

Our two-stage system framework. The structure of landmark estimation network is shown above, and the number of channels is under each feature map.

2.2 Vertebra Detection Network

The vertebra detection network is used to find the boundary of each vertebra from the original X-rays. Since the missing vertebra bounding boxes will never be found in the following steps, the accurate detection of bounding boxes is required for the landmark estimation. The Mask R-CNN [10] is the state of the art method for object detection and segmentation, which performance is even better than the Faster R-CNN [11]. For these reasons, we adopt the Mask R-CNN structure as our vertebra detection network. For network training, we connect four labeled landmarks to get each vertebra area and generate the mask with pixel information, then we use the mask to train the network.

The vertebra detection network outputs all the vertebra bounding boxes \([x_{lt},y_{lt},x_{rb},y_{rb}]\) for each X-ray, which represents the x-coordinate and y-coordinate of the left top and right bottom vertex of a bounding box. We use a filter to remove the wrong detection results. The center point of a bounding box can be calculated as \([x_{c},y_{c}] = [\frac{x_{lt}+x_{rb}}{2},\frac{y_{lt}+y_{rb}}{2}]\). We order the bounding box sequence from the top to bottom according to \(y_{c}\). The distance of the adjacent center points \([x_{c1},y_{c1}],[x_{c2},y_{c2}]\) are calculated by \(d = \sqrt{(x_{c1}-x_{c2})^{2} + (y_{c1}-y_{c2})^{2}}\). If the two distances of one center point and its neighbors are both larger than a threshold D, then this center point is an outlier and the relevant bounding box is removed. After the refinement, we can get the filtered vertebra bounding box sequence for the landmark estimation.

2.3 Landmark Estimation Network

The landmark estimation network (LE-NET) we design is shown in Fig. 2. The local vertebra area is firstly extracted from original X-rays according to the relevant bounding box \([x_{lt},y_{lt},x_{rb},y_{rb}]\), which provides the area limitation. The LE-NET receives the local vertebra area and resizes it to \(64\times 64\). It outputs a 8-dimension vector T after the convolution, max pooling and full connection. An extra enhancement module is inserted into LE-NET: we use the full connection operation to reconstruct the 2048-dimension feature vector based on the 512-dimension feature vector. An unsupervised mean square loss is used to minimize the difference between the original and reconstructed 2048-dimension feature vector, so that the 512-dimension vector is forced to be a compression representation of the 2048-dimension vector, which can provide better features for the landmark prediction. In order to deal with different resolution of detected bounding boxes, we use the normalized coordinate in [0, 1] rather than the absolute coordinate to predict the landmarks. To achieve this goal, a sigmoid function is used to transfer T to the final prediction vector \(pre = \frac{1}{e^{-T}+1}\), which includes the normalized x-coordinate and y-coordinate of four landmarks.

For network training, we extract the bounding boxes which include the four groundtruth landmarks of each vertebra, then randomly enlarge the bounding boxes to get the training data. Assuming that the height and width of an enlarged bounding box are [hw] (it will be resized to \(64\times 64\) before training), we first transform the absolute coordinate of the relevant four landmarks in original X-ray to the local bounding box, then the absolute coordinate vector \([x_{1},y_{1},x_{2},y_{2},x_{3},y_{3},x_{4},y_{4}]\) in the bounding box can be transformed to normalized coordinate vector \(gt = [\frac{x_{1}}{w},\frac{y_{1}}{h},\frac{x_{2}}{w},\frac{y_{2}}{h},\frac{x_{3}}{w},\frac{y_{3}}{h},\frac{x_{4}}{w},\frac{y_{4}}{h}]\), which is used as the label of training data. Therefore the loss function can be defined as

$$\begin{aligned} loss = \sum \limits _{i=1}^{8}(gt(i)-pre(i))^{2} + \lambda \sum \limits _{j=1}^{2048}(O(j)-R(j))^{2} \end{aligned}$$
(1)

Where the O and R represent the original and reconstructed 2048-dimension vector respectively. \(\lambda \) is the weight of reconstruction error. We use formula 1 to train the LE-NET. When we predict the landmarks in bounding box \([x_{lt},y_{lt},x_{rb},y_{rb}]\) output by vertebra detection network, once we get the predict vector \(pre=[x_{1}^{p},y_{1}^{p},x_{2}^{p},y_{2}^{p},x_{3}^{p},y_{3}^{p},x_{4}^{p},y_{4}^{p}]\), we can transform the normalized coordinate of four landmarks to absolute coordinate in original X-ray by using

$$\begin{aligned} x_{i}^{a} = x_{i}^{p}*(x_{rb}-x_{lt})+x_{lt}\ \ \ y_{i}^{a} = y_{i}^{p}*(y_{rb}-y_{lt})+y_{lt}(i=1,2,3,4) \end{aligned}$$
(2)

Since we have ordered the vertebra bounding boxes, after we get the absolute coordinate of all the landmarks, we connect two top landmarks and two bottom landmarks, and get the slope of two straight lines on each vertebra. We calculate the angle for each two vertebras, by choosing the bottom line on the top vertebra and the top line on the bottom vertebra (Fig. 1). The largest angle from all the results is chosen as the final Cobb angle result.

3 Experiments

3.1 Dataset Description and Parameters Setting

For our experiments, the experiment environment is running on ubuntu 14.04 with gpu 1080Ti, and the relevant software is python with tensorflow. The dataset of anterior-posterior view X-rays provided by a local hospital has 1200 images in total and the size of each X-ray (Fig. 1) is \(491\times 957\). For all the images, four landmarks of each vertebra and the Cobb angle are labeled by five professional physicians with cross validation. Therefore the labeling results are thought as the gold standard in our experiments.

The stochastic gradient descent (SGD) optimizer with momentum is used for both networks in our system framework. The X-rays are randomly divided to training set, validation set and test set. The scale is set to 7:2:1. For vertebra detection network, the initial learning rate and momentum are set to 0.001 and 0.9. The batch size is set to 4 and the number of training epoches is set to 60. The threshold D is set to 50 pixels according to the size of X-rays. For LE-NET, the \(\lambda \) in formula 1 is set to 0.01. The initial learning rate and momentum are set to 0.0001 and 0.9. The batch size is set to 32 and the number of training epoches is set to 100. For both two networks, the training step is repeated for 10 times, and the average performance is considered.

3.2 Evaluation and Discussion

We evaluate the vertebra location performance of vertebra detection network. The groundtruth bounding boxes are generated from four labeled landmarks. We calculate the precision, recall and Fscore of the predicted bounding boxes. A predicted box is regarded as a correct one if its intersection over union (IOU) with a groundtruth box is larger than 0.5. The results are shown in Table 1.

Table 1. The location performance of vertebra detection network.

From the results, the vertebra detection network achieves very high performance. This is because a vertebra area is large and robust against noise, which can be easily located by Mask R-CNN. With the filter, some wrong boxes are removed and the precision increases. The bounding boxes can be sent to the LE-NET based on the high performance of vertebra location. For landmark estimation, we compare the visualized results of our two-stage framework and the direct use of LE-NET on whole X-rays. An example is shown in Fig. 3.

Fig. 3.
figure 3

(a) The groundtruth landmarks. (b) The landmarks predicted by the direct use of LE-NET on the whole X-ray. (3) The landmarks predicted by our two-stage framework.

From Fig. 3, the landmarks obtained by the direct use of LE-NET have large deviation from the groundtruth. Compared to the vertebra area, the single landmark is easier to be effected by the noise in background. By using the area limitation, the background is removed so that the landmarks can be predicted much more precisely. The scaled mean absolute error (MAE) metric is used to evaluate the deviation of landmarks:

$$\begin{aligned} scaled\ MAE = \frac{1}{N}\sum \limits _{i=1}^{N}mean(|gt^{l}_{i}-pre^{l}_{i}|) \end{aligned}$$
(3)

Where N is the number of vertebras and mean() is the element-wise arithmetic mean of vector. \(gt^{l}_{i}\) and \(pre^{l}_{i}\) represent the normalized coordinate of groundtruth vector and prediction vector of four landmarks. For Cobb angle evaluation, two widely used evaluation metrics are the circular mean absolute error (CMAE) and symmetric mean absolute error (SMAPE), which are defined as

$$\begin{aligned}&CMAE = \frac{1}{M}\sum \limits _{i=1}^{M}|gt^{a}_{i}-pre^{a}_{i}|\ \ \ SMAPE = \frac{1}{M}\sum \limits _{i=1}^{M}\frac{|gt^{a}_{i}-pre^{a}_{i}|}{gt^{a}_{i}+pre^{a}_{i}}\times 100\% \end{aligned}$$
(4)

Where \(gt^{a}_{i}\) and \(pre^{a}_{i}\) represent groundtruth and system output Cobb angle respectively. M is the number of images. We compare our two-stage framework with both the indirect and direct estimation methods. Since the Mask R-CNN can also output the segmentation results for each vertebra bounding box, it can be used in the indirect estimation method. We also use several straight line fitting methods based on the segmentation results of Mask R-CNN. Some previous works such as the snake model in [3], the Boost-net in [6] and the MVC-net in [7] are also considered for the comparison. The results are shown in Table 2.

Table 2. Comparison of different methods for landmark location and Cobb angle measurement. The hough transformation (HT), canny operator (CA) and least square method (LS) are used in the indirect estimation methods.

From the results, our two-stage framework has the best performance on both landmarks and Cobb angle. With the enhancement module, the performance achieves further improvement, because the output feature vector is more representative for the original bounding box. For indirect estimation methods, although the vertebra area can be located, the slope of the fitted straight line is sensitive to the segmentation results. The wrong segmentation of a small area can lead to large error of Cobb angle. Our method uses the vertebra bounding boxes rather than directly use the segmentation results, which can avoid this problem. For direct estimation methods, there is a large deviation of landmarks in these methods, because the landmarks are influenced by the noise in the background, which causes large landmarks prediction error and severely effects the Cobb angle result. In fact, the location of four landmarks depends on the vertebra itself, which does not need the global information. The background only makes distraction to the landmark estimation. By using the local vertebra areas, our framework removes the background to avoid this problem. Our method takes advantages in both indirect and direct estimation methods, which achieves much more precise landmark location and smaller error on Cobb angle estimation.

4 Conclusion

In this paper, we propose an automated Cobb angle estimation method, which can provide efficient and reliable assistance for the physicians. The landmark location is very important for the Cobb measurement, and it is effected by the noise in X-rays. We design a two-stage framework to do the landmark estimation with area limitation, which takes advantages of indirect and direct estimation methods. Our method achieves more precise landmark location and smaller error on Cobb angle than the previous works. In the future work, we will continue to improve the performance of landmark estimation and Cobb angle measurement.