1 Introduction

One of the consequences of the numerous algorithms proposed for segmenting organs, tissues, the spine etc. involves analysing their anatomical shapes, eventually contributing towards population studies [1], disease characterisation [2], survival analysis [3], etc. Employing convolutional neural networks (CNN) for this task involves processing voxelised data due to its Euclidean nature. Such voluminous representation, however, is inefficient, especially when the masks are binary and the shape information corresponds to its surface profile. Alternatively, surface meshes (a collection of vertices, edges, and faces) or active contours could be used. Since the data is no longer Euclidean, a conventional CNN is unusable. Graph convolutional networks (GCN) [4] were thus developed by redefining the notion of ‘neighbourhood’ and ‘convolution’ for meshes and graphs. However, if the number of nodes is high, GCNs (esp. spectral) become bulky. Moreover, each mesh is treated as a domain, making mesh registration a requisite.

An alternative surface representation is a set of 3D points in space, referred to as the point clouds (PC). A PC represents the surface just with a set of N vertices, thus avoiding both the cubic-complexity of voxel-based representations and the \(N \times N\) dimensional, sparse, adjacency matrix of meshes. However, despite their representational effectiveness, PCs are permutation invariant and do not describe data on a structured grid, preventing the usage of standard convolution. To this end, we work with an architecture capable of processing PCs (point-net, [5]), and design a network capable of reconstructing PCs thereby extracting shape signatures in an unsupervised manner.

Uncertainty and Latent Space Modelling. Unlike supervised learning on PCs [6], we set out to obtain shape signatures from PCs without supervision, building towards a relatively less explored topic of auto-encoding point clouds. This involves mapping the PC to a latent vector and reconstructing it back. Since the PCs are unordered, PC-specific reconstruction losses replace traditional ones [7, 8]. Extending auto-encoders (AE) based on such a loss, we propose to improve its representational capacity by regularising the latent space to make it compact and by modelling the variance that exists in a PC population. We claim that this results in learning improved shape signatures, validating the claim by employing the extracted features for unsupervised vertebral fracture detection.

Fig. 1.
figure 1

Variation among vertebral shapes: compare the higher variation between healthy (blue) vertebrae of different classes (T3, top and L1, bottom) w.r.t the relatively lower variation within-class between fractured (red) and healthy vertebrae. (Color figure online)

Vertebral Fracture Detection. There exists an inherent shape variation in vertebral shapes within the spine of a single patient (e.g. cervical–thoracic–lumbar) along with a natural variation in a vertebra’s shape in a population (e.g. L1 across patients, cf. Fig. 1). Additionally, osteoporotic fractures start without significant shape change and progress into a vertebral collapse. Hence, fracture detection in vertebrae is non-trivial. Added to this, limited availability of fractured vertebrae makes the learning of supervised classifiers non-trivial. In literature, several classification systems exist mainly based on vertebral height measurement [9] or analysing sub-regions of the spine in sagittal slices [10]. However, an explicit shape-based approach seems absent. Evaluating the representational ability of the proposed AE architectures, we seek to analyse vertebral shapes and eventually detect vertebral fractures using the extracted latent shape features.

Our Contribution. Summarising the contributions of this work: (1) We build on existing point-net-based architectures to propose a point-cloud auto encoder (pAE). (2) Reinforcing this architecture, we incorporate latent space modelling and a more challenging uncertainty quantification. (3) We present a comprehensive analysis of the reconstruction capabilities of our pAEs by investigating their utility in detecting vertebral fractures. We work with an in-house, clinical dataset (\(\sim \)1500 vertebrae) achieving an area-under-curve (AUC) of >75% in detecting fractures, even without employing texture or intensity-based features.

2 Methodology

We present this section in two stages: First, we introduce the notation used in this work and describe a point-net-based architecture capable of efficiently auto-encoding point clouds. Second, we build on this architecture to model the natural variance in vertebrae while regularising the latent space.

2.1 Auto-Encoding Point Clouds

Given accurate voxel-wise segmentation of a vertebra, a point cloud (PC) can be extracted as a set of N points denoted by \(X = \{p_i\}_{i=0}^N\), where \(p_i\) represents a point by its 3D coordinate \((x_i, y_i, z_i)\). Additionally, \(p_i\) could also represent other point specific features such as normal, radius of curvature etc. So, each vertebra is represented by a PC of dimension \(N \times m\) (in this work, \(N=2048\) vertices and \(m=3\) coordinates, with the vertices randomly subsampled from a higher resolution mesh). Recall the lack of a regular coordinate space associated with the PC and that any permutation of these N points represents the same PC. Thus, a unique variant of deep networks is incorporated for processing PCs.

Fig. 2.
figure 2

Point cloud auto-encoder (pAE): architectural details of decoding path constructing a point cloud from a latent vector. Top arm is convolutional while bottom arm is fully-connected. Transposed convolution (\(-\cdot -\cdot \text {channels}\)) have a stride of 2. Since encoder is an adapted point-net [5], we detail its architecture in the supplement.

Architecture. An AE consists of an encoder mapping the PC to the latent vector and a decoder reconstructing the PC back from this latent vector, i.e \(X \mapsto z \mapsto X\). As the encoder, we employ a variant of the point-net architecture [5]. The latent vector, z, respects the permutation invariance of the PC and represents its shape signature. As a decoder, taking cues from [7], we construct a combination of an up-convolutional and dense branches taking z as input and predicting \(\hat{X}\), the reconstructed X. The convolutional path, owing to its neighbourhood processing, models the ‘average’ regions, while the dense path reconstructs the finer structures. This combination of the point-net and the decoder forms our point cloud auto-encoding (pAE, or interchangeably AE) architecture as illustrated in Fig. 2.

Loss. Reconstructing point clouds requires comparing the predicted PC with the actual PC to back-propagate the loss during training. However, owing to the unordered nature of PCs, usual regression losses cannot be employed. Two prominent candidates for such a task are the Chamfer distance and the Earth Mover (EM) distance [7]. We observed that minimising EM distance ignores the natural variation in shapes (e.g. the processes of the vertebrae) and reconstructs only a mean representation (e.g. the vertebral body), as validated in [7]. Since we intend to model the natural variance in the data, using EM distance is undesirable in our case. We thus employ the Chamfer distance computed as:

$$\begin{aligned} d_{ch}(X, \hat{X}) = \mathcal {L}_{ae} = \sum _{p \in X} \min _{\hat{p} \in \hat{X}} || p - \hat{p} ||_2^2 + \sum _{\hat{p} \in \hat{X}} \min _{p \in X} || p - \hat{p} ||_2^2. \end{aligned}$$
(1)

In essence, \(d_{ch}\) is the distance between a point in X and its nearest neighbour in \(\hat{X}\) and vice versa.

2.2 Probabilistic Reconstruction

From a generative modelling perspective, an AE can be seen to predict the parameters of Gaussian distribution imposed on X, i.e. \(p_{\varTheta }(X) = \mathcal {N}(X | \hat{X}, \hat{\mathrm {\Sigma }})\), parameterised by the weights of the AE denoted by \(\varTheta \). Determining the distribution parameters, viz. optimising for the AE weights, now involves maximising the log-likelihood of X, resulting in:

$$\begin{aligned} \varTheta ^* = \mathop {{{\,\mathrm{arg\,max}\,}}}\limits _\varTheta \log p_\varTheta (X) = \mathop {{{\,\mathrm{arg\,min}\,}}}\limits _\varTheta \frac{1}{2}(X - \hat{X})^T\hat{\mathrm {\Sigma }}^{-1}(X - \hat{X}) + \frac{1}{2}\log |\hat{\mathrm {\Sigma }}|. \end{aligned}$$
(2)

This perspective towards auto-encoding enables us to extend the pAE to encompass the data variance (\(\hat{\mathrm {\Sigma }}\)) while modelling the latent space, as described in following sections. It is important to note that the difference \(X-\hat{X}\) is not well defined for point clouds, requiring us to opt for alternatives.

Assuming \(\mathrm {\Sigma } = \mathbb {I}\), implying an independence among the elements of X and an element-wise unit variance, results in the familiar mean squared error (MSE), \(\mathcal {L} = ||X - \hat{X}||^2\). Based on the parallels between MSE and the Chamfer distance (Eq. 1), we design \(\sigma \)-AE and \(\sigma \)-VAE, as illustrated in Fig. 3.

Fig. 3.
figure 3

Probabilistic reconstruction architectures: \(\sim \) indicates a sampling operation. Since a point’s variance has a smaller scale compared to its mean, the variance is predicted using a softplus activation (added with \(\epsilon = 10^{-6}\) for stabilising divisions) and uses a layer parallel to the one predicting the mean.

\(\varvec{\sigma }\)-AE. The assumption of unit covariance, as in AE, is inherently restrictive. However, modelling an unconstrained covariance matrix is infeasible due to quadratic complexity. A practical compromise is the independence assumption. Thus, representing covariance as, \(\mathrm {\Sigma } = diag\{\hat{\sigma }_{p_1}^2, \dots , \hat{\sigma }_{p_i}^2, \dots , \hat{\sigma }_{p_N}^2\}\), where \(\hat{\sigma }_{p_i}^2\) denotes the variance corresponding to \(p_i\), Eq. (2) morphs to a loss function as:

$$\begin{aligned} \mathcal {L} = \sum _{\hat{p} \in \hat{X}} \sigma _{\hat{p}}^{-2} ||p_i - \hat{p_i}||^2 + \log \sigma _{\hat{p}}^2 \end{aligned}$$
(3)

This optimisation models the aleoteric uncertainty [11]. Equation 3 is an attenuated MSE, where a high variance associated to a point down-weighs its contribution to the loss. However, due to the lack of a reference grid in the point cloud space, the notion of uncertainty being associated to a data point (eg. pixel, spatial location etc.) is absent. We propose to associate the notion of variance to every point, \(\hat{p_i}\). This results in the variance-modelling Chamfer distance:

$$\begin{aligned} \mathcal {L}_{\sigma ae} = \sum _{p \in X} \min _{\hat{p} \in \hat{X}} \sigma _{\hat{p}}^{-2}|| p - \hat{p} ||_2^2 + \sum _{\hat{p} \in \hat{X}} \sigma _{\hat{p}}^{-2} \min _{p \in X} || p - \hat{p} ||_2^2 + \log \sigma _{\hat{p}}^2 \end{aligned}$$
(4)

Observe the slight abuse of notation in Eq. 4, wherein the variance at a predicted point, \(\sigma _{\hat{p}}\), actually represents the variance of the coordinate elements of p, i.e \(\{\sigma _{\hat{x}}, \sigma _{\hat{y}}, \sigma _{\hat{z}}\}\). Current notation is chosen to avoid clutter.

Variational and \(\varvec{\sigma }\)-Variational AE. An alternative approach for modelling p(X) involves modelling its dependency over a latent variable z, which is distributed according to a known prior p(z). A variational auto-encoder (VAE) operates on these principles and involves maximising a lower bound on the log-evidence (referred to as ELBO) of the data described as below:

$$\begin{aligned} \log p(X) \ge \mathbb {E}_{z\sim q_\phi (z|X)}\big [\log p_\theta (X|z)\big ] - \text {KL}\big [q_\phi (z|X)~||~p_\theta (z)\big ], \end{aligned}$$
(5)

where \(q_\phi (z|x)\) is the approximate posterior of z learnt by the encoder and parameterised by \(\phi \). \(p_\theta (X|z)\) is the data likelihood modelled by the decoder and parameterised by \(\theta \). \(p_\theta (z)\) is the prior on z.

Maximising ELBO is equivalent to maximising the log-likelihood of X while minimising the Kullback-Leibler divergence between the approximate and true prior. Representing the combination as \(\mathcal {L}_{rec} + \beta \mathcal {L}_{KL}\), where \(\mathcal {L}_{rec}\) is the reconstruction loss seen is earlier sections. \(\beta \) is a scaling factor weighing the contribution of the two losses appropriately. Standard practice assigns Gaussian distributions for \(q_\phi (z|x) \sim \mathcal {N}(z | \varvec{\mu }_z, \varvec{\sigma }_z)\) and \(p(z) \sim \mathcal {N}(z | \mathbf {0}, \mathbf {1})\) (cf. Fig. 3). Thus, \(\mathcal {L}_{KL}\) models the latent space to follow a Gaussian distribution inline with the prior. Incorporating this into the point cloud domain, results in an objective function for a PC-based VAE (or \(\sigma \)-VAE) as \(\mathcal {L}_{vae} = \mathcal {L}_{ae/\sigma ae} + \beta \mathcal {L}_{KL}\). Thus, \(\sigma \)-VAE acts as a AE capable of modelling the data variance while regularising the latent space. The prior on the latent space also imparts point cloud generation capabilities to \(\sigma \)-VAE.

2.3 Detecting Fractures as Anomalies

Examining the descriptive ability of our pAE architectures in auto-encoding PCs, we utilise them for detecting vertebral fractures. Assuming the AE is trained only on ‘normal’ patterns, a fracture can be detected as an ‘anomaly’ based on its ‘position’ in latent space. We inspect two measures for this purpose:

  1. 1.

    Reconstruction error or Chamfer distance: AEs trained on healthy samples fail to accurately reconstruct anomalous ones, resulting in a high \(d_{ch}\).

  2. 2.

    Reconstruction probability or likelihood [12]: Expected likelihood \(\mathbb {E}\big [p_\varTheta (X)\big ]\) of an input can be computed for \(\sigma \) – architectures (cf. Eq. 2). For any input PC, \(X_{in}\), it is computed by \(\mathcal {N}(X_{in} | \mu _\varTheta ,\varSigma _\varTheta )\) with the predicted mean and variances. We expect fractured vertebrae to be less likely than healthy ones.

Intuitively, relying on the reconstruction error or likelihood for detecting anomalies requires the learnt ‘healthy’ latent space to be representative. Both \(\sigma \)-AE and the VAE work towards this objective. In \(\sigma \)-AE, predictive variance down-weighs the loss due to highly uncertain points in the PC. This suppresses the interference due to natural variation in the vertebral PCs. On the other hand, VAE acts directly on the latent space by modelling the encoding uncertainty (\(X\mapsto z\)). The \(\sigma \)-VAE encompasses both these features.

Inference. A given vertebral PC is reconstructed and the reconstruction error and (or) likelihood are computed. This vertebra is said to be fractured if the reconstruction error is greater than a threshold, \(T_{rec}\), or its likelihood is lesser than a threshold, \(T_{l}\). \(T_{rec}\) and \(T_{l}\) and determined on the validation set.

3 Experiments and Discussion

We present this section in two parts: first, we explore the auto-encoding, variance modelling, and generative capabilities of our AE networks. Second, we deploy these architecture to detect vertebral fractures without supervision.

Data preparation: We evaluate our architecture on an in-house dataset with accurate voxel-level segmentations converted into PCs. The dataset consists of 1525 healthy and 155 fractured vertebrae, denoted as (\(1525H + 155F\)) vertebrae. Since we intend to learn the distribution of healthy vertebrae, we do not use any fractured vertebrae during training. The validation and test sets consists of (\(50H + 55F\)) and (\(100H+100F\)) vertebrae, respectively. For the supervised baselines, the train set needs to contain fractured vertebrae. Thus, validation and test sets were altered to contain (\(50H + 55F\)) and (\(55H + 55F\)) vertebrae.

Training: The architecture of the encoder and the decoder is similar across all architectures (cf. Fig. 3) except for the layers predicting variance. PCs are augmented online by perturbing the points with Gaussian noise and random rotations (\(\pm 15^\circ \)). Finally, the PCs are median-centred to origin and normalised to have the same surface area. The networks are trained until convergence using an Adam optimiser with an initial learning rate of \(5 \times 10^{-4}\). Specific to the VAE, we use KL-annealing by increasing \(\beta \) from 0 to 0.1.

Fig. 4.
figure 4

Characteristics of \({\sigma }\)-VAE: (a) Comparison of TSNE embeddings of simple pAE with \(\sigma \)-VAE. Observe transition in clusters being inline with vertebral indices. Note that embedding becomes compact for a VAE. (b) A PC and its reconstruction coloured with \(\log (\sigma ^2)\) of every point. Observe high variance in vertebral processes. (c) Example generations from decoder with \(z\sim \mathcal {N}(\mathbf {0}, \mathbf {1})\). (Color figure online)

Qualitative Evaluation of AE Architectures. We investigate if meaningful shape features can be learnt without supervision. Validating this, in Fig. 4a, we plot a TSNE embedding of the test set latent vectors learnt by a naive pAE and \(\sigma \)-VAE trained only on healthy vertebrae. Observe the clusters formed based on the vertebral index and the transition between the indices. This corresponds to the natural variation of vertebral shapes in a human spine. Indicating the fractured vertebrae in the embedding, we highlight their degree of similarity with the healthy counterparts. Also, observe that embedding is more regularised representing a Gaussian in case of \(\sigma \)-VAE, indicating the continuity of the learnt latent space. Figure 4b shows the predictive variance modelled by the \(\sigma \)-VAE. Posterior elements of a vertebrae are the most varying among population. Observe this being captured by the variance in the vertebral process regions. Lastly, illustrating \(\sigma \)-VAE’s generative capabilities, Fig. 4c shows vertebral PC samples generated by sampling the latent vector, \(z\sim \mathcal {N}(\mathbf {0}, \mathbf {1})\).

Vertebral Fracture Detection. Evaluating the reconstruction quality of our pAE architectures, we employ them to detect fractures as anomalies. As baselines, we choose two supervised approaches: (1) point-net (PN), the encoding part in our pAE architectures, cast as a binary classifier and (2) the same point-net trained with median frequency balancing the classes (ref. as PN\(_{bal}\)) to accentuate the loss from minority fractured class. We report their performance in Table 1, over 3-fold cross-validation while retaining the ratio of healthy to fractured vertebrae in the data splits. Frequency balancing improves the F1 score significantly, albeit not at the level of the proposed anomaly detection schemes.

Fig. 5.
figure 5

Reconstructions: healthy (top) and fractured (bottom) vertebral PCs. Observe pAE’s ‘healthy’ reconstruction of a fracture. Errors and log-probabilities are normalised to [0, 1] within PC for visualisation, but anomaly detection works on un-normalised values.

Table 1. Performance comparison of unsupervised and supervised fracture detection approaches. Measures: Precision (P), Recall (R), F1-score, and area-under-ROC curve (AUC) computed by varying thresholds on recon. error and recon. log-probabilities. Since supervised models have no threshold selection, AUC is not reported.

Reconstruction for fracture detection: When detecting fractures based on reconstruction error (\(d_{ch}\)), we observe that a naive pAE already out-performs the supervised classifiers (cf. Table 1). On top of this, we see that latent space modelling and variance modelling individually offer an improvement in F1-scores while increasing the AUC, indicating a stable detection of fractures. The performance of both \(\sigma \)-AE and \(\sigma \)-VAE is similar indicating the role of loss attenuation. However, the advantage of explicitly regularising the latent space for \(\sigma \)-VAE can be seen in likelihood-based anomaly detection, where \(\sigma \)-VAE outperforms \(\sigma \)-AE. Figure 5 compares a reconstruction of a healthy and fractured vertebrae of the same vertebral level. Note the high reconstruction error and a low log-likelihood spatially corresponding to the deformity due to fracture.

4 Conclusions

We presented point-cloud-based auto-encoding architectures for extracting descriptive shape features. Improving their description, we incorporated variance and latent space-modelling capability using specially defined PC specific losses. The former captures the natural variance in the data while the latter regularises the latent space to be continuous. Deploying these networks for the task of unsupervised fracture detection, we achieved an AUC of 76% without using any intensity or textural features. Future work will combine the extracted shape signatures with textural features e.g. bone density and trabecular texture of vertebrae to perform fracture-grade classification.