Keywords

1 Introduction

Alzheimer’s Disease (AD) is a chronic neurodegenerative disease that severely impacts patients’ thinking, memory, and behavior. AD is listed as the sixth leading cause of death in the United States of America, threatening 5.7 million American and 44 millions individuals worldwide [1]. Over the past decade, neuroimaging measures have been widely studied to predict disease status and/or cognitive performance [14]. However, there are some critical limitations in many predictive models, i.e., they routinely perform learning at each time points separately, ignoring the longitudinal variations among temporal brain phenotypes. First, since AD is a progressive neurodegenerative disorder, multiple consecutive neuroimaging records are usually required to monitor the disease progressions. It would be beneficial to explore the temporal relations among the longitudinal records of the biomarkers. Second, the records of neuroimaging biomarkers are often missing at some time points for some participants over the entire courses of the AD progressions, because it is difficult to conduct medical scans consistently across a large group of participants. To be more specific, higher mortality risk and cognitive impairment hinder older adults from staying in studies requiring multiple visits and thus result in incomplete data.

To overcome the first limitation of longitudinal data, many studies [3, 9, 17, 18] explore the temporal data structures of brain phenotypes over time. However, these models often formulate the temporal neuroimaging records as a tensor, which inevitably complicates the prediction problems. To handle the second limitation of data inconsistency, most longitudinal studies of AD only utilize the samples with complete temporal records for model analysis and ignore the samples with fewer time points. But discarding of observed samples may potentially neglects substantially valuable information in the data. To solve this problem, data imputation methods [4, 10] have been used to generate missing records over AD progressions, after which temporal regression studies can then be conducted. However, missing data imputation methods can introduce undesirable artifacts, which in turn can worsen the predictive power of the longitudinal models.

To deal with the longitudinal prediction problem with incomplete temporal inputs, we propose a novel method to learn an enriched biomarker representation to integrate the baseline records of the neuroimaging biomarkers of the entire cohort and the dynamic measures across all following time points of the individuals. First, it learns a global projection from the baseline records of the biomarkers from all the participants to preserve the global structure of a given neuroimaging dataset. Then, for each participant we learn a local projection from available follow-up neuroimaging records in the later couple of years to maintain the local data structure for the participant. Finally, a soft constraint is used to build consistency between the global and local projections. A schematic illustration of this formulation is shown in Fig. 1. Armed with the learned projection, we can transform the records with inconsistent sizes in a neuroimaging dataset into fixed-length vectors. With these fixed-length biomarker representations, we can make use of conventional regression methods to predict clinical scores.

Fig. 1.
figure 1

Illustration of the proposed framework to learn the global and local projections. The blank plots denote the absence of scans for the participants of an AD dataset.

2 Problem Formalization and the Solution Algorithm

In the task of representation learning for predicting cognitive outcomes, our goal is to learn a fixed-length biomarker representation vector for a participant from her or his baseline neuroimaging measures and all available follow-up measures. For the observed neuroimaging measures of a participant, we denote this information as: \(\mathcal { X } _ i = \{ \mathbf { x } _ i , \mathbf { X } _ i\}\), where \(i = 1,2, \cdots , m\) is the index of a participant in a neuroimaging dataset. In every \(\mathcal { X } _ i\), \(\mathbf { x } _ i\in \mathfrak {R}^ { d }\) is the baseline neuroimaging measure of the i-th participant where d denotes the dimensions of the neuroimaging feature set, and \(\mathbf { X } _ i = \left[ \mathbf { x } _ { 1 }^i, \ldots , \mathbf { x } _ {n_i}^i \right] \in \mathfrak {R}^ { d \times n_i }\) collects all available follow-up biomarker records of the i-th participant after the baseline where \(n_i\) denotes the number of available follow-up neuroimaging records of the same participant. Here we note that \(n_i\) varies over the dataset, because many participants of a cohort may have some missing records at certain time points.

Because the neuroimaging measures usually reside in a high-dimensional space, they can be redundant and noisy. Thus, we learn the compact representations of these measures via a global projection to only keep the most useful information of an input dataset. To achieve this goal, Principal Component Analysis (PCA) is the right tool, because it learns from the input data a linear projection to preserve as much information as possible in a low-dimensional projected subspace. Mathematically, PCA minimizes the reconstruction error via the projection \(\mathbf { W } \in \mathfrak {R}^ { d \times r }\) (usually \(r \ll d\)) that minimizes the following objective:

$$\begin{aligned} \mathcal { J } _ { \text{ Global } } ( \mathbf { W } ) = \sum _ { i = 1 } ^ { m } \left\| \mathbf { x } _ { i } - \mathbf { W } \mathbf { W }^ { T } \mathbf { x } _ { i } \right\| _2^2, \quad s.t. \ \mathbf { W } ^ { T } \mathbf { W } = \mathbf { I }. \end{aligned}$$
(1)

Usually the neuroimaging measures of each participant do not experience drastic changes in a short time, thus we still want to preserve the local consistency in the projected space for each participant. To uncover the local consistency among the neuroimaging records of each participant, we keep the local pairwise patterns of measures in the projected subspace. To achieve this, locality preserving projections (LPP) is the right tool to leverage. Given a pairwise similarity matrix \(\mathbf {S}_i \in \mathfrak {R}^{n_i \times n_i}\) of the i-th participant of the neuroimaging dataset, LPP preserves the local relationships and maximizes the smoothness of the manifold of the data in the embedding space by minimizing the following objective:

$$\begin{aligned} \mathcal { J } _ { \text{ Local } } ( \mathbf { W }_i ) = \sum _{\mathbf {x}_{j}^i, \mathbf {x}_{k}^i \in \mathbf {X}_i } s_{jk}^i \left\| \mathbf {W}_i^T \mathbf {x}_{j}^i - \mathbf {W}_i^T \mathbf {x}_{k}^i \right\| _2^2, \quad s.t. \ \mathbf { W }_i ^ { T } \mathbf { W }_i = \mathbf { I }, \end{aligned}$$
(2)

where \(s_{jk}^i\) is the pairwise relationship coefficient between the available records of the i-th participant at j-th and k-th time points.

To integrate the global and local consistencies of the neuroimaging measures at the same time, following the same idea in our previous works [6, 7, 15, 16] we can develop a combined objective using the above two objectives to minimize:

$$\begin{aligned} \mathcal { J }_ { \ell _ { 2 }^2 } (\mathcal {W})&= \sum _ { i = 1 } ^ { m } \left\| \mathbf { x } _ { i } - \mathbf { W } \mathbf { W }^ { T } \mathbf { x } _ { i } \right\| _2^2 + \alpha \sum _ { i = 1 } ^ { m } \sum _{\mathbf {x}_{j}^i, \mathbf {x}_{k}^i \in \mathbf {X}_i } s_{jk}^i \left\| \mathbf {W}_i^T \mathbf {x}_{j}^i - \mathbf {W}_i^T \mathbf {x}_{k}^i \right\| _2^2 \nonumber \\&+ \beta \sum _ { i = 1 } ^ { m } \left\| \mathbf { W } - \mathbf { W } _i \right\| _2^2, \qquad s.t. \quad \mathbf { W } ^ { T } \mathbf { W } = \mathbf { I }, \ \mathbf { W }_i ^ { T } \mathbf { W }_i = \mathbf { I } , \end{aligned}$$
(3)

where we denote \(\mathcal {W} = \{ \mathbf {W} , \mathbf {W}_1 , \cdots , \mathbf {W}_m\}\). Through the third term of Eq. (3), the projections \(\mathbf {W}_i\) learned from each participant are aligned with the projection \(\mathbf {W}\) learned globally from the baseline measures of the entire dataset, such that the information conveyed by the global projection \(\mathbf {W}\) learned from all the participants as a whole can be transferred to each individual participant.

Taking into account that the squared \(\ell _2\)-norm is notorious for being very sensitive to the outliers in a dataset which usually results in the failure of traditional regression models, to improve the robustness of the learned enriched representations by our new method, we substitute the squared \(\ell _2\)-norm terms in Eq. (3) by their not-squared counterparts as follows:

$$\begin{aligned} \mathcal { J }_ { \ell _ { 2} } ( \mathcal {W})&= \sum _ { i = 1 } ^ { m } \left\| \mathbf { x } _ { i } - \mathbf { W } \mathbf { W }^ { T } \mathbf { x } _ { i } \right\| _{2} + \alpha \sum _ { i = 1 } ^ { m } \sum _{\mathbf {x}_{j}^i, \mathbf {x}_{k}^i \in \mathbf {X}_i } s_{jk}^i \left\| \mathbf {W}_i^T \mathbf {x}_{j}^i - \mathbf {W}_i^T \mathbf {x}_{k}^i \right\| _2 \nonumber \\&+ \beta \sum _ { i = 1 } ^ { m } \left\| \mathbf { W } - \mathbf { W } _i \right\| _2^2,\qquad s.t. \quad \mathbf { W } ^ { T } \mathbf { W } = \mathbf { I }, \ \mathbf { W }_i ^ { T } \mathbf { W }_i = \mathbf { I }. \end{aligned}$$
(4)

Upon solving the optimization problem in Eq. (4), we learn a fixed-length representation for each participant by computing \(\{\mathbf {y}_i =\mathbf { W }_i^T \mathbf {x}_i \}_{i=1}^m\), which can be readily fed into traditional machine learning models.

Although the motivations of the formulation of our new method in Eq. (4) is clear and justifiable, it is a non-smooth objective, which is difficult to efficiently solve in general. Using the optimization framework presented in our earlier work [8] that solves non-smooth objectives using not-squared \(\ell _2\)-norm terms, we can solve Eq. (4) by an iterative procedure [8, Algorithm 1] in which the key step is to minimize the following objective:

$$\begin{aligned} \mathcal { J }^{\text {R}}_ { \ell _ { 2 } } ( \mathcal {W})&= {\text {tr}} \left( \mathbf { X } - \mathbf { W } \mathbf { W }^ { T } \mathbf { X } \right) \mathbf {\Gamma } \left( \mathbf { X } - \mathbf { W } \mathbf { W }^ { T } \mathbf { X } \right) ^T + \alpha \sum _ { i = 1 } ^ { m } {\text {tr}} \left( \mathbf { W }_i^T \mathbf { X }_i \mathbf { L }_i \mathbf { X }_i^T \mathbf { W }_i\right) \nonumber \\&+\beta \sum _ { i = 1 } ^ { m } \left\| \mathbf { W } - \mathbf { W } _i \right\| _2^2, \quad s.t. \quad \mathbf { W } ^ { T } \mathbf { W } = \mathbf { I }, \ \mathbf { W }_i ^ { T } \mathbf { W }_i = \mathbf { I }, \end{aligned}$$
(5)

where \(\mathbf {X} = [\mathbf {x}_1, \mathbf {x}_2, \cdots , \mathbf {x}_m] \in \mathfrak {R}^{d\times m}\) summarizes all the baseline measurements, and \(\mathbf {\Gamma } \in \mathfrak {R}^{m \times m }\) is a diagonal matrix whose i-th element is \(\gamma ^i =\frac{1}{2\sqrt{ \left\| \mathbf { x } _ { i } - \mathbf { W } \mathbf { W }^ { T } \mathbf { x } _ { i } \right\| _ { 2 } ^ { 2 } + \delta }}\). Defining \(\theta _{jk}^i = \frac{1}{2 \sqrt{ \left\| \mathbf {W}_i^T \mathbf {x}_{j}^i - \mathbf {W}_i^T \mathbf {x}_{k}^i \right\| _ { 2 } ^ { 2 } + \delta }}\) and \(\tilde{\mathbf {S}}_i \in \mathfrak {R}^{n_i \times n_i}\) whose element value is \(\tilde{s}_{jk}^i =\theta _{jk}^i s_{jk}^i \), in Eq. (5) we compute \(\mathbf {L}^i = \mathbf {D}^i -\tilde{\mathbf {S}}^i\) where \(\mathbf {D}^i\) is a diagonal matrix whose diagonal entries are the column (or row) sums of \(\tilde{\mathbf {S}}_i\), i.e., \(d_{jj}=\sum _{j} \tilde{s}_{jk} \).

The objective in Eq. (5) can be solved by the Alternating Direction Method of Multipliers (ADMM) [2] that minimizes the following equivalent objective:

$$\begin{aligned} \mathcal { J }^{\text {ADMM}}_ { \ell _ { 2 } } ( \mathcal {W},\mathcal {P})&={\text {tr}} \left( \mathbf { X } - \mathbf { W } \mathbf { W }^ { T } \mathbf { X } \right) \mathbf {\Gamma } \left( \mathbf { X } - \mathbf { W } \mathbf { W }^ { T } \mathbf { X } \right) ^T \nonumber \\&+ \alpha \sum _ { i = 1 } ^ { m } {\text {tr}} \left( \mathbf { W }_i^T \mathbf { X }_i \mathbf { L }_i \mathbf { X }_i^T \mathbf { W }_i\right) + \beta \sum _ { i = 1 } ^ { m } \left\| \mathbf { P } - \mathbf { P } _i \right\| _F^2 + \frac{\mu }{2} \left\| \mathbf { W } - \mathbf {P} + \frac{1}{\mu } \mathbf {\Lambda } \right\| _2^2 \nonumber \\&+ \sum _ { i = 1 } ^ { m } \frac{\mu }{2} \left\| \mathbf { W }_i - \mathbf {P}_i + \frac{1}{\mu } \mathbf {\Lambda }_i \right\| _2^2, \quad s.t. \quad \mathbf {P} ^ { T } \mathbf {P} = \mathbf { I }, \ \mathbf {P}_i ^ { T } \mathbf {P}_i = \mathbf { I }, \end{aligned}$$
(6)

where \(\mathcal {P}=\{\mathbf {P},\mathbf {P}_1,\mathbf {P}_2,\cdots ,\mathbf {P}_m\} \), \(\mathbf {\Lambda } \in \mathfrak {R}^{d \times r}\) is the Lagrangian multiplier for the constraint of \( \mathbf { W } = \mathbf {P} \), and \(\mathbf {\Lambda }_i \in \mathfrak {R}^{d \times r}\) is the Lagrangian multiplier for the constraint of \( \mathbf { W }_i = \mathbf {P}_i\). Algorithm (1) summaries the solution to Eq. (6).

3 Experiments

Data used in our experiments were obtained from the ADNI database (adni.loni.usc.edu). We downloaded 1.5T MRI scans and the demographic information for 821 ADNI-1 participants. We performed voxel-based morphometry (VBM) on the MRI data by following [12] and extracted mean modulated gray matter (GM) measures for 90 target regions of interest (ROI). These measures were adjusted for the baseline intracranial volume (ICV) using regression weights derived from the Healthy Control (HC) participants at the baseline. We also downloaded the longitudinal scores of the participants in the cognitive assessment of Mini-Mental State Examination (MMSE). The time points examined in this study for both imaging biomarkers and cognitive assessments include baseline (BL), Month 6 (M6), Month 12 (M12), Month 18 (M18) Month 24 (M24) and Month 36 (M36). All the participants’ data used in our enriched biomarker representation study are required to have a BL MRI measurement, BL cognitive score and at least two available measures from M6/M12/M18/M24/36. A total of 544 sample subjects are selected to perform in our study, among which we have 92 AD samples, and 205 Mild Cognitive Impairment (MCI) samples and 247 HC samples.

figure a

Experiment Settings. To validate the usefulness of our proposed method, we compare the performance to predict cognitive outcomes using two types of the neuroimaging inputs—the learned enriched representation and BL biomarker measurement. In our experiments, several methods proven to generalize well, such as ridge regression (RR), Lasso, support vector regression (SVR), and convolutional neural networks (CNN), are leveraged. For RR, Lasso and SVR models, we conduct a standard 5-fold cross-validation to fine tune the parameters and compute the root mean square error (RMSE) between the predicted values and ground truth values of the cognitive scores on the testing data. For the SVR model, the linear kernel is leveraged, for which the box constraint parameters are fine tuned by a grid search in standard 5-fold cross-validation as well. For the CNN regression model, we randomly select 70% of the neuroimaging measurements as the training set, 20% of the neuroimaging measurements as the validation set and the remaining 10% of the neuroimaging measurements as the testing set. The evaluation metrics are reported based on the results on the testing dataset. We construct a two-layer convolution architecture for the cognitive outcomes prediction and the dropout technique [13] is also leveraged to reduce overfitting in CNN models and prevent complex co-adaptations on training data. For the model parameters, \(\alpha ,\beta \) are fine tuned by searching the grid of \(\{10^{-5},\dots ,10^{-1},1,10,\cdots ,10^{5}\}\).

Experiment Results. From Table 1, we can see that the proposed enriched neuroimaging representation is consistently better than baseline representations when we use the four different methods, LR, RR Lasso, SVR and CNN. It can be attributed to the following reasons. First, the original baseline neuroimaging biomarker representation only contains the cognitive information of a participant at one single time point, which does not benefit from the correlation across different cognitive measures over the time. Instead, our proposed enriched neuroimaging biomarker representation could capture not only the baseline cognitive measurement, but also the temporal information conveyed by the longitudinal biomarkers over AD progressions. Our enriched neuroimaging representation could integrate the neuroimaging measurement at the fixed time point and the dynamic temporal changes. As AD is a progressively degenerative disease, incorporation of future information about subjects benefits the prediction model. Second, the original baseline neuroimaging measurements exhibits high dimensionality, which could be redundant and noisy. Thus the traditional methods easily suffer from “the curse of dimensionality”. Via the projection, we map the baseline cognitive measurement into a low dimensional space thereby mitigating the issue of high dimensionality.

Table 1. RMSE values (smaller is better) when predicting MMSE score using the VBM biomarkers.
Fig. 2.
figure 2

Frequency map of regions of interest of brain.

Apart from the cognitive outcomes prediction task, another primary goal of our regression analysis is to identify a subset of imaging biomarkers which are highly correlated to AD progressions. Therefore, we examine the imaging biomarkers of each participant identified by the proposed methods encoded by the cognitive scores. A brief summary of all the patients’ projections can be conducted to determine the most important biomarkers across patients. We select the 10 most frequently appearing regions of interest from each patient’s projection coefficients to reconstruct a global frequency map, shown in Fig. 2. The bilateral hippocampus, amygdala, fusiform, putamen regions are found to be in the top selected biomarkers by our model, which are in nice accordance with the known clinical knowledge [5, 11].

4 Conclusion

Missing data is a critical challenge in longitudinal AD studies. In this paper, we proposed a formulation to learn a consistent-length representation for all the participants in the ADNI dataset. The enriched fixed length biomarker representation could capture the global consistency from baseline measurements and local pairwise pattern from available follow-up measurements of each participant at the same time. Our results show that the learned enriched representation beat the performance of the baseline measurement in predicting the clinical scores. Furthermore, the identified biomarkers are highly suggestive and strongly agree with existing research findings, which warrants the correctness of our approach.