1 Introduction

This paper describes our method submitted to the ABCD Neurocognitive Prediction Challenge 2019. The task of the challenge is to predict fluid intelligence solely from structural T1-weighted magnetic resonance images (MRI). The challenge uses data from the Adolescent Brain Cognitive Development (ABCD) Study.

In this approach, we first extract features from MRI scans and then use an automated machine learning approach for the prediction. For the feature extraction, we use volume measurements as provided by the challenge’s organizers. For the prediction, we use an automated machine learning (AutoML) approach, as determining a good machine learning pipeline is a tedious and error-prone task for humans. A typical ML pipeline includes various types of preprocessing that can be applied to input features. Afterwards, an appropriate classifier needs to be selected and the optimal hyperparameters selected to achieve high predictive performance. The goal of AutoML is to automate the whole machine learning pipeline. A recent overview of AutoML approaches together with an analysis of the results of ChaLearn AutoML Challenges over the last four years is given in [5]. AutoML has not yet been widely explored in the medical field, with PubMed listing only four articles [1, 7, 10, 14]; none of which study MRI or neuroscience.

2 Data

Data was provided by The Adolescent Brain Cognitive Development(ABCD) Study [13], which recruited children aged 9–10. Participants were given access to T1-weighted MRI scans from 3,736 children for training, 415 children for validation, and 4,402 children for testing. Fluid intelligence scores were residualized to account for confounding due to sex at birth, ethnicity, highest parental education, parental income, parental marital status, and image acquisition site. Residualized fluid intelligence scores were provided for the training and validation data, but not for the test data. All data was obtained from the National Institute of Mental Health Data Archive.Footnote 1

3 Methods

Our proposed pipeline for the prediction of fluid intelligence from T1-weighted MRI scans builds on the Automated Machine Learning (AutoML) framework summarized in Fig. 1. Scans were acquired according to the acquisition protocol of the Adolescent Brain Cognitive Development (ABCD) study protocol.Footnote 2 For parcellation of the brain and the estimation of volume of each region of interest, we relied on the work of the challenge’s organizers.

Fig. 1.
figure 1

Overview of our proposed AutoML pipeline for the prediction of fluid intelligence from T1-weighted MRI scans.

3.1 Feature-Preprocessing

We used volume measurements of 122 regions of interest extracted by the challenge’s organizers from each T1-weighted MRI scan based on the SRI24 atlas [15].Footnote 3 We normalized all volume measurements while accounting for outliers by subtracting the median and dividing by the range between the 5% and 95% percentile. Thus, we reduce the impact of outliers and still obtain approximately centered features with equal scale. Finally, the provided residualized fluid intelligence scores in the training data where standardized to zero mean and unit variance; the same transformation as derived from the training data was applied to features and scores in the validation and test data. Additional pre-processing steps were selected without human interaction as described in the next section.

3.2 Automated Machine Learning

For the prediction of residualized fluid intelligence score, we used automated machine learning that leverages recent advances in Bayesian optimization, meta-learning, and ensemble construction. For every machine learning task, the fundamental problem is to decide which machine learning algorithm to use and whether and how to pre-process features. This task is extremely challenging, because there is no single algorithm that performs best on all datasets and the performance of machine learning methods depends to a large extent on their hyper-parameter settings, which can vary from one task to the next. Here, we use AutoML for the prediction of the residualized fluid intelligence score by producing test set predictions without human input within a given computational budget. Specifically, we employ Combined Algorithm Selection and Hyperparameter (CASH) optimization [3].

Let \(\mathcal {A} = \{ A^{(1)}, \ldots , A^{(R)} \}\) be a set of machine learning algorithms, and \(\varLambda ^{(j)}\) be the domain of the hyper-parameters of each algorithm. Further, we define \(\mathcal {D}_\text {train} = \{ (\mathbf {x}_1, y_1), \ldots , (\mathbf {x}_n, y_n) \}\) to be the training set, which we split into K cross-validation folds to obtain \(\{ \mathcal {D}_\text {train}^{(1)}, \ldots , \mathcal {D}_\text {train}^{(K)} \}\) and \(\{ \mathcal {D}_\text {valid}^{(1)}, \ldots , \mathcal {D}_\text {valid}^{(K)} \}\) with \(\mathcal {D}_\text {train}^{(k)} = \mathcal {D}_\text {train} \backslash \mathcal {D}_\text {valid}^{(k)}\). For a particular hyper-parameter configuration \(\varvec{\varTheta }\), we solve the CASH optimization problem

$$\begin{aligned} \mathop {{{\,\mathrm{\arg \!\min }\,}}}\limits _{A^{(j)} \in \mathcal {A}, \varvec{\varTheta } \in \varLambda ^{(j)}}\quad \frac{1}{K} \sum _{k=1}^K \frac{1}{|\mathcal {D}_\text {valid}^{(k)}|} \sum _{i=1}^{|\mathcal {D}_\text {valid}^{(k)}|} \left( y_i - \hat{f}_{A_{\varvec{\varTheta }}^{(j)}}(\mathbf {x}_i\,|\,\mathcal {D}_\text {train}^{(k)}) , \right) ^2 \end{aligned}$$
(1)

where \(\hat{f}_{A_{\varvec{\varTheta }}^{(j)}}(\mathbf {x}_i\,|\,\mathcal {D}_\text {train}^{(k)})\) denotes the prediction on the validation set of model \(A^{(j)}\) with hyper-parameters \(\varvec{\varTheta }\) and trained on \(\mathcal {D}_\text {train}^{(k)}\). This optimization problem can be solved via Sequential Model-based Algorithm Configuration (SMAC), a technique for Bayesian black-box optimization that uses a random-forest-based surrogate model [6]. The main idea of SMAC is to use the surrogate model to predict an algorithm’s performance for a given hyper-parameter optimization. It is able to interpolate the performance of algorithms between observed hyper-parameter configurations and previously unseen configurations in the hyper-parameter space. Thus, it enables us to focus on promising hyper-parameter configurations.

We employed the auto-sklearn toolkit (version 0.5.0), which for a given user-provided computational budget in terms of run time and memory, auto-sklearn searches for the best machine learning pipeline to predict the residualized fluid intelligence score by combining components of the scikit-learn machine learning framework (version 0.18.2) [12]. Figure 1 depicts an overview of the AutoML framework. For data preprocessing, AutoML can choose from 11 algorithms for data transformations, such as principal component analysis. For feature preprocessing 6 feature-wise transformations are available, such as transforming each feature to have zero mean and unit variance. Finally, AutoML can choose from 13 regression models. After evaluating various machine learning pipelines, comprising data transformations, feature transformations, and regression model, the best M pipelines are combined via ensemble selection [2] to form the final prediction model. We used a budget that consisted of a total run time of 40 h, where each pipeline was limited to 6 min and 4 GB of memory. The final ensemble size was \(M=50\).

3.3 Feature Importance

While complex prediction pipelines are potentially powerful, their black-box nature is often a barrier for employing such a model in clinical research. We use Shapley values to explain the predictions of our final ensemble of prediction pipelines. Shapley values are a classic solution in game theory to determine the distribution of credits to players participating in a cooperative game [16, 17]. They have first been proposed for linear models in the presence of multicollinearity [8]. A Shapley value assigns an importance value \(\phi _j\) to each feature j that reflects its effect on the model’s prediction. To compute this effect, retraining the model \(f(\cdot )\) on all possible feature subsets \(\mathcal {S} \subseteq \mathcal {F} \backslash \{j\}\) of all features \(\mathcal {F}\) is necessary. Given a feature vector \(\mathbf {x} \in \mathbb {R}^{|\mathcal {F}|}\), the j-th Shapley value can then be computed as the weighted average of all prediction differences:

$$\begin{aligned} \phi _j(\mathbf {x}) = \sum _{\mathcal {S} \subseteq \mathcal {F} \backslash \{j\}} \frac{|\mathcal {S}|!(|\mathcal {F}|-|\mathcal {S}|-1)!}{|\mathcal {F}|!} \left( \hat{f}_{\mathcal {S} \cup \{j\}}( \mathbf {x}^{\mathcal {S} \cup \{j\}} ) - \hat{f}_\mathcal {S}( \mathbf {x}^{\mathcal {S}} ) \right) , \end{aligned}$$
(2)

where \(\hat{f}_S( \mathbf {x}^{\mathcal {S}} )\) denotes the prediction of a model trained and evaluated on the feature subset \(\mathcal {S}\). The exact computation of Shapley values requires evaluating all \(2^{|\mathcal {F}|}\) possible feature subsets, which is only reasonable when data consists of not more than a few dozen features. To address this problem, we employ the recently proposed SHAP (SHapley Additive exPlanations) values, which belong to the class of additive feature importance measures [9]. The exact computation of SHAP values is prohibitive, therefore we approximate SHAP values using the model-agnostic KernelSHAP approach proposed in [9]. To obtain a global measure of feature importance, we compute the average magnitude of SHAP values across all N subjects in the data:

$$\begin{aligned} \bar{\phi }_j = \frac{1}{N} \sum _{i=1}^N | \phi _j(\mathbf {x}_i)|. \end{aligned}$$
(3)

4 Results

The performance of the final ensemble is summarized in Table 1. It reveals that predicting residualized fluid intelligence from MRI-derived volume measurements is a challenging task. In particular, the proposed model struggles to reliably predict residualized fluid intelligence at the extremes of the distribution, i.e., very low or very high values. Consequently, we observe a relatively high mean squared error, which is an order of magnitude larger than the mean absolute error. Moreover, the large difference between the performance on the training data and the validation data indicates that overfitting seems to be an issue.

Table 1. Performance on training, validation and test set. MSE: mean squared error. MAE: mean absolute error.

In total, we evaluated 2,608 machine learning algorithms (see Table 2). The components of our final ensemble of 50 machine learning pipelines is summarized in Table 3. Principal component analysis [11] was selected most often (15 times) for data pre-processing. The final ensemble was comprised of linear and non-linear regression models with ensembles of randomized regression trees [4] being selected most frequently (14 times). Looking at the top-performing pipelines in the ensemble, we noticed that combining principal component analysis with a tree-based ensemble was a frequently selected combination (5 out of the top 10 performing pipelines).

Table 2. Summary of evaluated machine learning pipelines.
Table 3. Overview of selected components in the final ensemble of \(M=50\) pipelines selected by AutoML. Each pipeline consists of one data preprocessing step, one feature preprocessing step, and one regressor.
Fig. 2.
figure 2

(a) Top 20 features sorted by mean absolute SHAP value \(\bar{\phi }_j\). (b) SHAP values of top 20 features for each subject in the training data. In each row SHAP values \(\phi _j\) for each subject are plotted horizontally, stacking vertically to avoid overlap. Each dot is colored by the value of that feature, from low (blue) to high (red). If the impact of the feature on the model’s prediction varies smoothly as its value changes then this coloring will also appear smooth. (Color figure online)

Next, we inspected which MRI-derived feature the model deems most important by computing SHAP values for each feature and subject in the training data. Figure 2 lists the top 20 features by mean absolute SHAP value \(\phi \). The top ranked feature is pons white matter volume (\(\phi = 0.0183\)), followed by left parahippocampal gyrus volume (\(\phi = 0.0155\)), and left lateral ventricle cerebral spinal fluid volume (\(\phi = 0.0148\)). However, we note that individual SHAP values are rather small, which is evidence that fluid intelligence is not strongly influenced by a single brain region, but a complex inter-relationship between different regions. Individual, subject-specific SHAP values depicted in Fig. 2b indicate that larger left and right parahippocampal gyrus volume are associated with a decrease in fluid intelligence, while larger pons white matter volume is associated with an increase.

5 Conclusion

We proposed an AutoML model for the prediction of fluid intelligence from T1-weighted magnetic resonance images based on more than 2,600 evaluated machine learning pipelines. Our experiments demonstrate that it is challenging for our ensemble to reliably predict fluid intelligence from MRI scans. In particular, errors on the validation and test data were more than four times higher than on the training data, which is evidence for overfitting. We analyzed the final model’s predictions using SHAP values. Results revealed that top ranked features still explain only a small fraction of the fluid intelligence score. Therefore, we concluded that current features derived from MRI are insufficient to robustly measure fluid intelligence. While current features are generic descriptors of the brain anatomy, we believe future research should focus on deriving tailor-made features from MRI, specific to the prediction of fluid intelligence, which could then be used to improve our understanding of the neurobiology underlying fluid intelligence.