Keywords

1 Introduction

Human brain function can be assessed non-invasively using two MR techniques with a whole brain coverage and relatively high spatial resolution. One is the blood-oxygen-level-dependent fMRI [10]; the other is arterial spin labeling (ASL) perfusion MRI [2]. BOLD fMRI is more widely used, offering high temporal and spatial resolution, but only provides relative values. It is sensitive to low-frequency drift and suffers from the susceptibility gradient-induced artifacts. By contrast, ASL MRI measures cerebral blood flow (CBF) in a physical unit of ml/100 g/min. The quantitative nature of ASL MRI makes it insensitive to low-frequency drift. Measuring signal from the capillary bed, ASL MRI is potentially more accurate for localizing functional activation than BOLD fMRI which is often mainly contributed by oxygen level change in venous vessels rather than the activation site. However, ASL MRI has lower signal-to-noise-ratio (SNR), lower temporal resolution, and has only seen increasing visibility in recent years. An important but still open question is how to fuse the benefits provided by the two complementary functional imaging modalities. Since many finished and ongoing large size fMRI projects only had or have BOLD fMRI, a related question is whether we can reliably extract CBF signal from BOLD fMRI. Solving both questions requires understanding the elusive relationship between these two imaging modalities.

Theoretically, ASL MRI works by magnetically labeling the arterial blood water as an endogenous tracer using radio-frequency (RF) pulses [2]. The perfusion-weighted MR image is acquired after the labeled spins reach the imaging place. To remove the background signal, a control image is also acquired using the same ASL imaging sequence but with modulations to avoid labeling the arterial blood so the background signal under the influence of RF pulses is the same as that in the spin labeling condition. The perfusion-weighted signal is subsequently extracted from the difference between the label (L) image and the control (C) and converted into a quantitative CBF measure using an appropriate compartment model [1]. Due to the limitation of the longitudinal relaxation rate (T1) of blood water and the post-labeling transmit process a small portion of tissue water can be labeled, resulting in a low SNR [15]. Thus, ASL often acquires many pairs of L/C images to improve SNR of the mean perfusion map. Practically, 10–50 L/C pairs are allowed in a typical 3–6 min scan, which can only provide minor to moderate SNR improvement by averaging across the limited number of measurements. The interleaved labeling and non-labeling procedure reduces the temporal resolution of ASL MRI by half compared to the regular dynamic MR imaging. The relatively long labeling and post-labeling delay time before data acquisition further reduces the temporal resolution of ASL MRI. Ideally, these drawbacks can be avoided instantly if CBF can be extracted from BOLD fMRI. From a technical point of view, ASL MRI can be acquired with many different imaging sequences. That is why the gradient-echo weighted BOLD imaging sequence is still widely used to acquire ASL MRI data. It is then theoretically reasonable to hypothesize that CBF can be extracted from the BOLD fMRI. The challenge is then to find an appropriate model for the unknown BOLD-CBF relationship.

A canonical BOLD-CBF model has been proposed in [4], but requires data acquired under gas-challenging. The underlying assumption of no change of cerebral metabolic rate of oxygen by gas-challenging may also be inaccurate. Without extra experiments, there isn’t an analytic way to extract quantitative CBF from BOLD fMRI. Alternatively, a learning-based approach might be able to solve this problem. Over the years, machine learning especially deep machine learning has been increasingly used to achieve astonishing success for modeling various highly complex data relationship [7].

Deep learning (DL) is motivated by the hierarchical learning in the visual system [3]. The most widely used deep neural networks consist of multiple layers of receptive field constrained local filters which are trained layer by layer by error backpropagations [6] and are often called convolutional neural networks (CNN). The local feature extraction, hierarchical abstraction, step-wise backpropagation of CNN and the introduction of several training strategies such as weight drop-out, batch-normalization, skip connection, and residual learning etc make CNN high flexible and capable for modeling any nonlinear function buried in a large data. Because medical imaging processing is often hindered by some unknown nonlinear processes or transforms, DL may provide a potentially versatile tool for medical imaging processing as increasingly demonstrated in a variety of applications, including image segmentation [11] and image reconstructions [13] etc. Specific to ASL MRI, DL has been adopted to improve SNR of ASL CBF maps [5, 16]. Most related to this study is that Xie et al. [17] piloted a pairwise label to control image prediction using CNN. Since the ASL MRI used in the so-called super-ASL network was acquired with the gradient-echo weighted BOLD fMRI sequence, it suggests the feasibility of directly extracting CBF from BOLD fMRI.

The purpose of this study was mainly to build and validate a DL-based BOLD-ASL relationship learning model to predict CBF signal directly from BOLD fMRI. We dubbed the network as the BOA-Net. Different from the super-ASL work, we used current ASL MRI and BOLD fMRI acquired with a dual-echo ASL MRI sequence [12] so the network doesn’t need to consider the physiological difference or signal drift-induced difference between the BOLD fMRI and ASL CBF. Another contribution is that we introduced a new CNN architecture based on dilated convolution [19] and wide activation residual block [20].

2 Methods

2.1 Problem Formulation

Denote the CBF image generated by i-th L/C pair by \(y_i\) and the BOLD image (the 2nd echo) after the i-th C image by \(x_i\). Given same brain structure and transitory acquisition time, we want to build a parametric regression model \(f_{\varTheta }\) that learns the mapping \(f_{\varTheta }(x_i)\rightarrow y_i\), where \(i = 1,2,...,N\), and N is the total number of one subject’s CBF maps. \(\varTheta \) are the parameters of the model and are adjusted through the training process. The model, typically a CNN, can be learned by minimizing the loss function: \(\sum _i L(f_{\varTheta }(x_i), y_i)\), where the loss function can be either the mean square error or mean absolute error between input and reference.

As we don’t have gold standard CBF maps as the training references, using the low SNR ASL CBF images as the training references may result in an inaccurate BOA-Net. Interestingly, a recent study [8] showed that the inaccurate model training concern due to the use of noisy reference is not necessarily true. Inspired by their work, we proposed a noisy reference-based BOA-Net. Instead of using the L2 norm as the loss function, we chose the L1 norm to reduce sensitivity to outliers which are common to ASL MRI [9].

Fig. 1.
figure 1

A schematic illustration of the proposed DWAN network (left) and wide activation residual block (right). All residual blocks in DWAN are made with wide activation residual blocks. The numbers in the figure represent the number of feature maps in corresponding position.

2.2 Network Architecture

Figure 1 demonstrates the architecture of DWAN used in BOA-Net and the wide activation residual block. The two-path DilatedNet [5] were used to extract both local and global contextual features. The wide activation residual blocks were adapted to expand data features and pass more information through the network [20]. In DWAN, each pathway contains 4 wide activation residual blocks. Inside each wide activation residual block, the first convolution layer expands the number of input feature maps by a factor of 4. After a ReLU layer, the following convolution layer shrinks the number of feature maps back to input size. The difference between local pathway and global pathway is that, the first convolution layer of the 4 wide activation residual blocks in the global pathway used a dilation rate of 2, 4, 8 and 16 respectively. The convolution kernel size was \(3\times 3\). A \(3\times 3\) convolution link [20] from the input layer to the output layer implements the residual learning of DWAN.

2.3 Data Preparation and Model Training

ASL and BOLD fMRI data were acquired with the dual-echo ASL sequence [12] from 50 young healthy subjects at Hangzhou Normal University with signed informed written consent forms. The experiment and the form were applied by local IRB. Imaging parameters were: labeling time/delay time/TR/TE1/TE2 = 1.4 s/1.5 s/4.5 s/11.7 msec/68 msec, 90 acquisitions (90 BOLD images and 45 C/L image pairs), FOV = 22 cm, matrix =  \(64\times 64\), 16 slices with a thickness of 5 mm plus 1 mm gap. We used ASLtbx [14] to preprocess ASL images with the procedures in [9].

The BOA-Net was trained with data from 23 subjects’ CBF maps (input and reference). 4 different subjects were used for validation samples. The remaining 23 subjects’ CBF maps were used as test samples. For each subject, we extracted slices from 7 to 11 of 3D ASL CBF maps. The number of total 2D CBF maps extracted for training and validation were \(27 \times 5 \times 45 = 6075\). The 2D CBF maps were 64 \(\times \) 64 pixels. U-Net [18] and DilatedNet [5], two popular CNN structure widely used in medical imaging, were implemented as a comparison to our DWAN-based BOA-Net.

We also compare the effects of training with smoothing CBFs versus non-smoothing CBFs. The CBFs that were generated from the L/C pairs with Gaussian smoothing were called smoothing CBFs. The CBFs that were generated from the L/C pairs without Gaussian smoothing were called non-smoothing CBFs. the suffix ‘sm’ and ‘nsm’ were added to the name of each model to represent that the model was trained using the smoothed or non-smoothed CBFs, respectively. We use Peak signal-to-noise ratio (PSNR) and structure similarity index (SSIM) to quantitatively compare the performance of DWAN with U-Net and DilatedNet. When calculating PSNR and SSIM, all the predicted results were compared with genuine mean CBF maps from smoothed ASL data.

All networks were implemented using the Keras and Tensorflow platform. The network was trained using adaptive moment estimation (ADAM) algorithm with basic learning rate of 0.001. All the models were trained with batches, each containing 64 training samples. All experiments were performed on a PC with Intel(R) Core(TM) i7-5820k CPU @3.30 GHz and a Nvidia GeForce Titan Xp GPU.

We used SNR to measure the image quality of ASL CBF. The SNR was calculated by using the mean signal of a grey matter (GM) region-of-interest (ROI) divided by the standard deviation of a white matter (WM) ROI in slice 9. Similarity of mean CBF from the outputs of BOA-Net to genuine mean CBF maps from ASL data, was evaluated by the correlation coefficient between the CBF values of all testing subjects (n = 23). This process was performed at each voxel for BOA-Net_sm and BOA-Net_nsm separately. The correlation coefficient maps were thresholded by r>0.3 for the purpose of comparison and display.

3 Results

Figure 2 shows the results of BOLD-based CBF prediction for one representative subject. As compared to the genuine mean CBF map from the acquired ASL MRI, the CBF map produced by BOA-Net showed substantially improved quality in terms of suppressed noise and better perfusion contrast between tissues. Moreover, BOA-Net recovered CBF signals in the air-brain boundaries. Signal loss in the genuine mean CBF in the prefrontal region was caused by the signal loss in BOLD images.

Fig. 2.
figure 2

From left to right: A. BOLD fMRI (input to BOA-Net_sm and BOA-Net_nsm); B. Mean CBF maps from 45 smoothed L/C pairs; C. Mean CBF maps from the output of BOA-Net_sm; D. CBF map from one smoothed L/C pair (reference for BOA-Net_sm). E. Mean CBF maps from 45 non-smoothed L/C pairs; F. Mean CBF maps from the outputs of BOA-Net_nsm; G. CBF map from one non-smoothed L/C pair (reference for BOA-Net_nsm). From top to bottom: slice 8, 9, 10, 11.

Fig. 3.
figure 3

The notched box plot of the SNR (left) and correlation coefficient maps between genuine mean CBF and output of BOA-Net (right). Original_nsm and original_sm represent the genuine mean CBF maps from non-smoothed and smoothed ASL data. BOA-Net_nsm and BOA-Net_sm represent mean CBF maps from outputs of BOA-Net_nsm and BOA-Net_sm. The correlation coefficient maps between genuine mean CBF and output of BOA-Net_sm is shown in the top row. The correlation coefficient maps between genuine mean CBF and output of BOA-Net_nsm is shown in the bottom row. Only 2 axial slices were shown. Correlation coefficients less than 0.3 were thresholded to be 0.

Figure 3 shows box plot of the SNR and spatial correlation BOA-Net_sm and BOA-Net_nsm. the average SNR of genuine mean CBF maps from non-smoothed and smoothed ASL data were 6.96 and 12.64 respectively. The average SNR of mean CBF maps from outputs of BOA-Net_nsm and BOA-Net_sm were 12.26 and 15.11. BOA-Net_sm improved SNR by 19.54% compared with mean CBF maps of smoothed ASL while BOA-Net_nsm achieved a 76.15% SNR improvement compared with the mean CBF maps of non-smoothed ASL. Correlation coefficient at each voxel was calculated between the genuine mean CBF map and network output. Figure 3 shows outputs of BOA-Net_sm and BOA-Net_nsm strongly correlated to the genuine mean CBF, proving that both networks can predict individual subjects’ CBF patterns correctly.

Table 1. The average PSNR and SSIM from different CNN architectures used in BOA-Net_sm and BOA-Net_nsm

Table 1 shows the PSNR and SSIM of mean CBF maps predicted from different models. DWAN achieved highest PSNR and SSIM in both BOA-Net_sm and BOA-Net_nsm categories. Figure 4 demonstrates the visual comparison of mean CBF maps predicted from BOLD fMRI using different CNN architectures. DWAN suppressed more noises than DilatedNet while recovered more details than U-Net. Moreover, DWAN_nsm has better perfusion contrast than DWAN_sm while DWAN_sm recover more signals in air-brain boundaries.

Fig. 4.
figure 4

Two representative slices of the mean CBF maps produced by different processing methods. The three columns on the left side are mean CBF maps of outputs of U-Net_nsm, DilatedNet_nsm, and DWAN_nsm respectively. The three columns on the right side are mean CBF maps of outputs of U-Net_sm, DilatedNet_sm, and DWAN_sm

4 Discussion and Conclusion

To our knowledge, this study represents the first effort to extract quantitative CBF from BOLD fMRI. Comparing with genuine mean CBF from ASL data, the BOA-Net can provide CBF measurement with higher SNR, higher temporal resolution, both inherited from BOLD fMRI (higher SNR is also contributed by DL denoising). For existing dataset without ASL MRI acquired, this provides a unique opportunity to generate a new functional imaging modality. For future studies, it offers an opportunity to avoid ASL MRI scan though that will need more evaluations especially in diseased populations. Even if ASL MRI scan is still needed, its scan time can be substantially shortened and the reduced SNR can be compensated by CBF estimated from BOA-Net. Because this study was only tested on dual-echo MRI sequences, future work will also aim at extending our work on different datasets.