Nothing Special   »   [go: up one dir, main page]

Academia.eduAcademia.edu

Arterial Spin Labeling Image Synthesis From Structural MRI Using Improved Capsule-Based Networks

2020, IEEE Access

Received September 19, 2020, accepted September 28, 2020, date of publication October 1, 2020, date of current version October 14, 2020. Digital Object Identifier 10.1109/ACCESS.2020.3028113 Arterial Spin Labeling Image Synthesis From Structural MRI Using Improved Capsule-Based Networks WEI HUANG 1 Department 1,2 , MINGYUAN LUO 3, XI LIU 1, PENG ZHANG 4, AND HUIJUN DING 3 of Computer Science, School of Information Engineering, Nanchang University, Nanchang 330031, China 2 Informatization Office, Nanchang University, Nanchang 330031, China 3 Lab of Medical UltraSound Image Computing, MUSIC, School of Biomedical Engineering, Shenzhen University, Shenzhen 518060, China 4 School of Computer Science, Northwestern Polytechnical University, Xi’an 710072, China Corresponding author: Wei Huang (n060101@e.ntu.edu.sg) This work was supported in part by the National Natural Science Foundation of China under Grant 61862043 and Grant 61971352, in part by the Natural Science Foundation of Jiangxi Province under Grant S2020RCDT2K0033, and in part by the Natural Science Foundation of Shaanxi Province under Grant 2018JM6015. ABSTRACT Medical image synthesis receives much popularity in recent years, and ample medical images can be synthesized by diverse deep learning models to alleviate the problem of lack of data in many medical imaging utilizations. However, most medical image synthesis methods still incorporate the well-known pooling operation in their convolutional neural networks-based / generative adversarial networks-based models, from which image details will be inevitably lost due to the pooling operation. In order to tackle the above problem, improved capsule-based networks, in which no pooling operation is executed and spatial details of images can be effectively preserved thanks to the equivariance characteristics of capsule models, are proposed in this paper to synthesize arterial spin labeling images, for the first time. Technically, three important issues in constructing improved capsule-based networks, including the depth of basic convolutions, the layer of capsules, and the capacity of capsules, are thoroughly investigated. Comprehensive experiments made up of region-based / voxel-based partial volume corrections and dementia diseases diagnosis based on two different datasets are conducted. The superiority of improved capsule-based networks introduced in this paper is substantiated from the statistical point of view. INDEX TERMS Image analysis, image generation, computer aided diagnosis, capsule, arterial spin labeling, deep learning. I. INTRODUCTION Arterial spin labeling (ASL) is widely acknowledged as a non-invasive magnetic resonance imaging (MRI) technique, and it receives increasing research attention in the field of dementia diseases diagnosis only beginning from recent years [1]–[4]. It is also necessary to point out that, ASL becomes a popular indicator in dementia diseases diagnosis due to two reasons. First, ASL requires no injection of external contrast enhancement agent (e.g., gadolinium) into patients while being scanned, which can totally avoid anaphylactoid reactions for elders. Second, cerebral blood flow (CBF) within each voxel is proportional to its ASL signal. In this way, brain atrophy of demented patients is capable to be directly The associate editor coordinating the review of this manuscript and approving it for publication was Jenny Mahoney. VOLUME 8, 2020 revealed by relatively low CBF from certain regions within ASL images, compared with ones of normal people. An illustration of acquiring ASL images via actual scanning is depicted in Fig. 1. Technically, an ASL image is produced as the direct difference between a label image and a control image. There are three key acquisition steps when acquiring ASL images via actual scanning. The first step is to acquire label images, which is displayed in subplota of Fig. 1. When arterial blood flows into Area 1 of the subplot-a, it will be magnetically labeled via a 180◦ radiofrequency inversion pulse, and water molecules within the arterial blood will be utilized as the endogenous ‘‘tracers’’. Then, these tracers flow into Area 2 of the subplot-a driven by the blood circulation of human beings after one transmit time. The label images are acquired within Area 2, therein. The second step is to acquire control images. The acquisition This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ 181137 W. Huang et al.: ASL Image Synthesis From Structural MRI Using Improved Capsule-Based Networks FIGURE 1. An illustration of key steps in ASL images acquisition via actual scanning and corresponding example images. occurs at Area 4 of the subplot-b, which is in fact the same region as Area 2 of the subplot-a. The only difference is that, Area 3 of the subplot-b has no radio-frequency inversion pulse as realized in Area 1, thus arterial blood in acquiring such control images will not be magnetically labeled. The third step is to produce ASL images by subtracting label images from control images, which is illustrated in subplot-c of Fig. 1. One thing to take note here is that, the green arrow in subplot-c represents control images obtained from Area 4 of subplot-b, while the orange arrow in subplot-c indicates label images obtained from Area 2 of subplot-a. Hence, the red arrow in subplot-c represents their direct difference, which denotes the ASL images. Furthermore, on the right-hand side of each subplot in Fig. 1, corresponding example images (i.e., label images in subplot-a, control images in subplotb, and ASL images in subplot-c) are also displayed from the transverse view. The scale unit of color bars in these example images is mL/100 g/min. Although ASL is promising in dementia diseases diagnosis, there are two challenging problems to be properly tackled before ASL can become really applicable in clinical diagnosis. First, ASL is a relatively new imaging modality in dementia diseases diagnosis. Thus, ASL images do lack in many well-established image-based dementia datasets, unfortunately. For instance, major imaging modalities in the wellknown ADNI-1 / Go / 2 / 3 datasets include structural MRI, DWI (i.e., diffusion-weighted imaging), PET, etc. However, ASL is not among them [5]. This data insufficiency is certainly not beneficial for the thorough investigation and wide 181138 utilization of ASL images. Second, even if ASL images can be extensively obtained, the notorious problem of partial volume effects (i.e., PVE), which is essential in ASL image processing, needs to be carefully tackled [6], [7]. Generally speaking, PVE can be considered as the loss of apparent activities closely related to the signal contamination problem, and the limited resolution of an imaging system is often considered as one of the main causes of PVE [2]–[4], [6], [7]. It can be summarized based on the above descriptions that, in order to make ASL images more applicable and reliable in contemporary dementia diseases diagnosis, ASL images should be synthesized from another more common imaging modality (i.e., for solving Problem 1), and those synthesized ASL images still need to be corrected regarding the PVE problem (i.e., for solving Problem 2). In this study, several improved capsule-based models are proposed. They aim to synthesize ASL images from structural MRI images. These synthesized ASL images also undergo rigorous PVE correction tests to proof their reliabilities. It is valuable to mention that, the capsule model [8] was investigated to realize retinal image synthesis in [9]. In this study, the public DIARETDB1 retinal image dataset [10] was incorporated to train the capsule model for fulfilling the retinal image synthesis task. Nevertheless, our study is significantly different from this study, as the ASL image synthesis problem is never solved via capsule-based models and these models are newly proposed after thoroughly investigating three important factors in constructing the capsule network (i.e., the depth of basic convolutions, the layer of capsules, and the capacity of capsules), in this study. The organization of this paper is as follows. In Section II, a review of recent developments in deep learning-based medical image synthesis is comprehensively introduced. In Section III, technical details of improved capsule-based models newly introduced in this study are elaborated. In Section IV, extensive experiments are conducted to substantiate the superiority of improved capsule-based models, both qualitatively and quantitatively. Comprehensive analyses based on the 355demented patients dataset constructed by our group as well as the public ADNI-1 dataset are executed from the statistical perspective. In Section V, the conclusion of this study is drawn. Main contributions of this study can be summarized as follows. • First, it is the first attempt to successfully synthesize ASL images using capsule-based networks. • Second, three important issues in designing capsule-based networks, including the depth of basic convolutions, the layer of capsules, and the capacity of capsules, are thoroughly investigated for shortlisting optimal model architectures in this synthesis study. • Third, the effectiveness of improved capsule-based networks for synthesizing ASL images has been comprehensively verified through a series of rigorous tests, which are composed of region-based PVE, voxelbased PVE, and dementia diseases diagnosis tests. VOLUME 8, 2020 W. Huang et al.: ASL Image Synthesis From Structural MRI Using Improved Capsule-Based Networks It is promising that, the dementia diseases diagnosis performance based on both a 355-demented patients’ dataset and the popular ADNI-1 dataset can be significantly improved, with the help of synthesized ASL images obtained from improved capsule-based networks introduced in this study. II. A REVIEW OF RECENT DEVELOPMENTS IN DEEP LEARNING-BASED MEDICAL IMAGE SYNTHESIS It is widely acknowledged that, deep learning models are often necessary to be trained based on large-scale datasets, in order to achieve high generalization performance afterwards. Although large-scale datasets composed of general images are commonly seen nowadays, it is still difficult to build up large-scale medical image datasets. The reason why large-scale medical image datasets are difficult to be constructed can be interpreted as follows. First, patient issues are highly concerned in the medical imaging domain. Informed consents need to be obtained from patients and the data de-sensitization operation is often indispensable. Second, increasing costs of acquiring medical images, such as MRI, CT (i.e., computed tomography), PET (i.e., positron emission tomography), etc., when building up large-scale medical image datasets of multi-modalities cannot be neglected. Other adversary factors, including patient discomfort during prolonged scanning durations, the unavailability of full-set scanners, etc., further hinder the construction of large-scale medical image datasets, at the current stage. Although large-scale medical image datasets acquired by actual scanning are challenging to be constructed, medical images of high-quality can be synthesized, alternatively [11], [12]. Recent studies have demonstrated that, diverse medical image modalities, including MRI images [13]–[19], PET images [20]–[22], CT / X-Ray images [23]–[25], ultrasound images [26], mammography images [27], [28], eye (including retinal, fundus, and glaucoma) images [29]–[32], endoscopic images [33], can be successfully synthesized. Generally, machine learning techniques are widely acknowledged to provide a profound impact on medical image synthesis, and the synthesis task itself can be considered as finding a good mapping from the source image to the target image [34]. When machine learning techniques in recent years have been evolving from shallow learning to deep learning, most medical image synthesis studies can also be categorized into shallow learning-based synthesis and deep learning-based synthesis, therein. For shallow learning-based medical image synthesis studies, well-known shallow learning techniques are heavily relied on. For instance, the classic Bayesian model was utilized in [13] to construct a modality propagation scheme to synthesize MRI images with glioma-bearing brains. Also, the classic idea of dictionary learning which was popular in the era of shallow learning was utilized in [14], to propose a weakly coupled and geometry co-regularized joint model for MRI image synthesis. It is also necessary to mention that, the mapping from the source image to the target image VOLUME 8, 2020 is often explicitly represented by shallow learning models in these studies, whose model structures are often not as sophisticated as those of deep learning models. Therefore, the generalization capability of shallow learning models in these studies still has opportunities to be further improved. For deep learning-based medical image synthesis studies, their capabilities of implicitly representing non-linear and complicated mappings from the source image to the target image have been greatly improved, thanks to these sophisticated deep learning models. It is also important to point out that, the current trend in deep learning has also been evolving from deep discriminative learning to deep generative learning. Thus, most deep learning-based medical image synthesis studies can be further divided into deep discriminative learning-based synthesis and deep generative learning-based synthesis, accordingly. For deep discriminative learning-based synthesis studies, many well-established deep discriminative learning models, including CNN (i.e., convolutional neural network) [35] and ResNet (i.e., deep residual network) [36], have been vastly investigated and utilized. For instance, both CNN and ResNet models were incorporated as building blocks and they were utilized to construct an U-net model [37] for fulfilling the multi-modal MRI image synthesis task in [15]. For deep generative learning-based synthesis studies, either the original GAN model itself [38] or its diverse GAN derivatives have been widely incorporated. Some representative studies belonging to this category are introduced as follows. In [16], an edge-ware GAN model which integrated the important edge information to delineate boundaries of objects in medical images, was introduced to realize MRI image synthesis. Similarly, the important edge information was represented via a sketch guidance module in [17], to bring about a SkrGAN model for synthesizing various medical image modalities. In [18], both the pixel-wise loss and the perceptual loss were fused together within a conditional GAN model [39], to construct a new pGAN model for realizing multi-contrast MRI image synthesis. In [21] and [22], both a locality adaptive fusion mechanism and an U-net model was separately incorporated within the 3D conditional GAN model [39], to synthesize high-quality PET images from their corresponding low-dose ones. It can be summarized from most contemporary deep learning-based medical image synthesis studies that, both conventional deep discriminant learning models (e.g., CNN and ResNet) and popular deep generative learning models (e.g., GAN and its many derivatives) have been receiving much popularity. However, it is also necessary to point out that, most above-mentioned deep learning-based studies still heavily incorporate the well-known pooling operation in their CNNbased / GAN-based models, from which image details will be inevitably lost because of the pooling operation. As a result, the quality of synthesized images will be deteriorated, therein. In order to tackle the challenging problem, improved capsule-based networks, in which no pooling operation is executed and spatial details of images are effectively 181139 W. Huang et al.: ASL Image Synthesis From Structural MRI Using Improved Capsule-Based Networks sub-structures in the original capsule model. Also, the ‘‘PrimaryCaps’’ and the ‘‘Capsules Layer’’ core sub-structures together perform as the semantic bridge between the primary representation and the high-level output (i.e., of the dimension N ′ × D′ ) of the capsule model. Since all three structures play important roles in the capsule model, they need to be comprehensively investigated and carefully designed for guaranteeing optimal synthesis performance, in this study. B. IMPROVED CAPSULE-BASED MODELS FOR REALIZING ASL IMAGE SYNTHESIS FIGURE 2. The basic architecture of the original capsule model. Three important issues, including the number of basic convolutions, the number of capsule layers, and the capacity of capsules, are emphasized and carefully considered for guaranteeing optimal synthesis performance of the improved capsule-based models in this study. Details are as follows. 1) THE NUMBER OF BASIC CONVOLUTIONS preserved thanks to the equivariance characteristics of capsule models [8], are introduced in this study to realize the ASL image synthesis task. Technical details are elaborated as follows. III. METHODOLOGY A. THE ARCHITECTURE OF THE ORIGINAL CAPSULE MODEL The architecture of the original capsule model is illustrated in Fig. 2. It is obvious to note that, the original capsule model is mainly composed of three sub-structures. The first one is denoted as the ‘‘ReLu Conv.1’’ in Fig. 2, in which W, H, C, and D stand for the width of the image, the height of the image, the channel of the model (i.e., which is equivalent to the image depth), and the dimension of the capsule, respectively. The main purpose here is to convert the original image patch input (i.e., of the dimension W × H × C) into its latent and primary representation (i.e., of the dimension W ′ × H ′ × C ′ ) using a one-time convolution (i.e., Conv.1) and the following rectified linear unit (i.e., ReLu). It can be perceived that, the first sub-structure of the original capsule model actually performs the same type of latent representation conversion as the conventional CNN model. Moreover, the second sub-structure of the original capsule model is denoted as the ‘‘PrimaryCaps’’, which has N capsules (i.e., N = W ′′ × H ′′ × C ′′ in Fig. 2) to perform vectorial convolutions. It is important to point out that, vectorial convolutions are significantly different from conventional scalar convolutions adopted in CNN, and the outcome of the vectorial convolution realized by one capsule in the ‘‘PrimaryCaps’’ is often of D dimensions. After that, the third sub-structure of the original capsule model, which is denoted as the ‘‘Capsules Layer’’, is realized by a non-linear squashing activation function and the classic dynamic routing scheme [8]. It can be summarized based on the above description that, the latent and primary representation obtained by the ‘‘ReLu Conv.1’’ actually acts as an input of the two following core 181140 Fig. 3 illustrates three improved capsule-based models after taking the number of basic convolutions into consideration. It can be noticed that, numbers of basic convolutions before the ‘‘PrimaryCaps’’ sub-structure in the three models significantly vary (i.e., 1, 6, and 9) in Fig. 3. For ‘‘CNN1+CapsNet’’ in Fig. 3, it is similar towards the original capsule model in Fig. 2, only except that two BN (i.e., batch normalization) operations are executed before and after the basic convolution to avoid the notorious overfitting problem. For ‘‘CNN6+CapsNet’’ and ‘‘CNN9+CapsNet’’, they obviously contain more basic convolutions before the ‘‘PrimaryCaps’’ sub-structure, compared with ‘‘CNN1+CapsNet’’. The motivation here is to investigate the optimal setting within the issue of basic convolutions in improved capsule-based models for obtaining latent and primary features. 2) THE NUMBER OF CAPSULE LAYERS Fig. 4 illustrates three other improved capsule-based models, after taking the number of capsule layers into consideration. It can be observed that, all three models in Fig. 4 have one basic convolution (i.e., ‘‘CNN1’’) for obtaining the latent and primary representation from the original input. However, their main difference resides in the number of capsule layers utilized in the ‘‘Capsules Layer’’ substructure. To be specific, there are 2, 3, and 4 capsule layers, in ‘‘CNN1+CapsNet3’’, ‘‘CNN1+CapsNet4’’, and ‘‘CNN1+CapsNet5’’, respectively. Since capsules can be viewed as a special form of neurons in CNN, incorporating different numbers of capsule layers in improved capsulebased models is similar towards adopting different numbers of scalar convolution-based layers in CNN. In this way, highlevel latent representations obtained by multi-layer vectorial convolutions can be obtained. Similar towards the previous discussion about the number of basic convolutions, the motivation of adopting different numbers of capsule layers here is to find out the optimal setting within the issue of vectorial convolution layers in improved capsule-based models for fulfilling ASL image synthesis. VOLUME 8, 2020 W. Huang et al.: ASL Image Synthesis From Structural MRI Using Improved Capsule-Based Networks FIGURE 4. Three improved capsule-based models after taking the layer of capsules into consideration (i.e., from left to right: ‘‘CNN1+CapsNet3’’, ‘‘CNN1+CapsNet4’’, ‘‘CNN1+CapsNet5’’). FIGURE 3. Three improved capsule-based models after taking the number of basic convolutions into consideration (i.e., from left to right: ‘‘CNN1+CapsNet’’, ‘‘CNN6+CapsNet’’, ‘‘CNN9+CapsNet’’). 3) THE CAPACITY OF CAPSULES The capacity of capsules is proposed in this study as a new concept for the first time. The capacity can be explicitly defined as [the number of capsules × the dimension of capsules] (e.g. as N × D and N ′ × D′ in Fig. 2). Fig. 5 illustrates two improved capsule-based models, after taking the capacity of capsules into consideration when fulfilling the ASL image synthesis task. To be specific, the increasing of capacities is realized by adding more capsules into the ‘‘PrimaryCaps’’ sub-structure, which makes the capsule-based models model become wider. An illustration of the comparison between the wide capsule model and the original ‘‘narrow’’ capsule model is depicted in Fig. 6. Detailed explanations are as follows. For the original capsule model in Fig. 6, its capacities of capsules in ‘‘PrimaryCaps’’ VOLUME 8, 2020 and ‘‘Capsules Layer’’ sub-structures can be mathematically calculated as 80 (i.e., 8 × 10) and 160 (i.e., 16 × 10), respectively, which makes the flow within the original capsule model ‘‘from narrow to wide’’ (i.e., 80→160). After increasing the number of capsules in the ‘‘PrimaryCaps’’ substructure of the wide capsule model from 10 to 64, its capacity of ‘‘PrimaryCaps’’ reaches 512 (i.e., 8 × 64), that makes the flow within the wide capsule model ‘‘from wide to narrow’’, instead (i.e., 512→160). The reason why this new ‘‘from wide to narrow’’ flow in the wide capsule model is more favorable can be explained as follows. Since ‘‘PrimaryCaps’’ and ‘‘Capsules Layer’’ substructures together act as a semantic bridge between the primary latent representation obtained from basic convolutions and the high-level understanding output of the capsule model, this new ‘‘from wide to narrow’’ flow actually complies well with the classic FC (i.e., fully-connected) operation in CNN, that also maps a great number of feature maps into a limit number of groups. Therefore, this new ‘‘from wide to narrow’’ flow in the wide capsule model brought by increasing the capacity of capsules can be actually considered as an alternative FC in capsule-based models. 181141 W. Huang et al.: ASL Image Synthesis From Structural MRI Using Improved Capsule-Based Networks inequation also meets N1 D1 −N2 D1 > N2 D2 −N2 D1 . After formatting with the common factor, the following inequation can be obtained as (N1 − N2 )D1 > (D2 − D1 )N2 . Therefore, the final inequation can be easily derived as follows. N1 − N2 D2 − D1 > . N2 D1 (1) It can be read from the above inequation that, the ratio of decreasing the number of capsules in the wide capsule model is larger than that of increasing the dimension of capsules in the wide capsule model. In other words, when representing more complicated objects in the wide capsule model, there is no need to include all low-level characteristics (i.e., latent features) of such objects. The above finding is similar towards the idea of distillation in deep learning [40], and the classic FC operation in CNN models mentioned above also complies with the same idea. It is also worthy of highlighting that, the above new findings are proposed in this study for the first time. FIGURE 5. Two improved capsule-based models after taking the capacity of capsules into consideration (i.e., from left to right: ‘‘CNN1+CapsNet2 Wide’’, ‘‘CNN1+CapsNet3 Wide’’). FIGURE 6. The comparison between the wide capsule model and the original ‘‘narrow’’ capsule model. The mathematical proof related to the improvement of the wide capsule model over the original ‘‘narrow’’ capsule model is explained as follows. Let’s denote the number of capsules in ‘‘PrimaryCaps’’ and ‘‘Capsules Layers’’ of Fig. 6 as N1 and N2 , respectively. Furthermore, dimensions of capsules in ‘‘PrimaryCaps’’ and ‘‘Capsules Layers’’ of Fig. 6 are described as D1 and D2 , respectively. For the wide capsule model, the following two constraints exist, i.e., N1 > N2 , and D1 < D2 . The reason is because that, high-level layers in a capsule (i.e., ‘‘Capsules Layers’’ of Fig. 6) normally aims to represent more complicated objects. Therefore, D2 should be larger than D1 for describing the complexity (i.e., D1 < D2 ). Furthermore, since the capacity in the wide capsule model decreases from ‘‘PrimaryCaps’’ to ‘‘Capsules Layers’’ (i.e., N1 D1 > N2 D2 ), the other constraint meets, i.e., N1 > N2 . Also, when adding two same terms (i.e., N2 D1 ) on both sides of the inequation N1 D1 > N2 D2 , the following 181142 C. THE ENERGY FUNCTION TO REALIZE MEDICAL IMAGE SYNTHESIS BASED ON IMPROVED CAPSULES NETS In this study, voxel-wise deviations between real ASL images and their corresponding synthesized ASL images are utilized to construct the energy function of improved capsule-based models for realizing their trainings. Since real ASL images often include outliers, the sparsity-induced L1 loss is to be utilized as the regression loss, rather than incorporating the conventional L2 loss that is more sensitive to outliers. Furthermore, as the original L1 loss is not smooth at zeros, the smooth L1 loss is chosen as an alternative in this study. Let τ1 denote the real structural MRI image set and τ2 be the real ASL control (or label) image set in this study, si ∈ τ1 and ai ∈ τ2 are structural MRI images and real ASL control (or label) images of the ith patient, respectively. Given the mapping function from structural MRI images to ASL control (or label) images as S(·), and a′i = S(si ) represents the synthesized ASL control (or label) image from the structural MRI image si . The smooth L1 loss can be mathematically described as Eq. 2. L= m m i=1 i=1 1X 1X H (ai , a′i ) = H (ai , S(si )), m m (2) where m indicates the number of real structural MRI / ASL control (or label) images. It is easy to perceive that, Eq. 2 aims to minimize the mean of errors between synthesized and real ASL control (or label) images. Moreover, H in Eq. 2 can be efficiently computed using the piece-wise function described in Eq. 3.  1 2 n if |xi − yi | ≤ 1. 1 X  2 (xi − yi ) , H (x, y) = (3) 1  n i=1  |xi − yi | − , otherwise. 2 Optimizing the smooth L1 loss of improved capsule-based models in this study can be realized by the conventional SGD VOLUME 8, 2020 W. Huang et al.: ASL Image Synthesis From Structural MRI Using Improved Capsule-Based Networks (i.e., stochastic gradient descent) algorithm. Also, other more efficient algorithms, such as the PGD (i.e., proximal gradient descent) algorithm, can be conveniently incorporated based on Eqs. 2 & 3 in this study as well. It is also valuable to point out that, capsule-based networks can work as clustering models, in which the clustering in each iteration is actually fulfilled by the dynamic routing in the capsule model [41]. In this way, excellent performance of capsule-based models can be expected. IV. EXPERIMENTAL ANALYSES A. DATA ACQUISITION D. THE MATHEMATICAL PROOF RELATED TO THE EXCELLENT PERFORMANCE OF CAPSULE-BASED NETWORKS For conventional CNN models, it is necessary to learn diverse latent features via various layers and multiple neurons of CNN models. In order to ensure the generalization capability, numbers of layers and neurons within a CNN model are normally enormous in contemporary CNN models, and the amount of data to train such a complicated model is also very large. For capsule-based networks, however, capsules can be shared thanks to their equivariance characteristics. Therefore, the number of shared capsules won’t be large, and it is practicable to incorporate less amount of training data in capsule-based networks. The mathematical proof related to the excellent performance of capsule-based networks is as follows. Provided the output of the ith capsule on the l th layer in a capsule-based network as ui , and the weight between the ith capsule on the l th layer and its connected jth capsule on the (l + 1)th layer is denoted as Wij . The input of the jth capsule on the (l + 1)th layer can be calculated as follows. ûj|i = Wij ui . (4) Moreover, given the output of the jth capsule on the (l + 1)th layer as vj , the probability that the input ûj|i influences the output vj can be mathematically calculated as follows. exphûj|i , vj i , pi|j = P i exphûj|i , vj i (5) where h, i denotes the operation of an inner product. Hence, vj can be further represented as follows. X exphûj|i , vj i pi|j ûj|i ) = squash( P vj = squash( ûj|i ), i exphûj|i , vj i i (6) in which vj can be considered as a high-level feature fusion made up of multiple low-level features ûj|i that are fed in from different capsules i. It is very important to mention that, the above fusion procedure is quite similar towards the wellknown clustering procedure, in which vj can be viewed as the clustering center of different ûj|i . Meanwhile, the distance between the clustering center and different components is proportional towards the inner product hûj|i , vj i. To sum up, the capsule operation is a generalized clustering operation. VOLUME 8, 2020 The above mathematical proof can also be substantiated by a recently published paper [41], in which the essential dynamic routing is capsule-based networks was solved via the EM routing based on a Gaussian mixture model (GMM). It further ensures that, the capsule operation is a generalized clustering operation, and the excellent performance of capsule-based networks can be expected from the clustering perspective. Improved capsule-based models for realizing ASL images synthesis in this study have been comprehensively evaluated based on a multi-modal MRI image dataset constructed by our group. There are totally 355 real patients in this dataset, including 38 AD patients (i.e., Alzheimer’s disease), 185 MCI patients (i.e., mild cognitive impairment), and 132 NCI (i.e., non-cognitive impairments) patients as normal controls. The average age of these patients is 70.56 ± 7.20 years old. For MRI images of the dataset, both high-resolution MPRAGE (i.e., magnetization prepared rapid acquisition gradient echo) T1-weighted MRI images (i.e., utilized as structural MRI) [42] and pseudo-continuous ASL images are acquired from each individual patient, via a SIEMENS 3T TIM Trio MR scanner equipped at the affiliated hospital of the Nanchang University. Acquisition parameters mainly include: labeling duration = 1500 ms, post-labeling delay = 1500 ms, TR/TE = 4000/9.1 ms, ASL voxel size = 3 × 3 × 5 mm3 . Spatial resolutions of structural MRI and ASL images are 192 × 256 × 256 and 64 × 64 × 21, respectively. The labeling of dementia diseases (i.e., AD / MCI / NCI) in this dataset was fulfilled by our senior clinicians by consensus, and informed consents were obtained from all patients for conducting this study. Moreover, both qualitative (i.e., Subsection IV-B) and quantitative evaluations (i.e., Subsection IV-C) have been conducted from the statistical point of view in this study. Essential pre-processing based on raw MRI images, including motion corrections, intra-modality registrations (i.e., within structural MRI and ASL images independently using the first slice as the reference slice), inter-modalities registrations (i.e., from ASL images to structural MRI images), etc. have been conducted using the popular SPM toolbox [43]. Furthermore, in Section IV-C.4, additional experiments of ASL images synthesis based on the well-known ADNI-1 dataset are conducted to further substantiate the superiority of improved capsule-based models proposed in this study. B. QUALITATIVE EVALUATIONS All 8 improved models introduced in Section III-B are capsule-based, and they are compared with conventional CNN-based models, including a 7-layer CNN model (i.e., denoted as ‘‘CNN7’’) and a 12-layer CNN model (i.e., denoted as ‘‘CNN12’’), for synthesizing ASL images from structural MRI images in this part. Main structures of the 181143 W. Huang et al.: ASL Image Synthesis From Structural MRI Using Improved Capsule-Based Networks FIGURE 7. Main structures of two CNN-based models for comparison in ASL image synthesis of this study. two compared CNN-based models are illustrated in Fig. 7. It is important to point out that, less FCs are incorporated in ‘‘CNN7’’ than in ‘‘CNN12’’, as a potential risk of overfitting can be avoided. The reasons to incorporate ‘‘CNN7’’ and ‘‘CNN12’’ for comparisons in this study are explained as follows. For improved capsule-based models introduced in this study, the maximum layer is 11 (i.e., ‘‘CNN9+CapsNet’’) and the minimum layer is 3 (i.e., ‘‘CNN1+CapsNet’’). Therefore, ‘‘CNN7’’ and ‘‘CNN12’’ are compared as they are equipped with comparable layers as introduced capsulebased models in this study. It is also valuable to mention that, there is no pooling operation and the stride equals to 1 in both ‘‘CNN7’’ and ‘‘CNN12’’ models. The reason is that, this is the optimal setting that no loss of information within the CNN models will be incorporated in the two compared models. Therefore, their performance in synthesizing ASL images should be more appreciated than other models with nonunit strides (e.g., stride = 2) and pooling (e.g., maxpooling) operations. Furthermore, implementation details of all abovementioned 10 deep learning-based models for synthesizing ASL images are described as follows. The patch size of structural MRI images that are utilized as the input of all improved capsule-based models is consistent as 28 × 28 × 12. Batch sizes of all improved capsulebased models are 32 (i.e., for ‘‘CNN1+CapsNet2 Wide’’), 128 (i.e., for ‘‘CNN1+CapsNet’’, ‘‘CNN1+CapsNet3’’, ‘‘CNN1+CapsNet4’’, and ‘‘CNN9+CapsNet’’), 192 (i.e., for 181144 FIGURE 8. ASL image synthesis outcomes obtained by all compared models based on one example AD patient in this study (i.e., Rows 1: golden standard, 2: CNN1+CapsNet, 3: CNN1+CapsNet2 Wide, 4: CNN1+CapsNet3, 5: CNN1+CapsNet3 Wide, 6: CNN1+CapsNet4, 7: CNN1+CapsNet5, 8: CNN6+CapsNet, 9: CNN9+CapsNet, 10: CNN-7, 11: CNN-12; Columns I: ASL control images, II: ASL label images; images are displayed from the transverse view and the scale unit of color bars is a.u.). ‘‘CNN1+CapsNet3 Wide’’), 256 (i.e., for ‘‘CNN6+Caps Net’’), and 512 (i.e., for ‘‘CNN1+CapsNet5’’), respectively. Epochs of improved capsule-based models are 6 (i.e., for ‘‘CNN1+CapsNet3 Wide’’), 10 (i.e., for ‘‘CNN1+Caps Net’’, ‘‘CNN1+CapsNet2 Wide’’, ‘‘CNN1+CapsNet5’’, ‘‘CNN6 + CapsNet’’ and ‘‘CNN9+CapsNet’’), 50 (i.e., for ‘‘CNN1 + CapsNet3’’ and ‘‘CNN1+CapsNet4’’), respectively. The learning rate of all improved capsule-based models is 0.01. For ‘‘CNN7’’ and ‘‘CNN12’’ models, their patch sizes of structural MRI images are both set as 8 × 8 × 12. Futhermore, the batch size, epoch and learning rate of the two CNN-based models are 5120, 50 and 0.01, respectively. All above parameters undergo trail-and-errors for achieving optimal performance of ASL image synthesis in each individual model. All model implementations are realized via Pytorch 0.4.0 based on Ubuntu 16 OS using a workstation equipped with main hardwares including an Intel Xeon CPU i7-7700K, 32G RAM, and a Nvidia Titan XP GPU card. Figs. 8, 9, and 10 demonstrate ASL image synthesis outcomes of example AD, MCI, and NCI patients obtained by all compared deep learning-based models, receptively. VOLUME 8, 2020 W. Huang et al.: ASL Image Synthesis From Structural MRI Using Improved Capsule-Based Networks FIGURE 9. ASL image synthesis outcomes obtained by all compared models based on one example MCI patient in this study (i.e., Rows 1: golden standard, 2: CNN1+CapsNet, 3: CNN1+CapsNet2 Wide, 4: CNN1+CapsNet3, 5: CNN1+CapsNet3 Wide, 6: CNN1+CapsNet4, 7: CNN1+CapsNet5, 8: CNN6+CapsNet, 9: CNN9+CapsNet, 10: CNN-7, 11: CNN-12; Columns I: ASL control images, II: ASL label images; images are displayed from the transverse view and the scale unit of color bars is a.u.). To be specific, 12 Slices from the 7th to the 18th from all 21 slices of synthesized ASL image outcomes are demonstrated in Figs. 8–10, as these middle slices contain rich brain information. It can be noticed from these figures that, the first row indicates real ASL images obtained by actual scanning (i.e., denoted as the ‘‘golden standard’’), while the rest 10 rows depict synthesized ASL image outcomes obtained by different models. Two columns in Figs. 8–10 depict ASL control and label images, which are essential elements to produce ASL images as illustrated in Fig. 1. Qualitative analyses are as follows. First, for improved capsule-based models taking the number of basic convolutions into consideration (i.e., ‘‘CNN1+CapsNet’’, ‘‘CNN6+CapsNet’’, and ‘‘CNN9+ CapsNet’’), Rows 2 brought by ‘‘CNN1+CapsNet’’ in Figs. 8–10 obviously suffer from the serious blurring problem (i.e., the mosaic phenomenon is easy to be identified in Rows 2). Unfortunately, Rows 9 brought by ‘‘CNN9+CapsNet’’ are deteriorated, either. Among them, Rows 8 brought by ‘‘CNN6+CapsNet’’ are the most satisfactory. It can be summarized based on the VOLUME 8, 2020 FIGURE 10. ASL image synthesis outcomes obtained by all compared models based on one example NCI patient in this study (i.e., Rows 1: golden standard, 2: CNN1+CapsNet, 3: CNN1+CapsNet2 Wide, 4: CNN1+CapsNet3, 5: CNN1+CapsNet3 Wide, 6: CNN1+CapsNet4, 7: CNN1+CapsNet5, 8: CNN6+CapsNet, 9: CNN9+CapsNet, 10: CNN-7, 11: CNN-12; Columns I: ASL control images, II: ASL label images; images are displayed from the transverse view and the scale unit of color bars is a.u.). above observation that, the number of basic convolutions in improved capsule-based models for synthesizing ASL images in this study need to be properly determined. In other words, too few basic convolutions are not adequate for obtaining proper latent representation for the later capsule-based operations (i.e., ‘‘CNN1+CapsNet’’), too many basic convolutions are prone to deteriorate the performance (i.e., ‘‘CNN9+CapsNet’’), either. Second, for improved capsule-based models taking the number of capsule layers into consideration (i.e., ‘‘CNN1+CapsNet3’’, ‘‘CNN1+CapsNet4’’, and ‘‘CNN1+CapsNet5’’), Rows 4 brought by ‘‘CNN1+CapsNet3’’ are more blurring than Rows 6 brought by ‘‘CNN1+CapsNet4’’ in Figs. 8–10. For Rows 7 brought by ‘‘CNN1+CapsNet5’’, synthesized ASL control images in Figs. 9 and 10 for the MCI and NCI patients are significantly deteriorated. Therefore, the number of capsule layers in improved capsule-based models for ASL image synthesis also needs to be properly determined. Third, for improved capsule-based models taking the capacity of capsules into consideration (i.e., ‘‘CNN1+CapsNet2 Wide’’ and ‘‘CNN1+CapsNet3 Wide’’), 181145 W. Huang et al.: ASL Image Synthesis From Structural MRI Using Improved Capsule-Based Networks Rows 3 and Rows 5 depict synthesis outcomes for ‘‘CNN1+CapsNet2 Wide’’ and ‘‘CNN1+CapsNet3 Wide’’, respectively. Take ‘‘CNN1+CapsNet3’’ (i.e., Row 4) and ‘‘CNN1+CapsNet3 Wide’’ (i.e., Row 5) for instance, the synthesis quality in Row 5 is significantly better than that of Row 4 (e.g., less blurring), in all three figures. Hence, it is valuable to take the issue of capacity into consideration, when constructing capsule-based networks. C. QUANTITATIVE EVALUATIONS More detailed quantitative evaluations are conducted in this part to reveal which improved capsule-based model is superior. Experiments are conducted from the statistical point of view. Two different types of quantitative tests, including corrections of the challenging PVE problem in ASL images and the multi-modal MRI-based dementia diseases diagnosis, are incorporated. Details are as follows. FIGURE 11. Boxplot of CBF calculated from real and synthesized ASL images obtained from all compared deep learning-based models in this study after applying the region-based PVE correction using 5 × 5 neighbors. TABLE 1. Multiple comparison test in terms of CBF based on real and synthesized ASL images after conducting the 5 × 5 region-based PVE correction. 1) THE REGION-BASED PVE CORRECTION The PVE correction problem can be mathematically described as follows [44], [45]. Provided voxel i, its magnetization MC on the ASL control image and its magnetization ML on the ASL label image can be explicitly represented as Eqs. 7 & 8. C C C MC = PGM · MGM + PWM · MWM + PCSF · MCSF , ML = L PGM · MGM L + PWM · MWM L + PCSF · MCSF , (7) (8) where, MC and ML denote magnetizations of voxel i on ASL control and label images, respectively (i.e., they are known parameters after obtaining ASL control and label images via actual scanning); PGM , PWM , and PCSF are fractional tissue volumes of GM (i.e., gray matter), WM (i.e., white matter), and CSF (i.e., cerebrospinal fluid) on voxel i, all of which are known after segmenting GM / WM / CSF tissues on co-registered MPRAGE T1-weighted images by the popular C denotes the GM magnetization within the SPM toolbox. MGM L denotes the GM magnetization ASL control image, and MGM within the ASL label image. Moreover, M∗⋆ (i.e., ⋆ represents the ASL control (C) or label (L) image, while ∗ denotes the specific tissue among GM / WM / CSF) in Eqs. 7 & 8 are C unknown to be solved in the PVE correction task. Since MCSF L and MCSF are often assumed to be equivalent in contemprorary clinical studies [44], [45], Eqs. 7 & 8 actually formulate a typical problem of indefinite equations (i.e., 2 equations with 5 unknowns). In order to solve the above problem, conventional region-based PVE correction methods were proposed in [44], [45], and they mainly follow the classic idea of linear regression to solve 5 unknowns as: (PT P)−1 PT M , in which P is of the size n2 × 3 whose n2 rows correspond to n2 voxels obtained from the n × n neighborhood of voxel i and 3 columns correspond to PGM , PWM and PCSF of all n2 voxels; M denotes a n2 ×2 matrix with ML and MC of all n2 voxels as its two columns; T and −1 indicate the transpose and inverse of a matrix, respectively. 181146 Since PVE is an essential and challenging problem in ASL image processing, both real ASL images acquired by actual scanning (i.e., as the ‘‘golden standard’’) and synthesized ASL images obtained by all synthesis models need to undergo the rigorous PVE correction process, in order to verify the effectiveness of synthesized ASL images. It is easy to perceive that, synthesized ASL images after fulfilling the PVE correction that are most similar towards real ASL images after applying the same PVE correction process will be highly appreciated. The notion of similarity here is quantitatively reflected by the perfusion signal (i.e., CBF), which is calculated based on the well-known kinetic function introduced in [46]. Figs. 11 and 12 demonstrate boxplots of CBF calculated based on real / synthesized ASL images after applying the region-based PVE correction method with small 5 × 5 neighbors (i.e., n = 5) and large 15 × 15 neighbors (i.e., n = 15), respectively. Technically, in Figs. 11 and 12, a red horizontal line across each box represents the median of CBF, while upper and lower quartiles of CBF are depicted by blue lines above and below the median in each box. A vertical dashed line is drawn from the upper and lower quartiles to their most extreme data points, which are within a 1.5 inter-quartile range (i.e., IQR). Each data point beyond ends of the 1.5 IQR is marked via a red star symbol. It can be observed in Fig. 11 that, boxes of ‘‘CNN6+CapsNet’’, ‘‘CNN1+CapsNet2 Wide’’, and ‘‘CNN1+CapsNet3 Wide’’ are close to that of the golden standard. The above observation complies very well VOLUME 8, 2020 W. Huang et al.: ASL Image Synthesis From Structural MRI Using Improved Capsule-Based Networks FIGURE 12. Boxplot of CBF calculated from real and synthesized ASL images obtained from all compared deep learning-based models in this study after applying the region-based PVE correction using 15 × 15 neighbors. TABLE 2. Multiple comparison test in terms of CBF based on real and synthesized ASL images after conducting the 15 × 15 region-based PVE correction. TABLE 3. Multiple comparison test in terms of CBF based on real and synthesized ASL images after conducting the voxel-based PVE correction. FIGURE 13. Boxplot of CBF calculated from real and synthesized ASL images obtained from all compared deep learning-based models in this study after applying the voxel-based PVE correction. difference between the GoldenStandard and ‘‘CNN6+ CapsNet’’ in Table 1 is 2.9839, and the difference is likely to fall within a 95%CI [1.3667, 4.6011]. Since the single-value estimation as well as the 95% CI (i.e., both the lower and upper bounds) are the closest towards the GoldenStandard compared with those of other synthesis models in Table 1, ‘‘CNN6+CapsNet’’ is superior. In Table 2, similar analyses can be conducted and ‘‘CNN1+CapsNet3 Wide’’ is the closest towards the GoldenStandard. Hence, increasing the number of capsule layers (i.e., ‘‘CNN6+CapsNet’’) and the capacity of capsules (i.e., ‘‘CNN1+CapsNet3’’) are quantitatively suggested to be beneficial, based on the above region-based PVE corrections. 2) THE VOXEL-BASED PVE CORRECTION with the previous qualitative analyses that, the number of basic convolutions (i.e., ‘‘CNN6+CapsNet’’) and the capacity of capsules (i.e., ‘‘CNN1+CapsNet2 Wide’’ and ‘‘CNN1+CapsNet3 Wide’’) play important roles in determining the synthesis performance of improved capsule-based models in this study. Another detailed quantitative analysis is carried out in Tables 1 and 2 based on multiple comparison tests from the statistical perspective. In each row of the tables, there are two kinds of evaluations. One is a single-value estimation of the difference (i.e., using the GoldenStandard minus one compared method), the other is a 95% confidence interval (i.e., CI). In statistics, a CI is a special form of interval estimator for a parameter (i.e. the difference using the GoldenStandard minus one compared method in this study). Instead of estimating the parameter by a single value, CI is capable to provide an interval estimation which is likely to include the estimated parameter within a specified interval. For instance, the single-value estimation of the VOLUME 8, 2020 It is necessary to point out that, region-based PVE correction methods mainly rely on neighboring voxels, which are likely to bring about the blurring problem in ASL image correction outcomes, since a great number of voxels are actually shared when tackling the PVE problem on nearby voxels (i.e., the percentage of the same utilized voxels will significantly increase from 80% to 93.33%, when the neighbor size enlarges from 5 × 5 to 15 × 15 [2]–[4]). Therefore, an alternative ‘‘voxel-based’’ PVE correction method, which concentrates on each individual voxel itself without incorporating neighboring voxels when tackling the PVE problem, receives much popularity in recent years [2]–[4]. In this part, synthesized ASL images obtained by all compared synthesis models undergo the rigorous voxel-based PVE correction test. Fig. 13 delineates the boxplot of CBF calculated from real ASL images and synthesized ASL images obtained by all compared models after applying the voxel-based PVE correction. It can be summarized that, ‘‘CNN1+CapsNet3 Wide’’ has the most close box towards the golden standard, which is a strong indicator that ASL images synthesis outcomes obtained by ‘‘CNN1+CapsNet3 Wide’’ perform better than others after applying the voxel-based PVE correction. There are also other interesting observations from Fig. 13. For the comparison among ‘‘CNN1+CapsNet’’, 181147 W. Huang et al.: ASL Image Synthesis From Structural MRI Using Improved Capsule-Based Networks FIGURE 15. Detailed structures of three deep learning-based dementia diseases diagnosis tools (i.e., from left to right: ‘‘CNN-2’’, ‘‘CNN-20’’, ‘‘ResNet-56’’). C after fulfilling the region-based PVE correction (i.e., FIGURE 14. MGM Column I) and the voxel-based PVE correction (i.e., Column II) obtained by different synthesis models based on one example AD patient in this study (i.e., Rows 1: Golden Standard, 2: CNN1+CapsNet, 3: CNN1+CapsNet2 Wide, 4: CNN1+CapsNet3, 5: CNN1+CapsNet3 Wide; 6: CNN1+CapsNet4; 7: CNN1+CapsNet5; 8: CNN6+CapsNet; 9: CNN9+CapsNet; 10: CNN7; 11: CNN12; images are displayed from the transverse view and the scale unit of color bars is a.u.). ‘‘CNN6+CapsNet’’, and ‘‘CNN9+CapsNet’’, the box of ‘‘CNN6+CapsNet’’ is the highest. It further substantiates the previous conclusion that, the depth of basic convolutions should be properly determined. For the comparison among ‘‘CNN1+CapsNet3’’, ‘‘CNN1+CapsNet4’’, and ‘‘CNN1+CapsNet5’’, the box of ‘‘CNN1+CapsNet4’’ is the highest, which also verifies the importance of properly determining the number of capsule layers. Also, both ‘‘CNN1+CapsNet2 Wide’’ and ‘‘CNN1+CapsNet3 Wide’’ demonstrate outstanding performance compared with their corresponding ‘‘narrow’’ deep learning models, which indicates that increasing the capacity of capsules in improved capsule-based models of this study is beneficial for synthesizing ASL images of high-quality. Table 3 demonstrates statistics of the multiple comparison test based on real and synthesized ASL images after applying the voxel-based PVE correction. It can be noticed that, ‘‘CNN1+CapsNet3’’ has the nearest difference 181148 towards the gold standard (i.e., 4.2635 and [2.8274, 5.6996]), which quantitatively substantiates the previous conclusion in Fig. 13 from the statistical perspective. Another interesting C obtained comparison is shown in Fig. 14, in which MGM after applying the voxel-based PVE correction as well as the region-based PVE correction (i.e., with 5 × 5 neighbors) based on the same example AD patient are illustrated and compared. It can be observed that, more brain details, which C after fulfilling the region-based cannot be observed in MGM PVE correction (i.e., Column II), can be clearly identified in C after conducting the voxel-based PVE correction (i.e., MGM Column I). Since the voxel-based PVE correction is helpful to preserve brain details (i.e., cortex regions are highly important in clinical dementia diseases diagnosis), ASL images after applying the voxel-based PVE correction will be utilized in the following multi-modal MRI-based dementia diseases diagnosis experiments. 3) MULTI-MODAL MRI-BASED DEMENTIA DISEASES DIAGNOSIS In this part, all real / synthesized ASL images after fulfilling the voxel-based PVE correction are utilized to differentiate the progression of dementia diseases (i.e., AD, MCI, NCI). There are totally 7 diagnosis tools implemented in this study, including 4 well-known shallow learning-based tools and 3 popular deep learning-based tools. The 4 shallow learningbased tools are linear regression (i.e., denoted as ‘‘LR’’), VOLUME 8, 2020 W. Huang et al.: ASL Image Synthesis From Structural MRI Using Improved Capsule-Based Networks support vector regression (i.e., denoted as ‘‘SVR’’), tripleclass SVM (i.e., denoted as ‘‘SVM’’), and support vector ranking (i.e., denoted as ‘‘Ranking’’). The 3 deep learningbased tools are denoted as ‘‘CNN-2’’, ‘‘CNN-20’’, and ‘‘ResNet-56’’, respectively. Implementation details of them are as follows. For 4 shallow learning-based tools, low-level hand-crafted features suggested by senior clinicians in our group are utilized. Technically, 8 regions including the left & right hippocampus, the left & right parahippocampal gyrus, the left & right putamen, and the left & right thalamus are segmented out from both structural MRI and real / synthesized ASL images of each individual patient via the IBA-SPM toolbox [47]. Means and standard deviations of these regions are calculated to construct 16-dimensional low-level feature vectors for both structural MRI and ASL images. For ‘‘SVR’’, ‘‘SVM’’ and ‘‘Ranking’’, they are implemented using the popular SVM-light toolbox, and the well-known Gaussian radial basis function (i.e., Gaussian RBF) is adopted as the non-linear kernel with its Gaussian width to be learned via the classic radius margin bound algorithm. For the 3 deep learning-based tools, their model structures are illustrated in Fig. 15. It can be observed that, two-channel inputs simultaneously feeding both structural MRI and real / synthesized ASL images into these deep learning diagnosis tools are employed (i.e., multi-modal MRI-based diagnosis). It is necessary to mention that, when structural MRI is solely utilized, only the single channel for feeding structural MRI is active. Other settings of the 3 deep learning-based diagnosis tools include: batch size = 64, epoch = 90, and learning rate = 0.01. Table 4 demonstrates statistics of accuracies of dementia diseases diagnosis based on structural MRI and real / synthesized ASL images of 355 demented patients in this study via a 5-fold cross validation strategy. Technically, 355 patients are divided into 5 groups of the equal size (i.e., 71 patients per group). In each fold, 4 groups of them are utilized for training and the rest 1 group is incorporated for testing. It can be learned from Table 4 that, statistics of only incorporating structural MRI for dementia diseases diagnosis are considered as the reference for the performance evaluation. Based on average accuracies calculated from diagnosis outcomes of 7 different tools, it can be summarized that the performance of dementia diseases diagnosis will degrade when adding synthesized ASL images obtained from ‘‘CNN-7’’ (i.e., −39.67%), ‘‘CNN-12’’ (i.e., −38.95%), ‘‘CNN9+CapsNet’’ (i.e., −52.48%), ‘‘CNN1+CapsNet3’’ (i.e., −21.68%), and ‘‘CNN1+CapsNet5’’ (i.e., −44.36%). It is reasonable to summarize that, both simple deep learning models (e.g., ‘‘CNN-7’’ and ‘‘CNN-12’’) and unsuitable model designs in improved capsule-based models (i.e., too many basic convolutions in ‘‘CNN9+CapsNet’’, too many layers of capsules in ‘‘CNN1+CapsNet3’’ and ‘‘CNN1+CapsNet5’’, etc.) are not beneficial to improve the dementia diseases diagnosis performance, when incorporating synthesized ASL images in this multi-modal MRI-based dementia diseases diagnosis task. VOLUME 8, 2020 For other improved capsule-based models, they all have performance improvements when their synthesized ASL images are incorporated for dementia diseases diagnosis (i.e., +4.67% by ‘‘CNN1+CapsNet’’, +18.44% by ‘‘CNN6+CapsNet’’, +7.91% by ‘‘CNN1+CapsNet4’’, +17.38% by ‘‘CNN1+CapsNet2 Wide’’, and +23.03% by ‘‘CNN1+CapsNet3 Wide’’). Among all these improved capsule-based models, ‘‘CNN1+CapsNet3 Wide’’ (i.e., +23.03%), ‘‘CNN1+CapsNet2 Wide’’ (i.e., +17.38%) and ‘‘CNN6+CapsNet’’ (i.e., +18.44%) are significantly better than others. To be specific, ‘‘CNN1+CapsNet3 Wide’’ dominates when incorporating ‘‘LR’’, ‘‘Ranking’’, and ‘‘ResNet-56’’ as diagnosis tools. ‘‘CNN1+CapsNet2 Wide’’ performs the best when employing ‘‘CNN-2’’ and ‘‘CNN20’’ as diagnosis tools. ‘‘CNN6+CapsNet’’ is superior when using ‘‘SVR’’ and ‘‘SVM’’ as diagnosis tools. It is also suggested from Table 4 that, ‘‘CNN1+CapsNet3 Wide’’ achieves the highest average accuracy based on all diagnosis outcomes (i.e., 0.6299). Since ‘‘CNN1+CapsNet3 Wide’’ increases its number of capsule layers compared with ‘‘CNN1+CapsNet2 Wide’’ (i.e., whose average accuracy is 0.6010) as well as increases its capacity of capsules compared with ‘‘CNN1+CapsNet3’’ (i.e., whose average accuracy is 0.4010), its performance boost suggests that comprehensively and properly taking various issues (e.g., ‘‘the number of capsule layers’’ and ‘‘the capacity of capsules’’ as mentioned above) into consideration in constructing capsule-based networks should be beneficial to improve the ASL image synthesis performance in this study. The above conclusion can also be further substantiated among comparisons among ‘‘CNN1+CapsNet’’, ‘‘CNN6+CapsNet’’, and ‘‘CNN9+CapsNet’’, in which ‘‘CNN6+CapsNet’’ is superior (i.e., 0.6064 against 0.5359 and 0.2433). It is valuable to mention that, the number of basic convolutions should also be properly chosen (i.e., according to our pre-trails, ‘‘CNN6+CapsNet’’ has the optimal performance when the number of basic convolutions is no larger than 20). 4) ADDITIONAL EXPERIMENTS OF ASL IMAGE SYNTHESIS AND DEMENTIA DISEASES DIAGNOSIS BASED ON THE POPULAR ADNI-1 DATASET In this part, additional experiments based on the popular ADNI-1 dataset are conducted to further reveal the superiority of improved capsule-based models proposed in this study. The public ADNI-1 dataset belongs to the important ADNI datasets family [5], which also has an imbalanced class distribution (i.e., including 200 AD patients, 400 MCI patients, and 200 NCI patients) as the 355-demented patients’ dataset. Since the acquisition of structural MRI images is realized but ASL is not available in the ADNI-1 dataset, the main task here is to synthesize ASL images from available structural MRI images in the ADNI-1 dataset, based on learned synthesis models from the previous 355-demented patients’ dataset. The main purpose here is to verify the effectiveness of incorporating previously unavailable but newly synthesized ASL images to improve the dementia diseases diagnosis 181149 W. Huang et al.: ASL Image Synthesis From Structural MRI Using Improved Capsule-Based Networks FIGURE 16. The visualization of example synthesized ASL images obtained by ‘‘CNN1+CapsNet3 Wide’’ (i.e., 1st column), ‘‘CNN1+CapsNet4’’ (i.e., 2nd column), and ‘‘CNN6+CapsNet’’ (i.e., 3rd column) from sMRI of AD (i.e., 1st row), MCI (i.e., 2nd row), and NCI (i.e., 3rd column) patients based on the popular ADNI-1 dataset (i.e., realized by the 3D slicer software with color renderings). performance in the ADNI-1 dataset. Spatial resolutions of structural MRI images in the ADNI-1 dataset are set as 64 × 64 × 21, which are consistent as those in the first dataset of this study. Three improved capsule-based models verified within Section IV-C.3, i.e., ‘‘CNN6+CapsNet’’, ‘‘CNN1+Caps Net4’’, and ‘‘CNN1+CapsNet3 Wide’’, are shortlisted to conduct additional experiments in this part. The reason is because that, each of the above-mentioned models concentrate on one of the three important issues in designing capsule-based networks (i.e., the depth of basic convolutions for ‘‘CNN6+CapsNet’’, the layer of capsules for ‘‘CNN1+CapsNet4’’, and the capacity of capsules for ‘‘CNN1+CapsNet3 Wide’’). Also, these three improved capsule-based models perform outstandingly as suggested in Table 4. Fig. 16 provides the visualization of example synthesized ASL images obtained by these three improved 181150 CapsNets-based models, based on structural MRI images of AD, MCI, and NCI patients from the popular ADNI-1 dataset. It is necessary to point out that, the visualization in Fig. 16 is realized by the well-known 3D slicer software with color renderings [48]. It is valuable to point out that, synthesized ASL images obtained by the three improved capsule-based models are reasonable, as hot regions (i.e., indicated by the red color) become less and less obvious with the progression of dementia diseases (i.e., NCI→MCI→AD). It also complies well with the clinical understanding that the blood flow tends to decrease (i.e., fewer hot regions) when the severity of dementia diseases increases. In Table 5, statistics of accuracies in dementia diseases diagnosis based on structural MRI and synthesized ASL images obtained by the three improved capsule-based models based on the popular ADNI-1 dataset are reported. The seven diagnosis tools which are utilized in Table 4 are also implemented in Table 5. It can be noticed that, the average VOLUME 8, 2020 W. Huang et al.: ASL Image Synthesis From Structural MRI Using Improved Capsule-Based Networks TABLE 4. Statistics of accuracies in dementia diseases diagnosis based on structural MRI and real / synthesized ASL images. TABLE 5. Statistics of accuracies of dementia diseases diagnosis based on structural MRI and synthesized ASL images obtained by three improved capsule-based models based on the popular ADNI-1 dataset. accuracy of dementia diseases diagnosis based on structural MRI images obtained by actual scanning is 0.6228, which is also utilized as the reference for later evaluations. For Columns 3, 4, and 5 in Table 5, average accuracies of diagnosis based on both structural MRI and synthesized ASL images obtained by ‘‘CNN1+CapsNet3 Wide’’, ‘‘CNN1+CapsNet4’’, and ‘‘CNN6+CapsNet’’, are demonstrated, respectively. It is necessary to point out that, the last row ‘‘Boost’’ in Table 5 indicates the percentage of improve= ment compared with the reference (e.g., 0.7757−0.6228 0.6228 +24.55%). The reliability of synthesized ASL images in dementia diseases diagnosis can be suggested via the significant boost of +24.55%, +18.53%, and +14.11% in Table 5, which are strong indicators of the superiority of improved capsule-based models (i.e., ‘‘CNN1+CapsNet3 Wide’’, ‘‘CNN1+CapsNet4’’, and ‘‘CNN6+CapsNet’’) proposed in this study. Therefore, the reliability of synthesized highquality ASL images in dementia diseases diagnosis based on the popular ADNI-1 dataset can be substantiated as well. The superiority of the newly introduced capsule-based networks in this study has also been revealed after comparing with the state-of-the-art in ASL image synthesis. In [3], an unbalanced and multi-channel-based deep learning model was proposed to fulfill the ASL image synthesis task, and it is generally acknowledged as the VOLUME 8, 2020 state-of-the-art in ASL image synthesis. The same 355-demented-patient dataset was incorporated, and the same 7 diagnosis tools (i.e., ‘‘LR’’, ‘‘SVR’’, ‘‘SVM’’, ‘‘Ranking’’, ‘‘CNN-2’’, ‘‘CNN-20’’, ‘‘ResNet-56’’) were also employed to fulfill the dementia diseases diagnosis task based on synthesized ASL images in that study as well. It turns out that, the average diagnosis accuracy of the state-of-the-art in [3] is 0.6248. In Table 4, ‘‘CNN1+CapsNet3 Wide’’ is capable to achieve higher average accuracy of 0.6299. Therefore, the superiority of the newly introduced capsulebased networks in this study can be further revealed. V. CONCLUSION In this study, several improved capsule-based models are introduced to realize ASL image synthesis for the first time. Technically, three important issues in designing capsulebased models, including the number of basic convolutions, the number of capsule layers, and the capacity of capsules, are throughly investigated. It is suggested that, high-quality synthesized ASL images can be obtained from improved capsule-based models. Although this study proposes improved capsule-based models for synthesizing ASL images from structural MRI images, these models can be extended for synthesizing medical images of other modalities. Future efforts will be spent 181151 W. Huang et al.: ASL Image Synthesis From Structural MRI Using Improved Capsule-Based Networks on synthesizing ultrasound images based on capsule-based models, which should be carefully designed or modified after taking characteristics of ultrasound images (e.g., low SNR, low-quality, etc.) into consideration. ACKNOWLEDGMENT (Mingyuan Luo and Xi Liu contributed equally to this work.) REFERENCES [1] E. S. Musiek, Y. Chen, M. Korczykowski, B. Saboury, P. M. Martinez, J. S. Reddin, A. Alavi, D. Y. Kimberg, D. A. Wolk, P. Julin, A. B. Newberg, S. E. Arnold, and J. A. Detre, ‘‘Direct comparison of fluorodeoxyglucose positron emission tomography and arterial spin labeling magnetic resonance imaging in Alzheimer’s disease,’’ Alzheimer’s Dementia, vol. 8, no. 1, pp. 51–59, Jan. 2012. [2] W. Huang, ‘‘A novel disease severity prediction scheme via big pair-wise ranking and learning techniques using image-based personal clinical data,’’ Signal Process., vol. 124, pp. 233–245, Jul. 2016. [3] W. Huang, M. Luo, X. Liu, P. Zhang, H. Ding, W. Xue, and D. Ni, ‘‘Arterial spin labeling images synthesis from sMRI using unbalanced deep discriminant learning,’’ IEEE Trans. Med. Imag., vol. 38, no. 10, pp. 2338–2351, Oct. 2019. [4] W. Huang, S. Zeng, M. Wan, and G. Chen, ‘‘Medical media analytics via ranking and big learning: A multi-modality image-based disease severity prediction study,’’ Neurocomputing, vol. 204, pp. 125–134, Sep. 2016. [5] S. G. Mueller, M. W. Weiner, L. J. Thal, R. C. Petersen, C. Jack, W. Jagust, J. Q. Trojanowski, A. W. Toga, and L. Beckett, ‘‘The Alzheimer’s disease neuroimaging initiative,’’ Neuroimag. Clin., vol. 15, no. 4, pp. 869–877, Nov. 2005. [6] A. Ahlgren, R. Wirestam, E. T. Petersen, F. Stahlberg, and L. Knutsson, ‘‘Partial volume correction of brain perfusion estimates using the inherent signal data of time-resolved arterial spin labeling,’’ NMR Biomed., vol. 27, no. 9, pp. 1112–1122, Jul. 2014. [7] L. Hernandez-Garcia, A. Lahiri, and J. Schollenberger, ‘‘Recent progress in ASL,’’ NeuroImage, vol. 187, pp. 3–16, Feb. 2019. [8] S. Sabour, N. Frosst, and G. Hinton, ‘‘Dynamic routing between capsules,’’ in Proc. Adv. Neural Inf. Process. Syst., Long Beach, CA, USA, 2017, pp. 3856–3866. [9] A. Jimenez-Sanchez, S. Albarqouni, and D. Mateus, ‘‘Capsule networks against medical imaging data challenges,’’ in Proc. Workshop Intravascular Imag. Comput. Assist. Stenting Large-Scale Annotation Biomed. Data Expert Label Synth., Med. Image Comput. Comput.-Assist. Intervent. Granada, Spain: Springer, 2018, pp. 150–160. [10] V. Kalesnykiene, J. Kamarainen, R. Voutilainen, J. Pietil, H. Klviinen, and H. Uusitalo. Diaretdb1 Diabetic Retinopathy Database and Evaluation Protocol. Accessed: Oct. 1, 2020. [Online]. Available: https://www.it. lut.fi/project/imageret/diaretdb1/doc/diaretdb1_techreport_v_1_1.pdf [11] B. Yu, Y. Wang, L. Wang, D. Shen, and L. Zhou, ‘‘Medical image synthesis via deep learning,’’ in Deep Learning in Medical Image Analysis (Advances in Experimental Medicine and Biology), vol. 1213. 2020, pp. 23–44, doi: 10.1007/978-3-030-33128-3_2. [12] X. Yi, E. Walia, and P. Babyn, ‘‘Generative adversarial network in medical imaging: A review,’’ Med. Image Anal., vol. 58, Dec. 2019, Art. no. 101552. [13] N. Cordier, H. Delingette, M. Le, and N. Ayache, ‘‘Extended modality propagation: Image synthesis of pathological cases,’’ IEEE Trans. Med. Imag., vol. 35, no. 12, pp. 2598–2608, Dec. 2016. [14] Y. Huang, L. Shao, and A. F. Frangi, ‘‘Cross-modality image synthesis via weakly coupled and geometry co-regularized joint dictionary learning,’’ IEEE Trans. Med. Imag., vol. 37, no. 3, pp. 815–827, Mar. 2018. [15] A. Chartsias, T. Joyce, M. V. Giuffrida, and S. A. Tsaftaris, ‘‘Multimodal MR synthesis via modality-invariant latent representation,’’ IEEE Trans. Med. Imag., vol. 37, no. 3, pp. 803–814, Mar. 2018. [16] B. Yu, L. Zhou, L. Wang, Y. Shi, J. Fripp, and P. Bourgeat, ‘‘Ea-GANs: Edge-aware generative adversarial networks for cross-modality MR image synthesis,’’ IEEE Trans. Med. Imag., vol. 38, no. 7, pp. 1750–1762, Jul. 2019. [17] T. Zhang, H. Fu, Y. Zhao, J. Cheng, M. Guo, Z. B. Gu Yang, Y. Xiao, S. Gao, and J. Liu, ‘‘SkrGAN: Sketching-rendering unconditional generative adversarial networks for medical image synthesis,’’ in Medical Image Computing and Computer Assisted Intervention—MICCAI. Shenzhen, China: Springer, 2019, pp. 777–785. 181152 [18] S. U. Dar, M. Yurt, L. Karacan, A. Erdem, E. Erdem, and T. Cukur, ‘‘Image synthesis in multi-contrast MRI with conditional generative adversarial networks,’’ IEEE Trans. Med. Imag., vol. 38, no. 10, pp. 2375–2388, Oct. 2019. [19] D. Nie, R. Trullo, J. Lian, L. Wang, C. Petitjean, S. Ruan, Q. Wang, and D. Shen, ‘‘Medical image synthesis with deep convolutional adversarial networks,’’ IEEE Trans. Biomed. Eng., vol. 65, no. 12, pp. 2720–2730, Dec. 2018. [20] I. Polycarpou, G. Soultanidis, and C. Tsoumpas, ‘‘Synthesis of realistic simultaneous positron emission tomography and magnetic resonance imaging data,’’ IEEE Trans. Med. Imag., vol. 37, no. 3, pp. 703–711, Mar. 2018. [21] Y. Wang, L. Zhou, B. Yu, L. Wang, C. Zu, D. S. Lalush, W. Lin, X. Wu, J. Zhou, and D. Shen, ‘‘3D auto-context-based locality adaptive multimodality GANs for PET synthesis,’’ IEEE Trans. Med. Imag., vol. 38, no. 6, pp. 1328–1339, Jun. 2019. [22] Y. Wang, B. Yu, L. Wang, C. Zu, D. S. Lalush, W. Lin, X. Wu, J. Zhou, D. Shen, and L. Zhou, ‘‘3D conditional generative adversarial networks for high-quality PET image estimation at low dose,’’ NeuroImage, vol. 174, pp. 550–562, Jul. 2018. [23] G. Zeng and G. Zheng, ‘‘Hybrid generative adversarial networks for deep MR to CT synthesis using unpaired data,’’ in Medical Image Computing and Computer Assisted Intervention—MICCAI. Shenzhen, China: Springer, 2019, pp. 759–767. [24] H. Salehinejad, S. Valaee, T. Dowdell, E. Colak, and J. Barfett, ‘‘Generalization of deep neural networks for chest pathology classification in X-Rays using generative adversarial networks,’’ in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Apr. 2018, pp. 990–994. [25] H. Salehinejad, E. Colak, T. Dowdell, J. Barfett, and S. Valaee, ‘‘Synthesizing chest X-ray pathology for training deep convolutional neural networks,’’ IEEE Trans. Med. Imag., vol. 38, no. 5, pp. 1197–1206, May 2019. [26] Y. Zhou, S. Giffard-Roisin, M. De Craene, S. Camarasu-Pop, J. D’Hooge, M. Alessandrini, D. Friboulet, M. Sermesant, and O. Bernard, ‘‘A framework for the generation of realistic synthetic cardiac ultrasound and magnetic resonance imaging sequences from the same virtual patients,’’ IEEE Trans. Med. Imag., vol. 37, no. 3, pp. 741–754, Mar. 2018. [27] Y. Ren, Z. Zhu, Y. Li, D. Kong, R. Hou, L. Grimm, J. Marks, and J. Lo, ‘‘Mask embedding for realistic high-resolution medical image synthesis,’’ in Medical Image Computing and Computer Assisted Intervention— MICCAI. Shenzhen, China: Springer, 2019, pp. 422–430. [28] G. Jiang, Y. Lu, J. Wei, and Y. Xu, ‘‘Synthesize mammogram from digital breast tomosynthesis with gradient guided cGANs,’’ in Medical Image Computing and Computer Assisted Intervention—MICCAI. Shenzhen, China: Springer, 2019, pp. 801–809. [29] P. Costa, A. Galdran, M. I. Meyer, M. Niemeijer, M. Abramoff, A. M. Mendonca, and A. Campilho, ‘‘End-to-end adversarial retinal image synthesis,’’ IEEE Trans. Med. Imag., vol. 37, no. 3, pp. 781–791, Mar. 2018. [30] Y. Zhou, X. He, S. Cui, F. Zhu, L. Liu, and L. Shao, ‘‘High-resolution diabetic retinopathy image synthesis manipulated by grading and lesions,’’ in Medical Image Computing and Computer Assisted Intervention—MICCAI. Shenzhen, China: Springer, 2019, pp. 505–513. [31] X. Wang, M. Xu, L. Li, Z. Wang, and Z. Guan, ‘‘Pathology-aware deep network visualization and its application in glaucoma image synthesis,’’ in Medical Image Computing and Computer Assisted Intervention—MICCAI. Shenzhen, China: Springer, 2019, pp. 423–431. [32] A. Diaz-Pinto, A. Colomer, V. Naranjo, S. Morales, Y. Xu, and A. F. Frangi, ‘‘Retinal image synthesis and semi-supervised learning for glaucoma assessment,’’ IEEE Trans. Med. Imag., vol. 38, no. 9, pp. 2211–2218, Sep. 2019. [33] T. Kanayama, Y. Kurose, K. Tanaka, K. Aida, S. Satoh, M. Kitsuregawa, and T. Harada, ‘‘Gastric cancer detection from endoscopic images using synthesis by GAN,’’ in Medical Image Computing and Computer Assisted Intervention—MICCAI. Shenzhen, China: Springer, 2019, pp. 530–538. [34] A. F. Frangi, S. A. Tsaftaris, and J. L. Prince, ‘‘Simulation and synthesis in medical imaging,’’ IEEE Trans. Med. Imag., vol. 37, no. 3, pp. 673–679, Mar. 2018. [35] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, ‘‘Gradient-based learning applied to document recognition,’’ Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, Nov. 1998. [36] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for image recognition,’’ 2015, arXiv:1512.03385. [Online]. Available: http://arxiv. org/abs/1512.03385 VOLUME 8, 2020 W. Huang et al.: ASL Image Synthesis From Structural MRI Using Improved Capsule-Based Networks [37] O. Ronneberger, P. Fischer, and T. Brox, ‘‘U-Net: Convolutional networks for biomedical image segmentation,’’ in Medical Image Computing and Computer Assisted Intervention—MICCAI. Cham, Switzerland: Springer, 2015, pp. 234–241. [38] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, ‘‘Generative adversarial networks,’’ 2014, arXiv:1406.2661. [Online]. Available: http://arxiv.org/abs/ 1406.2661 [39] M. Mirza and S. Osindero, ‘‘Conditional generative adversarial nets,’’ 2014, arXiv:1411.1784. [Online]. Available: http://arxiv.org/abs/1411. 1784 [40] G. Hinton, O. Vinyals, and J. Dean, ‘‘Distilling the knowledge in a neural network,’’ 2015, arXiv:1503.02531. [Online]. Available: http://arxiv.org/ abs/1503.02531 [41] G. Hinton, S. Sabour, and N. Frosst, ‘‘Matrix capsules with EM routing,’’ in Proc. Int. Conf. Learn. Represent., Vancouver, BC, Canada, 2018, pp. 234–241. [Online]. Available: https://openreview.net/forum?id= HJWLfGWRb [42] M. Brant-Zawadzki, G. D. Gillan, and W. R. Nitz, ‘‘MP RAGE: A threedimensional, T1-weighted, gradient-echo sequence-initial experience in the brain.,’’ Radiology, vol. 182, no. 3, pp. 769–775, Mar. 1992. [43] SPM12—Statistical Parametric Mapping Toolbox. Accessed: Oct. 1, 2020. [Online]. Available: https://www.fil.ion.ucl.ac.uk/spm/software/spm12/ [44] I. Asllani, A. Borogovac, and T. R. Brown, ‘‘Regression algorithm correcting for partial volume effects in arterial spin labeling MRI,’’ Magn. Reson. Med., vol. 60, no. 6, pp. 1362–1371, Dec. 2008. [45] M. Chapell, ‘‘Partial vol. correction, of multiple inversion time arterial spin labeling MRI data,’’ Magn. Reson. Med., vol. 65, pp. 1173–1183, Feb. 2011. [46] G. S. Pell, D. L. Thomas, M. F. Lythgoe, F. Calamante, A. M. Howseman, D. G. Gadian, and R. J. Ordidge, ‘‘Implementation of quantitative FAIR perfusion imaging with a short repetition time in time-course studies,’’ Magn. Reson. Med., vol. 41, no. 4, pp. 829–840, Apr. 1999. [47] Individual Brain Atlases using Statistical Parametric Mapping (IBA-SPM) Software. Accessed: Oct. 1, 2020. [Online]. Available: http://www.thomaskoenig.ch/Lester/ibaspm.htm [48] 3D Slicer. Accessed: Oct. 1, 2020. [Online]. Available: https:// www.slicer.org WEI HUANG received the B.Eng. and M.Eng. degrees from the Harbin Institute of Technology, China, and the Ph.D. degree from Nanyang Technological University, Singapore. He then worked with the University of California San Diego, San Diego, CA, USA, as well as the Agency for Science Technology and Research, Singapore, as Postdoctoral Research Fellows. He is currently with the Department of Compute Science, and acts as the Director of the Informatization Office, Nanchang University, China. His main research interests include machine learning, pattern recognition, medical image processing, and multimedia. He has published nearly 100 academic journal/conference papers, including the IEEE TRANSACTIONS ON MEDICAL IMAGING, the IEEE TRANSACTIONS ON MULTIMEDIA, the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, Pattern Recognition, MICCAI, and ACM Multimedia. He has been acting as principal investigators in 15 national/provincial grants, including three NSF-China projects and three NSF key projects in Jiangxi, China. He received the Jiangxi Provincial Natural Science Award, the most interesting paper award of ICME-ASMMC, and the Best Paper Award of MICCAI-MLMI. He has been entitled the provincial academic leader of the Jiangxi Province, and the provincial young scientist of the Jiangxi Province, respectively. VOLUME 8, 2020 MINGYUAN LUO received the B.Eng. and M.Eng. degrees from Nanchang University, under the supervision of Prof. W. Huang. He has published several academic articles in well-known international journals and conference proceedings, including the IEEE TRANSACTIONS ON MEDICAL IMAGING, IEEE ACCESS, MICCAI, and ACM Multimedia. His research interests include medical image processing, machine learning, computer vision, and pattern recognition. XI LIU received the B.Eng. degree in 2017, and the M.Eng. degree from Nanchang University, under the supervision of Prof. W. Huang. She has published several academic papers in wellknown international journals and conference proceedings, including the IEEE TRANSACTIONS ON MEDICAL IMAGING, IEEE ACCESS, and MICCAI. Her research interests mainly include computer vision and pattern recognition. PENG ZHANG received the B.E. degree from Xian Jiaotong University, China, in 2001, and the Ph.D. degree from Nanyang Technological University, Singapore, in 2011. He is currently a Full Professor with the School of Computer Science, Northwestern Polytechnical University, China. He Zhang has published nearly 100 research articles, including CVPR, ACM Multimedia, Neurocomputing, Signal Processing, the IEEE TRANSACTIONS ON IMAGE PROCESSING, the IEEE TRANSACTIONS ON MULTIMEDIA, and the IEEE TRANSACTIONS ON MEDICAL IMAGING. He has been acting as the PI in three grants of NSFC. His current research interests include computer vision, pattern recognition, and machine learning. He is also the Chief Scientist in Mekitec OY, Finland. HUIJUN DING received the B.Eng. degree in electronic engineering and information science from the University of Science and Technology of China, (USTC), Hefei, in 2006, and the Ph.D. degree from the School of Electrical and Electronic Engineering, Nanyang Technological University (NTU), Singapore, in 2011. Afterward, she was a Postdoctoral Research Fellow with the Department of Electronic Engineering, The Chinese University of Hong Kong (CUHK), before joining Shenzhen University, China, in 2013. Her current research interests include speech and image processing, objective measure, and nanomaterialenabled acoustic devices. 181153