Received September 19, 2020, accepted September 28, 2020, date of publication October 1, 2020, date of current version October 14, 2020.
Digital Object Identifier 10.1109/ACCESS.2020.3028113
Arterial Spin Labeling Image Synthesis From
Structural MRI Using Improved
Capsule-Based Networks
WEI HUANG
1 Department
1,2 ,
MINGYUAN LUO
3,
XI LIU
1,
PENG ZHANG
4,
AND HUIJUN DING
3
of Computer Science, School of Information Engineering, Nanchang University, Nanchang 330031, China
2 Informatization Office, Nanchang University, Nanchang 330031, China
3 Lab of Medical UltraSound Image Computing, MUSIC, School of Biomedical Engineering, Shenzhen University, Shenzhen 518060, China
4 School of Computer Science, Northwestern Polytechnical University, Xi’an 710072, China
Corresponding author: Wei Huang (n060101@e.ntu.edu.sg)
This work was supported in part by the National Natural Science Foundation of China under Grant 61862043 and Grant 61971352, in part
by the Natural Science Foundation of Jiangxi Province under Grant S2020RCDT2K0033, and in part by the Natural Science Foundation of
Shaanxi Province under Grant 2018JM6015.
ABSTRACT Medical image synthesis receives much popularity in recent years, and ample medical images
can be synthesized by diverse deep learning models to alleviate the problem of lack of data in many medical
imaging utilizations. However, most medical image synthesis methods still incorporate the well-known
pooling operation in their convolutional neural networks-based / generative adversarial networks-based
models, from which image details will be inevitably lost due to the pooling operation. In order to tackle
the above problem, improved capsule-based networks, in which no pooling operation is executed and spatial
details of images can be effectively preserved thanks to the equivariance characteristics of capsule models,
are proposed in this paper to synthesize arterial spin labeling images, for the first time. Technically, three
important issues in constructing improved capsule-based networks, including the depth of basic convolutions,
the layer of capsules, and the capacity of capsules, are thoroughly investigated. Comprehensive experiments
made up of region-based / voxel-based partial volume corrections and dementia diseases diagnosis based on
two different datasets are conducted. The superiority of improved capsule-based networks introduced in this
paper is substantiated from the statistical point of view.
INDEX TERMS Image analysis, image generation, computer aided diagnosis, capsule, arterial spin labeling,
deep learning.
I. INTRODUCTION
Arterial spin labeling (ASL) is widely acknowledged as a
non-invasive magnetic resonance imaging (MRI) technique,
and it receives increasing research attention in the field of
dementia diseases diagnosis only beginning from recent years
[1]–[4]. It is also necessary to point out that, ASL becomes a
popular indicator in dementia diseases diagnosis due to two
reasons. First, ASL requires no injection of external contrast
enhancement agent (e.g., gadolinium) into patients while
being scanned, which can totally avoid anaphylactoid reactions for elders. Second, cerebral blood flow (CBF) within
each voxel is proportional to its ASL signal. In this way,
brain atrophy of demented patients is capable to be directly
The associate editor coordinating the review of this manuscript and
approving it for publication was Jenny Mahoney.
VOLUME 8, 2020
revealed by relatively low CBF from certain regions within
ASL images, compared with ones of normal people.
An illustration of acquiring ASL images via actual scanning is depicted in Fig. 1. Technically, an ASL image is
produced as the direct difference between a label image and
a control image. There are three key acquisition steps when
acquiring ASL images via actual scanning. The first step
is to acquire label images, which is displayed in subplota of Fig. 1. When arterial blood flows into Area 1 of the
subplot-a, it will be magnetically labeled via a 180◦ radiofrequency inversion pulse, and water molecules within the
arterial blood will be utilized as the endogenous ‘‘tracers’’.
Then, these tracers flow into Area 2 of the subplot-a driven
by the blood circulation of human beings after one transmit
time. The label images are acquired within Area 2, therein.
The second step is to acquire control images. The acquisition
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
181137
W. Huang et al.: ASL Image Synthesis From Structural MRI Using Improved Capsule-Based Networks
FIGURE 1. An illustration of key steps in ASL images acquisition via
actual scanning and corresponding example images.
occurs at Area 4 of the subplot-b, which is in fact the same
region as Area 2 of the subplot-a. The only difference is
that, Area 3 of the subplot-b has no radio-frequency inversion
pulse as realized in Area 1, thus arterial blood in acquiring
such control images will not be magnetically labeled. The
third step is to produce ASL images by subtracting label
images from control images, which is illustrated in subplot-c
of Fig. 1. One thing to take note here is that, the green arrow in
subplot-c represents control images obtained from Area 4 of
subplot-b, while the orange arrow in subplot-c indicates label
images obtained from Area 2 of subplot-a. Hence, the red
arrow in subplot-c represents their direct difference, which
denotes the ASL images. Furthermore, on the right-hand side
of each subplot in Fig. 1, corresponding example images
(i.e., label images in subplot-a, control images in subplotb, and ASL images in subplot-c) are also displayed from the
transverse view. The scale unit of color bars in these example
images is mL/100 g/min.
Although ASL is promising in dementia diseases diagnosis, there are two challenging problems to be properly
tackled before ASL can become really applicable in clinical
diagnosis. First, ASL is a relatively new imaging modality
in dementia diseases diagnosis. Thus, ASL images do lack in
many well-established image-based dementia datasets, unfortunately. For instance, major imaging modalities in the wellknown ADNI-1 / Go / 2 / 3 datasets include structural MRI,
DWI (i.e., diffusion-weighted imaging), PET, etc. However,
ASL is not among them [5]. This data insufficiency is certainly not beneficial for the thorough investigation and wide
181138
utilization of ASL images. Second, even if ASL images can
be extensively obtained, the notorious problem of partial
volume effects (i.e., PVE), which is essential in ASL image
processing, needs to be carefully tackled [6], [7]. Generally
speaking, PVE can be considered as the loss of apparent
activities closely related to the signal contamination problem,
and the limited resolution of an imaging system is often
considered as one of the main causes of PVE [2]–[4], [6], [7].
It can be summarized based on the above descriptions that,
in order to make ASL images more applicable and reliable
in contemporary dementia diseases diagnosis, ASL images
should be synthesized from another more common imaging
modality (i.e., for solving Problem 1), and those synthesized
ASL images still need to be corrected regarding the PVE
problem (i.e., for solving Problem 2). In this study, several
improved capsule-based models are proposed. They aim to
synthesize ASL images from structural MRI images. These
synthesized ASL images also undergo rigorous PVE correction tests to proof their reliabilities.
It is valuable to mention that, the capsule model [8] was
investigated to realize retinal image synthesis in [9]. In this
study, the public DIARETDB1 retinal image dataset [10]
was incorporated to train the capsule model for fulfilling
the retinal image synthesis task. Nevertheless, our study is
significantly different from this study, as the ASL image
synthesis problem is never solved via capsule-based models
and these models are newly proposed after thoroughly investigating three important factors in constructing the capsule
network (i.e., the depth of basic convolutions, the layer of
capsules, and the capacity of capsules), in this study. The
organization of this paper is as follows. In Section II, a review
of recent developments in deep learning-based medical image
synthesis is comprehensively introduced. In Section III, technical details of improved capsule-based models newly introduced in this study are elaborated. In Section IV, extensive
experiments are conducted to substantiate the superiority
of improved capsule-based models, both qualitatively and
quantitatively. Comprehensive analyses based on the 355demented patients dataset constructed by our group as well
as the public ADNI-1 dataset are executed from the statistical
perspective. In Section V, the conclusion of this study is
drawn.
Main contributions of this study can be summarized as
follows.
• First, it is the first attempt to successfully synthesize
ASL images using capsule-based networks.
• Second,
three important issues in designing
capsule-based networks, including the depth of basic
convolutions, the layer of capsules, and the capacity
of capsules, are thoroughly investigated for shortlisting
optimal model architectures in this synthesis study.
• Third, the effectiveness of improved capsule-based networks for synthesizing ASL images has been comprehensively verified through a series of rigorous tests,
which are composed of region-based PVE, voxelbased PVE, and dementia diseases diagnosis tests.
VOLUME 8, 2020
W. Huang et al.: ASL Image Synthesis From Structural MRI Using Improved Capsule-Based Networks
It is promising that, the dementia diseases diagnosis
performance based on both a 355-demented patients’
dataset and the popular ADNI-1 dataset can be significantly improved, with the help of synthesized ASL
images obtained from improved capsule-based networks
introduced in this study.
II. A REVIEW OF RECENT DEVELOPMENTS IN DEEP
LEARNING-BASED MEDICAL IMAGE SYNTHESIS
It is widely acknowledged that, deep learning models are
often necessary to be trained based on large-scale datasets,
in order to achieve high generalization performance afterwards. Although large-scale datasets composed of general
images are commonly seen nowadays, it is still difficult to
build up large-scale medical image datasets. The reason why
large-scale medical image datasets are difficult to be constructed can be interpreted as follows. First, patient issues are
highly concerned in the medical imaging domain. Informed
consents need to be obtained from patients and the data
de-sensitization operation is often indispensable. Second,
increasing costs of acquiring medical images, such as MRI,
CT (i.e., computed tomography), PET (i.e., positron emission tomography), etc., when building up large-scale medical image datasets of multi-modalities cannot be neglected.
Other adversary factors, including patient discomfort during
prolonged scanning durations, the unavailability of full-set
scanners, etc., further hinder the construction of large-scale
medical image datasets, at the current stage.
Although large-scale medical image datasets acquired by
actual scanning are challenging to be constructed, medical images of high-quality can be synthesized, alternatively
[11], [12]. Recent studies have demonstrated that, diverse
medical image modalities, including MRI images [13]–[19],
PET images [20]–[22], CT / X-Ray images [23]–[25], ultrasound images [26], mammography images [27], [28], eye
(including retinal, fundus, and glaucoma) images [29]–[32],
endoscopic images [33], can be successfully synthesized.
Generally, machine learning techniques are widely acknowledged to provide a profound impact on medical image synthesis, and the synthesis task itself can be considered as finding a
good mapping from the source image to the target image [34].
When machine learning techniques in recent years have been
evolving from shallow learning to deep learning, most medical image synthesis studies can also be categorized into
shallow learning-based synthesis and deep learning-based
synthesis, therein.
For shallow learning-based medical image synthesis studies, well-known shallow learning techniques are heavily
relied on. For instance, the classic Bayesian model was utilized in [13] to construct a modality propagation scheme to
synthesize MRI images with glioma-bearing brains. Also,
the classic idea of dictionary learning which was popular in
the era of shallow learning was utilized in [14], to propose
a weakly coupled and geometry co-regularized joint model
for MRI image synthesis. It is also necessary to mention
that, the mapping from the source image to the target image
VOLUME 8, 2020
is often explicitly represented by shallow learning models
in these studies, whose model structures are often not as
sophisticated as those of deep learning models. Therefore,
the generalization capability of shallow learning models in
these studies still has opportunities to be further improved.
For deep learning-based medical image synthesis studies,
their capabilities of implicitly representing non-linear and
complicated mappings from the source image to the target
image have been greatly improved, thanks to these sophisticated deep learning models. It is also important to point
out that, the current trend in deep learning has also been
evolving from deep discriminative learning to deep generative
learning. Thus, most deep learning-based medical image synthesis studies can be further divided into deep discriminative
learning-based synthesis and deep generative learning-based
synthesis, accordingly.
For deep discriminative learning-based synthesis studies,
many well-established deep discriminative learning models,
including CNN (i.e., convolutional neural network) [35] and
ResNet (i.e., deep residual network) [36], have been vastly
investigated and utilized. For instance, both CNN and ResNet
models were incorporated as building blocks and they were
utilized to construct an U-net model [37] for fulfilling the
multi-modal MRI image synthesis task in [15]. For deep
generative learning-based synthesis studies, either the original GAN model itself [38] or its diverse GAN derivatives
have been widely incorporated. Some representative studies
belonging to this category are introduced as follows. In [16],
an edge-ware GAN model which integrated the important
edge information to delineate boundaries of objects in medical images, was introduced to realize MRI image synthesis.
Similarly, the important edge information was represented via
a sketch guidance module in [17], to bring about a SkrGAN
model for synthesizing various medical image modalities.
In [18], both the pixel-wise loss and the perceptual loss
were fused together within a conditional GAN model [39],
to construct a new pGAN model for realizing multi-contrast
MRI image synthesis. In [21] and [22], both a locality adaptive fusion mechanism and an U-net model was separately
incorporated within the 3D conditional GAN model [39],
to synthesize high-quality PET images from their corresponding low-dose ones. It can be summarized from most
contemporary deep learning-based medical image synthesis
studies that, both conventional deep discriminant learning
models (e.g., CNN and ResNet) and popular deep generative
learning models (e.g., GAN and its many derivatives) have
been receiving much popularity.
However, it is also necessary to point out that, most
above-mentioned deep learning-based studies still heavily
incorporate the well-known pooling operation in their CNNbased / GAN-based models, from which image details will
be inevitably lost because of the pooling operation. As a
result, the quality of synthesized images will be deteriorated, therein. In order to tackle the challenging problem,
improved capsule-based networks, in which no pooling operation is executed and spatial details of images are effectively
181139
W. Huang et al.: ASL Image Synthesis From Structural MRI Using Improved Capsule-Based Networks
sub-structures in the original capsule model. Also, the ‘‘PrimaryCaps’’ and the ‘‘Capsules Layer’’ core sub-structures
together perform as the semantic bridge between the primary
representation and the high-level output (i.e., of the dimension N ′ × D′ ) of the capsule model. Since all three structures play important roles in the capsule model, they need to
be comprehensively investigated and carefully designed for
guaranteeing optimal synthesis performance, in this study.
B. IMPROVED CAPSULE-BASED MODELS FOR REALIZING
ASL IMAGE SYNTHESIS
FIGURE 2. The basic architecture of the original capsule model.
Three important issues, including the number of basic convolutions, the number of capsule layers, and the capacity
of capsules, are emphasized and carefully considered for
guaranteeing optimal synthesis performance of the improved
capsule-based models in this study. Details are as follows.
1) THE NUMBER OF BASIC CONVOLUTIONS
preserved thanks to the equivariance characteristics of capsule models [8], are introduced in this study to realize the
ASL image synthesis task. Technical details are elaborated
as follows.
III. METHODOLOGY
A. THE ARCHITECTURE OF THE ORIGINAL CAPSULE
MODEL
The architecture of the original capsule model is illustrated
in Fig. 2. It is obvious to note that, the original capsule model
is mainly composed of three sub-structures. The first one
is denoted as the ‘‘ReLu Conv.1’’ in Fig. 2, in which W,
H, C, and D stand for the width of the image, the height
of the image, the channel of the model (i.e., which is
equivalent to the image depth), and the dimension of the
capsule, respectively. The main purpose here is to convert the original image patch input (i.e., of the dimension
W × H × C) into its latent and primary representation (i.e.,
of the dimension W ′ × H ′ × C ′ ) using a one-time convolution (i.e., Conv.1) and the following rectified linear unit
(i.e., ReLu). It can be perceived that, the first sub-structure
of the original capsule model actually performs the same
type of latent representation conversion as the conventional
CNN model. Moreover, the second sub-structure of the original capsule model is denoted as the ‘‘PrimaryCaps’’, which
has N capsules (i.e., N = W ′′ × H ′′ × C ′′ in Fig. 2) to
perform vectorial convolutions. It is important to point out
that, vectorial convolutions are significantly different from
conventional scalar convolutions adopted in CNN, and the
outcome of the vectorial convolution realized by one capsule
in the ‘‘PrimaryCaps’’ is often of D dimensions. After that,
the third sub-structure of the original capsule model, which is
denoted as the ‘‘Capsules Layer’’, is realized by a non-linear
squashing activation function and the classic dynamic routing
scheme [8].
It can be summarized based on the above description that,
the latent and primary representation obtained by the ‘‘ReLu
Conv.1’’ actually acts as an input of the two following core
181140
Fig. 3 illustrates three improved capsule-based models after
taking the number of basic convolutions into consideration.
It can be noticed that, numbers of basic convolutions before
the ‘‘PrimaryCaps’’ sub-structure in the three models significantly vary (i.e., 1, 6, and 9) in Fig. 3. For ‘‘CNN1+CapsNet’’
in Fig. 3, it is similar towards the original capsule model
in Fig. 2, only except that two BN (i.e., batch normalization) operations are executed before and after the basic
convolution to avoid the notorious overfitting problem. For
‘‘CNN6+CapsNet’’ and ‘‘CNN9+CapsNet’’, they obviously
contain more basic convolutions before the ‘‘PrimaryCaps’’
sub-structure, compared with ‘‘CNN1+CapsNet’’. The motivation here is to investigate the optimal setting within the
issue of basic convolutions in improved capsule-based models for obtaining latent and primary features.
2) THE NUMBER OF CAPSULE LAYERS
Fig. 4 illustrates three other improved capsule-based models,
after taking the number of capsule layers into consideration.
It can be observed that, all three models in Fig. 4 have
one basic convolution (i.e., ‘‘CNN1’’) for obtaining the
latent and primary representation from the original input.
However, their main difference resides in the number
of capsule layers utilized in the ‘‘Capsules Layer’’ substructure. To be specific, there are 2, 3, and 4 capsule
layers, in ‘‘CNN1+CapsNet3’’, ‘‘CNN1+CapsNet4’’, and
‘‘CNN1+CapsNet5’’, respectively. Since capsules can be
viewed as a special form of neurons in CNN, incorporating
different numbers of capsule layers in improved capsulebased models is similar towards adopting different numbers
of scalar convolution-based layers in CNN. In this way, highlevel latent representations obtained by multi-layer vectorial
convolutions can be obtained. Similar towards the previous
discussion about the number of basic convolutions, the motivation of adopting different numbers of capsule layers here is
to find out the optimal setting within the issue of vectorial
convolution layers in improved capsule-based models for
fulfilling ASL image synthesis.
VOLUME 8, 2020
W. Huang et al.: ASL Image Synthesis From Structural MRI Using Improved Capsule-Based Networks
FIGURE 4. Three improved capsule-based models after taking the layer of
capsules into consideration (i.e., from left to right: ‘‘CNN1+CapsNet3’’,
‘‘CNN1+CapsNet4’’, ‘‘CNN1+CapsNet5’’).
FIGURE 3. Three improved capsule-based models after taking the number
of basic convolutions into consideration (i.e., from left to right:
‘‘CNN1+CapsNet’’, ‘‘CNN6+CapsNet’’, ‘‘CNN9+CapsNet’’).
3) THE CAPACITY OF CAPSULES
The capacity of capsules is proposed in this study
as a new concept for the first time. The capacity
can be explicitly defined as [the number of capsules ×
the dimension of capsules] (e.g. as N × D and N ′ × D′
in Fig. 2). Fig. 5 illustrates two improved capsule-based models, after taking the capacity of capsules into consideration
when fulfilling the ASL image synthesis task. To be specific,
the increasing of capacities is realized by adding more capsules into the ‘‘PrimaryCaps’’ sub-structure, which makes the
capsule-based models model become wider. An illustration
of the comparison between the wide capsule model and
the original ‘‘narrow’’ capsule model is depicted in Fig. 6.
Detailed explanations are as follows. For the original capsule
model in Fig. 6, its capacities of capsules in ‘‘PrimaryCaps’’
VOLUME 8, 2020
and ‘‘Capsules Layer’’ sub-structures can be mathematically
calculated as 80 (i.e., 8 × 10) and 160 (i.e., 16 × 10),
respectively, which makes the flow within the original capsule model ‘‘from narrow to wide’’ (i.e., 80→160). After
increasing the number of capsules in the ‘‘PrimaryCaps’’ substructure of the wide capsule model from 10 to 64, its capacity
of ‘‘PrimaryCaps’’ reaches 512 (i.e., 8 × 64), that makes the
flow within the wide capsule model ‘‘from wide to narrow’’,
instead (i.e., 512→160).
The reason why this new ‘‘from wide to narrow’’ flow in
the wide capsule model is more favorable can be explained as
follows. Since ‘‘PrimaryCaps’’ and ‘‘Capsules Layer’’ substructures together act as a semantic bridge between the primary latent representation obtained from basic convolutions
and the high-level understanding output of the capsule model,
this new ‘‘from wide to narrow’’ flow actually complies well
with the classic FC (i.e., fully-connected) operation in CNN,
that also maps a great number of feature maps into a limit
number of groups. Therefore, this new ‘‘from wide to narrow’’ flow in the wide capsule model brought by increasing
the capacity of capsules can be actually considered as an
alternative FC in capsule-based models.
181141
W. Huang et al.: ASL Image Synthesis From Structural MRI Using Improved Capsule-Based Networks
inequation also meets N1 D1 −N2 D1 > N2 D2 −N2 D1 . After
formatting with the common factor, the following inequation
can be obtained as (N1 − N2 )D1 > (D2 − D1 )N2 . Therefore,
the final inequation can be easily derived as follows.
N1 − N2
D2 − D1
>
.
N2
D1
(1)
It can be read from the above inequation that, the ratio of
decreasing the number of capsules in the wide capsule model
is larger than that of increasing the dimension of capsules in
the wide capsule model. In other words, when representing
more complicated objects in the wide capsule model, there
is no need to include all low-level characteristics (i.e., latent
features) of such objects. The above finding is similar towards
the idea of distillation in deep learning [40], and the classic
FC operation in CNN models mentioned above also complies
with the same idea. It is also worthy of highlighting that,
the above new findings are proposed in this study for the first
time.
FIGURE 5. Two improved capsule-based models after taking the capacity
of capsules into consideration (i.e., from left to right:
‘‘CNN1+CapsNet2 Wide’’, ‘‘CNN1+CapsNet3 Wide’’).
FIGURE 6. The comparison between the wide capsule model and the
original ‘‘narrow’’ capsule model.
The mathematical proof related to the improvement of
the wide capsule model over the original ‘‘narrow’’ capsule model is explained as follows. Let’s denote the number
of capsules in ‘‘PrimaryCaps’’ and ‘‘Capsules Layers’’ of
Fig. 6 as N1 and N2 , respectively. Furthermore, dimensions
of capsules in ‘‘PrimaryCaps’’ and ‘‘Capsules Layers’’ of
Fig. 6 are described as D1 and D2 , respectively. For the
wide capsule model, the following two constraints exist, i.e.,
N1 > N2 , and D1 < D2 . The reason is because that, high-level
layers in a capsule (i.e., ‘‘Capsules Layers’’ of Fig. 6) normally aims to represent more complicated objects. Therefore,
D2 should be larger than D1 for describing the complexity
(i.e., D1 < D2 ). Furthermore, since the capacity in the wide
capsule model decreases from ‘‘PrimaryCaps’’ to ‘‘Capsules
Layers’’ (i.e., N1 D1 > N2 D2 ), the other constraint meets,
i.e., N1 > N2 .
Also, when adding two same terms (i.e., N2 D1 ) on both
sides of the inequation N1 D1 > N2 D2 , the following
181142
C. THE ENERGY FUNCTION TO REALIZE MEDICAL IMAGE
SYNTHESIS BASED ON IMPROVED CAPSULES NETS
In this study, voxel-wise deviations between real ASL images
and their corresponding synthesized ASL images are utilized
to construct the energy function of improved capsule-based
models for realizing their trainings. Since real ASL images
often include outliers, the sparsity-induced L1 loss is to be
utilized as the regression loss, rather than incorporating the
conventional L2 loss that is more sensitive to outliers. Furthermore, as the original L1 loss is not smooth at zeros,
the smooth L1 loss is chosen as an alternative in this study.
Let τ1 denote the real structural MRI image set and τ2
be the real ASL control (or label) image set in this study,
si ∈ τ1 and ai ∈ τ2 are structural MRI images and real ASL
control (or label) images of the ith patient, respectively. Given
the mapping function from structural MRI images to ASL
control (or label) images as S(·), and a′i = S(si ) represents the
synthesized ASL control (or label) image from the structural
MRI image si . The smooth L1 loss can be mathematically
described as Eq. 2.
L=
m
m
i=1
i=1
1X
1X
H (ai , a′i ) =
H (ai , S(si )),
m
m
(2)
where m indicates the number of real structural MRI / ASL
control (or label) images. It is easy to perceive that, Eq. 2 aims
to minimize the mean of errors between synthesized and real
ASL control (or label) images. Moreover, H in Eq. 2 can be
efficiently computed using the piece-wise function described
in Eq. 3.
1
2
n
if |xi − yi | ≤ 1.
1 X 2 (xi − yi ) ,
H (x, y) =
(3)
1
n
i=1 |xi − yi | − , otherwise.
2
Optimizing the smooth L1 loss of improved capsule-based
models in this study can be realized by the conventional SGD
VOLUME 8, 2020
W. Huang et al.: ASL Image Synthesis From Structural MRI Using Improved Capsule-Based Networks
(i.e., stochastic gradient descent) algorithm. Also, other more
efficient algorithms, such as the PGD (i.e., proximal gradient
descent) algorithm, can be conveniently incorporated based
on Eqs. 2 & 3 in this study as well.
It is also valuable to point out that, capsule-based networks
can work as clustering models, in which the clustering in each
iteration is actually fulfilled by the dynamic routing in the
capsule model [41]. In this way, excellent performance of
capsule-based models can be expected.
IV. EXPERIMENTAL ANALYSES
A. DATA ACQUISITION
D. THE MATHEMATICAL PROOF RELATED TO THE
EXCELLENT PERFORMANCE OF CAPSULE-BASED
NETWORKS
For conventional CNN models, it is necessary to learn diverse
latent features via various layers and multiple neurons of
CNN models. In order to ensure the generalization capability,
numbers of layers and neurons within a CNN model are
normally enormous in contemporary CNN models, and the
amount of data to train such a complicated model is also
very large. For capsule-based networks, however, capsules
can be shared thanks to their equivariance characteristics.
Therefore, the number of shared capsules won’t be large, and
it is practicable to incorporate less amount of training data in
capsule-based networks.
The mathematical proof related to the excellent performance of capsule-based networks is as follows. Provided the
output of the ith capsule on the l th layer in a capsule-based
network as ui , and the weight between the ith capsule on the
l th layer and its connected jth capsule on the (l + 1)th layer is
denoted as Wij . The input of the jth capsule on the (l + 1)th
layer can be calculated as follows.
ûj|i = Wij ui .
(4)
Moreover, given the output of the jth capsule on the (l + 1)th
layer as vj , the probability that the input ûj|i influences the
output vj can be mathematically calculated as follows.
exphûj|i , vj i
,
pi|j = P
i exphûj|i , vj i
(5)
where h, i denotes the operation of an inner product. Hence,
vj can be further represented as follows.
X
exphûj|i , vj i
pi|j ûj|i ) = squash( P
vj = squash(
ûj|i ),
i exphûj|i , vj i
i
(6)
in which vj can be considered as a high-level feature fusion
made up of multiple low-level features ûj|i that are fed in
from different capsules i. It is very important to mention that,
the above fusion procedure is quite similar towards the wellknown clustering procedure, in which vj can be viewed as
the clustering center of different ûj|i . Meanwhile, the distance
between the clustering center and different components is
proportional towards the inner product hûj|i , vj i. To sum up,
the capsule operation is a generalized clustering operation.
VOLUME 8, 2020
The above mathematical proof can also be substantiated
by a recently published paper [41], in which the essential
dynamic routing is capsule-based networks was solved via
the EM routing based on a Gaussian mixture model (GMM).
It further ensures that, the capsule operation is a generalized clustering operation, and the excellent performance of
capsule-based networks can be expected from the clustering
perspective.
Improved capsule-based models for realizing ASL images
synthesis in this study have been comprehensively evaluated based on a multi-modal MRI image dataset constructed
by our group. There are totally 355 real patients in this
dataset, including 38 AD patients (i.e., Alzheimer’s disease),
185 MCI patients (i.e., mild cognitive impairment), and
132 NCI (i.e., non-cognitive impairments) patients as normal
controls. The average age of these patients is 70.56 ± 7.20
years old. For MRI images of the dataset, both high-resolution
MPRAGE (i.e., magnetization prepared rapid acquisition
gradient echo) T1-weighted MRI images (i.e., utilized as
structural MRI) [42] and pseudo-continuous ASL images are
acquired from each individual patient, via a SIEMENS 3T
TIM Trio MR scanner equipped at the affiliated hospital
of the Nanchang University. Acquisition parameters mainly
include: labeling duration = 1500 ms, post-labeling delay =
1500 ms, TR/TE = 4000/9.1 ms, ASL voxel size = 3 × 3 ×
5 mm3 . Spatial resolutions of structural MRI and ASL images
are 192 × 256 × 256 and 64 × 64 × 21, respectively. The
labeling of dementia diseases (i.e., AD / MCI / NCI) in this
dataset was fulfilled by our senior clinicians by consensus,
and informed consents were obtained from all patients for
conducting this study.
Moreover, both qualitative (i.e., Subsection IV-B) and
quantitative evaluations (i.e., Subsection IV-C) have been
conducted from the statistical point of view in this study.
Essential pre-processing based on raw MRI images, including
motion corrections, intra-modality registrations (i.e., within
structural MRI and ASL images independently using the
first slice as the reference slice), inter-modalities registrations (i.e., from ASL images to structural MRI images), etc.
have been conducted using the popular SPM toolbox [43].
Furthermore, in Section IV-C.4, additional experiments of
ASL images synthesis based on the well-known ADNI-1
dataset are conducted to further substantiate the superiority
of improved capsule-based models proposed in this study.
B. QUALITATIVE EVALUATIONS
All 8 improved models introduced in Section III-B are
capsule-based, and they are compared with conventional
CNN-based models, including a 7-layer CNN model (i.e.,
denoted as ‘‘CNN7’’) and a 12-layer CNN model (i.e.,
denoted as ‘‘CNN12’’), for synthesizing ASL images from
structural MRI images in this part. Main structures of the
181143
W. Huang et al.: ASL Image Synthesis From Structural MRI Using Improved Capsule-Based Networks
FIGURE 7. Main structures of two CNN-based models for comparison in
ASL image synthesis of this study.
two compared CNN-based models are illustrated in Fig. 7.
It is important to point out that, less FCs are incorporated in
‘‘CNN7’’ than in ‘‘CNN12’’, as a potential risk of overfitting
can be avoided. The reasons to incorporate ‘‘CNN7’’ and
‘‘CNN12’’ for comparisons in this study are explained as follows. For improved capsule-based models introduced in this
study, the maximum layer is 11 (i.e., ‘‘CNN9+CapsNet’’)
and the minimum layer is 3 (i.e., ‘‘CNN1+CapsNet’’). Therefore, ‘‘CNN7’’ and ‘‘CNN12’’ are compared as they are
equipped with comparable layers as introduced capsulebased models in this study. It is also valuable to mention that,
there is no pooling operation and the stride equals to 1 in both
‘‘CNN7’’ and ‘‘CNN12’’ models. The reason is that, this is
the optimal setting that no loss of information within the CNN
models will be incorporated in the two compared models.
Therefore, their performance in synthesizing ASL images
should be more appreciated than other models with nonunit strides (e.g., stride = 2) and pooling (e.g., maxpooling)
operations.
Furthermore, implementation details of all abovementioned 10 deep learning-based models for synthesizing ASL images are described as follows. The patch
size of structural MRI images that are utilized as the
input of all improved capsule-based models is consistent
as 28 × 28 × 12. Batch sizes of all improved capsulebased models are 32 (i.e., for ‘‘CNN1+CapsNet2 Wide’’),
128 (i.e., for ‘‘CNN1+CapsNet’’, ‘‘CNN1+CapsNet3’’,
‘‘CNN1+CapsNet4’’, and ‘‘CNN9+CapsNet’’), 192 (i.e., for
181144
FIGURE 8. ASL image synthesis outcomes obtained by all compared
models based on one example AD patient in this study (i.e., Rows 1:
golden standard, 2: CNN1+CapsNet, 3: CNN1+CapsNet2 Wide, 4:
CNN1+CapsNet3, 5: CNN1+CapsNet3 Wide, 6: CNN1+CapsNet4, 7:
CNN1+CapsNet5, 8: CNN6+CapsNet, 9: CNN9+CapsNet, 10: CNN-7, 11:
CNN-12; Columns I: ASL control images, II: ASL label images; images are
displayed from the transverse view and the scale unit of color bars is a.u.).
‘‘CNN1+CapsNet3 Wide’’), 256 (i.e., for ‘‘CNN6+Caps
Net’’), and 512 (i.e., for ‘‘CNN1+CapsNet5’’), respectively.
Epochs of improved capsule-based models are 6 (i.e., for
‘‘CNN1+CapsNet3 Wide’’), 10 (i.e., for ‘‘CNN1+Caps
Net’’, ‘‘CNN1+CapsNet2 Wide’’, ‘‘CNN1+CapsNet5’’,
‘‘CNN6 + CapsNet’’ and ‘‘CNN9+CapsNet’’), 50 (i.e., for
‘‘CNN1 + CapsNet3’’ and ‘‘CNN1+CapsNet4’’), respectively. The learning rate of all improved capsule-based models is 0.01. For ‘‘CNN7’’ and ‘‘CNN12’’ models, their patch
sizes of structural MRI images are both set as 8 × 8 × 12.
Futhermore, the batch size, epoch and learning rate of the two
CNN-based models are 5120, 50 and 0.01, respectively. All
above parameters undergo trail-and-errors for achieving optimal performance of ASL image synthesis in each individual
model. All model implementations are realized via Pytorch
0.4.0 based on Ubuntu 16 OS using a workstation equipped
with main hardwares including an Intel Xeon CPU i7-7700K,
32G RAM, and a Nvidia Titan XP GPU card.
Figs. 8, 9, and 10 demonstrate ASL image synthesis outcomes of example AD, MCI, and NCI patients obtained
by all compared deep learning-based models, receptively.
VOLUME 8, 2020
W. Huang et al.: ASL Image Synthesis From Structural MRI Using Improved Capsule-Based Networks
FIGURE 9. ASL image synthesis outcomes obtained by all compared
models based on one example MCI patient in this study (i.e., Rows 1:
golden standard, 2: CNN1+CapsNet, 3: CNN1+CapsNet2 Wide, 4:
CNN1+CapsNet3, 5: CNN1+CapsNet3 Wide, 6: CNN1+CapsNet4, 7:
CNN1+CapsNet5, 8: CNN6+CapsNet, 9: CNN9+CapsNet, 10: CNN-7, 11:
CNN-12; Columns I: ASL control images, II: ASL label images; images are
displayed from the transverse view and the scale unit of color bars is a.u.).
To be specific, 12 Slices from the 7th to the 18th from all
21 slices of synthesized ASL image outcomes are demonstrated in Figs. 8–10, as these middle slices contain rich
brain information. It can be noticed from these figures that,
the first row indicates real ASL images obtained by actual
scanning (i.e., denoted as the ‘‘golden standard’’), while
the rest 10 rows depict synthesized ASL image outcomes
obtained by different models. Two columns in Figs. 8–10
depict ASL control and label images, which are essential
elements to produce ASL images as illustrated in Fig. 1.
Qualitative analyses are as follows.
First, for improved capsule-based models taking the
number of basic convolutions into consideration (i.e.,
‘‘CNN1+CapsNet’’, ‘‘CNN6+CapsNet’’, and ‘‘CNN9+
CapsNet’’), Rows 2 brought by ‘‘CNN1+CapsNet’’
in Figs. 8–10 obviously suffer from the serious blurring problem (i.e., the mosaic phenomenon is easy to
be identified in Rows 2). Unfortunately, Rows 9 brought
by ‘‘CNN9+CapsNet’’ are deteriorated, either. Among
them, Rows 8 brought by ‘‘CNN6+CapsNet’’ are the
most satisfactory. It can be summarized based on the
VOLUME 8, 2020
FIGURE 10. ASL image synthesis outcomes obtained by all compared
models based on one example NCI patient in this study (i.e., Rows 1:
golden standard, 2: CNN1+CapsNet, 3: CNN1+CapsNet2 Wide, 4:
CNN1+CapsNet3, 5: CNN1+CapsNet3 Wide, 6: CNN1+CapsNet4, 7:
CNN1+CapsNet5, 8: CNN6+CapsNet, 9: CNN9+CapsNet, 10: CNN-7, 11:
CNN-12; Columns I: ASL control images, II: ASL label images; images are
displayed from the transverse view and the scale unit of color bars is a.u.).
above observation that, the number of basic convolutions
in improved capsule-based models for synthesizing ASL
images in this study need to be properly determined.
In other words, too few basic convolutions are not adequate for obtaining proper latent representation for the
later capsule-based operations (i.e., ‘‘CNN1+CapsNet’’),
too many basic convolutions are prone to deteriorate the
performance (i.e., ‘‘CNN9+CapsNet’’), either. Second, for
improved capsule-based models taking the number of capsule layers into consideration (i.e., ‘‘CNN1+CapsNet3’’,
‘‘CNN1+CapsNet4’’, and ‘‘CNN1+CapsNet5’’), Rows
4 brought by ‘‘CNN1+CapsNet3’’ are more blurring than
Rows 6 brought by ‘‘CNN1+CapsNet4’’ in Figs. 8–10.
For Rows 7 brought by ‘‘CNN1+CapsNet5’’, synthesized
ASL control images in Figs. 9 and 10 for the MCI
and NCI patients are significantly deteriorated. Therefore,
the number of capsule layers in improved capsule-based
models for ASL image synthesis also needs to be properly determined. Third, for improved capsule-based models taking the capacity of capsules into consideration (i.e.,
‘‘CNN1+CapsNet2 Wide’’ and ‘‘CNN1+CapsNet3 Wide’’),
181145
W. Huang et al.: ASL Image Synthesis From Structural MRI Using Improved Capsule-Based Networks
Rows 3 and Rows 5 depict synthesis outcomes for
‘‘CNN1+CapsNet2 Wide’’ and ‘‘CNN1+CapsNet3 Wide’’,
respectively. Take ‘‘CNN1+CapsNet3’’ (i.e., Row 4) and
‘‘CNN1+CapsNet3 Wide’’ (i.e., Row 5) for instance, the synthesis quality in Row 5 is significantly better than that of
Row 4 (e.g., less blurring), in all three figures. Hence, it is
valuable to take the issue of capacity into consideration, when
constructing capsule-based networks.
C. QUANTITATIVE EVALUATIONS
More detailed quantitative evaluations are conducted in this
part to reveal which improved capsule-based model is superior. Experiments are conducted from the statistical point
of view. Two different types of quantitative tests, including
corrections of the challenging PVE problem in ASL images
and the multi-modal MRI-based dementia diseases diagnosis,
are incorporated. Details are as follows.
FIGURE 11. Boxplot of CBF calculated from real and synthesized ASL
images obtained from all compared deep learning-based models in this
study after applying the region-based PVE correction using 5 × 5
neighbors.
TABLE 1. Multiple comparison test in terms of CBF based on real and
synthesized ASL images after conducting the 5 × 5 region-based PVE
correction.
1) THE REGION-BASED PVE CORRECTION
The PVE correction problem can be mathematically
described as follows [44], [45]. Provided voxel i, its magnetization MC on the ASL control image and its magnetization
ML on the ASL label image can be explicitly represented as
Eqs. 7 & 8.
C
C
C
MC = PGM · MGM
+ PWM · MWM
+ PCSF · MCSF
,
ML =
L
PGM · MGM
L
+ PWM · MWM
L
+ PCSF · MCSF
,
(7)
(8)
where, MC and ML denote magnetizations of voxel i on ASL
control and label images, respectively (i.e., they are known
parameters after obtaining ASL control and label images via
actual scanning); PGM , PWM , and PCSF are fractional tissue
volumes of GM (i.e., gray matter), WM (i.e., white matter),
and CSF (i.e., cerebrospinal fluid) on voxel i, all of which
are known after segmenting GM / WM / CSF tissues on
co-registered MPRAGE T1-weighted images by the popular
C denotes the GM magnetization within the
SPM toolbox. MGM
L denotes the GM magnetization
ASL control image, and MGM
within the ASL label image. Moreover, M∗⋆ (i.e., ⋆ represents
the ASL control (C) or label (L) image, while ∗ denotes the
specific tissue among GM / WM / CSF) in Eqs. 7 & 8 are
C
unknown to be solved in the PVE correction task. Since MCSF
L
and MCSF are often assumed to be equivalent in contemprorary clinical studies [44], [45], Eqs. 7 & 8 actually formulate a
typical problem of indefinite equations (i.e., 2 equations with
5 unknowns). In order to solve the above problem, conventional region-based PVE correction methods were proposed
in [44], [45], and they mainly follow the classic idea of linear
regression to solve 5 unknowns as: (PT P)−1 PT M , in which
P is of the size n2 × 3 whose n2 rows correspond to n2
voxels obtained from the n × n neighborhood of voxel i and 3
columns correspond to PGM , PWM and PCSF of all n2 voxels;
M denotes a n2 ×2 matrix with ML and MC of all n2 voxels as
its two columns; T and −1 indicate the transpose and inverse
of a matrix, respectively.
181146
Since PVE is an essential and challenging problem in
ASL image processing, both real ASL images acquired by
actual scanning (i.e., as the ‘‘golden standard’’) and synthesized ASL images obtained by all synthesis models need
to undergo the rigorous PVE correction process, in order
to verify the effectiveness of synthesized ASL images. It is
easy to perceive that, synthesized ASL images after fulfilling the PVE correction that are most similar towards real
ASL images after applying the same PVE correction process
will be highly appreciated. The notion of similarity here is
quantitatively reflected by the perfusion signal (i.e., CBF),
which is calculated based on the well-known kinetic function
introduced in [46]. Figs. 11 and 12 demonstrate boxplots
of CBF calculated based on real / synthesized ASL images
after applying the region-based PVE correction method with
small 5 × 5 neighbors (i.e., n = 5) and large 15 × 15
neighbors (i.e., n = 15), respectively. Technically,
in Figs. 11 and 12, a red horizontal line across each box
represents the median of CBF, while upper and lower quartiles of CBF are depicted by blue lines above and below
the median in each box. A vertical dashed line is drawn
from the upper and lower quartiles to their most extreme
data points, which are within a 1.5 inter-quartile range (i.e.,
IQR). Each data point beyond ends of the 1.5 IQR is marked
via a red star symbol. It can be observed in Fig. 11 that,
boxes of ‘‘CNN6+CapsNet’’, ‘‘CNN1+CapsNet2 Wide’’,
and ‘‘CNN1+CapsNet3 Wide’’ are close to that of the
golden standard. The above observation complies very well
VOLUME 8, 2020
W. Huang et al.: ASL Image Synthesis From Structural MRI Using Improved Capsule-Based Networks
FIGURE 12. Boxplot of CBF calculated from real and synthesized ASL
images obtained from all compared deep learning-based models in this
study after applying the region-based PVE correction using 15 × 15
neighbors.
TABLE 2. Multiple comparison test in terms of CBF based on real and
synthesized ASL images after conducting the 15 × 15 region-based PVE
correction.
TABLE 3. Multiple comparison test in terms of CBF based on real and
synthesized ASL images after conducting the voxel-based PVE correction.
FIGURE 13. Boxplot of CBF calculated from real and synthesized ASL
images obtained from all compared deep learning-based models in this
study after applying the voxel-based PVE correction.
difference between the GoldenStandard and ‘‘CNN6+
CapsNet’’ in Table 1 is 2.9839, and the difference is likely to
fall within a 95%CI [1.3667, 4.6011]. Since the single-value
estimation as well as the 95% CI (i.e., both the lower and
upper bounds) are the closest towards the GoldenStandard
compared with those of other synthesis models in Table 1,
‘‘CNN6+CapsNet’’ is superior. In Table 2, similar analyses can be conducted and ‘‘CNN1+CapsNet3 Wide’’ is
the closest towards the GoldenStandard. Hence, increasing the number of capsule layers (i.e., ‘‘CNN6+CapsNet’’)
and the capacity of capsules (i.e., ‘‘CNN1+CapsNet3’’) are
quantitatively suggested to be beneficial, based on the above
region-based PVE corrections.
2) THE VOXEL-BASED PVE CORRECTION
with the previous qualitative analyses that, the number
of basic convolutions (i.e., ‘‘CNN6+CapsNet’’) and the
capacity of capsules (i.e., ‘‘CNN1+CapsNet2 Wide’’ and
‘‘CNN1+CapsNet3 Wide’’) play important roles in determining the synthesis performance of improved capsule-based
models in this study.
Another detailed quantitative analysis is carried out
in Tables 1 and 2 based on multiple comparison tests from
the statistical perspective. In each row of the tables, there
are two kinds of evaluations. One is a single-value estimation of the difference (i.e., using the GoldenStandard
minus one compared method), the other is a 95% confidence
interval (i.e., CI). In statistics, a CI is a special form of
interval estimator for a parameter (i.e. the difference using
the GoldenStandard minus one compared method in this
study). Instead of estimating the parameter by a single value,
CI is capable to provide an interval estimation which is
likely to include the estimated parameter within a specified interval. For instance, the single-value estimation of the
VOLUME 8, 2020
It is necessary to point out that, region-based PVE correction
methods mainly rely on neighboring voxels, which are likely
to bring about the blurring problem in ASL image correction outcomes, since a great number of voxels are actually
shared when tackling the PVE problem on nearby voxels
(i.e., the percentage of the same utilized voxels will significantly increase from 80% to 93.33%, when the neighbor
size enlarges from 5 × 5 to 15 × 15 [2]–[4]). Therefore,
an alternative ‘‘voxel-based’’ PVE correction method, which
concentrates on each individual voxel itself without incorporating neighboring voxels when tackling the PVE problem,
receives much popularity in recent years [2]–[4]. In this part,
synthesized ASL images obtained by all compared synthesis
models undergo the rigorous voxel-based PVE correction
test.
Fig. 13 delineates the boxplot of CBF calculated from real
ASL images and synthesized ASL images obtained by all
compared models after applying the voxel-based PVE correction. It can be summarized that, ‘‘CNN1+CapsNet3 Wide’’
has the most close box towards the golden standard,
which is a strong indicator that ASL images synthesis
outcomes obtained by ‘‘CNN1+CapsNet3 Wide’’ perform
better than others after applying the voxel-based PVE correction. There are also other interesting observations from
Fig. 13. For the comparison among ‘‘CNN1+CapsNet’’,
181147
W. Huang et al.: ASL Image Synthesis From Structural MRI Using Improved Capsule-Based Networks
FIGURE 15. Detailed structures of three deep learning-based dementia
diseases diagnosis tools (i.e., from left to right: ‘‘CNN-2’’, ‘‘CNN-20’’,
‘‘ResNet-56’’).
C after fulfilling the region-based PVE correction (i.e.,
FIGURE 14. MGM
Column I) and the voxel-based PVE correction (i.e., Column II) obtained by
different synthesis models based on one example AD patient in this study
(i.e., Rows 1: Golden Standard, 2: CNN1+CapsNet, 3:
CNN1+CapsNet2 Wide, 4: CNN1+CapsNet3, 5: CNN1+CapsNet3 Wide; 6:
CNN1+CapsNet4; 7: CNN1+CapsNet5; 8: CNN6+CapsNet; 9:
CNN9+CapsNet; 10: CNN7; 11: CNN12; images are displayed from the
transverse view and the scale unit of color bars is a.u.).
‘‘CNN6+CapsNet’’, and ‘‘CNN9+CapsNet’’, the box of
‘‘CNN6+CapsNet’’ is the highest. It further substantiates
the previous conclusion that, the depth of basic convolutions should be properly determined. For the comparison among ‘‘CNN1+CapsNet3’’, ‘‘CNN1+CapsNet4’’, and
‘‘CNN1+CapsNet5’’, the box of ‘‘CNN1+CapsNet4’’ is
the highest, which also verifies the importance of properly determining the number of capsule layers. Also, both
‘‘CNN1+CapsNet2 Wide’’ and ‘‘CNN1+CapsNet3 Wide’’
demonstrate outstanding performance compared with their
corresponding ‘‘narrow’’ deep learning models, which indicates that increasing the capacity of capsules in improved
capsule-based models of this study is beneficial for synthesizing ASL images of high-quality.
Table 3 demonstrates statistics of the multiple comparison test based on real and synthesized ASL images
after applying the voxel-based PVE correction. It can be
noticed that, ‘‘CNN1+CapsNet3’’ has the nearest difference
181148
towards the gold standard (i.e., 4.2635 and [2.8274, 5.6996]),
which quantitatively substantiates the previous conclusion
in Fig. 13 from the statistical perspective. Another interesting
C obtained
comparison is shown in Fig. 14, in which MGM
after applying the voxel-based PVE correction as well as
the region-based PVE correction (i.e., with 5 × 5 neighbors)
based on the same example AD patient are illustrated and
compared. It can be observed that, more brain details, which
C after fulfilling the region-based
cannot be observed in MGM
PVE correction (i.e., Column II), can be clearly identified in
C after conducting the voxel-based PVE correction (i.e.,
MGM
Column I). Since the voxel-based PVE correction is helpful
to preserve brain details (i.e., cortex regions are highly important in clinical dementia diseases diagnosis), ASL images
after applying the voxel-based PVE correction will be utilized
in the following multi-modal MRI-based dementia diseases
diagnosis experiments.
3) MULTI-MODAL MRI-BASED DEMENTIA DISEASES
DIAGNOSIS
In this part, all real / synthesized ASL images after fulfilling
the voxel-based PVE correction are utilized to differentiate
the progression of dementia diseases (i.e., AD, MCI, NCI).
There are totally 7 diagnosis tools implemented in this study,
including 4 well-known shallow learning-based tools and
3 popular deep learning-based tools. The 4 shallow learningbased tools are linear regression (i.e., denoted as ‘‘LR’’),
VOLUME 8, 2020
W. Huang et al.: ASL Image Synthesis From Structural MRI Using Improved Capsule-Based Networks
support vector regression (i.e., denoted as ‘‘SVR’’), tripleclass SVM (i.e., denoted as ‘‘SVM’’), and support vector
ranking (i.e., denoted as ‘‘Ranking’’). The 3 deep learningbased tools are denoted as ‘‘CNN-2’’, ‘‘CNN-20’’, and
‘‘ResNet-56’’, respectively. Implementation details of them
are as follows. For 4 shallow learning-based tools, low-level
hand-crafted features suggested by senior clinicians in our
group are utilized. Technically, 8 regions including the left &
right hippocampus, the left & right parahippocampal gyrus,
the left & right putamen, and the left & right thalamus are
segmented out from both structural MRI and real / synthesized ASL images of each individual patient via the IBA-SPM
toolbox [47]. Means and standard deviations of these regions
are calculated to construct 16-dimensional low-level feature
vectors for both structural MRI and ASL images. For ‘‘SVR’’,
‘‘SVM’’ and ‘‘Ranking’’, they are implemented using the
popular SVM-light toolbox, and the well-known Gaussian
radial basis function (i.e., Gaussian RBF) is adopted as the
non-linear kernel with its Gaussian width to be learned via
the classic radius margin bound algorithm. For the 3 deep
learning-based tools, their model structures are illustrated
in Fig. 15. It can be observed that, two-channel inputs simultaneously feeding both structural MRI and real / synthesized ASL images into these deep learning diagnosis tools
are employed (i.e., multi-modal MRI-based diagnosis). It is
necessary to mention that, when structural MRI is solely
utilized, only the single channel for feeding structural MRI is
active. Other settings of the 3 deep learning-based diagnosis
tools include: batch size = 64, epoch = 90, and learning
rate = 0.01.
Table 4 demonstrates statistics of accuracies of dementia
diseases diagnosis based on structural MRI and real / synthesized ASL images of 355 demented patients in this study via
a 5-fold cross validation strategy. Technically, 355 patients
are divided into 5 groups of the equal size (i.e., 71 patients
per group). In each fold, 4 groups of them are utilized for
training and the rest 1 group is incorporated for testing. It can
be learned from Table 4 that, statistics of only incorporating
structural MRI for dementia diseases diagnosis are considered as the reference for the performance evaluation. Based
on average accuracies calculated from diagnosis outcomes
of 7 different tools, it can be summarized that the performance
of dementia diseases diagnosis will degrade when adding
synthesized ASL images obtained from ‘‘CNN-7’’ (i.e.,
−39.67%), ‘‘CNN-12’’ (i.e., −38.95%), ‘‘CNN9+CapsNet’’
(i.e., −52.48%), ‘‘CNN1+CapsNet3’’ (i.e., −21.68%), and
‘‘CNN1+CapsNet5’’ (i.e., −44.36%). It is reasonable to
summarize that, both simple deep learning models (e.g.,
‘‘CNN-7’’ and ‘‘CNN-12’’) and unsuitable model designs in
improved capsule-based models (i.e., too many basic convolutions in ‘‘CNN9+CapsNet’’, too many layers of capsules
in ‘‘CNN1+CapsNet3’’ and ‘‘CNN1+CapsNet5’’, etc.) are
not beneficial to improve the dementia diseases diagnosis
performance, when incorporating synthesized ASL images
in this multi-modal MRI-based dementia diseases diagnosis
task.
VOLUME 8, 2020
For other improved capsule-based models, they all
have performance improvements when their synthesized
ASL images are incorporated for dementia diseases diagnosis (i.e., +4.67% by ‘‘CNN1+CapsNet’’, +18.44%
by ‘‘CNN6+CapsNet’’, +7.91% by ‘‘CNN1+CapsNet4’’,
+17.38% by ‘‘CNN1+CapsNet2 Wide’’, and +23.03% by
‘‘CNN1+CapsNet3 Wide’’). Among all these improved
capsule-based models, ‘‘CNN1+CapsNet3 Wide’’ (i.e.,
+23.03%), ‘‘CNN1+CapsNet2 Wide’’ (i.e., +17.38%) and
‘‘CNN6+CapsNet’’ (i.e., +18.44%) are significantly better than others. To be specific, ‘‘CNN1+CapsNet3 Wide’’
dominates when incorporating ‘‘LR’’, ‘‘Ranking’’, and
‘‘ResNet-56’’ as diagnosis tools. ‘‘CNN1+CapsNet2 Wide’’
performs the best when employing ‘‘CNN-2’’ and ‘‘CNN20’’ as diagnosis tools. ‘‘CNN6+CapsNet’’ is superior when
using ‘‘SVR’’ and ‘‘SVM’’ as diagnosis tools. It is also
suggested from Table 4 that, ‘‘CNN1+CapsNet3 Wide’’
achieves the highest average accuracy based on all diagnosis
outcomes (i.e., 0.6299). Since ‘‘CNN1+CapsNet3 Wide’’
increases its number of capsule layers compared with
‘‘CNN1+CapsNet2 Wide’’ (i.e., whose average accuracy
is 0.6010) as well as increases its capacity of capsules
compared with ‘‘CNN1+CapsNet3’’ (i.e., whose average
accuracy is 0.4010), its performance boost suggests that
comprehensively and properly taking various issues (e.g.,
‘‘the number of capsule layers’’ and ‘‘the capacity of capsules’’ as mentioned above) into consideration in constructing
capsule-based networks should be beneficial to improve the
ASL image synthesis performance in this study. The above
conclusion can also be further substantiated among comparisons among ‘‘CNN1+CapsNet’’, ‘‘CNN6+CapsNet’’, and
‘‘CNN9+CapsNet’’, in which ‘‘CNN6+CapsNet’’ is superior (i.e., 0.6064 against 0.5359 and 0.2433). It is valuable
to mention that, the number of basic convolutions should
also be properly chosen (i.e., according to our pre-trails,
‘‘CNN6+CapsNet’’ has the optimal performance when the
number of basic convolutions is no larger than 20).
4) ADDITIONAL EXPERIMENTS OF ASL IMAGE SYNTHESIS
AND DEMENTIA DISEASES DIAGNOSIS BASED ON THE
POPULAR ADNI-1 DATASET
In this part, additional experiments based on the popular
ADNI-1 dataset are conducted to further reveal the superiority of improved capsule-based models proposed in this study.
The public ADNI-1 dataset belongs to the important ADNI
datasets family [5], which also has an imbalanced class distribution (i.e., including 200 AD patients, 400 MCI patients,
and 200 NCI patients) as the 355-demented patients’ dataset.
Since the acquisition of structural MRI images is realized
but ASL is not available in the ADNI-1 dataset, the main
task here is to synthesize ASL images from available structural MRI images in the ADNI-1 dataset, based on learned
synthesis models from the previous 355-demented patients’
dataset. The main purpose here is to verify the effectiveness of
incorporating previously unavailable but newly synthesized
ASL images to improve the dementia diseases diagnosis
181149
W. Huang et al.: ASL Image Synthesis From Structural MRI Using Improved Capsule-Based Networks
FIGURE 16. The visualization of example synthesized ASL images obtained by ‘‘CNN1+CapsNet3 Wide’’ (i.e., 1st column), ‘‘CNN1+CapsNet4’’ (i.e., 2nd
column), and ‘‘CNN6+CapsNet’’ (i.e., 3rd column) from sMRI of AD (i.e., 1st row), MCI (i.e., 2nd row), and NCI (i.e., 3rd column) patients based on the
popular ADNI-1 dataset (i.e., realized by the 3D slicer software with color renderings).
performance in the ADNI-1 dataset. Spatial resolutions of
structural MRI images in the ADNI-1 dataset are set as
64 × 64 × 21, which are consistent as those in the first dataset
of this study.
Three improved capsule-based models verified within
Section IV-C.3, i.e., ‘‘CNN6+CapsNet’’, ‘‘CNN1+Caps
Net4’’, and ‘‘CNN1+CapsNet3 Wide’’, are shortlisted to
conduct additional experiments in this part. The reason
is because that, each of the above-mentioned models
concentrate on one of the three important issues in designing capsule-based networks (i.e., the depth of basic convolutions for ‘‘CNN6+CapsNet’’, the layer of capsules
for ‘‘CNN1+CapsNet4’’, and the capacity of capsules for
‘‘CNN1+CapsNet3 Wide’’). Also, these three improved
capsule-based models perform outstandingly as suggested
in Table 4.
Fig. 16 provides the visualization of example synthesized ASL images obtained by these three improved
181150
CapsNets-based models, based on structural MRI images of
AD, MCI, and NCI patients from the popular ADNI-1 dataset.
It is necessary to point out that, the visualization in Fig. 16
is realized by the well-known 3D slicer software with color
renderings [48]. It is valuable to point out that, synthesized
ASL images obtained by the three improved capsule-based
models are reasonable, as hot regions (i.e., indicated by the
red color) become less and less obvious with the progression
of dementia diseases (i.e., NCI→MCI→AD). It also complies well with the clinical understanding that the blood flow
tends to decrease (i.e., fewer hot regions) when the severity
of dementia diseases increases.
In Table 5, statistics of accuracies in dementia diseases
diagnosis based on structural MRI and synthesized ASL
images obtained by the three improved capsule-based models based on the popular ADNI-1 dataset are reported. The
seven diagnosis tools which are utilized in Table 4 are also
implemented in Table 5. It can be noticed that, the average
VOLUME 8, 2020
W. Huang et al.: ASL Image Synthesis From Structural MRI Using Improved Capsule-Based Networks
TABLE 4. Statistics of accuracies in dementia diseases diagnosis based on structural MRI and real / synthesized ASL images.
TABLE 5. Statistics of accuracies of dementia diseases diagnosis based on structural MRI and synthesized ASL images obtained by three improved
capsule-based models based on the popular ADNI-1 dataset.
accuracy of dementia diseases diagnosis based on structural MRI images obtained by actual scanning is 0.6228,
which is also utilized as the reference for later evaluations.
For Columns 3, 4, and 5 in Table 5, average accuracies
of diagnosis based on both structural MRI and synthesized ASL images obtained by ‘‘CNN1+CapsNet3 Wide’’,
‘‘CNN1+CapsNet4’’, and ‘‘CNN6+CapsNet’’, are demonstrated, respectively. It is necessary to point out that, the last
row ‘‘Boost’’ in Table 5 indicates the percentage of improve=
ment compared with the reference (e.g., 0.7757−0.6228
0.6228
+24.55%). The reliability of synthesized ASL images in
dementia diseases diagnosis can be suggested via the significant boost of +24.55%, +18.53%, and +14.11% in Table 5,
which are strong indicators of the superiority of improved
capsule-based models (i.e., ‘‘CNN1+CapsNet3 Wide’’,
‘‘CNN1+CapsNet4’’, and ‘‘CNN6+CapsNet’’) proposed in
this study. Therefore, the reliability of synthesized highquality ASL images in dementia diseases diagnosis based on
the popular ADNI-1 dataset can be substantiated as well.
The superiority of the newly introduced capsule-based
networks in this study has also been revealed after comparing with the state-of-the-art in ASL image synthesis. In [3], an unbalanced and multi-channel-based deep
learning model was proposed to fulfill the ASL image
synthesis task, and it is generally acknowledged as the
VOLUME 8, 2020
state-of-the-art in ASL image synthesis. The same
355-demented-patient dataset was incorporated, and the same
7 diagnosis tools (i.e., ‘‘LR’’, ‘‘SVR’’, ‘‘SVM’’, ‘‘Ranking’’,
‘‘CNN-2’’, ‘‘CNN-20’’, ‘‘ResNet-56’’) were also employed
to fulfill the dementia diseases diagnosis task based on
synthesized ASL images in that study as well. It turns out
that, the average diagnosis accuracy of the state-of-the-art
in [3] is 0.6248. In Table 4, ‘‘CNN1+CapsNet3 Wide’’
is capable to achieve higher average accuracy of 0.6299.
Therefore, the superiority of the newly introduced capsulebased networks in this study can be further revealed.
V. CONCLUSION
In this study, several improved capsule-based models are
introduced to realize ASL image synthesis for the first time.
Technically, three important issues in designing capsulebased models, including the number of basic convolutions,
the number of capsule layers, and the capacity of capsules,
are throughly investigated. It is suggested that, high-quality
synthesized ASL images can be obtained from improved
capsule-based models.
Although this study proposes improved capsule-based
models for synthesizing ASL images from structural MRI
images, these models can be extended for synthesizing medical images of other modalities. Future efforts will be spent
181151
W. Huang et al.: ASL Image Synthesis From Structural MRI Using Improved Capsule-Based Networks
on synthesizing ultrasound images based on capsule-based
models, which should be carefully designed or modified after
taking characteristics of ultrasound images (e.g., low SNR,
low-quality, etc.) into consideration.
ACKNOWLEDGMENT
(Mingyuan Luo and Xi Liu contributed equally to this work.)
REFERENCES
[1] E. S. Musiek, Y. Chen, M. Korczykowski, B. Saboury, P. M. Martinez,
J. S. Reddin, A. Alavi, D. Y. Kimberg, D. A. Wolk, P. Julin, A. B. Newberg,
S. E. Arnold, and J. A. Detre, ‘‘Direct comparison of fluorodeoxyglucose positron emission tomography and arterial spin labeling magnetic
resonance imaging in Alzheimer’s disease,’’ Alzheimer’s Dementia, vol. 8,
no. 1, pp. 51–59, Jan. 2012.
[2] W. Huang, ‘‘A novel disease severity prediction scheme via big pair-wise
ranking and learning techniques using image-based personal clinical data,’’
Signal Process., vol. 124, pp. 233–245, Jul. 2016.
[3] W. Huang, M. Luo, X. Liu, P. Zhang, H. Ding, W. Xue, and D. Ni,
‘‘Arterial spin labeling images synthesis from sMRI using unbalanced
deep discriminant learning,’’ IEEE Trans. Med. Imag., vol. 38, no. 10,
pp. 2338–2351, Oct. 2019.
[4] W. Huang, S. Zeng, M. Wan, and G. Chen, ‘‘Medical media analytics via
ranking and big learning: A multi-modality image-based disease severity
prediction study,’’ Neurocomputing, vol. 204, pp. 125–134, Sep. 2016.
[5] S. G. Mueller, M. W. Weiner, L. J. Thal, R. C. Petersen, C. Jack, W. Jagust,
J. Q. Trojanowski, A. W. Toga, and L. Beckett, ‘‘The Alzheimer’s disease
neuroimaging initiative,’’ Neuroimag. Clin., vol. 15, no. 4, pp. 869–877,
Nov. 2005.
[6] A. Ahlgren, R. Wirestam, E. T. Petersen, F. Stahlberg, and L. Knutsson,
‘‘Partial volume correction of brain perfusion estimates using the inherent
signal data of time-resolved arterial spin labeling,’’ NMR Biomed., vol. 27,
no. 9, pp. 1112–1122, Jul. 2014.
[7] L. Hernandez-Garcia, A. Lahiri, and J. Schollenberger, ‘‘Recent progress
in ASL,’’ NeuroImage, vol. 187, pp. 3–16, Feb. 2019.
[8] S. Sabour, N. Frosst, and G. Hinton, ‘‘Dynamic routing between capsules,’’
in Proc. Adv. Neural Inf. Process. Syst., Long Beach, CA, USA, 2017,
pp. 3856–3866.
[9] A. Jimenez-Sanchez, S. Albarqouni, and D. Mateus, ‘‘Capsule networks
against medical imaging data challenges,’’ in Proc. Workshop Intravascular Imag. Comput. Assist. Stenting Large-Scale Annotation Biomed.
Data Expert Label Synth., Med. Image Comput. Comput.-Assist. Intervent.
Granada, Spain: Springer, 2018, pp. 150–160.
[10] V. Kalesnykiene, J. Kamarainen, R. Voutilainen, J. Pietil, H. Klviinen, and
H. Uusitalo. Diaretdb1 Diabetic Retinopathy Database and Evaluation
Protocol. Accessed: Oct. 1, 2020. [Online]. Available: https://www.it.
lut.fi/project/imageret/diaretdb1/doc/diaretdb1_techreport_v_1_1.pdf
[11] B. Yu, Y. Wang, L. Wang, D. Shen, and L. Zhou, ‘‘Medical image synthesis via deep learning,’’ in Deep Learning in Medical Image Analysis (Advances in Experimental Medicine and Biology), vol. 1213. 2020,
pp. 23–44, doi: 10.1007/978-3-030-33128-3_2.
[12] X. Yi, E. Walia, and P. Babyn, ‘‘Generative adversarial network in
medical imaging: A review,’’ Med. Image Anal., vol. 58, Dec. 2019,
Art. no. 101552.
[13] N. Cordier, H. Delingette, M. Le, and N. Ayache, ‘‘Extended modality
propagation: Image synthesis of pathological cases,’’ IEEE Trans. Med.
Imag., vol. 35, no. 12, pp. 2598–2608, Dec. 2016.
[14] Y. Huang, L. Shao, and A. F. Frangi, ‘‘Cross-modality image synthesis via
weakly coupled and geometry co-regularized joint dictionary learning,’’
IEEE Trans. Med. Imag., vol. 37, no. 3, pp. 815–827, Mar. 2018.
[15] A. Chartsias, T. Joyce, M. V. Giuffrida, and S. A. Tsaftaris, ‘‘Multimodal
MR synthesis via modality-invariant latent representation,’’ IEEE Trans.
Med. Imag., vol. 37, no. 3, pp. 803–814, Mar. 2018.
[16] B. Yu, L. Zhou, L. Wang, Y. Shi, J. Fripp, and P. Bourgeat, ‘‘Ea-GANs:
Edge-aware generative adversarial networks for cross-modality MR image
synthesis,’’ IEEE Trans. Med. Imag., vol. 38, no. 7, pp. 1750–1762,
Jul. 2019.
[17] T. Zhang, H. Fu, Y. Zhao, J. Cheng, M. Guo, Z. B. Gu Yang, Y. Xiao,
S. Gao, and J. Liu, ‘‘SkrGAN: Sketching-rendering unconditional generative adversarial networks for medical image synthesis,’’ in Medical
Image Computing and Computer Assisted Intervention—MICCAI. Shenzhen, China: Springer, 2019, pp. 777–785.
181152
[18] S. U. Dar, M. Yurt, L. Karacan, A. Erdem, E. Erdem, and T. Cukur, ‘‘Image
synthesis in multi-contrast MRI with conditional generative adversarial
networks,’’ IEEE Trans. Med. Imag., vol. 38, no. 10, pp. 2375–2388,
Oct. 2019.
[19] D. Nie, R. Trullo, J. Lian, L. Wang, C. Petitjean, S. Ruan, Q. Wang, and
D. Shen, ‘‘Medical image synthesis with deep convolutional adversarial
networks,’’ IEEE Trans. Biomed. Eng., vol. 65, no. 12, pp. 2720–2730,
Dec. 2018.
[20] I. Polycarpou, G. Soultanidis, and C. Tsoumpas, ‘‘Synthesis of realistic simultaneous positron emission tomography and magnetic resonance
imaging data,’’ IEEE Trans. Med. Imag., vol. 37, no. 3, pp. 703–711,
Mar. 2018.
[21] Y. Wang, L. Zhou, B. Yu, L. Wang, C. Zu, D. S. Lalush, W. Lin, X. Wu,
J. Zhou, and D. Shen, ‘‘3D auto-context-based locality adaptive multimodality GANs for PET synthesis,’’ IEEE Trans. Med. Imag., vol. 38,
no. 6, pp. 1328–1339, Jun. 2019.
[22] Y. Wang, B. Yu, L. Wang, C. Zu, D. S. Lalush, W. Lin, X. Wu, J. Zhou,
D. Shen, and L. Zhou, ‘‘3D conditional generative adversarial networks
for high-quality PET image estimation at low dose,’’ NeuroImage, vol. 174,
pp. 550–562, Jul. 2018.
[23] G. Zeng and G. Zheng, ‘‘Hybrid generative adversarial networks for
deep MR to CT synthesis using unpaired data,’’ in Medical Image Computing and Computer Assisted Intervention—MICCAI. Shenzhen, China:
Springer, 2019, pp. 759–767.
[24] H. Salehinejad, S. Valaee, T. Dowdell, E. Colak, and J. Barfett, ‘‘Generalization of deep neural networks for chest pathology classification in
X-Rays using generative adversarial networks,’’ in Proc. IEEE Int. Conf.
Acoust., Speech Signal Process. (ICASSP), Apr. 2018, pp. 990–994.
[25] H. Salehinejad, E. Colak, T. Dowdell, J. Barfett, and S. Valaee, ‘‘Synthesizing chest X-ray pathology for training deep convolutional neural networks,’’ IEEE Trans. Med. Imag., vol. 38, no. 5, pp. 1197–1206, May 2019.
[26] Y. Zhou, S. Giffard-Roisin, M. De Craene, S. Camarasu-Pop, J. D’Hooge,
M. Alessandrini, D. Friboulet, M. Sermesant, and O. Bernard, ‘‘A framework for the generation of realistic synthetic cardiac ultrasound and magnetic resonance imaging sequences from the same virtual patients,’’ IEEE
Trans. Med. Imag., vol. 37, no. 3, pp. 741–754, Mar. 2018.
[27] Y. Ren, Z. Zhu, Y. Li, D. Kong, R. Hou, L. Grimm, J. Marks, and J. Lo,
‘‘Mask embedding for realistic high-resolution medical image synthesis,’’ in Medical Image Computing and Computer Assisted Intervention—
MICCAI. Shenzhen, China: Springer, 2019, pp. 422–430.
[28] G. Jiang, Y. Lu, J. Wei, and Y. Xu, ‘‘Synthesize mammogram from
digital breast tomosynthesis with gradient guided cGANs,’’ in Medical
Image Computing and Computer Assisted Intervention—MICCAI. Shenzhen, China: Springer, 2019, pp. 801–809.
[29] P. Costa, A. Galdran, M. I. Meyer, M. Niemeijer, M. Abramoff,
A. M. Mendonca, and A. Campilho, ‘‘End-to-end adversarial retinal
image synthesis,’’ IEEE Trans. Med. Imag., vol. 37, no. 3, pp. 781–791,
Mar. 2018.
[30] Y. Zhou, X. He, S. Cui, F. Zhu, L. Liu, and L. Shao, ‘‘High-resolution diabetic retinopathy image synthesis manipulated by grading and lesions,’’ in
Medical Image Computing and Computer Assisted Intervention—MICCAI.
Shenzhen, China: Springer, 2019, pp. 505–513.
[31] X. Wang, M. Xu, L. Li, Z. Wang, and Z. Guan, ‘‘Pathology-aware deep
network visualization and its application in glaucoma image synthesis,’’ in
Medical Image Computing and Computer Assisted Intervention—MICCAI.
Shenzhen, China: Springer, 2019, pp. 423–431.
[32] A. Diaz-Pinto, A. Colomer, V. Naranjo, S. Morales, Y. Xu, and
A. F. Frangi, ‘‘Retinal image synthesis and semi-supervised learning
for glaucoma assessment,’’ IEEE Trans. Med. Imag., vol. 38, no. 9,
pp. 2211–2218, Sep. 2019.
[33] T. Kanayama, Y. Kurose, K. Tanaka, K. Aida, S. Satoh, M. Kitsuregawa,
and T. Harada, ‘‘Gastric cancer detection from endoscopic images using
synthesis by GAN,’’ in Medical Image Computing and Computer Assisted
Intervention—MICCAI. Shenzhen, China: Springer, 2019, pp. 530–538.
[34] A. F. Frangi, S. A. Tsaftaris, and J. L. Prince, ‘‘Simulation and synthesis in
medical imaging,’’ IEEE Trans. Med. Imag., vol. 37, no. 3, pp. 673–679,
Mar. 2018.
[35] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, ‘‘Gradient-based learning applied to document recognition,’’ Proc. IEEE, vol. 86, no. 11,
pp. 2278–2324, Nov. 1998.
[36] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for image
recognition,’’ 2015, arXiv:1512.03385. [Online]. Available: http://arxiv.
org/abs/1512.03385
VOLUME 8, 2020
W. Huang et al.: ASL Image Synthesis From Structural MRI Using Improved Capsule-Based Networks
[37] O. Ronneberger, P. Fischer, and T. Brox, ‘‘U-Net: Convolutional networks
for biomedical image segmentation,’’ in Medical Image Computing and
Computer Assisted Intervention—MICCAI. Cham, Switzerland: Springer,
2015, pp. 234–241.
[38] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
S. Ozair, A. Courville, and Y. Bengio, ‘‘Generative adversarial networks,’’ 2014, arXiv:1406.2661. [Online]. Available: http://arxiv.org/abs/
1406.2661
[39] M. Mirza and S. Osindero, ‘‘Conditional generative adversarial nets,’’
2014, arXiv:1411.1784. [Online]. Available: http://arxiv.org/abs/1411.
1784
[40] G. Hinton, O. Vinyals, and J. Dean, ‘‘Distilling the knowledge in a neural
network,’’ 2015, arXiv:1503.02531. [Online]. Available: http://arxiv.org/
abs/1503.02531
[41] G. Hinton, S. Sabour, and N. Frosst, ‘‘Matrix capsules with EM routing,’’ in Proc. Int. Conf. Learn. Represent., Vancouver, BC, Canada,
2018, pp. 234–241. [Online]. Available: https://openreview.net/forum?id=
HJWLfGWRb
[42] M. Brant-Zawadzki, G. D. Gillan, and W. R. Nitz, ‘‘MP RAGE: A threedimensional, T1-weighted, gradient-echo sequence-initial experience in
the brain.,’’ Radiology, vol. 182, no. 3, pp. 769–775, Mar. 1992.
[43] SPM12—Statistical Parametric Mapping Toolbox. Accessed: Oct. 1, 2020.
[Online]. Available: https://www.fil.ion.ucl.ac.uk/spm/software/spm12/
[44] I. Asllani, A. Borogovac, and T. R. Brown, ‘‘Regression algorithm correcting for partial volume effects in arterial spin labeling MRI,’’ Magn. Reson.
Med., vol. 60, no. 6, pp. 1362–1371, Dec. 2008.
[45] M. Chapell, ‘‘Partial vol. correction, of multiple inversion time arterial
spin labeling MRI data,’’ Magn. Reson. Med., vol. 65, pp. 1173–1183,
Feb. 2011.
[46] G. S. Pell, D. L. Thomas, M. F. Lythgoe, F. Calamante, A. M. Howseman,
D. G. Gadian, and R. J. Ordidge, ‘‘Implementation of quantitative FAIR
perfusion imaging with a short repetition time in time-course studies,’’
Magn. Reson. Med., vol. 41, no. 4, pp. 829–840, Apr. 1999.
[47] Individual Brain Atlases using Statistical Parametric Mapping
(IBA-SPM) Software. Accessed: Oct. 1, 2020. [Online]. Available:
http://www.thomaskoenig.ch/Lester/ibaspm.htm
[48] 3D Slicer. Accessed: Oct. 1, 2020. [Online]. Available: https://
www.slicer.org
WEI HUANG received the B.Eng. and M.Eng.
degrees from the Harbin Institute of Technology,
China, and the Ph.D. degree from Nanyang Technological University, Singapore. He then worked
with the University of California San Diego,
San Diego, CA, USA, as well as the Agency
for Science Technology and Research, Singapore,
as Postdoctoral Research Fellows. He is currently
with the Department of Compute Science, and
acts as the Director of the Informatization Office,
Nanchang University, China. His main research interests include machine
learning, pattern recognition, medical image processing, and multimedia.
He has published nearly 100 academic journal/conference papers, including
the IEEE TRANSACTIONS ON MEDICAL IMAGING, the IEEE TRANSACTIONS ON
MULTIMEDIA, the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO
TECHNOLOGY, Pattern Recognition, MICCAI, and ACM Multimedia. He has
been acting as principal investigators in 15 national/provincial grants,
including three NSF-China projects and three NSF key projects in Jiangxi,
China. He received the Jiangxi Provincial Natural Science Award, the most
interesting paper award of ICME-ASMMC, and the Best Paper Award of
MICCAI-MLMI. He has been entitled the provincial academic leader of the
Jiangxi Province, and the provincial young scientist of the Jiangxi Province,
respectively.
VOLUME 8, 2020
MINGYUAN LUO received the B.Eng. and
M.Eng. degrees from Nanchang University, under
the supervision of Prof. W. Huang. He has published several academic articles in well-known
international journals and conference proceedings, including the IEEE TRANSACTIONS ON MEDICAL
IMAGING, IEEE ACCESS, MICCAI, and ACM Multimedia. His research interests include medical
image processing, machine learning, computer
vision, and pattern recognition.
XI LIU received the B.Eng. degree in 2017, and
the M.Eng. degree from Nanchang University,
under the supervision of Prof. W. Huang. She
has published several academic papers in wellknown international journals and conference proceedings, including the IEEE TRANSACTIONS ON
MEDICAL IMAGING, IEEE ACCESS, and MICCAI. Her
research interests mainly include computer vision
and pattern recognition.
PENG ZHANG received the B.E. degree from
Xian Jiaotong University, China, in 2001, and
the Ph.D. degree from Nanyang Technological
University, Singapore, in 2011. He is currently
a Full Professor with the School of Computer Science, Northwestern Polytechnical University, China. He Zhang has published nearly
100 research articles, including CVPR, ACM Multimedia, Neurocomputing, Signal Processing, the
IEEE TRANSACTIONS ON IMAGE PROCESSING, the IEEE
TRANSACTIONS ON MULTIMEDIA, and the IEEE TRANSACTIONS ON MEDICAL
IMAGING. He has been acting as the PI in three grants of NSFC. His current
research interests include computer vision, pattern recognition, and machine
learning. He is also the Chief Scientist in Mekitec OY, Finland.
HUIJUN DING received the B.Eng. degree in
electronic engineering and information science
from the University of Science and Technology
of China, (USTC), Hefei, in 2006, and the Ph.D.
degree from the School of Electrical and Electronic Engineering, Nanyang Technological University (NTU), Singapore, in 2011. Afterward,
she was a Postdoctoral Research Fellow with the
Department of Electronic Engineering, The Chinese University of Hong Kong (CUHK), before
joining Shenzhen University, China, in 2013. Her current research interests
include speech and image processing, objective measure, and nanomaterialenabled acoustic devices.
181153