Keywords

1 Introduction

Quantification of left ventricle (LV) from cardiac imaging is among the most clinically important and most frequently demanded tasks for identification and diagnosis of cardiac disease [6], yet still a challenging task due to the high variability of cardiac structure across subjects and the complicated global/regional temporal dynamics. Full quantification, i.e., to simultaneously quantify all LV indices including two areas, six regional wall thicknesses (RWT), three LV dimension, and one phase (as shown in Fig. 1), providing more detailed information for comprehensive cardiac function assessment, is even more challenging since the uncertain relatedness intra and inter each type of indices may hinder the learning procedure from better convergence and generalization. In this work, we propose a newly-designed deep multitask learning network FullLVNet for full quantification of LV respecting both intra- and inter-task relatedness.

In clinical practice, obtaining reliable quantification is subjected to measuring on segmented myocardium, which is usually obtained by manually contouring the borders of myocardium [13] or manual correction of contours [3, 7] generated by LV segmentation algorithms [9]. However, manually contouring is time-consuming, of high inter-observer variability, and typically limited to the end diastolic (ED) and end systolic (ES) frames, which makes it insufficient for dynamic function analysis. LV segmentation algorithms, despite the recent advances, is still a difficult problem due to the lack of edge information and presence of shape variability. Most existing segmentation methods for cardiac MR images [4, 9, 10] requires strong prior information and user interaction to obtain reliable results, which may prevent them from efficient clinical application.

Fig. 1.
figure 1

Illustration of LV indices to be quantified for short-axis view cardiac image. (a) Cavity (blue) and myocardium (orange) areas. (b) directional dimensions of cavity (red arrows). (c) Regional wall thicknesses (red arrows). A: anterior; AS: anterospetal; IS: inferoseptal; I: inferior; IL: inferolateral; AL: anterolateral. (d) Phase (systole or diastole).

In recent years, direct methods without segmentation have grown in popularity in cardiac volumes estimation [1, 2, 14, 17,18,19,20]. Although these methods obtained effective performance by leveraging state-of-art machine learning techniques, they suffer from the following limitations. (1) Lack of powerful task-aware representation. The vulnerable hand-crafted or task-unaware features are not capable of capturing sufficient task-relevant cardiac structures. (2) Lack of temporal modeling. Independently handling each frame without assistance from neighbors can not guarantee the consistency and accuracy. (3) Not end-to-end learning. The separately learned representation and regression models cannot be optimal for each other. (4) Not full quantification. Only cardiac volume alone is not sufficient for comprehensive global, regional and dynamic function assessment.

In this paper, we propose a newly-designed multitask learning network (FullLVNet), which is constituted by a specially tailored deep CNN for expressive feature embedding; two followed parallel RNN modules for temporal dynamic modeling; and four linear models for the final estimation. During the final estimation, FullLVNet is capable of improving the generalization by (1) modeling intra-task relatedness through group lasso regularization within each regression task; and (2) modeling inter-task relatedness with three phase-guided constraints that penalize violation of the temporal behavior of LV indices. After being trained with a two-step strategy, FullLVNet is capable of delivering accurate results for all the considered indices of cardiac LV.

2 Multitask Learning for Full Quantification of Cardiac LV

The proposed FullLVNet models full quantification of cardiac LV as a multitask learning problem. Three regression tasks \(\{y_{area}^{s,f}, y_{dim}^{s,f}, y_{rwt}^{s,f}\}\) and one classification task \(y_{phase}^{s,f}\) are simultaneously learned to predict frame-wise values of the above mentioned LV indices from cardiac MR sequences \(\mathcal {X}=\{X^{s,f}\}\), where \(s=1\cdots S\) indexes the subject and \(f=1\cdots F\) indexes the frame. The objective of FullLVNet is:

$$\begin{aligned} W_{optimal}=\min _{W}\frac{1}{S\times F}\sum _{s,f}\sum _{t}L_t(\hat{y}_t^{s,f}(X^{s,f}|W),y_t^{s,f})+\lambda \mathcal {R}(W) \end{aligned}$$
(1)

where \(t\in \{area, dim, rwt, phase\}\) denotes a specific task, \(\hat{y}_t\) is the estimated results for task t, \(L_t\) is the loss function of task t and \(\mathcal {R}(W)\) denotes regularization of parameters in the network.

Fig. 2.
figure 2

Overview of FullLVNet, which combines a deep CNN network (details shown in the left) for feature embedding, two RNN modules for temporal dynamic modeling, and four linear models for final estimation. Intra- and inter-task relatedness are modeled in the final estimation to improve generalization.

2.1 Architectures of FullLVNet

Figure 2 shows the overview of FullLVNet. A deep CNN is firstly designed to extract from cardiac images expressive and task-aware feature, which is then fed to the RNN modules for temporal dynamic modeling. Final estimations are given by four linear models with the output of RNN modules as input. To improve generalization of FullLVNet, both intra- and inter-task relatednesses are carefully modeled through group lasso and phase-guided constraints for the linear models.

CNN for deep feature embedding. To obtain expressive and task-aware features, we design a specially tailored deep CNN for cardiac images, as shown in the left of Fig. 2. Powerful representations can be obtained by transfer learning [12] from well-known deep architectures in computer vision for applications with limited labeled data. However, transfer learning may incur (1) restriction of network architecture, resulting in incompatible or redundant model; and (2) restriction of input channel and dimension, leading to requirement of image resizing and channel expanding. We reduce the number of filters for each layer to avoid model redundancy. As for the kernel size of convolution and pooling, \(5\times 5\), instead of the frequently used \(3\times 3\), is deployed to introduce more shift invariance. Dropout and batch normalization are adopted to alleviate the training procedure. As can be seen in our experiments, our CNN is very effective for cardiac images even without transfer learning. As a feature embedding network, our CNN maps each cardiac image \(X^{s,f}\) into a fixed-length low dimension vector \(e^{s,f}=f_{cnn}(X^{s,f}|w_{cnn})\in \mathcal {R}^{100}\).

RNNs for temporal dynamic modeling. Accurate modeling of cardiac temporal dynamic assistants the quantification of current frame with information from neighbors. RNN, especially when LSTM units [5] are deployed, is specialized in temporal dynamic modeling and has been employed in cardiac image segmentation [11] and key frame recognition [8] in cardiac sequences. In this work, two RNN modules, as shown by the green and yellow blocks in Fig. 2, are deployed for the regression tasks and the classification task. For the three regression tasks, the indices to be estimated are mainly related to the spatial structure of cardiac LV in each frame. For the classification task, the cardiac phase is mainly related to the structure difference between successive frames. Therefore, the two RNN modules are designed to capture these two kinds of dependencies. The outputs of RNN modules are \(\{h_m^{s,1},,,,h_m^{s,F}\}=f_{rnn}([e^{s,1},...e^{s,F}]|w_{m}),m\in \{rnn1,rnn2\}\).

Final estimation. With the outputs of RNN modules, all the LV indices can be estimated with a linear regression/classification model:

$$\begin{aligned} {\left\{ \begin{array}{ll} \hat{y}_{t}^{s,f}=w_{t}h_{rnn1}^{s,f}+b_t, &{}where ~t\in \{area, dim, rwt\} \\ p(\hat{y}_{t}^{s,f}=0)=\frac{1}{1+\exp (w_{t}h_{rnn2}^{s,f}+b_t)}, &{} t=phase \end{array}\right. } \end{aligned}$$
(2)

where \(w_{t}\) and \(b_t\) are the weight and bias term of the linear model for task t, 0 and 1 denote the two cardiac phase Diastole and Systole, and \(p(\hat{y}_{phase}^{s,f}=1)=1-p(\hat{y}_{phase}^{s,f}=0)\). For the loss function in (1), Euclidean distance and cross-entropy are employed for the regression tasks and the classification task, respectively.

$$\begin{aligned} L_{t} = {\left\{ \begin{array}{ll} \frac{1}{2}\Vert \hat{y}_t^{s,f}-y_t^{s,f}\Vert _2^2, &{}where ~t\in \{area, dim, rwt\}\\ -\log p(\hat{y}_{t}^{s,f}=y_{t}^{s,f}), &{}t=phase \end{array}\right. } \end{aligned}$$
(3)

2.2 Intra-task and Inter-task Relatedness

Significant correlations exist between the multiple outputs of each task and those of different tasks, and are referred as intra- and inter-task relatedness. Intra-task relatedness can be effectively modeled by the well-known group lasso regularization, while inter-task relatedness is modeled by three phase-guided constraints. Improved generalization can be achieved with both of them fully leveraged in our FullLVNet.

Intra-task relatedness based on group lasso. Group lasso, also known as L1/L2 regularization, can perfectly model relatedness within groups of outputs, i.e., the three regression tasks. It enforces common feature selection cross related outputs with the L2 norm, and encourages sparse selection of the most related features with the L1 norm for each task. In this way, the relevant features of different tasks can be well disentangled. To leverage this advantage, group lasso is applied to the weight parameters of the three regression models in (2).

$$\begin{aligned} \mathcal {R}_{intra}=\sum _t \sum _i\Vert w_t(i)\Vert _2, ~for ~t\in \{area, dim, rwt\} \end{aligned}$$
(4)

where \(w_t(i)\) denotes the ith column of \(w_t\).

Inter-task relatedness based on phase-guided constraints. Three phase-guided constraints are proposed to model inter-task relatedness, i.e., the cardiac phase and other LV indices. Cardiac phase indicates the temporal dynamics of LV myocardium in a cardiac cycle. Other LV indices change accordingly with cardiac phase: (1) cavity area and LV dimensions increase in the diastole phase and decrease in the systole phase; (2) myocardium area and RWT decrease in the diastole phase and increase in the systole phase. Effectively modeling such an intrinsic phase-guided relatedness would ensure that the estimated LV indices are consistent with the temporal dynamics of LV. To penalize violation of these inter-task relatednesses, three phase-guided constraints are applied to the predicted results of areas, dimensions and RWT.

$$\begin{aligned} \mathcal {R}_{inter}^{area}=\frac{1}{2S\times F}&\sum _{s,f}[\mathbbm {1}(y_{phase}^{s,f}=0)(\max (-z_{area}^{s,f,1},0)+\max (z_{area}^{s,f,2},0)) \nonumber \\&+\mathbbm {1}(y_{phase}^{s,f}=1)(\max (z_{area}^{s,f,1},0)+\max (-z_{area}^{s,f,2},0))] \end{aligned}$$
(5)
$$\begin{aligned} \mathcal {R}_{inter}^{dim}=\frac{1}{S\times F}\sum _{s,f}[\mathbbm {1}(y_{phase}^{s,f}=0)\max (-\bar{z}_{dim}^{s,f},0)+ \mathbbm {1}(y_{phase}^{s,f}=1)\max (\bar{z}_{dim}^{s,f},0)] \end{aligned}$$
(6)
$$\begin{aligned} \mathcal {R}_{inter}^{rwt}=\frac{1}{S\times F}\sum _{s,f}[\mathbbm {1}(y_{phase}^{s,f}=0)\max (\bar{z}_{rwt}^{s,f},0)+ \mathbbm {1}(y_{phase}^{s,f}=1)\max (-\bar{z}_{rwt}^{s,f},0)] \end{aligned}$$
(7)

where \(\mathbbm {1}(\cdot )\) is the indicator function, \(z_t^{s,f}=\hat{y}_t^{s,f}-\hat{y}_t^{s,f-1}, for~t\in \{area,dim,rwt\}\), \(z_t^{s,f,i}\) denotes the ith output of \(z_t\) and \(\bar{z}_t\) denotes the average value of \(z_t\) across its multiple outputs. Totally, our regularization term becomes

$$\begin{aligned} \mathcal {R}(W)=\lambda _1\mathcal {R}_{intra}+\lambda _2(\mathcal {R}_{inter}^{area}+\mathcal {R}_{inter}^{dim}+\mathcal {R}_{inter}^{rwt}) \end{aligned}$$
(8)

3 Dataset and Configurations

Our FullLVNet is validated with short-axis cardiac MR images of 145 subjects. Temporal resolution is 20 frames per cardiac cycle, resulting in a total of 2900 images in the dataset. The pixel spacings range from 0.6836 mm/pixel to 2.0833 mm/pixel, with mode of 1.5625 mm/pixel. The ground truth values are computed from manually obtained contours of LV myocardium. Within each subject, frames are labeled as either Diastole phase or Systole phase, according to the obtained values of cavity area. In our experiments, two landmarks, i.e., junctions of the right ventricular wall with the left ventricular, are manually marked for each image to provide reference for ROI cropping and the LV myocardial segments division. The cropped images are resized to \(80\times 80\). The network is implemented by Caffe with SGD solver. Five-fold cross validation is employed for performance evaluation and comparison. Data augmentation is conducted by randomly cropping images of size \(75\times 75\) from the resized image.

Two-step strategy training. We apply a two-step strategy for training our network to alleviate the difficulties caused by the different learning rate and loss function in multitask learning [15, 16]. Firstly the CNN embedding, the first RNN module and the three regression models are learned together with no back propagation from the classification task, to obtain accuracy prediction for the regression tasks; with the obtained CNN embedding, the second RNN module and the linear classification model are then learned while the rest of the network are kept frozen. As shown in the experiments, such a strategy delivers excellent performance for all the considered tasks.

4 Results and Analysis

FullLVNet is extensively validated under different configurations in Table 1. From the last column, we can draw that FullLVNet successfully delivers accurate predictions for all the considered indices, with average Mean Absolute Error (MAE) of 1.41 ± 0.72 mm, 2.68 ± 1.64 mm, 190 ± 128 mm\(^2\) for RWT, dimension, and areas. For reference, the maximums of these indices in our dataset are 24.4 mm, 81.0 mm, 4936 mm\(^2\). Error rate (1-accuracy) for phase identification is 10.4%. Besides, the effectivenesses of intra- and inter-task relatedness are also demonstrated by the results in the third and fourth column: intra-task relatedness brings clearly improvements for all the tasks, while inter-task relatedness further brings moderate improvement. Compared to the recent direct multi-feature based method [18], which we adapt to our full quantification task, FullLVNet shows remarkable advantages even without intra- and inter-task relatedness.

Table 1. Performance of FullLVNet under different configurations (e.g., intra/N means only intra-task relatedness is included) and its competitor for LV quantification. Mean Absolute Error (MAE) is used for the three regression tasks and prediction error rate is used for the phase identification task.

5 Conclusions

We propose a multitask learning network FullLVNet for full quantification of LV, which includes three regression tasks and one classification task. By taking advantages of expressive feature embeddings from deep CNN and effective dynamic temporal modeling from RNN, and leveraging intra- and inter-task relatedness with group lasso regularization and phase-guided constraints, FullLVNet is capable of delivering state-of-art accuracy for all the tasks considered.