1 Introduction

Low-dose computed tomography (CT) reconstruction requires careful regularization design to control noise while preserving crucial image features. Traditional regularizers have been based on mathematical models like total variation, whereas newer methods are based on models that are learned from training data, especially regression neural network (NN) models. Deep convolutional NN (CNN) methods in an early stage map low- to high-quality images: specifically, they “denoise” the artifacts in the low-quality images obtained by applying some basic solvers to raw data or measurements. However, the greater mapping capability (i.e., higher the NN complexity) can increase the overfitting risks [15]. There exist several ways to prevent NNs from overfitting, e.g., increasing the dataset size, reducing the neural network complexity, and dropout. However, in solving large-scale inverse problems in imaging, the first scheme is limited in training CNNs from large-scale images; the second scheme does not effectively remove complicated noise features; and the third scheme has limited benefits when applied to convolutional layers.

An alternative way to regulate overfitting of regression CNNs in inverse imaging problems is combining them with model-based image reconstruction (MBIR) that considers imaging physics or image formation models, and noise statistics in measurements. BCD-Net [4] is an iterative regression CNN that generalizes a block coordinate descent (BCD) MBIR method using learned convolutional regularizers [5]. Each layer (or iteration) of BCD-Net consists of image denoising and MBIR modules. In particular, the denoising modules use layer-wise regression CNNs to effectively remove layer-dependent noise features. Many existing works can be viewed as a special case of BCD-Net. For example, RED [11] and MoDL [1] are special cases of BCD-Net, because they use identical image denoising modules across layers or only consider quadratic data-fidelity terms (e.g., the first term in (P1)) in their MBIR modules.

This paper modifies BCD-Net that uses convolutional autoencoders in its denoising modules [4], and applies the modified BCD-Net to low-dose CT reconstruction. First, for fast CT reconstruction, we apply the Accelerated Proximal Gradient method using a Majorizer (APG-M), e.g., FISTA [2], to MBIR modules using the statistical CT data-fidelity term. Second, this paper provides the sequence convergence guarantee of BCD-Net when applied to low-dose CT reconstruction. Third, it investigates the generalization capability of BCD-Net for low-dose CT reconstruction, compared to a state-of-the-art deep (non-iterative) regression NN, FBPConvNet [8]. Numerical results with the extended cardiac-torso (XCAT) phantom show that applying faster numerical solvers (e.g., APG-M) to MBIR modules leads to faster and more accurate BCD-Net; regardless of numerical solvers of MBIR modules, BCD-Net significantly improves the reconstruction accuracy, compared to the state-of-the-art MBIR method using learned transforms [15]; given identical denoising CNN architectures, BCD-Net achieves better image quality, compared to a state-of-the-art iterative NN architecture, ADMM-Net [14]. Numerical results with clinical data show that BCD-Net generalizes significantly better than FBPConvNet [8] that lacks MBIR modules.

figure a

2 BCD-Net for Low-Dose CT Reconstruction

2.1 Architecture

This section modifies the architecture of BCD-Net in [4] for CT reconstruction. For the image denoising modules, we use layer-wise autoencoding CNNs that apply exponential function to trainable thresholding parameters. (The trainable thresholding parameters replace the bias terms, since biases can differ greatly for different objects in imaging problems.) The layer-wise denoising CNNs are particularly useful to remove layer-dependent artifacts in reconstructed images at the previous layers, without greatly increasing their parameter dimensions. In low-dose CT reconstruction, for example, the CNNs at the early and later layers remove streak artifacts and Gaussian-like noise, respectively. MBIR modules aim to regularize overfitting artifacts by combining information drawn from the data-fidelity term and output of denoising modules. Different from the image denoising and single-coil magnetic resonance imaging applications in [4], the MBIR modules of CT reconstruction BCD-Net involve iterative solvers. For fast CT reconstruction in particular, we apply a fast numerical solver, APG-M, to the MBIR modules. Algorithm 1 shows the architecture of the modified BCD-Net for CT reconstruction.

Image Denoising Module. For the \(l\mathrm {th}\) layer image denoising module, we use a convolutional autoencoder in the following form:

$$\begin{aligned} \mathcal {D}_{\varvec{{ \uptheta }}^{(l+1)}} (\cdot ) = \frac{1}{R} \sum _{k=1}^{K} {\mathbf {d}}_k^{(l+1)} \circledast \mathcal {T}_{\exp (\alpha _k^{(l+1)})} ( {\mathbf {e}}_k^{(l+1)} \circledast (\cdot ) ), \end{aligned}$$
(1)

where \(\varvec{{ \uptheta }}^{(l+1)} := \{ {\mathbf {d}}_k^{(l+1)}, {\mathbf {e}}_k^{(l+1)}, \alpha _k^{(l+1)} : k \,=\, 1,\ldots ,K, l \,=\, 1,\ldots ,L \}\) is a parameter set of the \(l\mathrm {th}\) convolutional autoencoder, \({\mathbf {d}}_k^{(l+1)} \in \mathbb {R}^R\), \({\mathbf {e}}_k^{(l+1)} \in \mathbb {R}^R\), and \(\exp (\alpha _k^{(l+1)})\) are the \(k\mathrm {th}\) decoding and encoding filters, and thresholding value at the \(l\mathrm {th}\) layer, respectively, the convolution operator \(\circledast \) uses the circulant boundary condition without the filter flip, \(\mathcal {T}_{a}(\cdot )\) is the soft-thresholding operator with the thresholding parameter a, R and K are the size and number of the filters, respectively, and L is the number of layers in BCD-Net. Different from the original convolutional autoencoder in [4], we included the exponential function \(\exp (\cdot )\) to prevent the thresholding parameters \(\{ \alpha _k \}\) from becoming zero during training [6]. The factor 1/R comes from the relation between convolution-perspective and patch-based trainings [6]. By applying the trained convolutional autoencoder in (1) to the \(l\mathrm {th}\) layer input \({\mathbf {x}}^{(l)}\) (i.e., reconstructed image at the \((l-1)\mathrm {th}\) layer), we obtain the “denoised” image \({\mathbf {z}}^{(l+1)} = \mathcal {D}_{\varvec{{ \uptheta }}^{(l+1)}} ({\mathbf {x}}^{(l)})\). We next feed \({\mathbf {z}}^{(l+1)}\) into the \(l\mathrm {th}\) layer MBIR module.

MBIR Module. The \(l\mathrm {th}\) layer MBIR module uses the \(l\mathrm {th}\) layer denoised image \({\mathbf {z}}^{(l+1)}\), and reconstructs an image \({\mathbf {x}} \in \mathbb {R}^{N}\) from post-log measurement \({\mathbf {y}} \in \mathbb {R}^{M}\) by solving the following statistical MBIR problem:

$$\begin{aligned} {\mathbf {x}}^{(l+1)} = \mathop {\mathrm {argmin}}\limits _{{\mathbf {x}} \succeq \mathbf {0}} F({\mathbf {x}}; {\mathbf {y}}, {\mathbf {z}}^{(l+1)}) :=\frac{1}{2}\Vert {\mathbf {y}} - {\mathbf {A}} {\mathbf {x}}\Vert ^2_{{\mathbf {W}}} + \frac{\beta }{2} \Vert {\mathbf {x}} - {\mathbf {z}}^{(l+1)}\Vert ^2_2, \end{aligned}$$
(P1)

where \({\mathbf {A}} \in \mathbb {R}^{ M \times N}\) is a CT scan system matrix, \({\mathbf {W}} \in \mathbb {R}^{M \times M}\) is a diagonal weighting matrix with elements \(\{ W_{m,m} \,=\, \rho _m^2/(\rho _m + \sigma ^2) \!:\! \forall m \}\) based on a Poisson-Gaussian model for the pre-log measurements \(\varvec{{ \uprho }}\in \mathbb {R}^M\) with electronic readout noise variance \(\sigma ^2\) [15], and \(\beta >0\) is a regularization parameter. To rapidly solve (P1), we apply APG-M, a generalized version of APG (e.g., FISTA [2]) that uses M-Lipschitz continuous gradients [5]. Initialized with \({\mathbf {v}}^{(0)} = \bar{{\mathbf {x}}}^{(0)} = {\mathbf {x}}^{(l)}\) and \(t_0 = 1\), the APG-M updates are

$$\begin{aligned} \bar{{\mathbf {x}}}^{(j+1)}&= \Big [{\mathbf {v}}^{(j)} + {\mathbf {M}}^{-1} \big ( \mathbf {A}^T \mathbf {W}(\mathbf {y}- \mathbf {A}{\mathbf {v}}^{(j)}) - \beta ({\mathbf {v}}^{(j)} - \mathbf {{\mathbf {z}}}^{(l+1)}) \big ) \Big ]_+, \end{aligned}$$
(2)
$$\begin{aligned} {\mathbf {v}}^{(j+1)}&= \bar{{\mathbf {x}}}^{(j+1)}+\frac{t_{j} -1 }{t_{j+1}} ( \bar{{\mathbf {x}}}^{(j+1)} - \bar{{\mathbf {x}}}^{(j)}), \quad \hbox {where} \ t_{j+1} = (1+\sqrt{1+4t_j^2}) / 2, \end{aligned}$$
(3)

for \( j = 0,\ldots ,J-1\), where the operator \([\cdot ]_+\) is the proximal operator obtained by considering the non-negativity constraint in (P1) and clips the negative values, and J is the number of APG-M iterations. We design the diagonal majorizer \({\mathbf {M}} \in \mathbb {R}^{N \times N}\) in (2) as follows [5]: \({\mathbf {M}} = \mathrm {diag}( {\mathbf {A}}^{T}{\mathbf {W}}{\mathbf {A}} \mathbf {1} ) + \beta \mathbf {I}\succeq \nabla ^2 F({\mathbf {x}}; {\mathbf {y}}, {\mathbf {z}}^{(l+1)}) = {\mathbf {A}}^{T}{\mathbf {W}}{\mathbf {A}} + \beta \mathbf {I}\), where \(\mathrm {diag}(\cdot )\) converts a vector into a diagonal matrix. The \(l\mathrm {th}\) layer reconstructed image \({\mathbf {x}}^{(l+1)}\) is given by the \(J\mathrm {th}\) APG-M update, i.e., \({\mathbf {x}}^{(l+1)} = \bar{{\mathbf {x}}}^{(J)}\), and fed into the next BCD-Net layer as an input.

figure b

2.2 Training BCD-Net

This section proposes a BCD-Net training framework for CT reconstruction, based on the image denoising and MBIR modules defined in the previous section. The training process requires I high-quality training images, \(\{ {\mathbf {x}}_i : i = 1,\ldots ,I \}\), and I training measurements simulated via CT physics, \(\{ {\mathbf {y}}_i : i = 1,\ldots ,I \}\). Algorithm 2 summarizes the training framework.

At the \(l\mathrm {th}\) layer, we optimize the parameters \(\varvec{{ \uptheta }}^{(l+1)}\) of the \(l\mathrm {th}\) convolutional autoencoder in (1) from I training pairs \(( {\mathbf {x}}_i, {\mathbf {x}}_i^{(l)})\), where \({\mathbf {x}}_i^{(l)}\) is the \(i\mathrm {th}\) reconstructed training image at the \((l-1)\mathrm {th}\) layer. Our patch-based training loss function at the \(l\mathrm {th}\) layer is

$$\begin{aligned} \varvec{{ \uptheta }}^{(l+1)} = \mathop {\mathrm {argmin}}\limits _{\{ {\mathbf {D}}, \varvec{\alpha }, {\mathbf {E}} \}} \frac{1}{R \tilde{P}}\Vert \widetilde{{\mathbf {X}}} - {\mathbf {D}} \mathcal {T}_{\exp (\varvec{\alpha })} ( {\mathbf {E}}^T \widetilde{{\mathbf {X}}}^{(l)} ) \Vert _F^2, \end{aligned}$$
(P2)

where encoding and decoding filter matrices \({\mathbf {D}} \in \mathbb {R}^{R\times K}\) and \({\mathbf {E}} \in \mathbb {R}^{R\times K}\) are formed by grouping K filters as \({\mathbf {D}} : = [{\mathbf {d}}_1, \ldots , {\mathbf {d}}_K]\) and \({\mathbf {E}} : = [{\mathbf {e}}_1, \ldots , {\mathbf {e}}_K]\), respectively, \(\varvec{\alpha }\in \mathbb {R}^{K}\) is a vector containing K thresholding values, \(\tilde{P}\) is the number of patches extracted from all training images, and \(\widetilde{{\mathbf {X}}} \in \mathbb {R}^{R \times \tilde{P}}\) and \(\widetilde{{\mathbf {X}}}^{(l)} \in \mathbb {R}^{R \times \tilde{P}}\) are paired training matrices whose columns are vectorized patches extracted from \(\{ {\mathbf {x}}_i : \forall i \}\) and \(\{ {\mathbf {x}}_i^{(l)} : \forall i \}\), respectively. The soft thresholding operator \(\mathcal {T}_{{\mathbf {a}}}({\mathbf {u}}) : \mathbb {R}^{K} \rightarrow \mathbb {R}^{K}\) is defined as follows: \(( \mathcal {T}_{{\mathbf {a}}} ({\mathbf {u}}) )_{k}\) equals to \(u_{k} - a_{k} \mathrm {sign}(u_{k})\) for \( | u_{k} | > a_{k}\), and 0 otherwise, \(\forall k\). We optimize (P2) via a mini-batch stochastic gradient method.

2.3 Convergence Analysis

There exist two key challenges in understanding the convergence behavior of BCD-Net in Algorithm 1: (1) (general) denoising NNs \(\{ \mathcal {D}_{\varvec{{ \uptheta }}^{(l+1)}} \}\) change across layers; (2) even if they are identical across layers, they are not necessarily nonexpansive operators [10] in practice. To moderate these issues, we introduce a new definition:

Definition 1

(Asymptotically nonexpansive paired operators [6]). Paired operators \(\{ \mathcal {D}_{\varvec{{ \uptheta }}^{(l)}}, \mathcal {D}_{\varvec{{ \uptheta }}^{(l+1)}} \}\) are asymptotically nonexpansive if there exist a summable sequence \(\{ \epsilon ^{(l+1)} \in [0,\infty ) : \sum _{l=0}^\infty \epsilon ^{(l+1)} < \infty \}\) such that

$$\begin{aligned} \left\| \mathcal {D}_{\varvec{{ \uptheta }}^{(l+1)}} ({\mathbf {u}}) - \mathcal {D}_{\varvec{{ \uptheta }}^{(l)}} ({\mathbf {v}}) \right\| _2^2 \le \Vert {\mathbf {u}} - {\mathbf {v}} \Vert _2^2 + \epsilon ^{(l+1)}, \qquad {\forall {\mathbf {u}},{\mathbf {v}}} \ \textit{and} \ \forall l. \end{aligned}$$

Based on Definition 1, we obtain the following convergence result for Algorithm 1:

Theorem 2

(Sequence convergence). Assume that paired denoising neural networks \(\{ \mathcal {D}_{\varvec{{ \uptheta }}^{(l)}}, \mathcal {D}_{\varvec{{ \uptheta }}^{(l+1)}} \}\) are asymptotically nonexpansive with the summable sequence \(\{ \epsilon ^{(l+1)} \in [0,\infty ) : \sum _{l=1}^\infty \epsilon ^{(l+1)} < \infty \}\) and \({\mathbf {A}}^T {\mathbf {W}} {\mathbf {A}} \succ 0\). Then the sequence \(\{ {\mathbf {x}}^{(l+1)} : l \ge 0 \}\) generated by Algorithm 1 (disregarding the non-negativity constraints in the MBIR optimization problems (P1)) is convergent.

Theorem 2 implies that if denoising neural networks \(\{ \mathcal {D}_{\varvec{{ \uptheta }}^{(l)}} : l \ge 1 \}\) converge to a nonexpansive one, BCD-Net guarantees the sequence convergence. Figure S.1 shows the convergence behaviors of \(\{ \mathcal {D}_{\varvec{{ \uptheta }}^{(l+1)}} \}\) and their Lipschitz constants.

2.4 Computational Complexity

The computational cost of the proposed BCD-Net is \(O((MJ+RK)NL)\). Since \(MJ \gg RK\), the computational complexity of BCD-Net is dominated by forward and back projections performed in the MBIR modules. To reduce the MJ factor, one can investigate faster optimization methods (e.g., proximal optimized gradient method (POGM) [13]) with ordered subsets. Applying these techniques can reduce the MJ factor to \((M/G) J'\), where G is the number of subsets and the number of POGM iterations \(J' < J\) (e.g., \(J' = (1/\sqrt{2}) J\)) due to faster convergence rates of POGM over APG.

3 Experimental Results and Discussion

3.1 Experimental Setup

Imaging. For XCAT phantom images [12] and reconstructed clinical images in [15], we simulated sinograms of size \(888 \,\times \,984\) (detectors\(\,\times \,\)projection views) with GE LightSpeed fan-beam geometry corresponding to a monoenergetic source with \(\rho _0 \,=\, 10^4\) incident photons per ray and electronic noise variance \(\sigma ^2 \,=\, 5^2\) [15] (while avoiding inverse crimes). We reconstructed \(420 \,\times \, 420\) images with pixel-size \(\varDelta _x \,=\, \varDelta _y \,=\, 0.9766\) mm. For the clinical data collected from the GE scanner using the CT geometry above, and tube voltage 120 kVp and current 160 mA, we reconstructed a \(716 \,\times \, 716\) image (shown in the third row of Fig. 2) with \(\varDelta _x \,=\, \varDelta _y \,=\, 0.9777\) mm.

Training BCD-Net, ADMM-Net, and FBPConvNet. Based on the proposed framework in Sect. 2.2, we trained 100-layer BCD-Nets with the two parameter sets, \(\{ K \,=\, R \,=\, 8^2 \}\) and \(\{ K \, = \, 10^2, R \,=\, 8^2 \}\), and the regularization parameter \(\beta \,=\, 4 \,\times \, 10^6\). In particular, we solved (P2) with Adam [9] and \(\tilde{P} \,\approx \, 1.7 \,\times \, 10^6\) training patches that were extracted from ten training images of the XCAT phantom [12]. We used the mini-batch size 512, 200 epochs, initial learning rates \(10^{-3}\), and \(10^{-2}\) for \(\{ {\mathbf {D}}^{(l)}, {\mathbf {E}}^{(l)} : \forall l \}\) and \(\{ \varvec{\alpha }^{(l)} : \forall l \}\), and random i.i.d. Gaussian filter initialization. We decayed the learning rates by a factor of 0.9 every 10 epochs. We trained a 100-layer ADMM-Net that uses the layer-wise denoising NNs (1) with \( K \,=\, R \,=\, 8^2\), with the identical training setup above. We chose the ADMM penalty parameter as \(1 \,\times \,10^6\), by matching the spatial resolution in the heart region of test sample \(\#1\) to that reconstructed by BCD-Net. We trained FBPConvNet with 500 2D XCAT phantom images and the similar parameters suggested in [8].

Image Reconstruction. We compared trained BCD-Nets with the conventional MBIR method using an edge-preserving (EP) regularizer, the state-of-the-art MBIR method using \(\ell _2\) prior with a learned square transform [15], a state-of-the-art iterative NN architecture, ADMM-Net [14] (i.e., plug-and-play ADMM [3] using denoising NNs), and/or a state-of-the-art (non-iterative) deep regression NN, FBPConvNet [8]. For the first two MBIR methods, we finely tuned their parameters to give the lowest root-mean-square-error (RMSE) values [5]. (See their parameter details in Section S.2). We tested the aforementioned methods to two sets of three representative chest CT images that are selected from the XCAT phantom and clinical data provided by GE. (Note that the testing phantom images are sufficiently different from training phantom images; specifically, they are \({\approx }\)2 cm away from training images.) We quantitatively evaluated the quality of phantom reconstructions by RMSE (in Hounsfield units, HU) in a region of interest [15].

Fig. 1.
figure 1

RMSE convergence of BCD-Nets using different MBIR modules for low-dose CT reconstruction (for the first testing image in Table 1).

3.2 Results and Discussion

Convergence of BCD-Net with Different MBIR Modules. Applying faster iterative solvers to MBIR modules leads to faster and more accurate BCD-Net. This assertion is supported by comparing the APG-M and PG-M results in Fig. 1 (given the identical iteration numbers), and noting that APG-M is faster than PG-M (i.e., APG-M using no “momentum”, \(\bar{{\mathbf {x}}}^{(j+1)} - \bar{{\mathbf {x}}}^{(j)}\) in (3)). In addition, Fig. 1 shows that increasing the number of iterations in numerical MBIR solvers leads more accurate BCD-Net, given the identical numbers of BCD-Net layers. This implies that numerical MBIR solvers using insufficient number of iterations do not fully extract “desired” information from CT data-fidelity (i.e., the first term in (P1)). The importance of using rapidly converging MBIR solvers is underestimated in existing literature: existing literature often considers some applications that have practical and closed-form MBIR solution [4].

Reconstruction Quality Comparisons. For all the testing phantom and clinical images, the proposed BCD-Nets significantly improve the low-dose CT reconstruction accuracy, compared to the conventional MBIR method using EP and/or the state-of-the-art MBIR method using \(\ell _2\) prior with a learned transform [15]. For all the testing phantom images, BCD-Net consistently achieves better reconstruction quality than ADMM-Net. See Table 1, Fig. 2 & S.2, and Section S.3. In particular, BCD-Net accomplishes the both benefits of EP and image denoising (see Fig. S.2); and increasing the number of filters and thresholding parameters improves its reconstruction performance (see Table 1).

Table 1. RMSE (HU) of three reconstructed XCAT phantom images with different MBIR methods for low-dose CT\(^\mathrm{a}\) (\(\rho _0 = 10^4\) incident photons)
Fig. 2.
figure 2

Comparison of three reconstructed clinical images from different reconstruction methods for low-dose CT (images are magnified to better show differences; display window [800, 1200] HU).

Generalization Capability Comparisons. The proposed BCD-Net has significantly better generalization capability than a state-of-the-art (non-iterative) deep regression NN, FBPConvNet [8]. Clinical scan experiments in Fig. 2 indicate that deep regression NNs, e.g., FBPConvNet, can have high overfitting risks, while our proposed BCD-Net has low overfitting risks, and gives more stable reconstruction. These show that MBIR modules benefit regularizing overfitting artifacts of regression NNs.

The BCD-Net result in the second row of Fig. 2 shows non-uniform spatial resolution or noise; see blurry artifacts particularly around the center of the reconstructed image. One can reduce such blurs by including the technique of controlling local spatial resolution or noise in the reconstructed images [7] to MBIR modules.

4 Conclusions

The proposed BCD-Net uses layer-wise autoencoding CNNs and achieves significantly more accurate low-dose CT reconstruction, compared to the state-of-the-art MBIR method using a learned transform [15]. BCD-Net provides better reconstruction quality, compared to a state-of-the-art iterative NN, ADMM-Net [14]. Taking both benefits of MBIR and low-complexity CNN (i.e., convolutional autoencoder), BCD-Net significantly improves the generalization capability, compared to a state-of-the-art (non-iterative) deep regression NN, FBPConvNet [8]. In addition, applying faster numerical solvers, e.g., APG-M, to MBIR modules leads to faster and more accurate BCD-Net, and those with sufficient iterations can lead to the sequence convergence. Future work will explore BCD-Net with local spatial resolution controls [7], to reduce blur around the center of reconstructed images.