1 Introduction

With the emergence of artificial intelligence (AI) methodologies in medical image processing, especially from 2015 onward, machine learning models have become increasingly capable of supporting medical personnel not only in their daily work to recognize and categorize tissue and body parts according to specific needs such as semantic segmentation and gleason score (GS) prediction for the prostate, but also for training of less experienced radiologists for educational purposes that study prostate anatomy, cancer aggressiveness and its characteristic features [1]. Current machine learning (ML)-based methods [2,3,4,5,6,7,8,9,10,11,12,13,14,15] for fully automated segmentation of the prostate achieve dice score (DSC) values as accurate as a radiologist of about 92% for the whole prostate gland [2]. However, the results are hard to compare due to the usage of different performance metrics such as DSC (individual slices or 3D-volumes) and different datasets or training- and augmentation techniques. For example, [3] used 19 \(\textrm{T}_{2}\)-weighted image (\(\textrm{T}_{2}\)WI) exams while [2] used data from 188 patients of the PROSTATEx challenge [16].

Prostate segmentation on magnetic resonance imaging (MRI) scans can be done manually or semi-automatically utilizing computer vision tools. This labor-intensive process has led to efforts to fully automate it using AI algorithms. Manual segmentation, typically performed slice-by-slice on axial \(\textrm{T}_{2}\)WI, is the common approach but is time-consuming and is affected by the radiologist’s segmentation ability [17,18,19].

Fig. 1
figure 1

Cascaded-UNetSlim (Casc-UNetS). Multi-label semantic segmentation of a single MRI slice into the classes whole prostate gland (WG) (yellow), prostate central gland (CG) (blue), peripheral zone (PZ) (green) and prostate cancer (PCa) (red). The WG is labeled but is superimposed by the PZ and the CG, hence, it is almost hidden

In this study, we introduce:

  1. 1.

    A new fully automated cascaded network configuration called Casc-UNetS based on a parameter-reduced version of UNet (convolutional neural network (CNN) for biomedical image segmentation) [20] and \(\textrm{T}_{2}\)WI only that is inspired by existing methods [6, 8, 9] and computer vision algorithms (i.e., image moments, contours, contrast limited adaptive histogram equalisation (CLAHE), etc.) that is capable of multi-label semantic segmentation as shown in Fig. 1. Trained on two tiny datasets, the first stage contours the WG. The second stage segments the PZ and CG and the third stage outlines PCa which most frequently appears in the PZ [21] as shown in Figs. 1 and 2. To incorporate expert knowledge (e.g. the prostate WG and the other associated regions) the model uses a filtering layer and a-priori feature-map concatenations between subsequent stages to focus while training to the individual region of interest (ROI).

  2. 2.

    A new multiparametric MRI (mp-MRI) prostate dataset called UDE-prostate (UDEp) whose raw data was collected and segmented by our partner at the university Duisburg-Essen. It contains exams in the DICOMFootnote 1 format [22] and segmentation labels in the MITKFootnote 2 format [23] for apparent diffusion coeffient (ADC), \(\textrm{T}_{2}\)WI and dynamic contrast enhanced (DCE) with 37, 29 and 30 exams respectively. For 22 of the exams all the aforementioned modalities are available.

  3. 3.

    In order to provide our network with a larger dataset we combine our data with the freely available I2CVBFootnote 3 dataset with only 29 and 19 exams and produce a third one with 48 exams. Initially both datasets appear to be incompatible since the I2CVB contains the CG class which is missing in the UDEp dataset. Therefore, we present a sophisticated denoising and approximation method to automatically generate this class for the UDEp dataset which produces good CG contours even on imprecise ground truth labelling. Based on our knowledge, this is the first research to combine two small prostate datasets.

  4. 4.

    Casc-UNetS is (statistically) evaluated using the DSC on the three datasets with different configurations (e.g. image augmentation, filtering layers, etc.) to find out how the model changes due to additional data and training variations using the DSC for the prostate regions base, midgland, apex as well as the complete prostate.

  5. 5.

    We open-source the UDEp dataset and Casc-UNetS.

2 Related works

Comprehensive meta analysis on semantic segmentation of the prostate and PCa detection are given in [1, 4, 24]. While [4] focuses on algorithms drawn from standard ML-techniques for computer-aided detection and diagnosis (CAD), [24] review ML- and DL-approaches on MRI and [1] outlines also both but with the main perspective on DL-based solutions. Semantic segmentation of the prostate and PCa detection can be divided two general pillars. The first one is the imaging technique which is most frequently either ultrasound (US) or MRI with its different modalities (e.g. ADC, DCE, diffusion weighted imaging (DWI), magnetic resonance spectroscopy imaging (MRSI), \(\textrm{T}_{2}\)WI, etc.). While mono parametric approaches only use a single modality, biparametric MRI (bp-MRI) and mp-MRI uses multiple MRI techniques (e.g. ADC, DCE and \(\textrm{T}_{2}\)WI). Hence, a dataset has to be at hand that matches the foreseen approach. The second pillar is the ML-method which ranges from standard algorithms such as k-means-clustering (k-means), support vector machine (SVM), ensemble classifier (EC), multi layer perceptron (MLP) and many others to DL-based CNNs with modifications that originate in transformers [25]. The CNNs used in [2, 3, 6,7,8, 26] are encoder-decoder networks, some with skip connections between the two and residual connections in the encoder. After that the standard ML-pipeline for validation, performance evaluation and visualization follows.

Alkadi et al. [3, 9] use the VGG16 architecture and propose a 2.5D method realized by a 2D multi channel input to incorporate volumetric information for the semantic segmentation of the prostate into the four different classes PCa, non-prostate, PZ and CG. The authors showed that their pseudo-volume method is superior to a single slice input in the I2CVB dataset. Zhu et al. [6] propose a method to segment the prostate into the WG and the PZ. Their dataset holds bp-MRI consisting of 163 exams on which they evaluate a three-stage method. Firstly, they select the WG as the ROI by utilizing k-means and other methods on a DWI slice to obtain an approximate segmentation of the prostate. This segmentation is then transferred to the corresponding \(\textrm{T}_{2}\)WI to preselect the prostate area by cropping it which serves as an input to the first UNet that extracts the WG. Thirdly, the WG is fed into the second UNet to identify the PZ region. Aldoj et al. [2] used a CNN that is inspired by DenseNet and UNet to automatically segment the prostate into WG, the CZ and the PZ and use data from 188 patients from the ProstateX challenge with in-house segmented masks. The authors found an increase in performance which they claim results from the UNet architecture in combination with dense blocks. Nai et al. [7] compared and evaluated the usage of mono- and multimodal algorithms for the segmentation of mp-MRI data of 160 patients. While their multimodal network created better boundary segmentations of the CG and PZ, the WG segmentations did not improve. Duran et al. [8] introduced a multi-class attention network to segment the PZ and PZ lesions with additional GS group grading of the same using a bp-MRI approach utilizing \(\textrm{T}_{2}\)WI and ADC of their in-house dataset with 98 subjects. The network is inspired by UNet and consists of a single encoding path and two decoding paths that share feature-maps. The first one segments the PZ and the second one segments PZ lesions and grades them according to five different GS groups. Hung et al. [10] investigate a transformer inspired cross-slice attention module called (CAT) to learn inter-slice dependencies at different levels. This mechanism is designed to be used in networks with shortcut connections between the encoding and decoding modules. The authors use their CAT module inside a nnUNet and nnUNet++ to segment the TZ and PZ and experimentally showed on an in-house and the ProstateX dataset that meaningful data can be learned from adjacent \(\textrm{T}_{2}\)WI slices. Isaksson et al. [25] compare two proprietary segmentation software systems against four DL-architectures including V-net which is a 3D version of standard UNet and two different 2D UNets to segment the prostate WG on a dataset with 100 patients. While their EfficientDet reached the highest DSC they argue that UNet still performs very well on small datasets. Additionally, they investigated whether clinical medical parameters have an effect on the segmentation, however, they did not find impactful correlations. All models were trained on two dataset splits on \(\textrm{T}_{2}\)WI and evaluated. Compared to the established medical software tools the DL-networks achieved better results.

Cascaded architectures are used in [6, 8, 27, 28], however, [6] uses DWI and \(\textrm{T}_{2}\)WI to segment the WG and the PZ only, [27] and [28] use it for image registration and localization in ADC and \(\textrm{T}_{2}\)WI and [8] uses it for PZ segmentation for PZ GS group grading.

3 Methods

This section introduces the pipeline used to process the datasets for this study. The CNN architecture is based on UNet [20], which was originally developed for biomedical image segmentation. We use the \(\textrm{T}_{2}\)WI for training and evaluation from our new open-source dataset UDEp (university Duisburg-Essen review board approval, 19-TEMP579281-BO) and from the openly available I2CVB dataset [4]. This modality is optimal to study the anatomy of the prostate and distinguish between the PZ with high-signal intensities and the CG with rather low signal-intensities [4, 29].

The pipeline consists of standard computer vision algorithms and DL-based methods. The \(\textrm{T}_{2}\)WI are preprocessed using histogram equalization. In order to obtain more data we join both datasets. Before merging the datasets the CG-class is created in the UDEp dataset via our generation algorithm to obtain the same amount of classes in both datasets and remove artifacts. After that the data is fed into our cascaded network architecture. Finally, the output is evaluated.

3.1 Model

The cascaded architecture Casc-UNetS is a concatenation of individual UNetSlim (UNetS) (c.f. Fig. 2) and standard computer vision processing. UNetS1 is used to predict the WG, UNetS2 the PZ and CG and finally UNetS3 the PCa class. We also integrate volumetric information to the input of the network by using the sliding window of [3, 9] and omitting the need to resample the MRI to coherent volumes.

Following the initial stage, the expert knowledge of the prostate WG location in \(\textrm{T}_{2}\)WI is employed to eliminate artifacts that may be erroneously generated by UNetS1. This optimization is achieved through the incorporation of a filter layer between UNetS1 and UNetS2, as follows: Calculate the MRI center point. Find the contours of all predicted WGs and calculate their area and center points using the contour moments as described in Sect. 3.4. Calculate the distance of each contour’s center point to the MRI center. If the contour has at least 5% (which was found to generate highest metric results on the trainset) of the possible maximum WG contour area, then select it as the correct WG, otherwise, choose the contour with largest area. Sometimes the selected prostate contour is very erratic, thus, the convex hull can optionally be calculated to smoothen the proposed outer contour. This option is only used to calculate the inputs to UNetS2 for the selected contour since this information is solely used to preselect the ROI.

Fig. 2
figure 2

Casc-UNetS. The prediction of the individual classes happen in stages. UNetS1 predicts the WG class. UNetS2 predicts the PZ and CG and UNetS3 the PCa class. The input to the following UNetS stage is always a concatenation of the original input In and the previous prediction results (O1 or O2) without the background (BG) class. The number of layers are indicated below the boxes

Feature-map concatenation is used to preselect the ROI which is based on the expert knowledge of the location of the prostate zones PZ and CG which both lie within the WG. This is done by providing UNetS2 with the WG segmentation and UNetS3 with the WG, PZ and CG from the previous stages. The idea of providing additional information to the network is tested with the I2CVB-dataset to predict the PZ and CG classes without prior hyperparameter tuning.

Fig. 3
figure 3

UNetS architecture. The input channels and the channels \(a\) of OutConv differ depending on the stage of Casc-UNetS. Image height and width \(I={320} \,\text{pixel}\)

The general UNet architecture is modified and in the following referred to as UNetS—a parameter reduced UNet (c.f. Fig. 3). The model can be divided into two basic blocks, namely a down- and an up-block. The down-block consists of a 2D-convolution layer followed by batch normalization [30] and the rectified linear unit (ReLU). This sequence is repeated twice and followed by a 2D max-pooling layer except for the bottleneck. The up-block is composed by 2D up-scaling using the bilinear transformation, followed by feature-map concatenation of the corresponding down-block and again two sequences of a 2D-convolution layer followed by batch normalization and ReLU.

Compared to standard UNet as described by [20] UNetS uses the following modifications: a) Batch normalization is used after each 2D convolution layer. b) The deepest stage with 1024 feature map channels is omitted to reduce the amount of parameters by more than 86% from about \({{\mathrm{31.04}}\times {10}^{6}}\) parameter to approximately \(\mathrm{4.28}\times \textrm{10}^{6}\) parameter which reduces the chance of overfitting due to the small amount of image data and vanishing gradient [31]. The final output convolution maps the channels of the last up-block to the desired number of prediction class channels. c) As an activation layer softmax is used instead of a combination with the same and cross-entropy loss. d) Up-sampling is achieved using the bilinear transform as opposed to transpose convolution in classical UNet. e) In contrast to a single image input we utilize the three consecutive MRI slice input to make use of inter-slice spacial information and mimic the behaviour of a radiologist when analyzing images. This method was shown to be superior in contrast to a single slice input, c.f. [3, 9, 10]. f) To incorporate expert knowledge of the location of the prostate into our network we equip UNetS1 with a filtering layer to remove artifacts. The input to UNetS2 and 3 is extended by feature-map concatenations over the previous segmentation result(s) on the current image.

3.2 Training

Hyperparameter tuning is performed for each dataset and stage using grid search [31]. The data is split into training, validation, and testing sets with ratios of 50 %, 20 % and 30 % respectively. We train our model on the training set and use the validation set for evaluation. The best hyperparameters are then tested on unseen testing data. For Casc-UNetS, tuning is done separately for each stage. For UNetS1, hyperparameters are tuned with and without CLAHE. For UNetS2 and 3, only CLAHE MRI are used due to improved validation losses.

Ma et al. [32] found in their statistical evaluation of different loss functions for medical imaging datasets that dice loss (DLoss) performs well with only a diminishing margin to the top performers, hence, the model is trained using \(\text {DLoss} = 1 - {\text {DSC}}\) which is calculated for a batch of inputs over the prediction classes. For optimization, adaptive moment estimation (Adam) [33] is used with a step-size decay scheduler with \(\gamma = 0.1\) and a step-size of 30 epochs.

3.3 Datasets

High quality prostate datasets with precise contouring are scarce. Hence, to create more training data, we combine the two small datasets UDEp and I2CVB. The UDEp dataset contains 29 \(\textrm{T}_{2}\)WI exams with 861 slices. The segmentations were created by a medicine PhD-student KS under the supervision of the expert urologist JPRFootnote 4 using the MITK [23] software. PCa was verified using systematic and guided needle biopsy to the lesion foci.

The UDEp dataset is segmented into WG, PZ and the multiple prostate lesion (PLES) classes that are joined to a single PCa class.

Additionally, we use the 19 \(\textrm{T}_{2}\)WI prostate exams from the I2CVB dataset with 468 slices in total that contain at least one of the following classes: PCa, PZ, CG and WG. In comparison to the UDEp dataset a segmentation for the CG is provided. The PCa class was proven through biopsy and found in 17 patients while 2 patients had negative biopsy results.

The third dataset called combined (COMB) dataset holding prostate exams of the UDEp and I2CVB with \(29 + 19 = 48\) patients. Since the UDEp does not contain a contouring for the CG class a method is presented on how to approximate it to have all classes present in both datasets.

3.4 CG generation

The CG (consisting of central zone, transitional zone, and anterior fibromuscular stroma) can be estimated by finding the set difference of the binary segmentation maps of the WG and the PZ, as \(\text {CG} = \text {WG} {\setminus } \text {PZ}\). The contourings in the UDEp dataset are sometimes less precise compared to the labels of the I2CVB dataset. As a result, the difference between the WG and PZ classes produces various artifacts, such as shapes around the original PZ and more than one CG, etc. The procedure to approximate the CG described in Fig. 4 is applied to remove these artifacts. It follows the these steps: a) The first approximation for the \(\widetilde{\text {CG}}\) is calculated as \(\text {WG} {\setminus } \text {PZ}\). b) A 2D median filter is applied to remove almost all unwanted artifacts. c) Since the median filter slightly blurs the contours, the set difference of the current approximation of the \(\widetilde{\text {CG}}\) with the intersection of itself and the PZ is computed, to achieve separability of the contours for the next step. Furthermore, contour moments are computed to determine the distances of all possible CGs to the PZ centroid. d) Finally, the CG candidate with shortest distance and relative area of 40% of the maximum area is selected as shown in Fig. 4-e. Further remarks are given in Appendix A.

Fig. 4
figure 4

The CG generation process in steps a to f. The procedure is shown on an example with segmentation errors to demonstrate the strength of the algorithm. For the UDEp-dataset the parameters \(\alpha = 0.8\) and \(\text {A}_m \ge 0.4\) are chosen. The former is used to scale the kernel filter size and the latter specifies the minimum percentage of the contour area to be selected from the contour moment \(M_{00}\)

Furthermore, the prostate \(\textrm{T}_{2}\)WI for each patient are grouped into the complete volume and three different regions or sub-volumes apex, midgland and base which comprise the 1st to 15th, the 16th to 84th and the 85th to 100th percentile of the MRI for which the targets include at least one of the classes WG, CG or PZ respectively [7]. This grouping helps to better determine the strengths and weaknesses of the DL-model in the above mentioned regions. The sub-volume names, however, are not to confuse with the exact prostate biopsy regions [34].

The distribution of the segmentation classes is shown in Table 1. The PCa class is underrepresented since only a few slices per patient contain PCa regions. Thus, this class is the hardest to detect. Furthermore, the UDEp dataset contains fewer PCa samples which makes it more difficult to train. The generated CG class is slightly larger in the UDEp dataset compared to the I2CVB set due to the automatic generation approach.

Table 1 The class pixel distribution is shown for each class over the complete prostate region of the three datasets. The background class is disregarded

3.5 Image augmentation and histogram equalization

Alkadi et al. [3] and Duran et al. [8] claim that image augmentation reduces overfitting while [2] could not find any improvements. The augmentation pipeline used for this work consists of the following linear transformations: The patient’s MRI height and width differ. Thus, all \(\textrm{T}_{2}\)WI are resized to \(320 \times 320\) pixels which bisects the height and width of most UDEp’s MRI and lies approximately in the middle of the I2CVB sizes. Additionally, the images are horizontally flipped with a probability of 50%. Moreover, the border of the images is cropped in a random manner. The cropped area is replaced by a reflection pad. Cropping varies slightly the location of the prostate gland. Furthermore, the MRI are rotated up to \(\pm 2^{\circ }\). The empty parts that are created by the rotation are padded by the parts that were rotated out of the image. For training, the targets and input \(\textrm{T}_{2}\)WI are augmented with the above mentioned transformations, however, the inputs are also modified by adding Gaussian noise.

To enhance image contrast and make shapes and textures mainly in the boundary zones more visible (c.f. [35, 36]), we employ CLAHE [37] which is shown in Fig. 5.

Fig. 5
figure 5

Different types of histogram equalization. The original MRI slice is very obscured (a) and has low contrast. After applying histogram equalisation (HE) the contrast improves by a large margin (b). Basic HE flattens the histogram to the available dynamic range based on the brightness values of the complete image. A downside of HE is that it also magnifies noise in darker areas. This can be omitted by applying the CLAHE method (c) which operates on the image by subdividing it using a kernel of size \(M \times M\) and limiting the amplification using the so-called clip limit which is usually set between 2 and 4. \(M\) is set to 8 which subdivides the image into 64 tiles each having 1/8 of the image width and 1/8 of the image height [38]. This ensures that the proportions of tile to image size are constant. The method computes the histograms locally for each of these \(M \times M\) squares around each pixel [39]. It can be easily seen in (c) that CLAHE was able to uncover structures in the bottom-left part of image (a) that are too bright in image (b)

3.6 Evaluation

In order to comprehend the interconnections between the target (TGT) and the prediction (PRED) the metrics \(\text {PPV} = \frac{TP}{TP + FP}\), \(\text {TPR} = \frac{TP}{TP+FN}\) and \(\text {DSC} = \frac{2 \cdot TP}{2 \cdot TP + FP + FN}\) are introduced. The positive predictive value (PPV) is the fraction of properly rated positives over the predicted positive cases (PPC). The true positive rate (TPR) is the proportion of properly categorized positives over the reference positive cases (RPC). The DSC is the harmonic mean between PPV and TPR and compares the intersecting area of the TGT and the PRED to the sum of both areas [40]. The reader is referred to Appendix B for further information on the calculation of the metrics.

3.7 Cross-validation

The authors of [4] identified k-fold cross-validation (k-cv) with \(k = 3\) or \(5\) as common numbers in literature. While [3] use leave-one-patient-out cross-validation (LOPO CV) for the exiguous I2CVB dataset, [2, 8, 28] use k-cv with either \(k=4\) or \(k=5\). While searching the hyperparameter space, the dataset is split randomly into the three sets train-, validation- and testset. For the final performance evaluation 5-fold cross validation (5-CV) is applied. Hence, the dataset is divided into the training-set that is used for training and the test-set on which the model is tested. After training and evaluation the highest performances for the DSC of each slice and each patient’s volume are averaged and the standard deviation is calculated. For hyperparameter tuning and performance evaluation, each split always contains all the images of a single patient, which ensures that the network model does not learn any features which are later validated and tested on.

4 Results

In Table 2 the slice- and 3D volume based DSC results are shown. Figure 6 shows the examples for Casc-UNetS and Fig. 7 shows PCa samples only. The findings are validated using the one-sided paired t-test with \(\alpha = 0.05\). Using CLAHE fewer WG artifacts were predicted in stage 1 which results in preciser ROIs for the WG which serve as additional input to stages 2 and 3. The effect of feature-map concatenation between individual stages results in an increase in 3D volume based TPR (15%, \(p < 0.001\)), PPV (5%, \(p < 0.05\)) and DSC (12%, \(p < 0.001\)) for the PZ and is highly critical. For the CG the increase in TPR (9%, \(p < 0.001\)), PPV (8%, \(p < 0.01\)) and DSC (6%, \(p < 0.01\)) is also highly critical. The effects of different training configurations of UNetS1 to 3 are discussed in the following subsections.

Table 2 Average and standard deviation for 3D-volume- and slice-based DSC results for Casc-UNetS in the different prostate regions and zones
Fig. 6
figure 6

Slice (sl.) segmentation examples (blue: target, green: prediction) with DSC in % (first DSC: training on either I2CVB or UDEp, second DSC: training on COMB dataset). depicted for training on either I2CVB or UDEp dataset (a shows the WG, b the PZ, c the CG and d the PCa class): 1. (midgland) a 95.6 / 92.7, b 76.1 / 71.5, c 77.2 / 74.8, d 0.0 / 0.0; 2. (base) a 84.2 / 79.6, b 61.9 / 52.5, c 83.7 / 84.0, d 0.0 / 0.0; 3. (apex) a 88.9 / 91.8, b The model detected the PZ, however, it is missing in the target. This falsely resulted in a DSC of 0.0%, c 64.2 / 66.1, d 0.0 / 0.0. 4. (midgland) a 86.5 / 87.6, b 69.6 / 75.1, c 89.5 / 91.9, d PCa not present and not detected in both

Fig. 7
figure 7

PCa slice (sl.) segmentation examples (blue: target, green: prediction) depicted for training on either I2CVB or UDEp dataset with DSC in % (first DSC: training on either I2CVB or UDEp, second DSC: training on COMB dataset). 1. a 0.0 / 0.0, b 74.4 / 58.3, c 60.1 / 44.4, d 61.0 / 67.0; 2. a 39.6 / 0.0, b 77.7 / 48.3, c 85.2 / 44.4, d 77.0 / 40.2; 3. a 0.0 / 0.0, b 80.9 / 0.0, c 62.0 / 0.0, d 34.2 / 0.0; 4. a no detection / 0.0, b 4.7 / 37.9, c 43.4 / 10.0, d 0.0 / 0.0

4.1 WG prediction with UNetS1

A DL-model has been trained based on each dataset. Since the COMB dataset comprises the I2CVB and the UDEp dataset, it has to be evaluated whether the model trained on the COMB dataset outperforms the models trained on the individual ones. Hence, for each MRI the metric differences are compared.

On the complete prostate region the COMB model improves the average DSC performance by 0.1% on the I2CVB dataset and worsens it by 1.1% on the UDEp dataset. [7] also report that their network did not always benefit from an increase in training samples. In the prostate apex region it improves by 1.4% for the I2CVB dataset while it deteriorates by 2.2% for the UDEp set. In the midgland region it diminishes by 0.2% and 0.4% for the I2CVB and UDEp set respectively. In the base region there is an increase of 0.2% for the I2CVB and 3.2% decrease for the UDEp dataset. There might be several reasons which lead to these findings: As mentioned in Sect. 3.4 the UDEp dataset contains segmentation errors. In many slices the WG extends the PZ which normally limits it. Thus, there is only a small improvement of 0.1% for the I2CVB set. Furthermore, the results for the WG are already very high (around 89% for I2CVB) which indicate that the model was able to learn the correct features based on the individual datasets.

Since the findings using the COMB dataset are in terms of the one sided paired-t-test with \(\alpha = 0.05\), \(p = 0.961\) for the I2CVB and \(p = 0.350\) for the UDEp not critical, the models trained on the individual datasets are used in the following stages of Casc-UNetS.

By using the filter between UNetS1 and 2 the slice-based DSC mean is increased from 88.9% to 89.2% (\(p < 0.001\)), 81.6% to 81.8% (\(p < 0.05\)) and 84.3% to 84.5% (\(p < 0.01\)) on the I2CVB, UDEp and COMB dataset respectively. The DSC increase is found in an increase in PPV of 0.4%, 0.4%, 0.5% for the three datasets while TPR remains constant, hence, the number of false positives (FP) is reduced, thus, the number of PPC decreases which in turn increases PPV and DSC (c.f. Sect. 3.6). Sometimes the model predicts within the target WG multiple smaller WGs. Obviously, the filter will remove all but one. The filter is used to reject unwanted artifacts outside the prostate area to provide an improved ROI selection to stages 2 and 3.

4.2 PZ and CG prediction with UnetS2

UNetS2 was trained and evaluated in two different ways using the additional WG contouring which is intended to focus the model to the WG to predict the PZ and CG. For training it is either possible to use the target- or the prediction WGs. The targets are used since they resulted in 1% higher DSC. The WG segmentation is either concatenated with the MRI slices (concatenated input) or the Hadamard product is formed between the WG and the input MRI (attention gate). The Hadamard-product of the prediction of the WG with the MRI [8] is used to focus UNetS2 solely to the ROI. The results are almost similar to those of the concatenated input, however, the base PZ region of the COMB dataset lies about 10% lower compared to the concatenated input. The attention gate acts like a perforator and reduces the MRI slices to only the predicted WG region before the model evaluates them. Since the base prostate contains mostly small areas, which are hard to detect for UNetS1, it will be impossible for the model to detect anything outside these predicted WG. Thus, Casc-UNetS uses the concatenated input.

4.3 PCa prediction with UNetS3

The last stage is formed by UNetS3 to predict the PCa class. Again two variants are tested. A filtered and unfiltered one since previous experiments revealed multiple PCa predictions scattered around the MRI slices. Hence, the filter is intended to remove all false predictions outside the segmentation contours of UNetS1 and 2. It is implemented as a element-wise multiplication of the predicted PCa class with a mask which is created by merging the classes WG, PZ, CG to obtain the complete predicted prostate region and finding the convex hull for smoothening. The experimental results, show that PCa outside the prostate region was not predicted anymore by providing the predicted segmentations of UNetS1 and -2 as an input, hence, filtering is redundant and thus, is rejected for UNetS3.

The highest DSC values are achieved for the I2CVB dataset. As expected the model learns the PCa class best in the I2CVB due to a higher PCa frequency (c.f. Table 1) followed by the COMB and finally the UDEp dataset. Although the percentages are rather low, the PCa class was not detected using a single stage UNetS.

5 Discussion

Looking closer at the cascaded network architecture of Casc-UNetS many advantages become evident. The architecture allows additional post-processing after each stage to incorporate the location of the prostate in \(\textrm{T}_{2}\)WI or the location of prostate lesions since they most frequently appear in the PZ [21]. Moreover, the output of each stage can be imagined as an a-priori residual connection of the last stage to the input of the next stage. The increase in DSC emphasizes that the network model can learn more precisely by limiting the model to the ROI as shown in Sect. 3.1. Each stage can be further optimized based on its individual segmentation task. Furthermore, Casc-UNetS solves the multi-label semantic segmentation task in stages using the softmax activation layer, i.e., one class (WG) per stage or if possible multiple classes (PZ and CG) if they do not superimpose anatomically. In order to solve the multi-label semantic segmentation problem multiple classes can be correct for a single pixel. If two classes are present in a ground truth pixel it is highly unlikely for softmax that the prediction probabilities split up in the same portions and sum up to one (even if both are correct). Additionally, the binary confusion matrix is calculated to compare the results which forces one single predicted class to one and all others to zero. Thus, for any pixel holding two or more ground truth classes the result would only be partially correct. Hence, the staged architecture has the advantage to utilize existing means and still realize multi-label classification in semantic segmentation. Although the performance results using image augmentation are slightly improved for the WG, CG and PCa (for the PZ they are 0.1% degraded), the findings are in terms of the paired-t-test with \(\alpha =0.05\) in the four classes WG, PZ, CG and PCa with p-values of 0.027, 0.439, 0.014 and 0.135 respectively only for the WG and CG with its lower signal intensities compared to the PZ [4, 29] critical. Comparing the extensive implementation effort of image augmentation with its effect this is a remarkable finding and in accordance with [2, 41].

On the other hand, there are also challenges that have to be addressed. While the targets of the I2CVB dataset [4] are mostly accurate the ones of the UDEp dataset contain segmentation errors which required the development of the novel CG generation and noise-reduction algorithm (c.f. Sect. 3.4). While this algorithm works very well to generate the CG, the existing targets of WG, PZ and PCa are not affected by the noise-reduction. Especially, the WG still contains a lot of errors that present itself by a WG that extends the PZ which usually limits it. The network trained on this data will learn the tissue features of the partially incorrect labeled MRI as correct which results in difficulties. We cannot compare a rather accurate model that is trained on the I2CVB dataset with incorrect targets from the UDEp and vice-versa. Since some slices within the UDEp-dataset contain segmentation errors, a combination of both datasets results in misleading information for training, hence, the additional data did not result in better or more accurate results (c.f. Sect. 4.1). Nevertheless, we conjecture that the model trained on the additional data from the UDEp dataset will result in a more robust model that generalizes better. Additionally, each stage has to be trained and evaluated separately which is time consuming and it is especially challenging to apply image augmentation for training UNetS2 and -3. Currently the prostate filter selects the correct contour based on the location of the predicted contours. This assumption is made since the prostate lies in the center of the male pelvic cavity. In some cases multiple contours will be correct if they lie within the ground truth prostate contour. However, the prostate filter is only based on distance and area. Thus, the prostate filter could be further improved to recognize more true positives while removing false positives.

To delineate the distinctions with classical UNet, we examined our staged architecture, termed Casc-UNetS, which essentially constitutes a concatenation of a modified UNet [c.f. 42]. The performance of traditional UNet architectures on the I2CVB dataset is documented in the literature (i.e., in the study “Attention Guided Deep Supervision Model for Prostate Segmentation” [43]) and found to be inferior compared to adapted versions of the same. The authors analyzed the performance disparities of different UNet versions (3D, attention guided) with classical UNet. For instance, [43, 44] provide extensive insights into the performance of various UNet variants, directly correlating with the findings of [3, 9], which form the foundation of our work.

In contrast to most other works that incorporate CNNs for prostate segmentation we only use two tiny datasets of 19, 29 and finally the combination of both with 48 exams with 5-CV which is at the very bottom in terms of dataset size when comparing to the studies that utilize CNNs as outlined in the meta analysis of Wildeboer et al. [c.f. [1], Table 4] with a mean of 202 patient exams to achieve similar results.

The following limitations are to mention: In general, if a network modification or training option did not improve the evaluation results in one stage it was disregarded, i.e., the effect of image augmentation was only evaluated for the model trained on the I2CVB data, the effect of training the model on the COMB dataset and evaluating it on the individual datasets I2CVB and UDEp was only evaluated for UNetS1 (c.f. Sect. 4.1).

For further investigations on how to properly join two datasets a standardized data acquisition and labeling protocol must be established in collaboration with the medical imaging professionals. After that, the data must be relabeled to generate a highly accurate UDEp dataset that can be compared to the performance of the dataset used in this work.

6 Conclusions

In this manuscript, we proposed an innovative cascaded CNN architecture called Casc-UNetS, which is able to perform multi-label semantic segmentation to precisely contour the prostate in \(\textrm{T}_{2}\)WI. The model is inspired by current research methods [3, 6,7,8, 20] and was evaluated on three different datasets. The network architecture was optimized in an iterative process which allows each stage to be tweaked to its prediction task. Featuremap concatenation improved the overall prediction performance by a large margin and PCa is detected based on \(\textrm{T}_{2}\)WI only.

Given that the CG contours in the UDEp-dataset are missing we employ our algorithm that uses median filtering and contour moments to approximate and simultaneously denoise the CG for the UDEp \(\textrm{T}_{2}\)WI in order to join both the I2CVB and UDEp datasets to obtain a dataset with more variety. The contrast of the MRI was improved using CLAHE which created preciser WG predictions. For each network stage hyperparameter tuning was conducted and the performance evaluated using 5-CV. The results are compared and analysed as shown in Table 2. Image augmentation techniques—except for CLAHE—were found to be almost irrelevant (c.f. Sect. 5).