Abstract
This study presents a novel approach to automatic detection and segmentation of the Crown Rump Length (CRL) and Nuchal Translucency (NT), two essential measurements in the first trimester US scan. The proposed method automatically localises a standard plane within a video clip as defined by the UK Fetal Abnormality Screening Programme. A Nested Hourglass (NHG) based network performs semantic pixel-wise segmentation to extract NT and CRL structures. Our results show that the NHG network is faster (19.52% < GFlops than FCN32) and offers high pixel agreement (mean-IoU=80.74) with expert manual annotations.
Keywords: First trimester, video segmentation, crown rump length (CRL), ultrasound, nuchal translucency (NT)
1. Introduction
Fetal ultrasound (US) is a non-invasive imaging method for assessing fetal growth and development. The prenatal first trimester US scan is carried out at 11+0 to 13+6 weeks gestation to evaluate fetal viability, pregnancy dating and assess the risk for chromosomal anomalies [1]. To accomplish these tasks, current clinical approaches rely on manual selection of the mid-sagittal plane with a measurement of the fetal Nuchal Translucency (NT) and Crown-Rump Length (CRL), which is subjective and requires extensive training and years of experience [2, 3].
Contribution
We present a two-stage deep learning architecture that automatically detects the mid-sagittal plane (MSP) and segments the key CRL and NT structures as shown in Fig. 1. As a pre-processing step, a real-time detection CNN predicts the class probabilities of key anatomical structures (nose, head, horizontal sagittal section, diencephalon and rump) to detect the best MSP view. Stage two is a novel nested encoder-decoder semantic segmentation architecture designed to segment the CRL and NT structures. The proposed design aims to ensure that the various levels of US image features extracted from the encoder are delivered to the decoder to discriminate more subtle anatomical structures at the cost of fewer trainable parameters (32.5% fewer than U-Net [4]). A class balancing based weighted loss function was employed to further improve the segmentation, which is reflected by an increase of 4.27% in the mean intersection-over-union (IoU) score.
Related Work
There are a limited number of studies on automated fetal biometry for the first-trimester US. Zhao et al. [5] presented a linear support vector machine (SVM)-based study to detect physical characteristics that ultimately help to detect Down Syndrome. Nirmala et al. [6] proposed NT detection method based on segmentation and edge detection. The authors utilized a shift procedure that clusters the features of pixels in an iterative way to get the segmentation mask of NT. More recently, Sobhaninia et al. [7] proposed a neural network-based multi-task fetal head circumference segmentation method for fetal biometry. However, none of the aforementioned methods considered the combined task of detecting a standard plane and segmentation of key anatomical structures from US video.
2. Methods
2.1. Data Acquisition
The dataset of 250 full-length routine first-trimester free-hand fetal US scans containing midline sagittal view of fetus acquired under a large-scale study PULSE (Perception Ultrasound by Learning Sonographer Experience) at Fetal Medicine Unit, Oxford University Hospitals National Health Services (NHS) Foundation Trust. The scans were performed on a commercial Voluson E8 version BT18 (General Electric Healthcare, Zipf, Austria) US machine. The setup was equipped with customized video recording software through secondary video output of the ultrasound machine to record full-length video scans using screengrab [8]. The video data was saved by anonymizing the patient details. The full-length US scans were recorded with HD resolution (1920 × 1080 pixels) at 30 frames per second and lossless compression. The average duration of acquired first-trimester US scans is 13.73 ± 4.18 minutes (24720 ± 7534 frames). Figure 2 shows an illustrative example of how an US scan was partitioned into video clips by an expert.
2.2. The Proposed Architecture
Figure 1 presents an overview of the proposed architecture. US frames are input to a pre-processing CNN to detect the best MSP. Next, the selected keyframes are fed into the proposed NHG with a weighted loss function for the segmentation of CRL and NT. The final predictions are refined using a dense Conditional Random Field (dCRF) model.
2.2.1. Sagittal Plane Detection (SPD)
For the sagittal plane detection (SPD) task, Table 1 summarises the CRL dataset manually annotated by an engineering researcher and a clinical fellow for five anatomical structures; a) head [Hd], b) horizontal sagittal section of the fetus [HS], c) echogenic tip of the nose [EN], d) rump [Ru], and e) translucent diencephalon [TD]. We applied Yolov5 [9] for high-speed (more than 30 frames per second (fps)) US anatomical object detection posed as a regression and classification problem; it returns class label and associated probabilities. The best MSP is detected when all anatomical classes are detected with a probability higher than > 70%. This threshold was selected after several experiments to ensure the presence of all key anatomical classes must be present as suggested by NHS Fetal Anomaly Screening Programme (FASP) guidelines [1].
Table 1. Details of datasets and tasks used in this study.
Anatomy | Task | Datasets | Video Segments | Frames |
---|---|---|---|---|
CRL | SPD and SPS | Training | 100 | 12534 (77.9%) |
Validation | 18 | 2385 (14.8%) | ||
Test | 10 | 1174 (7.2%) | ||
Training | 110 | 10174 (79.3%) | ||
NT | SPS | Validation | 27 | 2083 (16.2%) |
Test | 9 | 564 (4.4%) |
2.2.2. Sagittal Plane Segmentation (SPS)
For sagittal plane segmentation (SPS), we designed a NHG network architecture that sandwiches a single Hourglass (HG) [10] between residual blocks [11], as shown in Fig. 3. The proposed architecture arranges the residual, pooling, and HG blocks appropriately during the encoder stage, and likewise, during the decoding stage to produce various levels of feature maps in the same block. This leads to a final segmentation mask extracted with the help of encoder pooling indices. During NHG network training, extreme foreground-background class imbalance, especially classes such as NT, was found to be problematic. To address this we introduced a weighted-loss (WL) function that assigns weights to each class inversely proportional to the median frequency in which that class appears throughout the entire training set [12]. This offers a more customized loss calculation strategy than the general focal-loss approach [13]. This simple heuristic loss calculation improves segmentation performance by optimizing the network convergence (by adding focus to foreground pixels) without additional trainable parameters.
The proposed weighted loss WL is defined as:
where, N is the number of feature maps, is the predicted class, and gnxy is the ground truth. The weight of each class αc is scaled by its frequency relative to the median frequency of all classes, calculated as:
where, freq(c) is the frequency of class c pixels occurrences divided by the number of pixels in any image containing that class, and median_freq is the median of these frequencies over all classes [12]. A dCRF is used as a post-processing step for smoothing and maximizing agreement between similar neighbouring pixels of the predicted segmentation masks at the inference stage.
3. Experiments
3.1. Settings and Metrics
The pre-processing CNN (Yolo-v5 [9]) and NHG architectures were trained for 200 epochs to detect the sagittal plane and segment the CRL and NT. Training was initiated with a 0.1 learning rate (lr) and decreased by a factor of ×0.1 every 30 epochs. The data augmentation policy included rotation [−30°, 30°] and horizontal flipping. For evaluation of the SPD model, Recall (R), Precision (P), F1-score (F1), and Top-l accuracy (Top-1) metrics are reported. For evaluation of SPS, Global Average Accuracy (GAA), Mean Accuracy (MA), and Mean Intersection Over Union (mIoU) metrics are reported.
3.2. Evaluation of Sagittal Plane Detection
For the SPD task, we evaluated Yolo-v5 on the test set. The trained Yolo-v5 model statistics are P=0.88 ± 0.05, R=0.85 ± 0.03, F1=0.85 ± 0.10 and Top-1=0.87 ± 0.06. To further understand detection performance, we report the confusion matrix in Fig. 5. ‘Hd’ and ‘HS’ show little class confusion. ‘EN’, ‘Ru’ and ‘TD’ classes show some inter-class confusion.
3.3. Evaluation of Sagittal Plane Segmentation
For the SPS task, we trained and tested benchmark CNNs (FCN [14], UNet [4], SegNet [15] and Hourglass [10]) which were selected due to their high benchmark segmentation performance on the public computer vision datasets. Experimental results are reported in Table 2. The results showed that the proposed low compute design of the NHG network out-performs other benchmark CNN architectures. NHG offered 3.07% higher GAA scores than the standard HG (block=1, stack=2). The effectiveness of NHG-based segmentation can be attributed to its layers arrangement, which offers repeated bottom-up, top-down processing with intermediate supervision.
Table 2. Quantitative analysis of trained models on test dataset.
Methods | Para.(M) | CRL | NT | ||||
---|---|---|---|---|---|---|---|
GAA(%) | MA(%) | mIoU(%) | GAA(%) | MA(%) | mIoU(%) | ||
FCN-16 [14] | 134.27 | 79.05 ± 0.05 | 66.23 ± 0.10 | 54.48 ± 0.20 | 82.64 ± 0.21 | 55.11 ± 0.03 | 51.60 ± 0.15 |
FCN-32 [14] | 144 | 81.68 ± 0.01 | 76.56 ± 0.02 | 63.87 ± 0.18 | 85.02 ± 0.14 | 56.97 ± 0.01 | 51.80 ± 0.11 |
U-Net [4] | 30.72 | 83.64 ± 0.08 | 79.80 ± 0.07 | 67.41 ± 0.25 | 90.17 ± 0.02 | 60.41 ± 0.01 | 58.39 ± 0.01 |
SegNet [15] | 15.27 | 85.08 ± 0.10 | 83.82 ± 0.10 | 70.05 ± 0.33 | 89.66 ± 0.31 | 56.61 ± 0.24 | 48.18 ± 0.24 |
HG (B=1, S=2) [10] | 35.08 | 89.05 ± 0.09 | 82.70 ± 0.20 | 70.83 ± 0.05 | 94.22 ± 0.01 | 64.10 ± 0.01 | 63.10 ± 0.05 |
NHG (ours) | 11.46 | 92.32 ± 0.03 | 85.01 ± 0.01 | 74.42 ± 0.04 | 92.49 ± 0.05 | 66.37 ± 0.11 | 67.37 ± 0.01 |
Addition of weighted loss reflects 0.87% increase in the mIoU score, as shown in Table 3. These empirical results showed that the proposed NHG network with weighted loss performs consistently better than a class balancing (‘focal loss’) strategy based on standard cross-entropy. The quantitative metrics indicate that the majority of pixels have been classified correctly, depicted in Fig. 4. Figure 4-f shows that class balancing and dCRF yield considerable improvements by maximising agreement and smoothing between similar neighbouring pixels. These methods helped improve the segmentation of conflicting regions of pixels where the image was cluttered. However, in segmentation classes such as ‘NT’, the GAA score is higher, whereas the mIoU score is lower in comparison to the ‘CRL’ class, which certainly happens due to an imbalance between foreground and background classes. The dCRF also offers a well-defined separation between foreground and background pixels, specifically in the NT class, which is reflected in increased mIoU= 1.07% scores for each class. The test set automated semantic pixel-wise segmentation showed a high correlation Pearson Correlation Coefficient (PCC) value (ρ = 0.93, p = 0.0003) with manually segmented video.
Table 3. Quantitative results of NHG for NT and CRL segmentation on test dataset.
Architecture | CRL-Mean IoU | NT-Mean IoU | mean |
---|---|---|---|
NHG-Focal-Loss | 76.14 ± 0.01 | 69.51 ± 0.05 | 72.82 ± 0.05 |
NHG-Focal-Loss-dCRF | 76.92 ± 0.21 | 71.01 ± 0.22 | 73.96 ± 0.13 |
NHG-Weighted-Loss | 78.69 ± 0.27 | 72.89 ± 0.20 | 75.79 ± 0.01 |
NHG-Weighted-Loss+dCRF | 80.02 ± 0.19 | 73.71 ± 0.02 | 76.86 ± 0.02 |
4. Conclusion
We have presented a deep-learning based architecture that takes an ultrasound video as an input and outputs key structure segmentations that are used for fetal biometry in first-trimester US in one step. At the segmentation stage, our NHS based network outperformed all benchmark architectures in terms of accuracy, speed, and parameter efficiency. A good correlation was found between manually labelled and automatically segmented anatomical structures. The future work will examine downstream automated biometry and translational issues in terms of algorithm evaluation and its application in the clinical setting.
5. Compliance with Ethical Standards
This study was approved by the UK Research Ethics Committee (Reference 18/WS/0051) and the ERC ethics committee.
Acknowledgments
This work is supported by the ERC (ERC-ADG-2015694581, project PULSE), EPSRC (EP/R013853/1 and EP/T028572/1) and the NIHR Oxford Biomedical Research Centre.
References
- [1].Kirwan D. NHS Fetal Anomaly Screening Programme. National Standards and Guidance for England. 2010;18 0. [Google Scholar]
- [2].Taipale P, et al. Learning curve in ultrasonographic screening for selected fetal structural anomalies in early pregnancy. Obstetrics & Gynecology. 2003;101(2):273–278. doi: 10.1016/s0029-7844(02)02590-5. [DOI] [PubMed] [Google Scholar]
- [3].Drukker L, et al. Vp18. 07: First trimester scans: how much time does it take to acquire the crl and nt? Ultrasound in Obstetrics & Gynecology. 2021;58:174. [Google Scholar]
- [4].Ronneberger O, et al. U-Net: Convolutional networks for biomedical image segmentation; Proc MICCAI; 2015. pp. 234–241. [Google Scholar]
- [5].Zhao Q, et al. Automated down syndrome detection using facial photographs; 2013. pp. 3670–3673. [DOI] [PubMed] [Google Scholar]
- [6].Nirmala S, et al. Measurement of nuchal translucency thickness in first trimester ultrasound fetal images for detection of chromosomal abnormalities; Proc INCACEC; 2009. pp. 1–5. [Google Scholar]
- [7].Sobhaninia Z, et al. Fetal ultrasound image segmentation for measuring biometric parameters using multitask deep learning; Proc IEEE EMBC; 2019. pp. 6545–6548. [DOI] [PubMed] [Google Scholar]
- [8].Drukker L, et al. Transforming obstetric ultrasound into data science using eye tracking, voice recording, transducer motion and ultrasound video. Scientific Reports. 2021;11(1):1–12. doi: 10.1038/s41598-021-92829-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Jocher G, et al. Yolov5. Code repository. 2020. https://github.com/ultralytics/yolov5 .
- [10].Newell A, et al. Stacked hourglass networks for human pose estimation; Proc ECCV; 2016. pp. 483–499. [Google Scholar]
- [11].He K, et al. Deep residual learning for image recognition; Proc IEEE CVPR; 2016. pp. 770–778. [Google Scholar]
- [12].Yasrab R, et al. Rootnav 2.0: Deep learning for automatic navigation of complex plant root architectures. GigaScience. 2019;8(11):giz123. doi: 10.1093/gigascience/giz123. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Lin T, et al. Focal loss for dense object detection; Proc IEEE ICCV; 2017. pp. 2980–2988. [DOI] [PubMed] [Google Scholar]
- [14].Long J, et al. Fully convolutional networks for semantic segmentation; Proc IEEE CVPR; 2015. pp. 3431–3440. [DOI] [PubMed] [Google Scholar]
- [15].Badrinarayanan V, et al. SegNet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2017;39(12):2481–2495. doi: 10.1109/TPAMI.2016.2644615. [DOI] [PubMed] [Google Scholar]