Nothing Special   »   [go: up one dir, main page]

Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Mar 1.
Published in final edited form as: IEEE Trans Biomed Eng. 2023 Feb 17;70(3):970–979. doi: 10.1109/TBME.2022.3206596

Ultrasound Volume Reconstruction from Freehand Scans without Tracking

Hengtao Guo 1, Hanqing Chao 2, Sheng Xu 3, Bradford J Wood 4, Jing Wang 5, Pingkun Yan 6
PMCID: PMC10011008  NIHMSID: NIHMS1875942  PMID: 36103448

Abstract

Transrectal ultrasound is commonly used for guiding prostate cancer biopsy, where 3D ultrasound volume reconstruction is often desired. Current methods for 3D reconstruction from freehand ultrasound scans require external tracking devices to provide spatial information of an ultrasound transducer. This paper presents a novel deep learning approach for sensorless ultrasound volume reconstruction, which efficiently exploits content correspondence between ultrasound frames to reconstruct 3D volumes without external tracking. The underlying deep learning model, deep contextual-contrastive network (DC2-Net), utilizes self-attention to focus on the speckle-rich areas to estimate spatial movement and then minimizes a margin ranking loss for contrastive feature learning. A case-wise correlation loss over the entire input video helps further smooth the estimated trajectory. We train and validate DC2-Net on two independent datasets, one containing 619 transrectal scans and the other having 100 transperineal scans. Our proposed approach attained superior performance compared with other methods, with a drift rate of 9.64% and a prostate Dice of 0.89. The promising results demonstrate the capability of deep neural networks for universal ultrasound volume reconstruction from freehand 2D ultrasound scans without tracking information.

Keywords: Ultrasound imaging, volume reconstruction, deep learning, self-attention, contrastive learning

I. Introduction

UTRASOUND (US) imaging is widely used in interventional prostate applications to monitor and guide the process [1], [2]. US possesses many advantages, such as low cost, portable setup, and the capability of navigating in real-time for anatomic and functional information. However, due to the limitation of 2D views, a reconstructed 3D US image volume may be desired, which can bring multiple benefits. First, it facilitates image fusion with other image modalities such as magnetic resonance imaging (MRI) or computed tomography (CT) for image-guided intervention [3]–[5]. Second, 3D volume helps measure the size of the prostate and tumor for treatment planning. Last but not the least, it enables recording the biopsy needle locations for repeat biopsy in active surveillance or focal ablation in “super-active surveillance”.

In order to acquire a 3D volume for guiding the prostate biopsy, the existing techniques typically use a 2D US transducer to scan a region-of-interest (ROI) in patients and a tracking device attached to the US transducer can record the spatial location of each 2D US frame [6]. Afterwards, the 3D ultrasound volume is reconstructed from the 2D frames by using the spatial information acquired with the tracking system [6]–[8]. A number of software packages are available for volume reconstruction and visualization such as [7], [9], [10]. However, all these methods rely on tracking the US probe to reconstruct a 3D ultrasound volume, which adds hardware complexity and setup requirements.

A new category of methods, sensorless US volume reconstruction, aim to remove the need of tracking devices. They are promising for significantly reducing hardware costs and allow clinicians to move the probe with fewer constraints without blocking tracking signals. Prior research on this was mainly based on the speckle decorrelation [11], [12]. Recently, researchers have proposed several methods [13], [14] that utilize deep learning (DL) techniques, especially convolutional neural networks (CNNs), for US structural feature extraction to predict the interframe motion. However, these works focus on neighboring frames, which limits their use of the context information in ultrasound scans. An ultrasound video clip, composed of a sequence of 2D US frames, contains rich context information and can provide a more general representation of the motion trajectory of the US probe. Using only two neighboring frames [13] may lose temporal information and thus results in less accurate reconstruction. Previous research on the speckle decorrelation [11], [12], [15] indicates that focusing on the ultrasound speckle-rich regions can benefit the volume reconstruction performance. However, this useful information has not been explored by the recent deep learning based methods. Thus, addressing these research gaps to fully leverage the information in a limited dataset is the key to further improve the volume reconstruction accuracy.

In this paper, we propose a deep contextual-contrastive network (DC2-Net) for sensorless freehand 3D ultrasound volume reconstruction. The proposed network takes a video subsequence, containing multiple consecutive frames, as input for US transducer trajectory estimation. To leverage the contextual information within a video subsequence, we introduce an attention module into the DC2-Net to focus the network on US speckle information extraction and use a case-wise correlation loss to stabilize the training process. To further improve the trajectory estimation accuracy, we develop a novel margin ranking loss for this regression problem. The invented method pulls similar embeddings together and pushes dissimilar embeddings further apart, enabling the model to discriminate samples with dissimilar transformation parameters in the feature space. Our results demonstrate that the proposed method can reconstruct 3D US volumes without tracking information, while preserve high correlation to the motion variations in the video scan. By applying the deep learning-based ultrasound volume reconstruction method, we can eliminate the need for any additional tracking hardware and lower the cost of such clinical operations.

This study is based on our previous conference presentations [16], [17] with the following major extensions. (1) We have acquired additional datasets, consisting of newly-added transperineal US scans. We further systematically evaluate the proposed US volume reconstruction method on multiple datasets. (2) We introduce the contrastive learning strategy, using a margin ranking loss to leverage the label information more efficiently for probe trajectory estimation on two datasets. (3) We extend our parameter analysis and study the impact of various experimental settings, e.g. the network architecture, new evaluation metrics, contrastive margin and frame number tuning, on the performance of our system in a detailed quantitative evaluation.

II. Related Works

This section first reviews the traditional ultrasound volume reconstruction methods, followed by an overview of the contemporary deep learning techniques. After that, we review the contrastive learning strategies used for our network training to improve the feature extraction.

A. 3D Ultrasound Imaging

A 3D ultrasound volume reconstruction system visualizes a region-of-interest (ROI) in 3D by combining a set of 2D ultrasound frames [18]. Based on how the 2D frames are acquired, existing 3D ultrasound reconstruction methods can be divided into three categories [19]: 2D array scanning [20], mechanical scanning [21], and freehand scanning [22]. 2D array scanning systems use 3D ultrasound probes with 2D imaging arrays, which can directly create pyramidal volume scans [21]. However, such systems are more expensive than 1D array transducers and also bulkier in size. The 3D mechanical scanning uses a stepper motor inside a compact casing, which can move the internal imaging array following a tilting, rotating, or linear movement around the ROI. Such a mechanical system acquires regularly spaced 2D ultrasound frames for image volume reconstruction [23]. However, systems falling in the above two categories require specialized US imaging devices to enable the reconstruction, which significantly limit their applications.

Classical freehand US imaging methods use an external tracking device [6], either an optical or electromagnetic tracker, attached to the ultrasound probe attached to record the position and orientation of the US transducer in 3D space. The tracking system increases the complexity of the navigation devices and limits the clinician’s operational space as they need to avoid blocking the tracking signals during the operations.

B. Sensorless Volume Reconstruction

Sensorless freehand scans take a step further by removing such tracking devices. Such methods were supported by the speckle decorrelation algorithms [12], which estimates the elevational distance between neighboring US images based on the US speckle patterns correlation. Using Gaussian model function, Gee et al. [24] proposed a speckle correlation function to approximate the orthogonal distance between two B-model scans and achieved much improved the results. Rivaz et al. [15] designed a speckle detector to classify the irregularly shaped/located regions, and found that the detected fully developed speckles can largely improve the elevational distance measurement. Afsham et al. [25] applied a statistical model based on Rician-Inverse Gaussian stochastic process of the ultrasound speckle formation to estimate the out-of-plane motion. A more recent work by Tetrel et al. [26] proposed to reduce reconstruction error by filtering out unreliable estimations and reported a drift error of 5 mm on sweeps with average length of 35 mm in phantom studies. The above works majorly conducted experiments on exvivo tissues (such as beef, turkey and phantom) with freehand scans and demonstrated the possibility of image-based reconstruction without tracking.

Recent advances in deep learning (DL) methods have shown superior performance in automatic feature extraction. Prevost et al. [13] first proposed to use a convolutional neural network (CNN) to directly estimate the inter-frame motion between two 2D US frames for sensorless volume reconstruction. However, the rich contextual information along the entire ultrasound video was not used. Wein et al. [14] co-registered two DL-reconstructed volumes from transversal and sagittal views, respectively, for a better reconstruction result. Yet, it requires multiple scans in different directions for the same patient, which are typically hard to implement in clinical settings. Our previous work [16] applies 3D CNN on a US video subsequence for better utilizing the temporal context information, which showed promising performance. Recently, Luo et al. [27] proposed a network with a convolutional long short-term memory (LSTM) module and applied a differentiable reconstruction loss to extract the sequential information for US volume reconstruction. This work further demonstrates that using context information from neighboring frames is beneficial in ultrasound volume reconstruction.

C. Contrastive Learning

For supervised learning of deep classification models, commonly used cross-entropy (CE) loss has been reported to have several major shortcomings, including its lack of robustness to noisy labels and the possibility of poor margins [28], leading to reduced generalization performance. To tackle this problem, the contrastive learning strategy was proposed to enhance the discriminative contrast representation across different categories [29]–[31]. The core idea of contrastive learning is to pull positive pairs closer while pushing negative pairs apart in the latent space [32]. The contrastive losses were inspired by noise contrastive estimation [33] or N-pair losses [34], [35]. Intuitively, the contrast loss forces the deep feature generator to extract similar features for images in the same category and distinct feature representations for images from different categories. The losses are often calculated between paired training cases.

III. Deep Ultrasound Volume Reconstruction

A. Problem Definition and Proposed Solution

All the US scanning videos in this work were collected from patients undergoing prostate cancer biopsy using one of the two ultrasound probes, a Philips C9-5 and a Philips mc7-2. During a prostate biopsy procedure, the doctors used an electromagnetic (EM) tracking device to record the spatial location and orientation of the ultrasound probe, which provides the transformation between frames within the scan. This spatial transformation serves as the ground-truth for our network training. Thus, each US frame video has an associated EM-tracking 3D homogeneous transformation matrix M=[Rt;01], where M is a 3×3 rotation matrix and t is a 3D translation vector. The EM tracking data is used only as the ground truth for evaluation, but not invovled in the reconstruction process of our algorithm.

The task of sensorless 3D US volume reconstruction is to estimate the relative spatial position between two or more consecutive US frames purely based on the imaging contents: a deep convolutional neural network which takes a US video segment composed of N frames as input and outputs its trajectory estimation in the form of 6 degrees-of-freedom (6DOF) transformation parameters. Without loss of generality, here we use two neighboring frames as an example for illustration. Let Ii and Ii+1 denote two consecutive US frames with corresponding transformation matrices Mi and Mi+1, respectively. The relative transformation matrix Mi can be computed as Mi=Mi+1Mi1. By decomposing Mi into six transformation parameters θi={tx,ty,tz,αx,αy,αz}i, which contains the translations in millimeters and rotations in degrees. During training, a video segment of N consecutive frames with height H and width W are passed to the network as input, either separately or stacked together. Let {θi|i=1,,N1} denote the relative transform parameters between the neighboring frames within this video segment. Instead of directly using all these 6 × (N − 1) transformation parameters as ground-truth labels, we compute the mean parameters θ as the corresponding label of a video segment:

θ=1N1i=1N1θi. (1)

There are two reasons for using the mean vector as the training label: (1) Since the magnitude of motion between two frames is small, using the mean can effectively smooth the noise in probe motion; (2) Using a labeling representation with a fixed length (6 transformation parameters) does not require us to modify the output layer every time when we change the number of input frames N. Since we compute the motion vector relative to the previous neighboring frame, such interframe motion has small magnitudes for all the six degrees-of-freedom. The direction of each motion is signed, making it feasible to average the motion for label computation. We compose the network predictions from all the video segments into a full volume reconstruction for one entire video during the inference stage.

B. Deep Contextual Learning

This subsection introduces the Deep contextual-contrastive Network (DC2-Net) structure proposed for the 3D ultrasound volume reconstruction task. The structure of DC2-Net, shown in Fig. 2, is designed to make full use of the context information within a US video segment. The ultrasound video sequence contains rich temporal information, which can provide more reference for the robust US probe trajectory estimation. Thus, we propose three designs to fully utilize the US video’s context information: video segment input, self-attention mechanism, and a case-wise correlation loss.

Fig. 2:

Fig. 2:

An illustration of the proposed DC2-Net. The underlying framework is based on 3D ResNeXt structure with attention module, and trained with a combination of three loss terms: MSE loss, correlation loss and contrastive margin ranking loss.

Video Segment Input:

The proposed DC2-Net utilizes the 3D ResNeXt [36] as the backbone structure. Instead of using only two consecutive US frames, a small video segment containing N frames serves as the input to the network. 3D convolutions can better extract the feature mappings along the axis of the channel, which is the temporal direction in our case. Such properties enable the network to focus on the slight displacement of image features between consecutive frames. Thus, the network can be trained to connect these speckle correlated features to estimate the relative position and orientation.

Attention Mechanism in the deep learning models focuses the CNN to a specific region of an image, which carries salient information for the targeted task [37]. It has led to significant improvement in various computer vision tasks such as object classification [38] and segmentation [39]. In our 3D US volume reconstruction task, regions with strong speckle patterns for correlation are of high importance in estimating the transformations. Thus, we introduce a self-attention block, as shown in Fig. 2, which takes the feature maps produced by the last residual block as input and then outputs an attention map. This helps assign more weights to the highly informative regions.

Case-wise Correlation Loss:

The mean squared error (MSE) loss is commonly used in deep regression problems. We use the MSE loss as the primary loss function to train the network

Lmse=16d=16(θdθ^d)2 (2)

where θd and θ^d represent the d-th transformation parameter of the groundtruth label and the network’s prediction, respectively. However, the use of MSE loss alone can lead to a smoothed estimation of the motion and thus the trained network tends to memorize the general style of how the clinicians move the probe, i.e. the mean trajectory of the ultrasound probes. This shortcoming of the MSE loss for network training has been reported before [40], [41]. To deal with this problem, we introduce a case-wise correlation loss based on the Pearson correlation coefficient to emphasize the specific motion pattern of a scan. The correlation coefficients between the estimated motion and the ground truth mean are computed for every degrees-of-freedom and the loss is denoted as

Lcorr=116d=16Cov(θd,θ^d)σ(θd)σ(θ^d), (3)

The function Cov() gives the covariance and σ calculates the standard deviation.

C. Margin Ranking Loss

In our work, estimating the spatial location and orientation of each US frame for volume reconstruction is essentially a regression task. While contrastive learning is a well-defined strategy in image classification tasks, it is much more complex to adapt the idea into regression problems. The label values in classification problems are discreet but become continuous in regression-related tasks. We here enhance the trajectory-estimation feature learning by adopting a novel margin ranking loss [42], which correlates two random samples’ feature vectors distance with their transformation parameters discrepancy. For a given triplet, we regard one sample as the anchor. Among the other two samples, the one with a 6-DOF label closer to the anchor is considered as a near sample, while the other one as a far sample. Intuitively, the feature vectors of the anchor should be closer to a near sample than a far sample in the latent space by a margin M. The margin ranking loss is defined as:

Lmargin=max(0,α(faff2fafn2)+M), (4)

where fa, fn, and ff indicate the feature vectors of the anchor, near and far samples, respectively. The margin M defines the minimum tolerance that separates the anchor-far pair from the anchor-near pair. The network continues to optimize as long as the paired difference is below the margin. To enforce that anchor-far pairs have larger distances than the anchor-near pairs, we compute the adaptive coefficient α using sample labels:

α=sign(θaθn2θaθf2), (5)

where θa,θn,θf represent the groundtruth 6DOF transformation parameters of the anchor, near and far samples, respectively. The function sign() returns the positive/negative sign of the input value. Of note, during the training, we compute the margin ranking loss within each batch by ranking every pair of samples through matrix manipulation. In summary, the full contextual-contrastive loss is formulated as:

L=λ1Lmse+λ2Lcorr+λ3Lmargin, (6)

where λ1, λ2, and λ3 are the positive weighting parameters. Empirically, we set λ1, λ2, and λ3 to 5.0, 1.0, and 1.0, respectively, by considering their magnitudes.

IV. Experiments

A. Materials

Our study was conducted retrospectively using two independent sets of human subject data acquired through an IRB-approved clinical study, following the ethical standards of the institutional and national research committee.

Dataset A contains 618 transrectal US video sequences, with each frame corresponding to a positioning matrix captured by an EM-tracking device. An end-firing Philips C9-5 transrectal US transducer captures axial images by steadily sweeping through the prostate from base to apex. We split dataset A into 488, 66, and 64 cases for training, validation, and testing. Every video in this dataset is from a different patient. There is no overlapping patient between the training, validation and test sets. In dataset A, we have prostate segmentation annotations for each video frame, making it possible to reconstruct 3D prostate segmentation volumes for anatomical evaluation.

Dataset B contains 100 transabdominal/transperineal US video sequences acquired by a Philips mC7-2 US probe. Due to the relatively small size of dataset B, we split it into 80, 10, and 10 for training, validation, and testing in a rotating fashion to conduct 10-fold cross-validation. The US video sequences described above were captured from different subjects at varying lengths and resolutions. Some patients may have multiple US scans. During the cross validation, we split the training, validation and test set according to the patients ID, such that all the videos from the same patient will stay in the same set.

A 6D EM tracking sensor is attached to the US probe using a probe specific fitting to ensure that the spatial relationship, i.e. calibration, between the sensor and the ultrasound imaging plane is fixed. The manufacturer of the tracking sensor is Philips Healthcare, which also provides the calibration between the sensor and the ultrasound probe. The positioning information given by this EM-tracking device serves as the ground truth label in our training phase. Essentially, an EM-tracker may have several millimeter errors for the entire workspace. However, the motion range of the EM tracking sensor for 2D US sweeps is only a small part of the full workspace. As long as the linearity of EM tracking for that small range is good, the EM-based volume reconstruction performs well. All our reported errors are relative to the EM-tracking data and the reconstruction based on that.

B. Evaluation Metrics and Implementation Details

In this work, we adopted the following six evaluation metrics for measuring the volume reconstruction qualities: distance error, frame error, final drift, drift rate, and prostate Dice:

  • Distance Error is the average distance between all the corresponding corner-points of the input patches (shown by the white bounding box in Fig.1) throughout a video scan, which reveals the overall difference in speed and orientation across the entire video.

  • Frame Error computes the difference between the ground-truth and a predicted frame location. This metric helps evaluate the relative error between a pair of neighboring frames regardless of the cumulative error along the sequence.

  • Final Drift [13] denotes the Euclidean distance between the US video’s last frame, posed by the ground-truth transformation and the estimated transformation.

  • Drift Rate [27] measures the ratio between the final drift and the ground truth sequence length.

  • Prostate Dice computes the Dice coefficient between the prostate’s ground-truth segmentation Vseg and the predicted segmentation V^seg. Given the 2D prostate segmentation for each frame in dataset A, Vseg can be reconstructed using the ground-truth positioning information and V^seg from the network’s predicted trajectory.

  • Prostate Error measures the absolute volume difference between the ground-truth segmentation Vseg and the predicted segmentation V^seg in cubic centimeters (cc).

Fig. 1:

Fig. 1:

Sampled frames of (a) transrectal and (b) transperineal scans acquired by different ultrasound transducers along different motion trajectories. The cylinders represent a targeted anatomy, and “RAS” indicates right, anterior, and superior, respectively.

We use a paired t-test with a significance level α = 0.05 to compare different methods for the statistical tests carried out in this section.

Our network is trained for 100 epochs with batch size of 48 using Adam optimizer [44] with an initial learning rate of 1×10−4, which decays by 0.9 after five epochs. We iterate through all the possible samples in our dataset in each training epoch. Since the prostate US image only takes a relatively small part of each frame, each frame is cropped without exceeding the imaging field and then resized to 224 × 224 to fit the design of ResNeXts [36]. The cropping is automatically defined to find the maximum ROI square within the fan-shaped receptive field. During training, we mixed the ultrasound videos with varying depths and applied image intensity normalization, aiming to make the network invariant to the different imaging settings. We implemented the DC2-Net using the publicly available PyTorch library [45]. The entire training phase of the DC2-Net takes about twenty hours, taking five frames as input. It takes about 2.58s to produce all the transformation matrix of a US video with 100 frames during testing.

C. Results Comparison

Tables I and II summarize the overall comparison of the proposed DC2-Net against other existing methods on datasets A and B, respectively. We include the prostate Dice only in Table I because the 2D prostate segmentation is only available in dataset A. The “Linear Motion” assumes that a clinician moves the ultrasound probe at a constant speed along certain direction. It thus applies a constant velocity to reconstruct the ultrasound volume [13]. Specifically, we compute a mean motion vector (6D) in the training set and then directly apply this fixed motion vector to the test set for volume reconstruction [13]. The approach of “Decorrelation” is based on the speckle decorrelation algorithm presented in [43]. For a network structure comparison, we used five baseline designs as shown in Fig. 3: “2D CNN” refers to the method presented by Prevost et al. [13]. “Shared Siamese” refers to the Siamese network with shared parameters as in Fig.3(b). “Multi-branch” is a 2D network where the two branches share the same structure but with different gradient back-propagation, as in Fig.3 (c). “3D CNN (N = 2)” is the vanilla ResNeXt [36] architecture taking only two frames as input. “ConvLSTM [27] is a recent work that combines the convolutional layers with the long-short-term-memory module for US volume reconstruction. We also include our previous work in DCL-Net [16], which did not apply the margin ranking loss for contrastive learning.

TABLE I:

Comparison of different methods on Dataset A. The evaluation metrics are showed as mean value and the standard deviation. The results with the best mean value are highlighted in bold.

Methods Distance Error (mm) Frame Error (mm) Final Drift (mm) Drift Rate (%) Prostate Dice Prostate Error (cc)
Linear Motion 22.53±12.10 2.32±0.64 42.62±10.75 21.59±7.24 0.30±0.13 5.12±3.61
Decorrelation [43] 18.89±9.22 2.43±0.62 38.26±13.34 19.32±6.44 0.62±0.09 4.71±3.25
2D CNN [13] 10.53±5.65 1.87±0.49 23.42±10.19 17.98±4.21 0.68±0.08 4.58±3.12
Shared Siamese 8.27±4.10 1.45±0.35 14.32±9.88 14.53±10.30 0.78±0.11 4.35±3.06
Multi-branch 8.45±3.98 1.45±0.31 13.95±10.74 13.26±9.55 0.82±0.09 4.17±2.83
3D CNN(N = 2) 8.10±4.34 1.13±0.24 14.55±11.24 12.88±9.63 0.81±0.09 4.21±2.47
ConvLSTM [27] 8.79±4.88 0.92±0.27 15.21±10.47 17.09±11.75 0.83±0.09 4.08±2.55
DCL-Net [16] 7.02±3.72 0.93±0.30 11.92±8.89 11.59±9.22 0.86±0.05 3.95±2.24
DC2-Net 5.52±2.86 0.90±0.26 10.20±8.47 9.64±8.14 0.89±0.06 3.21±1.93

TABLE II:

Comparison of different methods on Dataset B.

Methods Distance Error (mm) Frame Error (mm) Final Drift (mm) Drift Rate (%)
Linear Motion 23.87±10.42 2.53±1.01 45.44±10.45 37.71±8.68
Decorrelation [43] 19.54±8.72 2.24±0.95 40.05±10.81 36.33±9.65
2D CNN [13] 13.86±5.65 1.94±0.63 25.53±9.74 24.34±6.78
Shared Siamese 12.32±6.40 1.64±0.57 19.64±7.90 21.83±8.92
Multi-branch 10.88±5.09 1.55±0.53 17.23±6.94 23.56±13.16
3D CNN (N = 2) 8.54±4.33 1.35±0.32 14.98±6.72 22.44±12.78
ConvLSTM [27] 6.96±3.23 1.18±0.27 10.86±5.12 19.55±12.54
DCL-Net [16] 6.78±2.36 1.18±0.26 11.29±3.71 16.06±10.98
DC2-Net 5.87±2.68 1.12±0.26 9.85±5.74 14.58±12.76

Fig. 3:

Fig. 3:

Network structure of all the baseline methods in this study.

Through paired t-test, our proposed DC2-Net is found to have significant improvements over all the baseline methods (p-value<0.05). The performance of the 2D-CNN reproduced in our experiments has consistent performance compared to the accuracy reported in the paper [13]. The Shared Siamese and Multi-branch structures have comparable performance on both datasets. However, due to the nature of 2D convolutions, they both have limited capacity to learn the inter-frame motion, especially the out-of-plane movement. The ConvLSTM uses a recurrent strategy and consistently achieves better performance than the previously described methods.

The statistical results of DCL-Net reported in this paper have a considerable performance leap compared to our previous work [16] because of the updated training strategy. By applying the newly added margin ranking loss and using seven frames as an input video segment, our DC2-Net achieved a significant performance boost on all six evaluation metrics. It is worth noting that the prostate Dice of 0.89 achieved by DC2-Net indicates a good reconstruction performance, especially for the shape of the prostate. With an average prostate volume size of 26.7 cc in our test set, the DC2-Net achieved an average prostate reconstruction error of 3.21 cc. During the prostate biopsy, the prostate volume is a clinical parameter that doctors use as a reference to make diagnosis. It does not have to be very precise for interventional guidance, as long as it provides useful 3D imaging information, which is otherwise unavailable using 2D ultrasound probes. Although the average final drift of 10.20 mm achieved by DC2-Net is not a negligible error, this is the best performance on real clinical data instead of phantom studies. It is challenging to reconstruct 3D US volume using these freehand US scans, and we have been making significant progress in this important area.

V. Discussion

A. Model Analysis and Hyper-parameter Sensitivity

In this section, we systematically perform the ablation study to evaluate the effectiveness of each component (attention module, correlation loss, and margin ranking loss) in the DC2-Net. We conduct the ablation study on dataset A, which has more training cases to guarantee a solid comparison in Table III. The first row refers to the vanilla ResNeXt-50 structure, which takes five consecutive frames as input. With both attention module and correlation loss implemented (third row, equivalent to DCL-Net [16]), a significant performance is achieved, demonstrating these two components’ effectiveness. With the additional margin ranking loss, the complete DC2-Net beats the baselines with a considerable margin of improvement.

TABLE III:

The ablation study of each component proposed in DC2-Net on Dataset A.

Attention Lcor Lmargin Distance Error (mm) Frame Error (mm) Final Drift (mm) Drift Rate (%) Prostate Dice
7.35±4.28 0.93±0.32 12.88±10.20 11.80±9.12 0.81±0.10
7.11±4.51 0.93±0.28 12.20±10.07 11.66±9.77 0.83±0.08
7.02±3.72 0.93±0.30 11.92±8.89 11.59±9.22 0.86±0.05
5.52±2.86 0.90±0.26 10.20±8.47 9.64±8.14 0.89±0.06

We perform the hyper-parameter tuning on the contrastive margin M in Fig. 4a, which determines the minimum distance to separate an anchor-near pair and an anchor-far pair. With N = 5 and N = 7 (which we found to be practical settings based on experience), the drift rate exhibits a downward then upward trend when we increase the margin M. By incorporating the margin ranking loss, DC2-Net can perform better than the DCL-Net (black horizontal line) over a wide range of margin choices. Additionally, using seven frames as input achieves almost consistent lower drift rates than using only five frames when exploring the margins. When we search for an optimal choice of frame numbers in Fig. 4b (with M = 0.25), we also observe a downward trend followed by an upward trend, which sufficiently matches the intuition: With relatively fewer frames in a video segment, the input sample provides limited contextual information which hinders the network from extracting sufficient features. On the other hand, since we use an average motion vector over a video segment as the training signal (Sec.III-A), sudden changes in the probe’s speed and orientation will be eliminated by simply averaging the gaps, which makes it difficult for the networks to capture the trajectory variances.

Fig. 4:

Fig. 4:

Hyper-parameter analysis of the margin M in the margin ranking loss and the number of frames N in each input video segment.

B. Motion Estimation Analysis

Given that the deep learning-based US volume reconstruction is a challenging task, one may question whether the network is simply memorizing the scanning protocols and fitting to a global average. To show the learning capacity of the proposed DC2-Net, Fig.5 visualizes the correlation between the ground-truth and network’s prediction on each of the six degrees-of-freedom. For every test video in Dataset A, we computed the mean motion vector for the ground-truth and network’s prediction. We plot the mean motion of each video as a single dot with the ground-truth magnitude as x-axis coordinate, and prediction as y-axis coordinate. More points distributed along the diagonal line would suggest a more accurate prediction. Fig. 5 shows that the DC2-Net’s prediction (green dots) on the test set exhibits a stronger correlation with the ground-truth label, as more dots distribute along the diagonal line than those of the DCL-Net. That indicates superior performance of the proposed DC2-Net against the baseline method.

Fig. 5:

Fig. 5:

Correlation between the ground-truth label and network’s prediction on test portion of Dataset A.

In Fig. 6, we draw motion changes along six degrees-of-freedom from the first frame to the last frame in the entire video of a case. By fixing the plotting scales for translations and rotations respectively, we can see that motions along tY, αX and αZ are more dominant than the others, because of the constraint by the rectum during ultrasound sweep. Our DC2-Net’s prediction not only has a strong correlation with the actual ground-truth motion, but also matches the magnitudes of each degrees-of-freedom. The prediction result of DC2-Net has several minor mismatches with the extreme highs and lows in the ground-truth. The extreme values could be the outcome of unsteady scans, which may not be predicted by a well-trained network. On the other hand, since we used the mean motion vector of a sub-sequence (varying from 2 to 10 frames) as the ground-truth for a training sample, the trained network might have learned such a smoothing effect. This could have contributed to some mismatches with the extreme values. However, since these extreme values are relatively sparse in the datasets, they have only minor impact on the overall performance.

Fig. 6:

Fig. 6:

To show that the propose DC2-Net learns more than a global average, we show one full video sequence from the first to the last frame gap. The groundtruth motion (green) has varied speed and trajectory, and our DC2-Net (red) is showing a highly matching pattern both in trend and magnitude.

Table IV shows the results on datasets A and B, where our DC2-Net achieved strong correlation of 0.62 and 0.46, respectively. We observed an interesting pattern by looking at the correlation of each degrees-of-freedom. The prediction of translations (tX, tY and tZ) has a higher correlation than the rotations (αX, αY and αZ). This is because the rotations are typically “out-of-plane” motion, which are more challenging to regress from the image contents than the in-plane translations. In summary, the matching pattern in Fig. 6 and the high Pearson coefficients in Table IV both indicate that the DC2-Net is capable of capturing the sudden changes in the US scans.

TABLE IV:

The mean correlation coefficients of 6 degrees-of-freedom, between the groundtruth label and the network’s prediction on the test sets.

DOF Dataset A Dataset B

DCL-Net DC2-Net DCL-Net DC2-Net
tX 0.57±0.22 0.67±0.20 0.32±0.39 0.62±0.32
tY 0.56±0.26 0.64±0.28 0.41±0.39 0.60±0.50
tZ 0.58±0.21 0.69±0.20 0.10±0.23 0.22±0.44
αX 0.50±0.28 0.56±0.30 0.25±0.37 0.47±0.55
αY 0.53±0.20 0.61±0.25 0.25±0.34 0.49±0.29
αZ 0.49±0.29 0.56±0.28 0.17±0.49 0.39±0.59

Mean 0.54±0.25 0.62±0.26 0.25±0.31 0.46±0.17

C. Quality of Volume Reconstruction

In addition to the quantitative evaluation metrics such as final drift and distance error, we also visualize the reconstructed volumes for qualitative evaluation as shown in Figs. 7 and 8. For the C9-5 cases in dataset A, since the prostate segmentation is available, we superimpose the reconstructed 3D prostate segmentation to the US volume for anatomical shape evaluation. The 3D prostate segmentation is reconstructed by posing 2D prostate segmentation in each frame to the 3D space, followed by spatial interpolation and filling holes. The DC2-Net’s reconstructed volume all looks similar to the ground-truth reconstruction, and the prostate segmentation largely preserves its shape and size as shown in Table I and Fig. 7. In contrast, the previous DCL-Net produces overly smoothed trajectory estimation (second rows in Figs. 7 and 8). By incorporating the margin ranking loss and adjusting the input frame numbers N properly, the predicted motion trajectories from DC2-Net have a much higher correlation with the ground-truth positions. That is to say, the predicted motion is more sensitive to the variations in speed and orientations introduced during the scanning. For example, there exist some sudden orientation changes in C95 cases 4 and 5 during the scan, which can be observed in the volumes’ rugged bottom lines. While the DCL-Net’s reconstruction is over smooth and eliminates the details, our DC2-Net captured those sudden changes and produce matching positioning information.

Fig. 7:

Fig. 7:

Comparison of the US volume reconstruction results of six examples from Dataset A.

Fig. 8:

Fig. 8:

Comparison of the US volume reconstruction results of six examples from Dataset B.

In our work, we measure the final drift for the full ultrasound volume, using the corner points of the input patches to compute the errors. The entire ultrasound volume is much larger than the prostate gland, leading to the seemingly large accumulated errors reported in the paper. In other words, these drift errors do not explicitly correspond to the prostate tracking errors. To compensate the effect, we also used the anatomical evaluation metrics, such as the Dice coefficient of prostate segmentation, to measure the quality of volume reconstruction. While it is difficult to certify that the current performance is sufficient for clinical application, the presented work can provide useful 3D imaging information of the prostate regardless of the drift. More importantly, the DC2-Net allows a regular 2D ultrasound probe to acquire a 3D scan without any hardware change.

VI. Conclusion

This paper first introduced a sensorless freehand 3D US volume reconstruction method based on deep learning. The proposed DC2-Net extracts the spatial information from multiple US frames to improve the US probe trajectory estimation. Experiments on two EM-tracked ultrasound datasets demonstrated the superior performance of the proposed DC2-Net. The ablation studies indicate that multi-frame input can significantly boost the performance compared with other methods using only two frames as input. The proposed attention module, combined with other designs, makes the network focus on speckle correlated regions for more accurate trajectory estimations. Furthermore, the contrastive margin ranking loss enhances the feature similarity between US clips with similar motion trajectories, making trajectory predictions more sensitive to sudden changes in the probe’s speed and orientation.

The proposed method can capture the speed and orientation of the ultrasound probe for sensorless freehand ultrasound volume reconstruction. Currently, we choose to study transrectal ultrasound majorly to benefit our downstream applications like prostate biopsy navigation. Since our method does not require any manual data annotations for the network training and is not specific to prostate imaging, it is possible to generalize the proposed DC2-Net to reconstruct volumes for other ultrasound applications, such as fetal exam [27] and vessel exam [13] with proper domain-specific adjustments. The proposed method enables 3D US imaging with a large field of view without the limit of hardware trackers. Without the need of cumbersome tracking devices attached to the US probes, our freehand sensorless 3D US reconstruction approach has a vast potential in broader clinical uses.

Acknowledgments

This work was partially supported by National Institute of Biomedical Imaging and Bioengineering (NIBIB) of the National Institutes of Health (NIH) under awards R21EB028001 and R01EB027898, and through an NIH Bench-to-Bedside award made possible by the National Cancer Institute.

Contributor Information

Hengtao Guo, Department of Biomedical Engineering and the Center for Biotechnology and Interdisciplinary Studies at Rensselaer Polytechnic Institute, Troy, NY, USA 12180.

Hanqing Chao, Department of Biomedical Engineering and the Center for Biotechnology and Interdisciplinary Studies at Rensselaer Polytechnic Institute, Troy, NY, USA 12180.

Sheng Xu, Center for Interventional Oncology, Radiology & Imaging Sciences at National Institutes of Health, Bethesda, MD, USA, 20892.

Bradford J. Wood, Center for Interventional Oncology, Radiology & Imaging Sciences at National Institutes of Health, Bethesda, MD, USA, 20892

Jing Wang, Advanced Imaging and Informatics for Radiation Therapy (AIRT) lab and Medical Artificial Intelligence and Automation (MAIA) lab, Department of Radiation Oncology, UT Southwestern Medical Center, Dallas TX USA, 75235.

Pingkun Yan, Department of Biomedical Engineering and the Center for Biotechnology and Interdisciplinary Studies at Rensselaer Polytechnic Institute, Troy, NY, USA 12180.

References

  • [1].Azizi S, Bayat S, Yan P, Tahmasebi A, Kwak JT, Xu S, Turkbey B, Choyke P, Pinto P, Wood B, Mousavi P, and Abolmaesumi P, “Deep recurrent neural networks for prostate cancer detection: Analysis of temporal enhanced ultrasound,” IEEE Transactions on Medical Imaging, vol. 37, no. 12, pp. 2695–2703, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [2].Pourtaherian A, Scholten HJ, Kusters L, Zinger S, Mihajlovic N, Kolen AF, Zuo F, Ng GC, Korsten HHM, and de With PHN, “Medical instrument detection in 3-dimensional ultrasound data volumes,” IEEE Transactions on Medical Imaging, vol. 36, no. 8, pp. 1664–1675, 2017. [DOI] [PubMed] [Google Scholar]
  • [3].Khallaghi S, Sánchez CA, Nouranian S, Sojoudi S, Chang S, Abdi H, Machan L, Harris A, Black P, Gleave M et al. , “A 2D-3D registration framework for freehand TRUS-guided prostate biopsy,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2015, pp. 272–279. [Google Scholar]
  • [4].Wegelin O, van Melick HH, Hooft L, Bosch JR, Reitsma HB, Barentsz JO, and Somford DM, “Comparing three different techniques for magnetic resonance imaging-targeted prostate biopsies: a systematic review of in-bore versus magnetic resonance imaging-transrectal ultrasound fusion versus cognitive registration. is there a preferred technique?” European urology, vol. 71, no. 4, pp. 517–531, 2017. [DOI] [PubMed] [Google Scholar]
  • [5].Siddiqui MM, Rais-Bahrami S, Turkbey B, George AK, Rothwax J, Shakir N, Okoro C, Raskolnikov D, Parnes HL, Linehan WM et al. , “Comparison of MR/ultrasound fusion–guided biopsy with ultrasound-guided biopsy for the diagnosis of prostate cancer,” Jama, vol. 313, no. 4, pp. 390–397, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [6].Wen T, Zhu Q, Qin W, Li L, Yang F, Xie Y, and Gu J, “An accurate and effective FMM-based approach for freehand 3D ultrasound reconstruction,” Biomedical Signal Processing and Control, vol. 8, no. 6, pp. 645–656, 2013. [Google Scholar]
  • [7].Daoud MI, Alshalalfah A-L, Awwad F, and Al-Najar M, “Freehand 3D ultrasound imaging system using electromagnetic tracking,” in 2015 International Conference on Open Source Software Computing (OSSCOM). IEEE, 2015, pp. 1–5. [Google Scholar]
  • [8].Rohling R, Gee A, and Berman L, “A comparison of freehand three-dimensional ultrasound reconstruction techniques,” Medical image analysis, vol. 3, no. 4, pp. 339–359, 1999. [DOI] [PubMed] [Google Scholar]
  • [9].Lasso A, Heffter T, Rankin A, Pinter C, Ungi T, and Fichtinger G, “PLUS: open-source toolkit for ultrasound-guided intervention systems,” IEEE transactions on biomedical engineering, vol. 61, no. 10, pp. 2527–2537, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [10].Hafizah M, Kok T, and Supriyanto E, “Development of 3D image reconstruction based on untracked 2d fetal phantom ultrasound images using vtk,” WSEAS transactions on signal processing, vol. 6, no. 4, pp. 145–154, 2010. [Google Scholar]
  • [11].Chen J-F, Fowlkes JB, Carson PL, and Rubin JM, “Determination of scan-plane motion using speckle decorrelation: Theoretical considerations and initial test,” International Journal of Imaging Systems and Technology, vol. 8, no. 1, pp. 38–44, 1997. [Google Scholar]
  • [12].Tuthill TA, Krücker J, Fowlkes JB, and Carson PL, “Automated three-dimensional US frame positioning computed from elevational speckle decorrelation.” Radiology, vol. 209, no. 2, pp. 575–582, 1998. [DOI] [PubMed] [Google Scholar]
  • [13].Prevost R, Salehi M, Jagoda S, Kumar N, Sprung J, Ladikos A, Bauer R, Zettinig O, and Wein W, “3D freehand ultrasound without external tracking using deep learning,” Medical image analysis, vol. 48, pp. 187–202, 2018. [DOI] [PubMed] [Google Scholar]
  • [14].Wein W, Lupetti M, Zettinig O, Jagoda S, Salehi M, Markova V, Zonoobi D, and Prevost R, “Three-dimensional thyroid assessment from untracked 2D ultrasound clips,” in International Conference on MICCAI. Springer, 2020, pp. 514–523. [Google Scholar]
  • [15].Rivaz H, Boctor E, and Fichtinger G, “A robust meshing and calibration approach for sensorless freehand 3D ultrasound,” in Medical Imaging 2007: Ultrasonic Imaging and Signal Processing, vol. 6513. SPIE, 2007, pp. 378–385. [Google Scholar]
  • [16].Guo H, Xu S, Wood B, and Yan P, “Sensorless freehand 3D ultrasound reconstruction via deep contextual learning,” in International Conference on MICCAI. Springer, 2020, pp. 463–472. [Google Scholar]
  • [17].Guo H, Xu S, Wood BJ, and Yan P, “Transducer adaptive ultrasound volume reconstruction,” 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), 2021. [Google Scholar]
  • [18].Mohamed F and Siang CV, “A survey on 3D ultrasound reconstruction techniques,” in Artificial Intelligence-Applications in Medicine and Biology. IntechOpen, 2019. [Google Scholar]
  • [19].Huang Q and Zeng Z, “A review on real-time 3D ultrasound imaging technology,” BioMed research international, vol. 2017, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [20].Yen JT and Smith SW, “Real-time rectilinear 3-D ultrasound using receive mode multiplexing,” ieee transactions on ultrasonics, ferroelectrics, and frequency control, vol. 51, no. 2, pp. 216–226, 2004. [PubMed] [Google Scholar]
  • [21].Fenster A, Downey DB, and Cardinal HN, “Three-dimensional ultrasound imaging,” Physics in medicine & biology, vol. 46, no. 5, p. R67, 2001. [DOI] [PubMed] [Google Scholar]
  • [22].Chen X, Wen T, Li X, Qin W, Lan D, Pan W, and Gu J, “Reconstruction of freehand 3D ultrasound based on kernel regression,” Biomedical engineering online, vol. 13, no. 1, pp. 1–15, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [23].Moon H, Ju G, Park S, and Shin H, “3D freehand ultrasound reconstruction using a piecewise smooth markov random field,” Computer Vision and Image Understanding, vol. 151, pp. 101–113, 2016. [Google Scholar]
  • [24].Gee AH, Housden RJ, Hassenpflug P, Treece GM, and Prager RW, “Sensorless freehand 3D ultrasound in real tissue: speckle decorrelation without fully developed speckle,” Medical image analysis, vol. 10, no. 2, pp. 137–149, 2006. [DOI] [PubMed] [Google Scholar]
  • [25].Afsham N, Najafi M, Abolmaesumi P, and Rohling R, “A generalized correlation-based model for out-of-plane motion estimation in freehand ultrasound,” IEEE transactions on medical imaging, vol. 33, no. 1, pp. 186–199, 2013. [DOI] [PubMed] [Google Scholar]
  • [26].Tetrel L, Chebrek H, and Laporte C, “Learning for graph-based sensorless freehand 3D ultrasound,” in International Workshop on Machine Learning in Medical Imaging. Springer, 2016, pp. 205–212. [Google Scholar]
  • [27].Luo M, Yang X, Huang X, Huang Y, Zou Y, Hu X, Ravikumar N, Frangi AF, and Ni D, “Self context and shape prior for sensorless freehand 3D ultrasound reconstruction,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2021, pp. 201–210. [Google Scholar]
  • [28].Elsayed GF, Krishnan D, Mobahi H, Regan K, and Bengio S, “Large margin deep networks for classification,” arXiv preprint arXiv:1803.05598, 2018. [Google Scholar]
  • [29].Henaff O, “Data-efficient image recognition with contrastive predictive coding,” in International Conference on Machine Learning. PMLR, 2020, pp. 4182–4192. [Google Scholar]
  • [30].Hjelm RD, Fedorov A, Lavoie-Marchildon S, Grewal K, Bachman P, Trischler A, and Bengio Y, “Learning deep representations by mutual information estimation and maximization,” arXiv preprint arXiv:1808.06670, 2018. [Google Scholar]
  • [31].Wu Z, Xiong Y, Yu SX, and Lin D, “Unsupervised feature learning via non-parametric instance discrimination,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3733–3742. [Google Scholar]
  • [32].Peng X, Wang K, Zhu Z, and You Y, “Crafting better contrastive views for siamese representation learning,” arXiv preprint arXiv:2202.03278, 2022. [Google Scholar]
  • [33].Gutmann M and Hyvärinen A, “Noise-contrastive estimation: A new estimation principle for unnormalized statistical models,” in Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings, 2010, pp. 297–304. [Google Scholar]
  • [34].Sohn K, “Improved deep metric learning with multi-class n-pair loss objective,” in Proceedings of the 30th International Conference on Neural Information Processing Systems, 2016, pp. 1857–1865. [Google Scholar]
  • [35].Chopra S, Hadsell R, and LeCun Y, “Learning a similarity metric discriminatively, with application to face verification,” in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 1. IEEE, 2005, pp. 539–546. [Google Scholar]
  • [36].Xie S, Girshick R, Dollár P, Tu Z, and He K, “Aggregated residual transformations for deep neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1492–1500. [Google Scholar]
  • [37].Bahdanau D, Cho K, and Bengio Y, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014. [Google Scholar]
  • [38].Fukui H, Hirakawa T, Yamashita T, and Fujiyoshi H, “Attention branch network: Learning of attention mechanism for visual explanation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 10705–10714. [Google Scholar]
  • [39].Oktay O, Schlemper J, Folgoc LL, Lee M, Heinrich M, Misawa K, Mori K, McDonagh S, Hammerla NY, Kainz B et al. , “Attention u-net: Learning where to look for the pancreas,” arXiv preprint arXiv:1804.03999, 2018. [Google Scholar]
  • [40].Yang Q, Yan P, Zhang Y, Yu H, Shi Y, Mou X, Kalra MK, Zhang Y, Sun L, and Wang G, “Low-dose CT image denoising using a generative adversarial network with wasserstein distance and perceptual loss,” IEEE transactions on medical imaging, vol. 37, no. 6, pp. 1348–1357, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [41].Johnson J, Alahi A, and Fei-Fei L, “Perceptual losses for real-time style transfer and super-resolution,” in European conference on computer vision. Springer, 2016, pp. 694–711. [Google Scholar]
  • [42].Zheng K, Wang Y, Zhou X-Y, Wang F, Lu L, Lin C, Huang L, Xie G, Xiao J, Kuo C-F et al. , “Semi-supervised learning for bone mineral density estimation in hip x-ray images,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2021, pp. 33–42. [Google Scholar]
  • [43].Chang R-F, Wu W-J, Chen D-R, Chen W-M, Shu W, Lee J-H, and Jeng L-B, “3-D us frame positioning using speckle decorrelation and image registration,” Ultrasound in medicine & biology, vol. 29, no. 6, pp. 801–812, 2003. [DOI] [PubMed] [Google Scholar]
  • [44].Kingma DP and Ba J, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014. [Google Scholar]
  • [45].Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, and Lerer A, “Automatic differentiation in PyTorch,” in NIPS 2017 Workshop Autodiff, 2017. [Google Scholar]

RESOURCES