DeepPyramid+: medical image segmentation using Pyramid View Fusion and Deformable Pyramid Reception

Negin Ghamsarian ORCID: orcid.org/0000-0002-0908-8972¹,
Sebastian Wolf³,
Martin Zinkernagel³,
Klaus Schoeffmann² &
…
Raphael Sznitman¹

1458 Accesses
2 Citations
Explore all metrics

Abstract

Purpose

Semantic segmentation plays a pivotal role in many applications related to medical image and video analysis. However, designing a neural network architecture for medical image and surgical video segmentation is challenging due to the diverse features of relevant classes, including heterogeneity, deformability, transparency, blunt boundaries, and various distortions. We propose a network architecture, DeepPyramid+, which addresses diverse challenges encountered in medical image and surgical video segmentation.

Methods

The proposed DeepPyramid+ incorporates two major modules, namely “Pyramid View Fusion” (PVF) and “Deformable Pyramid Reception” (DPR), to address the outlined challenges. PVF replicates a deduction process within the neural network, aligning with the human visual system, thereby enhancing the representation of relative information at each pixel position. Complementarily, DPR introduces shape- and scale-adaptive feature extraction techniques using dilated deformable convolutions, enhancing accuracy and robustness in handling heterogeneous classes and deformable shapes.

Results

Extensive experiments conducted on diverse datasets, including endometriosis videos, MRI images, OCT scans, and cataract and laparoscopy videos, demonstrate the effectiveness of DeepPyramid+ in handling various challenges such as shape and scale variation, reflection, and blur degradation. DeepPyramid+ demonstrates significant improvements in segmentation performance, achieving up to a 3.65% increase in Dice coefficient for intra-domain segmentation and up to a 17% increase in Dice coefficient for cross-domain segmentation.

Conclusions

DeepPyramid+ consistently outperforms state-of-the-art networks across diverse modalities considering different backbone networks, showcasing its versatility. Accordingly, DeepPyramid+ emerges as a robust and effective solution, successfully overcoming the intricate challenges associated with relevant content segmentation in medical images and surgical videos. Its consistent performance and adaptability indicate its potential to enhance precision in computerized medical image and surgical video analysis applications.

DeepPyramid: Enabling Pyramid View and Deformable Pyramid Reception for Semantic Segmentation in Cataract Surgery Videos

SPCTNet: A Series-Parallel CNN and Transformer Network for 3D Medical Image Segmentation

Hybrid U-Net: Instrument Semantic Segmentation in RMIS

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Semantic segmentation has emerged as a critical tool in computerized medical image and surgical video analysis, empowering numerous applications in various domains. In surgical videos, semantic segmentation is a prerequisite in several applications ranging from phase and action recognition, irregularity detection, surgical training, objective skill assessment, relevance-based compression, surgical planning, operation room organization, and so forth [1,2,3,4]. In the case of volumetric medical images, semantic segmentation can considerably aid in the diagnosis, treatment planning, and monitoring [5]. Automatic segmentation of medical images and videos can also reduce subjective errors caused by time constraints and workloads while enhancing treatment and surgical efficiency.

Designing a neural network architecture for medical image and surgical video segmentation presents a challenge due to the diverse features exhibited by different relevant labels. Specifically, many classes of objects relevant to the medical image and surgical video analysis are heterogeneous, featuring deformable or amorphous instances, as well as color, texture, and scale variation. Besides in surgical videos, the problem of motion blur degradation becomes more critical due to the camera’s proximity to the surgical scene. Unlike general images, medical images and surgical videos may contain transparent relevant content (such as intraocular lens) or exhibit blunt boundaries, further complicating the task of semantic segmentation. Accordingly, an effective network for medical image and surgical video segmentation should be able to simultaneously deal with (I) heterogeneity and deformability in relevant objects, and (II) transparency, blunt edges, and distortions such as motion and defocus blur.

This paper introduces a U-Net-based CNN for semantic segmentation, which effectively addresses the challenges associated with segmenting relevant content in medical images and surgical videos by adaptively capturing semantic information.^{Footnote 1} The proposed network, called DeepPyramid+, comprises two key modules: (i) Pyramid View Fusion (PVF) module, which offers a narrow-to-wide-angle global view of the feature map centering at each pixel position, and (ii) Deformable Pyramid Reception (DPR) module, responsible for performing shape-adaptive feature extraction on the input convolutional feature map^{Footnote 2}. We provide comprehensive experiments to compare the performance of DeepPyramid+ with state-of-the-art baselines for five intra-domain and two cross-domain datasets. Experimental results reveal the superiority of DeepPyramid+ compared to the baselines. Ablation studies confirm the effectiveness of each proposed module in boosting semantic segmentation performance. To support reproducibility and further investigations, we will release the PyTorch implementation of DeepPyramid+ and all dataset splits with the acceptance of this paper.

Related work

U-Net [7] was initially proposed for medical image segmentation and achieved succeeding performance being attributed to its skip connections. Many U-Net-based architectures have been proposed over the past years to improve the segmentation accuracy and address different flaws and restrictions in the previous architectures [8,9,10,11,12,13,14].

Attention modules

Attention mechanisms can be broadly described as the techniques to guide the network’s computational resources (i.e.,the convolutional operations) toward the most determinative features in the input feature map [9, 15, 16]. Such mechanisms have been especially proven to be gainful in the case of semantic segmentation. The scSE blocks [15] aim to recalibrate the feature maps based on pixel-wise and channel-wise global features. BARNet [12] adopts a bilinear-attention module to extract the cross-dependencies between the different channels of a convolutional feature map. PAANET [11] uses a double-attention module to model semantic dependencies between channels and spatial positions in the convolutional feature map.

Fusion modules

Fusion modules can be characterized as modules designed to improve semantic representation via combining several feature maps. The input feature maps could range from varying-level semantic features to the features coming from parallel operations. PSPNet [17] adopts a pyramid pooling module (PPM) containing parallel sub-region average pooling layers followed by upsampling to fuse the multi-scale sub-region representations. Atrous spatial pyramid pooling (ASPP) [18, 19] was proposed to deal with objects’ scale variance by aggregating multi-scale features extracted using parallel varying-rate dilated convolutions. CPFNet [13] uses another fusion approach for scale-aware feature extraction.

Methodology

We present a segmentation network that focuses on (I) modeling heterogeneous classes featuring deformations, shape, scale, color, and context variation, (II) dealing with content distortion due to motion blur and reflection, and (III) handling objects’ transparency and blunt boundaries (Fig. 1). At its core, our network adopts the U-Net architecture, with the encoder part being set to VGG16. We develop two decoder modules specifically tailored to tackle the mentioned challenges: (1) Pyramid View Fusion (PVF), which aims to replicate a deduction process within the neural network analogous to the functioning of the human visual system by enhancing the representation of relative information at each individual pixel position. (2) Deformable Pyramid Reception (DPR), which addresses the limitations of regular convolutional layers by introducing deformable dilated convolutions and shape- and scale-adaptive feature extraction techniques. This module allows for handling the complexities of heterogeneous classes and deformable shapes, resulting in improved accuracy and robustness in the segmentation performance.

We specify the functionality of each module in the following subsections. Additional discussions regarding the effectiveness of each module and an analysis of the complexity for each module are available in the supplementary material.

Notations. Throughout this paper, we represent convolutional layers with a kernel size of \((k\times k)\), dilation of d, m output channels, and g groups as \(\circledast _{k,d}^{m,g}\). For deformable convolutions, we use the symbol \({\tilde{\circledast }}_{k,d}^{m,g}\). Additionally, we illustrate the average-pooling layer with a kernel size of \((k\times k)\) and a stride of s pixels as and global average pooling as . The symbol \(+\!\!\!\!+\,_{D}\) denotes feature map concatenation over dimension D. Furthermore, we employ \(\Uparrow ^{(W_{out}, H_{out})}\) and \(\Downarrow ^{(W_{out}, H_{out})}\) for upsampling and downsampling operations with a scale factor of \((W_{out}, H_{out})\), respectively. We use \(\sigma (\cdot )\) to represent the Softmax operation, \(\Vert \cdot \Vert _{n}\) for layer normalization over the last n dimensions, \(\mathcal {R}(\cdot )\) for the ReLU nonlinearity function, and \(\tau (\cdot )\) for the hard tangent hyperbolic function.

Pyramid View Fusion (PVF)

To optimize computational complexity, the initial step involves creating a bottleneck by employing a convolutional layer with a kernel size of one, as illustrated in Fig. 2. Following this dimensionality reduction stage, the resulting convolutional feature map is fed into four parallel branches. The first branch features a global average pooling layer, which is subsequently followed by upsampling. The other three branches employ average pooling layers with progressively increasing filter sizes while maintaining a stride of one pixel. The use of a one-pixel stride is specifically important to achieve a pixel-wise centralized pyramid view, as opposed to the region-wise pyramid attention approach employed in PSPNet [17]. The output feature maps from all branches are then concatenated and fed into a convolutional layer with four groups, for extracting inter-channel dependencies during dimensionality reduction. Subsequently, a regular convolutional layer is applied to extract joint intra-channel and inter-channel dependencies. The resulting feature map is then passed through a layer-normalization function, which helps normalize the activations for improved stability and performance.

Deformable Pyramid Reception (DPR)

The architecture of the Deformable Pyramid Reception (DPR) module, as depicted in Fig. 2, can be described as follows. Initially, the upsampled coarse-grained semantic feature map from the preceding layer is concatenated with its symmetric fine-grained feature map from the encoder. Subsequently, these concatenated features are passed through three parallel branches. The first branch employs a regular convolution operation, while the other two branches utilize deformable convolutions with different dilation rates of three and six. The structured convolution covers the immediate neighboring pixels up to one pixel to the central pixel. The deformable convolutions with the dilation rate of three and six cover an area from two to four and five to seven pixels far away from each central pixel, respectively. Accordingly, the DPR module forms a learnable sparse receptive field of size \(15\times 15\) pixels by incorporating these layers. These layers share the weights to avoid imposing a huge number of trainable parameters.

To compute the feature-map-adaptive offset field for each deformable convolution, a regular convolution operation is employed. Considering the target area of the two deformable convolutions, the offset field should be computed based on the internal content within four and seven pixels away from each central pixel (\(k=9\), \(k=15\)). The computed offset values are then passed through a tangent hyperbolic function, which clips them within the range of \([-1, 1]\), to ensure that each deformable convolution adaptively covers an area within the range of \([k-1, k+1]\). The offset field provides two values per element in the deformable convolutional kernel (horizontal and vertical offsets). Accordingly, the number of offset field’s output channels for a deformable convolution with a kernel of size \(3\times 3\) is equal to 18. This enables the deformable convolution to spatially adjust its receptive field based on the learned offset values, improving its ability to capture contextually relevant information.

The output feature maps of the parallel structured and deformable convolutions are then passed through a feature fusion decision (FFD) module [4]. This module determines the significance of each input feature map based on the spatial descriptors using pixel-wise convolutions. These descriptors are concatenated and subjected to a Softmax operation, resulting in normalized descriptors. The normalized descriptors determine the pixel-wise contribution or weight of each input convolutional feature map in the final fused feature map. The output feature map of the FFD module is obtained as a weighted sum of the input feature maps, where the normalized descriptors serve as pixel-wise weights. The resulting feature map from the FFD module goes through a series of additional operations for deeper feature extraction and normalization.

Table 1 Specifications of the single-domain and cross-domain datasets

Full size table

Experimental settings

Datasets

We evaluate the performance of our proposed network on five intra-domain datasets from three different modalities (video, MRI, and OCT) and two cross-domain datasets from two different modalities. Table 1 details the specifications of adopted datasets, and Fig. 3 presents exemplary images together with the ground-truth segmentations from each dataset. These datasets cover a wide range of object classes with distinct characteristics. For example, endometriosis videos contain amorphous endometrial implants with color and texture variations. OCT scans involve amorphous intraretinal fluid, while prostate MRI images include deformations and variations in scale, contrast, and brightness. In addition, instrument segmentation in cataract and laparoscopy surgeries presents various challenges, such as scale variation, reflection, motion blur, and defocus blur degradation. The diversities in datasets ensure realistic conditions for evaluating the proposed network’s effectiveness in addressing challenges in medical image and surgical video segmentation.^{Footnote 3} For result reproducibility, we provide all train/test sets as CSV files in the paper’s GitHub repository.

Alternative methods

We compare the effectiveness of our proposed network architecture with eleven state-of-the-art neural networks using different backbones. Table 2 lists the specifications of the baselines and the proposed network. Note that UNet+ is an improved version of UNet, where we use VGG16 as the backbone network and double convolutional blocks (two consecutive convolutions followed by batch normalization and ReLU layers) as decoder modules. To have fair comparisons with alternative methods, we report the performance of DeepPyramid+ with three different backbones (VGG16, ResNet34, and ResNet50).

Table 2 Specifications of the proposed and alternative approaches

Full size table

Training settings

All backbones are initialized with the ImageNet pre-trained parameters. We use a batch size of four for all datasets, set the initial learning rate to 0.001, and decrease it during training using polynomial decay \(lr = lr_{\textrm{init}}\times (1-\frac{\hbox {iter}}{\hbox {total-iter}})^{0.9}\). The input size of the networks is \(512\times 512\) for all datasets. We apply cropping and random rotation (up to \(30^{\circ }\)), color jittering (brightness = 0.7, contrast = 0.7, saturation = 0.7), Gaussian blurring, and random sharpening as augmentations during training, and use the cross-entropy log dice loss during training [6]. All experiments are conducted using NVIDIA RTX:3090 GPUs.

Ablation study settings

To evaluate the effectiveness of different modules, we use the improved version of UNet (UNet+), with the same backbone (VGG16) as our baseline. This network does not include any PVF modules. Besides, the DPR module is replaced with a sequence of two convolutional layers, each of which being followed by a batch normalization layer and a ReLU activation.

Experimental results

Table 3 reports the segmentation performance of the proposed and state-of-the-art networks across three different modalities. DeepPyramid+ consistently demonstrates the highest average performance across all datasets with various backbones, while other methods, such as CPFNet, exhibit varying performance with different backbones and 2.22% compared to DeepPyramid+, respectively. Besides, DeepPyramid+ achieves the best results with all three backbones for endometrial implants and prostate segmentation and the best results with ResNet34 and ResNet50 backbones for IRF segmentation in OCT. Considering instrument segmentation performance (Table 4), DeepPyramid+ with VGG16 backbone shows more than 5.6% gain in segmentation compared to CPFNet as its main alternative (58.93% vs. 53.29%). Across all backbones, DeepPyramid+ with VGG16 backbone shows more than 2.7% higher performance compared to other methods. Besides, the best results for both datasets correspond to DeepPyramid+ with VGG16 backbone. Overall, DeepPyramid+ with our suggested backbone (VGG16) achieves the best segmentation performance in instrument and organ/disease segmentation.

Table 3 Quantitative comparisons among the performance of DeepPyramid+ and alternative methods in organ and disease segmentation, with top two results shown in italic and bold, respectively

Full size table

Table 4 Quantitative comparisons among the performance of DeepPyramid+ and alternative methods in instrument segmentation, with top two results shown in italic and bold, respectively

Full size table

Table 5 compares the cross-domain segmentation performance of DeepPyramid+ and its best two alternatives for three backbones (considering single-domain results in Table 3 and Table 4). Overall, DeepPyramid+ consistently outperforms other methods across all backbones. Considering the MRI dataset, DeepPyramid+ with VGG16 backbone shows more than 4.8% gain in Dice compared to alternatives. For instrument segmentation in cataract surgery, DeepPyramid+ with the VGG16 backbone exhibits an impressive improvement of approximately 19.5% in Dice score compared to CPFNet with the same backbone (55.10% vs. 35.59%), and a 17% improvement compared to the best alternative across all backbones (55.10% vs. 38.10% achieved by UPerNet). This exceptional performance in dealing with cross-domain distribution gaps [28] can be attributed to the effectiveness of the proposed modules in incorporating multi-scale local and global features.

Table 5 Quantitative comparisons of cross-domain performance among DeepPyramid+ and state-of-the-art methods, with top two results shown in italic and bold, respectively

Full size table

Table 6 provides an ablation study of DeepPyramid+ components. The results suggest that both PVF and DPR modules contribute significantly to improvements in segmentation performance across all datasets. This impact is more prominent in the case of cataract surgery, where the addition of PVF and DPR modules lead to a 4.95% and 4.72% increase in the Dice coefficient, respectively.

Table 6 Ablation study of DeepPyramid+ component across different datasets

Full size table

Conclusion

In recent years, considerable attention has been devoted to computerized medical image and surgical video analysis. A reliable relevant-instance-segmentation approach is a prerequisite for a majority of these applications. In this paper, we introduce a novel network architecture for semantic segmentation that addresses the challenges encountered in medical image and surgical video segmentation. Our proposed architecture, DeepPyramid+, incorporates two innovative modules, namely “Pyramid View Fusion” and “Deformable Pyramid Reception.” Experimental results demonstrate the effectiveness of DeepPyramid+ in capturing object features in challenging scenarios, including shape and scale variation, reflection and blur degradation, blunt edges, and deformability, resulting in competitive performance in cross-domain segmentation compared to state-of-the-art networks. The ablation study validates the efficacy of the proposed modules in DeepPyramid+, showcasing their performance across diverse datasets. The obtained promising results indicate the potential of DeepPyramid+ to enhance the precision in various computerized medical imaging and surgical video analysis applications.

Notes

This paper is an extended version of DeepPyramid [6], featuring minor enhancements in the DPR module.
The PyTorch implementation of DeepPyramid \(+\) is publicly available at https://github.com/Negin-Ghamsarian/DeepPyramid_Plus.
This paper aims to design a dedicated network tailored to address medical image and video segmentation challenges, emphasizing various modalities but not within a multi-modal training framework. We substantiate the efficacy of our model through distinctive validations across diverse medical image and video datasets.

References

Ghamsarian N, Taschwer M, Putzgruber-Adamitsch D, Sarny S, Schoeffmann K (2021) Relevance detection in cataract surgery videos by spatio-temporal action localization. In: 2020 25th International conference on pattern recognition (ICPR), pp 10720–10727
Ghamsarian N (2020) Enabling relevance-based exploration of cataract videos. In: Proceedings of the 2020 international conference on multimedia retrieval, pp 378–382
Ghamsarian N, Amirpourazarian H, Timmerer C, Taschwer M, Schöffmann K (2020) Relevance-based compression of cataract surgery videos using convolutional neural networks. In: Proceedings of the 28th ACM international conference on multimedia, pp 3577–3585
Ghamsarian N, Taschwer M, Putzgruber-Adamitsch D, Sarny S, El-Shabrawi Y, Schoeffmann K (2021) LensID: a CNN-RNN-based framework towards lens irregularity detection in cataract surgery videos. In: Medical image computing and computer assisted intervention—MICCAI 2021: 24th international conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part VIII 24. Springer, pp 76–86
Huang X, Wang H, She C, Feng J, Liu X, Hu X, Chen L, Tao Y (2022) Artificial intelligence promotes the diagnosis and screening of diabetic retinopathy. Front Endocrinol 13:946915
Article Google Scholar
Ghamsarian N, Taschwer M, Sznitman R, Schoeffmann K (2022) Deeppyramid: Enabling pyramid view and deformable pyramid reception for semantic segmentation in cataract surgery videos. In: Medical image computing and computer assisted intervention—MICCAI 2022: 25th international conference, Singapore, September 18–22, 2022, Proceedings, Part V. Springer, pp 276–286
Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Navab N, Hornegger J, Wells WM, Frangi AF (eds) Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015. Springer, Cham, pp 234–241
Google Scholar
Chen X, Zhang R, Yan P (2019) Feature fusion encoder decoder network for automatic liver lesion segmentation. In: 2019 IEEE 16th international symposium on biomedical imaging (ISBI 2019), pp 430–433
Ni Z-L, Bian G-B, Zhou X-H, Hou Z-G, Xie X-L, Wang C, Zhou Y-J, Li R-Q, Li Z (2019) Raunet: Residual attention u-net for semantic segmentation of cataract surgical instruments. In: Gedeon T, Wong KW, Lee M (eds) Neural Information Processing. Springer, Cham, pp 139–149
Chapter Google Scholar
Gu Z, Cheng J, Fu H, Zhou K, Hao H, Zhao Y, Zhang T, Gao S, Liu J (2019) Ce-net: Context encoder network for 2d medical image segmentation. IEEE Trans Med Imaging 38(10):2281–2292
Article PubMed Google Scholar
Ni Z-L, Bian G-B, Wang G-A, Zhou X-H, Hou Z-G, Chen H-B, Xie X-L (2020) Pyramid attention aggregation network for semantic segmentation of surgical instruments. Proc AAAI Conf Artif Intell 34(07):11782–11790
Google Scholar
Ni Z-L, Bian G-B, Wang G-A, Zhou X-H, Hou Z-G, Xie X-L, Li Z, Wang Y-H (2021) Barnet: bilinear attention network with adaptive receptive fields for surgical instrument segmentation. In: Proceedings of the twenty-ninth international conference on international joint conferences on artificial intelligence, pp 832–838
Feng S, Zhao H, Shi F, Cheng X, Wang M, Ma Y, Xiang D, Zhu W, Chen X (2020) CPFNet: Context pyramid fusion network for medical image segmentation. IEEE Trans Med Imaging 39(10):3008–3018
Article PubMed Google Scholar
Zhou Z, Siddiquee MMR, Tajbakhsh N, Liang J (2020) Unet++: Redesigning skip connections to exploit multiscale features in image segmentation. IEEE Trans Med Imaging 39(6):1856–1867
Article PubMed Google Scholar
Roy AG, Navab N, Wachinger C (2019) Recalibrating fully convolutional networks with spatial and channel “squeeze and excitation" blocks. IEEE Trans Med Imaging 38(2):540–549
Article PubMed Google Scholar
Ghamsarian N, Taschwer M, Putzgruber-Adamitsch D, Sarny S, El-Shabrawi Y, Schöffmann K (2021) Recal-net: Joint region-channel-wise calibrated network for semantic segmentation in cataract surgery videos. In: Neural information processing: 28th international conference, ICONIP 2021, Sanur, Bali, Indonesia, December 8–12, 2021, Proceedings, Part III 28. Springer, pp 391–402
Zhao H, Shi J, Qi X, Wang X, Jia J (2017) Pyramid scene parsing network. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)
Chen L-C, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2018) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848
Article PubMed Google Scholar
Chen L-C, Zhu Y, Papandreou G, Schroff F, Adam H (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European conference on computer vision (ECCV)
Ghamsarian N, El-Shabrawi Y, Nasirihaghighi S, Putzgruber-Adamitsch D, Zinkernagel M, Wolf S, Schoeffmann K, Sznitman R (2023) Cataract-1K: cataract surgery dataset for scene segmentation, phase recognition, and irregularity detection. arXiv preprint https://arxiv.org/abs/2312.06295
Bodenstedt S, Speidel S, Allan M, Stoyanov D, Maier-Hein L, Kenngott H, Wagner M (2015) Multi-instrument EndoVis challenge dataset. https://endovissub-instrument.grand-challenge.org/
Leibetseder A, Schoeffmann K, Keckstein J, Keckstein S (2022) Endometriosis detection and localization in laparoscopic gynecology. Multimed Tools Appl 81(5):6191–6215
Article Google Scholar
Liu Q, Dou Q, Yu L, Heng PA (2020) MS-Net: multi-site network for improving prostate segmentation with heterogeneous MRI data. IEEE Trans Med Imaging
Bogunovic H, Venhuizen F, Klimscha S, Apostolopoulos S, Bab-Hadiashar A, Bagci U, Beg MF, Bekalo L, Chen Q, Ciller C, Gopinath K, Gostar AK, Jeon K, Ji Z, Kang SH, Koozekanani DD, Lu D, Morley D, Parhi KK, Park HS, Rashno A, Sarunic M, Shaikh S, Sivaswamy J, Tennakoon R, Yadav S, De Zanet S, Waldstein SM, Gerendas BS, Klaver C, Sánchez CI, Schmidt-Erfurth U (2019) Retouch: the retinal oct fluid detection and segmentation benchmark and challenge. IEEE Trans Med Imaging 38(8):1858–1874
Article PubMed Google Scholar
Grammatikopoulou M, Flouty E, Kadkhodamohammadi A, Quellec G, Chow A, Nehme J, Luengo I, Stoyanov D (2021) CaDIS: Cataract dataset for surgical RGB-image segmentation. Med Image Anal 71:102053
Article PubMed Google Scholar
Chen L-C, Zhu Y, Papandreou G, Schroff F, Adam H (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European conference on computer vision (ECCV), pp 801–818
Xiao T, Liu Y, Zhou B, Jiang Y, Sun J (2018) Unified perceptual parsing for scene understanding. In: Proceedings of the European conference on computer vision (ECCV), pp 418–434
Ghamsarian N, Gamazo Tejero J, Márquez-Neila P, Wolf S, Zinkernagel M, Schoeffmann K, Sznitman R (2023) Domain adaptation for medical image segmentation using transformation-invariant self-training. In: International conference on medical image computing and computer-assisted intervention. Springer, pp 331–341

Download references

Funding

Open access funding provided by University of Bern

Author information

Authors and Affiliations

ARTORG Center for Biomedical Engineering Research, University of Bern, Bern, Switzerland
Negin Ghamsarian & Raphael Sznitman
Department of Information Technology, University of Klagenfurt, Klagenfurt, Austria
Klaus Schoeffmann
Department of Ophthalmology, Inselspital, Bern, Switzerland
Sebastian Wolf & Martin Zinkernagel

Authors

Negin Ghamsarian
View author publications
You can also search for this author in PubMed Google Scholar
Sebastian Wolf
View author publications
You can also search for this author in PubMed Google Scholar
Martin Zinkernagel
View author publications
You can also search for this author in PubMed Google Scholar
Klaus Schoeffmann
View author publications
You can also search for this author in PubMed Google Scholar
Raphael Sznitman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Negin Ghamsarian.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

For this type of study, formal consent is not required.

Informed consent

This article uses patient data from publicly available datasets.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was funded by Haag-Streit Foundation, Switzerland.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Ghamsarian, N., Wolf, S., Zinkernagel, M. et al. DeepPyramid+: medical image segmentation using Pyramid View Fusion and Deformable Pyramid Reception. Int J CARS 19, 851–859 (2024). https://doi.org/10.1007/s11548-023-03046-2

Download citation

Received: 31 May 2023
Accepted: 07 December 2023
Published: 08 January 2024
Issue Date: May 2024
DOI: https://doi.org/10.1007/s11548-023-03046-2

DeepPyramid+: medical image segmentation using Pyramid View Fusion and Deformable Pyramid Reception

Abstract

Purpose

Methods

Results

Conclusions

Similar content being viewed by others

DeepPyramid: Enabling Pyramid View and Deformable Pyramid Reception for Semantic Segmentation in Cataract Surgery Videos

SPCTNet: A Series-Parallel CNN and Transformer Network for 3D Medical Image Segmentation

Hybrid U-Net: Instrument Semantic Segmentation in RMIS

Introduction

Related work

Attention modules

Fusion modules

Methodology

Pyramid View Fusion (PVF)

Deformable Pyramid Reception (DPR)

Experimental settings

Datasets

Alternative methods

Training settings

Ablation study settings

Experimental results

Conclusion

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Informed consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation