Enhancing the Ground Truth Disparity by MAP Estimation for Developing a Neural-Net Based Stereoscopic Camera
<p>The proposed framework for enhancing SGM disparity map.</p> "> Figure 2
<p>Example of left (<b>a</b>) and RGB (<b>b</b>) images.</p> "> Figure 3
<p>(<b>a</b>) Disparity map generated by SGM for the images in <a href="#sensors-24-07761-f001" class="html-fig">Figure 1</a>. (<b>b</b>) Enlarged view of (<b>a</b>). Dark blue pixels indicate “invalid” regions. The numbers shown represent disparity values for each grouped region.</p> "> Figure 4
<p>(<b>a</b>) Prior probability, (<b>b</b>) Likelihood, and (<b>c</b>) Posterior distribution of an invalid pixel from <a href="#sensors-24-07761-f003" class="html-fig">Figure 3</a>.</p> "> Figure 5
<p>Preprocessing steps for the proposed method: (<b>a</b>) Original cropped patch, (<b>b</b>) Standardized patch, (<b>c</b>) Mask, and (<b>d</b>) Masked patch.</p> "> Figure 6
<p>(<b>a</b>) Left masked cropped patch. (<b>b</b>) Right cropped candidate patches.</p> "> Figure 7
<p>Disparity map comparisons on the synthetic Driving dataset across different scenes. (<b>a</b>) Ground truth, (<b>b</b>) SGM<math display="inline"><semantics> <mrow> <mo>(</mo> <msub> <mi>U</mi> <mrow> <mi>t</mi> <mi>h</mi> </mrow> </msub> <mo>=</mo> <mn>10</mn> <mo>)</mo> </mrow> </semantics></math>, (<b>c</b>) SGM<math display="inline"><semantics> <mrow> <mo>(</mo> <msub> <mi>U</mi> <mrow> <mi>t</mi> <mi>h</mi> </mrow> </msub> <mo>=</mo> <mn>0</mn> <mo>)</mo> </mrow> </semantics></math>, (<b>d</b>) Linear Interpolation, (<b>e</b>) Nearest Interpolation, (<b>f</b>) PDE, (<b>g</b>) ShCNN [<a href="#B56-sensors-24-07761" class="html-bibr">56</a>], (<b>h</b>) GMCNN [<a href="#B55-sensors-24-07761" class="html-bibr">55</a>], (<b>i</b>) MADF [<a href="#B58-sensors-24-07761" class="html-bibr">58</a>], (<b>j</b>) Chen [<a href="#B57-sensors-24-07761" class="html-bibr">57</a>], and (<b>k</b>) The proposed. Invalid regions are shown in darkish blue.</p> "> Figure 8
<p>Example captured images of real-world indoor scenes.</p> "> Figure 9
<p>Disparity map comparisons across different real-world scenes. (<b>a</b>) Input left images, (<b>b</b>) SGM<math display="inline"><semantics> <mrow> <mo>(</mo> <msub> <mi>U</mi> <mrow> <mi>t</mi> <mi>h</mi> </mrow> </msub> <mo>=</mo> <mn>10</mn> <mo>)</mo> </mrow> </semantics></math>, (<b>c</b>) Linear Interpolation, (<b>d</b>) Nearest Interpolation, (<b>e</b>) PDE, (<b>f</b>) Shepard inpainting [<a href="#B56-sensors-24-07761" class="html-bibr">56</a>], (<b>g</b>) GMCNN [<a href="#B55-sensors-24-07761" class="html-bibr">55</a>], (<b>h</b>) MADF [<a href="#B58-sensors-24-07761" class="html-bibr">58</a>], (<b>i</b>) Chen [<a href="#B57-sensors-24-07761" class="html-bibr">57</a>], and (<b>j</b>) The proposed. The insets highlight areas with significant differences, particularly in challenging regions with occlusions and textureless surfaces. Invalid regions are shown in darkish blue.</p> "> Figure 10
<p>Basic CNN-based model for disparity estimation.</p> "> Figure 11
<p>ResNet-based model for disparity estimation. Note that the additional low-scale intermediate feature maps are used to capture structural information of disparity at the original size.</p> "> Figure 12
<p>Vision Transformer-based model for disparity estimation. This model replaces the Encoder of the baseline model with a Vision Transformer (ViT) and modifies the Decoder from <a href="#sensors-24-07761-f010" class="html-fig">Figure 10</a> accordingly.</p> "> Figure 13
<p>Disparity map comparisons across various scenes using different models. (<b>a</b>) GT, (<b>b</b>) CNN-based, (<b>c</b>) ResNet-based, (<b>d</b>) ViT-based, (<b>e</b>) PSMNet [<a href="#B50-sensors-24-07761" class="html-bibr">50</a>]. The first row of each scene is trained with the original SGM GT, and the second row with the proposed GT. Invalid regions are shown in darkish red.</p> "> Figure 14
<p>Relationship between patch size <math display="inline"><semantics> <msub> <mi mathvariant="bold">S</mi> <mi>p</mi> </msub> </semantics></math> and both error and invalid pixel ratios for various prior window sizes <math display="inline"><semantics> <msub> <mi mathvariant="bold">S</mi> <mi>w</mi> </msub> </semantics></math> at a fixed intensity threshold <math display="inline"><semantics> <mrow> <msub> <mi>I</mi> <mrow> <mi>t</mi> <mi>h</mi> </mrow> </msub> <mo>=</mo> <mn>0.1</mn> </mrow> </semantics></math>. (<b>a</b>) shows how smaller values of <math display="inline"><semantics> <msub> <mi mathvariant="bold">S</mi> <mi>w</mi> </msub> </semantics></math> and larger values of <math display="inline"><semantics> <msub> <mi mathvariant="bold">S</mi> <mi>p</mi> </msub> </semantics></math> tend to minimize error, with an optimal configuration observed around <math display="inline"><semantics> <mrow> <msub> <mi mathvariant="bold">S</mi> <mi>w</mi> </msub> <mo>=</mo> <mn>17</mn> <mo>×</mo> <mn>17</mn> </mrow> </semantics></math> and <math display="inline"><semantics> <mrow> <msub> <mi mathvariant="bold">S</mi> <mi>p</mi> </msub> <mo>=</mo> <mn>24</mn> <mo>×</mo> <mn>4</mn> </mrow> </semantics></math>. (<b>b</b>) demonstrates that smaller values of <math display="inline"><semantics> <msub> <mi mathvariant="bold">S</mi> <mi>w</mi> </msub> </semantics></math> generally lead to higher invalid pixel ratios, while smaller values of <math display="inline"><semantics> <msub> <mi mathvariant="bold">S</mi> <mi>p</mi> </msub> </semantics></math> help in reducing invalid pixel ratios. This indicates a trade-off in parameter selection between minimizing error and reducing invalid pixel ratios.</p> "> Figure 15
<p>3D visualization of error-based on prior window size <math display="inline"><semantics> <msub> <mi mathvariant="bold">S</mi> <mi>w</mi> </msub> </semantics></math> and patch size <math display="inline"><semantics> <msub> <mi mathvariant="bold">S</mi> <mi>p</mi> </msub> </semantics></math>, with point size indicating the reciprocal of invalid pixel ratios.</p> ">
Abstract
:1. Introduction
2. Related Work
- MAP-Based Interpolation with Enhanced Likelihood Calculation for Invalid Disparity Pixels: Our MAP estimation approach leverages surrounding disparity information as the prior and utilizes cosine similarity as the likelihood, complemented by a preprocessing step that standardizes pixel intensity and applies masking. This posterior probability estimation identifies the most plausible disparity values for invalid regions caused by occlusions, textureless surfaces, and inconsistencies between left and right images in stereo vision.
- Practical Validation in Real-World Scenarios Where Ground Truth Is Difficult To Obtain: The proposed method demonstrates robustness and applicability in both ground truth (GT)-available and GT-unavailable scenarios. In environments without GT, where accurate disparity maps are inaccessible, learning-based methods trained on different datasets often struggle with challenges such as indistinct boundaries and the propagation of incorrect predictions when applied to unseen images. Our approach overcomes these limitations by employing MAP-based interpolation to generate sharper and more reliable disparity maps. Evaluations conducted on a comprehensive dataset of over 4000 real-world stereo images validate the proposed method’s ability to generalize effectively across diverse environments, emphasizing its practical usability in real-world applications where GT is unavailable.
- Improved Ground Truth for Lightweight Neural Network Training: We evaluate the impact of our enhanced ground truth data on various lightweight neural network architectures optimized for mobile applications. Specifically, we compare network performance using original SGM-based ground truth data against ground truth data enhanced by our proposed method. Results indicate that the proposed enhancement significantly improves output quality and generalization of these networks, demonstrating its value for real-time stereoscopic applications with limited computational resources.
3. Proposed MAP Estimation
3.1. Invalid Regions of SGM
3.2. Prior Probability of Invalid Pixels
3.3. Likelihood Probability
- The original captured image is standardized as follows:
- The standardized image is then masked out so that the pixels with low intensity are not regarded as distinct features:
3.4. Posterior Probability
4. Experimental Result
4.1. Performance Evaluation on Synthetic Datasets with Ground Truth
4.2. Self-Generated Real-World Dataset
4.2.1. Ground Truth Disparity Map Comparison for Training Neural-Nets
4.2.2. Neural Network Output Map Comparison
5. Discussion
Parameter Selection for Proposed Method
6. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Peng, R.; Wang, R.; Wang, Z.; Lai, Y.; Wang, R. Rethinking Depth Estimation for Multi-View Stereo: A Unified Representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8645–8654. [Google Scholar]
- Fan, R.; Wang, L.; Bocus, M.J.; Pitas, I. Computer stereo vision for autonomous driving. arXiv 2020, arXiv:2012.03194. [Google Scholar]
- Cui, Y.; Chen, R.; Chu, W.; Chen, L.; Tian, D.; Li, Y.; Cao, D. Deep learning for image and point cloud fusion in autonomous driving: A review. IEEE Trans. Intell. Transp. Syst. 2021, 23, 722–739. [Google Scholar] [CrossRef]
- Rajpal, A.; Cheema, N.; Illgner-Fehns, K.; Slusallek, P.; Jaiswal, S. High-Resolution Synthetic rgb-d Datasets for Monocular Depth Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 1188–1198. [Google Scholar]
- Toschi, M.; De Matteo, R.; Spezialetti, R.; De Gregorio, D.; Di Stefano, L.; Salti, S. Relight my Nerf: A Dataset for Novel View Synthesis and Relighting of Real World Objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 20762–20772. [Google Scholar]
- Ummenhofer, B.; Agrawal, S.; Sepulveda, R.; Lao, Y.; Zhang, K.; Cheng, T.; Richter, S.; Wang, S.; Ros, G. Objects With Lighting: A Real-World Dataset for Evaluating Reconstruction and Rendering for Object Relighting. In Proceedings of the 2024 International Conference on 3D Vision (3DV), Davos, Switzerland, 18–21 March 2024; pp. 137–147. [Google Scholar]
- Kallwies, J.; Engler, T.; Forkel, B.; Wuensche, H.J. Triple-SGM: Stereo Processing Using Semi-Global Matching with Cost Fusion. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 192–200. [Google Scholar]
- Hirschmuller, H. Accurate and Efficient Stereo Processing by Semi-Global Matching and Mutual Information. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 2, pp. 807–814. [Google Scholar]
- Jeong, J.C.; Shin, H.; Chang, J.; Lim, E.G.; Choi, S.M.; Yoon, K.J.; Cho, J.i. High-quality stereo depth map generation using infrared pattern projection. ETRI J. 2013, 35, 1011–1020. [Google Scholar] [CrossRef]
- Xu, Y.; Yang, X.; Yu, Y.; Jia, W.; Chu, Z.; Guo, Y. Depth Estimation by Combining Binocular Stereo and Monocular Structured-Light. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1746–1755. [Google Scholar]
- Li, H.; Chan, T.N.; Qi, X.; Xie, W. Detail-preserving multi-exposure fusion with edge-preserving structural patch decomposition. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 4293–4304. [Google Scholar] [CrossRef]
- Ma, F.; Cavalheiro, G.V.; Karaman, S. Self-Supervised Sparse-to-Dense: Self-Supervised Depth Completion from Lidar and Monocular Camera. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 3288–3295. [Google Scholar]
- Zhao, Y.; Krähenbühl, P. Real-Time Online Video Detection with Temporal Smoothing Transformers. In Computer Vision—ECCV 2022, Proceedings of the 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 485–502. [Google Scholar]
- Yao, P.; Zhang, H.; Xue, Y.; Zhou, M.; Xu, G.; Gao, Z. Iterative Color-Depth MST Cost Aggregation for Stereo Matching. In Proceedings of the 2016 IEEE International Conference on Multimedia and Expo (ICME), Seattle, WA, USA, 11–15 July 2016; pp. 1–6. [Google Scholar]
- Chai, Y.; Cao, X. Stereo Matching Algorithm Based on Joint Matching Cost and Adaptive Window. In Proceedings of the 2018 IEEE 3rd Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), Chongqing, China, 12–14 October 2018; pp. 442–446. [Google Scholar]
- Yang, S.; Lei, X.; Liu, Z.; Sui, G. An efficient local stereo matching method based on an adaptive exponentially weighted moving average filter in SLIC space. IET Image Process. 2021, 15, 1722–1732. [Google Scholar] [CrossRef]
- Yang, J.; Wang, H.; Ding, Z.; Lv, Z.; Wei, W.; Song, H. Local stereo matching based on support weight with motion flow for dynamic scene. IEEE Access 2016, 4, 4840–4847. [Google Scholar] [CrossRef]
- Bleyer, M.; Rother, C.; Kohli, P. Surface Stereo with Soft Segmentation. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 1570–1577. [Google Scholar]
- Yamaguchi, K.; McAllester, D.; Urtasun, R. Efficient Joint Segmentation, Occlusion Labeling, Stereo and Flow Estimation. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. Springer: Cham, Switzerland, 2014; pp. 756–771. [Google Scholar]
- Ulusoy, A.O.; Black, M.J.; Geiger, A. Semantic Multi-View Stereo: Jointly Estimating Objects and Voxels. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4531–4540. [Google Scholar]
- Ye, X.; Gu, Y.; Chen, L.; Li, J.; Wang, H.; Zhang, X. Order-based disparity refinement including occlusion handling for stereo matching. IEEE Signal Process. Lett. 2017, 24, 1483–1487. [Google Scholar] [CrossRef]
- Li, A.; Yuan, Z.; Ling, Y.; Chi, W.; Zhang, S.; Zhang, C. Unsupervised occlusion-aware stereo matching with directed disparity smoothing. IEEE Trans. Intell. Transp. Syst. 2021, 23, 7457–7468. [Google Scholar] [CrossRef]
- Xie, Y.; Zeng, S.; Chen, L. A Novel Disparity Refinement Method Based on Semi-Global Matching Algorithm. In Proceedings of the 2014 IEEE International Conference on Data Mining Workshop, Shenzhen, China, 14 December 2014; pp. 1135–1142. [Google Scholar]
- Yang, Q. A Non-Local Cost Aggregation Method for Stereo Matching. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 1402–1409. [Google Scholar]
- Sun, X.; Mei, X.; Jiao, S.; Zhou, M.; Liu, Z.; Wang, H. Real-time local stereo via edge-aware disparity propagation. Pattern Recognit. Lett. 2014, 49, 201–206. [Google Scholar] [CrossRef]
- Hosni, A.; Bleyer, M.; Gelautz, M.; Rhemann, C. Local Stereo Matching Using Geodesic Support Weights. In Proceedings of the 2009 16th IEEE International Conference on Image Processing (ICIP), Cairo, Egypt, 7–10 November 2009; pp. 2093–2096. [Google Scholar]
- Yang, Q.; Wang, L.; Yang, R.; Stewénius, H.; Nistér, D. Stereo matching with color-weighted correlation, hierarchical belief propagation, and occlusion handling. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 31, 492–504. [Google Scholar] [CrossRef]
- Sun, J.; Li, Y.; Kang, S.B.; Shum, H.Y. Symmetric Stereo Matching for Occlusion Handling. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 2, pp. 399–406. [Google Scholar]
- Mozerov, M.G.; Van De Weijer, J. Accurate stereo matching by two-step energy minimization. IEEE Trans. Image Process. 2015, 24, 1153–1163. [Google Scholar] [CrossRef] [PubMed]
- Heitz, F.; Bouthemy, P. Multimodal estimation of discontinuous optical flow using Markov random fields. IEEE Trans. Pattern Anal. Mach. Intell. 1993, 15, 1217–1232. [Google Scholar] [CrossRef]
- Yamaguchi, K.; Hazan, T.; McAllester, D.; Urtasun, R. Continuous Markov Random Fields for Robust Stereo Estimation. In Proceedings of the Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; Proceedings, Part V 12. Springer: Cham, Switzerland, 2012; pp. 45–58. [Google Scholar]
- Hosni, A.; Rhemann, C.; Bleyer, M.; Rother, C.; Gelautz, M. Fast cost-volume filtering for visual correspondence and beyond. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 504–511. [Google Scholar] [CrossRef] [PubMed]
- Wu, W.; Li, L.; Jin, W. Disparity Refinement Based on Segment-Tree and Fast Weighted Median Filter. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 3449–3453. [Google Scholar]
- Sun, X.; Mei, X.; Jiao, S.; Zhou, M.; Wang, H. Stereo Matching with Reliable Disparity Propagation. In Proceedings of the 2011 International Conference on 3D Imaging, Modeling, Processing, Visualization and Transmission, Hangzhou, China, 16–19 May 2011; pp. 132–139. [Google Scholar]
- Zhan, Y.; Gu, Y.; Huang, K.; Zhang, C.; Hu, K. Accurate image-guided stereo matching with efficient matching cost and disparity refinement. IEEE Trans. Circuits Syst. Video Technol. 2015, 26, 1632–1645. [Google Scholar] [CrossRef]
- Mei, X.; Sun, X.; Zhou, M.; Jiao, S.; Wang, H.; Zhang, X. On Building an Accurate Stereo Matching System on Graphics Hardware. In Proceedings of the 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), Barcelona, Spain, 6–13 November 2011; pp. 467–474. [Google Scholar]
- Jiao, J.; Wang, R.; Wang, W.; Dong, S.; Wang, Z.; Gao, W. Local stereo matching with improved matching cost and disparity refinement. IEEE Multimed. 2014, 21, 16–27. [Google Scholar] [CrossRef]
- Zhang, K.; Lu, J.; Yang, Q.; Lafruit, G.; Lauwereins, R.; Van Gool, L. Real-time and accurate stereo: A scalable approach with bitwise fast voting on CUDA. IEEE Trans. Circuits Syst. Video Technol. 2011, 21, 867–878. [Google Scholar] [CrossRef]
- Stentoumis, C.; Grammatikopoulos, L.; Kalisperakis, I.; Karras, G. On accurate dense stereo-matching using a local adaptive multi-cost approach. ISPRS J. Photogramm. Remote Sens. 2014, 91, 29–49. [Google Scholar] [CrossRef]
- Miclea, V.C.; Nedevschi, S. Real-time semantic segmentation-based stereo reconstruction. IEEE Trans. Intell. Transp. Syst. 2019, 21, 1514–1524. [Google Scholar] [CrossRef]
- Chen, Z.; Dong, P.; Li, Z.; Yao, R.; Ma, Y.; Fang, X.; Deng, H.; Zhang, W.; Chen, L.; An, F. Real-Time FPGA-Based Binocular Stereo Vision System with Semi-Global Matching Algorithm. In Proceedings of the 2021 IEEE 34th International System-on-Chip Conference (SOCC), Las Vegas, NV, USA, 14–17 September 2021; pp. 158–163. [Google Scholar]
- Jin, S.; Cho, J.; Dai Pham, X.; Lee, K.M.; Park, S.K.; Kim, M.; Jeon, J.W. FPGA design and implementation of a real-time stereo vision system. IEEE Trans. Circuits Syst. Video Technol. 2009, 20, 15–26. [Google Scholar]
- Cambuim, L.F.; Oliveira, L.A., Jr.; Barros, E.N.; Ferreira, A.P. An FPGA-based real-time occlusion robust stereo vision system using semi-global matching. J. Real-Time Image Process. 2020, 17, 1447–1468. [Google Scholar] [CrossRef]
- Zhang, Q.; Xu, L.; Jia, J. 100+ Times Faster Weighted Median Filter (WMF). In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 2830–2837. [Google Scholar]
- Kim, S.; Min, D.; Kim, S.; Sohn, K. Unified confidence estimation networks for robust stereo matching. IEEE Trans. Image Process. 2018, 28, 1299–1313. [Google Scholar] [CrossRef] [PubMed]
- Chao, W.; Wang, X.; Wang, Y.; Wang, G.; Duan, F. Learning sub-pixel disparity distribution for light field depth estimation. IEEE Trans. Comput. Imaging 2023, 9, 1126–1138. [Google Scholar] [CrossRef]
- Žbontar, J.; LeCun, Y. Stereo matching by training a convolutional neural network to compare image patches. J. Mach. Learn. Res. 2016, 17, 1–32. [Google Scholar]
- Wang, Y.; Yang, Y.; Yang, Z.; Zhao, L.; Wang, P.; Xu, W. Occlusion Aware Unsupervised Learning of Optical Flow. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4884–4893. [Google Scholar]
- Guo, X.; Yang, K.; Yang, W.; Wang, X.; Li, H. Group-Wise Correlation Stereo Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3273–3282. [Google Scholar]
- Chang, J.R.; Chen, Y.S. Pyramid Stereo Matching Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5410–5418. [Google Scholar]
- Gu, X.; Fan, Z.; Zhu, S.; Dai, Z.; Tan, F.; Tan, P. Cascade Cost Volume for High-Resolution Multi-View Stereo and Stereo Matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2495–2504. [Google Scholar]
- Bangunharcana, A.; Cho, J.W.; Lee, S.; Kweon, I.S.; Kim, K.S.; Kim, S. Correlate-and-Excite: Real-Time Stereo Matching via Guided Cost Volume Excitation. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 3542–3548. [Google Scholar]
- Wang, F.; Galliani, S.; Vogel, C.; Pollefeys, M. Itermvs: Iterative Probability Estimation for Efficient Multi-View Stereo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8606–8615. [Google Scholar]
- Xu, G.; Wang, X.; Ding, X.; Yang, X. Iterative Geometry Encoding Volume for Stereo Matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 21919–21928. [Google Scholar]
- Wang, Y.; Tao, X.; Qi, X.; Shen, X.; Jia, J. Image Inpainting via Generative Multi-Column Convolutional Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems 31 (NeurIPS 2018), Montréal, QC, Canada, 3–8 December 2018. [Google Scholar]
- Ren, J.S.; Xu, L.; Yan, Q.; Sun, W. Shepard Convolutional Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems 28 (NIPS 2015), Montréal, QC, Canada, 7–12 December 2015. [Google Scholar]
- Chen, H.; Zhao, Y. Don’t Look into the Dark: Latent Codes for Pluralistic Image Inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 7591–7600. [Google Scholar]
- Zhu, M.; He, D.; Li, X.; Li, C.; Li, F.; Liu, X.; Ding, E.; Zhang, Z. Image inpainting by end-to-end cascaded refinement with mask awareness. IEEE Trans. Image Process. 2021, 30, 4855–4866. [Google Scholar] [CrossRef]
- Li, W.; Lin, Z.; Zhou, K.; Qi, L.; Wang, Y.; Jia, J. Mat: Mask-Aware Transformer for Large Hole Image Inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10758–10768. [Google Scholar]
- Zhang, Z.; Wu, B.; Wang, X.; Luo, Y.; Zhang, L.; Zhao, Y.; Vajda, P.; Metaxas, D.; Yu, L. AVID: Any-Length Video Inpainting with Diffusion Model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 7162–7172. [Google Scholar]
- Liu, H.; Wang, Y.; Qian, B.; Wang, M.; Rui, Y. Structure Matters: Tackling the Semantic Discrepancy in Diffusion Models for Image Inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 8038–8047. [Google Scholar]
- Wei, C.; Mangalam, K.; Huang, P.Y.; Li, Y.; Fan, H.; Xu, H.; Wang, H.; Xie, C.; Yuille, A.; Feichtenhofer, C. Diffusion Models as Masked Autoencoders. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 16284–16294. [Google Scholar]
- Li, X.; Guo, Q.; Abdelfattah, R.; Lin, D.; Feng, W.; Tsang, I.; Wang, S. Leveraging Inpainting for Single-Image Shadow Removal. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 13055–13064. [Google Scholar]
- Sargsyan, A.; Navasardyan, S.; Xu, X.; Shi, H. Mi-gan: A Simple Baseline for Image Inpainting on Mobile Devices. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 7335–7345. [Google Scholar]
- Mayer, N.; Ilg, E.; Hausser, P.; Fischer, P.; Cremers, D.; Dosovitskiy, A.; Brox, T. A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4040–4048. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Metric | SGM | SGM | Linear | Nearest | PDE | ShCNN [56] | GMCNN [55] | MADF [58] | Chen [57] | Proposed | |
---|---|---|---|---|---|---|---|---|---|---|---|
EPE | Overall | 4.87 | 7.77 | 7.30 | 7.48 | 7.37 | 8.87 | 8.51 | 6.36 | 4.08 | 6.68 |
Upper | 2.69 | 3.38 | 3.31 | 3.34 | 3.31 | 3.53 | 3.37 | 3.29 | 3.38 | 3.25 | |
Lower | 10.15 | 12.19 | 11.33 | 11.65 | 11.46 | 14.06 | 13.69 | 9.45 | 4.79 | 10.15 | |
Invalid(%) | 37.04 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.15 |
CNN-Based | ResNet-Based | ViT-Based | PSMNet [50] | |
---|---|---|---|---|
# parameters | 220.93 K | 202.87 K | 1.79 M | 5.22 M |
FLOPs | 21.8 G | 21.7 G | 14.9 G | 893.8 G |
CNN-Based | ResNet-Based | ViT-Based | PSMNet [50] | |
---|---|---|---|---|
SGM GT | 1.83 | 2.13 | 1.74 | 1.26 |
Proposed GT | 1.91 | 2.03 | 1.87 | 1.35 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Gil, H.; Ryu, S.; Woo, S. Enhancing the Ground Truth Disparity by MAP Estimation for Developing a Neural-Net Based Stereoscopic Camera. Sensors 2024, 24, 7761. https://doi.org/10.3390/s24237761
Gil H, Ryu S, Woo S. Enhancing the Ground Truth Disparity by MAP Estimation for Developing a Neural-Net Based Stereoscopic Camera. Sensors. 2024; 24(23):7761. https://doi.org/10.3390/s24237761
Chicago/Turabian StyleGil, Hanbit, Sehyun Ryu, and Sungmin Woo. 2024. "Enhancing the Ground Truth Disparity by MAP Estimation for Developing a Neural-Net Based Stereoscopic Camera" Sensors 24, no. 23: 7761. https://doi.org/10.3390/s24237761
APA StyleGil, H., Ryu, S., & Woo, S. (2024). Enhancing the Ground Truth Disparity by MAP Estimation for Developing a Neural-Net Based Stereoscopic Camera. Sensors, 24(23), 7761. https://doi.org/10.3390/s24237761