Nothing Special   »   [go: up one dir, main page]

Skip to main content

Advertisement

Log in

Industrial object detection with multi-modal SSD: closing the gap between synthetic and real images

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Object detection for industrial applications faces challenges that are yet to solve by state-of-the-art deep learning models. They usually lack training data, and the common solution of using a synthetic dataset introduces a domain gap when the model is provided real images. Besides, few architectures fit in the small memory of a mobile device and run in real-time with limited computation capabilities. The models fulfilling these requirements generally have low learning capacity, and the domain gap reduces further the performance. In this work, we propose multiple strategies to reduce the domain gap when using RGB-D images, and to increase the overall performance of a Convolutional Neural Network (CNN) for object detection with a reasonable increase of the model size. First, we propose a new architecture based on the Single Shot Detector (SSD) architecture, and we compare different fusion methods to increase the performance with few or no additional parameters. We applied the proposed method to three synthetic datasets with different visual characteristics, and we show that classical image processing reduces significantly the domain gap for depth maps. Our experiments have shown an improvement when fusing RGB and depth images for two benchmark datasets, even when the depth maps contain few discriminative information. Our RGB-D SSD Lite model performs on par or better than a ResNet-FPN RetinaNet model on the LINEMOD and T-LESS datasets, while requiring 20 times less computation. Finally, we provide some insights on training a robust model for improved performance when one of the modalities is missing.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Data Availability

The datasets used by the present work are public and private.

1. The T-LESS and LINEMOD datasets are both public and are available from the respective authors’ websites.

2. The BusSeat dataset is private. It cannot be shared openly with the community to preserve the intellectual property of the product of our industrial partners.

References

  1. Alhaija HA, Mustikovela SK, Mescheder LM, Geiger A, Rother C (2018) Augmented reality meets computer vision: efficient data generation for urban driving scenes. Int J Comput Vis (IJCV) 126(9):961–972. https://doi.org/10.1007/s11263-018-1070-x

    Article  Google Scholar 

  2. Bi S, Sunkavalli K, Perazzi F, Shechtman E, Kim VG, Ramamoorthi R (2019) Deep cg2real: synthetic-to-real translation via image disentanglement. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 2730–2739. https://doi.org/10.1109/ICCV.2019.00282

  3. Bousmalis K, Silberman N, Dohan D, Erhan D, Krishnan D (2017) Unsupervised pixel-level domain adaptation with generative adversarial networks. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 95–104. https://doi.org/10.1109/CVPR.2017.18

  4. Carlson A, Skinner KA, Vasudevan R, Johnson-Roberson M (2018) Modeling camera effects to improve visual learning from synthetic data. In: Proceedings of the European conference on computer vision (ECCV), pp 505–520. https://doi.org/10.1007/978-3-030-11009-3_31

  5. Carlucci FM, Russo P, Caputo B (2017) A deep representation for depth images from synthetic data. In: 2017 IEEE international conference on robotics and automation (ICRA), pp 1362–1369. https://doi.org/10.1109/ICRA.2017.7989162

  6. Carlucci FM, D’Innocente A, Bucci S, Caputo B, Tommasi T (2019) Domain generalization by solving jigsaw puzzles. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 2229–2238. https://doi.org/10.1109/CVPR.2019.00233

  7. Cavallari G, Ribeiro L, Ponti M (2018) Unsupervised representation learning using convolutional and stacked auto-encoders: a domain and cross-domain feature space analysis. In: 2018 31st SIBGRAPI conference on graphics, patterns and images (SIBGRAPI), pp 440–446. https://doi.org/10.1109/SIBGRAPI.2018.00063

  8. Chang AX, Funkhouser TA, Guibas LJ, Hanrahan P, Huang QX, Li Z, Yu F (2015) ShapeNet: an information-rich 3D model repository. arXiv:1512.03012

  9. Cohen J, Crispim-Junior C, Grange-Faivre C, Tougne L (2020) CAD-based learning for egocentric object detection in industrial context. In: 15th International conference on computer vision theory and applications (VISAPP), vol 5. SCITEPRESS–Science and Technology Publications, pp 644–651. https://doi.org/10.5220/0008975506440651

  10. Cohen J, Crispim-Junior C, Chiappa JM, Tougne L (2021) Training an embedded object detector for industrial settings without real images. In: 2021 IEEE International conference on image processing (ICIP), pp 714–718. https://doi.org/10.1109/ICIP42928.2021.9506574

  11. Denninger M, Sundermeyer M, Winkelbauer D, Zidan Y, Olefir D, Elbadrawy M, Katam H (2019) BlenderProc Blenderproc. arXiv:1911.01911

  12. Doersch C, Gupta A, Efros AA (2015) Unsupervised visual representation learning by context prediction. In: 2015 IEEE International conference on computer vision (ICCV), pp 1422–1430. https://doi.org/10.1109/ICCV.2015.167

  13. Donahue J, Krähenbühl P, Darrell T (2016) Adversarial feature learning. arXiv:1605.09782

  14. Dosovitskiy A, Fischer P, Springenberg JT, Riedmiller M, Brox T (2016) Discriminative unsupervised feature learning with exemplar convolutional neural networks. IEEE Trans Pattern Anal Mach Intell (TPAMI) 38 (9):1734–1747. https://doi.org/10.1109/TPAMI.2015.2496141

    Article  Google Scholar 

  15. Dvornik N, Mairal J, Schmid C (2018) Modeling visual context is key to augmenting object detection datasets. In: Proceedings of the European conference on computer vision (ECCV), vol 11216, pp 375–391. https://doi.org/10.1007/978-3-030-01258-8_23

  16. Eitel A, Springenberg JT, Spinello L, Riedmiller M, Burgard W (2015) Multimodal deep learning for robust rgb-d object recognition. In: 2015 IEEE/RSJ International conference on intelligent robots and systems (IROS), pp 681–687. https://doi.org/10.1109/IROS.2015.7353446

  17. Everingham M, Gool L, Williams CK, Winn J, Zisserman A (2010) The pascal visual object classes (VOC) challenge, vol 88. https://doi.org/10.1007/s11263-009-0275-4

  18. Fan DP, Lin Z, Zhang Z, Zhu M, Cheng MM (2021) Rethinking RGB-d salient object detection: models, data sets, and large-scale benchmarks. IEEE Trans Neural Netw Learn Syst 32(5):2075–2089. https://doi.org/10.1109/TNNLS.2020.2996406

    Article  Google Scholar 

  19. Fang K, Bai Y, Hinterstoisser S, Savarese S, Kalakrishnan M (2018) Multi-task domain adaptation for deep learning of instance grasping from simulation. In: 2018 IEEE International conference on robotics and automation (ICRA), pp 3516–3523. https://doi.org/10.1109/ICRA.2018.8461041

  20. Feng Z, Xu C, Tao D (2019) Self-supervised representation learning from multi-domain data. In: 2019 IEEE/CVF international conference on computer vision (ICCV), pp 3245–3255. https://doi.org/10.1109/ICCV.2019.00334

  21. Fujii K, Kera H, Kawamoto K (2022) Adversarially trained object detector for unsupervised domain adaptation. IEEE Access. https://doi.org/10.1109/ACCESS.2022.3180344

  22. Gaidon A, Wang Q, Cabon Y, Vig E (2016) Virtualworlds as proxy for multi-object tracking analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 4340–4349. https://doi.org/10.1109/CVPR.2016.470

  23. Ganin Y, Lempitsky V (2015) Unsupervised domain adaptation by backpropagation. In: Proceedings of the 32nd international conference on machine learning (ICML), pp 1180–1189

  24. Ganin Y, Ustinova E, Ajakan H, Germain P, Larochelle H, Laviolette F, Lempitsky V (2016) Domain-adversarial training of neural networks. J Mach Learn Res 17(1):1–35. https://doi.org/10.1007/978-3-319-58347-1_10

    MathSciNet  Google Scholar 

  25. Gao Y, Ma J, Zhao M, Liu W, Yuille A L (2019) Nddr-cnn: layerwise feature fusing in multi-task cnns by neural discriminative dimensionality reduction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 3205–3214. https://doi.org/10.1109/CVPR.2019.00332

  26. Ghifary M, Kleijn W B, Zhang M, Balduzzi D (2015) Domain generalization for object recognition with multi-task autoencoders. In: 2015 IEEE International conference on computer vision (ICCV), pp 2551–2559. https://doi.org/10.1109/ICCV.2015.293

  27. Ghifary M, Kleijn W B, Zhang M, Balduzzi D, Li W (2016) Deep reconstruction-classification networks for unsupervised domain adaptation. In: European conference on computer vision (ECCV), pp 597–613. https://doi.org/10.1007/978-3-319-46493-0_36

  28. Giannone G, Chidlovskii B (2019) Learning common representation from rgb and depth images. In: 2019 IEEE/CVF conference on computer vision and pattern recognition workshops (CVPRW), pp 408–415. https://doi.org/10.1109/CVPRW.2019.00054

  29. Gidaris S, Singh P, Komodakis N (2018) Unsupervised representation learning by predicting image rotations. In: International conference on learning representations (ICLR)

  30. Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR ’14 proceedings of the 2014 IEEE conference on computer vision and pattern recognition (CVPR), pp 580–587. https://doi.org/10.1109/CVPR.2014.81

  31. Goyal P, Mahajan D, Gupta A, Misra I (2019) Scaling and benchmarking self-supervised visual representation learning. In: 2019 IEEE/CVF international conference on computer vision (ICCV), pp 6391–6400. https://doi.org/10.1109/ICCV.2019.00649

  32. Gschwandtner M, Kwitt R, Uhl A, Pree W (2011) Blensor: blender sensor simulation toolbox. In: Advances in visual computing advances in visual computing. Springer, Berlin, pp 199–208. https://doi.org/10.1007/978-3-642-24031-7_20

  33. Gupta S, Girshick RB, Arbeláez PA, Malik J (2014) Learning rich features from rgb-d images for object detection and segmentation. In: European conference on computer vision (ECCV), pp 345–360. https://doi.org/10.1007/978-3-319-10584-0_23

  34. Heindl C, Brunner L, Zambal S, Scharinger J (2020) Blendtorch: a real-time, adaptive domain randomization library. In: International conference on pattern recognition (ICPR), pp 538–551. https://doi.org/10.1007/978-3-030-68799-1_39

  35. Hinterstoisser S, Lepetit V, Ilic S, Holzer S, Bradski G, Konolige K, Navab N (2012) Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In: Asian conference on computer vision, pp 548–562

  36. Hinterstoisser S, Lepetit V, Wohlhart P, Konolige K (2018) On pre-trained image features and synthetic images for deep learning. In: Proceedings of the European conference on computer vision (ECCV), pp 682–697. https://doi.org/10.1007/978-3-030-11009-3_42

  37. Hinterstoisser S, Pauly O, Heibel H, Marek M, Bokeloh M (2019) An annotation saved is an annotation earned: Using fully synthetic training for object detection. In: Proceedings of the IEEE international conference on computer vision workshop (ICCVW), pp 2787–2796. https://doi.org/10.1109/ICCVW.2019.00340

  38. Hodan T, Haluza P, Obdrzalek S, Matas J, Lourakis M, Zabulis X (2017) T-less: an rgb-d dataset for 6d pose estimation of texture-less objects. In: 2017 IEEE Winter conference on applications of computer vision (WACV), pp 880–888. https://doi.org/10.1109/WACV.2017.103

  39. Hodan T, Michel F, Brachmann E, Kehl W, Buch AG, Kraft D, Rother C (2018) Bop: Benchmark for 6d object pose estimation. In: Proceedings of the European conference on computer vision (ECCV), pp 19–35

  40. Hodan T, Sundermeyer M, Drost B, Labbe Y, Brachmann E, Michel F, Matas J (2020) Bop challenge 2020 on 6d object localization. arXiv:2009.07378

  41. Hoffman J, Gupta S, Leong J, Guadarrama S, Darrell T (2016) Cross-modal adaptation for rgb-d detection. In: 2016 IEEE International conference on robotics and automation (ICRA), pp 5032–5039. https://doi.org/10.1109/ICRA.2016.7487708

  42. Hoffman J, Tzeng E, Park T, Zhu J Y, Isola P, Saenko K, Darrell T (2018) Cycada: cycle-consistent adversarial domain adaptation. In: International conference on machine learning international conference on machine learning, pp 1989–1998

  43. Hou S, Wang Z, Wu F (2016) Deeply exploit depth information for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops (CVPRW), pp 19–27. https://doi.org/10.1109/CVPRW.2016.140

  44. Howard A, Pang R, Adam H, Le Q, Sandler M, Chen B, Zhu Y (2019) Searching for mobilenetv3. In: 2019 2019 IEEE/CVF international conference on computer vision (ICCV), pp 1314–1324. https://doi.org/10.1109/ICCV.2019.00140

  45. Huang X, Liu M Y, Belongie S J, Kautz J (2018) Multimodal unsupervised image-to-image translation. In: Proceedings of the European conference on computer vision (ECCV), pp 179–196. https://doi.org/10.1007/978-3-030-01219-9_11

  46. Isola P, Zhu J Y, Zhou T, Efros A A (2017) Image-to-image translation with conditional adversarial networks. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 5967–5976. https://doi.org/10.1109/CVPR.2017.632

  47. Jalal M, Spjut J, Boudaoud B, Betke M (2019) Sidod: a synthetic image dataset for 3d object pose recognition with distractors. In: 2019 IEEE/CVF conference on computer vision and pattern recognition workshops (CVPRW), pp 475–477. https://doi.org/10.1109/CVPRW.2019.00063

  48. Jaritz M, Vu T H, de Charette R, Wirbel E, Perez P (2020) xmuda: cross-modal unsupervised domain adaptation for 3d semantic segmentation. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 12605–12614. https://doi.org/10.1109/CVPR42600.2020.01262

  49. Karras T, Laine S, Aittala M, Hellsten J, Lehtinen J, Aila T (2020) Analyzing and improving the image quality of stylegan. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 8110–8119. https://doi.org/10.1109/CVPR42600.2020.00813

  50. Kim D, Cho D, Yoo D, Kweon I S (2018) Learning image representations by completing damaged jigsaw puzzles. In: 2018 IEEE Winter conference on applications of computer vision (WACV), pp 793–802. https://doi.org/10.1109/WACV.2018.00092

  51. Kim J, Koh J, Kim Y, Choi J, Hwang Y, Choi J W (2018) Robust deep multi-modal learning based on gated information fusion network. In: 14th Asian conference on computer vision (ACCV), pp 90–106. https://doi.org/10.1007/978-3-030-20870-7_6

  52. Koch S, Matveev A, Jiang Z, Williams F, Artemov A, Burnaev E, Panozzo D (2019) Abc: a big cad model dataset for geometric deep learning. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 9601–9611. https://doi.org/10.1109/CVPR.2019.00983

  53. Ku J, Harakeh A, Waslander S L (2018) In defense of classical image processing: fast depth completion on the cpu. In: 2018 15th Conference on computer and robot vision (CRV), pp 16–22. https://doi.org/10.1109/CRV.2018.00013

  54. Langlois J, Mouchère H, Normand N, Viard-gaudin C (2018) 3d orientation estimation of industrial parts from 2d images using neural networks. In: Proceedings of the 7th international conference on pattern recognition applications and methods (ICPRAM), pp 409–416. https://doi.org/10.5220/0006597604090416

  55. Larsson G, Maire M, Shakhnarovich G (2016) Learning representations for automatic colorization. In: European conference on computer vision (ECCV), pp 577–593. https://doi.org/10.1007/978-3-319-46493-0_35

  56. Lee K H, Ros G, Li J, Gaidon A (2018) SPIGAN: privileged adversarial learning from simulation. In: International conference on learning representations (ICLR)

  57. Li Y, Yang Y, Zhou W, Hospedales T M (2019) Feature-critic networks for heterogeneous domain generalization. In: 36th International conference on machine learning (ICML), pp 3915–3924

  58. Lin TY, Goyal P, Girshick R, He K, Dollar P (2020) Focal loss for dense object detection. IEEE Trans Pattern Anal Mach Intell (TPAMI,) 42(2):318–327. https://doi.org/10.1109/TPAMI.2018.2858826

    Article  Google Scholar 

  59. Liu D, Zhang C, Song Y, Huang H, Wang C, Barnett M, Cai W (2022) Decompose to adapt: cross-domain object detection via feature disentanglement. IEEE Trans Multimed. https://doi.org/10.1109/TMM.2022.3141614

  60. Liu M Y, Tuzel O (2016) Coupled generative adversarial networks. In: Proceedings of the 30th international conference on neural information processing systems (NIPS), pp 469–477

  61. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C Y, Berg AC (2016) SSD: single shot multibox detector. In: European conference on computer vision (ECCV), pp 21–37. https://doi.org/10.1007/978-3-319-46448-0_2

  62. Loghmani MR, Robbiano L, Planamente M, Park K, Caputo B, Vincze M (2020) Unsupervised domain adaptation through inter-modal rotation for rgb-d object recognition. IEEE Robot Autom Lett 5(4):6631–6638. https://doi.org/10.1109/LRA.2020.3007092

    Article  Google Scholar 

  63. Marcel S, Rodriguez Y (2010) Torchvision the machine-vision package of torch. In: Proceedings of the 18th ACM international conference on multimedia, pp 1485–1488. https://doi.org/10.1145/1873951.1874254

  64. Mishra S, Panda R, Phoo C P, Chen C F R, Karlinsky L, Saenko K, Feris R S (2022) Task2sim: towards effective pre-training and transfer from synthetic data. In: 2022 IEEE conference on computer vision and pattern recognition (CVPR), pp 9194–9204. https://doi.org/10.1109/CVPR52688.2022.00898

  65. Misra I, Shrivastava A, Gupta A, Hebert M (2016) Cross-stitch networks for multi-task learning. In: 2016 IEEE Conference on computer vision and pattern recognition (CVPR), pp 3994–4003. https://doi.org/10.1109/CVPR.2016.433

  66. Neverova N, Wolf C, Taylor G, Nebout F (2016) Moddrop: adaptive multi-modal gesture recognition. IEEE Trans Pattern Anal Mach Intell (TPAMI) 38(8):1692–1706. https://doi.org/10.1109/TPAMI.2015.2461544

    Article  Google Scholar 

  67. Nogues FC, Huie A, Dasgupta S (2018) Object detection using domain randomization and generative adversarial refinement of synthetic images. arXiv:1805.11778

  68. Noroozi M, Favaro P (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In: European conference on computer vision (ECCV), pp 69–84. https://doi.org/10.1007/978-3-319-46466-4_5

  69. Ophoff T, Beeck KV, Goedemé T (2019) Exploring RGB+depth fusion for real-time object detection. Sensors 19 (4):866. https://doi.org/10.3390/s19040866

    Article  Google Scholar 

  70. Padilla R, Netto SL, da Silva EAB (2020) A survey on performance metrics for object-detection algorithms. In: 2020 International conference on systems, signals and image processing (IWSSIP), pp 237–242. https://doi.org/10.1109/IWSSIP48289.2020.9145130

  71. Pathak D, Krahenbuhl P, Donahue J, Darrell T, Efros AA (2016) Context encoders: feature learning by inpainting. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 2536–2544. https://doi.org/10.1109/CVPR.2016.278

  72. Pavlakos G, Zhou X, Chan A, Derpanis KG, Daniilidis K (2017) 6-DoF object pose from semantic keypoints. In: 2017 IEEE International conference on robotics and automation (ICRA), pp 2011–2018. https://doi.org/10.1109/ICRA.2017.7989233

  73. Peng X, Sun B, Ali K, Saenko K (2015) Learning deep object detectors from 3D models. In: 2015 IEEE International conference on computer vision (ICCV), pp 1278–1286. https://doi.org/10.1109/ICCV.2015.151

  74. Pizzati F, de Charette R, Zaccaria M, Cerri P (2020) Domain bridge for unpaired image-to-image translation and unsupervised domain adaptation. In: 2020 IEEE Winter conference on applications of computer vision (WACV), pp 2990–2998. https://doi.org/10.1109/WACV45572.2020.9093540

  75. Planche B, Wu Z, Ma K, Sun S, Kluckner S, Lehmann O, Ernst J (2017) Depthsynth: real-time realistic synthetic data generation from CAD models for 2.5D recognition. In: 2017 International conference on 3d vision (3DV), pp 1–10. https://doi.org/10.1109/3DV.2017.00011

  76. Prakash A, Boochoon S, Brophy M, Acuna D, Cameracci E, State G, Birchfield S (2019) Structured domain randomization: bridging the reality gap by context-aware synthetic data. In: 2019 International conference on robotics and automation (ICRA), pp 7249–7255. https://doi.org/10.1109/ICRA.2019.8794443

  77. Razavian AS, Azizpour H, Sullivan J, Carlsson S (2014) CNN features off-the-shelf: an astounding baseline for recognition. In: Proceedings of the 2014 IEEE conference on computer vision and pattern recognition workshops (CVPRW), pp 512–519. https://doi.org/10.1109/CVPRW.2014.131

  78. Ren Z, Lee YJ (2018) Cross-domain self-supervised multi-task feature learning using synthetic imagery. In: 2018 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 762–771. https://doi.org/10.1109/CVPR.2018.00086

  79. Rezatofighi H, Tsoi N, Gwak J, Sadeghian A, Reid I, Savarese S (2019) Generalized intersection over union: a metric and a loss for bounding box regression. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 658–666. https://doi.org/10.1109/CVPR.2019.00075

  80. Roitberg A, Pollert T, Haurilet M, Martin M, Stiefelhagen R (2019) Analysis of deep fusion strategies for multi-modal gesture recognition. In: 2019 IEEE/CVF conference on computer vision and pattern recognition workshops (CVPRW), pp 198–206. https://doi.org/10.1109/CVPRW.2019.00029

  81. Rozantsev A, Lepetit V, Fua P (2015) On rendering synthetic images for training an object detector. Comput Vis Image Underst 137:24–37. https://doi.org/10.1016/j.cviu.2014.12.006

    Article  Google Scholar 

  82. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115 (3):211–252. https://doi.org/10.1007/s11263-015-0816-y

    Article  MathSciNet  Google Scholar 

  83. Sampaio I, Machaca L, Viterbo J, Guérin J (2021) A novel method for object detection using deep learning and CAD models. In: Proceedings of the 23rd international conference on enterprise information systems. SCITEPRESS - Science and Technology Publications. https://doi.org/10.5220/0010451100750082

  84. Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L C (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 4510–4520. https://doi.org/10.1109/CVPR.2018.00474

  85. Sarkar K, Varanasi K, Stricker D (2017) Trained 3D models for CNN based object recognition. In: International conference on computer vision theory and applications, pp 130–137. https://doi.org/10.5220/0006272901300137

  86. Schwarz M, Milan A, Periyasamy AS, Behnke S (2018) RGB-D object detection and semantic segmentation for autonomous manipulation in clutter. Int J Robot Res 37:437–451. https://doi.org/10.1177/0278364917713117

    Article  Google Scholar 

  87. Shilane P, Min P, Kazhdan M, Funkhouser T (2004) The princeton shape benchmark. In: Proceedings shape modeling applications, pp 167–178. https://doi.org/10.1109/SMI.2004.1314504

  88. Shrivastava A, Pfister T, Tuzel O, Susskind J, Wang W, Webb R (2017) Learning from simulated and unsupervised images through adversarial training. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 2242–2251. https://doi.org/10.1109/CVPR.2017.241

  89. Song S, Yu F, Zeng A, Chang AX, Savva M, Funkhouser T (2017) Semantic scene completion from a single depth image. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 190–198. https://doi.org/10.1109/CVPR.2017.28

  90. Sun X, Wu J, Zhang X, Zhang Z, Zhang C, Xue T, Freeman W T (2018) Pix3d: dataset and methods for single-image 3D shape modeling. In: 2018 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 2974–2983. https://doi.org/10.1109/CVPR.2018.00314

  91. Sun Y, Tzeng E, Darrell T, Efros AA (2019) Unsupervised domain adaptation through self-supervision. arXiv:1909.11825

  92. Sundermeyer M, Marton Z C, Durner M, Brucker M, Triebel R (2018) Implicit 3D orientation learning for 6D object detection from RGB images. In: Proceedings of the European conference on computer vision (ECCV), pp 712–729. https://doi.org/10.1007/978-3-030-01231-1_43

  93. Sundermeyer M, Durner M, Puang EY, Marton ZC, Vaskevicius N, Arras KO, Triebel R (2020) Multi-path learning for object pose estimation across domains. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 13916–13925. https://doi.org/10.1109/CVPR42600.2020.01393

  94. Sweeney C, Izatt G, Tedrake R (2019) A supervised approach to predicting noise in depth images. In: 2019 International conference on robotics and automation (ICRA), pp 796–802. https://doi.org/10.1109/ICRA.2019.8793820

  95. Thalhammer S, Park K, Patten T, Vincze M, Kropatsch W (2019) Sydd: synthetic depth data randomization for object detection using domain-relevant background. TUGraz OPEN Library 14–22

  96. To T, Tremblay J, McKay D, Yamaguchi Y, Leung K, Balanon A, Birchfield S (2018) NDDS: NVIDIA deep learning dataset synthesizer. https://github.com/NVIDIA/Dataset_Synthesizer

  97. Tobin J, Fong R, Ray A, Schneider J, Zaremba W, Abbeel P (2017) Domain randomization for transferring deep neural networks from simulation to the real world. In: 2017 IEEE/RSJ International conference on intelligent robots and systems (IROS), pp 23–30

  98. Tremblay J, Prakash A, Acuna D, Brophy M, Jampani V, Anil C, Birchfield S (2018) Training deep networks with synthetic data: bridging the reality gap by domain randomization. In: 2018 IEEE/CVF conference on computer vision and pattern recognition workshops (CVPRW), pp 969–977. https://doi.org/10.1109/CVPRW.2018.00143

  99. Tremblay J, To T, Sundaralingam B, Xiang Y, Fox D, Birchfield S (2018) Deep object pose estimation for semantic robotic grasping of household objects. In: Conference on robot learning, pp 306–316

  100. Tzeng E, Hoffman J, Darrell T, Saenko K (2015) Simultaneous deep transfer across domains and tasks. In: 2015 IEEE International conference on computer vision (ICCV), pp 4068–4076. https://doi.org/10.1109/ICCV.2015.463

  101. Wang CY, Bochkovskiy A, Liao HYM YOLOv7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv:2207.02696

  102. Wu Z, Song S, Khosla A, Yu F, Zhang L, Tang X, Xiao J (2015) 3D shapenets: a deep representation for volumetric shapes. In: 2015 IEEE Conference on computer vision and pattern recognition (CVPR), pp 1912–1920

  103. Xiang Y, Mottaghi R, Savarese S (2014) Beyond PASCAL: a benchmark for 3D object detection in the wild. In: IEEE Winter conference on applications of computer vision (WACV), pp 75–82. https://doi.org/10.1109/WACV.2014.6836101

  104. Xiang Y, Kim W, Chen W, Ji J, Choy CB, Su H, Savarese S (2016) Objectnet3d: a large scale database for 3D object recognition. In: European conference on computer vision (ECCV), pp 160–176. https://doi.org/10.1007/978-3-319-46484-8_10

  105. Xu X, Li Y, Wu G, Luo J (2017) Multi-modal deep feature learning for RGB-d object detection. Pattern Recogn 72:300–313. https://doi.org/10.1016/j.patcog.2017.07.026

    Article  Google Scholar 

  106. Yu J, Jiang Y, Wang Z, Cao Z, Huang T (2016) Unitbox: an advanced object detection network. In: Proceedings of the 24th ACM international conference on multimedia, pp 516–520. https://doi.org/10.1145/2964284.2967274

  107. Yu F, Wang D, Shelhamer E, Darrell T (2018) Deep layer aggregation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2403–2412. https://doi.org/10.1109/CVPR.2018.00255

  108. Yue X, Zhang Y, Zhao S, Sangiovanni-Vincentelli A, Keutzer K, Gong B (2019) Domain randomization and pyramid consistency: simulation-to-real generalization without accessing target domain data. In: 2019 IEEE/CVF International conference on computer vision (ICCV), pp 2100–2110. https://doi.org/10.1109/ICCV.2019.00219

  109. Zakharov S, Kehl W, Ilic S (2019) Deceptionnet: network-driven domain randomization. In: 2019 IEEE/CVF International conference on computer vision (ICCV), pp 532–541. https://doi.org/10.1109/ICCV.2019.00062

  110. Zhang R, Isola P, Efros A A (2016) Colorful image colorization. In: European conference on computer vision (ECCV), pp 649–666. https://doi.org/10.1007/978-3-319-46487-9_40

  111. Zhang R, Isola P, Efros AA (2017) Split-brain autoencoders: unsupervised learning by cross-channel prediction. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 645–654. https://doi.org/10.1109/CVPR.2017.76

  112. Zhang D, Ye M, Liu Y, Xiong L, Zhou L (2022) Multi-source unsupervised domain adaptation for object detection. Inf Fusion 78:138–148. https://doi.org/10.1016/j.inffus.2021.09.011

    Article  Google Scholar 

  113. Zhao X, Zhang L, Pang Y, Lu H, Zhang L (2020) A single stream network for robust and real-time RGB-d salient object detection. In: European conference on computer vision (ECCV), pp 646–662. https://doi.org/10.1007/978-3-030-58542-6_39

  114. Zhao X, Pang Y, Zhang L, Lu H, Ruan X (2021) Self-supervised representation learning for RGB-D salient object detection. arXiv:2101.12482

  115. Zhou Q, Jacobson A (2016) Thingi10K: a dataset of 10, 000 3D-printing models. arXiv:1605.04797

  116. Zhou K, Liu Z, Qiao Y, Xiang T, Loy C C (2022) Domain generalization: a survey. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2022.3195549

  117. Zhu J Y, Park T, Isola P, Efros AA (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In: 2017 IEEE international conference on computer vision (ICCV), pp 2242–2251. https://doi.org/10.1109/ICCV.2017.244

Download references

Acknowledgements

This work was supported by grant CIFRE n.2018/0872 from ANRT.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Carlos Crispim-Junior.

Ethics declarations

Competing interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cohen, J., Crispim-Junior, C., Chiappa, JM. et al. Industrial object detection with multi-modal SSD: closing the gap between synthetic and real images. Multimed Tools Appl 83, 12111–12138 (2024). https://doi.org/10.1007/s11042-023-15367-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-15367-0

Keywords