Abstract
Object detection for industrial applications faces challenges that are yet to solve by state-of-the-art deep learning models. They usually lack training data, and the common solution of using a synthetic dataset introduces a domain gap when the model is provided real images. Besides, few architectures fit in the small memory of a mobile device and run in real-time with limited computation capabilities. The models fulfilling these requirements generally have low learning capacity, and the domain gap reduces further the performance. In this work, we propose multiple strategies to reduce the domain gap when using RGB-D images, and to increase the overall performance of a Convolutional Neural Network (CNN) for object detection with a reasonable increase of the model size. First, we propose a new architecture based on the Single Shot Detector (SSD) architecture, and we compare different fusion methods to increase the performance with few or no additional parameters. We applied the proposed method to three synthetic datasets with different visual characteristics, and we show that classical image processing reduces significantly the domain gap for depth maps. Our experiments have shown an improvement when fusing RGB and depth images for two benchmark datasets, even when the depth maps contain few discriminative information. Our RGB-D SSD Lite model performs on par or better than a ResNet-FPN RetinaNet model on the LINEMOD and T-LESS datasets, while requiring 20 times less computation. Finally, we provide some insights on training a robust model for improved performance when one of the modalities is missing.
Similar content being viewed by others
Data Availability
The datasets used by the present work are public and private.
1. The T-LESS and LINEMOD datasets are both public and are available from the respective authors’ websites.
2. The BusSeat dataset is private. It cannot be shared openly with the community to preserve the intellectual property of the product of our industrial partners.
References
Alhaija HA, Mustikovela SK, Mescheder LM, Geiger A, Rother C (2018) Augmented reality meets computer vision: efficient data generation for urban driving scenes. Int J Comput Vis (IJCV) 126(9):961–972. https://doi.org/10.1007/s11263-018-1070-x
Bi S, Sunkavalli K, Perazzi F, Shechtman E, Kim VG, Ramamoorthi R (2019) Deep cg2real: synthetic-to-real translation via image disentanglement. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 2730–2739. https://doi.org/10.1109/ICCV.2019.00282
Bousmalis K, Silberman N, Dohan D, Erhan D, Krishnan D (2017) Unsupervised pixel-level domain adaptation with generative adversarial networks. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 95–104. https://doi.org/10.1109/CVPR.2017.18
Carlson A, Skinner KA, Vasudevan R, Johnson-Roberson M (2018) Modeling camera effects to improve visual learning from synthetic data. In: Proceedings of the European conference on computer vision (ECCV), pp 505–520. https://doi.org/10.1007/978-3-030-11009-3_31
Carlucci FM, Russo P, Caputo B (2017) A deep representation for depth images from synthetic data. In: 2017 IEEE international conference on robotics and automation (ICRA), pp 1362–1369. https://doi.org/10.1109/ICRA.2017.7989162
Carlucci FM, D’Innocente A, Bucci S, Caputo B, Tommasi T (2019) Domain generalization by solving jigsaw puzzles. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 2229–2238. https://doi.org/10.1109/CVPR.2019.00233
Cavallari G, Ribeiro L, Ponti M (2018) Unsupervised representation learning using convolutional and stacked auto-encoders: a domain and cross-domain feature space analysis. In: 2018 31st SIBGRAPI conference on graphics, patterns and images (SIBGRAPI), pp 440–446. https://doi.org/10.1109/SIBGRAPI.2018.00063
Chang AX, Funkhouser TA, Guibas LJ, Hanrahan P, Huang QX, Li Z, Yu F (2015) ShapeNet: an information-rich 3D model repository. arXiv:1512.03012
Cohen J, Crispim-Junior C, Grange-Faivre C, Tougne L (2020) CAD-based learning for egocentric object detection in industrial context. In: 15th International conference on computer vision theory and applications (VISAPP), vol 5. SCITEPRESS–Science and Technology Publications, pp 644–651. https://doi.org/10.5220/0008975506440651
Cohen J, Crispim-Junior C, Chiappa JM, Tougne L (2021) Training an embedded object detector for industrial settings without real images. In: 2021 IEEE International conference on image processing (ICIP), pp 714–718. https://doi.org/10.1109/ICIP42928.2021.9506574
Denninger M, Sundermeyer M, Winkelbauer D, Zidan Y, Olefir D, Elbadrawy M, Katam H (2019) BlenderProc Blenderproc. arXiv:1911.01911
Doersch C, Gupta A, Efros AA (2015) Unsupervised visual representation learning by context prediction. In: 2015 IEEE International conference on computer vision (ICCV), pp 1422–1430. https://doi.org/10.1109/ICCV.2015.167
Donahue J, Krähenbühl P, Darrell T (2016) Adversarial feature learning. arXiv:1605.09782
Dosovitskiy A, Fischer P, Springenberg JT, Riedmiller M, Brox T (2016) Discriminative unsupervised feature learning with exemplar convolutional neural networks. IEEE Trans Pattern Anal Mach Intell (TPAMI) 38 (9):1734–1747. https://doi.org/10.1109/TPAMI.2015.2496141
Dvornik N, Mairal J, Schmid C (2018) Modeling visual context is key to augmenting object detection datasets. In: Proceedings of the European conference on computer vision (ECCV), vol 11216, pp 375–391. https://doi.org/10.1007/978-3-030-01258-8_23
Eitel A, Springenberg JT, Spinello L, Riedmiller M, Burgard W (2015) Multimodal deep learning for robust rgb-d object recognition. In: 2015 IEEE/RSJ International conference on intelligent robots and systems (IROS), pp 681–687. https://doi.org/10.1109/IROS.2015.7353446
Everingham M, Gool L, Williams CK, Winn J, Zisserman A (2010) The pascal visual object classes (VOC) challenge, vol 88. https://doi.org/10.1007/s11263-009-0275-4
Fan DP, Lin Z, Zhang Z, Zhu M, Cheng MM (2021) Rethinking RGB-d salient object detection: models, data sets, and large-scale benchmarks. IEEE Trans Neural Netw Learn Syst 32(5):2075–2089. https://doi.org/10.1109/TNNLS.2020.2996406
Fang K, Bai Y, Hinterstoisser S, Savarese S, Kalakrishnan M (2018) Multi-task domain adaptation for deep learning of instance grasping from simulation. In: 2018 IEEE International conference on robotics and automation (ICRA), pp 3516–3523. https://doi.org/10.1109/ICRA.2018.8461041
Feng Z, Xu C, Tao D (2019) Self-supervised representation learning from multi-domain data. In: 2019 IEEE/CVF international conference on computer vision (ICCV), pp 3245–3255. https://doi.org/10.1109/ICCV.2019.00334
Fujii K, Kera H, Kawamoto K (2022) Adversarially trained object detector for unsupervised domain adaptation. IEEE Access. https://doi.org/10.1109/ACCESS.2022.3180344
Gaidon A, Wang Q, Cabon Y, Vig E (2016) Virtualworlds as proxy for multi-object tracking analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 4340–4349. https://doi.org/10.1109/CVPR.2016.470
Ganin Y, Lempitsky V (2015) Unsupervised domain adaptation by backpropagation. In: Proceedings of the 32nd international conference on machine learning (ICML), pp 1180–1189
Ganin Y, Ustinova E, Ajakan H, Germain P, Larochelle H, Laviolette F, Lempitsky V (2016) Domain-adversarial training of neural networks. J Mach Learn Res 17(1):1–35. https://doi.org/10.1007/978-3-319-58347-1_10
Gao Y, Ma J, Zhao M, Liu W, Yuille A L (2019) Nddr-cnn: layerwise feature fusing in multi-task cnns by neural discriminative dimensionality reduction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 3205–3214. https://doi.org/10.1109/CVPR.2019.00332
Ghifary M, Kleijn W B, Zhang M, Balduzzi D (2015) Domain generalization for object recognition with multi-task autoencoders. In: 2015 IEEE International conference on computer vision (ICCV), pp 2551–2559. https://doi.org/10.1109/ICCV.2015.293
Ghifary M, Kleijn W B, Zhang M, Balduzzi D, Li W (2016) Deep reconstruction-classification networks for unsupervised domain adaptation. In: European conference on computer vision (ECCV), pp 597–613. https://doi.org/10.1007/978-3-319-46493-0_36
Giannone G, Chidlovskii B (2019) Learning common representation from rgb and depth images. In: 2019 IEEE/CVF conference on computer vision and pattern recognition workshops (CVPRW), pp 408–415. https://doi.org/10.1109/CVPRW.2019.00054
Gidaris S, Singh P, Komodakis N (2018) Unsupervised representation learning by predicting image rotations. In: International conference on learning representations (ICLR)
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR ’14 proceedings of the 2014 IEEE conference on computer vision and pattern recognition (CVPR), pp 580–587. https://doi.org/10.1109/CVPR.2014.81
Goyal P, Mahajan D, Gupta A, Misra I (2019) Scaling and benchmarking self-supervised visual representation learning. In: 2019 IEEE/CVF international conference on computer vision (ICCV), pp 6391–6400. https://doi.org/10.1109/ICCV.2019.00649
Gschwandtner M, Kwitt R, Uhl A, Pree W (2011) Blensor: blender sensor simulation toolbox. In: Advances in visual computing advances in visual computing. Springer, Berlin, pp 199–208. https://doi.org/10.1007/978-3-642-24031-7_20
Gupta S, Girshick RB, Arbeláez PA, Malik J (2014) Learning rich features from rgb-d images for object detection and segmentation. In: European conference on computer vision (ECCV), pp 345–360. https://doi.org/10.1007/978-3-319-10584-0_23
Heindl C, Brunner L, Zambal S, Scharinger J (2020) Blendtorch: a real-time, adaptive domain randomization library. In: International conference on pattern recognition (ICPR), pp 538–551. https://doi.org/10.1007/978-3-030-68799-1_39
Hinterstoisser S, Lepetit V, Ilic S, Holzer S, Bradski G, Konolige K, Navab N (2012) Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In: Asian conference on computer vision, pp 548–562
Hinterstoisser S, Lepetit V, Wohlhart P, Konolige K (2018) On pre-trained image features and synthetic images for deep learning. In: Proceedings of the European conference on computer vision (ECCV), pp 682–697. https://doi.org/10.1007/978-3-030-11009-3_42
Hinterstoisser S, Pauly O, Heibel H, Marek M, Bokeloh M (2019) An annotation saved is an annotation earned: Using fully synthetic training for object detection. In: Proceedings of the IEEE international conference on computer vision workshop (ICCVW), pp 2787–2796. https://doi.org/10.1109/ICCVW.2019.00340
Hodan T, Haluza P, Obdrzalek S, Matas J, Lourakis M, Zabulis X (2017) T-less: an rgb-d dataset for 6d pose estimation of texture-less objects. In: 2017 IEEE Winter conference on applications of computer vision (WACV), pp 880–888. https://doi.org/10.1109/WACV.2017.103
Hodan T, Michel F, Brachmann E, Kehl W, Buch AG, Kraft D, Rother C (2018) Bop: Benchmark for 6d object pose estimation. In: Proceedings of the European conference on computer vision (ECCV), pp 19–35
Hodan T, Sundermeyer M, Drost B, Labbe Y, Brachmann E, Michel F, Matas J (2020) Bop challenge 2020 on 6d object localization. arXiv:2009.07378
Hoffman J, Gupta S, Leong J, Guadarrama S, Darrell T (2016) Cross-modal adaptation for rgb-d detection. In: 2016 IEEE International conference on robotics and automation (ICRA), pp 5032–5039. https://doi.org/10.1109/ICRA.2016.7487708
Hoffman J, Tzeng E, Park T, Zhu J Y, Isola P, Saenko K, Darrell T (2018) Cycada: cycle-consistent adversarial domain adaptation. In: International conference on machine learning international conference on machine learning, pp 1989–1998
Hou S, Wang Z, Wu F (2016) Deeply exploit depth information for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops (CVPRW), pp 19–27. https://doi.org/10.1109/CVPRW.2016.140
Howard A, Pang R, Adam H, Le Q, Sandler M, Chen B, Zhu Y (2019) Searching for mobilenetv3. In: 2019 2019 IEEE/CVF international conference on computer vision (ICCV), pp 1314–1324. https://doi.org/10.1109/ICCV.2019.00140
Huang X, Liu M Y, Belongie S J, Kautz J (2018) Multimodal unsupervised image-to-image translation. In: Proceedings of the European conference on computer vision (ECCV), pp 179–196. https://doi.org/10.1007/978-3-030-01219-9_11
Isola P, Zhu J Y, Zhou T, Efros A A (2017) Image-to-image translation with conditional adversarial networks. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 5967–5976. https://doi.org/10.1109/CVPR.2017.632
Jalal M, Spjut J, Boudaoud B, Betke M (2019) Sidod: a synthetic image dataset for 3d object pose recognition with distractors. In: 2019 IEEE/CVF conference on computer vision and pattern recognition workshops (CVPRW), pp 475–477. https://doi.org/10.1109/CVPRW.2019.00063
Jaritz M, Vu T H, de Charette R, Wirbel E, Perez P (2020) xmuda: cross-modal unsupervised domain adaptation for 3d semantic segmentation. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 12605–12614. https://doi.org/10.1109/CVPR42600.2020.01262
Karras T, Laine S, Aittala M, Hellsten J, Lehtinen J, Aila T (2020) Analyzing and improving the image quality of stylegan. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 8110–8119. https://doi.org/10.1109/CVPR42600.2020.00813
Kim D, Cho D, Yoo D, Kweon I S (2018) Learning image representations by completing damaged jigsaw puzzles. In: 2018 IEEE Winter conference on applications of computer vision (WACV), pp 793–802. https://doi.org/10.1109/WACV.2018.00092
Kim J, Koh J, Kim Y, Choi J, Hwang Y, Choi J W (2018) Robust deep multi-modal learning based on gated information fusion network. In: 14th Asian conference on computer vision (ACCV), pp 90–106. https://doi.org/10.1007/978-3-030-20870-7_6
Koch S, Matveev A, Jiang Z, Williams F, Artemov A, Burnaev E, Panozzo D (2019) Abc: a big cad model dataset for geometric deep learning. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 9601–9611. https://doi.org/10.1109/CVPR.2019.00983
Ku J, Harakeh A, Waslander S L (2018) In defense of classical image processing: fast depth completion on the cpu. In: 2018 15th Conference on computer and robot vision (CRV), pp 16–22. https://doi.org/10.1109/CRV.2018.00013
Langlois J, Mouchère H, Normand N, Viard-gaudin C (2018) 3d orientation estimation of industrial parts from 2d images using neural networks. In: Proceedings of the 7th international conference on pattern recognition applications and methods (ICPRAM), pp 409–416. https://doi.org/10.5220/0006597604090416
Larsson G, Maire M, Shakhnarovich G (2016) Learning representations for automatic colorization. In: European conference on computer vision (ECCV), pp 577–593. https://doi.org/10.1007/978-3-319-46493-0_35
Lee K H, Ros G, Li J, Gaidon A (2018) SPIGAN: privileged adversarial learning from simulation. In: International conference on learning representations (ICLR)
Li Y, Yang Y, Zhou W, Hospedales T M (2019) Feature-critic networks for heterogeneous domain generalization. In: 36th International conference on machine learning (ICML), pp 3915–3924
Lin TY, Goyal P, Girshick R, He K, Dollar P (2020) Focal loss for dense object detection. IEEE Trans Pattern Anal Mach Intell (TPAMI,) 42(2):318–327. https://doi.org/10.1109/TPAMI.2018.2858826
Liu D, Zhang C, Song Y, Huang H, Wang C, Barnett M, Cai W (2022) Decompose to adapt: cross-domain object detection via feature disentanglement. IEEE Trans Multimed. https://doi.org/10.1109/TMM.2022.3141614
Liu M Y, Tuzel O (2016) Coupled generative adversarial networks. In: Proceedings of the 30th international conference on neural information processing systems (NIPS), pp 469–477
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C Y, Berg AC (2016) SSD: single shot multibox detector. In: European conference on computer vision (ECCV), pp 21–37. https://doi.org/10.1007/978-3-319-46448-0_2
Loghmani MR, Robbiano L, Planamente M, Park K, Caputo B, Vincze M (2020) Unsupervised domain adaptation through inter-modal rotation for rgb-d object recognition. IEEE Robot Autom Lett 5(4):6631–6638. https://doi.org/10.1109/LRA.2020.3007092
Marcel S, Rodriguez Y (2010) Torchvision the machine-vision package of torch. In: Proceedings of the 18th ACM international conference on multimedia, pp 1485–1488. https://doi.org/10.1145/1873951.1874254
Mishra S, Panda R, Phoo C P, Chen C F R, Karlinsky L, Saenko K, Feris R S (2022) Task2sim: towards effective pre-training and transfer from synthetic data. In: 2022 IEEE conference on computer vision and pattern recognition (CVPR), pp 9194–9204. https://doi.org/10.1109/CVPR52688.2022.00898
Misra I, Shrivastava A, Gupta A, Hebert M (2016) Cross-stitch networks for multi-task learning. In: 2016 IEEE Conference on computer vision and pattern recognition (CVPR), pp 3994–4003. https://doi.org/10.1109/CVPR.2016.433
Neverova N, Wolf C, Taylor G, Nebout F (2016) Moddrop: adaptive multi-modal gesture recognition. IEEE Trans Pattern Anal Mach Intell (TPAMI) 38(8):1692–1706. https://doi.org/10.1109/TPAMI.2015.2461544
Nogues FC, Huie A, Dasgupta S (2018) Object detection using domain randomization and generative adversarial refinement of synthetic images. arXiv:1805.11778
Noroozi M, Favaro P (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In: European conference on computer vision (ECCV), pp 69–84. https://doi.org/10.1007/978-3-319-46466-4_5
Ophoff T, Beeck KV, Goedemé T (2019) Exploring RGB+depth fusion for real-time object detection. Sensors 19 (4):866. https://doi.org/10.3390/s19040866
Padilla R, Netto SL, da Silva EAB (2020) A survey on performance metrics for object-detection algorithms. In: 2020 International conference on systems, signals and image processing (IWSSIP), pp 237–242. https://doi.org/10.1109/IWSSIP48289.2020.9145130
Pathak D, Krahenbuhl P, Donahue J, Darrell T, Efros AA (2016) Context encoders: feature learning by inpainting. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 2536–2544. https://doi.org/10.1109/CVPR.2016.278
Pavlakos G, Zhou X, Chan A, Derpanis KG, Daniilidis K (2017) 6-DoF object pose from semantic keypoints. In: 2017 IEEE International conference on robotics and automation (ICRA), pp 2011–2018. https://doi.org/10.1109/ICRA.2017.7989233
Peng X, Sun B, Ali K, Saenko K (2015) Learning deep object detectors from 3D models. In: 2015 IEEE International conference on computer vision (ICCV), pp 1278–1286. https://doi.org/10.1109/ICCV.2015.151
Pizzati F, de Charette R, Zaccaria M, Cerri P (2020) Domain bridge for unpaired image-to-image translation and unsupervised domain adaptation. In: 2020 IEEE Winter conference on applications of computer vision (WACV), pp 2990–2998. https://doi.org/10.1109/WACV45572.2020.9093540
Planche B, Wu Z, Ma K, Sun S, Kluckner S, Lehmann O, Ernst J (2017) Depthsynth: real-time realistic synthetic data generation from CAD models for 2.5D recognition. In: 2017 International conference on 3d vision (3DV), pp 1–10. https://doi.org/10.1109/3DV.2017.00011
Prakash A, Boochoon S, Brophy M, Acuna D, Cameracci E, State G, Birchfield S (2019) Structured domain randomization: bridging the reality gap by context-aware synthetic data. In: 2019 International conference on robotics and automation (ICRA), pp 7249–7255. https://doi.org/10.1109/ICRA.2019.8794443
Razavian AS, Azizpour H, Sullivan J, Carlsson S (2014) CNN features off-the-shelf: an astounding baseline for recognition. In: Proceedings of the 2014 IEEE conference on computer vision and pattern recognition workshops (CVPRW), pp 512–519. https://doi.org/10.1109/CVPRW.2014.131
Ren Z, Lee YJ (2018) Cross-domain self-supervised multi-task feature learning using synthetic imagery. In: 2018 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 762–771. https://doi.org/10.1109/CVPR.2018.00086
Rezatofighi H, Tsoi N, Gwak J, Sadeghian A, Reid I, Savarese S (2019) Generalized intersection over union: a metric and a loss for bounding box regression. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 658–666. https://doi.org/10.1109/CVPR.2019.00075
Roitberg A, Pollert T, Haurilet M, Martin M, Stiefelhagen R (2019) Analysis of deep fusion strategies for multi-modal gesture recognition. In: 2019 IEEE/CVF conference on computer vision and pattern recognition workshops (CVPRW), pp 198–206. https://doi.org/10.1109/CVPRW.2019.00029
Rozantsev A, Lepetit V, Fua P (2015) On rendering synthetic images for training an object detector. Comput Vis Image Underst 137:24–37. https://doi.org/10.1016/j.cviu.2014.12.006
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115 (3):211–252. https://doi.org/10.1007/s11263-015-0816-y
Sampaio I, Machaca L, Viterbo J, Guérin J (2021) A novel method for object detection using deep learning and CAD models. In: Proceedings of the 23rd international conference on enterprise information systems. SCITEPRESS - Science and Technology Publications. https://doi.org/10.5220/0010451100750082
Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L C (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 4510–4520. https://doi.org/10.1109/CVPR.2018.00474
Sarkar K, Varanasi K, Stricker D (2017) Trained 3D models for CNN based object recognition. In: International conference on computer vision theory and applications, pp 130–137. https://doi.org/10.5220/0006272901300137
Schwarz M, Milan A, Periyasamy AS, Behnke S (2018) RGB-D object detection and semantic segmentation for autonomous manipulation in clutter. Int J Robot Res 37:437–451. https://doi.org/10.1177/0278364917713117
Shilane P, Min P, Kazhdan M, Funkhouser T (2004) The princeton shape benchmark. In: Proceedings shape modeling applications, pp 167–178. https://doi.org/10.1109/SMI.2004.1314504
Shrivastava A, Pfister T, Tuzel O, Susskind J, Wang W, Webb R (2017) Learning from simulated and unsupervised images through adversarial training. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 2242–2251. https://doi.org/10.1109/CVPR.2017.241
Song S, Yu F, Zeng A, Chang AX, Savva M, Funkhouser T (2017) Semantic scene completion from a single depth image. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 190–198. https://doi.org/10.1109/CVPR.2017.28
Sun X, Wu J, Zhang X, Zhang Z, Zhang C, Xue T, Freeman W T (2018) Pix3d: dataset and methods for single-image 3D shape modeling. In: 2018 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 2974–2983. https://doi.org/10.1109/CVPR.2018.00314
Sun Y, Tzeng E, Darrell T, Efros AA (2019) Unsupervised domain adaptation through self-supervision. arXiv:1909.11825
Sundermeyer M, Marton Z C, Durner M, Brucker M, Triebel R (2018) Implicit 3D orientation learning for 6D object detection from RGB images. In: Proceedings of the European conference on computer vision (ECCV), pp 712–729. https://doi.org/10.1007/978-3-030-01231-1_43
Sundermeyer M, Durner M, Puang EY, Marton ZC, Vaskevicius N, Arras KO, Triebel R (2020) Multi-path learning for object pose estimation across domains. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 13916–13925. https://doi.org/10.1109/CVPR42600.2020.01393
Sweeney C, Izatt G, Tedrake R (2019) A supervised approach to predicting noise in depth images. In: 2019 International conference on robotics and automation (ICRA), pp 796–802. https://doi.org/10.1109/ICRA.2019.8793820
Thalhammer S, Park K, Patten T, Vincze M, Kropatsch W (2019) Sydd: synthetic depth data randomization for object detection using domain-relevant background. TUGraz OPEN Library 14–22
To T, Tremblay J, McKay D, Yamaguchi Y, Leung K, Balanon A, Birchfield S (2018) NDDS: NVIDIA deep learning dataset synthesizer. https://github.com/NVIDIA/Dataset_Synthesizer
Tobin J, Fong R, Ray A, Schneider J, Zaremba W, Abbeel P (2017) Domain randomization for transferring deep neural networks from simulation to the real world. In: 2017 IEEE/RSJ International conference on intelligent robots and systems (IROS), pp 23–30
Tremblay J, Prakash A, Acuna D, Brophy M, Jampani V, Anil C, Birchfield S (2018) Training deep networks with synthetic data: bridging the reality gap by domain randomization. In: 2018 IEEE/CVF conference on computer vision and pattern recognition workshops (CVPRW), pp 969–977. https://doi.org/10.1109/CVPRW.2018.00143
Tremblay J, To T, Sundaralingam B, Xiang Y, Fox D, Birchfield S (2018) Deep object pose estimation for semantic robotic grasping of household objects. In: Conference on robot learning, pp 306–316
Tzeng E, Hoffman J, Darrell T, Saenko K (2015) Simultaneous deep transfer across domains and tasks. In: 2015 IEEE International conference on computer vision (ICCV), pp 4068–4076. https://doi.org/10.1109/ICCV.2015.463
Wang CY, Bochkovskiy A, Liao HYM YOLOv7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv:2207.02696
Wu Z, Song S, Khosla A, Yu F, Zhang L, Tang X, Xiao J (2015) 3D shapenets: a deep representation for volumetric shapes. In: 2015 IEEE Conference on computer vision and pattern recognition (CVPR), pp 1912–1920
Xiang Y, Mottaghi R, Savarese S (2014) Beyond PASCAL: a benchmark for 3D object detection in the wild. In: IEEE Winter conference on applications of computer vision (WACV), pp 75–82. https://doi.org/10.1109/WACV.2014.6836101
Xiang Y, Kim W, Chen W, Ji J, Choy CB, Su H, Savarese S (2016) Objectnet3d: a large scale database for 3D object recognition. In: European conference on computer vision (ECCV), pp 160–176. https://doi.org/10.1007/978-3-319-46484-8_10
Xu X, Li Y, Wu G, Luo J (2017) Multi-modal deep feature learning for RGB-d object detection. Pattern Recogn 72:300–313. https://doi.org/10.1016/j.patcog.2017.07.026
Yu J, Jiang Y, Wang Z, Cao Z, Huang T (2016) Unitbox: an advanced object detection network. In: Proceedings of the 24th ACM international conference on multimedia, pp 516–520. https://doi.org/10.1145/2964284.2967274
Yu F, Wang D, Shelhamer E, Darrell T (2018) Deep layer aggregation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2403–2412. https://doi.org/10.1109/CVPR.2018.00255
Yue X, Zhang Y, Zhao S, Sangiovanni-Vincentelli A, Keutzer K, Gong B (2019) Domain randomization and pyramid consistency: simulation-to-real generalization without accessing target domain data. In: 2019 IEEE/CVF International conference on computer vision (ICCV), pp 2100–2110. https://doi.org/10.1109/ICCV.2019.00219
Zakharov S, Kehl W, Ilic S (2019) Deceptionnet: network-driven domain randomization. In: 2019 IEEE/CVF International conference on computer vision (ICCV), pp 532–541. https://doi.org/10.1109/ICCV.2019.00062
Zhang R, Isola P, Efros A A (2016) Colorful image colorization. In: European conference on computer vision (ECCV), pp 649–666. https://doi.org/10.1007/978-3-319-46487-9_40
Zhang R, Isola P, Efros AA (2017) Split-brain autoencoders: unsupervised learning by cross-channel prediction. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 645–654. https://doi.org/10.1109/CVPR.2017.76
Zhang D, Ye M, Liu Y, Xiong L, Zhou L (2022) Multi-source unsupervised domain adaptation for object detection. Inf Fusion 78:138–148. https://doi.org/10.1016/j.inffus.2021.09.011
Zhao X, Zhang L, Pang Y, Lu H, Zhang L (2020) A single stream network for robust and real-time RGB-d salient object detection. In: European conference on computer vision (ECCV), pp 646–662. https://doi.org/10.1007/978-3-030-58542-6_39
Zhao X, Pang Y, Zhang L, Lu H, Ruan X (2021) Self-supervised representation learning for RGB-D salient object detection. arXiv:2101.12482
Zhou Q, Jacobson A (2016) Thingi10K: a dataset of 10, 000 3D-printing models. arXiv:1605.04797
Zhou K, Liu Z, Qiao Y, Xiang T, Loy C C (2022) Domain generalization: a survey. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2022.3195549
Zhu J Y, Park T, Isola P, Efros AA (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In: 2017 IEEE international conference on computer vision (ICCV), pp 2242–2251. https://doi.org/10.1109/ICCV.2017.244
Acknowledgements
This work was supported by grant CIFRE n.2018/0872 from ANRT.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Cohen, J., Crispim-Junior, C., Chiappa, JM. et al. Industrial object detection with multi-modal SSD: closing the gap between synthetic and real images. Multimed Tools Appl 83, 12111–12138 (2024). https://doi.org/10.1007/s11042-023-15367-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-15367-0