Abstract
The success of deep learning in computer vision is based on the availability of large annotated datasets. To lower the need for hand labeled images, virtually rendered 3D worlds have recently gained popularity. Unfortunately, creating realistic 3D content is challenging on its own and requires significant human effort. In this work, we propose an alternative paradigm which combines real and synthetic data for learning semantic instance segmentation and object detection models. Exploiting the fact that not all aspects of the scene are equally important for this task, we propose to augment real-world imagery with virtual objects of the target category. Capturing real-world images at large scale is easy and cheap, and directly provides real background appearances without the need for creating complex 3D models of the environment. We present an efficient procedure to augment these images with virtual objects. In contrast to modeling complete 3D environments, our data augmentation approach requires only a few user interactions in combination with 3D models of the target object category. Leveraging our approach, we introduce a novel dataset of augmented urban driving scenes with 360 degree images that are used as environment maps to create realistic lighting and reflections on rendered objects. We analyze the significance of realistic object placement by comparing manual placement by humans to automatic methods based on semantic scene analysis. This allows us to create composite images which exhibit both realistic background appearance as well as a large number of complex object arrangements. Through an extensive set of experiments, we conclude the right set of parameters to produce augmented data which can maximally enhance the performance of instance segmentation models. Further, we demonstrate the utility of the proposed approach on training standard deep models for semantic instance segmentation and object detection of cars in outdoor driving scenarios. We test the models trained on our augmented data on the KITTI 2015 dataset, which we have annotated with pixel-accurate ground truth, and on the Cityscapes dataset. Our experiments demonstrate that the models trained on augmented imagery generalize better than those trained on fully synthetic data or models trained on limited amounts of annotated real data.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
References
Blender Online Community. (2006). Blender: a 3D modelling and rendering package. Amsterdam: Blender Foundation, Blender Institute. http://www.blender.org. Accessed 01 May 2017.
Brostow, G. J., Fauqueur, J., & Cipolla, R. (2009). Semantic object classes in video: A high-definition ground truth database. Pattern Recognition Letters, 30(2), 88–97.
Chen, W., Wang, H., Li, Y., Su, H., Wang, Z., Tu, C., et al. (2016). Synthesizing training images for boosting human 3D pose estimation. In Proceedings of the international conference on 3D vision (3DV).
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., et al. (2016). The cityscapes dataset for semantic urban scene understanding. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).
Dai, J., He, K., & Sun, J. (2016). Instance-aware semantic segmentation via multi-task network cascades. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).
de Souza, C. R., Gaidon, A., Cabon, Y., & Peña, A. M. L. (2016). Procedural generation of videos to train deep action recognition networks. arXiv:1612.00881.
Dosovitskiy, A., Fischer, P., Ilg, E., Haeusser, P., Hazirbas, C., Golkov, V., et al. (2015). Flownet: Learning optical flow with convolutional networks. In: Proceedings of the IEEE international conference on computer vision (ICCV).
Gaidon, A., Wang, Q., Cabon, Y., & Vig, E. (2016). Virtual worlds as proxy for multi-object tracking analysis. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).
Geiger, A., Lenz, P., Stiller, C., & Urtasun, R. (2013). Vision meets robotics: The KITTI dataset. International Journal of Robotics Research (IJRR), 32(11), 1231–1237.
Gupta, A., Vedaldi, A., & Zisserman, A. (2016). Synthetic data for text localisation in natural images. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).
Handa, A., Patraucean, V., Badrinarayanan, V., Stent, S., & Cipolla, R. (2016). Understanding real world indoor scenes with synthetic data. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).
Hattori, H., Boddeti, V. N., Kitani, K. M., & Kanade, T. (2015). Learning scene-specific pedestrian detectors without real data. In Proceedigs of IEEE conference on computer vision and pattern recognition (CVPR).
Jakob, W. (2010). Mitsuba renderer. http://www.mitsuba-renderer.org. Accessed 01 May 2017.
Kronander, J., Banterle, F., Gardner, A., Miandji, E., & Unger, J. (2015). Photorealistic rendering of mixed reality scenes. Computer Graphics Forum, 34(2), 643–665. https://doi.org/10.1111/cgf.12591.
Menze, M., & Geiger, A. (2015). Object scene flow for autonomous vehicles. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).
Movshovitz-Attias, Y., Kanade, T., & Sheikh, Y. (2016). How useful is photo-realistic rendering for visual learning? In Proceedings of the European conference on computer vision (ECCV) workshops.
Peng, X., Sun, B., Ali, K., & Saenko, K. (2015). Learning deep object detectors from 3D models. In Proceedings of the IEEE international conference on computer vision (ICCV).
Philbin, J., Chum, O., Isard, M., Sivic, J., & Zisserman, A. (2007). Object retrieval with large vocabularies and fast spatial matching. In: Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).
Pishchulin, L., Jain, A., Wojek, C., Andriluka, M., Thormählen, T., & Schiele, B. (2011). Learning people detection models from few training samples. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, & R. Garnett (Eds.), Advances in Neural Information Processing Systems (Vol. 28, pp. 91–99). Red Hook, NY: Curran Associates Inc.
Richter, S. R., Vineet, V., Roth, S., & Koltun, V. (2016). Playing for data: Ground truth from computer games. In Proceedings of the European conference on computer vision (ECCV).
Ros, G., Sellart, L., Materzynska, J., Vazquez, D., & Lopez, A. (2016). The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).
Rozantsev, A., Lepetit, V., & Fua, P. (2015). On rendering synthetic images for training an object detector. Computer Vision and Image Understanding (CVIU), 137, 24–37.
Shafaei, A., Little, J. J., & Schmidt, M. (2016). Play and learn: Using video games to train computer vision models. arXiv:1608.01745
Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In Proceedings of the international conference on learning representations (ICLR).
Stark, M., Goesele, M., & Schiele, B. (2010). Back to the future: Learning shape models from 3D CAD data. In Proceedings of the British machine vision conference (BMVC).
Su, H., Qi, C. R., Li, Y., & Guibas, L. J. (2015). Render for CNN: viewpoint estimation in images using CNNS trained with rendered 3D model views. In Proceedings of the IEEE international conference on computer vision (ICCV).
Taigman, Y., Yang, M., Ranzato, M., & Wolf, L. (2014). Deepface: Closing the gap to human-level performance in face verification. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).
Teichmann, M., Weber, M., Zöllner, J. M., Cipolla, R., & Urtasun, R. (2016). Multinet: Real-time joint semantic reasoning for autonomous driving. arXiv:1612.07695.
Varol, G., Romero, J., Martin, X., Mahmood, N., Black, M. J., Laptev, I. et al. (2017). Learning from synthetic humans. arXiv:1701.01370.
Xie, J., Kiefel, M., Sun, M. T., & Geiger, A. (2016). Semantic instance annotation of street scenes by 3D to 2D label transfer. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).
Zhang, Y., Qiu, W., Chen, Q., Hu, X., & Yuille, A. L. (2016a). Unrealstereo: A synthetic dataset for analyzing stereo vision. arXiv:1612.04647.
Zhang, Y., Song, S., Yumer, E., Savva, M., Lee, J., Jin, H., et al. (2016b). Physically-based rendering for indoor scene understanding using convolutional neural networks. arXiv:1612.07429.
Zhu, Y., Mottaghi, R., Kolve, E., Lim, J. J., Gupta, A., Fei-Fei, L., et al. (2016). Target-driven visual navigation in indoor scenes using deep reinforcement learning. arXiv:1609.05143.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Adrien Gaidon, Florent Perronnin and Antonio Lopez.
Rights and permissions
About this article
Cite this article
Abu Alhaija, H., Mustikovela, S.K., Mescheder, L. et al. Augmented Reality Meets Computer Vision: Efficient Data Generation for Urban Driving Scenes. Int J Comput Vis 126, 961–972 (2018). https://doi.org/10.1007/s11263-018-1070-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-018-1070-x