Augmented Reality Meets Computer Vision: Efficient Data Generation for Urban Driving Scenes

Hassan Abu Alhaija ORCID: orcid.org/0000-0001-6669-7072¹,
Siva Karthik Mustikovela¹,
Lars Mescheder²,
Andreas Geiger^2,3 &
…
Carsten Rother¹

6824 Accesses
281 Citations
9 Altmetric
Explore all metrics

Abstract

The success of deep learning in computer vision is based on the availability of large annotated datasets. To lower the need for hand labeled images, virtually rendered 3D worlds have recently gained popularity. Unfortunately, creating realistic 3D content is challenging on its own and requires significant human effort. In this work, we propose an alternative paradigm which combines real and synthetic data for learning semantic instance segmentation and object detection models. Exploiting the fact that not all aspects of the scene are equally important for this task, we propose to augment real-world imagery with virtual objects of the target category. Capturing real-world images at large scale is easy and cheap, and directly provides real background appearances without the need for creating complex 3D models of the environment. We present an efficient procedure to augment these images with virtual objects. In contrast to modeling complete 3D environments, our data augmentation approach requires only a few user interactions in combination with 3D models of the target object category. Leveraging our approach, we introduce a novel dataset of augmented urban driving scenes with 360 degree images that are used as environment maps to create realistic lighting and reflections on rendered objects. We analyze the significance of realistic object placement by comparing manual placement by humans to automatic methods based on semantic scene analysis. This allows us to create composite images which exhibit both realistic background appearance as well as a large number of complex object arrangements. Through an extensive set of experiments, we conclude the right set of parameters to produce augmented data which can maximally enhance the performance of instance segmentation models. Further, we demonstrate the utility of the proposed approach on training standard deep models for semantic instance segmentation and object detection of cars in outdoor driving scenarios. We test the models trained on our augmented data on the KITTI 2015 dataset, which we have annotated with pixel-accurate ground truth, and on the Cityscapes dataset. Our experiments demonstrate that the models trained on augmented imagery generalize better than those trained on fully synthetic data or models trained on limited amounts of annotated real data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

AI Playground: Unreal Engine-Based Data Ablation Tool for Deep Learning

3D Data Augmentation for Driving Scenes on Camera

Building 3D Virtual Worlds from Monocular Images of Urban Road Traffic Scenes

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

http://www.dmi-3d.net.

References

Blender Online Community. (2006). Blender: a 3D modelling and rendering package. Amsterdam: Blender Foundation, Blender Institute. http://www.blender.org. Accessed 01 May 2017.
Brostow, G. J., Fauqueur, J., & Cipolla, R. (2009). Semantic object classes in video: A high-definition ground truth database. Pattern Recognition Letters, 30(2), 88–97.
Article Google Scholar
Chen, W., Wang, H., Li, Y., Su, H., Wang, Z., Tu, C., et al. (2016). Synthesizing training images for boosting human 3D pose estimation. In Proceedings of the international conference on 3D vision (3DV).
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., et al. (2016). The cityscapes dataset for semantic urban scene understanding. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).
Dai, J., He, K., & Sun, J. (2016). Instance-aware semantic segmentation via multi-task network cascades. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).
de Souza, C. R., Gaidon, A., Cabon, Y., & Peña, A. M. L. (2016). Procedural generation of videos to train deep action recognition networks. arXiv:1612.00881.
Dosovitskiy, A., Fischer, P., Ilg, E., Haeusser, P., Hazirbas, C., Golkov, V., et al. (2015). Flownet: Learning optical flow with convolutional networks. In: Proceedings of the IEEE international conference on computer vision (ICCV).
Gaidon, A., Wang, Q., Cabon, Y., & Vig, E. (2016). Virtual worlds as proxy for multi-object tracking analysis. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).
Geiger, A., Lenz, P., Stiller, C., & Urtasun, R. (2013). Vision meets robotics: The KITTI dataset. International Journal of Robotics Research (IJRR), 32(11), 1231–1237.
Article Google Scholar
Gupta, A., Vedaldi, A., & Zisserman, A. (2016). Synthetic data for text localisation in natural images. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).
Handa, A., Patraucean, V., Badrinarayanan, V., Stent, S., & Cipolla, R. (2016). Understanding real world indoor scenes with synthetic data. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).
Hattori, H., Boddeti, V. N., Kitani, K. M., & Kanade, T. (2015). Learning scene-specific pedestrian detectors without real data. In Proceedigs of IEEE conference on computer vision and pattern recognition (CVPR).
Jakob, W. (2010). Mitsuba renderer. http://www.mitsuba-renderer.org. Accessed 01 May 2017.
Kronander, J., Banterle, F., Gardner, A., Miandji, E., & Unger, J. (2015). Photorealistic rendering of mixed reality scenes. Computer Graphics Forum, 34(2), 643–665. https://doi.org/10.1111/cgf.12591.
Article Google Scholar
Menze, M., & Geiger, A. (2015). Object scene flow for autonomous vehicles. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).
Movshovitz-Attias, Y., Kanade, T., & Sheikh, Y. (2016). How useful is photo-realistic rendering for visual learning? In Proceedings of the European conference on computer vision (ECCV) workshops.
Peng, X., Sun, B., Ali, K., & Saenko, K. (2015). Learning deep object detectors from 3D models. In Proceedings of the IEEE international conference on computer vision (ICCV).
Philbin, J., Chum, O., Isard, M., Sivic, J., & Zisserman, A. (2007). Object retrieval with large vocabularies and fast spatial matching. In: Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).
Pishchulin, L., Jain, A., Wojek, C., Andriluka, M., Thormählen, T., & Schiele, B. (2011). Learning people detection models from few training samples. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, & R. Garnett (Eds.), Advances in Neural Information Processing Systems (Vol. 28, pp. 91–99). Red Hook, NY: Curran Associates Inc.
Google Scholar
Richter, S. R., Vineet, V., Roth, S., & Koltun, V. (2016). Playing for data: Ground truth from computer games. In Proceedings of the European conference on computer vision (ECCV).
Ros, G., Sellart, L., Materzynska, J., Vazquez, D., & Lopez, A. (2016). The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).
Rozantsev, A., Lepetit, V., & Fua, P. (2015). On rendering synthetic images for training an object detector. Computer Vision and Image Understanding (CVIU), 137, 24–37.
Article Google Scholar
Shafaei, A., Little, J. J., & Schmidt, M. (2016). Play and learn: Using video games to train computer vision models. arXiv:1608.01745
Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In Proceedings of the international conference on learning representations (ICLR).
Stark, M., Goesele, M., & Schiele, B. (2010). Back to the future: Learning shape models from 3D CAD data. In Proceedings of the British machine vision conference (BMVC).
Su, H., Qi, C. R., Li, Y., & Guibas, L. J. (2015). Render for CNN: viewpoint estimation in images using CNNS trained with rendered 3D model views. In Proceedings of the IEEE international conference on computer vision (ICCV).
Taigman, Y., Yang, M., Ranzato, M., & Wolf, L. (2014). Deepface: Closing the gap to human-level performance in face verification. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).
Teichmann, M., Weber, M., Zöllner, J. M., Cipolla, R., & Urtasun, R. (2016). Multinet: Real-time joint semantic reasoning for autonomous driving. arXiv:1612.07695.
Varol, G., Romero, J., Martin, X., Mahmood, N., Black, M. J., Laptev, I. et al. (2017). Learning from synthetic humans. arXiv:1701.01370.
Xie, J., Kiefel, M., Sun, M. T., & Geiger, A. (2016). Semantic instance annotation of street scenes by 3D to 2D label transfer. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR).
Zhang, Y., Qiu, W., Chen, Q., Hu, X., & Yuille, A. L. (2016a). Unrealstereo: A synthetic dataset for analyzing stereo vision. arXiv:1612.04647.
Zhang, Y., Song, S., Yumer, E., Savva, M., Lee, J., Jin, H., et al. (2016b). Physically-based rendering for indoor scene understanding using convolutional neural networks. arXiv:1612.07429.
Zhu, Y., Mottaghi, R., Kolve, E., Lim, J. J., Gupta, A., Fei-Fei, L., et al. (2016). Target-driven visual navigation in indoor scenes using deep reinforcement learning. arXiv:1609.05143.

Download references

Author information

Authors and Affiliations

Visual Learning Lab, Heidelberg University, Heidelberg, Germany
Hassan Abu Alhaija, Siva Karthik Mustikovela & Carsten Rother
Autonomous Vision Group, MPI-IS Tübingen, Tübingen, Germany
Lars Mescheder & Andreas Geiger
Computer Vision and Geometry Group, ETH Zürich, Zurich, Switzerland
Andreas Geiger

Authors

Hassan Abu Alhaija
View author publications
You can also search for this author in PubMed Google Scholar
Siva Karthik Mustikovela
View author publications
You can also search for this author in PubMed Google Scholar
Lars Mescheder
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Geiger
View author publications
You can also search for this author in PubMed Google Scholar
Carsten Rother
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hassan Abu Alhaija.

Additional information

Communicated by Adrien Gaidon, Florent Perronnin and Antonio Lopez.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Abu Alhaija, H., Mustikovela, S.K., Mescheder, L. et al. Augmented Reality Meets Computer Vision: Efficient Data Generation for Urban Driving Scenes. Int J Comput Vis 126, 961–972 (2018). https://doi.org/10.1007/s11263-018-1070-x

Download citation

Received: 29 July 2017
Accepted: 26 February 2018
Published: 07 March 2018
Issue Date: September 2018
DOI: https://doi.org/10.1007/s11263-018-1070-x

Augmented Reality Meets Computer Vision: Efficient Data Generation for Urban Driving Scenes

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

AI Playground: Unreal Engine-Based Data Ablation Tool for Deep Learning

3D Data Augmentation for Driving Scenes on Camera

Building 3D Virtual Worlds from Monocular Images of Urban Road Traffic Scenes

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Augmented Reality Meets Computer Vision: Efficient Data Generation for Urban Driving Scenes

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

AI Playground: Unreal Engine-Based Data Ablation Tool for Deep Learning

3D Data Augmentation for Driving Scenes on Camera

Building 3D Virtual Worlds from Monocular Images of Urban Road Traffic Scenes

Explore related subjects

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation