Towards Scene Understanding with Detailed 3D Object Representations

M. Zeeshan Zia^1,2,
Michael Stark³ &
Konrad Schindler ¹

1764 Accesses
3 Altmetric
Explore all metrics

Abstract

Current approaches to semantic image and scene understanding typically employ rather simple object representations such as 2D or 3D bounding boxes. While such coarse models are robust and allow for reliable object detection, they discard much of the information about objects’ 3D shape and pose, and thus do not lend themselves well to higher-level reasoning. Here, we propose to base scene understanding on a high-resolution object representation. An object class—in our case cars—is modeled as a deformable 3D wireframe, which enables fine-grained modeling at the level of individual vertices and faces. We augment that model to explicitly include vertex-level occlusion, and embed all instances in a common coordinate frame, in order to infer and exploit object-object interactions. Specifically, from a single view we jointly estimate the shapes and poses of multiple objects in a common 3D frame. A ground plane in that frame is estimated by consensus among different objects, which significantly stabilizes monocular 3D pose estimation. The fine-grained model, in conjunction with the explicit 3D scene model, further allows one to infer part-level occlusions between the modeled objects, as well as occlusions by other, unmodeled scene elements. To demonstrate the benefits of such detailed object class models in the context of scene understanding we systematically evaluate our approach on the challenging KITTI street scene dataset. The experiments show that the model’s ability to utilize image evidence at the level of individual parts improves monocular 3D pose estimation w.r.t. both location and (continuous) viewpoint.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Artificial Intelligence

Notes

While in the earlier work they were scaled to the same size, so as to keep the deformations from the mean shape small.
In practice this amounts to a look-up in the precomputed response maps.
Note, there is no 3D counterpart to this part-level evaluation, since we see no way to obtain sufficiently accurate 3D part annotations.

References

Bao, S. Y., & Savarese, S. (2011). Semantic structure from motion. In CVPR.
Belongie, S., Malik, J., & Puzicha, J. (2000). Shape context: A new descriptor for shape matching and object recognition. In NIPS.
Bourdev, L., & Malik, J. (2009). Poselets: Body part detectors trained using 3D human pose annotations. In ICCV.
Brooks, R. A. (1981). Symbolic reasoning among 3-d models and 2-d images. Artificial Intelligence, 17, 285–348.
Article Google Scholar
Choi, W., Chao, Y. -W., Pantofaru, C., & Savarese, S. (2013). Understanding indoor scenes using 3D geometric phrases. In CVPR.
Cootes, T. F., Taylor, C. J., Cooper, D. H., & Graham, J. (1995). Active shape models, their training and application. Computer Vision and Image Understanding, 61(1), 38–59.
Article Google Scholar
Dalal, N., Triggs, B. (2005). Histograms of oriented gradients for human detection. In CVPR.
Del Pero, L., Bowdish, J., Kermgard, B., Hartley, E., & Barnard, K. (2013). Understanding Bayesian rooms using composite 3D object models. In CVPR.
Enzweiler, M., Eigenstetter, A., Schiele, B., & Gavrila, D. M. (2010). Multi-Cue pedestrian classification with partial occlusion handling. In CVPR.
Ess, A., Leibe, B., Schindler, K., & Gool, L. V. (2009). Robust multi-person tracking from a mobile platform. Pattern Analysis and Machine Intelligence, 31(10), 1831–1846.
Article Google Scholar
Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2), 303–338.
Article Google Scholar
Felzenszwalb, P. F., Girshick, R., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part based models. Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.
Article Google Scholar
Fransens, R., Strecha, C., & Gool, L. V. (2006). A mean field EM-algorithm for coherent occlusion handling in MAP-estimation. In CVPR.
Gao, T., Packer, B., & Koller, D. (2011). A segmentation-aware object detection model with occlusion handling. In CVPR.
Geiger, A., Lenz, P., & Urtasun, R. (2012). Are we ready for autonomous driving?. The KITTI vision benchmark suite. In CVPR.
Geiger, A., Wojek, C., & Urtasun, R. (2011). Joint 3D estimation of objects and scene layout. In NIPS.
Girshick, R. B., Felzenszwalb, P. F., & McAllester, D. (2011). Object detection with grammar models. In NIPS.
Gupta, A., Efros, A. A., & Hebert, M. (2010). Blocks world revisited: Image understanding using qualitative geometry and mechanics. In ECCV.
Haag, M., & Nagel, H.-H. (1999). Combination of edge element and optical flow estimates for 3d-model-based vehicle tracking in traffic image sequences. International Journal of Computer Vision, 35(3), 295–319.
Article Google Scholar
Hedau, V., Hoiem, D., & Forsyth, D. A. (2010). Thinking inside the box: Using appearance models and context based on room geometry. In ECCV.
Hejrati, M., & Ramanan, D. (2012). Analyzing 3D objects in cluttered images. In NIPS.
Hoiem, D., Efros, A., & Hebert, M. (2008). Putting objects in perspective. International Journal of Computer Vision, 80(1), 3–15.
Article Google Scholar
Kanade, T. (1980). A theory of Origami world. Artificial Intelligence, 13, 279–311.
Article MATH MathSciNet Google Scholar
Koller, D., Daniilidis, K., & Nagel, H. H. (1993). Model-based object tracking in monocular image sequences of road traffic scenes. International Journal of Computer Vision, 10(3), 257–281.
Article Google Scholar
Kwak, S., Nam, W., Han, B., & Han, J. H. (2011). Learning occlusion with likelihoods for visual tracking. In ICCV.
Leibe, B., Leonardis, A., & Schiele, B. (2006). An implicit shape model for combined object categorization and segmentation. In Toward category-level object recognition.
Leordeanu, M., & Hebert, M. (2008). Smoothing-based optimization. In CVPR.
Li, Y., Gu, L., & Kanade, T. (2011). Robustly aligning a shape model and its application to car alignment of unknown pose. Pattern Analysis and Machine Intelligence, 33(9), 1860–1876.
Article Google Scholar
Liu, X., Zhao, Y., & Zhu, S. -C. (2014). Single-view 3D scene parsing by attributed grammar. In CVPR.
Lowe, D. (1987). Three-dimensional object recognition from single two-dimensional images. Artificial Intelligence, 31(3), 355–395.
Article Google Scholar
Maji, S., & Malik, J. (2009). Object detection using a max-margin hough transform. In CVPR.
Malik, J. (1987). Interpreting line drawings of curved objects. International Journal of Computer Vision, 1(1), 73–103.
Article MathSciNet Google Scholar
Meger, D., Wojek, C., Schiele, B., & Little, J. J. (2011). Explicit occlusion reasoning for 3d object detection. In BMVC.
Oramas, J., De Raedt, L., & Tuytelaars, T. (2013). Allocentric pose estimation. In ICCV.
Pentland, A. (1986). Perceptual organization and representation of natural form. Artificial Intelligence, 28, 293–331.
Article MathSciNet Google Scholar
Pepik, B., Stark, M., Gehler, P., & Schiele, B. (2013). Occlusion patterns for object class detection. In CVPR.
Roberts, L. G. (1963) Machine perception of three-dimensional solids, Ph.D. thesis, MIT.
Schönborn, S., Forster, A., Egger, B., & Vetter, T. (2013). A Monte Carlo strategy to integrate detection and model-based face analysis. In GCPR.
Silberman, N., Hoiem, D., Kohli, P., & Fergus, R. (2012). Indoor segmentation and support inference from RGBD images. In ECCV.
Stark, M., Goesele, M., & Schiele, B. (2010). Back to the future: Learning shape models from 3D CAD data. In BMVC.
Sullivan, G. D., Worrall, A. D., & Ferryman, J. M. (1995). Visual object recognition using deformable models of vehicles. In IEEE workshop on context-based vision.
Tang, S., Andriluka, M., & Schiele, B. (2012). Detection and tracking of occluded oeople. In BMVC.
Vedaldi, A., & Zisserman, A. (2009). Structured output regression for detection with partial truncation. In NIPS.
Villamizar, M., Grabner, H., Andrade-Cetto, J., Sanfeliu, A., Gool, L. V., & Moreno-Noguer, F. (2011). Efficient 3D object detection using multiple pose-specific classifiers. In BMVC.
Wang, H., Gould, S., & Koller, D. (2010). Discriminative learning with latent variables for cluttered indoor scene understanding. In ECCV.
Wang, X., Han, T., & Yan, S. (2009). An HOG-LBP human detector with partial occlusion handling. In ICCV.
Wojek, C., Walk, S., Roth, S., Schindler, K., & Schiele, B. (2013). Monocular visual scene understanding: Understanding multi-object traffic scenes. In PAMI.
Xiang, Y., & Savarese, S. (2013). Object detection by 3D aspectlets and occlusion reasoning. In 3dRR.
Xiang, Y., & Savarese, S. (2012). Estimating the aspect layout of object categories. In CVPR.
Zhao, Y., & Zhu, S. -C. (2013). Scene parsing by integrating function, geometry and appearance models. In CVPR.
Zia, M. Z., Klank, U., & Beetz, M. (2009). Acquisition of a dense 3D model database for robotic vision. In ICAR.
Zia, M. Z., Stark, M., & Schindler, K. (2013). Explicit occlusion modeling for 3D object class representations. In CVPR.
Zia, M. Z., Stark, M., & Schindler, K. (2014). Are cars just 3D boxes? Jointly estimating the 3D shape of multiple objects. In CVPR.
Zia, M. Z., Stark, M., Schiele, B., & Schindler, K. (2013). Detailed 3d representations for object recognition and modeling. Pattern Analysis and Machine Intelligence, 35(11), 2608–2623.
Article Google Scholar

Download references

Acknowledgments

This work has been supported by the Max Planck Center for Visual Computing & Communication.

Author information

Authors and Affiliations

Swiss Federal Institute of Technology (ETH), Zurich, Switzerland
M. Zeeshan Zia & Konrad Schindler
Imperial College London, London, UK
M. Zeeshan Zia
Max Planck Institute for Informatics, Saarbrücken, Germany
Michael Stark

Authors

M. Zeeshan Zia
View author publications
You can also search for this author in PubMed Google Scholar
Michael Stark
View author publications
You can also search for this author in PubMed Google Scholar
Konrad Schindler
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to M. Zeeshan Zia.

Additional information

Communicated by Derek Hoiem, James Hays, Jianxiong Xiao and Aditya Khosla.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zia, M.Z., Stark, M. & Schindler , K. Towards Scene Understanding with Detailed 3D Object Representations. Int J Comput Vis 112, 188–203 (2015). https://doi.org/10.1007/s11263-014-0780-y

Download citation

Received: 26 February 2014
Accepted: 14 October 2014
Published: 04 November 2014
Issue Date: April 2015
DOI: https://doi.org/10.1007/s11263-014-0780-y

Towards Scene Understanding with Detailed 3D Object Representations

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Generic 3D Representation via Pose Estimation and Matching

FoundPose: Unseen Object Pose Estimation with Foundation Features

3D Pose Estimation for Fine-Grained Object Categories

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Towards Scene Understanding with Detailed 3D Object Representations

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Generic 3D Representation via Pose Estimation and Matching

FoundPose: Unseen Object Pose Estimation with Foundation Features

3D Pose Estimation for Fine-Grained Object Categories

Explore related subjects

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now