Abstract
Current approaches to semantic image and scene understanding typically employ rather simple object representations such as 2D or 3D bounding boxes. While such coarse models are robust and allow for reliable object detection, they discard much of the information about objects’ 3D shape and pose, and thus do not lend themselves well to higher-level reasoning. Here, we propose to base scene understanding on a high-resolution object representation. An object class—in our case cars—is modeled as a deformable 3D wireframe, which enables fine-grained modeling at the level of individual vertices and faces. We augment that model to explicitly include vertex-level occlusion, and embed all instances in a common coordinate frame, in order to infer and exploit object-object interactions. Specifically, from a single view we jointly estimate the shapes and poses of multiple objects in a common 3D frame. A ground plane in that frame is estimated by consensus among different objects, which significantly stabilizes monocular 3D pose estimation. The fine-grained model, in conjunction with the explicit 3D scene model, further allows one to infer part-level occlusions between the modeled objects, as well as occlusions by other, unmodeled scene elements. To demonstrate the benefits of such detailed object class models in the context of scene understanding we systematically evaluate our approach on the challenging KITTI street scene dataset. The experiments show that the model’s ability to utilize image evidence at the level of individual parts improves monocular 3D pose estimation w.r.t. both location and (continuous) viewpoint.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
While in the earlier work they were scaled to the same size, so as to keep the deformations from the mean shape small.
In practice this amounts to a look-up in the precomputed response maps.
Note, there is no 3D counterpart to this part-level evaluation, since we see no way to obtain sufficiently accurate 3D part annotations.
References
Bao, S. Y., & Savarese, S. (2011). Semantic structure from motion. In CVPR.
Belongie, S., Malik, J., & Puzicha, J. (2000). Shape context: A new descriptor for shape matching and object recognition. In NIPS.
Bourdev, L., & Malik, J. (2009). Poselets: Body part detectors trained using 3D human pose annotations. In ICCV.
Brooks, R. A. (1981). Symbolic reasoning among 3-d models and 2-d images. Artificial Intelligence, 17, 285–348.
Choi, W., Chao, Y. -W., Pantofaru, C., & Savarese, S. (2013). Understanding indoor scenes using 3D geometric phrases. In CVPR.
Cootes, T. F., Taylor, C. J., Cooper, D. H., & Graham, J. (1995). Active shape models, their training and application. Computer Vision and Image Understanding, 61(1), 38–59.
Dalal, N., Triggs, B. (2005). Histograms of oriented gradients for human detection. In CVPR.
Del Pero, L., Bowdish, J., Kermgard, B., Hartley, E., & Barnard, K. (2013). Understanding Bayesian rooms using composite 3D object models. In CVPR.
Enzweiler, M., Eigenstetter, A., Schiele, B., & Gavrila, D. M. (2010). Multi-Cue pedestrian classification with partial occlusion handling. In CVPR.
Ess, A., Leibe, B., Schindler, K., & Gool, L. V. (2009). Robust multi-person tracking from a mobile platform. Pattern Analysis and Machine Intelligence, 31(10), 1831–1846.
Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2), 303–338.
Felzenszwalb, P. F., Girshick, R., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part based models. Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.
Fransens, R., Strecha, C., & Gool, L. V. (2006). A mean field EM-algorithm for coherent occlusion handling in MAP-estimation. In CVPR.
Gao, T., Packer, B., & Koller, D. (2011). A segmentation-aware object detection model with occlusion handling. In CVPR.
Geiger, A., Lenz, P., & Urtasun, R. (2012). Are we ready for autonomous driving?. The KITTI vision benchmark suite. In CVPR.
Geiger, A., Wojek, C., & Urtasun, R. (2011). Joint 3D estimation of objects and scene layout. In NIPS.
Girshick, R. B., Felzenszwalb, P. F., & McAllester, D. (2011). Object detection with grammar models. In NIPS.
Gupta, A., Efros, A. A., & Hebert, M. (2010). Blocks world revisited: Image understanding using qualitative geometry and mechanics. In ECCV.
Haag, M., & Nagel, H.-H. (1999). Combination of edge element and optical flow estimates for 3d-model-based vehicle tracking in traffic image sequences. International Journal of Computer Vision, 35(3), 295–319.
Hedau, V., Hoiem, D., & Forsyth, D. A. (2010). Thinking inside the box: Using appearance models and context based on room geometry. In ECCV.
Hejrati, M., & Ramanan, D. (2012). Analyzing 3D objects in cluttered images. In NIPS.
Hoiem, D., Efros, A., & Hebert, M. (2008). Putting objects in perspective. International Journal of Computer Vision, 80(1), 3–15.
Kanade, T. (1980). A theory of Origami world. Artificial Intelligence, 13, 279–311.
Koller, D., Daniilidis, K., & Nagel, H. H. (1993). Model-based object tracking in monocular image sequences of road traffic scenes. International Journal of Computer Vision, 10(3), 257–281.
Kwak, S., Nam, W., Han, B., & Han, J. H. (2011). Learning occlusion with likelihoods for visual tracking. In ICCV.
Leibe, B., Leonardis, A., & Schiele, B. (2006). An implicit shape model for combined object categorization and segmentation. In Toward category-level object recognition.
Leordeanu, M., & Hebert, M. (2008). Smoothing-based optimization. In CVPR.
Li, Y., Gu, L., & Kanade, T. (2011). Robustly aligning a shape model and its application to car alignment of unknown pose. Pattern Analysis and Machine Intelligence, 33(9), 1860–1876.
Liu, X., Zhao, Y., & Zhu, S. -C. (2014). Single-view 3D scene parsing by attributed grammar. In CVPR.
Lowe, D. (1987). Three-dimensional object recognition from single two-dimensional images. Artificial Intelligence, 31(3), 355–395.
Maji, S., & Malik, J. (2009). Object detection using a max-margin hough transform. In CVPR.
Malik, J. (1987). Interpreting line drawings of curved objects. International Journal of Computer Vision, 1(1), 73–103.
Meger, D., Wojek, C., Schiele, B., & Little, J. J. (2011). Explicit occlusion reasoning for 3d object detection. In BMVC.
Oramas, J., De Raedt, L., & Tuytelaars, T. (2013). Allocentric pose estimation. In ICCV.
Pentland, A. (1986). Perceptual organization and representation of natural form. Artificial Intelligence, 28, 293–331.
Pepik, B., Stark, M., Gehler, P., & Schiele, B. (2013). Occlusion patterns for object class detection. In CVPR.
Roberts, L. G. (1963) Machine perception of three-dimensional solids, Ph.D. thesis, MIT.
Schönborn, S., Forster, A., Egger, B., & Vetter, T. (2013). A Monte Carlo strategy to integrate detection and model-based face analysis. In GCPR.
Silberman, N., Hoiem, D., Kohli, P., & Fergus, R. (2012). Indoor segmentation and support inference from RGBD images. In ECCV.
Stark, M., Goesele, M., & Schiele, B. (2010). Back to the future: Learning shape models from 3D CAD data. In BMVC.
Sullivan, G. D., Worrall, A. D., & Ferryman, J. M. (1995). Visual object recognition using deformable models of vehicles. In IEEE workshop on context-based vision.
Tang, S., Andriluka, M., & Schiele, B. (2012). Detection and tracking of occluded oeople. In BMVC.
Vedaldi, A., & Zisserman, A. (2009). Structured output regression for detection with partial truncation. In NIPS.
Villamizar, M., Grabner, H., Andrade-Cetto, J., Sanfeliu, A., Gool, L. V., & Moreno-Noguer, F. (2011). Efficient 3D object detection using multiple pose-specific classifiers. In BMVC.
Wang, H., Gould, S., & Koller, D. (2010). Discriminative learning with latent variables for cluttered indoor scene understanding. In ECCV.
Wang, X., Han, T., & Yan, S. (2009). An HOG-LBP human detector with partial occlusion handling. In ICCV.
Wojek, C., Walk, S., Roth, S., Schindler, K., & Schiele, B. (2013). Monocular visual scene understanding: Understanding multi-object traffic scenes. In PAMI.
Xiang, Y., & Savarese, S. (2013). Object detection by 3D aspectlets and occlusion reasoning. In 3dRR.
Xiang, Y., & Savarese, S. (2012). Estimating the aspect layout of object categories. In CVPR.
Zhao, Y., & Zhu, S. -C. (2013). Scene parsing by integrating function, geometry and appearance models. In CVPR.
Zia, M. Z., Klank, U., & Beetz, M. (2009). Acquisition of a dense 3D model database for robotic vision. In ICAR.
Zia, M. Z., Stark, M., & Schindler, K. (2013). Explicit occlusion modeling for 3D object class representations. In CVPR.
Zia, M. Z., Stark, M., & Schindler, K. (2014). Are cars just 3D boxes? Jointly estimating the 3D shape of multiple objects. In CVPR.
Zia, M. Z., Stark, M., Schiele, B., & Schindler, K. (2013). Detailed 3d representations for object recognition and modeling. Pattern Analysis and Machine Intelligence, 35(11), 2608–2623.
Acknowledgments
This work has been supported by the Max Planck Center for Visual Computing & Communication.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Derek Hoiem, James Hays, Jianxiong Xiao and Aditya Khosla.
Rights and permissions
About this article
Cite this article
Zia, M.Z., Stark, M. & Schindler , K. Towards Scene Understanding with Detailed 3D Object Representations. Int J Comput Vis 112, 188–203 (2015). https://doi.org/10.1007/s11263-014-0780-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-014-0780-y