Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Scene Parsing with Object Instance Inference Using Regions and Per-exemplar Detectors

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

This paper describes a system for interpreting a scene by assigning a semantic label at every pixel and inferring the spatial extent of individual object instances together with their occlusion relationships. First we present a method for labeling each pixel aimed at achieving broad coverage across hundreds of object categories, many of them sparsely sampled. This method combines region-level features with per-exemplar sliding window detectors. Unlike traditional bounding box detectors, per-exemplar detectors perform well on classes with little training data and high intra-class variation, and they allow object masks to be transferred into the test image for pixel-level segmentation. Next, we use per-exemplar detections to generate a set of candidate object masks for a given test image. We then select a subset of objects that explain the image well and have valid overlap relationships and occlusion ordering. This is done by minimizing an integer quadratic program either using a greedy method or a standard solver. We alternate between using the object predictions to refine the pixel labels and using the pixel labels to improve the object predictions. The proposed system obtains promising results on two challenging subsets of the LabelMe dataset, the largest of which contains 45,676 images and 232 classes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Notes

  1. Technically, headlights are attached to cars, but we do not make a distinction between attachment and occlusion in this work.

    Fig. 2
    figure 2

    Overview of our instance inference approach (see Sect. 4 for details). We use our region- and detector-based image parser (Fig. 1) to generate semantic labels for each pixel (a) and a set of candidate object masks (not shown). Next, we select a subset of these masks to cover the image (b). We alternate between refining the pixel labels and the object predictions until we obtain the final pixel labeling (c) and object predictions (d). On this image, our initial pixel labeling contains several “car” blobs, some of them representing multiple cars, but the object predictions separate these blobs into individual car instances. We also infer an occlusion ordering (e), which places the road behind the cars, and puts the three nearly overlapping cars on the left side in the correct depth order. Note that our instance-level inference formulation does not require the image to be completely covered. Thus, while our pixel labeling erroneously infers two large “building” areas in the mid-section of the image, these labels do not have enough confidence, so no corresponding “building” object instances get selected

  2. To determine depth ordering from polygon annotations, we use the LMsortlayers function from the LabelMe toolbox, which takes a collection of polygons and returns their depth ordering.

References

  • Adelson, E. H. (2001). On seeing stuff: The perception of materials by humans and machines. In Human Vision and Electronic Imaging, pp. 1–12.

  • Boykov, Y., & Kolmogorov, V. (2003). Computing geodesics and minimal surfaces via graph cuts. In International Conference on Computer Vision (ICCV), Nice, France.

  • Boykov, Y., & Kolmogorov, V. (2004). An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. PAMI, 26(9), 1124–37.

    Article  Google Scholar 

  • Brostow, G. J., Shotton, J., Fauqueur, J., & Cipolla, R. (2008). Segmentation and recognition using structure from motion point clouds. In European Conference on Computer Vision (ECCV), Marseille, France.

  • Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), San Diego, CA.

  • Dean, T., Ruzon, M. A., Segal, M., Shlens, J., Vijayanarasimhan, S., & Yagnik, J. (2013). Fast, accurate detection of 100,000 object classes on a single machine. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Portland, OR.

  • Eigen, D., & Fergus, R. (2012). Nonparametric image parsing using adaptive neighbor sets. In IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI (CVPR).

  • Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2008). The PASCAL visual object classes challenge 2008 (VOC2008) results. http://www.pascal-network.org/challenges/VOC/voc2008/workshop/index.html

  • Farabet, C., Couprie, C., Najman, L., & LeCun, Y. (2012). Scene parsing with multiscale feature learning, purity trees, and optimal covers. In International Conference on Machine Learning (ICML), Edinburgh, Scotland.

  • Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. PAMI, 32(9), 1627–1645.

    Article  Google Scholar 

  • Floros, G., Rematas, K., & Leibe, B. (2011). Multi-class image labeling with top-down segmentation and generalized robust \({P}^{N}\) potentials. In Proceedings of the British Machine Vision Conference (BMVC), Dundee, UK.

  • Gould, S., Fulton, R., & Koller, D. (2009). Decomposing a scene into geometric and semantically consistent regions. In International Conference on Computer Vision (ICCV), Kyoto, Japan.

  • Guo, R., & Hoiem, D. (2012). Beyond the line of sight: labeling the underlying surfaces. In European Conference on Computer Vision (ECCV), Amsterdam.

  • Hariharan, B., Malik, J., & Ramanan, D. (2012). Discriminative decorrelation for clustering and classification. In The European Conference on Computer Vision (ECCV), Amsterdam.

  • Heitz, G., & Koller, D. (2008). Learning spatial context: Using stuff to find things. In: European Conference on Computer Vision, Marseille, France, (ECCV), pp. 30–43.

  • IBM. (2013). Cplex optimizer. http://www.ibm.com/software/commerce/optimization/cplex-optimizer/.

  • Isola, P., & Liu, C. (2013). Scene collaging: Analysis and synthesis of natural images with semantic layers. In IEEE International Conference on Computer Vision (ICCV), Sydney, Australia.

  • Kim, B., Sun, M., Kohli, P., & Savarese, S. (2012). Relating things and stuff by high-order potential modeling. In ECCV Workshop on Higher-Order Models and Global Constraints in Computer Vision.

  • Kim, J., & Grauman, K. (2012). Shape sharing for object segmentation. In European Conference on Computer Vision (ECCV), Amsterdam.

  • Kolmogorov, V., & Zabih, R. (2004). What energy functions can be minimized via graph cuts? PAMI, 26(2), 147–59.

    Article  Google Scholar 

  • Krahenbuhl, P., & Koltun, V. (2011). Efficient inference in fully connected CRFs with Gaussian edge potentials. In Annual Conference on Neural Information Processing Systems (NIPS).

  • Ladický, L., Sturgess, P., Alahari, K., Russell, C., & Torr, P. H. (2010). What, where & how many? combining object detectors and CRFs. In The 11th European Conference on Computer Vision (ECCV), Heraklion, Greece.

  • Liu, C., Yuen, J., & Torralba, A. (2011). Nonparametric scene parsing via label transfer. PAMI, 33(12), 2368–2382.

    Article  Google Scholar 

  • Malisiewicz, T., Gupta, A., & Efros, A. A. (2011). Ensemble of exemplar-SVMs for object detection and beyond. In 13th International Conference on Computer Vision (ICCV), Barcelona, Spain.

  • Myeong, H. J., Chang, Y., & Lee, K. M. (2012). Learning object relationships via graph-based context model. In Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI.

  • Rahimi, A., & Recht, B. (2007). Random features for large-scale kernel machines. In Proceedings of the Twenty-First Annual Conference on Neural Information Processing Systems, Vancouver (NIPS).

  • Rother, C., Kolmogorov, V., & Blake, A. (2004). “grabCut”—interactive foreground extraction using iterated graph cuts. In Special Interest Group on Computer Graphics and Interactive Techniques, Los Angeles, CA (SIGGRAPH).

  • Russell, B. C., & Torralba, A. (2009). Building a database of 3d scenes from user annotations. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL.

  • Russell, B. C., Torralba, A., Murphy, K. P., & Freeman, W. T. (2008). Labelme: A database and web-based tool for image annotation. IJCV, 77(1–3), 157–173.

    Article  Google Scholar 

  • Shotton, J., Winn, J. M., Rother, C., & Criminisi, A. (2009). Textonboost for image understanding: Multi-class object recognition and segmentation by jointly modeling texture, layout, and context. IJCV, 81(1), 2–23.

  • Sturgess, P., Alahari, K., Ladický, L., & Torr, P. H. S. (2009). Combining appearance and structure from motion features for road scene understanding. In British Machine Vision Conference (BMVC), London, UK.

  • Tighe, J., & Lazebnik, S. (2013). Finding things: Image parsing with regions and per-exemplar detectors. In IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR (CVPR).

  • Tighe, J., & Lazebnik, S. (2013). SuperParsing: Scalable nonparametric image parsing with superpixels. IJCV, 101(2), 329–349.

    Article  MathSciNet  Google Scholar 

  • Tighe, J., Niethammer, M., & Lazebnik, S. (2014). Scene parsing with object instances and occlusion ordering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH.

  • Tu, Z., Chen, X., Yuille, A. L., & Zhu, S. C. (2005). Image parsing: Unifying segmentation, detection, and recognition. IJCV, 63(2), 113–140.

    Article  Google Scholar 

  • Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., & Torralba, A. (2010). Sun database: Large-scale scene recognition from abbey to zoo. In The Twenty-Third IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Francisco, CA.

  • Yang, Y., Hallman, S., Ramanan, D., & Fowlkes, C. (2012). Layered object models for image segmentation. PAMI, 34(9), 1731–1743.

    Article  Google Scholar 

  • Yao, J., Fidler, S., & Urtasun, R. (2012). Describing the scene as a whole: Joint object detection, scene classification and semantic segmentation. In Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI.

  • Zhang, C., Wang, L., & Yang, R. (2010). Semantic segmentation of urban scenes using dense depth maps. In The 11th European Conference on Computer Vision (ECCV), Heraklion, Greece.

Download references

Acknowledgments

This research was supported in part by NSF grants IIS 1228082 and CIF 1302438, DARPA Computer Science Study Group (D12AP00305), Microsoft Research Faculty Fellowship, Sloan Foundation, and Xerox. We thank Arun Mallya for helping to adapt the LDA detector code of Hariharan et al. (2012).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Joseph Tighe.

Additional information

Communicated by Derek Hoiem, James Hays, Jianxiong Xiao, and Aditya Khosla.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tighe, J., Niethammer, M. & Lazebnik, S. Scene Parsing with Object Instance Inference Using Regions and Per-exemplar Detectors. Int J Comput Vis 112, 150–171 (2015). https://doi.org/10.1007/s11263-014-0778-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-014-0778-5

Keywords

Navigation