Scene Parsing with Object Instance Inference Using Regions and Per-exemplar Detectors

Joseph Tighe¹,
Marc Niethammer¹ &
Svetlana Lazebnik²

823 Accesses
17 Citations
Explore all metrics

Abstract

This paper describes a system for interpreting a scene by assigning a semantic label at every pixel and inferring the spatial extent of individual object instances together with their occlusion relationships. First we present a method for labeling each pixel aimed at achieving broad coverage across hundreds of object categories, many of them sparsely sampled. This method combines region-level features with per-exemplar sliding window detectors. Unlike traditional bounding box detectors, per-exemplar detectors perform well on classes with little training data and high intra-class variation, and they allow object masks to be transferred into the test image for pixel-level segmentation. Next, we use per-exemplar detections to generate a set of candidate object masks for a given test image. We then select a subset of objects that explain the image well and have valid overlap relationships and occlusion ordering. This is done by minimizing an integer quadratic program either using a greedy method or a standard solver. We alternate between using the object predictions to refine the pixel labels and using the pixel labels to improve the object predictions. The proposed system obtains promising results on two challenging subsets of the LabelMe dataset, the largest of which contains 45,676 images and 232 classes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Microsoft COCO: Common Objects in Context

Scalable scene understanding via saliency consensus

Article 21 November 2017

TS $$^{2}$$ C: Tight Box Mining with Surrounding Segmentation Context for Weakly Supervised Object Detection

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

Technically, headlights are attached to cars, but we do not make a distinction between attachment and occlusion in this work.
Fig. 2
Overview of our instance inference approach (see Sect. 4 for details). We use our region- and detector-based image parser (Fig. 1) to generate semantic labels for each pixel (a) and a set of candidate object masks (not shown). Next, we select a subset of these masks to cover the image (b). We alternate between refining the pixel labels and the object predictions until we obtain the final pixel labeling (c) and object predictions (d). On this image, our initial pixel labeling contains several “car” blobs, some of them representing multiple cars, but the object predictions separate these blobs into individual car instances. We also infer an occlusion ordering (e), which places the road behind the cars, and puts the three nearly overlapping cars on the left side in the correct depth order. Note that our instance-level inference formulation does not require the image to be completely covered. Thus, while our pixel labeling erroneously infers two large “building” areas in the mid-section of the image, these labels do not have enough confidence, so no corresponding “building” object instances get selected
Full size image
To determine depth ordering from polygon annotations, we use the LMsortlayers function from the LabelMe toolbox, which takes a collection of polygons and returns their depth ordering.

References

Adelson, E. H. (2001). On seeing stuff: The perception of materials by humans and machines. In Human Vision and Electronic Imaging, pp. 1–12.
Boykov, Y., & Kolmogorov, V. (2003). Computing geodesics and minimal surfaces via graph cuts. In International Conference on Computer Vision (ICCV), Nice, France.
Boykov, Y., & Kolmogorov, V. (2004). An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. PAMI, 26(9), 1124–37.
Article Google Scholar
Brostow, G. J., Shotton, J., Fauqueur, J., & Cipolla, R. (2008). Segmentation and recognition using structure from motion point clouds. In European Conference on Computer Vision (ECCV), Marseille, France.
Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), San Diego, CA.
Dean, T., Ruzon, M. A., Segal, M., Shlens, J., Vijayanarasimhan, S., & Yagnik, J. (2013). Fast, accurate detection of 100,000 object classes on a single machine. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Portland, OR.
Eigen, D., & Fergus, R. (2012). Nonparametric image parsing using adaptive neighbor sets. In IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI (CVPR).
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2008). The PASCAL visual object classes challenge 2008 (VOC2008) results. http://www.pascal-network.org/challenges/VOC/voc2008/workshop/index.html
Farabet, C., Couprie, C., Najman, L., & LeCun, Y. (2012). Scene parsing with multiscale feature learning, purity trees, and optimal covers. In International Conference on Machine Learning (ICML), Edinburgh, Scotland.
Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. PAMI, 32(9), 1627–1645.
Article Google Scholar
Floros, G., Rematas, K., & Leibe, B. (2011). Multi-class image labeling with top-down segmentation and generalized robust ${P}^{N}$ potentials. In Proceedings of the British Machine Vision Conference (BMVC), Dundee, UK.
Gould, S., Fulton, R., & Koller, D. (2009). Decomposing a scene into geometric and semantically consistent regions. In International Conference on Computer Vision (ICCV), Kyoto, Japan.
Guo, R., & Hoiem, D. (2012). Beyond the line of sight: labeling the underlying surfaces. In European Conference on Computer Vision (ECCV), Amsterdam.
Hariharan, B., Malik, J., & Ramanan, D. (2012). Discriminative decorrelation for clustering and classification. In The European Conference on Computer Vision (ECCV), Amsterdam.
Heitz, G., & Koller, D. (2008). Learning spatial context: Using stuff to find things. In: European Conference on Computer Vision, Marseille, France, (ECCV), pp. 30–43.
IBM. (2013). Cplex optimizer. http://www.ibm.com/software/commerce/optimization/cplex-optimizer/.
Isola, P., & Liu, C. (2013). Scene collaging: Analysis and synthesis of natural images with semantic layers. In IEEE International Conference on Computer Vision (ICCV), Sydney, Australia.
Kim, B., Sun, M., Kohli, P., & Savarese, S. (2012). Relating things and stuff by high-order potential modeling. In ECCV Workshop on Higher-Order Models and Global Constraints in Computer Vision.
Kim, J., & Grauman, K. (2012). Shape sharing for object segmentation. In European Conference on Computer Vision (ECCV), Amsterdam.
Kolmogorov, V., & Zabih, R. (2004). What energy functions can be minimized via graph cuts? PAMI, 26(2), 147–59.
Article Google Scholar
Krahenbuhl, P., & Koltun, V. (2011). Efficient inference in fully connected CRFs with Gaussian edge potentials. In Annual Conference on Neural Information Processing Systems (NIPS).
Ladický, L., Sturgess, P., Alahari, K., Russell, C., & Torr, P. H. (2010). What, where & how many? combining object detectors and CRFs. In The 11th European Conference on Computer Vision (ECCV), Heraklion, Greece.
Liu, C., Yuen, J., & Torralba, A. (2011). Nonparametric scene parsing via label transfer. PAMI, 33(12), 2368–2382.
Article Google Scholar
Malisiewicz, T., Gupta, A., & Efros, A. A. (2011). Ensemble of exemplar-SVMs for object detection and beyond. In 13th International Conference on Computer Vision (ICCV), Barcelona, Spain.
Myeong, H. J., Chang, Y., & Lee, K. M. (2012). Learning object relationships via graph-based context model. In Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI.
Rahimi, A., & Recht, B. (2007). Random features for large-scale kernel machines. In Proceedings of the Twenty-First Annual Conference on Neural Information Processing Systems, Vancouver (NIPS).
Rother, C., Kolmogorov, V., & Blake, A. (2004). “grabCut”—interactive foreground extraction using iterated graph cuts. In Special Interest Group on Computer Graphics and Interactive Techniques, Los Angeles, CA (SIGGRAPH).
Russell, B. C., & Torralba, A. (2009). Building a database of 3d scenes from user annotations. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL.
Russell, B. C., Torralba, A., Murphy, K. P., & Freeman, W. T. (2008). Labelme: A database and web-based tool for image annotation. IJCV, 77(1–3), 157–173.
Article Google Scholar
Shotton, J., Winn, J. M., Rother, C., & Criminisi, A. (2009). Textonboost for image understanding: Multi-class object recognition and segmentation by jointly modeling texture, layout, and context. IJCV, 81(1), 2–23.
Sturgess, P., Alahari, K., Ladický, L., & Torr, P. H. S. (2009). Combining appearance and structure from motion features for road scene understanding. In British Machine Vision Conference (BMVC), London, UK.
Tighe, J., & Lazebnik, S. (2013). Finding things: Image parsing with regions and per-exemplar detectors. In IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR (CVPR).
Tighe, J., & Lazebnik, S. (2013). SuperParsing: Scalable nonparametric image parsing with superpixels. IJCV, 101(2), 329–349.
Article MathSciNet Google Scholar
Tighe, J., Niethammer, M., & Lazebnik, S. (2014). Scene parsing with object instances and occlusion ordering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH.
Tu, Z., Chen, X., Yuille, A. L., & Zhu, S. C. (2005). Image parsing: Unifying segmentation, detection, and recognition. IJCV, 63(2), 113–140.
Article Google Scholar
Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., & Torralba, A. (2010). Sun database: Large-scale scene recognition from abbey to zoo. In The Twenty-Third IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Francisco, CA.
Yang, Y., Hallman, S., Ramanan, D., & Fowlkes, C. (2012). Layered object models for image segmentation. PAMI, 34(9), 1731–1743.
Article Google Scholar
Yao, J., Fidler, S., & Urtasun, R. (2012). Describing the scene as a whole: Joint object detection, scene classification and semantic segmentation. In Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI.
Zhang, C., Wang, L., & Yang, R. (2010). Semantic segmentation of urban scenes using dense depth maps. In The 11th European Conference on Computer Vision (ECCV), Heraklion, Greece.

Download references

Acknowledgments

This research was supported in part by NSF grants IIS 1228082 and CIF 1302438, DARPA Computer Science Study Group (D12AP00305), Microsoft Research Faculty Fellowship, Sloan Foundation, and Xerox. We thank Arun Mallya for helping to adapt the LDA detector code of Hariharan et al. (2012).

Author information

Authors and Affiliations

Department of Computer Science, University of North Carolina, Chapel Hill, NC, USA
Joseph Tighe & Marc Niethammer
Department of Computer Science, University of Illinois, Urbana-Champaign, IL, USA
Svetlana Lazebnik

Authors

Joseph Tighe
View author publications
You can also search for this author in PubMed Google Scholar
Marc Niethammer
View author publications
You can also search for this author in PubMed Google Scholar
Svetlana Lazebnik
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Joseph Tighe.

Additional information

Communicated by Derek Hoiem, James Hays, Jianxiong Xiao, and Aditya Khosla.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tighe, J., Niethammer, M. & Lazebnik, S. Scene Parsing with Object Instance Inference Using Regions and Per-exemplar Detectors. Int J Comput Vis 112, 150–171 (2015). https://doi.org/10.1007/s11263-014-0778-5

Download citation

Received: 31 January 2014
Accepted: 14 October 2014
Published: 28 November 2014
Issue Date: April 2015
DOI: https://doi.org/10.1007/s11263-014-0778-5

Scene Parsing with Object Instance Inference Using Regions and Per-exemplar Detectors

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Microsoft COCO: Common Objects in Context

Scalable scene understanding via saliency consensus

TS $$^{2}$$ C: Tight Box Mining with Surrounding Segmentation Context for Weakly Supervised Object Detection

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Scene Parsing with Object Instance Inference Using Regions and Per-exemplar Detectors

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Microsoft COCO: Common Objects in Context

Scalable scene understanding via saliency consensus

TS $$^{2}$$ C: Tight Box Mining with Surrounding Segmentation Context for Weakly Supervised Object Detection

Explore related subjects

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation