Modeling the World from Internet Photo Collections

Noah Snavely¹,
Steven M. Seitz¹ &
Richard Szeliski²

5787 Accesses
1465 Citations
15 Altmetric
Explore all metrics

Abstract

There are billions of photographs on the Internet, comprising the largest and most diverse photo collection ever assembled. How can computer vision researchers exploit this imagery? This paper explores this question from the standpoint of 3D scene modeling and visualization. We present structure-from-motion and image-based rendering algorithms that operate on hundreds of images downloaded as a result of keyword-based image search queries like “Notre Dame” or “Trevi Fountain.” This approach, which we call Photo Tourism, has enabled reconstructions of numerous well-known world sites. This paper presents these algorithms and results as a first step towards 3D modeling of the world’s well-photographed sites, cities, and landscapes from Internet imagery, and discusses key open problems and challenges for the research community.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Artificial Intelligence

References

Akbarzadeh, A., Frahm, J.-M., Mordohai, P., Clipp, B., Engels, C., Gallup, D., Merrell, P., Phelps, M., Sinha, S., Talton, B., Wang, L., Yang, Q., Stewenius, H., Yang, R., Welch, G., Towles, H., Nistér, D., & Pollefeys, M. (2006). Towards urban 3D reconstruction from video. In Proceedings of the international symposium on 3D data processing, visualization, and transmission.
Aliaga, D. G. et al. (2003). Sea of images. IEEE Computer Graphics and Applications, 23(6), 22–30.
Article Google Scholar
Aliaga, D., Yanovsky, D., Funkhouser, T., & Carlbom, I. (2003). Interactive image-based rendering using feature globalization. In Proceedings of the SIGGRAPH symposium on interactive 3D graphics (pp. 163–170).
Aloimonos, Y. (Ed.). (1993). Active perception. Mahwah: Lawrence Erlbaum Associates.
Google Scholar
Arya, S., Mount, D. M., Netanyahu, N. S., Silverman, R., & Wu, A. Y. (1998). An optimal algorithm for approximate nearest neighbor searching fixed dimensions. Journal of the ACM, 45(6), 891–923.
Article MATH MathSciNet Google Scholar
Baumberg, A. (2000). Reliable feature matching across widely separated views. In Proceedings of the IEEE conference on computer vision and pattern recognition (Vol. 1, pp. 774–781), June 2000.
Blake, A., & Yuille, A. (Eds.). (1993). Active vision. Cambridge: MIT Press.
Google Scholar
Brown, M., & Lowe, D. G. (2005). Unsupervised 3D object recognition and reconstruction in unordered datasets. In Proceedings of the international conference on 3D digital imaging and modelling (pp. 56–63).
Buehler, C., Bosse, M., McMillan, L., Gortler, S., & Cohen, M. (2001). Unstructured lumigraph rendering. In SIGGRAPH conference proceedings (pp. 425–432).
Chen, S., & Williams, L. (1993). View interpolation for image synthesis. In SIGGRAPH conference proceedings (pp. 279–288).
Chew, L. P. (1987). Constrained Delaunay triangulations. In Proceedings of the symposium on computational geometry (pp. 215–222).
Cooper, M., Foote, J., Girgensohn, A., & Wilcox, L. (2003). Temporal event clustering for digital photo collections. In Proceedings of the ACM international conference on multimedia (pp. 364–373).
Debevec, P. E., Taylor, C. J., & Malik, J. (1996). Modeling and rendering architecture from photographs: a hybrid geometry- and image-based approach. In SIGGRAPH conference proceedings (pp. 11–20).
Dick, A. R., Torr, P. H. S., & Cipolla, R. (2004). Modelling and interpretation of architecture from several images. International Journal of Computer Vision, 60(2), 111–134.
Article Google Scholar
Feiner, S., MacIntyre, B., Hollerer, T., & Webster, A. (1997). A touring machine: Prototyping 3D mobile augmented reality systems for exploring the urban environment. In Proceedings of the IEEE international symposium on wearable computers (pp. 74–81).
Fergus, R., Fei-Fei, L., Perona, P., & Zisserman, A. (2005). Learning object categories from Google’s image search. In Proceedings of the international conference on computer vision (Vol. 2, pp. 816–823), October 2005.
Fischler, M. A., & Bolles, R. C. (1981). Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6), 381–395.
Article MathSciNet Google Scholar
Fitzgibbon, A. W., & Zisserman, A. Automatic camera recovery for closed and open image sequences. In Proceedings of the European conference on computer vision (pp. 311–326), June 1998.
Förstner, W. (1986). A feature-based correspondence algorithm for image matching. International Archives Photogrammetry & Remote Sensing, 26(3), 150–166.
Google Scholar
Goesele, M., Snavely, N., Seitz, S. M., Curless, B., & Hoppe, H. (2007, to appear). Multi-view stereo for community photo collections. In Proceedings of the international conference on computer vision.
Gortler, S. J., Grzeszczuk, R., Szeliski, R., & Cohen, M. F. (1996). The lumigraph. In SIGGRAPH conference proceedings (pp. 43–54), August 1996.
Grauman, K., & Darrell, T. (2005). The pyramid match kernel: discriminative classification with sets of image features. In Proceedings of the international conference on computer vision (pp. 1458–1465).
Grzeszczuk, R. (2002). Course 44: image-based modeling. In SIGGRAPH 2002
Hannah, M. J. (1988). Test results from SRI’s stereo system. In Image understanding workshop (pp. 740–744), Cambridge, MA, April 1988. Los Altos: Morgan Kaufmann.
Google Scholar
Harris, C., & Stephens, M. J. (1988). A combined corner and edge detector. In Alvey vision conference (pp. 147–152).
Hartley, R. I. (1997). In defense of the eight-point algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(6), 580–593.
Article Google Scholar
Hartley, R. I., & Zisserman, A. (2004). Multiple view geometry. Cambridge: Cambridge University Press.
MATH Google Scholar
Hays, J., & Efros, A. A. (2007). Scene completion using millions of photographs. In SIGGRAPH conference proceedings.
Irani, M., & Anandan, P. (1998). Video indexing based on mosaic representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 86(5), 905–921.
Google Scholar
Johansson, B., & Cipolla, R. (2002). A system for automatic pose-estimation from a single image in a city scene. In Proceedings of the IASTED international conference on signal processing, pattern recognition and applications.
Kadir, T., & Brady, M. (2001). Saliency, scale and image description. International Journal of Computer Vision, 45(2), 83–105.
Article MATH Google Scholar
Kadobayashi, R., & Tanaka, K. (2005). 3D viewpoint-based photo search and information browsing. In Proceedings of the ACM international conference on research and development in information retrieval (pp. 621–622).
Lalonde, J.-F., Hoiem, D., Efros, A. A., Rother, C., Winn, J., & Criminisi, A. (2007). Photo clip art. In SIGGRAPH conference proceedings.
Levoy, M., & Hanrahan, P. (1996). Light field rendering. In SIGGRAPH conference proceedings (pp. 31–42).
Lippman, A. (1980). Movie maps: an application of the optical videodisc to computer graphics. In SIGGRAPH conference proceedings (pp. 32–43).
Longuet-Higgins, H. C. (1981). A computer algorithm for reconstructing a scene from two projections. Nature, 293, 133–135.
Article Google Scholar
Lourakis, M., & Argyros, A. (2004). The design and implementation of a generic sparse bundle adjustment software package based on the Levenberg–Marquardt algorithm (Technical Report 340). Inst. of Computer Science-FORTH, Heraklion, Crete, Greece. Available from www.ics.forth.gr/~lourakis/sba.
Lowe, D. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.
Article Google Scholar
Lucas, B. D., & Kanade, T. (1981). An iterative image registration technique with an application in stereo vision. In International joint conference on artificial Intelligence (pp. 674–679).
Matas, J. et al. (2004). Robust wide baseline stereo from maximally stable extremal regions. Image and Vision Computing, 22(10), 761–767.
Article Google Scholar
McCurdy, N., & Griswold, W. (2005). A systems architecture for ubiquitous video. In Proceedings of the international conference on mobile systems, applications, and services (pp. 1–14).
McMillan, L., & Bishop, G. (1995) Plenoptic modeling: An image-based rendering system. In SIGGRAPH conference proceedings (pp. 39–46).
Mikolajczyk, K., & Schmid, C. (2004). Scale & affine invariant interest point detectors. International Journal of Computer Vision, 60(1), 63–86.
Article Google Scholar
Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., Schaffalitzky, F., Kadir, T., & van Gool, L. (2005). A comparison of affine region detectors. International Journal of Computer Vision, 65(1/2), 43–72.
Article Google Scholar
Moravec, H. (1983). The Stanford cart and the CMU rover. Proceedings of the IEEE, 71(7), 872–884.
Article Google Scholar
Naaman, M., Paepcke, A., & Garcia-Molina, H. (2003). From where to what: Metadata sharing for digital photographs with geographic coordinates. In Proceedings of the international conference on cooperative information systems (pp. 196–217).
Naaman, M., Song, Y. J., Paepcke, A., & Garcia-Molina, H. (2004). Automatic organization for digital photographs with geographic coordinates. In Proceedings of the ACM/IEEE-CS joint conference on digital libraries (pp. 53–62).
Nistér, D. (2000). Reconstruction from uncalibrated sequences with a hierarchy of trifocal tensors. In Proceedings of the European conference on computer vision (pp. 649–663).
Nistér, D. (2004). An efficient solution to the five-point relative pose problem. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(6), 756–777.
Article Google Scholar
Nistér, D., & Stewénius, H. (2006). Scalable recognition with a vocabulary tree. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2118–2125).
Nocedal, J., & Wright, S. J. (1999). Springer series in operations research. Numerical optimization. New York: Springer.
Google Scholar
Oliensis, J. (1999). A multi-frame structure-from-motion algorithm under perspective projection. International Journal of Computer Vision, 34(2–3), 163–192.
Article Google Scholar
Pollefeys, M., Koch, R., & Van Gool, L. (1999). Self-calibration and metric reconstruction in spite of varying and unknown internal camera parameters. International Journal of Computer Vision, 32(1), 7–25.
Article Google Scholar
Pollefeys, M., & Van Gool, L. (2002). From images to 3D models. Communications of the ACM, 45(7), 50–55.
Article Google Scholar
Pollefeys, M., van Gool, L., Vergauwen, M., Verbiest, F., Cornelis, K., Tops, J., & Koch, R. (2004). Visual modeling with a hand-held camera. International Journal of Computer Vision, 59(3), 207–232.
Article Google Scholar
Robertson, D. P., & Cipolla, R. (2002). Building architectural models from many views using map constraints. In Proceedings of the European conference on computer vision (Vol. II, pp. 155–169).
Rodden, K., & Wood, K. R. (2003). How do people manage their digital photographs? In Proceedings of the conference on human factors in computing systems (pp. 409–416).
Román, A., et al. (2004). Interactive design of multi-perspective images for visualizing urban landscapes. In IEEE visualization 2004 (pp. 537–544).
Russell, B. C., Torralba, A., Murphy, K. P., & Freeman, W. T. (2005). Labelme: a database and web-based tool for image annotation (Technical Report MIT-CSAIL-TR-2005-056). Massachusetts Institute of Technology.
Schaffalitzky, F., & Zisserman, A. (2002). Multi-view matching for unordered image sets, or “How do I organize my holiday snaps?” In Proceedings of the European conference on computer vision (Vol. 1, pp. 414–431).
Scharstein, D., & Szeliski, R. (2002). A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision, 47(1), 7–42.
Article MATH Google Scholar
Schindler, G., Dellaert, F., & Kang, S. B. (2007). Inferring temporal order of images from 3D structure. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Schmid, C., & Zisserman, A. (1997). Automatic line matching across views. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 666–671).
Seitz, S. M., & Dyer, C. M. (1996). View morphing. In SIGGRAPH conference proceedings (pp. 21–30).
Seitz, S., Curless, B., Diebel, J., Scharstein, D., & Szeliski, R. (2006). A comparison and evaluation of multi-view stereo reconstruction algorithms. In Proceedings of the IEEE conference on computer vision and pattern recognition (Vol. 1, pp. 519–526), June 2006.
Shi, J., & Tomasi, C. Good features to track. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 593–600), June 1994.
Sivic, J., & Zisserman, A. (2003). Video Google: a text retrieval approach to object matching in videos. In Proceedings of the international conference on computer vision (pp. 1470–1477), October 2003.
Snavely, N., Seitz, S. M., & Szeliski, R. (2006). Photo tourism: exploring photo collections in 3D. ACM Transactions on Graphics, 25(3), 835–846.
Article Google Scholar
Spetsakis, M. E., & Aloimonos, J. Y. (1991). A multiframe approach to visual motion perception. International Journal of Computer Vision, 6(3), 245–255.
Article Google Scholar
Strecha, C., Tuytelaars, T., & Van Gool, L. (2003). Dense matching of multiple wide-baseline views. In Proceedings of the international conference on computer vision (pp. 1194–1201), October 2003.
Szeliski, R. (2006). Image alignment and stitching: a tutorial. Foundations and Trends in Computer Graphics and Computer Vision, 2(1).
Szeliski, R., & Kang, S. B. (1994). Recovering 3D shape and motion from image streams using nonlinear least squares. Journal of Visual Communication and Image Representation, 5(1), 10–28.
Article Google Scholar
Szeliski, R., Zabih, R., Scharstein, D., Veksler, O., Kolmogorov, V., Agarwala, A., Tappen, M., & Rother, C. (2006). A comparative study of energy minimization methods for Markov random fields. In Proceedings of the European conference on computer vision (Vol. 2, pp. 16–29), May 2006.
Tanaka, H., Arikawa, M., & Shibasaki, R. (2002). A 3-d photo collage system for spatial navigations. In Revised papers from the second Kyoto workshop on digital cities II, computational and sociological approaches (pp. 305–316).
Teller, S., Antone, M., Bodnar, Z., Bosse, M., Coorg, S., Jethwa, M., & Master, N. (2003). Calibrated, registered images of an extended urban area. International Journal of Computer Vision, 53(1), 93–107.
Article Google Scholar
Tomasi, C., & Kanade, T. (1992). Shape and motion from image streams under orthography: a factorization method. International Journal of Computer Vision, 9(2), 137–154.
Article Google Scholar
Toyama, K., Logan, R., & Roseway, A. (2003). Geographic location tags on digital images. In Proceedings of the international conference on multimedia (pp. 156–166).
Triggs, B., et al. (1999). Bundle adjustment—a modern synthesis. In International workshop on vision algorithms (pp. 298–372), September 1999.
Tuytelaars, T., & Van Gool, L. (2004). Matching widely separated views based on affine invariant regions. International Journal of Computer Vision, 59(1), 61–85.
Article Google Scholar
Vergauwen, M., & Van Gool, L. (2006). Web-based 3D reconstruction service. Machine Vision and Applications, 17(2), 321–329.
Google Scholar
von Ahn, L., & Dabbish, L. (2004). Labeling images with a computer game. In Proceedings of the conference on human factors in computing systems (pp. 319–326).
Zitnick, L., Kang, S. B., Uyttendaele, M., Winder, S., & Szeliski, R. (2004). High-quality video view interpolation using a layered representation. In SIGGRAPH conference proceedings (pp. 600–608).

Download references

Author information

Authors and Affiliations

University of Washington, Seattle, WA, USA
Noah Snavely & Steven M. Seitz
Microsoft Research, Redmond, WA, USA
Richard Szeliski

Authors

Noah Snavely
View author publications
You can also search for this author in PubMed Google Scholar
Steven M. Seitz
View author publications
You can also search for this author in PubMed Google Scholar
Richard Szeliski
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Noah Snavely.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Snavely, N., Seitz, S.M. & Szeliski, R. Modeling the World from Internet Photo Collections. Int J Comput Vis 80, 189–210 (2008). https://doi.org/10.1007/s11263-007-0107-3

Download citation

Received: 30 January 2007
Accepted: 31 October 2007
Published: 11 December 2007
Issue Date: November 2008
DOI: https://doi.org/10.1007/s11263-007-0107-3

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Introduction to Large-Scale Visual Geo-localization

PhotoSketch: a photocentric urban 3D modeling system

Towards Large-Scale City Reconstruction from Satellites

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Modeling the World from Internet Photo Collections

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Introduction to Large-Scale Visual Geo-localization

PhotoSketch: a photocentric urban 3D modeling system

Towards Large-Scale City Reconstruction from Satellites

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation