Abstract
There are billions of photographs on the Internet, comprising the largest and most diverse photo collection ever assembled. How can computer vision researchers exploit this imagery? This paper explores this question from the standpoint of 3D scene modeling and visualization. We present structure-from-motion and image-based rendering algorithms that operate on hundreds of images downloaded as a result of keyword-based image search queries like “Notre Dame” or “Trevi Fountain.” This approach, which we call Photo Tourism, has enabled reconstructions of numerous well-known world sites. This paper presents these algorithms and results as a first step towards 3D modeling of the world’s well-photographed sites, cities, and landscapes from Internet imagery, and discusses key open problems and challenges for the research community.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Akbarzadeh, A., Frahm, J.-M., Mordohai, P., Clipp, B., Engels, C., Gallup, D., Merrell, P., Phelps, M., Sinha, S., Talton, B., Wang, L., Yang, Q., Stewenius, H., Yang, R., Welch, G., Towles, H., Nistér, D., & Pollefeys, M. (2006). Towards urban 3D reconstruction from video. In Proceedings of the international symposium on 3D data processing, visualization, and transmission.
Aliaga, D. G. et al. (2003). Sea of images. IEEE Computer Graphics and Applications, 23(6), 22–30.
Aliaga, D., Yanovsky, D., Funkhouser, T., & Carlbom, I. (2003). Interactive image-based rendering using feature globalization. In Proceedings of the SIGGRAPH symposium on interactive 3D graphics (pp. 163–170).
Aloimonos, Y. (Ed.). (1993). Active perception. Mahwah: Lawrence Erlbaum Associates.
Arya, S., Mount, D. M., Netanyahu, N. S., Silverman, R., & Wu, A. Y. (1998). An optimal algorithm for approximate nearest neighbor searching fixed dimensions. Journal of the ACM, 45(6), 891–923.
Baumberg, A. (2000). Reliable feature matching across widely separated views. In Proceedings of the IEEE conference on computer vision and pattern recognition (Vol. 1, pp. 774–781), June 2000.
Blake, A., & Yuille, A. (Eds.). (1993). Active vision. Cambridge: MIT Press.
Brown, M., & Lowe, D. G. (2005). Unsupervised 3D object recognition and reconstruction in unordered datasets. In Proceedings of the international conference on 3D digital imaging and modelling (pp. 56–63).
Buehler, C., Bosse, M., McMillan, L., Gortler, S., & Cohen, M. (2001). Unstructured lumigraph rendering. In SIGGRAPH conference proceedings (pp. 425–432).
Chen, S., & Williams, L. (1993). View interpolation for image synthesis. In SIGGRAPH conference proceedings (pp. 279–288).
Chew, L. P. (1987). Constrained Delaunay triangulations. In Proceedings of the symposium on computational geometry (pp. 215–222).
Cooper, M., Foote, J., Girgensohn, A., & Wilcox, L. (2003). Temporal event clustering for digital photo collections. In Proceedings of the ACM international conference on multimedia (pp. 364–373).
Debevec, P. E., Taylor, C. J., & Malik, J. (1996). Modeling and rendering architecture from photographs: a hybrid geometry- and image-based approach. In SIGGRAPH conference proceedings (pp. 11–20).
Dick, A. R., Torr, P. H. S., & Cipolla, R. (2004). Modelling and interpretation of architecture from several images. International Journal of Computer Vision, 60(2), 111–134.
Feiner, S., MacIntyre, B., Hollerer, T., & Webster, A. (1997). A touring machine: Prototyping 3D mobile augmented reality systems for exploring the urban environment. In Proceedings of the IEEE international symposium on wearable computers (pp. 74–81).
Fergus, R., Fei-Fei, L., Perona, P., & Zisserman, A. (2005). Learning object categories from Google’s image search. In Proceedings of the international conference on computer vision (Vol. 2, pp. 816–823), October 2005.
Fischler, M. A., & Bolles, R. C. (1981). Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6), 381–395.
Fitzgibbon, A. W., & Zisserman, A. Automatic camera recovery for closed and open image sequences. In Proceedings of the European conference on computer vision (pp. 311–326), June 1998.
Förstner, W. (1986). A feature-based correspondence algorithm for image matching. International Archives Photogrammetry & Remote Sensing, 26(3), 150–166.
Goesele, M., Snavely, N., Seitz, S. M., Curless, B., & Hoppe, H. (2007, to appear). Multi-view stereo for community photo collections. In Proceedings of the international conference on computer vision.
Gortler, S. J., Grzeszczuk, R., Szeliski, R., & Cohen, M. F. (1996). The lumigraph. In SIGGRAPH conference proceedings (pp. 43–54), August 1996.
Grauman, K., & Darrell, T. (2005). The pyramid match kernel: discriminative classification with sets of image features. In Proceedings of the international conference on computer vision (pp. 1458–1465).
Grzeszczuk, R. (2002). Course 44: image-based modeling. In SIGGRAPH 2002
Hannah, M. J. (1988). Test results from SRI’s stereo system. In Image understanding workshop (pp. 740–744), Cambridge, MA, April 1988. Los Altos: Morgan Kaufmann.
Harris, C., & Stephens, M. J. (1988). A combined corner and edge detector. In Alvey vision conference (pp. 147–152).
Hartley, R. I. (1997). In defense of the eight-point algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(6), 580–593.
Hartley, R. I., & Zisserman, A. (2004). Multiple view geometry. Cambridge: Cambridge University Press.
Hays, J., & Efros, A. A. (2007). Scene completion using millions of photographs. In SIGGRAPH conference proceedings.
Irani, M., & Anandan, P. (1998). Video indexing based on mosaic representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 86(5), 905–921.
Johansson, B., & Cipolla, R. (2002). A system for automatic pose-estimation from a single image in a city scene. In Proceedings of the IASTED international conference on signal processing, pattern recognition and applications.
Kadir, T., & Brady, M. (2001). Saliency, scale and image description. International Journal of Computer Vision, 45(2), 83–105.
Kadobayashi, R., & Tanaka, K. (2005). 3D viewpoint-based photo search and information browsing. In Proceedings of the ACM international conference on research and development in information retrieval (pp. 621–622).
Lalonde, J.-F., Hoiem, D., Efros, A. A., Rother, C., Winn, J., & Criminisi, A. (2007). Photo clip art. In SIGGRAPH conference proceedings.
Levoy, M., & Hanrahan, P. (1996). Light field rendering. In SIGGRAPH conference proceedings (pp. 31–42).
Lippman, A. (1980). Movie maps: an application of the optical videodisc to computer graphics. In SIGGRAPH conference proceedings (pp. 32–43).
Longuet-Higgins, H. C. (1981). A computer algorithm for reconstructing a scene from two projections. Nature, 293, 133–135.
Lourakis, M., & Argyros, A. (2004). The design and implementation of a generic sparse bundle adjustment software package based on the Levenberg–Marquardt algorithm (Technical Report 340). Inst. of Computer Science-FORTH, Heraklion, Crete, Greece. Available from www.ics.forth.gr/~lourakis/sba.
Lowe, D. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.
Lucas, B. D., & Kanade, T. (1981). An iterative image registration technique with an application in stereo vision. In International joint conference on artificial Intelligence (pp. 674–679).
Matas, J. et al. (2004). Robust wide baseline stereo from maximally stable extremal regions. Image and Vision Computing, 22(10), 761–767.
McCurdy, N., & Griswold, W. (2005). A systems architecture for ubiquitous video. In Proceedings of the international conference on mobile systems, applications, and services (pp. 1–14).
McMillan, L., & Bishop, G. (1995) Plenoptic modeling: An image-based rendering system. In SIGGRAPH conference proceedings (pp. 39–46).
Mikolajczyk, K., & Schmid, C. (2004). Scale & affine invariant interest point detectors. International Journal of Computer Vision, 60(1), 63–86.
Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., Schaffalitzky, F., Kadir, T., & van Gool, L. (2005). A comparison of affine region detectors. International Journal of Computer Vision, 65(1/2), 43–72.
Moravec, H. (1983). The Stanford cart and the CMU rover. Proceedings of the IEEE, 71(7), 872–884.
Naaman, M., Paepcke, A., & Garcia-Molina, H. (2003). From where to what: Metadata sharing for digital photographs with geographic coordinates. In Proceedings of the international conference on cooperative information systems (pp. 196–217).
Naaman, M., Song, Y. J., Paepcke, A., & Garcia-Molina, H. (2004). Automatic organization for digital photographs with geographic coordinates. In Proceedings of the ACM/IEEE-CS joint conference on digital libraries (pp. 53–62).
Nistér, D. (2000). Reconstruction from uncalibrated sequences with a hierarchy of trifocal tensors. In Proceedings of the European conference on computer vision (pp. 649–663).
Nistér, D. (2004). An efficient solution to the five-point relative pose problem. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(6), 756–777.
Nistér, D., & Stewénius, H. (2006). Scalable recognition with a vocabulary tree. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2118–2125).
Nocedal, J., & Wright, S. J. (1999). Springer series in operations research. Numerical optimization. New York: Springer.
Oliensis, J. (1999). A multi-frame structure-from-motion algorithm under perspective projection. International Journal of Computer Vision, 34(2–3), 163–192.
Pollefeys, M., Koch, R., & Van Gool, L. (1999). Self-calibration and metric reconstruction in spite of varying and unknown internal camera parameters. International Journal of Computer Vision, 32(1), 7–25.
Pollefeys, M., & Van Gool, L. (2002). From images to 3D models. Communications of the ACM, 45(7), 50–55.
Pollefeys, M., van Gool, L., Vergauwen, M., Verbiest, F., Cornelis, K., Tops, J., & Koch, R. (2004). Visual modeling with a hand-held camera. International Journal of Computer Vision, 59(3), 207–232.
Robertson, D. P., & Cipolla, R. (2002). Building architectural models from many views using map constraints. In Proceedings of the European conference on computer vision (Vol. II, pp. 155–169).
Rodden, K., & Wood, K. R. (2003). How do people manage their digital photographs? In Proceedings of the conference on human factors in computing systems (pp. 409–416).
Román, A., et al. (2004). Interactive design of multi-perspective images for visualizing urban landscapes. In IEEE visualization 2004 (pp. 537–544).
Russell, B. C., Torralba, A., Murphy, K. P., & Freeman, W. T. (2005). Labelme: a database and web-based tool for image annotation (Technical Report MIT-CSAIL-TR-2005-056). Massachusetts Institute of Technology.
Schaffalitzky, F., & Zisserman, A. (2002). Multi-view matching for unordered image sets, or “How do I organize my holiday snaps?” In Proceedings of the European conference on computer vision (Vol. 1, pp. 414–431).
Scharstein, D., & Szeliski, R. (2002). A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision, 47(1), 7–42.
Schindler, G., Dellaert, F., & Kang, S. B. (2007). Inferring temporal order of images from 3D structure. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Schmid, C., & Zisserman, A. (1997). Automatic line matching across views. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 666–671).
Seitz, S. M., & Dyer, C. M. (1996). View morphing. In SIGGRAPH conference proceedings (pp. 21–30).
Seitz, S., Curless, B., Diebel, J., Scharstein, D., & Szeliski, R. (2006). A comparison and evaluation of multi-view stereo reconstruction algorithms. In Proceedings of the IEEE conference on computer vision and pattern recognition (Vol. 1, pp. 519–526), June 2006.
Shi, J., & Tomasi, C. Good features to track. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 593–600), June 1994.
Sivic, J., & Zisserman, A. (2003). Video Google: a text retrieval approach to object matching in videos. In Proceedings of the international conference on computer vision (pp. 1470–1477), October 2003.
Snavely, N., Seitz, S. M., & Szeliski, R. (2006). Photo tourism: exploring photo collections in 3D. ACM Transactions on Graphics, 25(3), 835–846.
Spetsakis, M. E., & Aloimonos, J. Y. (1991). A multiframe approach to visual motion perception. International Journal of Computer Vision, 6(3), 245–255.
Strecha, C., Tuytelaars, T., & Van Gool, L. (2003). Dense matching of multiple wide-baseline views. In Proceedings of the international conference on computer vision (pp. 1194–1201), October 2003.
Szeliski, R. (2006). Image alignment and stitching: a tutorial. Foundations and Trends in Computer Graphics and Computer Vision, 2(1).
Szeliski, R., & Kang, S. B. (1994). Recovering 3D shape and motion from image streams using nonlinear least squares. Journal of Visual Communication and Image Representation, 5(1), 10–28.
Szeliski, R., Zabih, R., Scharstein, D., Veksler, O., Kolmogorov, V., Agarwala, A., Tappen, M., & Rother, C. (2006). A comparative study of energy minimization methods for Markov random fields. In Proceedings of the European conference on computer vision (Vol. 2, pp. 16–29), May 2006.
Tanaka, H., Arikawa, M., & Shibasaki, R. (2002). A 3-d photo collage system for spatial navigations. In Revised papers from the second Kyoto workshop on digital cities II, computational and sociological approaches (pp. 305–316).
Teller, S., Antone, M., Bodnar, Z., Bosse, M., Coorg, S., Jethwa, M., & Master, N. (2003). Calibrated, registered images of an extended urban area. International Journal of Computer Vision, 53(1), 93–107.
Tomasi, C., & Kanade, T. (1992). Shape and motion from image streams under orthography: a factorization method. International Journal of Computer Vision, 9(2), 137–154.
Toyama, K., Logan, R., & Roseway, A. (2003). Geographic location tags on digital images. In Proceedings of the international conference on multimedia (pp. 156–166).
Triggs, B., et al. (1999). Bundle adjustment—a modern synthesis. In International workshop on vision algorithms (pp. 298–372), September 1999.
Tuytelaars, T., & Van Gool, L. (2004). Matching widely separated views based on affine invariant regions. International Journal of Computer Vision, 59(1), 61–85.
Vergauwen, M., & Van Gool, L. (2006). Web-based 3D reconstruction service. Machine Vision and Applications, 17(2), 321–329.
von Ahn, L., & Dabbish, L. (2004). Labeling images with a computer game. In Proceedings of the conference on human factors in computing systems (pp. 319–326).
Zitnick, L., Kang, S. B., Uyttendaele, M., Winder, S., & Szeliski, R. (2004). High-quality video view interpolation using a layered representation. In SIGGRAPH conference proceedings (pp. 600–608).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Snavely, N., Seitz, S.M. & Szeliski, R. Modeling the World from Internet Photo Collections. Int J Comput Vis 80, 189–210 (2008). https://doi.org/10.1007/s11263-007-0107-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-007-0107-3