Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1007/978-3-031-73232-4_22guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

3D Congealing: 3D-Aware Image Alignment in the Wild

Published: 30 September 2024 Publication History

Abstract

We propose 3D Congealing, a novel problem of 3D-aware alignment for 2D images capturing semantically similar objects. Given a collection of unlabeled Internet images, our goal is to associate the shared semantic parts from the inputs and aggregate the knowledge from 2D images to a shared 3D canonical space. We introduce a general framework that tackles the task without assuming shape templates, poses, or any camera parameters. At its core is a canonical 3D representation that encapsulates geometric and semantic information. The framework optimizes for the canonical representation together with the pose for each input image, and a per-image coordinate map that warps 2D pixel coordinates to the 3D canonical frame to account for the shape matching. The optimization procedure fuses prior knowledge from a pre-trained image generative model and semantic information from input images. The former provides strong knowledge guidance for this under-constraint task, while the latter provides the necessary information to mitigate the training data bias from the pre-trained model. Our framework can be used for various tasks such as pose estimation and image editing, achieving strong results on real-world image datasets under challenging illumination conditions and on in-the-wild online image collections. Project page at https://ai.stanford.edu/~yzzhang/projects/3d-congealing/.

References

[1]
Boss M et al. SAMURAI: shape and material from unconstrained real-world arbitrary image collections Adv. Neural. Inf. Process. Syst. 2022 35 26389-26403
[2]
Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021)
[3]
Chen X, Dong Z, Song J, Geiger A, and Hilliges O Vedaldi A, Bischof H, Brox T, and Frahm J-M Category level object pose estimation via neural analysis-by-synthesis Computer Vision – ECCV 2020 2020 Cham Springer 139-156
[4]
Chen, Y., et al.: Local-to-global registration for bundle-adjusting neural radiance fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognitionm, pp. 8264–8273 (2023)
[5]
Cheng, W., Cao, Y.P., Shan, Y.: Id-pose: sparse-view camera pose estimation by inverting diffusion models. arXiv preprint arXiv:2306.17140 (2023)
[6]
Deng, Y., Yang, J., Tong, X.: Deformed implicit field: modeling 3D shapes with learned dense correspondence. In: CVPR (2021)
[7]
Gal, R., et al.: An image is worth one word: personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618 (2022)
[8]
Goodfellow I et al. Generative adversarial networks Commun. ACM 2020 63 11 139-144
[9]
Goodwin W, Vaze S, Havoutis I, and Posner I Avidan S, Brostow G, Cissé M, Farinella GM, and Hassner T Zero-shot category-level object pose estimation Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXIX 2022 Cham Springer Nature Switzerland 516-532
[10]
Gower, J.C., Dijksterhuis, G.B.: Procrustes problems, vol. 30. OUP Oxford (2004)
[11]
Gupta, K., et al.: ASIC: aligning sparse in-the-wild image collections. arXiv preprint arXiv:2303.16201 (2023)
[12]
Huang, G., Mattar, M., Lee, H., Learned-Miller, E.: Learning to align from scratch. Adv. Neural Inf. Process. Syst. 25 (2012)
[13]
Huang, G.B., Jain, V., Learned-Miller, E.: Unsupervised joint alignment of complex images. In: ICCV, pp. 1–8. IEEE (2007)
[14]
Jampani, V., et al.: Navi: Category-agnostic image collections with high-quality 3D shape and pose annotations. arXiv preprint arXiv:2306.09109 (2023)
[15]
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (2015)
[16]
Kirillov, A., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)
[17]
Kuang Z et al. NeROIC: neural rendering of objects from online image collections ACM Trans. Graph. (TOG) 2022 41 4 1-12
[18]
Learned-Miller EG Data driven image models through continuous joint alignment IEEE TPAMI 2005 28 2 236-250
[19]
Lin, A., Zhang, J.Y., Ramanan, D., Tulsiani, S.: RelPose++: recovering 6D poses from sparse-view observations. arXiv preprint arXiv:2305.04926 (2023)
[20]
Lin, C.H., Ma, W.C., Torralba, A., Lucey, S.: BARF: bundle-adjusting neural radiance fields. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5741–5751 (2021)
[21]
Liu, R., Wu, R., Van Hoorick, B., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero-1-to-3: Zero-shot one image to 3D object. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9298–9309 (2023)
[22]
Lorensen WE and Cline HE Marching cubes: a high resolution 3D surface construction algorithm ACM SIGGRAPH Comput. Graph. 1987 21 4 163-169
[23]
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2018)
[24]
Martin-Brualla, R., Radwan, N., Sajjadi, M.S.M., Barron, J.T., Dosovitskiy, A., Duckworth, D.: NeRF in the wild: neural radiance fields for unconstrained photo collections. In: CVPR (2021)
[25]
Meng, Q., et al.: GNeU: GAN-based neural radiance field without posed camera. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6351–6361 (2021)
[26]
Mildenhall B et al. Nerf: representing scenes as neural radiance fields for view synthesis Commun. ACM 2021 65 1 99-106
[27]
Miller, E.G., Matsakis, N.E., Viola, P.A.: Learning from one example through shared densities on transforms. In: CVPR, vol. 1, pp. 464–471. IEEE (2000)
[28]
Min, J., Lee, J., Ponce, J., Cho, M.: SPair-71k: a large-scale benchmark for semantic correspondence. arXiv preprint arXiv:1908.10543 (2019)
[29]
Ofri-Amar, D., Geyer, M., Kasten, Y., Dekel, T.: Neural congealing: aligning images to a joint semantic atlas. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19403–19412 (2023)
[30]
Oquab, M., et al.: DINOv2: learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)
[31]
Peebles, W., Zhu, J.Y., Zhang, R., Torralba, A., Efros, A.A., Shechtman, E.: GAN-supervised dense visual alignment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13470–13481 (2022)
[32]
Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: DreamFusion: Text-to-3D using 2D diffusion. arXiv preprint arXiv:2209.14988 (2022)
[33]
Raj, A., et al.: DreamBooth3D: subject-driven text-to-3D generation. arXiv preprint arXiv:2303.13508 (2023)
[34]
Reizenstein, J., Shapovalov, R., Henzler, P., Sbordone, L., Labatut, P., Novotny, D.: Common objects in 3D: large-scale learning and evaluation of real-life 3D category reconstruction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10901–10911 (2021)
[35]
Ren, T., et al.: Grounded SAM: assembling open-world models for diverse visual tasks. arXiv preprint arXiv:2401.14159 (2024)
[36]
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)
[37]
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: DreamBooth: fine tuning text-to-image diffusion models for subject-driven generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22500–22510 (2023)
[38]
Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4104–4113 (2016)
[39]
Shi, Y., Wang, P., Ye, J., Long, M., Li, K., Yang, X.: MVDream: multi-view diffusion for 3D generation. arXiv preprint arXiv:2308.16512 (2023)
[40]
Sun, X., et al.: Pix3D: dataset and methods for single-image 3D shape modeling. In: CVPR (2018)
[41]
Wang, H., Sridhar, S., Huang, J., Valentin, J., Song, S., Guibas, L.J.: Normalized object coordinate space for category-level 6d object pose and size estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2642–2651 (2019)
[42]
Wang, J., Rupprecht, C., Novotny, D.: PoseDiffusion: solving pose estimation via diffusion-aided bundle adjustment. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9773–9783 (2023)
[43]
Wang, P., Liu, L., Liu, Y., Theobalt, C., Komura, T., Wang, W.: NeuS: learning neural implicit surfaces by volume rendering for multi-view reconstruction. arXiv preprint arXiv:2106.10689 (2021)
[44]
Wang, Z., Wu, S., Xie, W., Chen, M., Prisacariu, V.A.: NeRF–: neural radiance fields without known camera parameters. arXiv preprint arXiv:2102.07064 (2021)
[45]
Yariv L, Gu J, Kasten Y, and Lipman Y Volume rendering of neural implicit surfaces Adv. Neural. Inf. Process. Syst. 2021 34 4805-4815
[46]
Yen-Chen, L., Florence, P., Barron, J.T., Rodriguez, A., Isola, P., Lin, T.Y.: INeRF: inverting neural radiance fields for pose estimation. In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1323–1330. IEEE (2021)
[47]
Zhang, J., Yang, G., Tulsiani, S., Ramanan, D.: NeRS: neural reflectance surfaces for sparse-view 3D reconstruction in the wild. In: Advances in Neural Information Processing Systems, vol. 34, pp. 29835–29847 (2021)
[48]
Zhang JY, Ramanan D, and Tulsiani S Avidan S, Brostow G, Cissé M, Farinella GM, and Hassner T RelPose: predicting probabilistic relative rotation for single objects in the wild Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXI 2022 Cham Springer Nature Switzerland 592-611

Index Terms

  1. 3D Congealing: 3D-Aware Image Alignment in the Wild
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Guide Proceedings
    Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part I
    Sep 2024
    580 pages
    ISBN:978-3-031-73231-7
    DOI:10.1007/978-3-031-73232-4
    • Editors:
    • Aleš Leonardis,
    • Elisa Ricci,
    • Stefan Roth,
    • Olga Russakovsky,
    • Torsten Sattler,
    • Gül Varol

    Publisher

    Springer-Verlag

    Berlin, Heidelberg

    Publication History

    Published: 30 September 2024

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 0
      Total Downloads
    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 10 Dec 2024

    Other Metrics

    Citations

    View Options

    View options

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media