Nothing Special   »   [go: up one dir, main page]

Skip to main content

Watch Your Steps: Local Image and Scene Editing by Text Instructions

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

The success of denoising diffusion models in generating and editing images has sparked interest in using diffusion models for editing 3D scenes represented via neural radiance fields (NeRFs). However, current 3D editing methods lack a way to both pinpoint the edit location and limit changes to the desired volumetric region. Consequently, these methods often over-edit, altering irrelevant parts of the scene. We introduce a new task, 3D edit localization, to automatically identify the relevant region for an editing task and restrict the edit accordingly. To achieve this goal, we initially tackle 2D edit localization, and then lift it to multiple views to address the 3D localization challenge. For 2D localization, we leverage InstructPix2Pix (IP2P) and identify the discrepancy between IP2P predictions with and without the instruction. We refer to this discrepancy as the relevance map. The relevance map conveys the importance of changing each pixel to achieve an edit, and guides downstream modifications, ensuring that pixels irrelevant to the edit remain unchanged. With the relevance maps of multiview posed images, we can define the relevance field, defining the 3D region within which modifications should be made. This enables us to improve the quality of text-guided 3D NeRF scene editing, by performing iterative updates on the training views, guided by renders from the relevance field. Our method achieves state-of-the-art performance on both NeRF and image editing tasks. We will make the code available.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Avrahami, O., Lischinski, D., Fried, O.: Blended diffusion for text-driven editing of natural images. In: CVPR (2022)

    Google Scholar 

  2. Balaji, Y., et al.: eDiff-I: text-to-image diffusion models with ensemble of expert denoisers. arXiv (2022)

    Google Scholar 

  3. Barron, J.T., Mildenhall, B., Tancik, M., Hedman, P., Martin-Brualla, R., Srinivasan, P.P.: Mip-NeRF: a multiscale representation for anti-aliasing neural radiance fields. In: ICCV (2021)

    Google Scholar 

  4. Barron, J.T., Mildenhall, B., Verbin, D., Srinivasan, P.P., Hedman, P.: Mip-NeRF 360: unbounded anti-aliased neural radiance fields. In: CVPR (2022)

    Google Scholar 

  5. Barron, J.T., Mildenhall, B., Verbin, D., Srinivasan, P.P., Hedman, P.: Zip-NeRF: anti-aliased grid-based neural radiance fields. In: ICCV (2023)

    Google Scholar 

  6. Brooks, T., Holynski, A., Efros, A.A.: InstructPix2Pix: learning to follow image editing instructions. In: CVPR (2023)

    Google Scholar 

  7. Ceylan, D., Huang, C.H.P., Mitra, N.J.: Pix2Video: video editing using image diffusion. In: ICCV (2023)

    Google Scholar 

  8. Chen, A., Xu, Z., Geiger, A., Yu, J., Su, H.: Tensorf: tensorial radiance fields. In: ECCV (2022)

    Google Scholar 

  9. Chen, Z., Funkhouser, T., Hedman, P., Tagliasacchi, A.: MobileNeRF: exploiting the polygon rasterization pipeline for efficient neural field rendering on mobile architectures. In: CVPR (2023)

    Google Scholar 

  10. Cheng, Y., et al.: Segment and track anything. arXiv preprint arXiv:2305.06558 (2023)

  11. Couairon, G., Verbeek, J., Schwenk, H., Cord, M.: Diffedit: diffusion-based semantic image editing with mask guidance. In: ICLR (2023)

    Google Scholar 

  12. Dai, A., Siddiqui, Y., Thies, J., Valentin, J., Niessner, M.: SPSG: self-supervised photometric scene generation from RGB-D scans. In: CVPR (2021)

    Google Scholar 

  13. Deng, K., Liu, A., Zhu, J.Y., Ramanan, D.: Depth-supervised NeRF: fewer views and faster training for free. In: CVPR (2022)

    Google Scholar 

  14. Dhariwal, P., Nichol, A.: Diffusion models beat GANs on image synthesis. In: NeurIPS (2021)

    Google Scholar 

  15. Dong, J., Wang, Y.X.: ViCA-NeRF: view-consistency-aware 3D editing of neural radiance fields. In: NeurIPS (2023)

    Google Scholar 

  16. Gal, R., Patashnik, O., Maron, H., Chechik, G., Cohen-Or, D.: StyleGAN-NADA: clip-guided domain adaptation of image generators. TOG (2022)

    Google Scholar 

  17. Geyer, M., Bar-Tal, O., Bagon, S., Dekel, T.: TokenFlow: consistent diffusion features for consistent video editing. In: ICLR (2024)

    Google Scholar 

  18. Haque, A., Tancik, M., Efros, A., Holynski, A., Kanazawa, A.: Instruct-NeRF2NeRF: editing 3D scenes with instructions. In: ICCV (2023)

    Google Scholar 

  19. Hedman, P., Srinivasan, P.P., Mildenhall, B., Barron, J.T., Debevec, P.: Baking neural radiance fields for real-time view synthesis. In: ICCV (2021)

    Google Scholar 

  20. Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. In: ICLR (2023)

    Google Scholar 

  21. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS (2020)

    Google Scholar 

  22. Ho, J., Saharia, C., Chan, W., Fleet, D.J., Norouzi, M., Salimans, T.: Cascaded diffusion models for high fidelity image generation. JMLR (2022)

    Google Scholar 

  23. Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)

  24. Jain, A., Mildenhall, B., Barron, J.T., Abbeel, P., Poole, B.: Zero-shot text-guided object generation with dream fields. In: CVPR (2022)

    Google Scholar 

  25. Jheng, R.F., Wu, T.H., Yeh, J.F., Hsu, W.H.: Free-form 3D scene inpainting with dual-stream GAN. In: BMVC (2022)

    Google Scholar 

  26. Kania, K., Yi, K.M., Kowalski, M., Trzciński, T., Tagliasacchi, A.: CoNeRF: controllable neural radiance fields. In: CVPR (2022)

    Google Scholar 

  27. Kawar, B., et al.: Imagic: text-based real image editing with diffusion models. In: CVPR (2023)

    Google Scholar 

  28. Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3D Gaussian splatting for real-time radiance field rendering. TOG (2023)

    Google Scholar 

  29. Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. arXiv (2022)

    Google Scholar 

  30. Kuang, Z., Luan, F., Bi, S., Shu, Z., Wetzstein, G., Sunkavalli, K.: PaletteNeRF: palette-based appearance editing of neural radiance fields. arXiv (2022)

    Google Scholar 

  31. Kurz, A., Neff, T., Lv, Z., Zollhöfer, M., Steinberger, M.: AdaNeRF: adaptive sampling for real-time rendering of neural radiance fields. In: ECCV (2022)

    Google Scholar 

  32. Lazova, V., Guzov, V., Olszewski, K., Tulyakov, S., Pons-Moll, G.: Control-NeRF: editable feature volumes for scene rendering and manipulation. In: WACV (2023)

    Google Scholar 

  33. Li, Z., et al.: CompNVS: novel view synthesis with scene completion. In: ECCV (2022)

    Google Scholar 

  34. Lin, C.H., et al.: Magic3D: high-resolution text-to-3D content creation. In: CVPR (2023)

    Google Scholar 

  35. Lin, C.H., Ma, W.C., Torralba, A., Lucey, S.: BARF: bundle-adjusting neural radiance fields. In: ICCV (2021)

    Google Scholar 

  36. Lindell, D.B., Van Veen, D., Park, J.J., Wetzstein, G.: BACON: band-limited coordinate networks for multiscale scene representation. In: CVPR (2022)

    Google Scholar 

  37. Liu, H.K., Shen, I.C., Chen, B.Y.: NeRF-In: free-form NeRF inpainting with RGB-D priors. arXiv (2022)

    Google Scholar 

  38. Liu, N., Li, S., Du, Y., Torralba, A., Tenenbaum, J.B.: Compositional visual generation with composable diffusion models. In: ECCV (2023)

    Google Scholar 

  39. Liu, R., Wu, R., Hoorick, B.V., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero-1-to-3: zero-shot one image to 3D object. arXiv (2023)

    Google Scholar 

  40. Liu, S., Zhang, X., Zhang, Z., Zhang, R., Zhu, J.Y., Russell, B.: Editing conditional radiance fields. In: ICCV (2021)

    Google Scholar 

  41. Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Timofte, R., Van Gool, L.: Repaint: inpainting using denoising diffusion probabilistic models. In: CVPR (2022)

    Google Scholar 

  42. Max, N., Chen, M.: Local and global illumination in the volume rendering integral. Technical report, Lawrence Livermore National Lab (LLNL), Livermore, CA (United States) (2005)

    Google Scholar 

  43. Meng, C., et al.: SDEdit: guided image synthesis and editing with stochastic differential equations. In: ICLR (2021)

    Google Scholar 

  44. Mikaeili, A., Perel, O., Safaee, M., Cohen-Or, D., Mahdavi-Amiri, A.: SKED: sketch-guided text-based 3D editing. arXiv (2023)

    Google Scholar 

  45. Mildenhall, B., et al.: Local light field fusion: practical view synthesis with prescriptive sampling guidelines. TOG (2019)

    Google Scholar 

  46. Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. In: ECCV (2020)

    Google Scholar 

  47. Mirzaei, A., et al.: Reference-guided controllable inpainting of neural radiance fields. In: ICCV (2023)

    Google Scholar 

  48. Mirzaei, A., et al.: SPIn-NeRF: multiview segmentation and perceptual inpainting with neural radiance fields. In: CVPR (2023)

    Google Scholar 

  49. Mirzaei, A., Kant, Y., Kelly, J., Gilitschenski, I.: LaTeRF: label and text driven object radiance fields. In: ECCV (2022)

    Google Scholar 

  50. Mittal, A., Soundararajan, R., Bovik, A.C.: Making a “completely blind” image quality analyzer. IEEE Signal Process. Lett. (2013)

    Google Scholar 

  51. Müller, T., Evans, A., Schied, C., Keller, A.: Instant neural graphics primitives with a multiresolution hash encoding. TOG (2022)

    Google Scholar 

  52. Nichol, A.Q., et al.: GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. In: ICML (2022)

    Google Scholar 

  53. Parmar, G., Singh, K.K., Zhang, R., Li, Y., Lu, J., Zhu, J.Y.: Zero-shot image-to-image translation. In: SIGGRAPH (2023)

    Google Scholar 

  54. von Platen, P., et al.: Diffusers: state-of-the-art diffusion models (2022). https://github.com/huggingface/diffusers

  55. Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: DreamFusion: text-to-3D using 2D diffusion. In: ICLR (2023)

    Google Scholar 

  56. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)

    Google Scholar 

  57. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with CLIP latents. arXiv (2022)

    Google Scholar 

  58. Reiser, C., et al.: MERF: memory-efficient radiance fields for real-time view synthesis in unbounded scenes. arXiv (2023)

    Google Scholar 

  59. Roeder, G., Wu, Y., Duvenaud, D.: Sticking the landing: simple, lower-variance gradient estimators for variational inference. arXiv (2017)

    Google Scholar 

  60. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)

    Google Scholar 

  61. Saharia, C., et al.: Palette: image-to-image diffusion models. In: SIGGRAPH (2022)

    Google Scholar 

  62. Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: NeurIPS (2022)

    Google Scholar 

  63. Saharia, C., Ho, J., Chan, W., Salimans, T., Fleet, D.J., Norouzi, M.: Image super-resolution via iterative refinement. TPAMI (2022)

    Google Scholar 

  64. Fridovich-Keil, S., Yu, A., Tancik, M., Chen, Q., Recht, B., Kanazawa, A.: Plenoxels: radiance fields without neural networks. In: CVPR (2022)

    Google Scholar 

  65. Schönberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: CVPR (2016)

    Google Scholar 

  66. Schönberger, J.L., Zheng, E., Pollefeys, M., Frahm, J.M.: Pixelwise view selection for unstructured multi-view stereo. In: ECCV (2016)

    Google Scholar 

  67. Sohl-Dickstein, J., Weiss, E.A., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: PMLR (2015)

    Google Scholar 

  68. Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: ICLR (2021)

    Google Scholar 

  69. Song, S., Yu, F., Zeng, A., Chang, A.X., Savva, M., Funkhouser, T.: Semantic scene completion from a single depth image. In: CVPR (2017)

    Google Scholar 

  70. Song, Y., Ermon, S.: Generative modeling by estimating gradients of the data distribution. In: NeurIPS (2020)

    Google Scholar 

  71. Tancik, M., et al.: Nerfstudio: a modular framework for neural radiance field development. In: SIGGRAPH (2023)

    Google Scholar 

  72. Tewari, A., et al.: Advances in neural rendering. In: SIGGRAPH (2021)

    Google Scholar 

  73. Verbin, D., Hedman, P., Mildenhall, B., Zickler, T., Barron, J.T., Srinivasan, P.P.: Ref-NeRF: structured view-dependent appearance for neural radiance fields. In: CVPR (2022)

    Google Scholar 

  74. Wang, C., Chai, M., He, M., Chen, D., Liao, J.: CLIP-NeRF: text-and-image driven manipulation of neural radiance fields. In: CVPR (2022)

    Google Scholar 

  75. Wang, C., Jiang, R., Chai, M., He, M., Chen, D., Liao, J.: NeRF-Art: text-driven neural radiance fields stylization. In: TVCG (2023)

    Google Scholar 

  76. Wang, H., Du, X., Li, J., Yeh, R.A., Shakhnarovich, G.: Score Jacobian chaining: lifting pretrained 2D diffusion models for 3D generation. In: CVPR (2023)

    Google Scholar 

  77. Wang, Q., et al.: IBRNet: learning multi-view image-based rendering. In: CVPR (2021)

    Google Scholar 

  78. Wang, Z., Lu, C., Wang, Y., Bao, F., Li, C., Su, H., Zhu, J.: ProlificDreamer: high-fidelity and diverse text-to-3D generation with variational score distillation. arXiv (2023)

    Google Scholar 

  79. Wang, Z., Wu, S., Xie, W., Chen, M., Prisacariu, V.A.: NeRF–: neural radiance fields without known camera parameters. arXiv (2021)

    Google Scholar 

  80. Weder, S., et al.: Removing objects from neural radiance fields. In: CVPR (2023)

    Google Scholar 

  81. Yang, B., et al.: Learning object-compositional neural radiance field for editable scene rendering. In: ICCV (2021)

    Google Scholar 

  82. Yu, A., Li, R., Tancik, M., Li, H., Ng, R., Kanazawa, A.: PlenOctrees for real-time rendering of neural radiance fields. In: ICCV (2021)

    Google Scholar 

  83. Yu, A., Ye, V., Tancik, M., Kanazawa, A.: pixelNeRF: neural radiance fields from one or few images. In: CVPR (2021)

    Google Scholar 

  84. Yuan, Y.J., Sun, Y.T., Lai, Y.K., Ma, Y., Jia, R., Gao, L.: NeRF-editing: geometry editing of neural radiance fields. In: CVPR (2022)

    Google Scholar 

  85. Zhang, Z., Li, B., Nie, X., Han, C., Guo, T., Liu, L.: Towards consistent video editing with text-to-image diffusion models. In: NeurIPS (2024)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ashkan Mirzaei .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 7995 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Mirzaei, A. et al. (2025). Watch Your Steps: Local Image and Scene Editing by Text Instructions. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15096. Springer, Cham. https://doi.org/10.1007/978-3-031-72920-1_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72920-1_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72919-5

  • Online ISBN: 978-3-031-72920-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics