Watch Your Steps: Local Image and Scene Editing by Text Instructions

Ashkan Mirzaei^13,14,
Tristan Aumentado-Armstrong^13,15,16,
Marcus A. Brubaker^13,15,16,
Jonathan Kelly¹⁴,
Alex Levinshtein¹³,
Konstantinos G. Derpanis^13,15,16 &
…
Igor Gilitschenski¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15096))

Included in the following conference series:

European Conference on Computer Vision

293 Accesses
3 Citations

Abstract

The success of denoising diffusion models in generating and editing images has sparked interest in using diffusion models for editing 3D scenes represented via neural radiance fields (NeRFs). However, current 3D editing methods lack a way to both pinpoint the edit location and limit changes to the desired volumetric region. Consequently, these methods often over-edit, altering irrelevant parts of the scene. We introduce a new task, 3D edit localization, to automatically identify the relevant region for an editing task and restrict the edit accordingly. To achieve this goal, we initially tackle 2D edit localization, and then lift it to multiple views to address the 3D localization challenge. For 2D localization, we leverage InstructPix2Pix (IP2P) and identify the discrepancy between IP2P predictions with and without the instruction. We refer to this discrepancy as the relevance map. The relevance map conveys the importance of changing each pixel to achieve an edit, and guides downstream modifications, ensuring that pixels irrelevant to the edit remain unchanged. With the relevance maps of multiview posed images, we can define the relevance field, defining the 3D region within which modifications should be made. This enables us to improve the quality of text-guided 3D NeRF scene editing, by performing iterative updates on the training views, guided by renders from the relevance field. Our method achieves state-of-the-art performance on both NeRF and image editing tasks. We will make the code available.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

LatentEditor: Text Driven Local Editing of 3D Scenes

DATENeRF: Depth-Aware Text-Based Editing of NeRFs

GaussCtrl: Multi-view Consistent Text-Driven 3D Gaussian Splatting Editing

References

Avrahami, O., Lischinski, D., Fried, O.: Blended diffusion for text-driven editing of natural images. In: CVPR (2022)
Google Scholar
Balaji, Y., et al.: eDiff-I: text-to-image diffusion models with ensemble of expert denoisers. arXiv (2022)
Google Scholar
Barron, J.T., Mildenhall, B., Tancik, M., Hedman, P., Martin-Brualla, R., Srinivasan, P.P.: Mip-NeRF: a multiscale representation for anti-aliasing neural radiance fields. In: ICCV (2021)
Google Scholar
Barron, J.T., Mildenhall, B., Verbin, D., Srinivasan, P.P., Hedman, P.: Mip-NeRF 360: unbounded anti-aliased neural radiance fields. In: CVPR (2022)
Google Scholar
Barron, J.T., Mildenhall, B., Verbin, D., Srinivasan, P.P., Hedman, P.: Zip-NeRF: anti-aliased grid-based neural radiance fields. In: ICCV (2023)
Google Scholar
Brooks, T., Holynski, A., Efros, A.A.: InstructPix2Pix: learning to follow image editing instructions. In: CVPR (2023)
Google Scholar
Ceylan, D., Huang, C.H.P., Mitra, N.J.: Pix2Video: video editing using image diffusion. In: ICCV (2023)
Google Scholar
Chen, A., Xu, Z., Geiger, A., Yu, J., Su, H.: Tensorf: tensorial radiance fields. In: ECCV (2022)
Google Scholar
Chen, Z., Funkhouser, T., Hedman, P., Tagliasacchi, A.: MobileNeRF: exploiting the polygon rasterization pipeline for efficient neural field rendering on mobile architectures. In: CVPR (2023)
Google Scholar
Cheng, Y., et al.: Segment and track anything. arXiv preprint arXiv:2305.06558 (2023)
Couairon, G., Verbeek, J., Schwenk, H., Cord, M.: Diffedit: diffusion-based semantic image editing with mask guidance. In: ICLR (2023)
Google Scholar
Dai, A., Siddiqui, Y., Thies, J., Valentin, J., Niessner, M.: SPSG: self-supervised photometric scene generation from RGB-D scans. In: CVPR (2021)
Google Scholar
Deng, K., Liu, A., Zhu, J.Y., Ramanan, D.: Depth-supervised NeRF: fewer views and faster training for free. In: CVPR (2022)
Google Scholar
Dhariwal, P., Nichol, A.: Diffusion models beat GANs on image synthesis. In: NeurIPS (2021)
Google Scholar
Dong, J., Wang, Y.X.: ViCA-NeRF: view-consistency-aware 3D editing of neural radiance fields. In: NeurIPS (2023)
Google Scholar
Gal, R., Patashnik, O., Maron, H., Chechik, G., Cohen-Or, D.: StyleGAN-NADA: clip-guided domain adaptation of image generators. TOG (2022)
Google Scholar
Geyer, M., Bar-Tal, O., Bagon, S., Dekel, T.: TokenFlow: consistent diffusion features for consistent video editing. In: ICLR (2024)
Google Scholar
Haque, A., Tancik, M., Efros, A., Holynski, A., Kanazawa, A.: Instruct-NeRF2NeRF: editing 3D scenes with instructions. In: ICCV (2023)
Google Scholar
Hedman, P., Srinivasan, P.P., Mildenhall, B., Barron, J.T., Debevec, P.: Baking neural radiance fields for real-time view synthesis. In: ICCV (2021)
Google Scholar
Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. In: ICLR (2023)
Google Scholar
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS (2020)
Google Scholar
Ho, J., Saharia, C., Chan, W., Fleet, D.J., Norouzi, M., Salimans, T.: Cascaded diffusion models for high fidelity image generation. JMLR (2022)
Google Scholar
Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)
Jain, A., Mildenhall, B., Barron, J.T., Abbeel, P., Poole, B.: Zero-shot text-guided object generation with dream fields. In: CVPR (2022)
Google Scholar
Jheng, R.F., Wu, T.H., Yeh, J.F., Hsu, W.H.: Free-form 3D scene inpainting with dual-stream GAN. In: BMVC (2022)
Google Scholar
Kania, K., Yi, K.M., Kowalski, M., Trzciński, T., Tagliasacchi, A.: CoNeRF: controllable neural radiance fields. In: CVPR (2022)
Google Scholar
Kawar, B., et al.: Imagic: text-based real image editing with diffusion models. In: CVPR (2023)
Google Scholar
Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3D Gaussian splatting for real-time radiance field rendering. TOG (2023)
Google Scholar
Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. arXiv (2022)
Google Scholar
Kuang, Z., Luan, F., Bi, S., Shu, Z., Wetzstein, G., Sunkavalli, K.: PaletteNeRF: palette-based appearance editing of neural radiance fields. arXiv (2022)
Google Scholar
Kurz, A., Neff, T., Lv, Z., Zollhöfer, M., Steinberger, M.: AdaNeRF: adaptive sampling for real-time rendering of neural radiance fields. In: ECCV (2022)
Google Scholar
Lazova, V., Guzov, V., Olszewski, K., Tulyakov, S., Pons-Moll, G.: Control-NeRF: editable feature volumes for scene rendering and manipulation. In: WACV (2023)
Google Scholar
Li, Z., et al.: CompNVS: novel view synthesis with scene completion. In: ECCV (2022)
Google Scholar
Lin, C.H., et al.: Magic3D: high-resolution text-to-3D content creation. In: CVPR (2023)
Google Scholar
Lin, C.H., Ma, W.C., Torralba, A., Lucey, S.: BARF: bundle-adjusting neural radiance fields. In: ICCV (2021)
Google Scholar
Lindell, D.B., Van Veen, D., Park, J.J., Wetzstein, G.: BACON: band-limited coordinate networks for multiscale scene representation. In: CVPR (2022)
Google Scholar
Liu, H.K., Shen, I.C., Chen, B.Y.: NeRF-In: free-form NeRF inpainting with RGB-D priors. arXiv (2022)
Google Scholar
Liu, N., Li, S., Du, Y., Torralba, A., Tenenbaum, J.B.: Compositional visual generation with composable diffusion models. In: ECCV (2023)
Google Scholar
Liu, R., Wu, R., Hoorick, B.V., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero-1-to-3: zero-shot one image to 3D object. arXiv (2023)
Google Scholar
Liu, S., Zhang, X., Zhang, Z., Zhang, R., Zhu, J.Y., Russell, B.: Editing conditional radiance fields. In: ICCV (2021)
Google Scholar
Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Timofte, R., Van Gool, L.: Repaint: inpainting using denoising diffusion probabilistic models. In: CVPR (2022)
Google Scholar
Max, N., Chen, M.: Local and global illumination in the volume rendering integral. Technical report, Lawrence Livermore National Lab (LLNL), Livermore, CA (United States) (2005)
Google Scholar
Meng, C., et al.: SDEdit: guided image synthesis and editing with stochastic differential equations. In: ICLR (2021)
Google Scholar
Mikaeili, A., Perel, O., Safaee, M., Cohen-Or, D., Mahdavi-Amiri, A.: SKED: sketch-guided text-based 3D editing. arXiv (2023)
Google Scholar
Mildenhall, B., et al.: Local light field fusion: practical view synthesis with prescriptive sampling guidelines. TOG (2019)
Google Scholar
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. In: ECCV (2020)
Google Scholar
Mirzaei, A., et al.: Reference-guided controllable inpainting of neural radiance fields. In: ICCV (2023)
Google Scholar
Mirzaei, A., et al.: SPIn-NeRF: multiview segmentation and perceptual inpainting with neural radiance fields. In: CVPR (2023)
Google Scholar
Mirzaei, A., Kant, Y., Kelly, J., Gilitschenski, I.: LaTeRF: label and text driven object radiance fields. In: ECCV (2022)
Google Scholar
Mittal, A., Soundararajan, R., Bovik, A.C.: Making a “completely blind” image quality analyzer. IEEE Signal Process. Lett. (2013)
Google Scholar
Müller, T., Evans, A., Schied, C., Keller, A.: Instant neural graphics primitives with a multiresolution hash encoding. TOG (2022)
Google Scholar
Nichol, A.Q., et al.: GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. In: ICML (2022)
Google Scholar
Parmar, G., Singh, K.K., Zhang, R., Li, Y., Lu, J., Zhu, J.Y.: Zero-shot image-to-image translation. In: SIGGRAPH (2023)
Google Scholar
von Platen, P., et al.: Diffusers: state-of-the-art diffusion models (2022). https://github.com/huggingface/diffusers
Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: DreamFusion: text-to-3D using 2D diffusion. In: ICLR (2023)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Google Scholar
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with CLIP latents. arXiv (2022)
Google Scholar
Reiser, C., et al.: MERF: memory-efficient radiance fields for real-time view synthesis in unbounded scenes. arXiv (2023)
Google Scholar
Roeder, G., Wu, Y., Duvenaud, D.: Sticking the landing: simple, lower-variance gradient estimators for variational inference. arXiv (2017)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)
Google Scholar
Saharia, C., et al.: Palette: image-to-image diffusion models. In: SIGGRAPH (2022)
Google Scholar
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: NeurIPS (2022)
Google Scholar
Saharia, C., Ho, J., Chan, W., Salimans, T., Fleet, D.J., Norouzi, M.: Image super-resolution via iterative refinement. TPAMI (2022)
Google Scholar
Fridovich-Keil, S., Yu, A., Tancik, M., Chen, Q., Recht, B., Kanazawa, A.: Plenoxels: radiance fields without neural networks. In: CVPR (2022)
Google Scholar
Schönberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: CVPR (2016)
Google Scholar
Schönberger, J.L., Zheng, E., Pollefeys, M., Frahm, J.M.: Pixelwise view selection for unstructured multi-view stereo. In: ECCV (2016)
Google Scholar
Sohl-Dickstein, J., Weiss, E.A., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: PMLR (2015)
Google Scholar
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: ICLR (2021)
Google Scholar
Song, S., Yu, F., Zeng, A., Chang, A.X., Savva, M., Funkhouser, T.: Semantic scene completion from a single depth image. In: CVPR (2017)
Google Scholar
Song, Y., Ermon, S.: Generative modeling by estimating gradients of the data distribution. In: NeurIPS (2020)
Google Scholar
Tancik, M., et al.: Nerfstudio: a modular framework for neural radiance field development. In: SIGGRAPH (2023)
Google Scholar
Tewari, A., et al.: Advances in neural rendering. In: SIGGRAPH (2021)
Google Scholar
Verbin, D., Hedman, P., Mildenhall, B., Zickler, T., Barron, J.T., Srinivasan, P.P.: Ref-NeRF: structured view-dependent appearance for neural radiance fields. In: CVPR (2022)
Google Scholar
Wang, C., Chai, M., He, M., Chen, D., Liao, J.: CLIP-NeRF: text-and-image driven manipulation of neural radiance fields. In: CVPR (2022)
Google Scholar
Wang, C., Jiang, R., Chai, M., He, M., Chen, D., Liao, J.: NeRF-Art: text-driven neural radiance fields stylization. In: TVCG (2023)
Google Scholar
Wang, H., Du, X., Li, J., Yeh, R.A., Shakhnarovich, G.: Score Jacobian chaining: lifting pretrained 2D diffusion models for 3D generation. In: CVPR (2023)
Google Scholar
Wang, Q., et al.: IBRNet: learning multi-view image-based rendering. In: CVPR (2021)
Google Scholar
Wang, Z., Lu, C., Wang, Y., Bao, F., Li, C., Su, H., Zhu, J.: ProlificDreamer: high-fidelity and diverse text-to-3D generation with variational score distillation. arXiv (2023)
Google Scholar
Wang, Z., Wu, S., Xie, W., Chen, M., Prisacariu, V.A.: NeRF–: neural radiance fields without known camera parameters. arXiv (2021)
Google Scholar
Weder, S., et al.: Removing objects from neural radiance fields. In: CVPR (2023)
Google Scholar
Yang, B., et al.: Learning object-compositional neural radiance field for editable scene rendering. In: ICCV (2021)
Google Scholar
Yu, A., Li, R., Tancik, M., Li, H., Ng, R., Kanazawa, A.: PlenOctrees for real-time rendering of neural radiance fields. In: ICCV (2021)
Google Scholar
Yu, A., Ye, V., Tancik, M., Kanazawa, A.: pixelNeRF: neural radiance fields from one or few images. In: CVPR (2021)
Google Scholar
Yuan, Y.J., Sun, Y.T., Lai, Y.K., Ma, Y., Jia, R., Gao, L.: NeRF-editing: geometry editing of neural radiance fields. In: CVPR (2022)
Google Scholar
Zhang, Z., Li, B., Nie, X., Han, C., Guo, T., Liu, L.: Towards consistent video editing with text-to-image diffusion models. In: NeurIPS (2024)
Google Scholar

Download references

Author information

Authors and Affiliations

Samsung AI Centre Toronto, Toronto, Canada
Ashkan Mirzaei, Tristan Aumentado-Armstrong, Marcus A. Brubaker, Alex Levinshtein & Konstantinos G. Derpanis
University of Toronto, Toronto, Canada
Ashkan Mirzaei, Jonathan Kelly & Igor Gilitschenski
York University, Toronto, Canada
Tristan Aumentado-Armstrong, Marcus A. Brubaker & Konstantinos G. Derpanis
Vector Institute for AI, Toronto, Canada
Tristan Aumentado-Armstrong, Marcus A. Brubaker & Konstantinos G. Derpanis

Authors

Ashkan Mirzaei
View author publications
You can also search for this author in PubMed Google Scholar
Tristan Aumentado-Armstrong
View author publications
You can also search for this author in PubMed Google Scholar
Marcus A. Brubaker
View author publications
You can also search for this author in PubMed Google Scholar
Jonathan Kelly
View author publications
You can also search for this author in PubMed Google Scholar
Alex Levinshtein
View author publications
You can also search for this author in PubMed Google Scholar
Konstantinos G. Derpanis
View author publications
You can also search for this author in PubMed Google Scholar
Igor Gilitschenski
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ashkan Mirzaei .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Palo Alto, CA, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 7995 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mirzaei, A. et al. (2025). Watch Your Steps: Local Image and Scene Editing by Text Instructions. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15096. Springer, Cham. https://doi.org/10.1007/978-3-031-72920-1_7

Download citation

DOI: https://doi.org/10.1007/978-3-031-72920-1_7
Published: 01 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72919-5
Online ISBN: 978-3-031-72920-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics