Abstract
The success of denoising diffusion models in generating and editing images has sparked interest in using diffusion models for editing 3D scenes represented via neural radiance fields (NeRFs). However, current 3D editing methods lack a way to both pinpoint the edit location and limit changes to the desired volumetric region. Consequently, these methods often over-edit, altering irrelevant parts of the scene. We introduce a new task, 3D edit localization, to automatically identify the relevant region for an editing task and restrict the edit accordingly. To achieve this goal, we initially tackle 2D edit localization, and then lift it to multiple views to address the 3D localization challenge. For 2D localization, we leverage InstructPix2Pix (IP2P) and identify the discrepancy between IP2P predictions with and without the instruction. We refer to this discrepancy as the relevance map. The relevance map conveys the importance of changing each pixel to achieve an edit, and guides downstream modifications, ensuring that pixels irrelevant to the edit remain unchanged. With the relevance maps of multiview posed images, we can define the relevance field, defining the 3D region within which modifications should be made. This enables us to improve the quality of text-guided 3D NeRF scene editing, by performing iterative updates on the training views, guided by renders from the relevance field. Our method achieves state-of-the-art performance on both NeRF and image editing tasks. We will make the code available.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Avrahami, O., Lischinski, D., Fried, O.: Blended diffusion for text-driven editing of natural images. In: CVPR (2022)
Balaji, Y., et al.: eDiff-I: text-to-image diffusion models with ensemble of expert denoisers. arXiv (2022)
Barron, J.T., Mildenhall, B., Tancik, M., Hedman, P., Martin-Brualla, R., Srinivasan, P.P.: Mip-NeRF: a multiscale representation for anti-aliasing neural radiance fields. In: ICCV (2021)
Barron, J.T., Mildenhall, B., Verbin, D., Srinivasan, P.P., Hedman, P.: Mip-NeRF 360: unbounded anti-aliased neural radiance fields. In: CVPR (2022)
Barron, J.T., Mildenhall, B., Verbin, D., Srinivasan, P.P., Hedman, P.: Zip-NeRF: anti-aliased grid-based neural radiance fields. In: ICCV (2023)
Brooks, T., Holynski, A., Efros, A.A.: InstructPix2Pix: learning to follow image editing instructions. In: CVPR (2023)
Ceylan, D., Huang, C.H.P., Mitra, N.J.: Pix2Video: video editing using image diffusion. In: ICCV (2023)
Chen, A., Xu, Z., Geiger, A., Yu, J., Su, H.: Tensorf: tensorial radiance fields. In: ECCV (2022)
Chen, Z., Funkhouser, T., Hedman, P., Tagliasacchi, A.: MobileNeRF: exploiting the polygon rasterization pipeline for efficient neural field rendering on mobile architectures. In: CVPR (2023)
Cheng, Y., et al.: Segment and track anything. arXiv preprint arXiv:2305.06558 (2023)
Couairon, G., Verbeek, J., Schwenk, H., Cord, M.: Diffedit: diffusion-based semantic image editing with mask guidance. In: ICLR (2023)
Dai, A., Siddiqui, Y., Thies, J., Valentin, J., Niessner, M.: SPSG: self-supervised photometric scene generation from RGB-D scans. In: CVPR (2021)
Deng, K., Liu, A., Zhu, J.Y., Ramanan, D.: Depth-supervised NeRF: fewer views and faster training for free. In: CVPR (2022)
Dhariwal, P., Nichol, A.: Diffusion models beat GANs on image synthesis. In: NeurIPS (2021)
Dong, J., Wang, Y.X.: ViCA-NeRF: view-consistency-aware 3D editing of neural radiance fields. In: NeurIPS (2023)
Gal, R., Patashnik, O., Maron, H., Chechik, G., Cohen-Or, D.: StyleGAN-NADA: clip-guided domain adaptation of image generators. TOG (2022)
Geyer, M., Bar-Tal, O., Bagon, S., Dekel, T.: TokenFlow: consistent diffusion features for consistent video editing. In: ICLR (2024)
Haque, A., Tancik, M., Efros, A., Holynski, A., Kanazawa, A.: Instruct-NeRF2NeRF: editing 3D scenes with instructions. In: ICCV (2023)
Hedman, P., Srinivasan, P.P., Mildenhall, B., Barron, J.T., Debevec, P.: Baking neural radiance fields for real-time view synthesis. In: ICCV (2021)
Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. In: ICLR (2023)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS (2020)
Ho, J., Saharia, C., Chan, W., Fleet, D.J., Norouzi, M., Salimans, T.: Cascaded diffusion models for high fidelity image generation. JMLR (2022)
Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)
Jain, A., Mildenhall, B., Barron, J.T., Abbeel, P., Poole, B.: Zero-shot text-guided object generation with dream fields. In: CVPR (2022)
Jheng, R.F., Wu, T.H., Yeh, J.F., Hsu, W.H.: Free-form 3D scene inpainting with dual-stream GAN. In: BMVC (2022)
Kania, K., Yi, K.M., Kowalski, M., Trzciński, T., Tagliasacchi, A.: CoNeRF: controllable neural radiance fields. In: CVPR (2022)
Kawar, B., et al.: Imagic: text-based real image editing with diffusion models. In: CVPR (2023)
Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3D Gaussian splatting for real-time radiance field rendering. TOG (2023)
Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. arXiv (2022)
Kuang, Z., Luan, F., Bi, S., Shu, Z., Wetzstein, G., Sunkavalli, K.: PaletteNeRF: palette-based appearance editing of neural radiance fields. arXiv (2022)
Kurz, A., Neff, T., Lv, Z., Zollhöfer, M., Steinberger, M.: AdaNeRF: adaptive sampling for real-time rendering of neural radiance fields. In: ECCV (2022)
Lazova, V., Guzov, V., Olszewski, K., Tulyakov, S., Pons-Moll, G.: Control-NeRF: editable feature volumes for scene rendering and manipulation. In: WACV (2023)
Li, Z., et al.: CompNVS: novel view synthesis with scene completion. In: ECCV (2022)
Lin, C.H., et al.: Magic3D: high-resolution text-to-3D content creation. In: CVPR (2023)
Lin, C.H., Ma, W.C., Torralba, A., Lucey, S.: BARF: bundle-adjusting neural radiance fields. In: ICCV (2021)
Lindell, D.B., Van Veen, D., Park, J.J., Wetzstein, G.: BACON: band-limited coordinate networks for multiscale scene representation. In: CVPR (2022)
Liu, H.K., Shen, I.C., Chen, B.Y.: NeRF-In: free-form NeRF inpainting with RGB-D priors. arXiv (2022)
Liu, N., Li, S., Du, Y., Torralba, A., Tenenbaum, J.B.: Compositional visual generation with composable diffusion models. In: ECCV (2023)
Liu, R., Wu, R., Hoorick, B.V., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero-1-to-3: zero-shot one image to 3D object. arXiv (2023)
Liu, S., Zhang, X., Zhang, Z., Zhang, R., Zhu, J.Y., Russell, B.: Editing conditional radiance fields. In: ICCV (2021)
Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Timofte, R., Van Gool, L.: Repaint: inpainting using denoising diffusion probabilistic models. In: CVPR (2022)
Max, N., Chen, M.: Local and global illumination in the volume rendering integral. Technical report, Lawrence Livermore National Lab (LLNL), Livermore, CA (United States) (2005)
Meng, C., et al.: SDEdit: guided image synthesis and editing with stochastic differential equations. In: ICLR (2021)
Mikaeili, A., Perel, O., Safaee, M., Cohen-Or, D., Mahdavi-Amiri, A.: SKED: sketch-guided text-based 3D editing. arXiv (2023)
Mildenhall, B., et al.: Local light field fusion: practical view synthesis with prescriptive sampling guidelines. TOG (2019)
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. In: ECCV (2020)
Mirzaei, A., et al.: Reference-guided controllable inpainting of neural radiance fields. In: ICCV (2023)
Mirzaei, A., et al.: SPIn-NeRF: multiview segmentation and perceptual inpainting with neural radiance fields. In: CVPR (2023)
Mirzaei, A., Kant, Y., Kelly, J., Gilitschenski, I.: LaTeRF: label and text driven object radiance fields. In: ECCV (2022)
Mittal, A., Soundararajan, R., Bovik, A.C.: Making a “completely blind” image quality analyzer. IEEE Signal Process. Lett. (2013)
Müller, T., Evans, A., Schied, C., Keller, A.: Instant neural graphics primitives with a multiresolution hash encoding. TOG (2022)
Nichol, A.Q., et al.: GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. In: ICML (2022)
Parmar, G., Singh, K.K., Zhang, R., Li, Y., Lu, J., Zhu, J.Y.: Zero-shot image-to-image translation. In: SIGGRAPH (2023)
von Platen, P., et al.: Diffusers: state-of-the-art diffusion models (2022). https://github.com/huggingface/diffusers
Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: DreamFusion: text-to-3D using 2D diffusion. In: ICLR (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with CLIP latents. arXiv (2022)
Reiser, C., et al.: MERF: memory-efficient radiance fields for real-time view synthesis in unbounded scenes. arXiv (2023)
Roeder, G., Wu, Y., Duvenaud, D.: Sticking the landing: simple, lower-variance gradient estimators for variational inference. arXiv (2017)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)
Saharia, C., et al.: Palette: image-to-image diffusion models. In: SIGGRAPH (2022)
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: NeurIPS (2022)
Saharia, C., Ho, J., Chan, W., Salimans, T., Fleet, D.J., Norouzi, M.: Image super-resolution via iterative refinement. TPAMI (2022)
Fridovich-Keil, S., Yu, A., Tancik, M., Chen, Q., Recht, B., Kanazawa, A.: Plenoxels: radiance fields without neural networks. In: CVPR (2022)
Schönberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: CVPR (2016)
Schönberger, J.L., Zheng, E., Pollefeys, M., Frahm, J.M.: Pixelwise view selection for unstructured multi-view stereo. In: ECCV (2016)
Sohl-Dickstein, J., Weiss, E.A., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: PMLR (2015)
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: ICLR (2021)
Song, S., Yu, F., Zeng, A., Chang, A.X., Savva, M., Funkhouser, T.: Semantic scene completion from a single depth image. In: CVPR (2017)
Song, Y., Ermon, S.: Generative modeling by estimating gradients of the data distribution. In: NeurIPS (2020)
Tancik, M., et al.: Nerfstudio: a modular framework for neural radiance field development. In: SIGGRAPH (2023)
Tewari, A., et al.: Advances in neural rendering. In: SIGGRAPH (2021)
Verbin, D., Hedman, P., Mildenhall, B., Zickler, T., Barron, J.T., Srinivasan, P.P.: Ref-NeRF: structured view-dependent appearance for neural radiance fields. In: CVPR (2022)
Wang, C., Chai, M., He, M., Chen, D., Liao, J.: CLIP-NeRF: text-and-image driven manipulation of neural radiance fields. In: CVPR (2022)
Wang, C., Jiang, R., Chai, M., He, M., Chen, D., Liao, J.: NeRF-Art: text-driven neural radiance fields stylization. In: TVCG (2023)
Wang, H., Du, X., Li, J., Yeh, R.A., Shakhnarovich, G.: Score Jacobian chaining: lifting pretrained 2D diffusion models for 3D generation. In: CVPR (2023)
Wang, Q., et al.: IBRNet: learning multi-view image-based rendering. In: CVPR (2021)
Wang, Z., Lu, C., Wang, Y., Bao, F., Li, C., Su, H., Zhu, J.: ProlificDreamer: high-fidelity and diverse text-to-3D generation with variational score distillation. arXiv (2023)
Wang, Z., Wu, S., Xie, W., Chen, M., Prisacariu, V.A.: NeRF–: neural radiance fields without known camera parameters. arXiv (2021)
Weder, S., et al.: Removing objects from neural radiance fields. In: CVPR (2023)
Yang, B., et al.: Learning object-compositional neural radiance field for editable scene rendering. In: ICCV (2021)
Yu, A., Li, R., Tancik, M., Li, H., Ng, R., Kanazawa, A.: PlenOctrees for real-time rendering of neural radiance fields. In: ICCV (2021)
Yu, A., Ye, V., Tancik, M., Kanazawa, A.: pixelNeRF: neural radiance fields from one or few images. In: CVPR (2021)
Yuan, Y.J., Sun, Y.T., Lai, Y.K., Ma, Y., Jia, R., Gao, L.: NeRF-editing: geometry editing of neural radiance fields. In: CVPR (2022)
Zhang, Z., Li, B., Nie, X., Han, C., Guo, T., Liu, L.: Towards consistent video editing with text-to-image diffusion models. In: NeurIPS (2024)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Mirzaei, A. et al. (2025). Watch Your Steps: Local Image and Scene Editing by Text Instructions. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15096. Springer, Cham. https://doi.org/10.1007/978-3-031-72920-1_7
Download citation
DOI: https://doi.org/10.1007/978-3-031-72920-1_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72919-5
Online ISBN: 978-3-031-72920-1
eBook Packages: Computer ScienceComputer Science (R0)