Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1007/978-3-031-72673-6_8guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning

Published: 22 October 2024 Publication History

Abstract

Generating instructional images of human daily actions from an egocentric viewpoint serves as a key step towards efficient skill transfer. In this paper, we introduce a novel problem – egocentric action frame generation. The goal is to synthesize an image depicting an action in the user’s context (i.e., action frame) by conditioning on a user prompt and an input egocentric image. Notably, existing egocentric action datasets lack the detailed annotations that describe the execution of actions. Additionally, existing diffusion-based image manipulation models are sub-optimal in controlling the state transition of an action in egocentric image pixel space because of the domain gap. To this end, we propose to Learn EGOcentric (LEGO) action frame generation via visual instruction tuning. First, we introduce a prompt enhancement scheme to generate enriched action descriptions from a visual large language model (VLLM) by visual instruction tuning. Then we propose a novel method to leverage image and text embeddings from the VLLM as additional conditioning to improve the performance of a diffusion model. We validate our model on two egocentric datasets – Ego4D and Epic-Kitchens. Our experiments show substantial improvement over prior image manipulation models in both quantitative and qualitative evaluation. We also conduct detailed ablation studies and analysis to provide insights in our method. More details of the dataset and code are available on the website (https://bolinlai.github.io/Lego_EgoActGen/).

References

[1]
Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning. In: Advances in Neural Information Processing Systems, vol. 35, pp. 23716–23736 (2022)
[2]
Anonymity: Making multimodal generation easier: when diffusion models meet LLMs. Openreview (2023)
[3]
Ashutosh, K., Girdhar, R., Torresani, L., Grauman, K.: Hiervl: learning hierarchical video-language embeddings. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23066–23078 (2023)
[4]
Baskin JH, Edersheim JG, and Price BH Is a picture worth a thousand words? Neuroimaging in the courtroom Am. J. Law Med. 2007 33 2–3 239-269
[5]
Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: learning to follow image editing instructions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18392–18402 (2023)
[6]
Brown, T., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901 (2020)
[7]
Chakrabarty, T., Singh, K., Saakyan, A., Muresan, S.: Learning to follow object-centric image editing instructions faithfully. arXiv preprint arXiv:2310.19145 (2023)
[8]
Chen, J., et al.: Pixart-alpha: fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426 (2023)
[9]
Chen, W.G., Spiridonova, I., Yang, J., Gao, J., Li, C.: Llava-interactive: an all-in-one demo for image chat, segmentation, generation and editing. arXiv preprint arXiv:2311.00571 (2023)
[10]
Chowdhery, A., et al.: Palm: scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 (2022)
[11]
Couairon, G., Verbeek, J., Schwenk, H., Cord, M.: Diffedit: diffusion-based semantic image editing with mask guidance. arXiv preprint arXiv:2210.11427 (2022)
[12]
Dai, W., et al.: Instructblip: towards general-purpose vision-language models with instruction tuning. In: Advances in Neural Information Processing Systems (2023)
[13]
Damen, D., et al.: Rescaling egocentric vision: collection, pipeline and challenges for epic-kitchens-100. Int. J. Comput. Vision 1–23 (2022)
[14]
Du, Y., et al.: Learning universal policies via text-guided video generation. In: Thirty-Seventh Conference on Neural Information Processing Systems (2023)
[15]
Epstein, D., Jabri, A., Poole, B., Efros, A.A., Holynski, A.: Diffusion self-guidance for controllable image generation. In: Advances in Neural Information Processing Systems (2023)
[16]
Girdhar, R., Grauman, K.: Anticipative video transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13505–13515 (2021)
[17]
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256. JMLR Workshop and Conference Proceedings (2010)
[18]
Goel, V., et al.: Pair-diffusion: object-level image editing with structure-and-appearance paired diffusion models. arXiv preprint arXiv:2303.17546 (2023)
[19]
Goodfellow I et al. Generative adversarial networks Commun. ACM 2020 63 11 139-144
[20]
Goyal, M., Modi, S., Goyal, R., Gupta, S.: Human hands as probes for interactive object understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3293–3303 (2022)
[21]
Grauman, K., et al.: Ego4D: around the world in 3,000 hours of egocentric video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18995–19012 (2022)
[22]
Hafri A, Trueswell JC, and Epstein RA Neural representations of observed actions generalize across static and dynamic visual input J. Neurosci. 2017 37 11 3056-3071
[23]
Han, J., et al.: Imagebind-LLM: multi-modality instruction tuning. arXiv preprint arXiv:2309.03905 (2023)
[24]
Han, L., et al.: Proxedit: improving tuning-free real image editing with proximal guidance. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 4291–4301 (2024)
[25]
Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022)
[26]
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
[27]
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Advances in Neural Information Processing Systems, vol. 33, pp. 6840–6851 (2020)
[28]
Huang, J., Liu, Y., Qin, J., Chen, S.: KV inversion: KV embeddings learning for text-conditioned real image action editing. arXiv preprint arXiv:2309.16608 (2023)
[29]
Huang Y, Cai M, Li Z, Lu F, and Sato Y Mutual context network for jointly estimating egocentric gaze and action IEEE Trans. Image Process. 2020 29 7795-7806
[30]
Huang, Y., Cai, M., Li, Z., Sato, Y.: Predicting gaze in egocentric video by learning task-dependent attention transition. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 754–769 (2018)
[31]
Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.: Squeezenet: alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size. arXiv preprint arXiv:1602.07360 (2016)
[32]
Jia W, Liu M, and Rehg JM Avidan S, Brostow G, Cissé M, Farinella GM, and Hassner T Generative adversarial network for future hand segmentation from egocentric video ECCV 2022 2022 Cham Springer 639-656
[33]
Jiang, Y., Zhang, Z., Xue, T., Gu, J.: Autodir: automatic all-in-one image restoration with latent diffusion. arXiv preprint arXiv:2310.10123 (2023)
[34]
Joseph, K., et al.: Iterative multi-granular image editing using diffusion models. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (2024)
[35]
Kawar, B., Elad, M., Ermon, S., Song, J.: Denoising diffusion restoration models. In: Advances in Neural Information Processing Systems (2022)
[36]
Kawar, B., et al.: Imagic: text-based real image editing with diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6007–6017 (2023)
[37]
Kazakos, E., Nagrani, A., Zisserman, A., Damen, D.: Epic-fusion: audio-visual temporal binding for egocentric action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5492–5501 (2019)
[38]
Kim, S., et al.: User-friendly image editing with minimal text input: leveraging captioning and injection techniques. arXiv preprint arXiv:2306.02717 (2023)
[39]
Kirillov, A., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)
[40]
Koh, J.Y., Fried, D., Salakhutdinov, R.R.: Generating images with multimodal language models. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
[41]
Lai, B., Liu, M., Ryan, F., Rehg, J.M.: In the eye of transformer: global-local correlation for egocentric gaze estimation. In: British Machine Vision Conference (2022)
[42]
Lai B, Liu M, Ryan F, and Rehg JM In the eye of transformer: global-local correlation for egocentric gaze estimation and beyond Int. J. Comput. Vision 2024 132 3 854-871
[43]
Lai, B., Ryan, F., Jia, W., Liu, M., Rehg, J.M.: Listen to look into the future: audio-visual egocentric gaze anticipation. arXiv preprint arXiv:2305.03907 (2023)
[44]
Li, D., Li, J., Hoi, S.C.: Blip-diffusion: pre-trained subject representation for controllable text-to-image generation and editing. In: Advances in Neural Information Processing Systems (2023)
[45]
Li, J., Liu, K., Wu, J.: Ego-body pose estimation via ego-head pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17142–17151 (2023)
[46]
Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: International Conference on Machine Learning (2023)
[47]
Li, J., Li, D., Xiong, C., Hoi, S.: Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900. PMLR (2022)
[48]
Li, Y., Liu, M., Rehg, J.M.: In the eye of beholder: joint learning of gaze and actions in first person video. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 619–635 (2018)
[49]
Lian, L., Li, B., Yala, A., Darrell, T.: LLM-grounded diffusion: enhancing prompt understanding of text-to-image diffusion models with large language models. arXiv preprint arXiv:2305.13655 (2023)
[50]
Lin, K.Q., et al.: Egocentric video-language pretraining. In: Advances in Neural Information Processing Systems, vol. 35, pp. 7575–7586 (2022)
[51]
Liu, B., Zhang, H., Liu, J., Wang, Q.: Acigs: an automated large-scale crops image generation system based on large visual language multi-modal models. In: 2023 20th Annual IEEE International Conference on Sensing, Communication, and Networking (SECON), pp. 7–13. IEEE (2023)
[52]
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Advances in Neural Information Processing Systems (2023)
[53]
Liu M, Tang S, Li Y, and Rehg JM Vedaldi A, Bischof H, Brox T, and Frahm J-M Forecasting human-object interaction: joint prediction of motor attention and actions in first person video Computer Vision – ECCV 2020 2020 Cham Springer 704-721
[54]
Liu, S., Tripathi, S., Majumdar, S., Wang, X.: Joint hand motion and interaction hotspots prediction from egocentric videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3282–3292 (2022)
[55]
Luo, Z., Hachiuma, R., Yuan, Y., Kitani, K.: Dynamics-regulated kinematic policy for egocentric pose estimation. In: Advances in Neural Information Processing Systems, vol. 34, pp. 25019–25032 (2021)
[56]
Meng, C., et al.: SDEdit: guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations (2022)
[57]
Mirzaei, A., et al.: Watch your steps: local image and scene editing by text instructions. arXiv preprint arXiv:2308.08947 (2023)
[58]
Mokady, R., Hertz, A., Aberman, K., Pritch, Y., Cohen-Or, D.: Null-text inversion for editing real images using guided diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6038–6047 (2023)
[59]
Molad, E., et al.: Dreamix: video diffusion models are general video editors. arXiv preprint arXiv:2302.01329 (2023)
[60]
Orgad, H., Kawar, B., Belinkov, Y.: Editing implicit assumptions in text-to-image diffusion models. arXiv preprint arXiv:2303.08084 (2023)
[61]
Pan, Z., Gherardi, R., Xie, X., Huang, S.: Effective real image editing with accelerated iterative diffusion inversion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15912–15921 (2023)
[62]
Pramanick, S., et al.: Egovlpv2: egocentric video-language pre-training with fusion in the backbone. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5285–5297 (2023)
[63]
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
[64]
Raffel C et al. Exploring the limits of transfer learning with a unified text-to-text transformer J. Mach. Learn. Res. 2020 21 1 5485-5551
[65]
Ragusa, F., Farinella, G.M., Furnari, A.: Stillfast: an end-to-end approach for short-term object interaction anticipation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3635–3644 (2023)
[66]
Ramakrishnan, S.K., Al-Halah, Z., Grauman, K.: NAQ: leveraging narrations as queries to supervise episodic memory. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6694–6703 (2023)
[67]
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)
[68]
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dreambooth: fine tuning text-to-image diffusion models for subject-driven generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22500–22510 (2023)
[69]
Ryan, F., Jiang, H., Shukla, A., Rehg, J.M., Ithapu, V.K.: Egocentric auditory attention localization in conversations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14663–14674 (2023)
[70]
Shen, Y., Song, K., Tan, X., Li, D., Lu, W., Zhuang, Y.: Hugginggpt: solving AI tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580 (2023)
[71]
Shi, J., Xiong, W., Lin, Z., Jung, H.J.: Instantbooth: personalized text-to-image generation without test-time finetuning. arXiv preprint arXiv:2304.03411 (2023)
[72]
Stein, G., et al.: Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models. arXiv preprint arXiv:2306.04675 (2023)
[73]
Su, Y., Lan, T., Li, H., Xu, J., Wang, Y., Cai, D.: Pandagpt: one model to instruction-follow them all. arXiv preprint arXiv:2305.16355 (2023)
[74]
Sudhakaran, S., Escalera, S., Lanz, O.: LSTA: long short-term attention for egocentric action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9954–9963 (2019)
[75]
Sun, Z., Zhou, Y., He, H., Mok, P.: Sgdiff: a style guided diffusion model for fashion synthesis. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 8433–8442 (2023)
[76]
Thoppilan, R., et al.: Lamda: language models for dialog applications. arXiv preprint arXiv:2201.08239 (2022)
[77]
Tome D et al. Selfpose: 3D egocentric pose estimation from a headset mounted camera IEEE Trans. Pattern Anal. Mach. Intell. 2020 45 6 6794-6806
[78]
Touvron, H., et al.: Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
[79]
Tsaban, L., Passos, A.: Ledits: real image editing with DDPM inversion and semantic guidance. arXiv preprint arXiv:2307.00522 (2023)
[80]
Wallace, B., Gokul, A., Naik, N.: Edict: exact diffusion inversion via coupled transformations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22532–22541 (2023)
[81]
Wang, J., Luvizon, D., Xu, W., Liu, L., Sarkar, K., Theobalt, C.: Scene-aware egocentric 3D human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13031–13040 (2023)
[82]
Wang, K., Yang, F., Yang, S., Butt, M.A., van de Weijer, J.: Dynamic prompt learning: addressing cross-attention leakage for text-based image editing. arXiv preprint arXiv:2309.15664 (2023)
[83]
Wang, Q., Zhang, B., Birsak, M., Wonka, P.: Instructedit: improving automatic masks for diffusion-based image editing with user instructions. arXiv preprint arXiv:2305.18047 (2023)
[84]
Wang, Q., Zhang, B., Birsak, M., Wonka, P.: MDP: a generalized framework for text-guided image editing by manipulating the diffusion path. arXiv preprint arXiv:2303.16765 (2023)
[85]
Wang, W., et al.: Zero-shot video editing using off-the-shelf image diffusion models. arXiv preprint arXiv:2303.17599 (2023)
[86]
Wang, X., Zhu, L., Wang, H., Yang, Y.: Interactive prototype learning for egocentric action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8168–8177 (2021)
[87]
Wei, Y., Zhang, Y., Ji, Z., Bai, J., Zhang, L., Zuo, W.: Elite: encoding visual concepts into textual embeddings for customized text-to-image generation. arXiv preprint arXiv:2302.13848 (2023)
[88]
Wen, S., Fang, G., Zhang, R., Gao, P., Dong, H., Metaxas, D.: Improving compositional text-to-image generation with large vision-language models. arXiv preprint arXiv:2310.06311 (2023)
[89]
Wu, C., Yin, S., Qi, W., Wang, X., Tang, Z., Duan, N.: Visual chatgpt: talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671 (2023)
[90]
Wu, S., Fei, H., Qu, L., Ji, W., Chua, T.S.: Next-GPT: any-to-any multimodal LLM. arXiv preprint arXiv:2309.05519 (2023)
[91]
Xu, Y., et al.: Egopca: a new framework for egocentric hand-object interaction understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5273–5284 (2023)
[92]
Ye, Y., et al.: Affordance diffusion: synthesizing hand-object interactions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22479–22489 (2023)
[93]
Yu, Q., Li, J., Ye, W., Tang, S., Zhuang, Y.: Interactive data synthesis for systematic vision adaptation via LLMs-AIGCs collaboration. arXiv preprint arXiv:2305.12799 (2023)
[94]
Yu, Z., Li, H., Fu, F., Miao, X., Cui, B.: Fisedit: accelerating text-to-image editing via cache-enabled sparse diffusion inference. arXiv preprint arXiv:2305.17423 (2023)
[95]
Zhang, H., Li, X., Bing, L.: Video-llama: an instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858 (2023)
[96]
Zhang, M., Teck Ma, K., Hwee Lim, J., Zhao, Q., Feng, J.: Deep future gaze: gaze anticipation on egocentric videos using adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4372–4381 (2017)
[97]
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–595 (2018)
[98]
Zhang, S., et al.: Hive: harnessing human feedback for instructional visual editing. arXiv preprint arXiv:2303.09618 (2023)
[99]
Zhang, S., et al.: OPT: open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022)
[100]
Zhang, Y., et al.: Inversion-based style transfer with diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10146–10156 (2023)
[101]
Zhang, Z., Han, L., Ghosh, A., Metaxas, D.N., Ren, J.: Sine: single image editing with text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6027–6037 (2023)
[102]
Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: Minigpt-4: enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)

Index Terms

  1. LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image Guide Proceedings
        Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part IX
        Sep 2024
        597 pages
        ISBN:978-3-031-72672-9
        DOI:10.1007/978-3-031-72673-6
        • Editors:
        • Aleš Leonardis,
        • Elisa Ricci,
        • Stefan Roth,
        • Olga Russakovsky,
        • Torsten Sattler,
        • Gül Varol

        Publisher

        Springer-Verlag

        Berlin, Heidelberg

        Publication History

        Published: 22 October 2024

        Author Tags

        1. Egocentric Vision
        2. Instruction Tuning
        3. Diffusion Model

        Qualifiers

        • Article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 0
          Total Downloads
        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 25 Nov 2024

        Other Metrics

        Citations

        View Options

        View options

        Login options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media