CVPR 2024 Workshop
Tuesday June 18th
13:30 - 18:00
Room: Summit 427
Videos demonstrating people performing procedural activities, i.e., sequences of steps towards a goal, have gained popularity as effective tools for skill acquisition, spanning various domains like cooking, home improvement, and crafting. In addition to being useful teaching materials for humans, procedural videos paired with language are also a promising medium for multimodal learning by machines, as they combine visual demonstrations with detailed verbal descriptions. Despite the recent introduction of multiple datasets, such as HT100M, HT-Step and Ego4d Goal-Step, models in the procedural video domain still lag behind their image-based counterparts.
This workshop aims to foster discussion on the future of language-based procedural video understanding. We'll explore paths to integrate diverse language sources, harness LLMs for structured task knowledge, and combine language with other information streams (visual, audio, IMU, etc.) to enhance procedural video understanding (recognizing key steps, mistakes, hand-object interactions, etc.).
13:30 - 13:40: Welcome and introduction
13:40 - 14:10: Antoine Miech
14:10 - 14:40: Hilde Kuehne
14:40 - 15:10: Ivan Laptev
15:10 - 16:00: Poster Session / Coffee Break (Poster IDs #381-405)
16:00 - 16:30: Cordelia Schmid
16:30 - 17:00: Juho Kim
17:00 - 17:30: Dima Damen
17:30 - 17:50: Roundtable Discussion
17:55 - 18:00: Closing Remarks
FAIR, Meta
FAIR, Meta
FAIR, Meta
VGG, Universify of Oxford
Max Planck Institute for Informatics
Apple
Northeastern University
FAIR, Meta & UT Austin
FAIR, Meta