Towards Real-time Multimodal Contextual Assistants
Advancing egocentric vision, proactive AI, and long-context multimodal understanding for next-generation wearable devices.
About the Workshop
🏆 Your chance to work with a newly collected, large-scale real-world wearables dataset and compete for $21K in total cash prizes across three grand challenges that define the frontier of Wearable AI.
Imagine an AI that watches the world through your eyes, remembers your entire day, and proactively helps before you even ask. That future is being built right now — and this workshop is where the key breakthroughs will be discussed. If you work on egocentric video, multimodal LLMs, conversational AI, or on-device inference, this is your workshop.
Three things make this workshop stand out: (1) it brings together researchers from egocentric vision, long-form video Q&A, and conversational AI in one place; (2) it introduces high-impact benchmark tasks — proactive AI and streaming multi-turn dialog — that push the frontier of Wearable AI; (3) every participant gets a ready-to-use Participant Toolkit and baseline models to hit the ground running. Come ready to collaborate at one of the most vibrant sessions at ECCV.
Research Areas
The workshop welcomes submissions on the following topics, among others:
All submissions will be handled electronically via the OpenReview workshop submission website. Submissions must follow the ECCV Main Conference format, and abide by the ECCV policy. More details on the submission deadline will be shared soon.
Grand Challenge
Wearable AI is one of the most exciting and challenging frontiers in computer vision and AI today. Yet progress has been hampered by a lack of large-scale, realistic benchmarks that reflect the complexity of real-world egocentric experiences — long-duration video, proactive AI interactions, and multi-turn conversations. This grand challenge is designed to close that gap: to catalyze the community around concrete, measurable tasks and to reward reproducible, open-sourced breakthroughs.
In addition to general research paper submissions, our workshop features three specific grand challenge tasks. A total of $21,000 in cash prizes will be awarded to winners with reproducible, open-sourced results. The dev set is released as part of the challenge for the participants to iterate on their system. The test set (equal volume) is kept hidden for evaluation.
Be sure to review the Official Challenge Rules for complete details and to verify your eligibility to participate.
Given a long egocentric video and user requests, predict proactive engagements (e.g., step-by-step instructions, object finding) at appropriate moments to maximize utility while minimizing disruption.
🧑: Help me make an espresso shot. 👓: Sure! First, grind the coffee beans ... 🧑: [silent: grinds the coffee] 👓: [Proactive Interruption] Great job with the grinding! Now, place the ground coffee into the portafilter. 🧑: [silent: fills up the portafilter] 👓: [Proactive Interruption] Nice! Now, use a temper to gently press the coffee.
Given a long egocentric video interleaved with user–assistant interactions, predict answers to user questions in a streaming fashion, maintaining coherence with past context.
🧑: Which option should I pick on this kiosk? 👓: Hit "Get New Card", then "Single Ride" for $2.90 ... 🧑: Done. Which train do I take from here? 👓: Take the uptown N, Q, or R — Time Square is three stops.
Given 10+ minute egocentric videos, answer questions posed at the end, via temporal grounding and recall of events spanning the full video duration.
🧑: How many sets of workouts have I done so far? 👓: Ten sets total — three each of push-ups, bicep curls, and lateral raises, plus one set of front raises.
To facilitate this challenge, we are releasing three newly annotated, large-scale Wearables AI collections that are the first to combine long-duration egocentric video with real user–assistant conversations and proactive AI annotations.
| Dataset | Egocentric | Long QA | User-AI Conv. | Proactive |
|---|---|---|---|---|
| EpicKitchen | ✓ | ✗ | ✗ | ✗ |
| EgoSchema | ✓ | ✓ | ✗ | ✗ |
| EgoExo | ✓ | ✓ | ✗ | ✗ |
| Ego4D | ✓ | ✓ | ✗ | ✗ |
| Wearable AI | ✓ | ✓ | ✓ | ✓ |
Register below to participate in the workshop. Individual registrations are open to all researchers and students. If you are participating in the Grand Challenge with a team, provide your team name in the registration form.
Every registered participant will receive our Participant Toolkit: standardized data loaders, example inference and evaluation scripts on a suite of MLLM baseline models.
To use the dataset, please use the following citation:
Invited Talks

Leading researcher in computer vision, renowned for her pivotal role in the development and leadership of the Ego4D and EgoExo projects.

Distinguished professor in computer vision, renowned for founding and leading the EpicKitchen dataset. Expert in egocentric video analysis and action recognition.

Distinguished Engineer at Meta Reality Labs, leading Generative AI and Wearable AI. Frequently invited keynote speaker at ECCV, CVPR, and ICCV. Work featured in BBC, TIME, and MIT Technology Review.
Additional keynote speakers to be announced.
Workshop Committee
The organizing committee brings together world-class researchers from Meta Reality Labs, the University of Edinburgh, HKUST, Georgia Tech, and UCF — spanning multimodal AI, computer vision, robotics, and conversational systems.
📧 Contact: wearable.ai.workshop at gmail.com

Research Scientist/Technical Lead at Meta Reality Labs. Work focuses on large vision encoders and multimodal foundation models for real-time perception and conversational AI on wearables.

Tech Lead at Meta Reality Labs, driving AI modeling and agentic systems for multi-modal proactive assistants. His current work focuses on self-improvement methods in multi-agent systems for wearables.

Lead AI Research Scientist at Meta Reality Labs. Ph.D. in ML and Language Technologies from CMU. Organized workshops at AAAI, KDD, ICASSP, and NeurIPS.

Distinguished Engineer at Meta Reality Labs, leading Generative AI and Wearable AI. Frequently invited keynote speaker at ECCV, CVPR, and ICCV. Work featured in BBC, TIME, and MIT Technology Review.

Assistant Professor, ELLIS member and GAIL Fellow. Research focuses on multimodal generative AI for robotics. Collaborated with Amazon Alexa AI and European Space Agency.

Founding professor of the Robotics Perception and Learning lab at Georgia Tech. Research includes robust finetuning of VLMs, open-world generalization, and Vision-Language-Action models.

Professor at HKUST directing research in conversational AI, NLP, and human-robot interaction. IEEE Fellow. Pioneering work in multilingual speech and language technologies.

Founding director of the Center for Research in Computer Vision and Trustee Chair Professor of CS at UCF. Fellow of NAI, IEEE, AAAS, IAPR and SPIE.

Tech Lead at Meta Reality Labs | Multimodal, Agentic & Proactive AI | Ex-Cisco AI, Columbia University

Tech Lead at Meta Reality Labs | Ex AWS AI | PhD at University of Toronto
Schedule
Full-day workshop with 4 invited talks, 2 oral presentations, 2 grand challenge talks, 1 poster session, and a panel discussion. The schedule below is tentative and subject to change.
Timeline
All deadlines are at 11:59 PM Anywhere on Earth (AOE).