Towards Real-time Multimodal Contextual Assistants
Advancing egocentric vision, proactive AI, and long-context multimodal understanding for next-generation wearable devices.
About the Workshop
🏆 Your chance to work with the largest wearables dataset ever released — 5,000+ hours of real-world egocentric video — and compete for $21K in total cash prizes across three grand challenges that define the frontier of Wearable AI.
Imagine an AI that watches the world through your eyes, remembers your entire day, and proactively helps before you even ask. That future is being built right now — and this workshop is where the key breakthroughs will be discussed. If you work on egocentric video, multimodal LLMs, conversational AI, or on-device inference, this is your workshop.
Three things make this workshop stand out: (1) it brings together researchers from egocentric vision, long-form video Q&A, and conversational AI in one place; (2) it introduces high-impact benchmark tasks — proactive AI and streaming multi-turn dialog — that push the frontier of Wearable AI; (3) every participant gets a ready-to-use Participant Toolkit and baseline models to hit the ground running. Come ready to collaborate at one of the most vibrant sessions at ECCV.
Research Areas
The workshop welcomes submissions on the following topics, among others:
Grand Challenge
Wearable AI is one of the most exciting and challenging frontiers in computer vision and AI today. Yet progress has been hampered by a lack of large-scale, realistic benchmarks that reflect the complexity of real-world egocentric experiences — long-duration video, proactive AI interactions, and multi-turn conversations. This grand challenge is designed to close that gap: to catalyze the community around concrete, measurable tasks and to reward reproducible, open-sourced breakthroughs.
In addition to general research paper submissions, our workshop features three specific grand challenge tasks. A total of $21,000 in cash prizes will be awarded to winners with reproducible, open-sourced results. 90% of the data is released as train+dev, with 10% kept hidden for evaluation.
Given a long egocentric video and user requests, predict proactive engagements (e.g., step-by-step instructions, object finding) at appropriate moments to maximize utility while minimizing disruption.
Given a long egocentric video interleaved with user–assistant interactions, predict answers to user questions in a streaming fashion, maintaining coherence with past context.
Given 30+ minute egocentric videos, answer questions posed at the end, via temporal grounding and recall of events spanning the full video duration.
To facilitate this challenge, we are releasing the largest wearable AI dataset ever collected — three newly annotated, large-scale collections that are the first to combine long-duration egocentric video with real user–assistant conversations and proactive AI annotations.
| Dataset | Egocentric | Long QA | User-AI Conv. | Proactive | # Hours |
|---|---|---|---|---|---|
| EpicKitchen | ✓ | ✗ | ✗ | ✗ | 0.1K |
| EgoSchema | ✓ | ✓ | ✗ | ✗ | 0.3K |
| EgoExo | ✓ | ✓ | ✗ | ✗ | 1.2K |
| Ego4D | ✓ | ✓ | ✗ | ✗ | 3.5K |
| Wearable AI | ✓ | ✓ | ✓ | ✓ | 5K |
Every registered participant will receive our Participant Toolkit: standardized data loaders, evaluation scripts, and a suite of MLLM baseline models covering diverse architectures.
Register below to participate in the workshop. Individual registrations are open to all researchers and students. If you are participating in the Grand Challenge with a team, provide your team name in the registration form.
Invited Talks

Leading researcher in computer vision, renowned for her pivotal role in the development and leadership of the Ego4D and EgoExo projects.

Distinguished professor in computer vision, renowned for founding and leading the EpicKitchen dataset. Expert in egocentric video analysis and action recognition.

Distinguished Engineer at Meta Reality Labs, leading Generative AI and Wearable AI. Frequently invited keynote speaker at ECCV, CVPR, and ICCV. Work featured in BBC, TIME, and MIT Technology Review.
Additional keynote speakers to be announced.
Workshop Committee
The organizing committee brings together world-class researchers from Meta Reality Labs, the University of Edinburgh, HKUST, Georgia Tech, and UCF — spanning multimodal AI, computer vision, robotics, and conversational systems.

Research Scientist/Technical Lead at Meta Reality Labs. Work focuses on large vision encoders and multimodal foundation models for real-time perception and conversational AI on wearables.

Tech Lead at Meta Reality Labs, driving AI modeling and agentic systems for multi-modal proactive assistants. His current work focuses on self-improvement methods in multi-agent systems for wearables.

Lead AI Research Scientist at Meta Reality Labs. Ph.D. in ML and Language Technologies from CMU. Organized workshops at AAAI, KDD, ICASSP, and NeurIPS.

Distinguished Engineer at Meta Reality Labs, leading Generative AI and Wearable AI. Frequently invited keynote speaker at ECCV, CVPR, and ICCV. Work featured in BBC, TIME, and MIT Technology Review.

Assistant Professor, ELLIS member and GAIL Fellow. Research focuses on multimodal generative AI for robotics. Collaborated with Amazon Alexa AI and European Space Agency.

Founding professor of the Robotics Perception and Learning lab at Georgia Tech. Research includes robust finetuning of VLMs, open-world generalization, and Vision-Language-Action models.

Professor at HKUST directing research in conversational AI, NLP, and human-robot interaction. IEEE Fellow. Pioneering work in multilingual speech and language technologies.

Founding director of the Center for Research in Computer Vision and Trustee Chair Professor of CS at UCF. Fellow of NAI, IEEE, AAAS, IAPR and SPIE.
Schedule
Full-day workshop with 4 invited talks, 2 oral presentations, 2 grand challenge talks, 1 poster session, and a panel discussion. The schedule below is tentative and subject to change.
Timeline