🕑 ECCV 2026 Workshop

Wearable AI Workshop

Towards Real-time Multimodal Contextual Assistants

Advancing egocentric vision, proactive AI, and long-context multimodal understanding for next-generation wearable devices.

📅 September 2026
📍 Malmö, Sweden
🏆 Cash Prizes
🏆 Grand Challenge with 35K+ Videos Dataset
Register Now 🏆 View Challenges 💾 Explore Dataset

Building the Future of Wearable AI

🏆 Your chance to work with the largest wearables dataset ever released — 5,000+ hours of real-world egocentric video — and compete for $21K in total cash prizes across three grand challenges that define the frontier of Wearable AI.

What You Will Experience

Imagine an AI that watches the world through your eyes, remembers your entire day, and proactively helps before you even ask. That future is being built right now — and this workshop is where the key breakthroughs will be discussed. If you work on egocentric video, multimodal LLMs, conversational AI, or on-device inference, this is your workshop.

Why You Should Participate

Three things make this workshop stand out: (1) it brings together researchers from egocentric vision, long-form video Q&A, and conversational AI in one place; (2) it introduces high-impact benchmark tasks — proactive AI and streaming multi-turn dialog — that push the frontier of Wearable AI; (3) every participant gets a ready-to-use Participant Toolkit and baseline models to hit the ground running. Come ready to collaborate at one of the most vibrant sessions at ECCV.

Topics of Interest

The workshop welcomes submissions on the following topics, among others:

Long-Context & Real-Time Interactions Addressing memory bottlenecks (e.g., KV cache) for sustained dialog about past visual content; enabling low-latency, streaming responsiveness and continual answer revision as new visual frames arrive.
Proactive AI Systems Building agents that proactively anticipate and act on user needs based on evolving visual context autonomously, moving beyond traditional user-initiated requests.
Persistent Scene and Object Memory from Video Learning unified, time-aware representations of the user's environment from egocentric video to support spatial-temporal reasoning, object memory, and retrieval.
Efficient AI/Edge Computing for Wearable Devices Building small models that run on the strict low power requirements of wearable devices, involving model compression, token compression, quantization, etc.
Multi-Modal Foundation Architecture Developing state-of-the-art architectures specialized for egocentric video understanding with grounded conversations, multi-modal reasoning, contextual object recognition, real-time action prediction, and personalized user assistance.
Auxiliary Sensor Signals from Wearables Learning contextualized signals from diverse wearable devices (e.g., EMG, IMU motion sensors), augmented to the primary vision signals for various assistant use cases.

Wearable AI Grand Challenge

Wearable AI is one of the most exciting and challenging frontiers in computer vision and AI today. Yet progress has been hampered by a lack of large-scale, realistic benchmarks that reflect the complexity of real-world egocentric experiences — long-duration video, proactive AI interactions, and multi-turn conversations. This grand challenge is designed to close that gap: to catalyze the community around concrete, measurable tasks and to reward reproducible, open-sourced breakthroughs.

In addition to general research paper submissions, our workshop features three specific grand challenge tasks. A total of $21,000 in cash prizes will be awarded to winners with reproducible, open-sourced results. 90% of the data is released as train+dev, with 10% kept hidden for evaluation.

Challenge 1

Proactive AI

Given a long egocentric video and user requests, predict proactive engagements (e.g., step-by-step instructions, object finding) at appropriate moments to maximize utility while minimizing disruption.

3.5K videos Interruption P/R
🎬 Sample Clips
Proactive sample 1 Proactive sample 2 Proactive sample 3
Challenge 2

Multi-turn Conversations

Given a long egocentric video interleaved with user–assistant interactions, predict answers to user questions in a streaming fashion, maintaining coherence with past context.

12K videos Response Accuracy
🎬 Sample Clips
Multi-turn sample 1 Multi-turn sample 2 Multi-turn sample 3
Challenge 3

Long Video Q&A

Given 30+ minute egocentric videos, answer questions posed at the end, via temporal grounding and recall of events spanning the full video duration.

20K videos Response Accuracy
🎬 Sample Clips
Long Video Q&A sample 1 Long Video Q&A sample 2 Long Video Q&A sample 3

💾 Challenge Dataset

To facilitate this challenge, we are releasing the largest wearable AI dataset ever collected — three newly annotated, large-scale collections that are the first to combine long-duration egocentric video with real user–assistant conversations and proactive AI annotations.

Dataset Egocentric Long QA User-AI Conv. Proactive # Hours
EpicKitchen 0.1K
EgoSchema 0.3K
EgoExo 1.2K
Ego4D 3.5K
Wearable AI 5K

🔧 Participant Toolkit

Every registered participant will receive our Participant Toolkit: standardized data loaders, evaluation scripts, and a suite of MLLM baseline models covering diverse architectures.

Coming Soon — Dataset and Toolkit will be released in May 2026. Register below to get notified.

🏁 Participate

Register below to participate in the workshop. Individual registrations are open to all researchers and students. If you are participating in the Grand Challenge with a team, provide your team name in the registration form.

Register Now

Confirmed Keynote Speakers

Kristen Grauman
Prof. Kristen Grauman
UT Austin
Professor, Computer Vision Group

Leading researcher in computer vision, renowned for her pivotal role in the development and leadership of the Ego4D and EgoExo projects.

Dima Damen
Prof. Dima Damen
University of Bristol, Google DeepMind
Professor of Computer Vision

Distinguished professor in computer vision, renowned for founding and leading the EpicKitchen dataset. Expert in egocentric video analysis and action recognition.

Raffay Hamid
Dr. Raffay Hamid
Meta Reality Labs
Distinguished Engineer, GenAI

Distinguished Engineer at Meta Reality Labs, leading Generative AI and Wearable AI. Frequently invited keynote speaker at ECCV, CVPR, and ICCV. Work featured in BBC, TIME, and MIT Technology Review.

Additional keynote speakers to be announced.

Organizers

The organizing committee brings together world-class researchers from Meta Reality Labs, the University of Edinburgh, HKUST, Georgia Tech, and UCF — spanning multimodal AI, computer vision, robotics, and conversational systems.

Tuyen (Harry) Tran
Dr. Tuyen (Harry) Tran
Meta Reality Labs
Grand Challenge Chair

Research Scientist/Technical Lead at Meta Reality Labs. Work focuses on large vision encoders and multimodal foundation models for real-time perception and conversational AI on wearables.

Maxim Arap
Dr. Maxim Arap
Meta Reality Labs
Publicity Chair

Tech Lead at Meta Reality Labs, driving AI modeling and agentic systems for multi-modal proactive assistants. His current work focuses on self-improvement methods in multi-agent systems for wearables.

Seungwhan Moon
Dr. Seungwhan Moon
Meta Reality Labs
Primary Contact

Lead AI Research Scientist at Meta Reality Labs. Ph.D. in ML and Language Technologies from CMU. Organized workshops at AAAI, KDD, ICASSP, and NeurIPS.

Raffay Hamid
Dr. Raffay Hamid
Meta Reality Labs
General Chair

Distinguished Engineer at Meta Reality Labs, leading Generative AI and Wearable AI. Frequently invited keynote speaker at ECCV, CVPR, and ICCV. Work featured in BBC, TIME, and MIT Technology Review.

Alessandro Suglia
Dr. Alessandro Suglia
University of Edinburgh
Publication Chair

Assistant Professor, ELLIS member and GAIL Fellow. Research focuses on multimodal generative AI for robotics. Collaborated with Amazon Alexa AI and European Space Agency.

Zsolt Kira
Prof. Zsolt Kira
Georgia Institute of Technology
Organizer

Founding professor of the Robotics Perception and Learning lab at Georgia Tech. Research includes robust finetuning of VLMs, open-world generalization, and Vision-Language-Action models.

Pascale Fung
Prof. Pascale Fung
HKUST, AMI Labs
Organizer

Professor at HKUST directing research in conversational AI, NLP, and human-robot interaction. IEEE Fellow. Pioneering work in multilingual speech and language technologies.

Mubarak Shah
Prof. Mubarak Shah
University of Central Florida
Organizer

Founding director of the Center for Research in Computer Vision and Trustee Chair Professor of CS at UCF. Fellow of NAI, IEEE, AAAS, IAPR and SPIE.

Workshop Program Tentative

Full-day workshop with 4 invited talks, 2 oral presentations, 2 grand challenge talks, 1 poster session, and a panel discussion. The schedule below is tentative and subject to change.

09:00 – 09:15
Opening Remarks
Workshop overview and introduction
09:15 – 09:55
Invited Talk 1
Keynote Speaker TBA (40 min)
09:55 – 10:35
Invited Talk 2
Keynote Speaker TBA (40 min)
10:35 – 11:00
Coffee Break
11:00 – 11:40
Invited Talk 3
Keynote Speaker TBA (40 min)
11:40 – 12:20
Invited Talk 4
Keynote Speaker TBA (40 min)
12:20 – 13:30
Lunch Break
13:30 – 13:50
Oral Presentation 1
Accepted paper (20 min)
13:50 – 14:10
Oral Presentation 2
Accepted paper (20 min)
14:10 – 14:30
Grand Challenge Talk 1
Proactive AI challenge results (20 min)
14:30 – 14:50
Grand Challenge Talk 2
Multi-turn & LongQA results (20 min)
14:50 – 15:20
Coffee Break
15:20 – 16:20
Poster Session
Accepted papers and challenge submissions (60 min)
16:20 – 17:20
Panel Discussion
Live Q&A with all speakers and organizers (60 min)
17:20 – 17:30
Closing Remarks & Awards
Prize announcements

Important Dates