Transform egocentric factory video into robot-ready training data
Ego2Robot is an open-source pipeline that converts egocentric human demonstrations into LeRobot-compatible datasets for robot foundation model training.
- π Real manufacturing data from 10,000 hours of factory work
- π Intelligent curation with motion + hand visibility filtering
- π§ Unsupervised skill discovery via VideoMAE embeddings + clustering
- π€ LeRobot v3 format with observations + pseudo-actions
- π Rich annotations including zero-shot labels and quality scores
- π Reusable pipeline for any egocentric video dataset
git clone https://github.com/msunbot/ego2robot.git
cd ego2robot
pip install -r requirements.txtfrom ego2robot.data.sampler import EgocentricSampler
from ego2robot.data.clips import ClipExtractor
# Load and process video
sampler = EgocentricSampler(config)
extractor = ClipExtractor(config)
for video in sampler.filter_videos():
clips = extractor.extract_clips(video['video_bytes'], video['metadata'])
# Process clips...from datasets import load_dataset
ds = load_dataset("msunbot1/ego2robot-factory-episodes")
for episode in ds:
images = episode['observation.images.top']
actions = episode['action']
# Your code here50 curated episodes of factory manipulation tasks:
- Quality Inspection: 50% (25 episodes)
- Assembly: 17% (9 episodes)
- Fastening: 17% (8 episodes)
- Machine Operation: 8% (4 episodes)
- Mixed: 8% (4 episodes)
Format: LeRobot v3 with:
- Observations: RGB (360x640@6fps) + hand bounding boxes
- Actions: 2D hand motion vectors (pseudo-actions)
- Metadata: Skill clusters, quality scores, zero-shot labels
β View Dataset on Hugging Face
βββββββββββββββββββββββββββββββββββββββββββ
β Egocentric-10K (10,000 hours) β
βββββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββ
β Quality Filtering β
β - Motion scoring β
β - Hand detection β
ββββββββββββββββββββββββ
β
ββββββββββββββββββββββββ
β Feature Extraction β
β - VideoMAE (768-dim) β
β - CLIP labels β
ββββββββββββββββββββββββ
β
ββββββββββββββββββββββββ
β Skill Clustering β
β - K-means (k=10) β
β - t-SNE viz β
ββββββββββββββββββββββββ
β
ββββββββββββββββββββββββ
β LeRobot Export β
β - Hand tracking β
β - Pseudo-actions β
ββββββββββββββββββββββββ
β
50 Robot-Ready Episodes
ego2robot/
βββ data/
β βββ sampler.py # Stream videos from HF
β βββ clips.py # Extract 6s clips
β βββ quality.py # Motion + hand filtering
β βββ storage.py # Save curated clips
βββ vision/
β βββ motion.py # Motion scoring
β βββ hands.py # Hand detection
β βββ videomae.py # Video embeddings
β βββ clip_text.py # Zero-shot labeling
β βββ hand_tracker.py # Trajectory extraction
βββ skills/
β βββ cluster.py # K-means clustering
βββ export/
β βββ lerobot_builder.py # LeRobot format
βββ examples/
βββ day5_build_dataset.py # Full pipeline
βββ day12_build_lerobot_dataset.py
βββ day17_training_demo.py # Validation
python examples/day5_build_dataset.pyOutputs: 50-100 high-quality clips in data/ego2robot_dataset/
python examples/day9_extract_all_embeddings.py
python examples/day10_add_all_labels.py
python examples/day11_cluster_skills.pyOutputs: Embeddings, labels, and cluster IDs
python examples/day12_build_lerobot_dataset.pyOutputs: 50 episodes in data/lerobot_dataset/
python examples/day14_upload_to_hf.py- Motion score: 0.168 avg (>0.15 threshold)
- Hand visibility: 0.421 avg (>0.30 threshold)
- Cluster separation: Clear in t-SNE visualization
- Training demo: Converged MSE loss
10 fine-grained clusters mapping to 5 high-level actions:
- Quality Inspection (6 variants) - 30 clips
- Assembly (2 variants) - 10 clips
- Fastening - 10 clips
- Machine Operation - 5 clips
- Mixed - 5 clips
- VLA pretraining: Diverse visual data for models like Οβ
- Representation learning: Learn manipulation primitives
- Skill discovery: Study unsupervised clustering approaches
- Domain adaptation: Manufacturing β other domains
- Custom datasets: Process your factory video
- Robot training: Fine-tune policies on domain-specific data
- Quality control: Automated task recognition
We welcome contributions! Areas of interest:
- Additional domains (warehouses, kitchens, etc.)
- Depth estimation integration
- Improved action generation (3D trajectories)
- Evaluation benchmarks
- Documentation improvements
See CONTRIBUTING.md for guidelines.
If you use this dataset or code, please cite:
@software{ego2robot2025,
author = {Michelle Sun},
title = {Ego2Robot: Egocentric Factory Episodes for Robot Learning},
year = {2025},
url = {https://github.com/msunbot/ego2robot}
}- Code: MIT License
- Dataset: Apache 2.0 (inherits from Egocentric-10K)
- BuildAI for Egocentric-10K dataset
- Hugging Face LeRobot for format standards
- Physical Intelligence for Οβ inspiration
- Open-source community for VideoMAE, CLIP, MediaPipe
Michelle Sun
- LinkedIn: linkedin.com/in/sunmichelle
- Twitter: @michellelsun
- Email: michelle@aetherone.xyz
Interested in:
- Collaborations on Physical AI data & ecosystem
- Advisory & angel investing opportunities in robotics/AI
- π Dataset on Hugging Face
- π Blog Post
- π Project Roadmap
Built with β€οΈ for the robotics community