Web-served, SAM3-driven semi-automatic segmentation pipeline for Robocup vision. Captures clips from a RealSense (or a folder of frames), seeds masks from text prompts via SAM3, propagates them through the clip with the SAM3 video tracker, lets you review/edit in a browser, then exports a YOLO-seg dataset and trains/tests a YOLO11 model — all without leaving the SPA.
The legacy CLI scripts (yolo_tuning/) still work for the original cv2-window flow; new work happens through the web UI under web/ + server/.
- Python 3.10
- A CUDA GPU for SAM3 (bfloat16 recommended; tested on RTX 5070 Ti / sm_120, PyTorch 2.11+cu128)
- Intel RealSense (optional — folder import works without one)
- Node 20+ if rebuilding the SPA
Install Python deps via pip install -e ./server (preferred — pulls FastAPI, ultralytics, transformers, albumentations, etc. from pyproject.toml) or pip install -r requirements.txt for the legacy scripts.
The default checkpoint at sam3_checkpoint_hf/ was converted from sam3.pt and has trained tracker_neck weights stored under tracker_model.tracker_neck.*; Sam3Engine.load aliases them to top-level tracker_neck.* at startup. If tk_vision serve errors with tracker_neck patch loaded N/22 weights, install a clean checkpoint:
tk_vision fetch-weights --source local # verify the on-disk copy
tk_vision fetch-weights --source hf --repo facebook/sam3 --yes # pull from Hugging Face
tk_vision fetch-weights --source url --url https://… --sha256 … --yes
--source hf|url requires --yes (or TK_VISION_ALLOW_DOWNLOAD=1); the canary refuses to install unless all 22 tracker_neck weights load.
tk_vision serve # FastAPI on :28000, SPA at http://localhost:28000
Then in the browser:
- Clips page — record a live RealSense clip, import a folder of images, or pick an existing clip.
- Label page —
Seed first frameruns SAM3 against the ontology; click on a track to refine with positive/negative points; pressPropagateto run the SAM3 video tracker forward through the clip in chunks. Scrub frames, mark bad framesDelete/Restore/Prune from here, re-seed problem frames. - Export YOLO-seg writes
data/runs/<run_id>/{images,labels}/{train,val}/+data.yaml. Per-clip split is the default to avoid temporal leakage; multi-contour polygons are preserved. - Augment applies the Albumentations ops in
configs/default.yaml > augment.opsplus optional copy-paste, materializing<stem>_aug{0..N-1}.jpg/.txtin place. - Train spawns
python -m tk_vision._train_runneras a subprocess runningmodel.train(...); stdout streams to the SPA via WebSocket. Cancel via SIGTERM;metrics.jsonis written fromresults.csvon success. - Test → /test/<run_id>/<clip_id> loads a
.ptweights path and runsmodel.predicton every non-deleted frame, persisting predictions as JSON. The frame scrubber overlays predicted polygons + class scores against the original frame.
resource/ontology.json is {prompt: label}:
{"<text prompt>": "<label>"}The keys feed SAM3's open-vocab text head, which was trained on short noun phrases ("cat", "remote"). Keep keys to ≤6 tokens, simple visual descriptors, no task jargon. Long phrases collapse SAM3's presence_logits and seed returns nothing. Values become YOLO class names — keep them stable.
configs/default.yaml > sam3.score_mode (or the SPA toggle on the Label page) selects:
native— Meta default:score = sigmoid(pred_logits) * sigmoid(presence_logits). Use with short prompts.per_query— drops the presence multiplier. Use when you keep verbose / domain-specific prompts andnativereturns nothing.
configs/default.yaml controls server bind, capture device, SAM3 checkpoint path/dtype, propagation chunk size, augmentation ops, and training hyperparameters. Field-level docs live in server/tk_vision/config.py. Override via tk_vision serve --config <path> or by editing the YAML in place.
Long operations (propagate, train, infer) follow a uniform pattern:
POST /api/.../{op}— start. Returns{job_id, ...}and starts an asyncio task.GET /api/.../{op}/{job_id}— poll status.DELETE /api/.../{op}/{job_id}— cancel (SIGTERM for subprocess jobs,cancel.set()for in-process loops).WS /ws/{op}/.../{job_id}— streamlog/frame/done/error/cancelledevents.
Job status is one of pending | running | done | error | cancelled (JobStatus in web/src/api/rest.ts).pip install "httpx[socks]"
cd server
PYTHONPATH=. python -m pytest tests/ -q --override-ini "addopts="
The GPU smoke test (test_propagate_smoke.py) and the tracker_neck patch test require a working CUDA + SAM3 checkpoint and skip otherwise.
Still functional for the original cv2-window flow; tagged legacy/v0 in git. See git history for the pre-web README.