feat(datagen): schema-grounded synthetic training data for an Argent-native model by latekvo · Pull Request #375 · software-mansion/argent

latekvo · 2026-06-19T10:43:47Z

What

datagen/ — a deterministic pipeline that generates high-quality, high-volume tool-use trajectories for fine-tuning a model to drive the Argent toolkit (iOS sim / Android emu / Chromium). The primary deliverable is the data; a ~$500 4B proof model comes next, then the $50k run.

Why not just prompt an LLM to write the conversations

That hallucinates tool schemas (the skills even reference a paste tool that isn't in the registry), invents ungrounded tap coordinates (the one thing Argent forbids), and approximates the observation formats. Training on that teaches the model to guess.

Approach: simulate Argent, roll out the expert, gate hard

Ground truth — all 67 tool schemas extracted from the live registry (spec/tools.json); real output serializers (describe, component-tree, network logs, profiler reports) ported in format.ts.
Gym (gym.ts + archetypes/) — a text-simulator of devices + apps + screen graphs; each tool has a real transition function returning the exact observation.
Expert policy (expert.ts) — the argent.md rules as executable behavior: list-devices first, discovery-before-tap, tap the centre of a discovered element, re-discover after navigation, component-tree for RN, run-sequence batching, profiling/flow/network procedures, and recovery from injected failures (tap miss, describe error, boot timeout, debugger drop).
Hard gates (validate.ts) — schema (strict, unknown args rejected) + structure + device-order + policy + coordinate grounding (every tap must resolve to an element in the most recent discovery output). The grounding gate makes "non-hallucinated" a checkable property and also enforces discovery-before-tap.
NL layer (narrate.ts) — varies only phrasing, never the grounded action/observation backbone (the safe place for later LLM paraphrase).

Proof it works

Validator self-check (node src/selfcheck.ts): corrupts a known-good trajectory 9 ways (ungrounded tap, missing discovery, schema break, unknown arg, out-of-range coord, wrong device order, unknown tool, un-offered tool) and asserts each is rejected — 9/9. A gate that accepts everything is worthless; this keeps it honest.
Pilot (node src/generate.ts --n 800 --evalN 100): 800 train + 100 eval, 0 rejected, 40/67 tools (60%) exercised across 16 task families and iOS/Android/Chromium, ~9 tool calls/trajectory, ~10% with recovery. Deterministic (seeded) → byte-identical reruns; train/eval seed-disjoint.
Speed: ~1,300 validated trajectories/sec single-threaded → ~1M in ~13 min, embarrassingly parallel beyond.
OpenAI + ShareGPT exporters; committed samples/ shows real output without running.

Review pointers

datagen/DESIGN.md — full rationale + path to the proof model and scale-up.
datagen/README.md — quickstart, gates, extension pattern.
datagen/samples/ — rendered trajectories + pilot stats.json.

Not included / next

27 niche tools still uncovered (listed in stats.json#tools_never_used) — extension is mechanical (adding the last 5 families took ~150 lines).
Screenshots are textual placeholders (text SFT; reinforces "read structure, not pixels"). A vision variant would layer real images.
No model trained yet — this PR is the data pipeline + validated pilot.

Runs on Node 24+ with no build step (native TS type-stripping). Depends only on ajv.

…Argent-native model Adds datagen/: a deterministic pipeline that produces high-quality, high-volume tool-use training data for fine-tuning a model to drive the Argent toolkit. Approach: instead of asking an LLM to write tool-use conversations (which hallucinates schemas and ungrounded coordinates), simulate Argent as a deterministic "gym" and roll out an expert policy (the argent.md workflow rules encoded as code) inside it. Every observation is produced by the simulator, every tap coordinate is the centre of a discovered element, and every trajectory must pass schema + structural + policy + coordinate-grounding gates before it ships. - spec/tools.json: real 67-tool catalog extracted from the live registry (ground truth) - gym + 4 app archetypes (native iOS, RN, Chromium) across iOS/Android/Chromium - expert policy + 16 task families (navigate, login, scroll, run-sequence, visual-regression, profiling+drilldown, flows, network, cold-boot, deep-link, pinch-zoom, chromium-tabs, native/debug-inspect, console-check) - strict validator with an adversarial self-check (9/9) proving the gates reject ungrounded taps, missing discovery, schema breaks, wrong device order, etc. - OpenAI + ShareGPT exporters; deterministic (seeded), reproducible, shardable Pilot: 800 train + 100 eval, 0 rejected, 40/67 tools (60%) exercised, ~1,300 validated trajectories/sec single-threaded. See datagen/DESIGN.md for the quality argument and the path to the $500 proof model and $50k scale-up.

datagen runs on Node-native TS type-stripping and is not part of the typechecked package graph, so its files are in no tsconfig project the type-aware parser knows about. Ignore the dir (same pattern as the other non-package ignores), mirroring its exclusion from tsc and .prettierignore.

…isual nav) The post-action screenshot slot now carries the actual screen content (title + key visible elements + tabs) instead of a placeholder, so a text model reads where it landed after each action — the navigation signal the user flagged as most valuable. Grounding gate is unaffected: captions have no coordinates, so taps must still be grounded in a describe/component-tree result.

… / seeker) Adds a persona dimension to user prompts so the dataset covers the three asker types requested: technical users digging into profiling/inspection, non-technical app builders describing UI in natural language, and 'seeker' find-it framing for navigation. Persona is weighted per task kind, recorded in meta, and reported in stats (by_persona).

The model must find a target with no given path, demonstrating exploratory navigation: read the screen (via discovery + scene caption), try a plausible wrong screen, recognize the target isn't there, backtrack with the back button, and continue until found. Directly trains the weak app-navigation capability and exercises the screenshot-as-navigation-signal. Weighted high (16) and routed to the seeker persona.

toGemmaMessages folds system+tools+task into the first user turn, renders tool calls as <tool_call> text in model turns and tool results into following user turns (clean user/model alternation for Gemma's template). Strips the constant describe coordinate-header to shrink sequences; offered-tool count is now configurable so training can use a lean 16-tool list. training/prepare.ts writes seed-disjoint train/valid/test JSONL for mlx-lm.

- serve.py: persistent mlx-lm generation server (loads model+adapter once, one generation per stdin line, greedy). - eval.ts: drives a model through the gym on held-out seeds (5M+, disjoint from train/valid/test) and scores with the same validators — schema-valid %, grounded-tap %, policy violations, and navigation task-success (tapped the target element on the target screen). Gym/validators stay the source of truth in TS; Python only generates text. - Validator.checkCall (single-call schema/coord check), buildGemmaFirstUser + compactObservation exports shared with the trainer. - train.sh: LoRA fine-tune Gemma 2 2B 4-bit with --mask-prompt.

Documents the prepare -> train -> eval(base) -> eval(tuned) -> report flow and what the gym-replay eval measures. train.sh trains on the full sequence (every tool-call turn) rather than masking to the last turn. report.ts emits the comparison table. prepare.ts uses a lean 16-tool offered list.

First run OOM'd on Metal (batch 2, seq 3400). Fixes: a ~120-token compact policy preamble for the Gemma export/eval (the full ~600-token policy was repeated in every example), and batch-size 1 / max-seq-length 2600 / 8 LoRA layers. Regenerated data now medians ~2173 tokens (was ~2772), fitting the sequence cap with no truncation. Train and eval share the compact preamble.

…emma benchmark)

…; faster eval First stable-memory run diverged — LoRA LR was ~15x too high. Drop to 5e-5. Also cut eval generation budget to 180 tokens (model turns are short).

…grounded / 44% nav Base gemma-2-2b-it makes 0 tool calls (chats only). LoRA fine-tuned on 2,500 gym trajectories (val loss 2.32->0.064), evaluated on 120 held-out tasks through the gym: schema-valid 99.2%, grounded taps 97.2%, navigation success 44.1%, 7.3 calls/ep. Per-kind: login 6/6, toggle 6/9, deep-link 4/8, hide-and-seek 5/12, navigate-tap 6/19. Honest weakness: scroll-find 0/9 (didn't learn scroll-to-reveal). Proves the gym data teaches Argent tool-use to a tiny model. Results in training/results/.

…iters, maxTokens 2500)

…ed training dirs

fuse LoRA into base (dequantize) -> copy SentencePiece tokenizer.model in (MLX omits it) -> make-modelfile.ts bakes the Argent policy + 16-tool list and a literal <bos> into a multi-turn Gemma template -> ollama create silver:2b. Two gotchas documented: missing tokenizer.model (UNK garbage) and missing <bos> (model chats instead of tool-calling). Verified: emits policy-correct, schema- valid tool calls and continues across turns from pasted tool responses.

… refresh samples The grounding gate tracked the most-recent discovery purely by message order and never invalidated it on a screen change, so a gesture-tap reusing coordinates from a discovery taken *before* a navigation (launch-app / open-url / restart-app / button back|home) was accepted even though those coordinates no longer describe the visible screen. This contradicted the gate's own promise ("a tap on a freshly-navigated screen can only be grounded by a discovery taken on that screen"). Clear lastDiscovery on navigation so taps must be re-grounded by a fresh discovery after every transition. The expert always re-discovers after navigating, so this rejects zero existing trajectories (verified: 0/1697 taps flagged across the 800+100 pilot). Adds a 10th selfcheck case (stale discovery across a navigation) as the regression test; selfcheck is now 10/10. Also regenerate the committed samples/ (stats.json, samples.md, sample.*.jsonl) which were stale: they predated the phase-2 src changes (scene captions, personas, hide-and-seek, exporter updates), so a seeded rerun no longer reproduced them and stats misreported coverage/recovery/persona mix. Now byte- reproducible from `generate.ts --n 800 --evalN 100 --samples 8` + prettier. Docs updated to the true current numbers (41/67 tools, ~15% recovery, 17 task families, selfcheck 10/10).

- clean-base.py: strip 126 redundant shared-KV tensors mlx_lm's strict load rejects - train.sh: MAXSEQ/NUMLAYERS env overrides (gemma4 template needs seq 3500) - to-causal.py: rewrite fused checkpoint as text-only Gemma4ForCausalLM - make-modelfile.ts: FLAVOR=gemma4 emits a SYSTEM-preamble Modelfile (native renderer) - test-ollama.ts: raw/chat Ollama probes for tool-call verification - README: e4b pipeline + note that Ollama's gemma4 converter is broken (use llama.cpp)

Ollama defaults gemma4 to num_ctx=4096; agent harnesses (OpenCode, hermes, Claude) send long system+tools that overflow it, so Ollama truncates the input and the model emits degenerate 'TheThe...' output until max tokens (hits official gemma4:e4b too). Raise the baked context to 32768. Also document the global OLLAMA_CONTEXT_LENGTH fix and the Argent text-tool-call vs native-tool-call caveat.

- serve-vllm.sh: vLLM serve (Gemma team's dev/test rec) for Linux+CUDA; gemma4 is in vLLM's registry (Gemma4ForCausalLM, v0.23+). No Apple-Silicon build, so it runs on a GPU box against LatekVo/silver, not this Mac. - serve-mlx.sh: local OpenAI-compatible MLX server (4-bit base + adapter, ~4GB) — the practical dev/test path on Apple Silicon. - README + argent-system.txt: drive either as an Argent agent (system prompt + <tool_call> text parsing). MLX server runtime-verification pending eval finish.

…/ model dir

Compare {Argent, agent-device} × {silver:e4b, gemma4:e4b, Claude Haiku} on a suite of toolkit-agnostic iOS-Settings tasks. Uses Hermes (established agent runtime) per the 'no custom harness' constraint; adds a thin LLM-judge scoring pass (heuristic proxy until the Anthropic key is supplied). - tasks.jsonl: 8 deterministic, zero-install Settings tasks + judge goals - run-cell.sh: one (model × toolkit) cell via headless 'hermes chat' + transcript export - run-all.sh: the 4 local cells, serialized (memory); Haiku cells deferred for the key - judge.py: LLM-as-judge (Claude) or labeled heuristic fallback + comparison table - README: design, Hermes config, prereqs, memory caveats

…(beats 2B's 44%) 120 held-out tasks. silver:e4b: 60.3% nav / 96.2% schema / 87.9% grounded; base just chats (0% nav). Tuned eval'd as a 4-bit MERGED checkpoint — mlx deadlocks on the runtime-adapter path for gemma4 (documented in RESULTS-e4b.md).

…xecute silver's tool calls Smoke-tested both installed harnesses against a real iOS sim + argent: - Hermes: loads MCP tools but never executes local ollama models' tool calls; silver:e4b AND gemma4:e4b base both hallucinate the loop (real sim untouched, argent log shows only list_tools, fake udid/2025 timestamps). - OpenCode (ollama provider added): silver emits text only, zero executed tools. Root cause: silver is specialized for Argent's <tool_call> text + preamble-in-SYSTEM; generic harnesses inject their own system prompt + tool format. Documented options for Monday (recommend: adapt eval.ts to a real-device Argent loop for an apples-to-apples run).

latekvo added 25 commits June 19, 2026 12:43

style(datagen): apply prettier formatting; ignore generated datagen/out

65dbed3

Merge remote-tracking branch 'origin/main' into argent-finetune-data

a88e342

docs(datagen): document Phase 2 (personas, hide-and-seek, captions, G…

24f8889

…emma benchmark)

fix(training): LR 5e-5 (1.5e-4 diverged: train loss climbed 4.4->9.9)…

7421a66

…; faster eval First stable-memory run diverged — LoRA LR was ~15x too high. Drop to 5e-5. Also cut eval generation budget to 180 tokens (model turns are short).

docs(training): align pipeline example with the config that ran (500 …

5158c9b

…iters, maxTokens 2500)

style(datagen): prettier formatting for phase-2 files; ignore generat…

19d20b2

…ed training dirs

style(training): prettier markdown italics in e4b README; ignore base…

24d9249

…/ model dir

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(datagen): schema-grounded synthetic training data for an Argent-native model#375

feat(datagen): schema-grounded synthetic training data for an Argent-native model#375
latekvo wants to merge 25 commits into
mainfrom
argent-finetune-data

latekvo commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

latekvo commented Jun 19, 2026

What

Why not just prompt an LLM to write the conversations

Approach: simulate Argent, roll out the expert, gate hard

Proof it works

Review pointers

Not included / next

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant