Skip to content

feat(datagen): schema-grounded synthetic training data for an Argent-native model#375

Draft
latekvo wants to merge 25 commits into
mainfrom
argent-finetune-data
Draft

feat(datagen): schema-grounded synthetic training data for an Argent-native model#375
latekvo wants to merge 25 commits into
mainfrom
argent-finetune-data

Conversation

@latekvo

@latekvo latekvo commented Jun 19, 2026

Copy link
Copy Markdown
Member

What

datagen/ — a deterministic pipeline that generates high-quality, high-volume tool-use trajectories for fine-tuning a model to drive the Argent toolkit (iOS sim / Android emu / Chromium). The primary deliverable is the data; a ~$500 4B proof model comes next, then the $50k run.

Why not just prompt an LLM to write the conversations

That hallucinates tool schemas (the skills even reference a paste tool that isn't in the registry), invents ungrounded tap coordinates (the one thing Argent forbids), and approximates the observation formats. Training on that teaches the model to guess.

Approach: simulate Argent, roll out the expert, gate hard

  • Ground truth — all 67 tool schemas extracted from the live registry (spec/tools.json); real output serializers (describe, component-tree, network logs, profiler reports) ported in format.ts.
  • Gym (gym.ts + archetypes/) — a text-simulator of devices + apps + screen graphs; each tool has a real transition function returning the exact observation.
  • Expert policy (expert.ts) — the argent.md rules as executable behavior: list-devices first, discovery-before-tap, tap the centre of a discovered element, re-discover after navigation, component-tree for RN, run-sequence batching, profiling/flow/network procedures, and recovery from injected failures (tap miss, describe error, boot timeout, debugger drop).
  • Hard gates (validate.ts) — schema (strict, unknown args rejected) + structure + device-order + policy + coordinate grounding (every tap must resolve to an element in the most recent discovery output). The grounding gate makes "non-hallucinated" a checkable property and also enforces discovery-before-tap.
  • NL layer (narrate.ts) — varies only phrasing, never the grounded action/observation backbone (the safe place for later LLM paraphrase).

Proof it works

  • Validator self-check (node src/selfcheck.ts): corrupts a known-good trajectory 9 ways (ungrounded tap, missing discovery, schema break, unknown arg, out-of-range coord, wrong device order, unknown tool, un-offered tool) and asserts each is rejected — 9/9. A gate that accepts everything is worthless; this keeps it honest.
  • Pilot (node src/generate.ts --n 800 --evalN 100): 800 train + 100 eval, 0 rejected, 40/67 tools (60%) exercised across 16 task families and iOS/Android/Chromium, ~9 tool calls/trajectory, ~10% with recovery. Deterministic (seeded) → byte-identical reruns; train/eval seed-disjoint.
  • Speed: ~1,300 validated trajectories/sec single-threaded → ~1M in ~13 min, embarrassingly parallel beyond.
  • OpenAI + ShareGPT exporters; committed samples/ shows real output without running.

Review pointers

  • datagen/DESIGN.md — full rationale + path to the proof model and scale-up.
  • datagen/README.md — quickstart, gates, extension pattern.
  • datagen/samples/ — rendered trajectories + pilot stats.json.

Not included / next

  • 27 niche tools still uncovered (listed in stats.json#tools_never_used) — extension is mechanical (adding the last 5 families took ~150 lines).
  • Screenshots are textual placeholders (text SFT; reinforces "read structure, not pixels"). A vision variant would layer real images.
  • No model trained yet — this PR is the data pipeline + validated pilot.

Runs on Node 24+ with no build step (native TS type-stripping). Depends only on ajv.

latekvo added 25 commits June 19, 2026 12:43
…Argent-native model

Adds datagen/: a deterministic pipeline that produces high-quality, high-volume
tool-use training data for fine-tuning a model to drive the Argent toolkit.

Approach: instead of asking an LLM to write tool-use conversations (which
hallucinates schemas and ungrounded coordinates), simulate Argent as a
deterministic "gym" and roll out an expert policy (the argent.md workflow rules
encoded as code) inside it. Every observation is produced by the simulator,
every tap coordinate is the centre of a discovered element, and every trajectory
must pass schema + structural + policy + coordinate-grounding gates before it
ships.

- spec/tools.json: real 67-tool catalog extracted from the live registry (ground truth)
- gym + 4 app archetypes (native iOS, RN, Chromium) across iOS/Android/Chromium
- expert policy + 16 task families (navigate, login, scroll, run-sequence,
  visual-regression, profiling+drilldown, flows, network, cold-boot, deep-link,
  pinch-zoom, chromium-tabs, native/debug-inspect, console-check)
- strict validator with an adversarial self-check (9/9) proving the gates reject
  ungrounded taps, missing discovery, schema breaks, wrong device order, etc.
- OpenAI + ShareGPT exporters; deterministic (seeded), reproducible, shardable

Pilot: 800 train + 100 eval, 0 rejected, 40/67 tools (60%) exercised,
~1,300 validated trajectories/sec single-threaded. See datagen/DESIGN.md for the
quality argument and the path to the $500 proof model and $50k scale-up.
datagen runs on Node-native TS type-stripping and is not part of the
typechecked package graph, so its files are in no tsconfig project the
type-aware parser knows about. Ignore the dir (same pattern as the other
non-package ignores), mirroring its exclusion from tsc and .prettierignore.
…isual nav)

The post-action screenshot slot now carries the actual screen content (title +
key visible elements + tabs) instead of a placeholder, so a text model reads
where it landed after each action — the navigation signal the user flagged as
most valuable. Grounding gate is unaffected: captions have no coordinates, so
taps must still be grounded in a describe/component-tree result.
… / seeker)

Adds a persona dimension to user prompts so the dataset covers the three asker
types requested: technical users digging into profiling/inspection, non-technical
app builders describing UI in natural language, and 'seeker' find-it framing for
navigation. Persona is weighted per task kind, recorded in meta, and reported in
stats (by_persona).
The model must find a target with no given path, demonstrating exploratory
navigation: read the screen (via discovery + scene caption), try a plausible
wrong screen, recognize the target isn't there, backtrack with the back button,
and continue until found. Directly trains the weak app-navigation capability and
exercises the screenshot-as-navigation-signal. Weighted high (16) and routed to
the seeker persona.
toGemmaMessages folds system+tools+task into the first user turn, renders tool
calls as <tool_call> text in model turns and tool results into following user
turns (clean user/model alternation for Gemma's template). Strips the constant
describe coordinate-header to shrink sequences; offered-tool count is now
configurable so training can use a lean 16-tool list. training/prepare.ts writes
seed-disjoint train/valid/test JSONL for mlx-lm.
- serve.py: persistent mlx-lm generation server (loads model+adapter once, one
  generation per stdin line, greedy).
- eval.ts: drives a model through the gym on held-out seeds (5M+, disjoint from
  train/valid/test) and scores with the same validators — schema-valid %,
  grounded-tap %, policy violations, and navigation task-success (tapped the
  target element on the target screen). Gym/validators stay the source of truth
  in TS; Python only generates text.
- Validator.checkCall (single-call schema/coord check), buildGemmaFirstUser +
  compactObservation exports shared with the trainer.
- train.sh: LoRA fine-tune Gemma 2 2B 4-bit with --mask-prompt.
Documents the prepare -> train -> eval(base) -> eval(tuned) -> report flow and
what the gym-replay eval measures. train.sh trains on the full sequence (every
tool-call turn) rather than masking to the last turn. report.ts emits the
comparison table. prepare.ts uses a lean 16-tool offered list.
First run OOM'd on Metal (batch 2, seq 3400). Fixes: a ~120-token compact
policy preamble for the Gemma export/eval (the full ~600-token policy was
repeated in every example), and batch-size 1 / max-seq-length 2600 / 8 LoRA
layers. Regenerated data now medians ~2173 tokens (was ~2772), fitting the
sequence cap with no truncation. Train and eval share the compact preamble.
…; faster eval

First stable-memory run diverged — LoRA LR was ~15x too high. Drop to 5e-5.
Also cut eval generation budget to 180 tokens (model turns are short).
…grounded / 44% nav

Base gemma-2-2b-it makes 0 tool calls (chats only). LoRA fine-tuned on 2,500 gym
trajectories (val loss 2.32->0.064), evaluated on 120 held-out tasks through the
gym: schema-valid 99.2%, grounded taps 97.2%, navigation success 44.1%, 7.3 calls/ep.
Per-kind: login 6/6, toggle 6/9, deep-link 4/8, hide-and-seek 5/12, navigate-tap 6/19.
Honest weakness: scroll-find 0/9 (didn't learn scroll-to-reveal). Proves the gym
data teaches Argent tool-use to a tiny model. Results in training/results/.
fuse LoRA into base (dequantize) -> copy SentencePiece tokenizer.model in (MLX
omits it) -> make-modelfile.ts bakes the Argent policy + 16-tool list and a
literal <bos> into a multi-turn Gemma template -> ollama create silver:2b.
Two gotchas documented: missing tokenizer.model (UNK garbage) and missing <bos>
(model chats instead of tool-calling). Verified: emits policy-correct, schema-
valid tool calls and continues across turns from pasted tool responses.
… refresh samples

The grounding gate tracked the most-recent discovery purely by message order
and never invalidated it on a screen change, so a gesture-tap reusing
coordinates from a discovery taken *before* a navigation (launch-app /
open-url / restart-app / button back|home) was accepted even though those
coordinates no longer describe the visible screen. This contradicted the gate's
own promise ("a tap on a freshly-navigated screen can only be grounded by a
discovery taken on that screen"). Clear lastDiscovery on navigation so taps must
be re-grounded by a fresh discovery after every transition.

The expert always re-discovers after navigating, so this rejects zero existing
trajectories (verified: 0/1697 taps flagged across the 800+100 pilot). Adds a
10th selfcheck case (stale discovery across a navigation) as the regression
test; selfcheck is now 10/10.

Also regenerate the committed samples/ (stats.json, samples.md, sample.*.jsonl)
which were stale: they predated the phase-2 src changes (scene captions,
personas, hide-and-seek, exporter updates), so a seeded rerun no longer
reproduced them and stats misreported coverage/recovery/persona mix. Now byte-
reproducible from `generate.ts --n 800 --evalN 100 --samples 8` + prettier.
Docs updated to the true current numbers (41/67 tools, ~15% recovery, 17 task
families, selfcheck 10/10).
- clean-base.py: strip 126 redundant shared-KV tensors mlx_lm's strict load rejects
- train.sh: MAXSEQ/NUMLAYERS env overrides (gemma4 template needs seq 3500)
- to-causal.py: rewrite fused checkpoint as text-only Gemma4ForCausalLM
- make-modelfile.ts: FLAVOR=gemma4 emits a SYSTEM-preamble Modelfile (native renderer)
- test-ollama.ts: raw/chat Ollama probes for tool-call verification
- README: e4b pipeline + note that Ollama's gemma4 converter is broken (use llama.cpp)
Ollama defaults gemma4 to num_ctx=4096; agent harnesses (OpenCode, hermes, Claude)
send long system+tools that overflow it, so Ollama truncates the input and the model
emits degenerate 'TheThe...' output until max tokens (hits official gemma4:e4b too).
Raise the baked context to 32768. Also document the global OLLAMA_CONTEXT_LENGTH fix
and the Argent text-tool-call vs native-tool-call caveat.
- serve-vllm.sh: vLLM serve (Gemma team's dev/test rec) for Linux+CUDA; gemma4 is
  in vLLM's registry (Gemma4ForCausalLM, v0.23+). No Apple-Silicon build, so it
  runs on a GPU box against LatekVo/silver, not this Mac.
- serve-mlx.sh: local OpenAI-compatible MLX server (4-bit base + adapter, ~4GB) —
  the practical dev/test path on Apple Silicon.
- README + argent-system.txt: drive either as an Argent agent (system prompt +
  <tool_call> text parsing). MLX server runtime-verification pending eval finish.
Compare {Argent, agent-device} × {silver:e4b, gemma4:e4b, Claude Haiku} on a suite
of toolkit-agnostic iOS-Settings tasks. Uses Hermes (established agent runtime) per
the 'no custom harness' constraint; adds a thin LLM-judge scoring pass (heuristic
proxy until the Anthropic key is supplied).

- tasks.jsonl: 8 deterministic, zero-install Settings tasks + judge goals
- run-cell.sh: one (model × toolkit) cell via headless 'hermes chat' + transcript export
- run-all.sh: the 4 local cells, serialized (memory); Haiku cells deferred for the key
- judge.py: LLM-as-judge (Claude) or labeled heuristic fallback + comparison table
- README: design, Hermes config, prereqs, memory caveats
…(beats 2B's 44%)

120 held-out tasks. silver:e4b: 60.3% nav / 96.2% schema / 87.9% grounded; base just
chats (0% nav). Tuned eval'd as a 4-bit MERGED checkpoint — mlx deadlocks on the
runtime-adapter path for gemma4 (documented in RESULTS-e4b.md).
…xecute silver's tool calls

Smoke-tested both installed harnesses against a real iOS sim + argent:
- Hermes: loads MCP tools but never executes local ollama models' tool calls;
  silver:e4b AND gemma4:e4b base both hallucinate the loop (real sim untouched,
  argent log shows only list_tools, fake udid/2025 timestamps).
- OpenCode (ollama provider added): silver emits text only, zero executed tools.
Root cause: silver is specialized for Argent's <tool_call> text + preamble-in-SYSTEM;
generic harnesses inject their own system prompt + tool format. Documented options for
Monday (recommend: adapt eval.ts to a real-device Argent loop for an apples-to-apples run).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant