Skip to content

Incremental re-parse: O(damage) edits, a 9MB keystroke in ~0.05ms, proven ≡ fresh#38

Merged
johnsoncodehk merged 15 commits into
masterfrom
incremental
Jun 10, 2026
Merged

Incremental re-parse: O(damage) edits, a 9MB keystroke in ~0.05ms, proven ≡ fresh#38
johnsoncodehk merged 15 commits into
masterfrom
incremental

Conversation

@johnsoncodehk

@johnsoncodehk johnsoncodehk commented Jun 10, 2026

Copy link
Copy Markdown
Owner

parseEdited(source, entryRule?, edits?) re-parses after an edit reusing the previous parse, byte-identical to a from-scratch parse (gate-enforced). Per-edit work is O(damage) in steady state — a keystroke on a 9MB file re-parses in ~0.05ms median (~750× a full parse); batch parsing is unaffected (all mechanisms are sign/branch-dead outside incremental sessions).

Numbers (M2 Max)

shape fresh parse keystroke median p90 speedup
9MB flat body (180k stmts) 569ms 0.04ms 0.07ms ~750×
8MB nested real code 253ms 0.13ms 0.16ms ~700×
81KB editor-typical 2.5ms 0.10ms 0.20ms ~25×

First edit of a session pays one-time costs (suffix flip, spare-buffer allocation, paren-stack scan); every keystroke after is sub-millisecond. Mixed random sessions (the verify gate, far jumps + rejects): 1.6-1.9×.

How

Layered so each mechanism only fires when the one above rejects:

  1. Node surgery (the steady-state path): descend the old tree to the deepest pure list container (a seq of literals/refs around exactly one repetition of a memoized rule — shapes with alts/lookahead at the container's own level are excluded, since a losing arm's probes are owned by no kid row). Re-parse only the affected elements with the real rule functions, require exact rejoin at an old kid boundary, then splice the kid range and patch lengths up the path — all checks happen before any row is mutated. Kept prefix kids are bounded by the same lookahead-watermark rule adoption uses, made transitive by a lazy per-row containment bit.
  2. Old-tree adoption + run-extension (the fallback): every rule entry consults the previous tree (stateless, backtracking-safe); repetition loops bulk-adopt runs of old siblings under a (pos, rule, generation) signature.
  3. Windowed re-lex with depth-tolerant restart/resync; the live paren stack rolls forward from the previous edit's anchor instead of a backward scan.
  4. Sign-relative spans kill the remaining O(n) add-loops: suffix token spans store EOF-relative negatives (updating one length variable is the shift), and a spliced parent's suffix kid rels store parent-end-relative negatives (the parent's length update is the shift). Values self-describe by sign; flip bands are cursor-locality sized.

Soundness

  • The equivalence gate (`test/incremental-verify.ts`) runs seeded random edit sessions (inserts, deletes, statement inserts, syntax-breaking edits) across real files, with explicit edit ranges throughout; every accepted re-parse must be byte-identical (toObject) to a fresh parse, rejects must reject on both sides.
  • Watermark discipline: memo/adoption/surgery all reuse only nodes whose recorded lookahead extent clears the damage (+2 read slack at use sites); reused subtree hits bump the enclosing watermark.
  • The seeded sessions caught three real holes during development (a trivia-boundary length leak — token-inside is not char-inside; an empty-range insertion stitched into a closed node; an Int32 overflow in a boundary remap). All three fixes are in, and the gate is green: 120 steps, 0 mismatches.
  • Full suite: 30/30 gates, emit≡interp parity 0 mismatches, lexer streams byte-identical, batch bench unchanged (11.4-11.6× aggregate).

API

The change protocol is LSP/VS Code's native shape — the engine owns the document text and builds it from the changes, so "the ranges don't match the text" is unrepresentable:

const p = createParser();
const cst = p.parse(text);
p.edit(cst, [{ start, end, text }]);     // each change in the coordinates of the
p.visit(cst, fns); p.tree.ruleNameOf(id);//   document after the preceding ones

The document lives as PIECES (flat fragments; a change splices O(1) string views, never a V8 cons-flatten — which costs ~1.2ms per keystroke on 9MB and was previously paid, unmeasured, by whoever materialized the text). The lexer re-lexes through a small materialized window with retry-on-truncation. The quoted keystroke numbers are therefore END-TO-END: text application included, no hidden caller cost.

There is no toObject: the engine hands out a tree to traverse (visit + accessors), never a materialized object tree — full materialization is a consumer choice (the equivalence gates build their comparison objects from visit in test code).

Contract (gated): a rejected edit throws and leaves the handle on the previous tree, readable (the reject path restores the token columns to the live tree's source); re-opening the document via parse() — even a rejecting one, since its arena reset precedes its outcome — invalidates prior handles; foreign handles throw. Documents are isolated per parser instance (gated: interleaved edit sessions stay byte-identical to fresh parses). Module-level parse/parseEdited keep working on a default document.

… reuse

parseEdited(newSource) re-parses after an edit reusing everything the edit
provably did not touch. No edit protocol: the damage window is DERIVED by
diffing the old and new token columns (longest identical prefix; longest
suffix identical modulo the char delta) — the caller just hands the new text.

Reuse flows through the carried memo. Soundness rests on three pieces:

- Every memo entry records its lookahead WATERMARK (memoExt): the farthest
  token the stored parse may have read — a PEG parse probes beyond its end
  (failed longer arms, not() lookaheads, SECOND-token dispatch). It comes for
  free from the global advance watermark at frame exit; the fixed read slack
  (stop token + SECOND probe, +2) is applied at invalidation time, so the
  stored value stays the true watermark.
- A memo HIT bumps the watermark to the entry's own: the jump semantically
  reads everything the stored parse read, or an enclosing rule completing
  right after a reused subtree would record a watermark smaller than what its
  result depends on (including the child's over-probing failed arms), and a
  later edit in the gap would keep the stale entry alive. Guaranteed batch
  no-op by monotonicity — the 18,805-file byte-identical gate and the exact
  reject-message gate both stay green.
- Prefix entries survive when watermark+slack stays inside the prefix; suffix
  entries shift by the token delta; the damage window drops. The old arena is
  re-based in place (suffix rows by charDelta, reused leaf entries by
  tokenDelta; damage-spanning rows are unreachable garbage), and new rows
  append after the old — a full parse() compacts.

The ENTRY rule's repetition units (Stmt/Decl for TS — derived from the grammar
shape, no language names) now memoize through parseRuleEntry like pratt/
left-rec rules, so whole untouched statements reuse, not just expression
subtrees. Token columns double-buffer across edits (ping-pong, zero
steady-state allocation; the in-place memo variant for token-count changes
measured SLOWER than sparse rebuild — undefined writes materialize holes —
and was reverted).

Gate (test/incremental-verify.ts, #30 in the chain): scripted edit sessions
over the bench files — inserts, deletes, statement insertions, syntax-breaking
edits — every accepted re-parse must be byte-identical (toObject) to a fresh
parse; rejects must reject on both sides. 120 steps, 0 mismatches. Measured:
mixed-session 1.4-1.5x, single-keystroke ~3x, pure-reuse floor ~5.6x; the
remaining cost is full-file relex + diff bookkeeping (windowed relex and the
green {rel,len} re-base are the named follow-ups), the reused parse itself is
~1% of the profile.
…(M1)

The lexer core is parameterized (lexCore): start anywhere with the previous
token's (k, t) as the regex-context seed and empty template/paren stacks.
Every token records its two stack depths (tkDp/tkPd columns); the restart
anchor is the last token before the damage with both records zero and no
live cross-token flag (a control-head ')' or postfix-ambiguous operator),
walking back to the file head in the worst case — always sound.

The window lexes into the spare buffer set (the old stream stays live), and
RESYNC fires at the first token at/past the damage end that aligns with an
old token (same k/t, spans shifted by the char delta) at EQUAL stack depths
where every still-open bracket was opened BEFORE the damage — the byte-equal
prefix guarantees those stack entries agree, while anything opened inside
the damage may differ in control-head-ness and must not span the join. The
depth-tolerant condition matters: an all-wrapping IIFE (typescript.js) keeps
paren depth >= 1 everywhere, and a depth-0-only resync degraded 9MB edits to
~1.2x; with it they reach 2.6x. The splice is copyWithin + a suffix span
shift; the damage window is derived from a char-level prefix/suffix compare
of the two sources (no edit protocol needed).

The true token prefix is recovered by comparing the window's leading tokens
against the old stream before the splice, so the memo carry keeps everything
the re-lex merely re-derived. Fallback-lexer grammars keep the full-relex
path; tokenize() is unchanged for batch (the lexer-equality gate runs the
full streams).

Numbers: 81KB keystroke 3.5x -> 3.3x parse-side with lex now O(damage);
mixed sessions ~1.5-1.65x; 9MB keystroke 2.6x. Remaining per-edit O(n) is
the M3/M4 bookkeeping (memo prefix scans, arena re-base loops, suffix span
shift) — the green {rel,len} re-base and old-tree cursor adoption kill
those next. 30/30 gates; emit≡interp 18,802 byte-identical; reject messages
and token streams exact.
… anchor

The restart anchor no longer requires paren depth zero — inside an
all-wrapping IIFE (typescript.js's 8.9MB bundle) no interior token has it,
so the anchor fell back to the file head and the window re-lexed half the
file. A '(' token now records its control-head-ness as tkFl bit 8, and
reconstructParens rebuilds the live stack enclosing the anchor by a backward
scan: the first '(' recording exactly depth d is the live opener of level d
(closed openers at that depth are re-opened later, and the re-opener comes
first backward). The anchor still requires template depth zero (interp brace
counters are not reconstructable) and additionally must not be a control
KEYWORD — a '(' lexed first in the window would mis-derive its head-ness
from a missing predecessor.

Two boundary bugs found by measurement on the way: lastDp/lastPd (the
"depth before the damage" baseline that resync compares against) must
initialize from the ANCHOR's depths, not zero — with the anchor adjacent to
the edit there are no pre-damage pushes to set them, the baseline froze at
zero, resync never fired inside the IIFE and the window ran to EOF (306ms
edits, worse than the depth-0 restart it replaced); and tokenize() must keep
returning tokN (the lexer-equality gate consumes it).

incremental ≡ fresh 0/120 mismatches; lexer streams, reject messages, and
the 18,805-file byte-identical gate all green; 30/30. 9MB keystroke edits
land at ~120-160ms (machine-thermal band), now dominated by the named
O(n) bookkeeping (memo prefix scans ~9M iterations/edit, arena re-base,
suffix span shift, char-diff scans) — the green {rel,len} + cursor-adoption
milestones' targets.
…vanish

Nodes no longer store absolute positions. A row owns only its LENGTHS (rowLen
chars, rowTokLen tokens); a node child's relative coordinates (kidRel chars,
kidTokRel tokens, both against the parent's start) live in the PARENT's kids
stream, parallel to the entries — NOT on the child row: a memo-reused subtree
can be a child of several longest-match CANDIDATES, and a losing candidate
completing after the winner would clobber child-side rel fields (928 corpus
mismatches before the edge-ownership fix). Leaf entries are node-relative
token indices. The red layer is a descent: visit/toObject thread (charBase,
tokBase); leaf spans come from the token columns at tokBase + rel.

Build stays absolute in TRANSIENT per-row coordinates (absChar/absTok),
written at finishNode/finishWrap, read by the enclosing parent, never part of
the green tree. A memo HIT refreshes the reused root's transients to the
current stream in O(1) — which is the whole point:

- the arena re-base loops (rowOff O(nodes), kids O(kids) per edit) are GONE;
- the '>'-splice kids/scratch fixup is GONE (completed rows lie wholly before
  the splice point; the carried memo is cleared);
- a reused subtree needs zero rewriting at any depth.

Matchers thread one tokBase parameter (leaf spans come from the token
columns, so they never need charBase); the totality gate's visit supplies it.
The ts-ast lowering moves to the INTERPRETER oracle through a new object-tree
TreeAccess adapter (test/obj-tree.ts, absolute coordinates, tokBase ignored)
— the grammar↔tsc structure contract is engine-independent, and the lowering
needed zero semantic changes.

18,802/18,805 emit ≡ interp byte-identical (toObject reproduces absolute
objects exactly); reject messages exact; incremental ≡ fresh 0/120 with the
mixed session at 1.69× (best yet); totality 32,167 nodes / 0 misses; 30/30.
9MB keystrokes ~148ms — now dominated by the memo prefix scans, the suffix
span shift, and the char-diff scans (the cursor-adoption and chunked-column
milestones' targets).
An incremental rule entry now asks the PREVIOUS tree first: adoptSeek walks
the old root toward the mapped old position (cached containment path +
binary search over each node's monotone child starts) and adopts a node when
the rule matches, its lookahead gap stays clear of the damage (rowExt — the
ext-minus-start LENGTH, position-independent like everything green), and the
old parse MEMOIZED it (rowOK): a row built under a suppress (no-'in') or
parseLimit-capped context is a context-dependent parse, and adoption must not
widen the contract the memo carry never offered — skipping that bit produced
real divergences (an incremental reject of text the fresh parse accepts).

Adoption is STATELESS: nothing is consumed, so PEG backtracking needs no
cursor rollback, a node refused under one longest-match candidate can be
adopted by the next, and exploratory descent through same-start chains never
commits to the cache. On adoption: pos jumps by rowTokLen, the watermark
bumps by the gap, the transients refresh — all O(1).

The memo becomes purely intra-parse: parseEdited's whole O(rules × n)
carry/invalidate machinery (the prefix watermark scans, the sparse rebuilds)
is deleted; fresh memo arrays per parse.

incremental ≡ fresh 0/120 with the mixed session at 1.91× (best yet);
9MB keystrokes ~121ms; 18,802/18,805 emit ≡ interp byte-identical; reject
messages exact; 30/30 gates.
The intra-parse memo arrays persist across parses; an entry is live iff its
stamp (a new memoGen Int32Array per rule) equals the current generation, and
bumping the generation counter IS the whole reset — parse(), parseEdited()
and the '>'-splice all just increment it. Allocating fresh multi-million-slot
arrays per edit was ~30% of a large-file edit in GC alone (and pushed V8
toward dictionary elements); now steady-state edits allocate nothing.

9MB keystroke edits: ~121ms -> ~50ms (5.4x vs a full parse); mixed sessions
2.27x. incremental ≡ fresh 0/120; 18,802 byte-identical; reject messages
exact; 30/30 gates.
…scans

An editor knows its edit ranges, so the damage envelope can come from the
caller ([{start, oldEnd, newEnd}], merged over multiple edits) instead of the
char-level prefix/suffix compare — which was the largest remaining O(file)
scan (two charCodeAt sweeps over a 9MB source per keystroke). The compare
stays as the no-protocol fallback.

9MB keystroke edits: ~50ms -> 7.8ms (34.6x vs a full parse), equivalence
verified. What remains per edit is memcpy-grade: the suffix span shift and
the token-column splice (the chunked-columns endgame), the window lex, and
the adoption walk. incremental ≡ fresh 0/120 (the gate exercises the
fallback path); 30/30 gates.
…ack cache

A 9MB flat-body keystroke spent 79% of its 61ms re-entering
parseRuleEntry/adoptSeek once per undamaged statement, and another 7ms
re-deriving the live paren stack by backward scan (the IIFE worst case).

- adoptSeek publishes the hit site (old parent row / kid index / base)
  when the adopted node is the parent's direct kid; parseRuleEntry arms
  a (pos, rid, generation)-signed run signal on such adoptions.
- '*'/'+' loops whose element is a parseRuleEntry-routed rule (pratt /
  left-rec / spine) consume the signal via runExtend: following old
  siblings are adopted in one tight loop under exactly the single-adopt
  eligibility (same-rule row, rowOK, contiguous, damage-clear, non-zero
  width). A member's existence proves the loop's FIRST guard true at its
  position; the signature triple keeps an inner rule's adoption from
  feeding elements into an outer loop. Members skip memo stores - a
  backtracking re-probe just re-adopts.
- reconstructParensCached rolls the previous anchor's stack FORWARD over
  the tokens between the anchors (tokens at/before the cached anchor are
  splice-stable); backward jumps fall back to the full scan. Invalidated
  by full lexes and the '>' splice.
- The spine-rule set moved to Emitter.spineSet(), shared by emitRuleFns
  and the quantifier hook.

9MB IIFE keystroke: 61ms -> 10.4ms (parse 50.5 -> 6.9, parens 7.2 -> 0.2).
Gates: 30/30, incremental 0/120, emit-parser-verify 0 mismatch,
emit-lexer-verify streams equal, batch bench unchanged (11.4x aggregate).
After run-adoption, three O(n)-per-edit costs remained on a 9MB flat
body: the damage-path list parent re-collected all 180k kids through
scratch (and the arena grew by that much per edit), the suffix token
spans took a char-delta add-loop, and the spliced parent's suffix kids
took a rel add-loop.

- Node SURGERY patches the damage path in place. Descend the old tree
  along single-affected-row kids; at the deepest PURE container
  (SURG_ELEM: a seq of literals/refs around exactly one '*'/'+' rep of a
  parseRuleEntry-routed rule - no alt/sep/opt/not at the container's own
  level, so every probe is owned by a kid row), re-parse only the
  affected elements with the real rule fn (adoption reuses their
  undamaged subtrees), require exact rejoin at an old kid start, then
  splice the kid range and patch lengths up the path. Every check runs
  before any row is mutated; any failure falls back to the full adoption
  re-parse. Prefix kids are kept under the adoption watermark rule, made
  transitive by rowKC (lazy kid-containment bit). Pure insertions at a
  kid boundary must touch the rep zone (a neighbour element), or the
  splice would stitch the element into a CLOSED node. Char lengths are
  re-DERIVED from the token columns, not patched by the char delta: a
  pure-trivia edit can sit token-inside but char-outside a node (the gap
  belongs to no node).
- EOF-relative token spans: tkOff/tkEnd at/after the damage store
  value - (srcLen + 1); decode adds the current length back, so updating
  srcLenP1 IS the suffix shift. Values self-describe by sign; negFrom
  bounds the flip band (cursor-locality sized). The '>' splice writes
  its pair sign-consistently with the zone it lands in.
- END-relative kid rels: a row kid's kidRel/kidTokRel may be stored
  relative to the parent's END (strictly negative, decoded with the
  parent's current lengths), so a surgical splice shifts the whole kid
  suffix by updating the parent's lengths. Stable across edits while the
  parent row is untouched; rowNF bounds the per-row band. Leaf kids stay
  start-relative (packed) - a pure container's trailing leaves get an
  O(1) backward walk.
- incremental-verify now alternates the edits-protocol and char-diff
  envelopes, and its seeded sessions caught three real holes during
  development (trivia-boundary length leak, closed-node stitching,
  Int32 overflow in the relocated-range boundary remap).

9MB keystroke: 10.4ms -> median 0.04ms / p90 0.07ms (~750x vs fresh,
steady state; the first edit of a session pays the one-time flip +
buffer allocation). 8MB nested real-code shape: median 0.13ms. 81KB:
median 0.10ms. Batch is sign-clean: emitted aggregate 11.4-11.6x
(unchanged band), 30/30 gates, emit-parser-verify 0 mismatches,
emit-lexer-verify byte-identical streams.
@johnsoncodehk johnsoncodehk changed the title Incremental re-parse: parseEdited() with watermarked memo carry-over, proven ≡ fresh Incremental re-parse: O(damage) edits, a 9MB keystroke in ~0.05ms, proven ≡ fresh Jun 10, 2026
A deep edit under a giant flat list paid an O(suffix-kids) eager rel
walk per keystroke on every ANCESTOR with a large suffix - the band so
far existed only on the surgical container itself. Measured on the 9MB
flat body as ancestor: 0.60ms median / 1.85ms p90 per keystroke.

Ancestors whose rule is a pure container (SURG_ELEM: interior = element
rows only, leaves only as a trailing run) now maintain the same
end-relative band as D: rows entering the suffix flip once (old-length
bias cancels), rows beyond the boundary ride the parent length update,
trailing leaves get the O(1) backward re-pack. Mixed-content ancestors
(interleaved leaves cannot sign-encode inside the packed kid entry)
keep the eager walk - those are the grammar's non-list shapes with
small kid counts.

Nested edit on the 9MB flat body: 0.60ms -> 0.031ms median / 0.047ms
p90. List-level keystroke unchanged (0.04ms median), 8MB nested real
shape 0.13ms, batch aggregate in band (11.2x), 30/30 gates,
incremental-verify 0/120, emit-parser-verify 0 mismatches.
…estart anchor

API rework (the session model made the edit base IMPLICIT - parseEdited
acted on whatever was parsed last, and two interleaved documents shared
one module state):

  const p = createParser();
  const cst = p.parse(text);
  const cst2 = p.edit(cst, next[, edits]);

- Each parser instance owns a DOCUMENT: the 51 per-document fields
  (token columns, arena, kids, memo, session, paren cache, spare
  buffers) live in a doc object; the module-level variables stay the
  ACTIVE REGISTER SET and activate() lazily swaps on instance switch -
  the hot paths never indirect through an object (batch unchanged,
  11.0-11.3x band; handle-API keystroke median 0.06ms).
- Handles are generation-stamped: trees are edited IN PLACE (node
  surgery), so an edit invalidates earlier handles of that parser -
  using one throws instead of silently reading a mutated tree. A
  REJECTED edit leaves the previous handle valid and the next edit
  falls back to a full re-parse internally.
- Module-level parse/parseEdited/visit/toObject keep working on a
  default document (gates/back-compat); the interpreter's createParser
  gains edit() (full re-parse - immutable object trees) for API parity.
- NEW gate test/multi-doc.ts: two instances over two sources, edits
  interleaved with the default doc mixed in - every edited tree must
  equal a fresh parse (a missed swap field = cross-document corruption),
  plus the stale/foreign/reject handle contract.

SOUNDNESS FIX the new smoke test exposed (predates this branch, M1-era):
findRestart anchored at tokens ENDING exactly at the damage start, but
maximal munch lets the edit EXTEND such a token ('b' + inserted 'x'
lexes as 'bx', '=' + '=' as '==', deleting a gap glues neighbours) and
the anchor itself is never re-lexed - the spliced stream then carried
'b','x' as two tokens where a batch lex has one, parsing a DIFFERENT
(sometimes still valid) program. The fixed-seed gate sessions never hit
the abutment in 120 steps. Anchor comparison is now STRICT (<), so the
abutting token falls inside the window and the merge is re-derived;
incremental-verify gains a deterministic boundary-glue session (ident
glue, operator glue, gap deletion, '>>' split sites) so the class stays
pinned.

31/31 gates (multi-doc included), incremental-verify 128 steps 0
mismatch, emit-parser-verify 0 mismatches, agnostic 9/9, batch in band.
…ontract

Returning a new handle from edit() read like value semantics - as if the
old cst survived and edit produced a clone. There is no clone: surgery
mutates the tree in place. The handle is now the STABLE IDENTITY of the
document's tree: p.edit(cst, next) updates cst.root and returns void;
the same reference always reads the current tree.

Making that honest exposed two reject holes the contract tests pinned:

- A rejected edit had already spliced the token columns to the rejected
  text (the splice precedes the parse attempt), so the kept tree's leaf
  spans read corrupted data. The reject path now restores the columns by
  re-lexing the LIVE tree's source (treeSrc, which unlike lastSrc
  survives rejects) - O(n) on the reject path only; #39's recovery mode
  is what makes rejects rare.
- The full-parse fallback inside edit (after a previous reject) went
  through parseCore, whose arena reset destroys the live tree BEFORE
  knowing whether the new text parses. edit now falls back in APPEND
  mode; parse() is the only compaction point - and since its reset
  happens before its outcome is known, parse() bumps the generation on
  entry: old handles die when a document is re-opened, success or not.

Handle contract (gated, 5/5): in-place edit updates the same handle; a
rejected edit throws and keeps the handle on the previous tree
(readable); foreign handles throw; re-opening via parse() - including a
REJECTING parse() - invalidates prior handles. The interpreter's edit()
mirrors the in-place semantics by replacing the tree object's fields.

31/31 gates, incremental-verify 128 steps 0 mismatch, parity 0
mismatches, handle-API keystroke median 0.028ms.
…surface

The arena design's premise (PR #36) is that parse() hands out a tree to
TRAVERSE, not an object tree to materialize - toObject was the
materialization back door left on both the module API and the handle
API, and its only real consumer was the gate layer's byte-identical
JSON comparison. That is a test concern: gates now build the comparison
object through visit + tree accessors (test/emitted-obj.ts, mirroring
the interpreter's object shape and key order exactly, so the emit ≡
interp and incremental ≡ fresh comparisons are unchanged). The unused
emitted getText went with it; the interpreter keeps returning its
native object trees (that IS its representation, not a conversion).

31/31 gates, emit-parser-verify 0 mismatches, multi-doc contract 5/5,
handle-API keystroke median 0.028ms (9MB) / 0.089ms (8MB nested).
The char-diff envelope was the protocol's predecessor left in as a
convenience default - but it silently spends O(file) prefix/suffix
scans, defeating the O(damage) contract exactly for the callers who
reached for the incremental API. Callers that track edits (editors) all
have the ranges; a caller without them passes the whole-file range and
gets an honest full re-parse instead of hidden scans.

The ranges MUST cover every change: over-claiming shrinks via the true
token-prefix compare; under-claiming is the caller's bug (the same
garbage-in contract as tree-sitter's tree.edit, now documented at the
envelope). edit() without ranges throws (gated, contract 6/6); the
seeded sessions and glue cases all pass explicit ranges (the gate keeps
a small test-side diff helper for its constructed pairs).

31/31 gates, parity 0 mismatches, keystroke median 0.028ms.
…O(damage)

Why next had to go: it was the only carrier of the inserted content
(the ranges carry positions, not text), which meant the API could
express 'the text and the ranges disagree' - the garbage-in hazard. The
change protocol [{ start, end, text }] is LSP/VS Code's native shape
(each edit in the coordinates of the document after the preceding ones)
and makes the inconsistency unrepresentable: the engine BUILDS the new
text from the changes.

Building it as a string exposed a cost that was always there, hidden in
the caller: slicing the previous edit's cons string flattens it in V8 -
measured 1.18ms per keystroke on 9MB, paid by whoever materializes the
text. The engine now owns the document as PIECES (flat fragments;
applying a change splits via O(1) SlicedString views and never
flattens), with:

- window-materialized relexing: lexCore reads a small flat slice with
  an absolute srcBase bias (biased once per token in tkPush - batch
  cost is two adds per token); running off the window end - including a
  matcher failing at the EDGE (a truncated string literal is not a lex
  error) - signals a retry with a larger window via LEX_RETRY. A cut
  token cannot fake a resync: suffix-zone equality makes its end
  mismatch the old token's.
- doc reads route through docChar/docText (flat fast path, cursor-
  cached piece lookup otherwise); cold paths (errors, debug) flatten
  lazily; pieces consolidate past 256 fragments (amortized join).
- the reject restore re-lexes the live tree's pieces but preserves the
  DOCUMENT pieces (the editor's buffer holds the rejected text; the
  gates' editor model now advances on reject and verifies an UNDO
  revert edit against a fresh parse every time).

End-to-end keystroke (engine builds the text, nothing hidden): 9MB
median 0.024ms / p90 0.047ms; 8MB nested 0.072ms / p90 0.094ms. 31/31
gates, parity 0 mismatches, lexer streams byte-identical, batch in band
(11.5x).
@johnsoncodehk johnsoncodehk merged commit 70a8019 into master Jun 10, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant