Incremental re-parse: O(damage) edits, a 9MB keystroke in ~0.05ms, proven ≡ fresh#38
Merged
Conversation
… reuse parseEdited(newSource) re-parses after an edit reusing everything the edit provably did not touch. No edit protocol: the damage window is DERIVED by diffing the old and new token columns (longest identical prefix; longest suffix identical modulo the char delta) — the caller just hands the new text. Reuse flows through the carried memo. Soundness rests on three pieces: - Every memo entry records its lookahead WATERMARK (memoExt): the farthest token the stored parse may have read — a PEG parse probes beyond its end (failed longer arms, not() lookaheads, SECOND-token dispatch). It comes for free from the global advance watermark at frame exit; the fixed read slack (stop token + SECOND probe, +2) is applied at invalidation time, so the stored value stays the true watermark. - A memo HIT bumps the watermark to the entry's own: the jump semantically reads everything the stored parse read, or an enclosing rule completing right after a reused subtree would record a watermark smaller than what its result depends on (including the child's over-probing failed arms), and a later edit in the gap would keep the stale entry alive. Guaranteed batch no-op by monotonicity — the 18,805-file byte-identical gate and the exact reject-message gate both stay green. - Prefix entries survive when watermark+slack stays inside the prefix; suffix entries shift by the token delta; the damage window drops. The old arena is re-based in place (suffix rows by charDelta, reused leaf entries by tokenDelta; damage-spanning rows are unreachable garbage), and new rows append after the old — a full parse() compacts. The ENTRY rule's repetition units (Stmt/Decl for TS — derived from the grammar shape, no language names) now memoize through parseRuleEntry like pratt/ left-rec rules, so whole untouched statements reuse, not just expression subtrees. Token columns double-buffer across edits (ping-pong, zero steady-state allocation; the in-place memo variant for token-count changes measured SLOWER than sparse rebuild — undefined writes materialize holes — and was reverted). Gate (test/incremental-verify.ts, #30 in the chain): scripted edit sessions over the bench files — inserts, deletes, statement insertions, syntax-breaking edits — every accepted re-parse must be byte-identical (toObject) to a fresh parse; rejects must reject on both sides. 120 steps, 0 mismatches. Measured: mixed-session 1.4-1.5x, single-keystroke ~3x, pure-reuse floor ~5.6x; the remaining cost is full-file relex + diff bookkeeping (windowed relex and the green {rel,len} re-base are the named follow-ups), the reused parse itself is ~1% of the profile.
…(M1)
The lexer core is parameterized (lexCore): start anywhere with the previous
token's (k, t) as the regex-context seed and empty template/paren stacks.
Every token records its two stack depths (tkDp/tkPd columns); the restart
anchor is the last token before the damage with both records zero and no
live cross-token flag (a control-head ')' or postfix-ambiguous operator),
walking back to the file head in the worst case — always sound.
The window lexes into the spare buffer set (the old stream stays live), and
RESYNC fires at the first token at/past the damage end that aligns with an
old token (same k/t, spans shifted by the char delta) at EQUAL stack depths
where every still-open bracket was opened BEFORE the damage — the byte-equal
prefix guarantees those stack entries agree, while anything opened inside
the damage may differ in control-head-ness and must not span the join. The
depth-tolerant condition matters: an all-wrapping IIFE (typescript.js) keeps
paren depth >= 1 everywhere, and a depth-0-only resync degraded 9MB edits to
~1.2x; with it they reach 2.6x. The splice is copyWithin + a suffix span
shift; the damage window is derived from a char-level prefix/suffix compare
of the two sources (no edit protocol needed).
The true token prefix is recovered by comparing the window's leading tokens
against the old stream before the splice, so the memo carry keeps everything
the re-lex merely re-derived. Fallback-lexer grammars keep the full-relex
path; tokenize() is unchanged for batch (the lexer-equality gate runs the
full streams).
Numbers: 81KB keystroke 3.5x -> 3.3x parse-side with lex now O(damage);
mixed sessions ~1.5-1.65x; 9MB keystroke 2.6x. Remaining per-edit O(n) is
the M3/M4 bookkeeping (memo prefix scans, arena re-base loops, suffix span
shift) — the green {rel,len} re-base and old-tree cursor adoption kill
those next. 30/30 gates; emit≡interp 18,802 byte-identical; reject messages
and token streams exact.
… anchor
The restart anchor no longer requires paren depth zero — inside an
all-wrapping IIFE (typescript.js's 8.9MB bundle) no interior token has it,
so the anchor fell back to the file head and the window re-lexed half the
file. A '(' token now records its control-head-ness as tkFl bit 8, and
reconstructParens rebuilds the live stack enclosing the anchor by a backward
scan: the first '(' recording exactly depth d is the live opener of level d
(closed openers at that depth are re-opened later, and the re-opener comes
first backward). The anchor still requires template depth zero (interp brace
counters are not reconstructable) and additionally must not be a control
KEYWORD — a '(' lexed first in the window would mis-derive its head-ness
from a missing predecessor.
Two boundary bugs found by measurement on the way: lastDp/lastPd (the
"depth before the damage" baseline that resync compares against) must
initialize from the ANCHOR's depths, not zero — with the anchor adjacent to
the edit there are no pre-damage pushes to set them, the baseline froze at
zero, resync never fired inside the IIFE and the window ran to EOF (306ms
edits, worse than the depth-0 restart it replaced); and tokenize() must keep
returning tokN (the lexer-equality gate consumes it).
incremental ≡ fresh 0/120 mismatches; lexer streams, reject messages, and
the 18,805-file byte-identical gate all green; 30/30. 9MB keystroke edits
land at ~120-160ms (machine-thermal band), now dominated by the named
O(n) bookkeeping (memo prefix scans ~9M iterations/edit, arena re-base,
suffix span shift, char-diff scans) — the green {rel,len} + cursor-adoption
milestones' targets.
…vanish Nodes no longer store absolute positions. A row owns only its LENGTHS (rowLen chars, rowTokLen tokens); a node child's relative coordinates (kidRel chars, kidTokRel tokens, both against the parent's start) live in the PARENT's kids stream, parallel to the entries — NOT on the child row: a memo-reused subtree can be a child of several longest-match CANDIDATES, and a losing candidate completing after the winner would clobber child-side rel fields (928 corpus mismatches before the edge-ownership fix). Leaf entries are node-relative token indices. The red layer is a descent: visit/toObject thread (charBase, tokBase); leaf spans come from the token columns at tokBase + rel. Build stays absolute in TRANSIENT per-row coordinates (absChar/absTok), written at finishNode/finishWrap, read by the enclosing parent, never part of the green tree. A memo HIT refreshes the reused root's transients to the current stream in O(1) — which is the whole point: - the arena re-base loops (rowOff O(nodes), kids O(kids) per edit) are GONE; - the '>'-splice kids/scratch fixup is GONE (completed rows lie wholly before the splice point; the carried memo is cleared); - a reused subtree needs zero rewriting at any depth. Matchers thread one tokBase parameter (leaf spans come from the token columns, so they never need charBase); the totality gate's visit supplies it. The ts-ast lowering moves to the INTERPRETER oracle through a new object-tree TreeAccess adapter (test/obj-tree.ts, absolute coordinates, tokBase ignored) — the grammar↔tsc structure contract is engine-independent, and the lowering needed zero semantic changes. 18,802/18,805 emit ≡ interp byte-identical (toObject reproduces absolute objects exactly); reject messages exact; incremental ≡ fresh 0/120 with the mixed session at 1.69× (best yet); totality 32,167 nodes / 0 misses; 30/30. 9MB keystrokes ~148ms — now dominated by the memo prefix scans, the suffix span shift, and the char-diff scans (the cursor-adoption and chunked-column milestones' targets).
An incremental rule entry now asks the PREVIOUS tree first: adoptSeek walks the old root toward the mapped old position (cached containment path + binary search over each node's monotone child starts) and adopts a node when the rule matches, its lookahead gap stays clear of the damage (rowExt — the ext-minus-start LENGTH, position-independent like everything green), and the old parse MEMOIZED it (rowOK): a row built under a suppress (no-'in') or parseLimit-capped context is a context-dependent parse, and adoption must not widen the contract the memo carry never offered — skipping that bit produced real divergences (an incremental reject of text the fresh parse accepts). Adoption is STATELESS: nothing is consumed, so PEG backtracking needs no cursor rollback, a node refused under one longest-match candidate can be adopted by the next, and exploratory descent through same-start chains never commits to the cache. On adoption: pos jumps by rowTokLen, the watermark bumps by the gap, the transients refresh — all O(1). The memo becomes purely intra-parse: parseEdited's whole O(rules × n) carry/invalidate machinery (the prefix watermark scans, the sparse rebuilds) is deleted; fresh memo arrays per parse. incremental ≡ fresh 0/120 with the mixed session at 1.91× (best yet); 9MB keystrokes ~121ms; 18,802/18,805 emit ≡ interp byte-identical; reject messages exact; 30/30 gates.
The intra-parse memo arrays persist across parses; an entry is live iff its stamp (a new memoGen Int32Array per rule) equals the current generation, and bumping the generation counter IS the whole reset — parse(), parseEdited() and the '>'-splice all just increment it. Allocating fresh multi-million-slot arrays per edit was ~30% of a large-file edit in GC alone (and pushed V8 toward dictionary elements); now steady-state edits allocate nothing. 9MB keystroke edits: ~121ms -> ~50ms (5.4x vs a full parse); mixed sessions 2.27x. incremental ≡ fresh 0/120; 18,802 byte-identical; reject messages exact; 30/30 gates.
…scans
An editor knows its edit ranges, so the damage envelope can come from the
caller ([{start, oldEnd, newEnd}], merged over multiple edits) instead of the
char-level prefix/suffix compare — which was the largest remaining O(file)
scan (two charCodeAt sweeps over a 9MB source per keystroke). The compare
stays as the no-protocol fallback.
9MB keystroke edits: ~50ms -> 7.8ms (34.6x vs a full parse), equivalence
verified. What remains per edit is memcpy-grade: the suffix span shift and
the token-column splice (the chunked-columns endgame), the window lex, and
the adoption walk. incremental ≡ fresh 0/120 (the gate exercises the
fallback path); 30/30 gates.
…ack cache A 9MB flat-body keystroke spent 79% of its 61ms re-entering parseRuleEntry/adoptSeek once per undamaged statement, and another 7ms re-deriving the live paren stack by backward scan (the IIFE worst case). - adoptSeek publishes the hit site (old parent row / kid index / base) when the adopted node is the parent's direct kid; parseRuleEntry arms a (pos, rid, generation)-signed run signal on such adoptions. - '*'/'+' loops whose element is a parseRuleEntry-routed rule (pratt / left-rec / spine) consume the signal via runExtend: following old siblings are adopted in one tight loop under exactly the single-adopt eligibility (same-rule row, rowOK, contiguous, damage-clear, non-zero width). A member's existence proves the loop's FIRST guard true at its position; the signature triple keeps an inner rule's adoption from feeding elements into an outer loop. Members skip memo stores - a backtracking re-probe just re-adopts. - reconstructParensCached rolls the previous anchor's stack FORWARD over the tokens between the anchors (tokens at/before the cached anchor are splice-stable); backward jumps fall back to the full scan. Invalidated by full lexes and the '>' splice. - The spine-rule set moved to Emitter.spineSet(), shared by emitRuleFns and the quantifier hook. 9MB IIFE keystroke: 61ms -> 10.4ms (parse 50.5 -> 6.9, parens 7.2 -> 0.2). Gates: 30/30, incremental 0/120, emit-parser-verify 0 mismatch, emit-lexer-verify streams equal, batch bench unchanged (11.4x aggregate).
After run-adoption, three O(n)-per-edit costs remained on a 9MB flat body: the damage-path list parent re-collected all 180k kids through scratch (and the arena grew by that much per edit), the suffix token spans took a char-delta add-loop, and the spliced parent's suffix kids took a rel add-loop. - Node SURGERY patches the damage path in place. Descend the old tree along single-affected-row kids; at the deepest PURE container (SURG_ELEM: a seq of literals/refs around exactly one '*'/'+' rep of a parseRuleEntry-routed rule - no alt/sep/opt/not at the container's own level, so every probe is owned by a kid row), re-parse only the affected elements with the real rule fn (adoption reuses their undamaged subtrees), require exact rejoin at an old kid start, then splice the kid range and patch lengths up the path. Every check runs before any row is mutated; any failure falls back to the full adoption re-parse. Prefix kids are kept under the adoption watermark rule, made transitive by rowKC (lazy kid-containment bit). Pure insertions at a kid boundary must touch the rep zone (a neighbour element), or the splice would stitch the element into a CLOSED node. Char lengths are re-DERIVED from the token columns, not patched by the char delta: a pure-trivia edit can sit token-inside but char-outside a node (the gap belongs to no node). - EOF-relative token spans: tkOff/tkEnd at/after the damage store value - (srcLen + 1); decode adds the current length back, so updating srcLenP1 IS the suffix shift. Values self-describe by sign; negFrom bounds the flip band (cursor-locality sized). The '>' splice writes its pair sign-consistently with the zone it lands in. - END-relative kid rels: a row kid's kidRel/kidTokRel may be stored relative to the parent's END (strictly negative, decoded with the parent's current lengths), so a surgical splice shifts the whole kid suffix by updating the parent's lengths. Stable across edits while the parent row is untouched; rowNF bounds the per-row band. Leaf kids stay start-relative (packed) - a pure container's trailing leaves get an O(1) backward walk. - incremental-verify now alternates the edits-protocol and char-diff envelopes, and its seeded sessions caught three real holes during development (trivia-boundary length leak, closed-node stitching, Int32 overflow in the relocated-range boundary remap). 9MB keystroke: 10.4ms -> median 0.04ms / p90 0.07ms (~750x vs fresh, steady state; the first edit of a session pays the one-time flip + buffer allocation). 8MB nested real-code shape: median 0.13ms. 81KB: median 0.10ms. Batch is sign-clean: emitted aggregate 11.4-11.6x (unchanged band), 30/30 gates, emit-parser-verify 0 mismatches, emit-lexer-verify byte-identical streams.
A deep edit under a giant flat list paid an O(suffix-kids) eager rel walk per keystroke on every ANCESTOR with a large suffix - the band so far existed only on the surgical container itself. Measured on the 9MB flat body as ancestor: 0.60ms median / 1.85ms p90 per keystroke. Ancestors whose rule is a pure container (SURG_ELEM: interior = element rows only, leaves only as a trailing run) now maintain the same end-relative band as D: rows entering the suffix flip once (old-length bias cancels), rows beyond the boundary ride the parent length update, trailing leaves get the O(1) backward re-pack. Mixed-content ancestors (interleaved leaves cannot sign-encode inside the packed kid entry) keep the eager walk - those are the grammar's non-list shapes with small kid counts. Nested edit on the 9MB flat body: 0.60ms -> 0.031ms median / 0.047ms p90. List-level keystroke unchanged (0.04ms median), 8MB nested real shape 0.13ms, batch aggregate in band (11.2x), 30/30 gates, incremental-verify 0/120, emit-parser-verify 0 mismatches.
…estart anchor
API rework (the session model made the edit base IMPLICIT - parseEdited
acted on whatever was parsed last, and two interleaved documents shared
one module state):
const p = createParser();
const cst = p.parse(text);
const cst2 = p.edit(cst, next[, edits]);
- Each parser instance owns a DOCUMENT: the 51 per-document fields
(token columns, arena, kids, memo, session, paren cache, spare
buffers) live in a doc object; the module-level variables stay the
ACTIVE REGISTER SET and activate() lazily swaps on instance switch -
the hot paths never indirect through an object (batch unchanged,
11.0-11.3x band; handle-API keystroke median 0.06ms).
- Handles are generation-stamped: trees are edited IN PLACE (node
surgery), so an edit invalidates earlier handles of that parser -
using one throws instead of silently reading a mutated tree. A
REJECTED edit leaves the previous handle valid and the next edit
falls back to a full re-parse internally.
- Module-level parse/parseEdited/visit/toObject keep working on a
default document (gates/back-compat); the interpreter's createParser
gains edit() (full re-parse - immutable object trees) for API parity.
- NEW gate test/multi-doc.ts: two instances over two sources, edits
interleaved with the default doc mixed in - every edited tree must
equal a fresh parse (a missed swap field = cross-document corruption),
plus the stale/foreign/reject handle contract.
SOUNDNESS FIX the new smoke test exposed (predates this branch, M1-era):
findRestart anchored at tokens ENDING exactly at the damage start, but
maximal munch lets the edit EXTEND such a token ('b' + inserted 'x'
lexes as 'bx', '=' + '=' as '==', deleting a gap glues neighbours) and
the anchor itself is never re-lexed - the spliced stream then carried
'b','x' as two tokens where a batch lex has one, parsing a DIFFERENT
(sometimes still valid) program. The fixed-seed gate sessions never hit
the abutment in 120 steps. Anchor comparison is now STRICT (<), so the
abutting token falls inside the window and the merge is re-derived;
incremental-verify gains a deterministic boundary-glue session (ident
glue, operator glue, gap deletion, '>>' split sites) so the class stays
pinned.
31/31 gates (multi-doc included), incremental-verify 128 steps 0
mismatch, emit-parser-verify 0 mismatches, agnostic 9/9, batch in band.
…ontract Returning a new handle from edit() read like value semantics - as if the old cst survived and edit produced a clone. There is no clone: surgery mutates the tree in place. The handle is now the STABLE IDENTITY of the document's tree: p.edit(cst, next) updates cst.root and returns void; the same reference always reads the current tree. Making that honest exposed two reject holes the contract tests pinned: - A rejected edit had already spliced the token columns to the rejected text (the splice precedes the parse attempt), so the kept tree's leaf spans read corrupted data. The reject path now restores the columns by re-lexing the LIVE tree's source (treeSrc, which unlike lastSrc survives rejects) - O(n) on the reject path only; #39's recovery mode is what makes rejects rare. - The full-parse fallback inside edit (after a previous reject) went through parseCore, whose arena reset destroys the live tree BEFORE knowing whether the new text parses. edit now falls back in APPEND mode; parse() is the only compaction point - and since its reset happens before its outcome is known, parse() bumps the generation on entry: old handles die when a document is re-opened, success or not. Handle contract (gated, 5/5): in-place edit updates the same handle; a rejected edit throws and keeps the handle on the previous tree (readable); foreign handles throw; re-opening via parse() - including a REJECTING parse() - invalidates prior handles. The interpreter's edit() mirrors the in-place semantics by replacing the tree object's fields. 31/31 gates, incremental-verify 128 steps 0 mismatch, parity 0 mismatches, handle-API keystroke median 0.028ms.
…surface The arena design's premise (PR #36) is that parse() hands out a tree to TRAVERSE, not an object tree to materialize - toObject was the materialization back door left on both the module API and the handle API, and its only real consumer was the gate layer's byte-identical JSON comparison. That is a test concern: gates now build the comparison object through visit + tree accessors (test/emitted-obj.ts, mirroring the interpreter's object shape and key order exactly, so the emit ≡ interp and incremental ≡ fresh comparisons are unchanged). The unused emitted getText went with it; the interpreter keeps returning its native object trees (that IS its representation, not a conversion). 31/31 gates, emit-parser-verify 0 mismatches, multi-doc contract 5/5, handle-API keystroke median 0.028ms (9MB) / 0.089ms (8MB nested).
The char-diff envelope was the protocol's predecessor left in as a convenience default - but it silently spends O(file) prefix/suffix scans, defeating the O(damage) contract exactly for the callers who reached for the incremental API. Callers that track edits (editors) all have the ranges; a caller without them passes the whole-file range and gets an honest full re-parse instead of hidden scans. The ranges MUST cover every change: over-claiming shrinks via the true token-prefix compare; under-claiming is the caller's bug (the same garbage-in contract as tree-sitter's tree.edit, now documented at the envelope). edit() without ranges throws (gated, contract 6/6); the seeded sessions and glue cases all pass explicit ranges (the gate keeps a small test-side diff helper for its constructed pairs). 31/31 gates, parity 0 mismatches, keystroke median 0.028ms.
…O(damage)
Why next had to go: it was the only carrier of the inserted content
(the ranges carry positions, not text), which meant the API could
express 'the text and the ranges disagree' - the garbage-in hazard. The
change protocol [{ start, end, text }] is LSP/VS Code's native shape
(each edit in the coordinates of the document after the preceding ones)
and makes the inconsistency unrepresentable: the engine BUILDS the new
text from the changes.
Building it as a string exposed a cost that was always there, hidden in
the caller: slicing the previous edit's cons string flattens it in V8 -
measured 1.18ms per keystroke on 9MB, paid by whoever materializes the
text. The engine now owns the document as PIECES (flat fragments;
applying a change splits via O(1) SlicedString views and never
flattens), with:
- window-materialized relexing: lexCore reads a small flat slice with
an absolute srcBase bias (biased once per token in tkPush - batch
cost is two adds per token); running off the window end - including a
matcher failing at the EDGE (a truncated string literal is not a lex
error) - signals a retry with a larger window via LEX_RETRY. A cut
token cannot fake a resync: suffix-zone equality makes its end
mismatch the old token's.
- doc reads route through docChar/docText (flat fast path, cursor-
cached piece lookup otherwise); cold paths (errors, debug) flatten
lazily; pieces consolidate past 256 fragments (amortized join).
- the reject restore re-lexes the live tree's pieces but preserves the
DOCUMENT pieces (the editor's buffer holds the rejected text; the
gates' editor model now advances on reject and verifies an UNDO
revert edit against a fresh parse every time).
End-to-end keystroke (engine builds the text, nothing hidden): 9MB
median 0.024ms / p90 0.047ms; 8MB nested 0.072ms / p90 0.094ms. 31/31
gates, parity 0 mismatches, lexer streams byte-identical, batch in band
(11.5x).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
parseEdited(source, entryRule?, edits?)re-parses after an edit reusing the previous parse, byte-identical to a from-scratch parse (gate-enforced). Per-edit work is O(damage) in steady state — a keystroke on a 9MB file re-parses in ~0.05ms median (~750× a full parse); batch parsing is unaffected (all mechanisms are sign/branch-dead outside incremental sessions).Numbers (M2 Max)
First edit of a session pays one-time costs (suffix flip, spare-buffer allocation, paren-stack scan); every keystroke after is sub-millisecond. Mixed random sessions (the verify gate, far jumps + rejects): 1.6-1.9×.
How
Layered so each mechanism only fires when the one above rejects:
Soundness
API
The change protocol is LSP/VS Code's native shape — the engine owns the document text and builds it from the changes, so "the ranges don't match the text" is unrepresentable:
The document lives as PIECES (flat fragments; a change splices O(1) string views, never a V8 cons-flatten — which costs ~1.2ms per keystroke on 9MB and was previously paid, unmeasured, by whoever materialized the text). The lexer re-lexes through a small materialized window with retry-on-truncation. The quoted keystroke numbers are therefore END-TO-END: text application included, no hidden caller cost.
There is no
toObject: the engine hands out a tree to traverse (visit + accessors), never a materialized object tree — full materialization is a consumer choice (the equivalence gates build their comparison objects from visit in test code).Contract (gated): a rejected edit throws and leaves the handle on the previous tree, readable (the reject path restores the token columns to the live tree's source); re-opening the document via
parse()— even a rejecting one, since its arena reset precedes its outcome — invalidates prior handles; foreign handles throw. Documents are isolated per parser instance (gated: interleaved edit sessions stay byte-identical to fresh parses). Module-levelparse/parseEditedkeep working on a default document.