Skip to content

Error-tolerant parsing mode: a live tree through transiently-invalid edit states #39

@johnsoncodehk

Description

@johnsoncodehk

Motivation

The incremental parser (#38) keeps a keystroke at ~0.05ms on multi-MB files, but only for inputs the grammar accepts. A typing session spends most of its time in transiently invalid states (const is already invalid), and the current session policy — reject on both sides, keep the previous tree — leaves the tree stale through every burst of typing until the text parses again.

Error tolerance is what turns the incremental parser into an editor engine (and what a parser-as-product IDE adopter needs). It is also the one capability tree-sitter/Lezer have that this parser deliberately lacks today.

Hard constraint: recovery is a MODE

Conformance is a product guarantee (bidirectional vs tsc, byte-exact reject messages, emit ≡ interp parity). Therefore:

  • Strict mode (default) does not change at all. Every existing gate keeps running against strict mode and must stay byte-identical.
  • Recovery mode is opt-in (e.g. parse(src, entry, { recover: true }) / a recovering session) and only does extra work on inputs strict mode rejects: a valid input must produce the IDENTICAL tree in both modes.

Design sketch

  • Recover at the spine/rep sync points (statement granularity first). When an element fails inside a repetition, skip tokens to a sync set, emit an error node spanning the skipped region, and continue the loop. The sync set must be grammar-derived (rep-element FIRST sets + enclosing closers) — no language names (the agnostic gate applies).
  • Error nodes are ordinary rows (a reserved rule id): the green tree, visit/toObject, matchers, adoption and node surgery all generalize without new cases — an error row adopts and splices like any row.
  • The equivalence invariant survives: incremental-with-recovery ≡ fresh-with-recovery, byte-identical, gated by seeded sessions whose syntax-breaking edits must now produce identical trees instead of falling into the both-reject bucket. (Tree-sitter does not make this guarantee — its edited trees can diverge from fresh ones in error scenarios.)
  • Diagnostics instead of throws: recovery-mode parse never throws; it returns the root plus a diagnostics list (offset, skipped span, the strict-mode expectation info).

Open questions

  • Error-node shape: span-only (skip-to-sync) vs keeping the longest partial parse of the broken element (tsc-style incomplete nodes give better trees for IDE features, at a much larger surface). Start span-only?
  • Expression-internal recovery (Pratt/left-rec arms) — out of scope for the first round; statement-granularity recovery already bounds the damage to one element.
  • Watermark semantics for error rows: rowExt must cover the skipped region plus every probe of the failed arms, or a later edit inside the skipped span could keep a stale error row alive (the memo/adoption soundness argument extends, but needs its own gate-witnessed proof).
  • ASI interaction: a skip-to-sync that crosses a newline must not change how the following statement's ASI behaves (sync sets vs sameLine connectors).

Acceptance criteria

  • Strict mode: all existing gates pass unchanged (30/30, conformance matrix, parity, reject-message equality).
  • Recovery mode totality: never throws across the generative/fuzz corpus + the edit-session corpus.
  • Valid inputs: recovery tree ≡ strict tree, byte-identical.
  • Invalid inputs: error-node spans tile the source (CST text invariant holds); diagnostics non-empty exactly when strict rejects.
  • Incremental: recovery sessions stay O(damage) steady-state (surgery/adoption around error rows), and incremental-with-recovery ≡ fresh-with-recovery over seeded sessions including syntax-breaking edits.

Context

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions