Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
b397356
WIP issue #39: total parse/edit with cst.errors - one residual equiva…
johnsoncodehk Jun 11, 2026
dc10568
Total parse/edit complete: a latent Pratt watermark hole closed, equi…
johnsoncodehk Jun 11, 2026
e4fc2f3
Gate the expression-splitting ';' injection class
johnsoncodehk Jun 11, 2026
05c6284
Cross-grammar incremental gate: all 7 grammars, edit ≡ fresh + self-c…
johnsoncodehk Jun 11, 2026
3e7f1d6
Missing-token synthesis: tsc-style "expected 'x'" with structure pres…
johnsoncodehk Jun 11, 2026
bf771a1
Missing-nonterminal synthesis: the tsc "Expression expected" analog
johnsoncodehk Jun 11, 2026
2245f0b
Broken-state edits go incremental: recovering adoption under bar purity
johnsoncodehk Jun 11, 2026
ee1890d
Cross-attempt memo survival: bar-free windows are context-free
johnsoncodehk Jun 11, 2026
b37e1cc
Conditional lexer resync: depth-shift adoption kills the transition c…
johnsoncodehk Jun 11, 2026
4248105
Recovering surgery: bar-clear splices keep the error tree incremental
johnsoncodehk Jun 11, 2026
668f8f5
Diagnostics: viable-set messages + paired-opener related info
johnsoncodehk Jun 11, 2026
2c6e593
Head-to-head bench: Monogram vs tsc updateSourceFile vs tree-sitter
johnsoncodehk Jun 11, 2026
71e14a7
Error-recovery conformance metric: bidirectional agreement vs tsc
johnsoncodehk Jun 11, 2026
f0d2c75
Reject unterminated templates and colon-less case clauses
johnsoncodehk Jun 11, 2026
25b78ba
Formal write-up + bounded-exhaustive edit gate
johnsoncodehk Jun 11, 2026
397a76d
Attribute the transition-edit cost to what profiling actually shows
johnsoncodehk Jun 11, 2026
476ab69
Row-level taint + reject body-less class expressions
johnsoncodehk Jun 11, 2026
d61726b
O(1) shifted-resync check at depth 0 via a pop-on-empty index list
johnsoncodehk Jun 11, 2026
3d8f494
Block bare statement keywords as expressions; for-in takes comma objects
johnsoncodehk Jun 11, 2026
f8a5742
Roadmap: enumerate the parser-acceptance long tail vs tsc
johnsoncodehk Jun 11, 2026
d37332b
Decorators prefix class members; orphan and post-modifier decorators …
johnsoncodehk Jun 11, 2026
d77b803
A ';'-less class field rejects a same-line decorator after it
johnsoncodehk Jun 11, 2026
777fe21
Lexer resync also validates the candidate's leading-trivia flags
johnsoncodehk Jun 11, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -227,6 +227,23 @@ The **only-Monogram** wins above are all disambiguations that are *TextMate-expr

"TextMate can't express X" is not a guess or an assertion; it is a claim to be **proven from the model**. TextMate is a line-oriented matcher whose only cross-line memory is a finite stack of scope contexts, so a proof exhibits an X whose correct highlighting provably needs memory that model lacks — unbounded lookback to a token that is not an enclosing context. A failed *attempt* to derive a pattern is not such a proof: a cleverer pattern may exist, and most "impossible for TextMate" folklore is exactly this error — the multiline / nested-generic cases turn out TM-expressible once a parser supplies the pattern, which is why the derived grammar gets them right. Where a construct provably exceeds the model, Monogram's **tree-sitter** target — a real parser over the whole tree — resolves it.

### Total parsing under edits — measured against tsc and tree-sitter

The handle API (`createParser()`) is **total**: every text yields a tree plus `cst.errors`, with tsc-grade diagnostics (`expected ',' or ']'` where every listed token is *provably* still accepted at that position, `to match this '('` related info, zero-width `$missing` nodes that keep a call's shape when its `)` is missing). Two structural guarantees back it:

- **The valid path is byte-identical to the strict parser** — recovery runs only after a strict pass has rejected, so error tolerance costs valid input nothing, by construction.
- **Every edited re-parse is byte-identical to a fresh parse** of the same text — tree *and* errors, broken states included, held exact by generative edit scripts across all seven grammars in CI (`test/incremental-grammars.ts`).

One 9 MB TypeScript document, identical single-character edit scripts (`test/head-to-head.ts`, node v24, Apple silicon; ✎ = per keystroke, median):

| engine | fresh parse | valid ✎ | breaking ✎ | while-broken ✎ | fixing ✎ |
|---|---:|---:|---:|---:|---:|
| **Monogram** | **167 ms** | 0.37 ms | 12 ms | **0.22 ms** | 2.2 ms |
| tsc `updateSourceFile` | 207 ms | 35 ms | 12.0 ms | 11.9 ms | 11.9 ms |
| tree-sitter (official) | 430 ms | **0.18 ms** | **0.29 ms** | 0.30 ms | **0.22 ms** |

Monogram beats tsc on every phase (valid typing ~100×, while-broken ~50×) and beats or matches tree-sitter everywhere except the two **transition** edits (break/fix). Profiling attributes those almost entirely to the bench's 4.5 MB cursor jump: token-column offsets are EOF-relative-biased so that local typing never rewrites the suffix (that is what makes the valid keystroke 0.37 ms), and the bias boundary moves with the cursor — a far jump pays once, proportional to the jump distance, then repeated break/fix transitions at that position settle to **~1.6–2 ms** (the parser passes measure under 1 ms of that).

## What you get

From one grammar definition (a small TypeScript combinator API), five outputs are **fully functional**:
Expand Down
3 changes: 3 additions & 0 deletions ROADMAP.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,9 @@ Three parser-grounded layers (in `test/`), each comparing against the language's

## What's next

- **Parser-acceptance long tail vs tsc** (measured by `test/recovery-conformance.ts`: recall 61.2%, 108 conformance files we parse-accept that tsc's parser rejects). The remainder is fully enumerated, two buckets:
- **`[Await]`/`[Yield]` parameter contexts** (31 files): `await`/`yield` must be reserved *inside* async/generator bodies and parameter lists, identifiers elsewhere. Needs a context-threading mechanism in the engine — the same shape as `exclude('in', …)` for the no-`in` context, but suppressing identifier *texts* over a subtree. Designed direction, not yet built.
- **Per-shape strictness** (77 files, each class small and named): declaration-modifier ordering (`public @dec method`), private names outside classes (`const #foo`), strict-mode octal literals (`001`), member declarations with `var` (`class C { var x }`), paren-less `new` arguments (`new C0 32`), reserved words in dotted namespace tails, template-literal module names, `extends void`, `super<T>` tagged templates. Each wants the same treatment that landed for `case`/`class`/statement keywords: fix, then prove FN=0 with the accept/reject flip-scan against the corpus.
- **More vscode#203212 bundles** — low-effort first (ini, diff, git config, xml); the large ones (ruby, perl, c/c++, groovy) each need an instrumentable official parser (WASM / native-coverage) + a corpus.
- **Field labels** in the grammar DSL → richer named-field AST types.
- **Highlighter long tail** — the few remaining per-language divergences are documented (in the PR) as either the shared TextMate-vs-parser ceiling or proven architectural floors; where a construct provably exceeds the TextMate model, the derived **tree-sitter** target (a real whole-tree parser) resolves it.
232 changes: 232 additions & 0 deletions TOTAL-PARSING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,232 @@
# Total parsing: the formal spine

How the handle API (`createParser()`) parses *every* text into a tree plus
`cst.errors` while keeping two byte-identity guarantees no mainstream engine
makes, and why each piece is sound. The implementation lives in
`src/emit-parser.ts` (emitted runtime) and is held exact by the gates listed at
the end.

## The contract

For every input text and every edit sequence:

1. **Totality** — `parse`/`edit` never throw on input. Every text yields a root
and a (possibly empty) `errors` list. Only API misuse throws.
2. **Strict-path identity** — a text the strict grammar accepts parses
byte-identically to the strict module-level parser, with `errors = []`.
Error tolerance costs valid input *nothing*, by construction (below), not by
testing.
3. **Edit/fresh identity** — after any edit, tree *and* errors are
byte-identical to a fresh parse of the same text — broken states included.

## Two passes, strict first

`parse`/`edit` run the **strict** parser first. Only when it rejects does the
text re-run with `recovering = true`. Guarantee 2 is therefore structural: the
valid path never executes a single recovery branch. The recovering run is where
everything below lives.

## The bar discipline

A naive "recover at any failure" breaks both identities: PEG longest-match
exploration *fails constantly* on valid arms, so an always-on recovery rescues
losing arms and perturbs valid shapes; and an incremental run that reuses old
rows explores *less* than a fresh run, so any failure-count-dependent decision
desynchronizes the two.

Recovery instead fires only at positions a strict pass has *proven* to fail:

- Each recovering **attempt** runs strictly except at an ordered list of
**bars** (token indices). A recovery action is allowed only inside a bar's
window (below).
- An attempt that fails *past* its bars aborts and appends a new bar at the
attempt's farthest-fail watermark (`maxPos`), monotonically increasing.
- Attempt k runs under the first k bars; the loop is capped (32), then degrades
to a deterministic free-fire pass (`recoverFree`) and, past even that, to a
zero-width `$error` root. Never a crash.

**Determinism theorem.** The bar list is a pure function of the token stream:
bar k+1 is the strict-modulo-bars farthest-fail of a deterministic parse under
bars 1..k. Hence fresh and incremental recovering parses derive byte-identical
bar lists, which is the keystone of guarantee 3. This forces every ingredient
below to be *adoption-invariant*: nothing about reuse may change any watermark
or any fire decision.

## Recovery actions, all position-pure

Every action's fire condition is a pure function of `(position, bar list)` —
no counters, no budgets, no global parse state. (A budgeted design was tried
and failed exactly here: bar₂'s decisions depended on bar₁'s spending, which an
adopted region replays differently.)

- **Skip absorption** — at a repetition whose element fails with
`recoverArmed(from, reach)` (∃ bar in `[from, reach]` with `reach ≤ bar+2`,
where `reach` is the *failing element's frame-local* probe watermark, not the
global one — a frontier parked on a far bar must not arm unrelated loops),
absorb tokens to the loop's FIRST set / threaded closer / EOF into an
`$error` row. Leaves keep text-tiling; the diagnostic quotes the first
absorbed token.
- **Missing-token synthesis** (`missTok`) — a *required* literal/token matcher
failing at `missAt(pos)` (∃ bar in `[pos, pos+2]`) materializes a zero-width
`$missing` row instead of failing: the construct completes (a call keeps its
Call shape with `)` marked missing) and the diagnostic reads `expected ')'`.
- **Missing-nonterminal synthesis** (`missRule`) — the same at a required rule
reference's fail exit: `expected Expr`.
- **Commitment semantics** — synthesis is suppressed inside *uncommitted*
probes: `not()` and separator probes (`probing`), and optional groups that
have not consumed past their entry (`probeBase`). Once an optional consumes a
real token it is committed and synthesizes like required content (`const a =
;` synthesizes the initializer; a bare `const a` does not invent one). This
is tsc's required-only semantics, derived rather than hand-coded.

## Three structural theorems the gates forced

Each of these was surfaced as an `edit ≠ fresh` divergence by the generative
cross-grammar gate, then closed structurally — not patched per-case.

**T1 — Zero-width success is a synthesis-only artifact.** A strict parser can
never succeed at width zero inside a loop (it would not terminate), so *every*
loop must discard zero-width elements: plain repetitions break on
`pos === before`, hooked repetitions discard and re-arm, left-recursion
continuations and Pratt LEDs refuse zero-width wraps. Without this, synthesis
inside a loop spins unboundedly.

**T2 — Same-position re-entry is a real cycle class.** Zero-width synthesis
(and, under recovering, the opened dispatch guards) lets a rule re-enter
itself at the same position through paths no grammar check can rule out.
`recRunning` maps each in-flight `(rule, position)` frame to an entry serial;
re-entry fails with PEG cycle semantics. The refinement that matters for reuse:
a cycle refusal that leans on a frame entered *before* the current one makes
the current frame's result a function of its **ancestor stack**, not of the
text — such results are *tainted* (memo-stamped own-generation-only, taint
propagating to whoever reuses them). Internal cycles (both ends inside the
frame) replay from the window text alone and do not taint.

**T3 — The bar protocol's inputs must be adoption-invariant.** Bar k+1 is
derived from a watermark, so watermarks must be *exact* and *reuse-stable*:
`frameMax` is a frame-local advance watermark (reset at rule entry, folded to
the parent at exit) that makes every stored extent the frame's true probe
reach; memo jumps and adoptions re-raise it to the stored extent, so a reused
subtree contributes the same watermark the parse that built it did.

## The window-replay theorem

Define a frame's **window** as `[start, start + ext + 2]` over token indices,
where `ext` is its exact probe extent (T3) and `+2` covers the stop-token and
SECOND-token dispatch reads.

**Theorem.** Every recovery decision being position-pure, a frame's behavior —
result, probe extent, internal fires and synthesis included — is completely
determined by its window's *text* and its window's *bars*, modulo the
external-cycle dependence of T2.

Corollaries, each carrying one optimization:

- **Recovering adoption** (`barsWindowEq`): an old-tree row whose window sees
the same (shifted) bars the build run saw there replays identically — even
rows *containing* `$error`/`$missing` (an error region is exactly what stays
stable across far edits). Broken-state keystrokes go incremental.
- **Cross-attempt memo survival**: attempts within one sequence parse the same
stream under a monotonically growing bar list, so a memo entry whose window
is **bar-free** behaved strictly (no synthesis, no arming; opened dispatch
guards add only non-consuming probes) and is a pure function of window text —
valid in every later attempt. Tainted entries (T2) are excluded; this
exclusion is precisely what the first survival attempt missed and the gates
rejected. Survival is edit-side only: the fresh path's attempt loop resets
the arena per attempt, so earlier attempts' rows are clobbered there.
- **Recovering surgery**: a splice whose damage and re-parsed span sit clear of
every bar window *commutes with every recovery decision* — kept rows replay
at shifted positions, and the fresh parse behaves strictly across the span,
exactly like the strict re-parse the surgery runs. Attempt k's bars are a
prefix of the final list, so one check against the final list covers every
attempt. The spliced tree keeps its bar list, suffix bars shifted.

Taint is tracked on rows as well as memo entries: a tainted frame's row
carries `rowRM` bit 2, propagated structurally like error containment, and
recovering adoption / run extension refuse it — a context-dependent result is
never reused outside the parse that computed it.

## Lexer resync under depth shifts

The windowed re-lex adopts the old token suffix at the first aligned token
where the old suffix's lexing is reproducible from observable state. Two
sufficient conditions (both require empty template stacks on both sides — an
interpolation entry's brace counter is mutable state no record captures — and
a candidate token that carries no cross-token lexer flag its adopted successor
reads):

- **Equal-depth**: neither lex dipped below the candidate's paren depth since
the divergence point (damage start; before it, identical bytes from an
identical anchor state give identical stacks). Every open entry is then
common to both lexes: the stacks are content-equal, and every future pop
behaves identically. O(1), the common case.
- **Shifted-depth**: the old suffix never pops an entry open at the candidate
(its recorded depth column never dips below the candidate's depth;
pop-on-empty counts as −1). No open entry's head-ness is ever read again, so
stack *contents* are irrelevant and the depths may differ by an arbitrary
shift δ — the splice re-bases the adopted depth records by δ, restoring true
absolute depths (`(`-head bits are local facts of their own neighbors and
stay valid). This is what makes a paren-balance-changing edit O(window)
instead of a relex-to-EOF. The dominant candidate depth is 0 (statement
boundaries), where the condition collapses to "no pop-on-empty beyond the
candidate" — answered O(1) from an ascending doc-level list of pop-on-empty
token indices (almost always empty) instead of an O(suffix) min-build; only
depth > 0 candidates build the suffix minimum, lazily once per edit.

## Diagnostics are data, derived from the tree

`cst.errors` is rebuilt at settle from structured lexer entries plus the
`$error`/`$missing` rows found by descending the structurally-propagated
`rowRM` spine — never collected during parsing. That is what makes adoption
safe for diagnostics: an adopted error region re-derives byte-identical
messages from the current token columns. Two derived enrichments:

- **Viable sets** — for a required literal in a seq, the companion literals
*provably still accepted* when it fails: repetitions before it are always
re-enterable (their nullable-prefix-reachable literals stay viable);
nullable one-shot items are crossed but contribute nothing, since they may
already have consumed. `expected ',' or ']'` never names an impossible
continuation — a static FIRST union would (after `[1, 2` an expression is
not viable), and tsc under-reports the same position as `')' expected`.
- **Paired openers** — for each literal, intersect the sets of preceding
literals across all its seq occurrences; a unique survivor is its structural
opener (`)`←`(`, `]`←`[`, `while`←`do` — derived, no bracket list), attached
as `related` info pointing at the opener leaf among the `$missing`'s earlier
siblings.

## Measured (9 MB TypeScript, single-character edits, median)

| phase | Monogram | tsc `updateSourceFile` | tree-sitter |
|---|---:|---:|---:|
| fresh parse | **167 ms** | 207 ms | 430 ms |
| valid keystroke | 0.37 ms | 35 ms | **0.18 ms** |
| breaking edit | 12 ms | 12.0 ms | **0.29 ms** |
| while-broken keystroke | **0.22 ms** | 11.9 ms | 0.30 ms |
| fixing edit | 2.2 ms | 11.9 ms | **0.22 ms** |

(`test/head-to-head.ts`.) The transition rows measure a first-touch 4.5 MB
cursor jump: token offsets are EOF-relative-biased so local typing never
rewrites the suffix (the 0.37 ms valid keystroke), and the bias boundary
moves with the cursor — a far jump pays once, proportional to the distance.
Repeated break/fix transitions at one position settle to ~1.6–2 ms, of
which the strict-fail pass is 0.23 ms and the recovery attempts 0.46 ms;
the raw 7-column suffix memmove measures 0.07 ms, so the residual is spread
bookkeeping, not a storage floor.

Error-report agreement with tsc's parser on the conformance files it rejects
(`test/recovery-conformance.ts`, ±8 chars): recall 59.1%, precision 82.4%,
first-error agreement 57.5%.

## The gates that hold all of this exact

- `test/incremental-grammars.ts` — generative inputs × seeded edits × all 7
grammars: every step's tree+errors byte-equal to fresh, self-consistent
spans, no throws (672 steps).
- `test/incremental-verify.ts`, `test/multi-doc.ts` — real-file edit scripts
and interleaved documents under the same byte-equality.
- `test/recovery.ts` — strict-path identity on valid texts, totality and
determinism on an invalid corpus, a char-by-char typing session, and
exact-match diagnostic pins (synthesis quality must not silently regress to
absorption).
- `test/emit-parser-verify.ts` / `test/emit-lexer-verify.ts` — emitted runtime
≡ interpreter on the corpus, token streams and error messages included.
Loading
Loading