Skip to content

feat(indent): opt-in extensions for non-YAML indentation languages (commentExcept, rawBlock, flowColonSeparator)#41

Open
theoephraim wants to merge 1 commit into
johnsoncodehk:masterfrom
dmno-dev:feat/indent-mode-extensions
Open

feat(indent): opt-in extensions for non-YAML indentation languages (commentExcept, rawBlock, flowColonSeparator)#41
theoephraim wants to merge 1 commit into
johnsoncodehk:masterfrom
dmno-dev:feat/indent-mode-extensions

Conversation

@theoephraim

@theoephraim theoephraim commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Testing out monogram on another parser project I've had kicking around for a while... Ran into a few more small things. Overall it's working great!!


Three opt-in IndentConfig extensions that the indent mode needs to host indentation languages that aren't YAML — specifically ones that nest tag lines (Pug-shaped) rather than key/value scalars. All three default off; a grammar declaring none tokenizes byte-identically (all 32 gates pass unchanged, including every YAML gate and the generative scope≡role check).

Context: we're building NMBL (WIP... site a bit outdated) — an indentation shorthand for HTML that compiles to Vue/Svelte/Astro/JSX templates — as a monogram grammar, on the "adding a language is one grammar file" promise. It very nearly is: the grammar, parser, TextMate/tree-sitter/Monarch outputs and language-config all work on the unmodified engine. These three behaviors were the only places the indent mode had YAML baked in as the sole client.

The tests are the contract. Everything is specified in test/indent-extensions.ts (21 checks, registered as a core gate) over toy grammars whose token names and introducer characters deliberately match no real language — including a custom introducer char to prove the config is data, not convention. If you'd rather implement any of these differently, please throw the implementation away and keep the tests.


1. commentExcept — two-tier comments

NMBL has // comments that are stripped (dev notes) and //! comments that are rendered into the output (<!-- … -->). The strip-tier wants exactly YAML's comment treatment — comment-only lines invisible to the indent stack. But the introducer check is a plain startsWith, so //! lines (which must lex as real, structural tokens) get swallowed too: they share the // prefix.

indent: { comment: '//', commentExcept: '!' }

A line whose comment introducer is immediately followed by the exception string falls through to ordinary tokenization. // note lines vanish; //! shipped note lexes as a declared token and participates in Newline/Indent structure. // ! (space between) is still a comment.

2. rawBlock — verbatim capture introduced from the END of a line

blockScalar captures more-indented lines verbatim, but its trigger is a leading introducer char (the signature regex hardcodes [|>]). Pug-style languages introduce raw regions from the line's end — NMBL's content modes:

script:              ← bare ':' at end of tag line
  const x = 1 < 2;   ← captured verbatim (would otherwise lex as NMBL)
article:md           ← named mode
  ## markdown here
indent: { rawBlock: { token: 'RawContent' } }   // signature/introChar configurable

Same capture semantics as blockScalar (bounded by indentation, blank lines included, one token from introducer through body). The introducer only counts when glued to the line's content — no top-level whitespace before it (whitespace inside balanced parens/quotes is fine, so div(a="1" b): triggers) — or at the line lead (:md). That guard matters in practice: label Size: is inline text ending in a colon, not a raw block; we hit exactly this in a real template.

3. flowColonSeparator: false — opt out of the flow : separator carve-out

In flow context the lexer force-emits a : glued after a quoted scalar or flow-close as the YAML key: value separator (the 5T43/C2DT cohort). Correct for YAML — but NMBL has :name-shaped tokens (Vue-style bound-attribute shorthand) that legally follow values and closes inside its (…) attribute lists:

button(@click.stop="go" :disabled)     ← ':disabled' split into ':' + 'disabled'
@each(items as item (item.id) :key="…")  ← ':key' split after ')'
indent: { flowColonSeparator: false }

Default true preserves YAML behavior exactly (asserted in the tests both ways).


Field notes (no action needed)

Two other YAML-isms we avoided via grammar choices rather than config, mentioned as data points on what string: true / blockPattern opt into: flagging our string tokens string: true pulled them into the same flow-: carve-out (we dropped the flag and lost auto-close delimiter derivation), and any token with blockPattern participates in plain-scalar continuation folding (rest-of-line capture), which is surprising outside YAML. If useful, happy to file these as separate issues.

🤖 Generated with Claude Code

Indentation languages that nest tag lines (Pug-like) rather than key/value
scalars need three behaviors the indent mode currently hardcodes for YAML.
Each is an opt-in IndentConfig field, default off — a grammar declaring none
tokenizes byte-identically (all existing gates unchanged).

- commentExcept: an exception string after the comment introducer makes the
  line fall through to tokenization ('//' lines vanish, '//!' doc-comment
  lines lex as real structural tokens).
- rawBlock: verbatim capture introduced from the END of a line (tag:mode
  filters / content modes) — the mirror image of blockScalar's leading | / >.
  The introducer must be glued to the line content (no top-level whitespace)
  or sit at the line lead.
- flowColonSeparator: false disables the YAML flow ':' key-separator
  carve-out, for grammars with ':name'-shaped tokens (bound-attribute
  shorthand) that legally follow quoted values / flow closes.

Specified as engine behavior over toy grammars in test/indent-extensions.ts
(21 checks, registered as a core gate).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
theoephraim added a commit to theoephraim/nmbl that referenced this pull request Jun 12, 2026
The NMBL grammar is now defined once (src/nmbl-grammar.ts) using monogram
(github:johnsoncodehk/monogram, pinned + bundled into dist); the runtime
lexer/parser execute it directly and all editor artifacts derive from it
(scripts/gen-artifacts.ts → TextMate, language-config, tree-sitter, Monarch,
CST types). Replaces the hand-written lexer.ts/parser.ts/tokens.ts; the
battle-tested compiler codegen survives via a CST→AST adapter (cst-to-ast.ts).

Engine extensions live in patches/monogram.patch (commentExcept, rawBlock,
flowColonSeparator — proposed upstream in johnsoncodehk/monogram#41).

Language/compiler changes:
- host-native @-blocks: framework option ('html'|'vue'|'svelte'|'astro'|'jsx'),
  default 'html'; vue compiles @if/@each to <template v-if/v-for> wrappers;
  astro/jsx emit JSX expressions; unsupported blocks are hard errors
- unified @each: accepts 'item of items' AND 'items as item (key)' forms in
  every mode, parsed to structured {collection, bindings, key}; :key wrapper
  attribute unifies keying across hosts
- jsx target: attribute aliasing (class→className), self-closing voids,
  {/* */} comments, key injection on the iteration root
- comments: '//' silent (recoverable via recoverComments()), '//!' rendered;
  works at line level and inside attribute lists
- content blocks (script:, article:md), component-name token, escape scopes

178 tests.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@johnsoncodehk

Copy link
Copy Markdown
Owner

Thank you for your PR. I'll check it tomorrow. Have you also tested the behavior for src/emit-lexer.ts?

@theoephraim

Copy link
Copy Markdown
Contributor Author

All three behaviors I added are gated on indent.* (indent.rawBlock, indent.commentExcept, indent.flowColonSeparator), and emitLexer returns null for any grammar with markup/indent/newline (emit-lexer.ts:33), so those grammars are always lexed by the interpreted createLexer — there's no emitted code path that can reach the new branches. The flags are also opt-in and default to the existing YAML behavior, so for the grammars emit-lexer does handle the emitted stream is unchanged: I reproduced emit-lexer-verify's token-stream comparison for the TypeScript grammar and got 0 diffs (183 tokens).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants