Skip to content

Discussion: unifying inline markdown parsers (tables + buffer text) #19

@ewilderj

Description

@ewilderj

While working on PR #17 (pretty markdown tables), I introduced a completely separate inline markdown parsing system from the one in markdown-overlays.el. This meant two independent implementations of bold, italic, code, link, and strikethrough parsing — with different regex patterns, different processing orders, and different bugs.

I've done an experimental spike to unify these into a single shared parser. The branch is here for reference:

ewilderj/shell-maker@feature/unify-inline-parsing

This is built on top of the PR #17 branch and is not intended as a PR itself yet — just sharing the analysis and approach for discussion.

What changed

The old architecture has two independent inline markdown parsers:

  • Buffer scanner in markdown-overlays.el — 10 functions (5 --markdown-* finders + 5 --fontify-* renderers). Searches the buffer directly with re-search-forward, creates overlays at match positions.
  • String processor in markdown-overlays-tables.el — 3 functions (--apply-face-to-unpropertized, --replace-markup, --process-cell-content). Builds propertized strings for table cells.

The new architecture has one shared parser in a new markdown-overlays-parser.el (3 functions, 127 LOC), consumed by:

  • Tables: call --propertize-inline-markdown directly (same as before)
  • Buffer text: call it + a 65-line adapter (--compute-position-map + --apply-inline-overlays) that maps the propertized string back to buffer positions via a two-pointer walk

Text in the buffer remains fully navigable — point can move through all visible characters. Delimiters are hidden with invisible overlays, content is styled with face overlays.

Lines of code

Excluding comments and blank lines:

Old (PR17) New (unified) Delta
markdown-overlays.el 657 430 −227
markdown-overlays-tables.el 617 519 −98
markdown-overlays-parser.el 127 +127
Total 1,274 1,076 −198 (−16%)
Functions (total) 45 38 −7
Inline-parsing functions 13 6 −7

Correctness

Test case Old New
**bold *italic* bold** (nesting) ✓ bold only, italic lost ✓ both
***text*** (bold-italic) ✗ nothing rendered
(see **bold**) (after paren) ✗ nothing rendered
**a** **b** **c** (consecutive) ✗ drops middle b ✓ all three
un**bold**ed (mid-word) ✗ nothing rendered
~~struck **bold** struck~~
**bold [link](url) bold**

The new parser has 32 ERT tests. The old buffer parser has 0 tests for inline formatting.

Performance

Benchmarked on a pathological 36,800 char buffer (100 copies of markup-dense text — far larger than any real LLM response). Old and new run back-to-back in the same Emacs process, 3 rounds of 20 iterations each, averaged:

Environment Old (6 buffer regex passes) New (shared parser + position map + overlays) Ratio
Batch Emacs 53ms/call 63ms/call 1.20×
GUI Emacs 72ms/call 90ms/call 1.25×

On realistic LLM responses (1–5K chars), both are single-digit ms — imperceptible. A benchmark tool (markdown-overlays-bench.el) is included in the branch.

Pros

  1. Single source of truth — one parser, one set of regexes, one processing order. Fix a bug once, both paths benefit.
  2. 198 fewer lines of code — net reduction even after adding the new file.
  3. Correct nesting — italic inside bold, strikethrough inside bold, bold-italic *** all work.
  4. 32 tests covering edge cases, protecting against regressions.
  5. Navigability preserved — buffer text remains point-navigable (invisible overlays on delimiters, face overlays on content). Tables still use before-string.
  6. Clean dependency chainparser.el is required by both tables.el and overlays.el, no circular dependencies.

Cons

  1. ~1.2× slower on extreme inputs — the string→position-map→overlay pipeline has inherent overhead vs direct buffer regex. Negligible on real-world sizes.
  2. New file — adds markdown-overlays-parser.el (127 LOC). More files vs less duplication.
  3. Slightly more complex mental model — position map concept requires understanding the two-pointer walk. But it's 19 lines and well-documented.

How the position map works

The adapter creates a vector mapping each character position in the propertized (delimiters-removed) string back to its position in the original buffer text. A two-pointer walk builds this in O(n). Then:

  • Characters in the original that aren't in the map → delimiter ranges → invisible overlays
  • Text properties from the propertized string → face/keymap overlays at the mapped buffer positions

Happy to discuss any aspect of this. Not proposing this as a change right now — just wanted to share the exploration since it came up naturally while working on the tables PR.

I could see a world where landing the tables is big enough of a change that going the extra mile to unify markdown parsing is worth the churn.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions