While working on PR #17 (pretty markdown tables), I introduced a completely separate inline markdown parsing system from the one in markdown-overlays.el. This meant two independent implementations of bold, italic, code, link, and strikethrough parsing — with different regex patterns, different processing orders, and different bugs.
I've done an experimental spike to unify these into a single shared parser. The branch is here for reference:
ewilderj/shell-maker@feature/unify-inline-parsing
This is built on top of the PR #17 branch and is not intended as a PR itself yet — just sharing the analysis and approach for discussion.
What changed
The old architecture has two independent inline markdown parsers:
- Buffer scanner in
markdown-overlays.el — 10 functions (5 --markdown-* finders + 5 --fontify-* renderers). Searches the buffer directly with re-search-forward, creates overlays at match positions.
- String processor in
markdown-overlays-tables.el — 3 functions (--apply-face-to-unpropertized, --replace-markup, --process-cell-content). Builds propertized strings for table cells.
The new architecture has one shared parser in a new markdown-overlays-parser.el (3 functions, 127 LOC), consumed by:
- Tables: call
--propertize-inline-markdown directly (same as before)
- Buffer text: call it + a 65-line adapter (
--compute-position-map + --apply-inline-overlays) that maps the propertized string back to buffer positions via a two-pointer walk
Text in the buffer remains fully navigable — point can move through all visible characters. Delimiters are hidden with invisible overlays, content is styled with face overlays.
Lines of code
Excluding comments and blank lines:
|
Old (PR17) |
New (unified) |
Delta |
markdown-overlays.el |
657 |
430 |
−227 |
markdown-overlays-tables.el |
617 |
519 |
−98 |
markdown-overlays-parser.el |
— |
127 |
+127 |
| Total |
1,274 |
1,076 |
−198 (−16%) |
| Functions (total) |
45 |
38 |
−7 |
| Inline-parsing functions |
13 |
6 |
−7 |
Correctness
| Test case |
Old |
New |
**bold *italic* bold** (nesting) |
✓ bold only, italic lost |
✓ both |
***text*** (bold-italic) |
✗ nothing rendered |
✓ |
(see **bold**) (after paren) |
✗ nothing rendered |
✓ |
**a** **b** **c** (consecutive) |
✗ drops middle b |
✓ all three |
un**bold**ed (mid-word) |
✗ nothing rendered |
✓ |
~~struck **bold** struck~~ |
✓ |
✓ |
**bold [link](url) bold** |
✓ |
✓ |
The new parser has 32 ERT tests. The old buffer parser has 0 tests for inline formatting.
Performance
Benchmarked on a pathological 36,800 char buffer (100 copies of markup-dense text — far larger than any real LLM response). Old and new run back-to-back in the same Emacs process, 3 rounds of 20 iterations each, averaged:
| Environment |
Old (6 buffer regex passes) |
New (shared parser + position map + overlays) |
Ratio |
| Batch Emacs |
53ms/call |
63ms/call |
1.20× |
| GUI Emacs |
72ms/call |
90ms/call |
1.25× |
On realistic LLM responses (1–5K chars), both are single-digit ms — imperceptible. A benchmark tool (markdown-overlays-bench.el) is included in the branch.
Pros
- Single source of truth — one parser, one set of regexes, one processing order. Fix a bug once, both paths benefit.
- 198 fewer lines of code — net reduction even after adding the new file.
- Correct nesting — italic inside bold, strikethrough inside bold, bold-italic
*** all work.
- 32 tests covering edge cases, protecting against regressions.
- Navigability preserved — buffer text remains point-navigable (invisible overlays on delimiters, face overlays on content). Tables still use
before-string.
- Clean dependency chain —
parser.el is required by both tables.el and overlays.el, no circular dependencies.
Cons
- ~1.2× slower on extreme inputs — the string→position-map→overlay pipeline has inherent overhead vs direct buffer regex. Negligible on real-world sizes.
- New file — adds
markdown-overlays-parser.el (127 LOC). More files vs less duplication.
- Slightly more complex mental model — position map concept requires understanding the two-pointer walk. But it's 19 lines and well-documented.
How the position map works
The adapter creates a vector mapping each character position in the propertized (delimiters-removed) string back to its position in the original buffer text. A two-pointer walk builds this in O(n). Then:
- Characters in the original that aren't in the map → delimiter ranges → invisible overlays
- Text properties from the propertized string → face/keymap overlays at the mapped buffer positions
Happy to discuss any aspect of this. Not proposing this as a change right now — just wanted to share the exploration since it came up naturally while working on the tables PR.
I could see a world where landing the tables is big enough of a change that going the extra mile to unify markdown parsing is worth the churn.
While working on PR #17 (pretty markdown tables), I introduced a completely separate inline markdown parsing system from the one in
markdown-overlays.el. This meant two independent implementations of bold, italic, code, link, and strikethrough parsing — with different regex patterns, different processing orders, and different bugs.I've done an experimental spike to unify these into a single shared parser. The branch is here for reference:
ewilderj/shell-maker@feature/unify-inline-parsingThis is built on top of the PR #17 branch and is not intended as a PR itself yet — just sharing the analysis and approach for discussion.
What changed
The old architecture has two independent inline markdown parsers:
markdown-overlays.el— 10 functions (5--markdown-*finders + 5--fontify-*renderers). Searches the buffer directly withre-search-forward, creates overlays at match positions.markdown-overlays-tables.el— 3 functions (--apply-face-to-unpropertized,--replace-markup,--process-cell-content). Builds propertized strings for table cells.The new architecture has one shared parser in a new
markdown-overlays-parser.el(3 functions, 127 LOC), consumed by:--propertize-inline-markdowndirectly (same as before)--compute-position-map+--apply-inline-overlays) that maps the propertized string back to buffer positions via a two-pointer walkText in the buffer remains fully navigable — point can move through all visible characters. Delimiters are hidden with invisible overlays, content is styled with face overlays.
Lines of code
Excluding comments and blank lines:
markdown-overlays.elmarkdown-overlays-tables.elmarkdown-overlays-parser.elCorrectness
**bold *italic* bold**(nesting)***text***(bold-italic)(see **bold**)(after paren)**a** **b** **c**(consecutive)bun**bold**ed(mid-word)~~struck **bold** struck~~**bold [link](url) bold**The new parser has 32 ERT tests. The old buffer parser has 0 tests for inline formatting.
Performance
Benchmarked on a pathological 36,800 char buffer (100 copies of markup-dense text — far larger than any real LLM response). Old and new run back-to-back in the same Emacs process, 3 rounds of 20 iterations each, averaged:
On realistic LLM responses (1–5K chars), both are single-digit ms — imperceptible. A benchmark tool (
markdown-overlays-bench.el) is included in the branch.Pros
***all work.before-string.parser.elis required by bothtables.elandoverlays.el, no circular dependencies.Cons
markdown-overlays-parser.el(127 LOC). More files vs less duplication.How the position map works
The adapter creates a vector mapping each character position in the propertized (delimiters-removed) string back to its position in the original buffer text. A two-pointer walk builds this in O(n). Then:
Happy to discuss any aspect of this. Not proposing this as a change right now — just wanted to share the exploration since it came up naturally while working on the tables PR.
I could see a world where landing the tables is big enough of a change that going the extra mile to unify markdown parsing is worth the churn.