Discussion: unifying inline markdown parsers (tables + buffer text)

While working on [PR #17](https://github.com/xenodium/shell-maker/pull/17) (pretty markdown tables), I introduced a completely separate inline markdown parsing system from the one in `markdown-overlays.el`. This meant two independent implementations of bold, italic, code, link, and strikethrough parsing — with different regex patterns, different processing orders, and different bugs.

I've done an experimental spike to unify these into a single shared parser. The branch is here for reference:

**[`ewilderj/shell-maker@feature/unify-inline-parsing`](https://github.com/ewilderj/shell-maker/compare/main...ewilderj:shell-maker:feature/unify-inline-parsing)**

This is built on top of the PR #17 branch and is not intended as a PR itself yet — just sharing the analysis and approach for discussion.

## What changed

The old architecture has **two independent inline markdown parsers**:
- **Buffer scanner** in `markdown-overlays.el` — 10 functions (5 `--markdown-*` finders + 5 `--fontify-*` renderers). Searches the buffer directly with `re-search-forward`, creates overlays at match positions.
- **String processor** in `markdown-overlays-tables.el` — 3 functions (`--apply-face-to-unpropertized`, `--replace-markup`, `--process-cell-content`). Builds propertized strings for table cells.

The new architecture has **one shared parser** in a new `markdown-overlays-parser.el` (3 functions, 127 LOC), consumed by:
- **Tables**: call `--propertize-inline-markdown` directly (same as before)
- **Buffer text**: call it + a 65-line adapter (`--compute-position-map` + `--apply-inline-overlays`) that maps the propertized string back to buffer positions via a two-pointer walk

Text in the buffer remains fully navigable — point can move through all visible characters. Delimiters are hidden with invisible overlays, content is styled with face overlays.

## Lines of code

Excluding comments and blank lines:

| | Old (PR17) | New (unified) | Delta |
|---|---|---|---|
| `markdown-overlays.el` | 657 | 430 | **−227** |
| `markdown-overlays-tables.el` | 617 | 519 | **−98** |
| `markdown-overlays-parser.el` | — | 127 | +127 |
| **Total** | **1,274** | **1,076** | **−198 (−16%)** |
| Functions (total) | 45 | 38 | **−7** |
| Inline-parsing functions | 13 | 6 | **−7** |

## Correctness

| Test case | Old | New |
|---|---|---|
| `**bold *italic* bold**` (nesting) | ✓ bold only, italic lost | ✓ both |
| `***text***` (bold-italic) | ✗ nothing rendered | ✓ |
| `(see **bold**)` (after paren) | ✗ nothing rendered | ✓ |
| `**a** **b** **c**` (consecutive) | ✗ drops middle `b` | ✓ all three |
| `un**bold**ed` (mid-word) | ✗ nothing rendered | ✓ |
| `~~struck **bold** struck~~` | ✓ | ✓ |
| `**bold [link](url) bold**` | ✓ | ✓ |

The new parser has **32 ERT tests**. The old buffer parser has **0 tests** for inline formatting.

## Performance

Benchmarked on a pathological 36,800 char buffer (100 copies of markup-dense text — far larger than any real LLM response). Old and new run back-to-back in the same Emacs process, 3 rounds of 20 iterations each, averaged:

| Environment | Old (6 buffer regex passes) | New (shared parser + position map + overlays) | Ratio |
|---|---|---|---|
| **Batch Emacs** | 53ms/call | 63ms/call | **1.20×** |
| **GUI Emacs** | 72ms/call | 90ms/call | **1.25×** |

On realistic LLM responses (1–5K chars), both are single-digit ms — imperceptible. A benchmark tool (`markdown-overlays-bench.el`) is included in the branch.

## Pros

1. **Single source of truth** — one parser, one set of regexes, one processing order. Fix a bug once, both paths benefit.
2. **198 fewer lines of code** — net reduction even after adding the new file.
3. **Correct nesting** — italic inside bold, strikethrough inside bold, bold-italic `***` all work.
4. **32 tests** covering edge cases, protecting against regressions.
5. **Navigability preserved** — buffer text remains point-navigable (invisible overlays on delimiters, face overlays on content). Tables still use `before-string`.
6. **Clean dependency chain** — `parser.el` is required by both `tables.el` and `overlays.el`, no circular dependencies.

## Cons

1. **~1.2× slower on extreme inputs** — the string→position-map→overlay pipeline has inherent overhead vs direct buffer regex. Negligible on real-world sizes.
2. **New file** — adds `markdown-overlays-parser.el` (127 LOC). More files vs less duplication.
3. **Slightly more complex mental model** — position map concept requires understanding the two-pointer walk. But it's 19 lines and well-documented.

## How the position map works

The adapter creates a vector mapping each character position in the propertized (delimiters-removed) string back to its position in the original buffer text. A two-pointer walk builds this in O(n). Then:
- Characters in the original that aren't in the map → delimiter ranges → invisible overlays
- Text properties from the propertized string → face/keymap overlays at the mapped buffer positions

Happy to discuss any aspect of this. Not proposing this as a change right now — just wanted to share the exploration since it came up naturally while working on the tables PR.

I could see a world where landing the tables is big enough of a change that going the extra mile to unify markdown parsing is worth the churn.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: unifying inline markdown parsers (tables + buffer text) #19

What changed

Lines of code

Correctness

Performance

Pros

Cons

How the position map works

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

	Old (PR17)	New (unified)	Delta
`markdown-overlays.el`	657	430	−227
`markdown-overlays-tables.el`	617	519	−98
`markdown-overlays-parser.el`	—	127	+127
Total	1,274	1,076	−198 (−16%)
Functions (total)	45	38	−7
Inline-parsing functions	13	6	−7

Test case	Old	New
`*bold italic* bold**` (nesting)	✓ bold only, italic lost	✓ both
`*text*` (bold-italic)	✗ nothing rendered	✓
`(see bold)` (after paren)	✗ nothing rendered	✓
`a b c` (consecutive)	✗ drops middle `b`	✓ all three
`unbolded` (mid-word)	✗ nothing rendered	✓
`~~struck bold struck~~`	✓	✓
`bold [link](url) bold`	✓	✓

Environment	Old (6 buffer regex passes)	New (shared parser + position map + overlays)	Ratio
Batch Emacs	53ms/call	63ms/call	1.20×
GUI Emacs	72ms/call	90ms/call	1.25×

Discussion: unifying inline markdown parsers (tables + buffer text) #19

Description

What changed

Lines of code

Correctness

Performance

Pros

Cons

How the position map works

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions