WIP XML.jl v0.4: Rewrite of internals, streaming tokenizer, XPath support, and bug fixes by joshday · Pull Request #54 · JuliaComputing/XML.jl

joshday · 2026-03-06T21:54:41Z

Summary of Changes

I revived an old rewrite I had halfway finished with the help of Claude Code. It produced some good results!

Major rewrite of XML.jl's internals that addresses many open issues
Self-contained src/XMLTokenizer.jl module for speedy tokenization
Node{T} now parameterized by the string storage type, enabling quick reads via SubString or StringViews.jl
StringViews extension — XML.mmap("file.xml", LazyNode) for memory-mapped parsing of very large files
XPath support — xpath(node, path) with a practical subset of XPath 1.0
Greatly expanded test suite — 243 libxml2 test cases, pugixml and libexpat compatibility tests, W3C conformance tests

Downstream

@TimG1964 you are likely the most impacted with these changes. The Downstream.yml action does indicate a failure in XLSX.jl tests related to Raw no longer existing. I'd appreciate your review here! I'm happy to submit a PR for a fix in XLSX.jl so that its ready to go before this gets merged.

Addressed Issues

Closes XML character references are not unescaped/escaped #17 — XML character references are now unescaped/escaped
Closes XPath support #30 — XPath support
Closes Inconsistent type for attributes where nodes have no attributes #33 — Inconsistent type for attributes where nodes have no attributes
Closes Simple XML.write followed by XML.parse fails #35 — Simple XML.write followed by XML.parse no longer fails
Closes get not defined to match getindex #50 — get defined to match getindex
Closes Question: Why the choice not to escape & to &amp; ? #52 — escape now unconditionally escapes '&'
Closes Incorrect unescape result. #53 — Incorrect unescape result (double-unescaping)

Benchmarks: See `benchmarks/compare.jl`

Here (SS) refers to using SubString{String} as storage type.

julia --project=. benchmarks/compare.jl
============================================================
  XML.jl Benchmark Comparison
  Current (dev) vs v0.3.8
============================================================

Running dev benchmarks... done
Setting up v0.3.8 worktree... done
Running v0.3.8 benchmarks... done

------------------------------------------------------------

  Parse (small)
          v0.3.8      0.114 ms
             dev     0.0335 ms  (70.6% faster)

  Parse (small, SS)
          v0.3.8           n/a
             dev     0.0285 ms

  Parse (medium)
          v0.3.8   634.7153 ms
             dev   161.0888 ms  (74.6% faster)

  Parse (medium, SS)
          v0.3.8           n/a
             dev   151.3025 ms

  Write (small)
          v0.3.8     0.0227 ms
             dev     0.0176 ms  (22.4% faster)

  Write (medium)
          v0.3.8   118.1504 ms
             dev     77.619 ms  (34.3% faster)

  Read file (medium)
          v0.3.8   645.5785 ms
             dev   170.8398 ms  (73.5% faster)

  Collect tags (small)
          v0.3.8     0.0005 ms
             dev     0.0006 ms  (10.3% slower)

  Collect tags (medium)
          v0.3.8    21.0988 ms
             dev    11.1532 ms  (47.1% faster)

============================================================

TimG1964 · 2026-03-08T12:25:20Z

Hey @joshday . I've only had a very superficial look so far but it looks great. Thanks!

In terms of impact on XLSX.jl, I think it looks significant. It isn't just Raw. Since @nhz2 first suggested using Raw, I've known it was internal and therefore subject to change. On first inspection, I think the rework involved should be manageable.

More of a challenge will be the removal of prev and next, which are currently exported functions. I rely on these for fundamental elements of XLSX.jl like the sheetrow and tablerow iterators, and for reading and writing the XML files from/to the zip archive .xlsx file.

These obviously aren't insuperable, but will likely need a bit of time while I get to grips with xpath and tokenizer. Optimistic me thinks the new functionality will simplify the code of XLSX.jl, but I usually find things are considerably harder than I first imagine! I'll feedback more when I've had a bit more of a go at getting XLSX.jl working.

Thanks,

Tim

TimG1964 · 2026-04-02T10:31:26Z

I'm happy to submit a PR for a fix in XLSX.jl so that its ready to go before this gets merged.

Hi @joshday, I've been a bit distracted recently by transferring XLSX.jl to JuliaData and subsequently making a v0.11 release, but my attention will be back on this again after the Easter break. I have to say I'd welcome any PR you could make on XLSX.jl to help facilitate this upgrade.

Thanks!

Drops the underscore prefixes from internal names (module is unexported, the clutter was only needed back when these names leaked into XML.jl). Replaces the name-byte predicate with a 256-entry const lookup table. Also fixes a 1-based indexing off-by-one in read_doctype_body: the '<!--' detection guarded with `pos >= 2` while reading `codeunit(data, pos - 2)`, which is codeunit 0 when pos == 2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

tag, value, keys, and attributes on LazyNode now return SubString{String} views into the source rather than allocating fresh Strings, so traversing a large document lazily does not duplicate its text data. Introduces a small _as_substring helper to promote the String that `unescape` can return into a SubString so Attributes stays homogeneous. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

_write_xml now inspects children before reformatting: if any Text child has non-whitespace content (or any CData child exists), the element is treated as mixed content and its whitespace is preserved verbatim. Otherwise the writer drops the whitespace-only Text nodes the parser emits for round-tripping source formatting and generates fresh indentation. Same filter is applied at the Document level. Also adds an unescape(::SubString{String}) specialization that returns the input unchanged when it contains no '&', avoiding an allocation on the lazy scanning path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The medium-file workloads show a ~10–25% regression vs the numbers captured at 4a728ee ("Revamp benchmarks"). v0.4-vs-v0.3.8 remains a 70–80% improvement, so this is a post-release follow-up, not a release blocker. Suspected culprit is the eager Pair{S,S}[] alloc per TOKEN_OPEN_TAG introduced in 2f71f9a — see follow-up issue. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

joshday added 14 commits March 5, 2026 09:34

Rewrite XML parser with tokenizer and XPath

6dacef3

remove dead code

97384c3

more test files

1844b16

Add validation tests and remove legacy DTD/raw code

b6f4d47

Update CI actions and add validation tests

21f647d

update ci

c673427

Add XMark benchmark generator and expand benchmarks

46c5a31

Add LazyNode type and StringViews extension

33bcf35

Refactor simple_value checks and use direct attrs iteration

d011424

Refactor tokenizer into XMLTokenizer and add LazyNode

754f8fa

Add benchmarks, StringViews tests, simplify XML module

8483fed

Add GC.gc before tmpfile cleanup for Windows

eb5caeb

Bump version to v0.4.0

b914bfe

Use mktempdir for temp file cleanup in StringViews tests

d76c484

nhz2 reviewed Mar 8, 2026

View reviewed changes

Comment thread ext/XMLStringViewsExt.jl Outdated

joshday added 3 commits March 8, 2026 15:05

Remove StringViews extension and simplify tokenizer

41836ae

Replace printstyled with print in show methods

b670267

Revamp benchmarks and expand test suite

4a728ee

joshday and others added 6 commits April 2, 2026 16:49

Add Attributes type and performance optimizations

2f71f9a

Add sourcetext, write, eachchildnode for LazyNode

6c4e8f3

joshday mentioned this pull request Apr 30, 2026

perf: avoid per-call ctx allocation in next_no_xml_space #58

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP XML.jl v0.4: Rewrite of internals, streaming tokenizer, XPath support, and bug fixes #54

WIP XML.jl v0.4: Rewrite of internals, streaming tokenizer, XPath support, and bug fixes #54
joshday wants to merge 23 commits into
JuliaComputing:mainfrom
joshday:main

joshday commented Mar 6, 2026

Uh oh!

TimG1964 commented Mar 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

TimG1964 commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

joshday commented Mar 6, 2026

Summary of Changes

Downstream

Addressed Issues

Benchmarks: See benchmarks/compare.jl

Uh oh!

TimG1964 commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

TimG1964 commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Benchmarks: See `benchmarks/compare.jl`

TimG1964 commented Mar 8, 2026 •

edited

Loading