Skip to content

WIP XML.jl v0.4: Rewrite of internals, streaming tokenizer, XPath support, and bug fixes #54

Open
joshday wants to merge 23 commits into
JuliaComputing:mainfrom
joshday:main
Open

WIP XML.jl v0.4: Rewrite of internals, streaming tokenizer, XPath support, and bug fixes #54
joshday wants to merge 23 commits into
JuliaComputing:mainfrom
joshday:main

Conversation

@joshday
Copy link
Copy Markdown
Contributor

@joshday joshday commented Mar 6, 2026

Summary of Changes

I revived an old rewrite I had halfway finished with the help of Claude Code. It produced some good results!

  • Major rewrite of XML.jl's internals that addresses many open issues
  • Self-contained src/XMLTokenizer.jl module for speedy tokenization
  • Node{T} now parameterized by the string storage type, enabling quick reads via SubString or StringViews.jl
  • StringViews extension — XML.mmap("file.xml", LazyNode) for memory-mapped parsing of very large files
  • XPath support — xpath(node, path) with a practical subset of XPath 1.0
  • Greatly expanded test suite — 243 libxml2 test cases, pugixml and libexpat compatibility tests, W3C conformance tests

Downstream

@TimG1964 you are likely the most impacted with these changes. The Downstream.yml action does indicate a failure in XLSX.jl tests related to Raw no longer existing. I'd appreciate your review here! I'm happy to submit a PR for a fix in XLSX.jl so that its ready to go before this gets merged.

Addressed Issues

Benchmarks: See benchmarks/compare.jl

Here (SS) refers to using SubString{String} as storage type.

julia --project=. benchmarks/compare.jl
============================================================
  XML.jl Benchmark Comparison
  Current (dev) vs v0.3.8
============================================================

Running dev benchmarks... done
Setting up v0.3.8 worktree... done
Running v0.3.8 benchmarks... done

------------------------------------------------------------

  Parse (small)
          v0.3.8      0.114 ms
             dev     0.0335 ms  (70.6% faster)

  Parse (small, SS)
          v0.3.8           n/a
             dev     0.0285 ms

  Parse (medium)
          v0.3.8   634.7153 ms
             dev   161.0888 ms  (74.6% faster)

  Parse (medium, SS)
          v0.3.8           n/a
             dev   151.3025 ms

  Write (small)
          v0.3.8     0.0227 ms
             dev     0.0176 ms  (22.4% faster)

  Write (medium)
          v0.3.8   118.1504 ms
             dev     77.619 ms  (34.3% faster)

  Read file (medium)
          v0.3.8   645.5785 ms
             dev   170.8398 ms  (73.5% faster)

  Collect tags (small)
          v0.3.8     0.0005 ms
             dev     0.0006 ms  (10.3% slower)

  Collect tags (medium)
          v0.3.8    21.0988 ms
             dev    11.1532 ms  (47.1% faster)

============================================================

@TimG1964
Copy link
Copy Markdown
Contributor

TimG1964 commented Mar 8, 2026

Hey @joshday . I've only had a very superficial look so far but it looks great. Thanks!

In terms of impact on XLSX.jl, I think it looks significant. It isn't just Raw. Since @nhz2 first suggested using Raw, I've known it was internal and therefore subject to change. On first inspection, I think the rework involved should be manageable.

More of a challenge will be the removal of prev and next, which are currently exported functions. I rely on these for fundamental elements of XLSX.jl like the sheetrow and tablerow iterators, and for reading and writing the XML files from/to the zip archive .xlsx file.

These obviously aren't insuperable, but will likely need a bit of time while I get to grips with xpath and tokenizer. Optimistic me thinks the new functionality will simplify the code of XLSX.jl, but I usually find things are considerably harder than I first imagine! I'll feedback more when I've had a bit more of a go at getting XLSX.jl working.

Thanks,

Tim

Comment thread ext/XMLStringViewsExt.jl Outdated
@TimG1964
Copy link
Copy Markdown
Contributor

TimG1964 commented Apr 2, 2026

I'm happy to submit a PR for a fix in XLSX.jl so that its ready to go before this gets merged.

Hi @joshday, I've been a bit distracted recently by transferring XLSX.jl to JuliaData and subsequently making a v0.11 release, but my attention will be back on this again after the Easter break. I have to say I'd welcome any PR you could make on XLSX.jl to help facilitate this upgrade.

Thanks!

joshday and others added 6 commits April 2, 2026 16:49
Drops the underscore prefixes from internal names (module is unexported,
the clutter was only needed back when these names leaked into XML.jl).
Replaces the name-byte predicate with a 256-entry const lookup table.

Also fixes a 1-based indexing off-by-one in read_doctype_body: the
'<!--' detection guarded with `pos >= 2` while reading
`codeunit(data, pos - 2)`, which is codeunit 0 when pos == 2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tag, value, keys, and attributes on LazyNode now return
SubString{String} views into the source rather than allocating
fresh Strings, so traversing a large document lazily does not
duplicate its text data.

Introduces a small _as_substring helper to promote the String that
`unescape` can return into a SubString so Attributes stays homogeneous.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
_write_xml now inspects children before reformatting: if any Text
child has non-whitespace content (or any CData child exists), the
element is treated as mixed content and its whitespace is preserved
verbatim. Otherwise the writer drops the whitespace-only Text nodes
the parser emits for round-tripping source formatting and generates
fresh indentation. Same filter is applied at the Document level.

Also adds an unescape(::SubString{String}) specialization that
returns the input unchanged when it contains no '&', avoiding an
allocation on the lazy scanning path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The medium-file workloads show a ~10–25% regression vs the numbers
captured at 4a728ee ("Revamp benchmarks"). v0.4-vs-v0.3.8 remains
a 70–80% improvement, so this is a post-release follow-up, not a
release blocker. Suspected culprit is the eager Pair{S,S}[] alloc
per TOKEN_OPEN_TAG introduced in 2f71f9a — see follow-up issue.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants