Skip to content

Fix O(N²) duplicate-attribute check in Attributes iterator#971

Open
qifan-sailboat wants to merge 1 commit into
tafia:masterfrom
qifan-sailboat:fix/969-attributes-dup-check-quadratic
Open

Fix O(N²) duplicate-attribute check in Attributes iterator#971
qifan-sailboat wants to merge 1 commit into
tafia:masterfrom
qifan-sailboat:fix/969-attributes-dup-check-quadratic

Conversation

@qifan-sailboat

Copy link
Copy Markdown

Fixes #969.

IterState::check_for_duplicates did a linear Vec scan of all already-seen attribute name ranges for every new attribute, so a start tag with N distinct attribute names cost O(N²/2) byte comparisons. On untrusted XML this is a CPU-exhaustion DoS — see #969 for measurements (N=80,000 ≈ 6.1 s release; N=800,000 ≈ 10 min) and the demonstrated downstream impact on NLnet Labs Routinator.

This adds a HashSet<u64> of DefaultHasher hashes of the key bytes as an O(1) pre-filter:

  • a fresh hash means the key cannot be a duplicate → push and return Ok (the no-duplicate path is now amortised O(1) per attribute / O(N) per start tag);
  • on a hash hit (a real duplicate, or an astronomically rare 64-bit collision) fall back to the existing linear scan to recover the exact previous position for AttrError::Duplicated(new, prev) — error semantics are unchanged.

The set is lazily allocated (Option<HashSet<u64>>) so IterState::new, Attributes::new and Attributes::html stay const fn.

N (distinct attrs) before after
10,000 75 ms 0 ms
20,000 303 ms 1 ms
40,000 1,251 ms 2 ms
80,000 6,109 ms 4 ms

A timing regression test is included in events::attributes::xml::duplicated::with_check.

`IterState::check_for_duplicates` previously did a linear `Vec` scan of
all already-seen attribute name ranges for every new attribute, so a
start tag with N distinct attribute names cost O(N²/2) byte comparisons.
On untrusted XML this is a CPU-exhaustion DoS: N=80,000 took ~6.1 s in
release; N=800,000 takes ~10 minutes and cannot be interrupted by an
I/O-gated timeout because the loop performs no reads.

Add a `HashSet<u64>` of `DefaultHasher` hashes of the key bytes as an
O(1) pre-filter. A fresh hash means the key cannot be a duplicate. On a
hash hit (a real duplicate, or an astronomically rare 64-bit collision)
we fall back to the existing linear scan to recover the exact previous
position for `AttrError::Duplicated(new, prev)`, so error semantics are
unchanged. The set is lazily allocated so `IterState::new` (and the
public `Attributes::new`/`Attributes::html`) stay `const fn`.

N=80,000 distinct attributes: 6109 ms → 4 ms.

Fixes tafia#969.

Co-Authored-By: Claude <noreply@anthropic.com>
@dralley

dralley commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

And the distinct attrs column, I assume that represents N attributes within one element, as opposed to N attributes interspersed across many elements?

@qifan-sailboat

Copy link
Copy Markdown
Author

Yes. It means N distinct attributes on a single start tag.

@codecov-commenter

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 95.83333% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 57.37%. Comparing base (e00ae5c) to head (703c47b).
⚠️ Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
src/events/attributes.rs 95.83% 1 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #971      +/-   ##
==========================================
+ Coverage   57.31%   57.37%   +0.06%     
==========================================
  Files          46       46              
  Lines       18197    18207      +10     
==========================================
+ Hits        10429    10447      +18     
+ Misses       7768     7760       -8     
Flag Coverage Δ
unittests 57.37% <95.83%> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread src/events/attributes.rs
.find(|r| slice[(*r).clone()] == slice[key.clone()])
{
return Err(AttrError::Duplicated(key.start, prev.start));
let mut h = DefaultHasher::new();

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SipHash (the Rust default) is fairly expensive in computational terms, though I suppose if the goal is DoS-resistance is the most "safe" option.

Nonetheless would you mind running some benchmarks (e.g. https://github.com/tafia/quick-xml/blob/master/benches/microbenches.rs#L177) to compare how the existing implementation compares to this implementation w/ SipHash and aHash for normal inputs? Or at least the former two.

I would just like an idea of what impact this will have for the standard case.

@dralley

dralley commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Please run cargo fmt

Comment thread src/events/attributes.rs
/// so a start tag with many distinct attribute names cost O(N²) byte
/// comparisons. With the hash pre-filter the same input is O(N).
#[test]
fn many_distinct_attributes_is_linear() {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love the idea of having something like this in a unit test. I suppose if it's reliable enough, it could be OK. On the other hand maybe it's best to just have a criterion microbenchmark, those tend to get run for any important implementation changes.

@Mingun , thoughts

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree, this should be placed in ./benches directory. I suggest to place it into ./benches/issue971.rs. Benchmarks runs as tests on CI, so we will get the same results as for unit test:

- name: Run tests + benchmarks
run: cargo test --all-features --benches --tests

@Mingun Mingun left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move the test into the benchmark directory and could you investigate the possibility to address other comments. You may force-push the changes, because that PR is small enough (and anyway, we prefer to have clean history).

Comment thread src/events/attributes.rs
/// so a start tag with many distinct attribute names cost O(N²) byte
/// comparisons. With the hash pre-filter the same input is O(N).
#[test]
fn many_distinct_attributes_is_linear() {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree, this should be placed in ./benches directory. I suggest to place it into ./benches/issue971.rs. Benchmarks runs as tests on CI, so we will get the same results as for unit test:

- name: Run tests + benchmarks
run: cargo test --all-features --benches --tests

Comment thread src/events/attributes.rs
/// the duplicate check is amortised O(N) over the whole start tag instead of
/// O(N²). Lazily allocated on first use so that [`IterState::new`] can stay
/// `const`.
key_hashes: Option<HashSet<u64>>,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems, we should use NoHasher here because we already put the hash to the set.

Comment thread Changelog.md
pre-filter so the no-duplicate path is amortised O(1) per attribute; the
exact `AttrError::Duplicated(new, prev)` positions are unchanged.

[#969]: https://github.com/tafia/quick-xml/issues/969

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We put links at the end of version section (here you put it at the end of Bug Fixes section). Could you please move it below (keep the 2 blank lines before the next ## section)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

IterState::check_for_duplicates — O(N²) attribute-key linear scan → CPU DoS on untrusted XML

4 participants