Fix O(N²) duplicate-attribute check in Attributes iterator#971
Fix O(N²) duplicate-attribute check in Attributes iterator#971qifan-sailboat wants to merge 1 commit into
Conversation
`IterState::check_for_duplicates` previously did a linear `Vec` scan of all already-seen attribute name ranges for every new attribute, so a start tag with N distinct attribute names cost O(N²/2) byte comparisons. On untrusted XML this is a CPU-exhaustion DoS: N=80,000 took ~6.1 s in release; N=800,000 takes ~10 minutes and cannot be interrupted by an I/O-gated timeout because the loop performs no reads. Add a `HashSet<u64>` of `DefaultHasher` hashes of the key bytes as an O(1) pre-filter. A fresh hash means the key cannot be a duplicate. On a hash hit (a real duplicate, or an astronomically rare 64-bit collision) we fall back to the existing linear scan to recover the exact previous position for `AttrError::Duplicated(new, prev)`, so error semantics are unchanged. The set is lazily allocated so `IterState::new` (and the public `Attributes::new`/`Attributes::html`) stay `const fn`. N=80,000 distinct attributes: 6109 ms → 4 ms. Fixes tafia#969. Co-Authored-By: Claude <noreply@anthropic.com>
|
And the distinct attrs column, I assume that represents N attributes within one element, as opposed to N attributes interspersed across many elements? |
|
Yes. It means N distinct attributes on a single start tag. |
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #971 +/- ##
==========================================
+ Coverage 57.31% 57.37% +0.06%
==========================================
Files 46 46
Lines 18197 18207 +10
==========================================
+ Hits 10429 10447 +18
+ Misses 7768 7760 -8
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
| .find(|r| slice[(*r).clone()] == slice[key.clone()]) | ||
| { | ||
| return Err(AttrError::Duplicated(key.start, prev.start)); | ||
| let mut h = DefaultHasher::new(); |
There was a problem hiding this comment.
SipHash (the Rust default) is fairly expensive in computational terms, though I suppose if the goal is DoS-resistance is the most "safe" option.
Nonetheless would you mind running some benchmarks (e.g. https://github.com/tafia/quick-xml/blob/master/benches/microbenches.rs#L177) to compare how the existing implementation compares to this implementation w/ SipHash and aHash for normal inputs? Or at least the former two.
I would just like an idea of what impact this will have for the standard case.
|
Please run |
| /// so a start tag with many distinct attribute names cost O(N²) byte | ||
| /// comparisons. With the hash pre-filter the same input is O(N). | ||
| #[test] | ||
| fn many_distinct_attributes_is_linear() { |
There was a problem hiding this comment.
I don't love the idea of having something like this in a unit test. I suppose if it's reliable enough, it could be OK. On the other hand maybe it's best to just have a criterion microbenchmark, those tend to get run for any important implementation changes.
@Mingun , thoughts
There was a problem hiding this comment.
Yes, I agree, this should be placed in ./benches directory. I suggest to place it into ./benches/issue971.rs. Benchmarks runs as tests on CI, so we will get the same results as for unit test:
quick-xml/.github/workflows/rust.yml
Lines 64 to 65 in 9aaea92
Mingun
left a comment
There was a problem hiding this comment.
Please move the test into the benchmark directory and could you investigate the possibility to address other comments. You may force-push the changes, because that PR is small enough (and anyway, we prefer to have clean history).
| /// so a start tag with many distinct attribute names cost O(N²) byte | ||
| /// comparisons. With the hash pre-filter the same input is O(N). | ||
| #[test] | ||
| fn many_distinct_attributes_is_linear() { |
There was a problem hiding this comment.
Yes, I agree, this should be placed in ./benches directory. I suggest to place it into ./benches/issue971.rs. Benchmarks runs as tests on CI, so we will get the same results as for unit test:
quick-xml/.github/workflows/rust.yml
Lines 64 to 65 in 9aaea92
| /// the duplicate check is amortised O(N) over the whole start tag instead of | ||
| /// O(N²). Lazily allocated on first use so that [`IterState::new`] can stay | ||
| /// `const`. | ||
| key_hashes: Option<HashSet<u64>>, |
There was a problem hiding this comment.
It seems, we should use NoHasher here because we already put the hash to the set.
| pre-filter so the no-duplicate path is amortised O(1) per attribute; the | ||
| exact `AttrError::Duplicated(new, prev)` positions are unchanged. | ||
|
|
||
| [#969]: https://github.com/tafia/quick-xml/issues/969 |
There was a problem hiding this comment.
We put links at the end of version section (here you put it at the end of Bug Fixes section). Could you please move it below (keep the 2 blank lines before the next ## section)
Fixes #969.
IterState::check_for_duplicatesdid a linearVecscan of all already-seen attribute name ranges for every new attribute, so a start tag with N distinct attribute names cost O(N²/2) byte comparisons. On untrusted XML this is a CPU-exhaustion DoS — see #969 for measurements (N=80,000 ≈ 6.1 s release; N=800,000 ≈ 10 min) and the demonstrated downstream impact on NLnet Labs Routinator.This adds a
HashSet<u64>ofDefaultHasherhashes of the key bytes as an O(1) pre-filter:Ok(the no-duplicate path is now amortised O(1) per attribute / O(N) per start tag);AttrError::Duplicated(new, prev)— error semantics are unchanged.The set is lazily allocated (
Option<HashSet<u64>>) soIterState::new,Attributes::newandAttributes::htmlstayconst fn.A timing regression test is included in
events::attributes::xml::duplicated::with_check.