diff --git a/crates/casefold/BLOG.md b/crates/casefold/BLOG.md new file mode 100644 index 0000000..d5353c0 --- /dev/null +++ b/crates/casefold/BLOG.md @@ -0,0 +1,582 @@ +# Case-folding source code at >40 GiB/s on a single core + +This started as a cleanup, not a project. Coming back to some AVX-optimized +case-folding code we'd written a couple of years earlier, I mostly wanted to know +whether all that hand-vectorized machinery was still worth maintaining — or +whether the problem had quietly solved itself in the meantime. The first surprise +was that the answer looked like "solved": Rust's `str::to_lowercase` in the +standard library was already *fast*, fast enough that the question seemed settled. + +The second surprise is the reason this post exists. I had Copilot write the most +boring version of the loop imaginable, then strip out one conditional at a time — +and somewhere along the way the plain, SIMD-free code pulled **~15× ahead** of the +naive version and comfortably past `to_lowercase`, without a single intrinsic. The +biggest win came from *removing* a common optimization — the "stop as soon as you +see a non-ASCII byte" early-exit — because that one branch is what stops the +compiler from vectorizing the loop at all. The end result case-folds pure-ASCII +text — the kind source code, logs, and URLs are made of — at over **40 GiB/s on a +single M4 core**: memory-bandwidth territory, where the CPU is essentially just +waiting for the bytes to arrive. + +This post is about how the common case got that fast — and how, once the ASCII +engine was running at memory speed, the comparatively sluggish throughput on the +*non*-ASCII path started to bug me too, and I pulled out all the stops there as +well: a 1.7 KB table that folds without ever decoding a character, and beats them. + +## Wait — why case-fold at all? + +Suppose a user searches for `straße` and your corpus contains `STRASSE`, or they +type `İstanbul` and you stored `istanbul`. To make these match you need a +canonical form that erases case distinctions, letting two strings that "differ only +in case" compare equal. That form is **case folding**, and it shows up wherever +text is *matched* rather than *displayed*: + +- **Search engines** push every indexed term and every query token through the + same fold, collapsing `Café`, `café`, and `CAFÉ` into one posting list. It runs on + every token at index time and query time — it has to be fast and + allocation-light. +- **Regex engines** implement the case-insensitive flag `(?i)` by folding the + pattern's character classes (and comparing against folded input). A hot inner + loop can't afford a hash lookup per character. +- **Identifiers and protocols** — case-insensitive comparison of usernames, + hostnames, file paths, HTTP headers, and so on. + +### Folding is not lowercasing + +It is tempting to reach for `str::to_lowercase`, but lowercasing and folding are +different operations with different goals: + +- **Lowercasing** is for *display* and is locale- and context-sensitive: Greek + final sigma `Σ` lowercases to `ς` at the end of a word and `σ` elsewhere; + Turkish `I` lowercases differently than English `I`. +- **Case folding** is for *comparison* and is deliberately context-free and + locale-independent, keeping the relation stable and symmetric. The Unicode + Character Database ships an explicit + [`CaseFolding.txt`](https://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt) + for exactly this. + +These diverge on real characters — `ß`, `İ`, final sigma — and lowercasing as a +stand-in silently produces wrong matches. This crate implements the **simple** +(1-to-1) folds — statuses `C` and `S` in `CaseFolding.txt` — and deliberately +*not* the multi-character "full" folds (`ß` → `ss`) or Turkic locale folds. + +That restriction is a feature, not a shortcut: regex engines like +[ripgrep](https://github.com/BurntSushi/ripgrep) (via the Rust `regex` crate) +match case-insensitively with the *same* simple-fold table, expanding each +character class over its simple folds. Sticking to simple folds keeps this +crate's results **consistent with the tools people already search with**, and +text normalized here matches the way a `(?i)` ripgrep query would. + +## The workload is mostly ASCII + +Now the observation that shapes the whole design: in practice, the text you fold +is overwhelmingly ASCII. + +Source code is the extreme case — keywords, identifiers, punctuation, and digits +are virtually all in `0x00..0x7F`, with the occasional non-ASCII byte confined to +a string literal or comment. The same is true of logs, URLs, HTTP headers, +machine-generated text, and most English-language content. A tool like ripgrep +grepping a codebase case-insensitively spends almost all of its time on ASCII. + +The ASCII path, then, isn't a corner case to tolerate on the way to the "real" +Unicode logic — it *is* the common case, and making it run at memory speed is +the single most important thing the crate does. Everything else exists to keep +the rare non-ASCII path from spoiling that. + +## The counterintuitive core: don't stop early + +The fold of an ASCII letter is trivial — `A..=Z` map to `a..=z`, everything else +is unchanged — making the ASCII pass really just "sweep the buffer, lowercase in +place." Ask any LLM for it and you might get something like this: + +```rust +let bytes = s.as_bytes_mut(); +for (i, b) in bytes.iter_mut().enumerate() { + if *b >= 0x80 { + break; // non-ASCII at index i: hand the rest to the Unicode path + } + if b.is_ascii_uppercase() { + *b += 32; // 'A'..='Z' → 'a'..='z' + } +} +``` + +It looks ideal: do the cheap byte work, and the instant you hit a non-ASCII byte, +`break` and let the "real" Unicode path take over — "only do the cheap work until +you have to." On an Apple M4 this runs at about **3 GiB/s**. That sounds +fine in isolation, but it is more than **15× short** of what the same work can +do — and the reason is every one of those `if`s. + +Two lines carry a **data-dependent branch**: the `b >= 0x80` early-exit and the +`is_ascii_uppercase` test. Data-dependent +branches in the loop body are exactly what stop the compiler from +auto-vectorizing, leaving this loop to run one byte at a time. Crucially, vectorization +is **all-or-nothing**: a single branch in the body is enough to disable it +entirely. Removing just `if b >= 0x80 { break }` doesn't shave a few percent off +— it's the difference between scalar and SIMD. + +Let's delete every branch, line by line: + +- **`if b >= 0x80 { break }`** → don't stop at all. OR every byte into an + accumulator and test it *once*, after the loop: `high_bit_acc |= *b`. Same + information (was there any non-ASCII byte?), zero branches in the body. +- **The `A..=Z` range test** → make it arithmetic. `b.wrapping_sub(b'A') < 26` is + true exactly for `A..=Z` (any other byte wraps to `≥ 26`), yielding a 0/1 mask + with no branch. +- **The conditional write** → fold the mask into the store. `| (is_upper << 5)` + sets bit 5 — turning an upper-case letter lower-case and being a no-op on + everything else — the byte is always written, never branched on. + +What's left has no branch in its body and no early exit: + +```rust +let mut high_bit_acc: u8 = 0; +for b in &mut bytes { + high_bit_acc |= *b; // detect any non-ASCII byte + let is_upper = b.wrapping_sub(b'A') < 26; // branchless A..=Z test + *b |= u8::from(is_upper) << 5; // set bit 5 → lowercase, else no-op +} +if high_bit_acc & 0x80 == 0 { + return bytes; // pure ASCII: already folded in place, no second buffer +} +``` + +A loop with no data-dependent control flow is trivially vectorizable: LLVM +emits 16-byte-at-a-time NEON and the whole thing runs at >**45 GiB/s** — +essentially memory bandwidth. And we come out of the pass already knowing, from +`high_bit_acc`, whether there's any non-ASCII work left to do. + +How much did each step matter? Measuring the cumulative ladder on pure ASCII +(Apple M4, 5.7 KB buffer): + +| version | throughput | vectorized? | +|---|--:|---| +| naive (break + branch test) | 3.1 GiB/s | no (0 vector instrs) | +| → branchless test/write, *keep* break | 2.6 GiB/s | no (0 vector instrs) | +| → drop the early-exit `break` | 7.6 GiB/s | **partially** (25 vector instrs) | +| → branchless test + write (the loop) | **46.9 GiB/s** | fully (41 vector instrs) | + +The early-exit is what gates vectorization at all: keep the `break` but make the +body perfectly branch-free and you still get **zero** vector instructions +(~2.6 GiB/s); a data-dependent loop exit is enough on its own to keep the loop +scalar. Only once the `break` is gone can the compiler vectorize. The final step — +making the upper-case fold branchless — then turns a *partially* vectorized loop +(which still compiles the conditional store to a compare-blend-masked-store, +~7.6 GiB/s) into the straight-line arithmetic that hits memory bandwidth. + +> **NOTE — branchless is a *pessimization* in scalar code.** Look again at the +> table: making the body branchless while *keeping* the `break` (2.6 GiB/s) is +> actually **slower** than the naive branchy loop (3.1 GiB/s). The asm explains +> why. The branchy version only stores a byte when it actually changes one — its +> conditional `strb` is skipped for every lowercase letter, digit and space (the +> vast majority of real text), and the well-predicted branch that guards it is +> nearly free. The branchless version replaces that rarely-taken store with an +> **unconditional `strb` every iteration**, writing back all ~5,700 bytes +> instead of just the handful of upper-case ones. Extra write traffic for no +> benefit. Branchless-write only *wins* once the loop vectorizes, because then the +> store becomes a single 16-byte vector write regardless of content and the +> per-byte cost disappears. The lesson: a branchless body is worth it **only** as +> the enabler for vectorization — on its own, in scalar code, it can cost you. + +There's also a middle ground, and it's the one the standard library takes. +Instead of testing one byte at a time, `[u8]::is_ascii` scans a **machine word at +a time** — on a 64-bit target it tests 16 bytes per iteration by OR-ing two +`u64` lanes and checking all their high bits with a single +`& 0x8080_8080_8080_8080` mask. You can build the ASCII fast path on top of that: +chunk-scan to find the ASCII prefix, then run the branchless (vectorizable) +convert over it. That keeps the early-exit ability — it still bails on the first +non-ASCII block — while letting both halves go fast. The catch is that it reads +the data **twice** (once to scan, once to convert), landing at about +**23 GiB/s** — roughly half of the single-pass branchless sweep, and ~7× the +naive break loop. A solid, general-purpose default; just not the absolute ceiling +when you control the whole loop and can fold detection and conversion into one +branch-free pass. + +> **NOTE — wouldn't *fusing* the two passes be faster?** It's the obvious next +> thought: keep the chunked early-exit but convert each 16-byte block right after +> you've confirmed it's ASCII, reading the data only *once*. Measured, it's +> **~2.6× slower** — 8.7 GiB/s versus the two-pass 23. The inner block convert +> still vectorizes to a single 16-byte op, but now there's a data-dependent +> early-exit branch *every 16 bytes*, and that branch pins the loop to one block +> at a time: the compiler can't unroll or software-pipeline across blocks, and each +> iteration pays the full load→test→branch→convert→store latency with nothing to +> hide it behind. Split into two passes, each one is clean: the scan is a +> branch-light, **store-free** word scan that races through memory, and the +> convert is the fully-vectorized branch-free sweep at ~47 GiB/s. Two fast, +> branch-free passes beat one branchy fused pass — even though the fused version +> touches the data half as many times. It's the same lesson one more time: in the +> hot loop, the branch is the enemy. + +It is genuinely faster to *unconditionally* sweep the entire buffer once, +branch-free, and decide what to do afterwards than to try to stop early. Stripped +of every branch, the loop becomes almost insultingly simple — a flat sequence of +loads, an OR, an add, and stores over a contiguous buffer — and a loop that +simple is a piece of cake for the compiler to vectorize into a 45 GiB/s racing +car. + +## Touch the heap only when you must + +40 GiB/s also means doing zero unnecessary allocation. `simple_fold` takes the +input `String` *by value*, owning the heap buffer it can mutate and return +it. If the OR-accumulator's high bit was clear, the input was pure ASCII — +already folded in place — we hand the **same allocation** straight back, no +second buffer and no copy. Otherwise we `memchr` to the first non-ASCII byte and +scan the tail from there, leaving the output buffer *unallocated* (a null write +cursor) until we hit a character that folds to **different bytes**. Text whose +multibyte content never folds — CJK, Hangul, Kana, Arabic, Hebrew, symbols — also +returns the original allocation untouched, never copying a byte. + +Why a *second* buffer rather than rewriting in place like the ASCII pass? +Because folding can make the string **longer**: almost every fold preserves the +UTF-8 length or shrinks it, but two outliers grow — U+023A (`Ⱥ`) and U+023E +(`Ɀ`) are 2 bytes each yet fold to 3-byte characters (`ⱥ`, `ɀ`). Once one +appears, the output no longer fits in the input's bytes and we need somewhere +new to write. + +We allocate that buffer **once**, sized for the worst case, rather than growing +it as more folds appear. Incremental `reserve` calls would mean re-checking +capacity, occasionally reallocating, copying everything written so far, and +juggling extra length/capacity bookkeeping; a single up-front allocation lets a +raw write cursor run straight to the end with none of that. (And since the +cursor is null until that first growing/changing fold, it doubles as the "have +we started building yet?" flag — the decision to allocate costs no extra +state either.) + +Sizing it needs a bound on growth, and those same two outliers give it: every 2 +input bytes yield at most 3 output bytes, capping the output at **1.5× the +input** — exactly the capacity we reserve: + +```rust +out = Vec::with_capacity(bytes.len() + bytes.len() / 2 + 4); +``` + +After that the loop writes through a raw pointer with no capacity checks and +calls `set_len` exactly once at the end. Two more details keep it branch-light. +The run of unchanged bytes between two folds is moved with a single +`copy_nonoverlapping` rather than byte by byte. And each fold unconditionally +writes all 4 bytes of a little-endian word before bumping the cursor by only the +*folded* length (1–4) — dropping a branch on the output length from the hot path, +with the `+ 4` in the reservation as the headroom that makes the final +character's over-store safe. + +## Making the rare path cheap too + +When a character *does* fold, we still don't want to fall off a cliff — decode +UTF-8, hash, re-encode. Unicode 16.0 has 1484 simple-fold mappings, but they're a +*very* sparse and *very* structured relation. Four observations shrink them to +**1776 bytes** and let the fold run **without ever decoding a full character**. + +But before any of that, the most important ingredient: even on the non-ASCII +path, the overwhelming majority of characters **do not fold**. CJK, Hangul, Kana, +Arabic, Hebrew, Indic scripts, emoji, punctuation, symbols — none of it folds. The +hot operation, then, isn't really "fold this character," it's "*does* this character +fold? — almost always no." The table has to make that **negative test** as close +to free as possible; the actual folding is the rare sub-case of an already-rare +path. That priority is what shapes the layout below — the page bitmap +([idea 1](#idea-1-foldable-code-points-cluster-into-64-code-point-pages)) exists precisely so a +non-folding character is rejected in a single bit test, straight from its leading +UTF-8 bytes, without decoding or scanning anything. + +This is exactly why a `HashMap` is the *wrong* shape for the job, not +just a bigger one. A hash map is optimized for the **hit**: it finds a present +key in roughly one probe, and only spends extra work (more probes, full key +comparison) when load factor or collisions bite. But our workload is dominated by +**misses** — characters that aren't in the table at all — and a miss is a hash +map's *least* favourite query: it still has to hash the key, jump to a bucket, +and walk the probe sequence far enough to *prove absence*. We'd be +paying the map's slow path on virtually every character and its fast path almost +never. The bitmap inverts that: the common case (no fold) is a single bit test, +and only the rare hit does any further work — the exact opposite of the hash +map's bias, and the right one for this data. + +### Idea 1: foldable code points cluster into 64-code-point "pages" + +Foldable code points bunch together. Slice the code space into 64-code-point +"pages" and the ~1484 folds touch just **59** of ~1960 possible pages. A +one-bit-per-page **presence bitmap** answers the negative test on its own: a +clear bit is a *definitive* "no fold" — copy through, done — which is what makes +fold-free scripts cheap. Only on a set bit do we consult a second structure, a +**cumulative-popcount side table** that ranks the page (how many populated pages +precede it) to find its slice of entries, storing nothing for the ~1900 empty +pages. + +Why **64**? Six bits is exactly what makes the probe fall out of the UTF-8 bytes. +A continuation byte carries 6 payload bits, which makes the within-page offset `cp & 0x3F` +*literally the low 6 bits of the last byte*. Indexing the bitmap as 64-bit +words, the bit position is another 6 bits — straight from the second-to-last +byte — leaving only the higher bits as the word index. So the bit index is +always just the second-to-last byte masked with `0x3F`, and the word index is +`0`, a nibble, or (only for four-byte sequences) two merged bytes — a tiny +branch on the lead byte, no full code-point reconstruction: + +```rust +let (word_idx, bit_idx, c_len) = if lead < 0xE0 { + (0usize, lead & 0x1F, 2usize) // 2-byte: word 0 +} else if lead < 0xF0 { + ((lead & 0x0F) as usize, bytes[read + 1] & 0x3F, 3) // 3-byte: word = nibble +} else { + ( // 4-byte: merge 2 bytes + (((lead & 0x07) as usize) << 6) | (bytes[read + 1] & 0x3F) as usize, + bytes[read + 2] & 0x3F, + 4usize, + ) +}; +// reject without decoding: clear bit ⇒ no fold +if word_idx >= PAGE_BITMAP.len() || (PAGE_BITMAP[word_idx] >> bit_idx) & 1 == 0 { + read += c_len; + continue; +} +``` + +Because `word_idx` depends only on the lead byte (and, for four-byte sequences, +the first continuation byte), the bitmap load can be issued early. + +### Idea 2: within a page, folds come in runs + +A set page bit tells us *something* on this page folds, but not which code points +or to what. The obvious encoding is one entry per foldable code point — but that +is both bulky and slow to search: a page can hold dozens of folds, and we'd have +to scan them all to find the one matching the current code point. The structure +of the data rescues us again. Adjacent code points overwhelmingly share the same +delta to their fold: `A`–`Z` all map `+32`, and Latin Extended is full of +*alternating* runs like `0x0100, 0x0102, 0x0104, …` where every second code point +folds. Instead of per-code-point entries we store **runs** — start, end, +stride, delta — and a 1-bit `stride` flag covers both the contiguous and the +every-other case. This interval compression collapses the ~1484 individual folds +into just **238** runs across the 59 pages (≈4 per page), leaving the within-page +search only a handful of entries to look at instead of dozens. This +range-with-delta encoding (including the stride trick) is borrowed from Go's +`unicode` package, whose +[`CaseRange`](https://github.com/golang/go/blob/master/src/unicode/tables.go) +records store a `Lo`/`Hi` range plus per-case deltas, with an `UpperLower` +sentinel marking the alternating blocks. Runs are clipped at the 64-cp page +boundaries so a run never straddles two pages — which is exactly what lets the +page bitmap above treat a clear bit as a *definitive* "no fold". + +### Idea 3: a run record is two clean bytes + +With both endpoints inside one page they +fit in 6 bits, split across two arrays: `RUN_END_LOW[i] = end & 0x3F` (the scan +key) and `RUN_START_STRIDE[i] = (start & 0x3F) | ((stride − 1) << 6)` (read only +on a hit). Because each key is one clean byte, the within-page search can go +**wide**: rather than comparing `cp & 0x3F` against the runs one at a time, we +load **8 `end_low` bytes into a single `u64` and test all of them at once** with +one branchless SWAR step — `(chunk | 0x80…80) − broadcast(low) & 0x80…80` sets +the top bit of every lane whose key is `≥ cp & 0x3F`. A single bit-scan of that +mask (the keys are sorted, so the first set lane is the run we want) finds the +slot. A page holds ~4 runs on average; that one 8-wide compare almost always +resolves the entire search in a single step. One unlucky page does hold 30 runs, +which puts the compare inside a short loop that strides 8 keys at a time — but that +loop trips at most a handful of times on exactly one page in all of Unicode, and +never on the common ones. Either way: no per-run branch, and no code-point +reconstruction anywhere. + +```rust +/// Offset of the first run with `end_low >= low_v` in a page of `n` runs, +/// or `n` if none. Scans 8 `end_low` bytes at a time via SWAR. +#[inline] +fn scan_end_low(lo: usize, n: usize, low_v: u8) -> usize { + const HIGH: u64 = 0x8080_8080_8080_8080; + const ONES: u64 = 0x0101_0101_0101_0101; + let bcast = (low_v as u64).wrapping_mul(ONES); + let mut base = 0; + while base < n { + // RUN_END_LOW is padded by 8 bytes so this read is always in bounds. + let chunk = u64::from_le_bytes( + RUN_END_LOW[lo + base..lo + base + 8] + .try_into() + .expect("8-byte slice"), + ); + // `(b | 0x80) - low_v` keeps its high bit iff `b >= low_v` (no + // cross-lane borrow). The first set lane is the first run `>= low_v`. + let ge = (chunk | HIGH).wrapping_sub(bcast) & HIGH; + if ge != 0 { + let j = base + (ge.trailing_zeros() / 8) as usize; + return if j < n { j } else { n }; + } + base += 8; + } + n +} +``` + +### Idea 4: folding is a little-endian byte addition + +On a little-endian machine the +folded character's UTF-8 bytes, read as a `u32`, equal the source bytes (as a +`u32`) plus a **per-run constant**. A parallel `BYTE_DELTA[i]` table then turns the +whole fold into a masked load, one `wrapping_add`, and a 4-byte store: + +```rust +let word = u32::from_le_bytes(next_four_bytes) & length_mask; // keep this char's bytes +let folded = word.wrapping_add(BYTE_DELTA[i]); // the fold, as one byte add +write_u32_le(dst, folded); // store all 4 bytes... +dst += utf8_len(folded); // ...advance by the folded length +``` + +Both lengths in that snippet — the `length_mask` for the source character and +the *advance by the folded length* for the destination — come from one more tiny +trick. A UTF-8 sequence's length is fixed by the top four bits of its lead byte, +letting the 16 possible lengths pack one nibble each into a single 64-bit +constant (`0x4322_1111_1111_1111`); the length is then a shift and a mask, +`(LEN_BITS >> (4 * (lead >> 4))) & 0xF` — no `if` chain, no table memory, nothing +for the predictor to get wrong. (A *count leading ones* — `(!lead).leading_zeros()` +— would also work, since a lead byte carries one leading 1-bit per byte of the +sequence, but the nibble shift avoids the bit-complement.) + +```rust +/// Number of bytes in the UTF-8 sequence whose lead byte is `lead`. +#[inline] +pub fn utf8_len(lead: u8) -> usize { + const UTF8_LEN_BY_LEAD: u64 = 0x4322_1111_1111_1111; + ((UTF8_LEN_BY_LEAD >> (4 * (lead >> 4))) & 0xF) as usize +} +``` + +Because we advance by the *folded* length, this even handles length-changing +folds — U+212A KELVIN SIGN (3 bytes) → `k` (1 byte), or U+023A `Ⱥ` (2 bytes) → +U+2C65 `ⱥ` (3 bytes) — by writing fewer or more bytes than were read.[^overlong] +And it's the part I believe is genuinely new: every other folder I looked at — +ICU, Go's `unicode`, Rust's `regex`, CPython, glibc — decodes UTF-8 to a code +point, applies the fold there, and re-encodes (even SIMD folders decode first). +Doing the arithmetic in byte space skips both the decode and the encode, which is +exactly why this path can outrun a hash map that already has the answer +tabulated — the hash map still has to decode its key and encode its result. + +[^overlong]: The byte-space arithmetic assumes the input is **well-formed, + shortest-form UTF-8** — every code point encoded with the minimal number of + bytes. Reading the source bytes as a `u32` and adding a per-run delta only + lands on the correct folded encoding when the source is in canonical form; + an *overlong* encoding (a code point padded into more bytes than necessary, + e.g. `/` as `0xC0 0xAF`) has a different byte pattern and would break the + `length_mask` and the delta arithmetic. This is not a real restriction in + Rust — `&str`/`String` are guaranteed to hold valid UTF-8, which by + definition rejects overlong sequences — but a caller feeding raw bytes from + elsewhere must validate (or otherwise normalize) them first. + +### The ASCII shortcut in the tail loop + +One more shortcut rounds out the tail loop. Remember the first pass already +lowercased every ASCII byte, so when the scan meets an ASCII byte in the tail it +advances a single byte and moves on — no page probe, no table touch at all. And +it doesn't copy that byte either: unmodified bytes (ASCII and non-folding +multibyte alike) aren't moved one at a time. The scan just keeps walking until it +reaches a character that actually folds, then flushes the whole unchanged run +between the last fold and this one with a single `copy_nonoverlapping`. Mixed +text — CJK with ASCII spaces and punctuation, or code with the occasional +accented identifier — therefore races through the ASCII filler and only consults +the bitmap for genuine multibyte characters, copying in bulk rather than byte by +byte. + +### Putting it together: the whole table + +| Component | Bytes | +|----------------------------------------------------|-------:| +| `PAGE_BITMAP` (1 bit per 64-cp page) | 248 | +| `POPCNT_SAMPLES` (cumulative popcount) | 32 | +| `PAGE_OFFSET` (per populated page) | 60 | +| `RUN_END_LOW` (scan key, `end & 0x3F`, +8 pad) | 246 | +| `RUN_START_STRIDE` (`start & 0x3F` \| stride) | 238 | +| `BYTE_DELTA` (little-endian fold delta per run) | 952 | +| **Total** | **1776** | + +That's **9.6 bits per fold entry** — over half of it the `BYTE_DELTA` side table +we trade for the decode-free path; the index + run records alone are ~4.4 +bits/entry. + +Next to the obvious alternatives, that 1776 bytes is an order of magnitude or +more smaller — and unlike most of them it never decodes a character: + +| Representation | Size | +|-----------------------------------------------------|-----------:| +| Naïve `[(u32, u32); 1484]` | ~11.6 KB | +| `regex-syntax`'s `case_folding_simple` table | ~70 KB | +| Go's `unicode.SimpleFold` (orbit + ASCII + ranges) | ~7.3 KB | +| A runtime `HashMap` | ~17 KB | +| **This crate (paged bitmap + packed runs)** | **1776 B** | + +## How fast is it? + +Criterion medians on an Apple M4 (single core, `target-cpu=native`). Treat the +absolute figures as illustrative, not portable: the whole design leans on +auto-vectorization, SWAR, and little-endian byte arithmetic, so the numbers — and +even the *ratios* between rows — can shift substantially on a different +microarchitecture (a wider or narrower vector unit, different memory bandwidth, a +big-endian target, x86 vs ARM). The qualitative story holds; the exact GiB/s do +not. + +The other **true case-folders** — +`simd-normalizer` and the same byte path backed by a simple `HashMap` — produce +identical output. `str::to_lowercase` does *not* in general, but on pure ASCII it +coincides with the fold exactly, earning a spot on that row as the correct +std-library baseline. The final column is **not** a folder at all: it is +[`simdutf`](https://github.com/simdutf/simdutf)'s UTF-8 → UTF-32 → UTF-8 round +trip — decoding to code points with a state-of-the-art SIMD decoder and +re-encoding them — included as the *transcoding tax* any folder that reconstructs +code points must pay around its lookup (both buffer lengths assumed known, so only +the two transcodes are timed; no folding happens in between): + +| Workload (input size) | `simple_fold` | `simd_normalizer` | `HashMap` (byte path) | `str::to_lowercase` | `simdutf` round-trip | +|----------------------------------------|---------------:|------------------:|----------------------:|--------------------:|---------------------:| +| Pure ASCII (5.7 KB) | **40.8 GiB/s** | 1.21 GiB/s | 213 MiB/s | 27.7 GiB/s | 9.33 GiB/s | +| CJK, no folds (8.1 KB) | **2.95 GiB/s**| 1.97 GiB/s | 558 MiB/s | — | 2.57 GiB/s | +| Symbols / Myanmar, no folds (9.0 KB) | **2.96 GiB/s**| 1.56 GiB/s | 410 MiB/s | — | 2.00 GiB/s | +| Mixed BMP, all folding (8.8 KB) | 869 MiB/s | **922 MiB/s**| 334 MiB/s | — | 1.99 GiB/s | +| Length-changing folds (1.7 KB) | **1.26 GiB/s**| 716 MiB/s | 233 MiB/s | — | 1.77 GiB/s | + +The headline ASCII row is the workload that dominates real text, and it runs an +order of magnitude faster than the SIMD-dispatching `simd-normalizer` and ~200× +faster than a `HashMap` — purely because the common path is one branch-free +vectorized sweep. The no-fold rows (CJK, symbols) run at GiB/s for the same +reason: the page-bitmap probe rejects whole characters from their lead bytes and +the original buffer is returned without a single byte copied. Even on the +identical byte-level fold, the compact table beats a `HashMap` by 3–5× at ~10× +less memory; `simple_fold` only trails on all-folding mixed-BMP text, where +`simd-normalizer` edges ahead by a hair (922 vs 869 MiB/s). + +The `simdutf` column is really about the **multibyte** rows. The ASCII figure +(9.33 GiB/s) is almost meaningless: nobody would transcode pure ASCII to 4-byte +code units and straight back — there is nothing to gain and a 4× blow-up in +memory traffic to pay, so the round trip there measures a step no sane folder +takes. It is the non-ASCII path where a code-point-based folder is genuinely +*forced* to transcode, and that's where this number bites. Decode-*and*-re-encode +is the unavoidable envelope of every such design — ICU, Go's `unicode`, the +`regex` crate, CPython all decode UTF-8 to code points and re-encode the result — +and even a world-class SIMD transcoder caps out at **1.8–2.6 GiB/s** on multibyte +input. That round trip is a *floor on the competition*: any folder that decodes +first has already spent this much transcoding before it looks a single character +up. Yet `simple_fold` beats it outright on the no-fold rows (2.95 vs 2.57, 2.96 +vs 2.00 GiB/s) — the very rows where a decode-then-fold design would be paying the +full transcoding tax — because it answers the real question, *does this character +fold?*, straight from the raw bytes without ever decoding. Folding in byte space +doesn't just beat the hash map's lookup; on multibyte text it beats the +decode-and-re-encode that every code-point-based folder pays around the lookup. + +The pure-ASCII row is the fairest fight of all: there `str::to_lowercase` +produces the **exact same bytes** we do — a correct std-library baseline +rather than a different operation — and even then the branch-free sweep is ~1.5× +faster (40.8 vs 27.7 GiB/s), because `to_lowercase` still scans for the first +non-ASCII byte and allocates a fresh `String` instead of folding in place. On +multibyte inputs `to_lowercase` both diverges from the fold *and* slows to +roughly 290–500 MiB/s. + +## Takeaways + +Case folding sounds solved — uppercase to lowercase, how hard can it be? Yet a +task this basic hid two *surprising* wins, in opposite directions. + +On the common path, the win was doing **more** work: deleting the "obvious" +early-exit so the loop sweeps the whole buffer ran ~15× faster, because the +branch we added to *save* work was the very thing blocking vectorization. + +On the rare path, the win was looking at the *shape* of the data instead of +reaching for a hash map: case folding is sparse, run-heavy, and page-clustered, +and UTF-8's little-endian layout turns a code-point delta into a plain integer +add — so a 1.7 KB table beats a hash map on both size and speed. + +The meta-lesson: "basic" rarely means "fully explored." Measuring instead of +guessing — and questioning the optimization everyone reaches for first — can +still find an order of magnitude. That's the fun part. + +The crate is [`casefold`](./README.md); the generated table and full design notes +live alongside the source. diff --git a/crates/casefold/src/lib.rs b/crates/casefold/src/lib.rs index f826fe1..ab6ef41 100644 --- a/crates/casefold/src/lib.rs +++ b/crates/casefold/src/lib.rs @@ -167,29 +167,28 @@ fn fold_non_ascii_tail(bytes: Vec, start: usize) -> Vec { read += 1; continue; } - // Page-precision reject probe (see the module docs). - let (page, c_len) = if lead < 0xE0 { - ((lead & 0x1F) as u32, 2usize) + // Page-precision reject probe. Compute the bitmap word index and the + // within-word bit directly per length, instead of forming a combined + // `page` and re-splitting it. `word_idx` stays a function of `lead` + // (plus, for 4-byte, the first continuation byte) so the bitmap load + // issues early. + let (word_idx, bit_idx, c_len) = if lead < 0xE0 { + (0usize, lead & 0x1F, 2usize) } else if lead < 0xF0 { - ( - (((lead & 0x0F) as u32) << 6) | (bytes[read + 1] & 0x3F) as u32, - 3, - ) + ((lead & 0x0F) as usize, bytes[read + 1] & 0x3F, 3usize) } else { ( - (((lead & 0x07) as u32) << 12) - | (((bytes[read + 1] & 0x3F) as u32) << 6) - | (bytes[read + 2] & 0x3F) as u32, - 4, + (((lead & 0x07) as usize) << 6) | (bytes[read + 1] & 0x3F) as usize, + bytes[read + 2] & 0x3F, + 4usize, ) }; - let word_idx = (page >> 6) as usize; - if word_idx >= PAGE_BITMAP.len() || (PAGE_BITMAP[word_idx] >> (page & 63)) & 1 == 0 { + if word_idx >= PAGE_BITMAP.len() || (PAGE_BITMAP[word_idx] >> bit_idx) & 1 == 0 { read += c_len; continue; } let low_v = bytes[read + c_len - 1] & 0x3F; - let dense = popcount_up_to(page) as usize; + let dense = popcount_up_to(word_idx, bit_idx) as usize; let lo = PAGE_OFFSET[dense] as usize; let n = PAGE_OFFSET[dense + 1] as usize - lo; let off = scan_end_low(lo, n, low_v); @@ -310,13 +309,11 @@ const fn table_size_bytes() -> usize { // RUN_START_STRIDE[i] = (start & PAGE_MASK) | ((stride - 1) << 6) // (membership, vs `cp & 0x3F`) -/// Number of populated pages strictly before `page`. +/// Number of populated pages strictly before the page at `word_idx`/`bit_idx`. #[inline] -fn popcount_up_to(page: u32) -> u32 { - let word_idx = (page / 64) as usize; - let bit_in_word = page % 64; +fn popcount_up_to(word_idx: usize, bit_idx: u8) -> u32 { let base = POPCNT_SAMPLES[word_idx] as u32; - let partial = PAGE_BITMAP[word_idx] & ((1u64 << bit_in_word).wrapping_sub(1)); + let partial = PAGE_BITMAP[word_idx] & ((1u64 << bit_idx).wrapping_sub(1)); base + partial.count_ones() }