Concurrent in-memory cache benchmarks for Rust. Compares throughput, hit rate, and latency across 14 caching libraries under a realistic Zipfian workload.
| Component | Details |
|---|---|
| CPU | Intel Xeon Platinum 8581C @ 2.10GHz |
| Cores/Threads | 4 cores / 8 threads (1 socket, SMT enabled) |
| Architecture | x86_64 |
| Memory | 62 GB (61 GB available) |
| Swap | None |
| Disk | 9.7 GB (3.3 GB free, 65% used) |
| OS | Debian 12 (Bookworm) |
| Kernel | 6.1.0-43-cloud-amd64 |
| Machine Type | n4-highmem-8 |
The Concurrency column is a critical fairness disclosure: throughput numbers
are dominated by lock topology as much as by eviction policy. Caches marked
sharded serialise only within their shard; caches marked single mutex are
wrapped by this benchmark in a parking_lot::Mutex because the underlying
crate is not internally Sync-safe for concurrent writers. Compare
like-for-like before drawing conclusions.
| Crate | Strategy | Concurrency |
|---|---|---|
| moka | TinyLFU / W-TinyLFU | sharded (built-in) |
| mini-moka | TinyLFU (lighter weight) | sharded (built-in) |
| quick_cache | LRU / CLOCK-Pro | sharded (built-in) |
| schnellru | LRU | single Mutex (this harness) |
| stretto | Ristretto (TinyLFU) | sharded (built-in) - see footnote † |
| lru | LRU | single Mutex (this harness) |
| TinyUFO | TinyUFO | sharded (built-in) |
| foyer | S3-FIFO | sharded (built-in) |
| cached | LRU (proc-macro) | single Mutex (this harness) |
| clru | Count-Min LRU | single Mutex (this harness) |
| lru-mem | Memory-bounded LRU | single Mutex (this harness) - see footnote ‡ |
| sieve-cache | SIEVE | sharded (ShardedSieveCache) |
| caches | Two-Queue | single Mutex (this harness) — see footnote § |
| neocache | S3-FIFO + DashMap | sharded (built-in) |
Every cache is constructed with default parameters unless listed here. Anything
on this list is a deliberate choice we have made on behalf of the cache; the
rationale is in the matching src/caches/<name>.rs file. If you believe one of
these tunings biases the comparison, please file a methodology issue with a
before/after diff.
- † stretto - constructed with
num_counters = 10× cache_size,set_ignore_internal_cost(true), andset_buffer_size(num_tasks × 4 KiB). Stretto additionally receives a unique 3-phase warmup (insert → read pass to populate TinyLFU frequency counts → re-insert with TinyLFU-informed admission) because its frequency sketch is updated only from reads, never from inserts; a single-phase warmup leaves admission decisions effectively random and collapses the hit rate to ~45%. - ‡ lru-mem - capacity is byte-budgeted (
cache_size × value_size). The budget covers value bytes only;Stringheaders and key-string storage are not accounted for, so the effective entry count slightly exceedscache_size. Documented for transparency; correcting it would require bespokeHeapSizeaccounting that other caches do not model. - § caches (Two-Queue) - the
cachescrate has not seen a release since 2022 and pulls inrand 0.8.5. Included for historical comparison; a future PR may retire it once a maintained alternative exists. - mini-moka - unlike moka, mini-moka 0.10 does not expose a public flush
analogous to
run_pending_tasks()(the equivalentsync()is private), so warmup completion is not synchronous on this cache. The first measurement iteration absorbs the residual eviction work; later iterations measure steady state. - lrucache, schnellru, clru, cached, two-queue, lru-mem - wrapped in a
single
parking_lot::Mutexbecause none of them isSyncfor concurrent writers. This makes them strictly worse than sharded caches under the default 8-thread workload; the comparison is still meaningful as a deployment signal ("do not use these from multiple writer threads"), not as a pure algorithmic ranking.
- Rust toolchain (MSRV 1.94, pinned in
rust-toolchain.toml,Cargo.toml'srust-version, and the CI workflow). Installed automatically viarustup. - For best results, run on a quiet machine (no other heavy processes)
# Run all caches (release build is required for meaningful numbers)
cargo run --release
# Run a subset
cargo run --release -- --caches moka,quick_cache,neocache
# Override config parameters at runtime (no recompile needed)
cargo run --release -- --size 10000 --zipf 0.9
cargo run --release -- --pattern uniform --ratio 0.5
cargo run --release -- --size 50000 --keys 800000 --iters 20
# Print brief flag summary
cargo run --release -- --help
# Print full configuration reference, CLI flags, and benchmark methodology
cargo run --release -- --infoNote:
.cargo/config.tomlsets-C target-cpu=nativeso the binary is optimised for the machine it is built on. Do not compare binaries built on different machines.
| Column | Meaning |
|---|---|
ops/sec |
Median throughput across all measurement iterations - total operations (reads + writes) per wall-clock second |
±ci95 |
95% CI half-width as a percentage of the mean throughput (uses the actual n for this cache). Two caches whose CI ranges don't overlap are distinguishable at the 95% level |
cv% |
Coefficient of variation (stddev / mean × 100). < 3% = stable; > 10% = noisy, flagged with ! |
n |
Actual number of measurement iterations run. Adaptive early stopping may converge before the 30-iteration maximum |
p50μs |
Median of per-iteration p50 latencies in μs - the typical latency of a representative single run |
p99μs |
Median of per-iteration p99 latencies in μs - avoids the pooling bias of aggregating all iterations before computing percentiles |
tailμs |
Tail amplification = p99μs - p50μs (both medians) in μs. Lower = more consistent per-op cost |
p99cv% |
CV% of per-iteration p99 latencies - measures tail latency stability; high = occasional spikes invisible in the median |
wp99μs |
Median of per-iteration write-only p99 latencies in μs. Isolates write cost from the read-dominated p99μs; especially diagnostic for mutex-backed caches where writes hold an exclusive lock. Shows -- if no write samples exist |
hit% |
Cache hit rate - fraction of reads that found the key; purely a function of eviction policy and access pattern |
±hitci |
95% CI half-width for hit rate in absolute percentage points (e.g. ±0.020 = ±0.020 pp). Three decimal places avoids the misleading ±0.00 display when CI is very tight but nonzero |
Note on latency floor: Each sampled latency is bracketed by two TSC reads (~10-20 ns each on Linux via vDSO). The combined ~20-40 ns overhead inflates
p50μs/p99μsfor the fastest caches. Use latency numbers for relative comparison, not absolute wall-clock cost.
All parameters have sensible defaults and can be overridden via CLI flags without recompiling. Full reference: cargo run --release -- --info.
| Parameter | Flag | Default | Notes |
|---|---|---|---|
cache_size |
--size |
30,000 | Max entries per cache |
num_tasks |
--tasks |
CPU count | Parallel workers (one per logical CPU) |
num_distinct_keys |
--keys |
480,000 | Zipf keyspace (16× cache_size) |
read_write_ratio |
--ratio |
0.80 | 80% reads, 20% writes |
value_size |
--value-size |
5,120 B | 5 KB cached values |
num_iterations |
--iters |
30 (max) | Hard cap; adaptive early stopping may converge sooner |
warmup_iterations |
--warmup |
1 | One ~3 s pass primes branch predictors and engages Turbo Boost |
zipf_exponent |
--zipf |
1.07 | ≈ 80/20 hotspot skew; sweep 0.8-1.5 |
access_pattern |
--pattern |
zipfian | zipfian, uniform, or sequential |
cold_start |
--cold-start |
false | Start with empty cache |
latency_sample_every |
--sample-every |
10 | Bernoulli sampling rate (~10% of ops) |
rng_seed |
--seed |
42 | Base seed for reproducible runs |
num_keys_per_task is calibrated automatically per cache (max 2M) to target ~3 s/iter; it is not CLI-configurable.
- Task-parallel, OS-thread model - each task runs in a
tokio::task::spawn_blockingthread for true parallelism - Barrier-synchronized start - all tasks start simultaneously; throughput =
total_ops / wall_time_of_slowest_task - Per-cache ops calibration - a 200 K-op/task calibration pass before warmup measures each cache's throughput and sets
num_keys_per_task = clamp(3 s × throughput / num_tasks, 500 K, 2 M), targeting ~3 s wall time per iteration; prevents slow global-mutex caches from dominating total runtime - Per-iteration randomised ordering - the remaining active caches are shuffled each iteration to eliminate thermal/position bias
- Adaptive early stopping - after each iteration, any cache with ≥ 15 samples and 95% CI < 0.75% of mean is retired; the
ncolumn shows actual iterations run - Median throughput - robust to single-iteration outliers from thermal throttling or OS scheduling
- 95% confidence intervals - reported on both throughput (
±ci95) and hit rate (±hitci) - Per-iteration latency percentiles - p50/p99 computed per iteration then summarised with the median; avoids the pooling bias of aggregating all samples before computing percentiles
- Bernoulli latency sampling - each op is sampled with probability 1/N from the task's existing RNG; avoids phase-locking with periodic cache slow paths that deterministic modulo sampling can miss
- Noisy-run flagging - throughput CV% > 10 is marked with
!in the output table; a footnote is printed below the table when triggered
Please do not open public issues for security reports. See SECURITY.md for private disclosure channels.
Licensed under the MIT License.