rust-cache-benchmarks

Concurrent in-memory cache benchmarks for Rust. Compares throughput, hit rate, and latency across 14 caching libraries under a realistic Zipfian workload.

Benchmark Output

Bachmark machine specs:

Component	Details
CPU	Intel Xeon Platinum 8581C @ 2.10GHz
Cores/Threads	4 cores / 8 threads (1 socket, SMT enabled)
Architecture	x86_64
Memory	62 GB (61 GB available)
Swap	None
Disk	9.7 GB (3.3 GB free, 65% used)
OS	Debian 12 (Bookworm)
Kernel	6.1.0-43-cloud-amd64
Machine Type	n4-highmem-8

Caches compared

The Concurrency column is a critical fairness disclosure: throughput numbers are dominated by lock topology as much as by eviction policy. Caches marked sharded serialise only within their shard; caches marked single mutex are wrapped by this benchmark in a parking_lot::Mutex because the underlying crate is not internally Sync-safe for concurrent writers. Compare like-for-like before drawing conclusions.

Crate	Strategy	Concurrency
moka	TinyLFU / W-TinyLFU	sharded (built-in)
mini-moka	TinyLFU (lighter weight)	sharded (built-in)
quick_cache	LRU / CLOCK-Pro	sharded (built-in)
schnellru	LRU	single `Mutex` (this harness)
stretto	Ristretto (TinyLFU)	sharded (built-in) - see footnote †
lru	LRU	single `Mutex` (this harness)
TinyUFO	TinyUFO	sharded (built-in)
foyer	S3-FIFO	sharded (built-in)
cached	LRU (proc-macro)	single `Mutex` (this harness)
clru	Count-Min LRU	single `Mutex` (this harness)
lru-mem	Memory-bounded LRU	single `Mutex` (this harness) - see footnote ‡
sieve-cache	SIEVE	sharded (`ShardedSieveCache`)
caches	Two-Queue	single `Mutex` (this harness) — see footnote §
neocache	S3-FIFO + DashMap	sharded (built-in)

Per-cache configuration deviations

Every cache is constructed with default parameters unless listed here. Anything on this list is a deliberate choice we have made on behalf of the cache; the rationale is in the matching src/caches/<name>.rs file. If you believe one of these tunings biases the comparison, please file a methodology issue with a before/after diff.

† stretto - constructed with num_counters = 10× cache_size, set_ignore_internal_cost(true), and set_buffer_size(num_tasks × 4 KiB). Stretto additionally receives a unique 3-phase warmup (insert → read pass to populate TinyLFU frequency counts → re-insert with TinyLFU-informed admission) because its frequency sketch is updated only from reads, never from inserts; a single-phase warmup leaves admission decisions effectively random and collapses the hit rate to ~45%.
‡ lru-mem - capacity is byte-budgeted (cache_size × value_size). The budget covers value bytes only; String headers and key-string storage are not accounted for, so the effective entry count slightly exceeds cache_size. Documented for transparency; correcting it would require bespoke HeapSize accounting that other caches do not model.
§ caches (Two-Queue) - the caches crate has not seen a release since 2022 and pulls in rand 0.8.5. Included for historical comparison; a future PR may retire it once a maintained alternative exists.
mini-moka - unlike moka, mini-moka 0.10 does not expose a public flush analogous to run_pending_tasks() (the equivalent sync() is private), so warmup completion is not synchronous on this cache. The first measurement iteration absorbs the residual eviction work; later iterations measure steady state.
lrucache, schnellru, clru, cached, two-queue, lru-mem - wrapped in a single parking_lot::Mutex because none of them is Sync for concurrent writers. This makes them strictly worse than sharded caches under the default 8-thread workload; the comparison is still meaningful as a deployment signal ("do not use these from multiple writer threads"), not as a pure algorithmic ranking.

Requirements

Rust toolchain (MSRV 1.94, pinned in rust-toolchain.toml, Cargo.toml's rust-version, and the CI workflow). Installed automatically via rustup.
For best results, run on a quiet machine (no other heavy processes)

Build and run

# Run all caches (release build is required for meaningful numbers)
cargo run --release

# Run a subset
cargo run --release -- --caches moka,quick_cache,neocache

# Override config parameters at runtime (no recompile needed)
cargo run --release -- --size 10000 --zipf 0.9
cargo run --release -- --pattern uniform --ratio 0.5
cargo run --release -- --size 50000 --keys 800000 --iters 20

# Print brief flag summary
cargo run --release -- --help

# Print full configuration reference, CLI flags, and benchmark methodology
cargo run --release -- --info

Note: .cargo/config.toml sets -C target-cpu=native so the binary is optimised for the machine it is built on. Do not compare binaries built on different machines.

Output column definitions

Column	Meaning
`ops/sec`	Median throughput across all measurement iterations - total operations (reads + writes) per wall-clock second
`±ci95`	95% CI half-width as a percentage of the mean throughput (uses the actual `n` for this cache). Two caches whose CI ranges don't overlap are distinguishable at the 95% level
`cv%`	Coefficient of variation (`stddev / mean × 100`). < 3% = stable; > 10% = noisy, flagged with `!`
`n`	Actual number of measurement iterations run. Adaptive early stopping may converge before the 30-iteration maximum
`p50μs`	Median of per-iteration p50 latencies in μs - the typical latency of a representative single run
`p99μs`	Median of per-iteration p99 latencies in μs - avoids the pooling bias of aggregating all iterations before computing percentiles
`tailμs`	Tail amplification = `p99μs - p50μs` (both medians) in μs. Lower = more consistent per-op cost
`p99cv%`	CV% of per-iteration p99 latencies - measures tail latency stability; high = occasional spikes invisible in the median
`wp99μs`	Median of per-iteration write-only p99 latencies in μs. Isolates write cost from the read-dominated `p99μs`; especially diagnostic for mutex-backed caches where writes hold an exclusive lock. Shows `--` if no write samples exist
`hit%`	Cache hit rate - fraction of reads that found the key; purely a function of eviction policy and access pattern
`±hitci`	95% CI half-width for hit rate in absolute percentage points (e.g. `±0.020` = ±0.020 pp). Three decimal places avoids the misleading `±0.00` display when CI is very tight but nonzero

Note on latency floor: Each sampled latency is bracketed by two TSC reads (~10-20 ns each on Linux via vDSO). The combined ~20-40 ns overhead inflates p50μs/p99μs for the fastest caches. Use latency numbers for relative comparison, not absolute wall-clock cost.

Configuration

All parameters have sensible defaults and can be overridden via CLI flags without recompiling. Full reference: cargo run --release -- --info.

Parameter	Flag	Default	Notes
`cache_size`	`--size`	30,000	Max entries per cache
`num_tasks`	`--tasks`	CPU count	Parallel workers (one per logical CPU)
`num_distinct_keys`	`--keys`	480,000	Zipf keyspace (16× cache_size)
`read_write_ratio`	`--ratio`	0.80	80% reads, 20% writes
`value_size`	`--value-size`	5,120 B	5 KB cached values
`num_iterations`	`--iters`	30 (max)	Hard cap; adaptive early stopping may converge sooner
`warmup_iterations`	`--warmup`	1	One ~3 s pass primes branch predictors and engages Turbo Boost
`zipf_exponent`	`--zipf`	1.07	≈ 80/20 hotspot skew; sweep 0.8-1.5
`access_pattern`	`--pattern`	zipfian	`zipfian`, `uniform`, or `sequential`
`cold_start`	`--cold-start`	false	Start with empty cache
`latency_sample_every`	`--sample-every`	10	Bernoulli sampling rate (~10% of ops)
`rng_seed`	`--seed`	42	Base seed for reproducible runs

num_keys_per_task is calibrated automatically per cache (max 2M) to target ~3 s/iter; it is not CLI-configurable.

Benchmark methodology

Task-parallel, OS-thread model - each task runs in a tokio::task::spawn_blocking thread for true parallelism
Barrier-synchronized start - all tasks start simultaneously; throughput = total_ops / wall_time_of_slowest_task
Per-cache ops calibration - a 200 K-op/task calibration pass before warmup measures each cache's throughput and sets num_keys_per_task = clamp(3 s × throughput / num_tasks, 500 K, 2 M), targeting ~3 s wall time per iteration; prevents slow global-mutex caches from dominating total runtime
Per-iteration randomised ordering - the remaining active caches are shuffled each iteration to eliminate thermal/position bias
Adaptive early stopping - after each iteration, any cache with ≥ 15 samples and 95% CI < 0.75% of mean is retired; the n column shows actual iterations run
Median throughput - robust to single-iteration outliers from thermal throttling or OS scheduling
95% confidence intervals - reported on both throughput (±ci95) and hit rate (±hitci)
Per-iteration latency percentiles - p50/p99 computed per iteration then summarised with the median; avoids the pooling bias of aggregating all samples before computing percentiles
Bernoulli latency sampling - each op is sampled with probability 1/N from the task's existing RNG; avoids phase-locking with periodic cache slow paths that deterministic modulo sampling can miss
Noisy-run flagging - throughput CV% > 10 is marked with ! in the output table; a footnote is printed below the table when triggered

Reporting security issues

Please do not open public issues for security reports. See SECURITY.md for private disclosure channels.

License

Licensed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rust-cache-benchmarks

Benchmark Output

Bachmark machine specs:

Caches compared

Per-cache configuration deviations

Requirements

Build and run

Output column definitions

Configuration

Benchmark methodology

Reporting security issues

License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

rust-cache-benchmarks

Benchmark Output

Bachmark machine specs:

Caches compared

Per-cache configuration deviations

Requirements

Build and run

Output column definitions

Configuration

Benchmark methodology

Reporting security issues

License