Skip to content

Latest commit

 

History

History
178 lines (142 loc) · 10.6 KB

File metadata and controls

178 lines (142 loc) · 10.6 KB

rust-cache-benchmarks

Concurrent in-memory cache benchmarks for Rust. Compares throughput, hit rate, and latency across 14 caching libraries under a realistic Zipfian workload.

Benchmark Output

image

Bachmark machine specs:

Component Details
CPU Intel Xeon Platinum 8581C @ 2.10GHz
Cores/Threads 4 cores / 8 threads (1 socket, SMT enabled)
Architecture x86_64
Memory 62 GB (61 GB available)
Swap None
Disk 9.7 GB (3.3 GB free, 65% used)
OS Debian 12 (Bookworm)
Kernel 6.1.0-43-cloud-amd64
Machine Type n4-highmem-8

Caches compared

The Concurrency column is a critical fairness disclosure: throughput numbers are dominated by lock topology as much as by eviction policy. Caches marked sharded serialise only within their shard; caches marked single mutex are wrapped by this benchmark in a parking_lot::Mutex because the underlying crate is not internally Sync-safe for concurrent writers. Compare like-for-like before drawing conclusions.

Crate Strategy Concurrency
moka TinyLFU / W-TinyLFU sharded (built-in)
mini-moka TinyLFU (lighter weight) sharded (built-in)
quick_cache LRU / CLOCK-Pro sharded (built-in)
schnellru LRU single Mutex (this harness)
stretto Ristretto (TinyLFU) sharded (built-in) - see footnote †
lru LRU single Mutex (this harness)
TinyUFO TinyUFO sharded (built-in)
foyer S3-FIFO sharded (built-in)
cached LRU (proc-macro) single Mutex (this harness)
clru Count-Min LRU single Mutex (this harness)
lru-mem Memory-bounded LRU single Mutex (this harness) - see footnote ‡
sieve-cache SIEVE sharded (ShardedSieveCache)
caches Two-Queue single Mutex (this harness) — see footnote §
neocache S3-FIFO + DashMap sharded (built-in)

Per-cache configuration deviations

Every cache is constructed with default parameters unless listed here. Anything on this list is a deliberate choice we have made on behalf of the cache; the rationale is in the matching src/caches/<name>.rs file. If you believe one of these tunings biases the comparison, please file a methodology issue with a before/after diff.

  • † stretto - constructed with num_counters = 10× cache_size, set_ignore_internal_cost(true), and set_buffer_size(num_tasks × 4 KiB). Stretto additionally receives a unique 3-phase warmup (insert → read pass to populate TinyLFU frequency counts → re-insert with TinyLFU-informed admission) because its frequency sketch is updated only from reads, never from inserts; a single-phase warmup leaves admission decisions effectively random and collapses the hit rate to ~45%.
  • ‡ lru-mem - capacity is byte-budgeted (cache_size × value_size). The budget covers value bytes only; String headers and key-string storage are not accounted for, so the effective entry count slightly exceeds cache_size. Documented for transparency; correcting it would require bespoke HeapSize accounting that other caches do not model.
  • § caches (Two-Queue) - the caches crate has not seen a release since 2022 and pulls in rand 0.8.5. Included for historical comparison; a future PR may retire it once a maintained alternative exists.
  • mini-moka - unlike moka, mini-moka 0.10 does not expose a public flush analogous to run_pending_tasks() (the equivalent sync() is private), so warmup completion is not synchronous on this cache. The first measurement iteration absorbs the residual eviction work; later iterations measure steady state.
  • lrucache, schnellru, clru, cached, two-queue, lru-mem - wrapped in a single parking_lot::Mutex because none of them is Sync for concurrent writers. This makes them strictly worse than sharded caches under the default 8-thread workload; the comparison is still meaningful as a deployment signal ("do not use these from multiple writer threads"), not as a pure algorithmic ranking.

Requirements

  • Rust toolchain (MSRV 1.94, pinned in rust-toolchain.toml, Cargo.toml's rust-version, and the CI workflow). Installed automatically via rustup.
  • For best results, run on a quiet machine (no other heavy processes)

Build and run

# Run all caches (release build is required for meaningful numbers)
cargo run --release

# Run a subset
cargo run --release -- --caches moka,quick_cache,neocache

# Override config parameters at runtime (no recompile needed)
cargo run --release -- --size 10000 --zipf 0.9
cargo run --release -- --pattern uniform --ratio 0.5
cargo run --release -- --size 50000 --keys 800000 --iters 20

# Print brief flag summary
cargo run --release -- --help

# Print full configuration reference, CLI flags, and benchmark methodology
cargo run --release -- --info

Note: .cargo/config.toml sets -C target-cpu=native so the binary is optimised for the machine it is built on. Do not compare binaries built on different machines.

Output column definitions

Column Meaning
ops/sec Median throughput across all measurement iterations - total operations (reads + writes) per wall-clock second
±ci95 95% CI half-width as a percentage of the mean throughput (uses the actual n for this cache). Two caches whose CI ranges don't overlap are distinguishable at the 95% level
cv% Coefficient of variation (stddev / mean × 100). < 3% = stable; > 10% = noisy, flagged with !
n Actual number of measurement iterations run. Adaptive early stopping may converge before the 30-iteration maximum
p50μs Median of per-iteration p50 latencies in μs - the typical latency of a representative single run
p99μs Median of per-iteration p99 latencies in μs - avoids the pooling bias of aggregating all iterations before computing percentiles
tailμs Tail amplification = p99μs - p50μs (both medians) in μs. Lower = more consistent per-op cost
p99cv% CV% of per-iteration p99 latencies - measures tail latency stability; high = occasional spikes invisible in the median
wp99μs Median of per-iteration write-only p99 latencies in μs. Isolates write cost from the read-dominated p99μs; especially diagnostic for mutex-backed caches where writes hold an exclusive lock. Shows -- if no write samples exist
hit% Cache hit rate - fraction of reads that found the key; purely a function of eviction policy and access pattern
±hitci 95% CI half-width for hit rate in absolute percentage points (e.g. ±0.020 = ±0.020 pp). Three decimal places avoids the misleading ±0.00 display when CI is very tight but nonzero

Note on latency floor: Each sampled latency is bracketed by two TSC reads (~10-20 ns each on Linux via vDSO). The combined ~20-40 ns overhead inflates p50μs/p99μs for the fastest caches. Use latency numbers for relative comparison, not absolute wall-clock cost.

Configuration

All parameters have sensible defaults and can be overridden via CLI flags without recompiling. Full reference: cargo run --release -- --info.

Parameter Flag Default Notes
cache_size --size 30,000 Max entries per cache
num_tasks --tasks CPU count Parallel workers (one per logical CPU)
num_distinct_keys --keys 480,000 Zipf keyspace (16× cache_size)
read_write_ratio --ratio 0.80 80% reads, 20% writes
value_size --value-size 5,120 B 5 KB cached values
num_iterations --iters 30 (max) Hard cap; adaptive early stopping may converge sooner
warmup_iterations --warmup 1 One ~3 s pass primes branch predictors and engages Turbo Boost
zipf_exponent --zipf 1.07 ≈ 80/20 hotspot skew; sweep 0.8-1.5
access_pattern --pattern zipfian zipfian, uniform, or sequential
cold_start --cold-start false Start with empty cache
latency_sample_every --sample-every 10 Bernoulli sampling rate (~10% of ops)
rng_seed --seed 42 Base seed for reproducible runs

num_keys_per_task is calibrated automatically per cache (max 2M) to target ~3 s/iter; it is not CLI-configurable.

Benchmark methodology

  • Task-parallel, OS-thread model - each task runs in a tokio::task::spawn_blocking thread for true parallelism
  • Barrier-synchronized start - all tasks start simultaneously; throughput = total_ops / wall_time_of_slowest_task
  • Per-cache ops calibration - a 200 K-op/task calibration pass before warmup measures each cache's throughput and sets num_keys_per_task = clamp(3 s × throughput / num_tasks, 500 K, 2 M), targeting ~3 s wall time per iteration; prevents slow global-mutex caches from dominating total runtime
  • Per-iteration randomised ordering - the remaining active caches are shuffled each iteration to eliminate thermal/position bias
  • Adaptive early stopping - after each iteration, any cache with ≥ 15 samples and 95% CI < 0.75% of mean is retired; the n column shows actual iterations run
  • Median throughput - robust to single-iteration outliers from thermal throttling or OS scheduling
  • 95% confidence intervals - reported on both throughput (±ci95) and hit rate (±hitci)
  • Per-iteration latency percentiles - p50/p99 computed per iteration then summarised with the median; avoids the pooling bias of aggregating all samples before computing percentiles
  • Bernoulli latency sampling - each op is sampled with probability 1/N from the task's existing RNG; avoids phase-locking with periodic cache slow paths that deterministic modulo sampling can miss
  • Noisy-run flagging - throughput CV% > 10 is marked with ! in the output table; a footnote is printed below the table when triggered

Reporting security issues

Please do not open public issues for security reports. See SECURITY.md for private disclosure channels.

License

Licensed under the MIT License.