Analyze LLM request logs to find KV cache optimization opportunities.
Tells you which prefixes to cache, how much VRAM it costs, what hit rate to expect, and whether your multi-tenant sessions benefit from stateful KV persistence — before you touch a single config file.
┌─────────────────────────────────────────────────────────┐
│ Your request logs → kv-cache-analyzer → Report │
│ │
│ "Your system prompt (113 tokens) appears in 94% of │
│ requests. Enabling prefix caching would skip 87.6% │
│ of prefill tokens. Cost: 14 MB VRAM on Llama 3 8B." │
│ │
│ "8 sessions detected. Within-session hit rate: 67%. │
│ Stateful KV persistence would save 5,100 additional │
│ tokens beyond prefix caching alone." │
└─────────────────────────────────────────────────────────┘
KV cache prefix caching (vLLM APC, SGLang RadixAttention) can eliminate 50–90% of prefill compute — but only if your prompts are structured to take advantage of it. Most teams enable it and hope for the best. This tool tells you exactly what to expect, where the gains are, and how to restructure prompts to maximize hit rate.
It also tells you something prefix caching alone cannot: whether your multi-turn user sessions would benefit from stateful KV persistence — the pattern where the server keeps each user's KV state alive between turns rather than recomputing the full history from scratch on every request.
What it analyzes:
- Which prefixes are shared across requests (and how frequently)
- Simulated cache hit rate under LRU, oracle, and no-cache policies
- Cache size sensitivity — the elbow where more VRAM stops helping
- VRAM cost to cache each hot prefix, per model architecture
- Session-aware analysis: cross-session vs within-session prefix reuse
- Per-session breakdown sorted by stateful-serving benefit
- Actionable prompt structure recommendations
Requires Python 3.11+.
pip install kv-cache-analyzerOptional extras:
pip install kv-cache-analyzer[tiktoken] # precise BPE token counts (OpenAI vocab)
pip install kv-cache-analyzer[hf] # HuggingFace tokenizers for exact model vocab# Realistic multi-surface prod simulation
kv-cache-analyzer generate-logs sample.jsonl --count 500 --scenario prod
# Multi-turn sessions with session_id (picks up within-session analysis automatically)
kv-cache-analyzer generate-logs sample.jsonl --count 100 --scenario customer-support
kv-cache-analyzer generate-logs sample.jsonl --count 100 --scenario code-assistant
# Full scenario list
kv-cache-analyzer generate-logs --helpAlways run this before analyzing production logs.
kv-cache-analyzer scrub prod_logs.jsonl scrubbed.jsonl --dry-run # preview first
kv-cache-analyzer scrub prod_logs.jsonl scrubbed.jsonl # write filekv-cache-analyzer analyze scrubbed.jsonl \
--model llama-3-8b \
--cache-size 100000kv-cache-analyzer analyze logs.jsonl \
--format openai \ # openai | vllm | jsonl | plain
--model llama-3-70b \ # model for VRAM estimates (see list-models)
--cache-size 200000 \ # KV cache capacity in tokens
--block-size 16 \ # cache block size (match your vLLM config)
--min-freq 0.03 \ # minimum prefix frequency to report
--min-length 5 \ # minimum prefix length (tokens) to report
--tokenizer word \ # word | char | tiktoken | hf
--limit 50000 \ # analyze only first N requests
--session-field session_id \ # JSON key for session ID (auto-detected for openai/vllm)
--output-json report.json # also write full report as JSONSupported log formats:
| Format | Description | Example |
|---|---|---|
openai (default) |
OpenAI chat API | {"messages": [{"role": "system", "content": "..."}]} |
vllm |
vLLM request logs | {"prompt": "...", "request_id": "..."} |
jsonl |
Generic JSONL | {"prompt": "..."} (use --prompt-field to change key) |
plain |
One prompt per line | plaintext |
Session ID is auto-detected from session_id, conversation_id, or thread_id fields in openai/vllm formats.
Generate realistic synthetic logs for testing across different production scenarios.
kv-cache-analyzer generate-logs output.jsonl \
--scenario <name> \ # see table below
--count 500 \ # number of log entries (turns, not sessions)
--seed 42 # reproducibilityScenarios and expected KV cache profiles:
| Scenario | Type | LRU Hit Rate | Within-Session | Best for testing |
|---|---|---|---|---|
prod |
single-turn | ~75–85% | n/a | General mixed workload |
chatbot |
single-turn | ~85–90% | n/a | High-cacheability baseline |
diverse |
single-turn | ~5–10% | n/a | Low-cacheability baseline |
no-system |
single-turn | ~5% | n/a | No system prompt workload |
customer-support |
multi-turn | ~92% | ~67% | Support ticket sessions |
ecommerce |
multi-turn | ~94% | ~69% | Shopping journey sessions |
code-assistant |
multi-turn | ~92% | ~60% | IDE debugging/refactor sessions |
rag-qa |
multi-turn | ~93%* | ~30% | RAG pipelines — shows cache limits |
legal-review |
multi-turn | ~96% | ~64% | Very long system prompts |
healthcare |
multi-turn | ~95% | ~45% | Mixed single/multi-turn |
content-moderation |
single-turn | ~96% | ~0% | Pure prefix caching pattern |
fintech |
multi-turn | ~96% | ~56% | Portfolio analysis + quick queries |
Note on
rag-qa: The high cross-session LRU rate is an artifact of the word tokenizer seeing the static system header as a long shared prefix. With--tokenizer tiktoken, the rate drops significantly because retrieved context chunks vary per request. This correctly reflects real RAG deployments — use this scenario to demonstrate cache limits before assuming RAG will benefit from prefix caching.
kv-cache-analyzer scrub input.jsonl output.jsonl \
--format openai \
--dry-run # preview without writingReplaces PII automatically:
| Pattern | Placeholder |
|---|---|
| Email addresses | [SCRUBBED_EMAIL] |
| Phone numbers | [SCRUBBED_PHONE] |
| IPv4 / IPv6 | [SCRUBBED_IP] |
| Credit card numbers | [SCRUBBED_CARD] |
| SSNs | [SCRUBBED_SSN] |
| UUIDs | [SCRUBBED_ID] |
| Long hex/base64 tokens | [SCRUBBED_TOKEN] |
kv-cache-analyzer list-modelsModel Layers KV Heads Head Dim KB/token (BF16)
llama-3-8b 32 8 128 128
llama-3-70b 80 8 128 320
llama-3-405b 126 8 128 512
mistral-7b 32 8 128 128
qwen2-7b 28 4 128 56
gemma-2-9b 42 8 256 344
...
from kv_cache_analyzer import analyze
report = analyze(
log_path="scrubbed_requests.jsonl",
model_name="llama-3-8b",
cache_size_tokens=100_000,
)
print(f"Requests: {report.total_requests:,}")
print(f"LRU rate: {next(r for r in report.simulation_results if r.policy == 'lru_prefix').cache_hit_rate:.1%}")
print(f"Hot prefixes: {len(report.hot_prefixes)}")
for rec in report.recommendations:
print(f"\n• {rec}")When logs include session_id fields, the report automatically includes per-session breakdown:
from kv_cache_analyzer import analyze
report = analyze("session_logs.jsonl")
# Cross-session (prefix caching) vs within-session (stateful KV)
print(f"Sessions detected: {report.sessions_detected}")
print(f"Cross-session hit rate: {next(r for r in report.simulation_results if r.policy == 'lru_prefix').cache_hit_rate:.1%}")
print(f"Within-session hit rate: {report.within_session_hit_rate:.1%}")
# Per-session breakdown (sorted by within-session hit rate descending)
for session in report.session_breakdown[:5]:
print(f"\n Session {session.session_id}")
print(f" Turns: {session.request_count}")
print(f" Avg turn length: {session.avg_turn_length:.0f} tokens")
print(f" Within-sess rate: {session.within_session_hit_rate:.1%}")
print(f" Reusable tokens: {session.within_session_cached_tokens:,}")Interpreting the two hit rates:
| Metric | What it measures | Served by |
|---|---|---|
| Cross-session (LRU) | System prompt + few-shot sharing across all users | vLLM APC, SGLang RadixAttention |
| Within-session | Turn-level prefix reuse within one conversation | Stateful KV persistence (e.g. LayerScale) |
High within-session rate → your longest sessions would benefit from a server that keeps each user's KV state alive between turns, not just caches the common system prefix.
report = analyze("requests.jsonl", min_frequency=0.05, min_prefix_length=10)
for prefix in report.hot_prefixes:
print(f"Frequency : {prefix.frequency:.1%}")
print(f"Length : {prefix.length} tokens")
print(f"Stable : {prefix.is_stable}") # True = good pinning candidate
print(f"Score : {prefix.savings_potential:.1f}") # frequency × length
print(f"Preview : {prefix.text_preview}")report = analyze("requests.jsonl")
for r in report.sweep_results:
bar = "█" * int(r.cache_hit_rate * 30)
print(f"{r.cache_size_tokens:>12,} {bar} {r.cache_hit_rate:.1%}")
# Find where adding more cache stops helping
deltas = [
report.sweep_results[i+1].cache_hit_rate - report.sweep_results[i].cache_hit_rate
for i in range(len(report.sweep_results) - 1)
]
elbow = report.sweep_results[deltas.index(max(deltas))]
print(f"Elbow: {elbow.cache_size_tokens:,} tokens → {elbow.cache_hit_rate:.1%}")pip install kv-cache-analyzer[tiktoken]# OpenAI BPE (GPT-4 / text-embedding vocab)
report = analyze("requests.jsonl", tokenizer_name="tiktoken")
# Exact model vocab (e.g. Llama 3)
report = analyze(
"requests.jsonl",
tokenizer_name="hf",
hf_model="meta-llama/Meta-Llama-3-8B-Instruct",
)from kv_cache_analyzer import analyze
def check_cache_regression(log_path: str, min_hit_rate: float = 0.60) -> None:
"""Fail CI if a prompt change would drop the predicted cache hit rate."""
report = analyze(log_path, cache_size_tokens=100_000)
lru = next(r for r in report.simulation_results if r.policy == "lru_prefix")
if lru.cache_hit_rate < min_hit_rate:
raise SystemExit(
f"Cache regression: {lru.cache_hit_rate:.1%} < threshold {min_hit_rate:.1%}\n"
f"Likely cause: dynamic content in system prompt.\n"
f"Fix: {report.recommendations[0]}"
)
print(f"✓ Cache hit rate {lru.cache_hit_rate:.1%} — OK")
check_cache_regression("staging_requests.jsonl", min_hit_rate=0.70)from kv_cache_analyzer import PrefixTrie, CacheSimulator
from kv_cache_analyzer.tokenizer import get_tokenizer
tokenizer = get_tokenizer("word")
trie = PrefixTrie()
prompts = ["You are helpful. User: Hello", "You are helpful. User: Goodbye"]
for prompt in prompts:
trie.insert(tokenizer.encode(prompt))
hot = trie.find_hot_prefixes(min_frequency=0.5, min_length=3)
for p in hot:
print(p.text_preview, "→", p.frequency)
sim = CacheSimulator()
sequences = [tokenizer.encode(p) for p in prompts]
result = sim.simulate(sequences, cache_size_tokens=10_000, policy="lru_prefix")
print(f"Hit rate: {result.cache_hit_rate:.1%}")import json
from kv_cache_analyzer import analyze
report = analyze("requests.jsonl", model_name="llama-3-8b")
output = {
"summary": {
"total_requests": report.total_requests,
"total_tokens": report.total_tokens,
"avg_request_length": report.avg_request_length,
},
"simulations": [
{"policy": r.policy, "hit_rate": r.cache_hit_rate, "tokens_saved": r.cached_tokens}
for r in report.simulation_results
],
"hot_prefixes": [
{"frequency": p.frequency, "length": p.length, "preview": p.text_preview}
for p in report.hot_prefixes
],
"session_analysis": {
"sessions_detected": report.sessions_detected,
"within_session_hit_rate": report.within_session_hit_rate,
"breakdown": [
{"session_id": s.session_id, "turns": s.request_count,
"within_session_hit_rate": s.within_session_hit_rate}
for s in report.session_breakdown
],
},
"recommendations": report.recommendations,
}
with open("report.json", "w") as f:
json.dump(output, f, indent=2)Or use the CLI flag directly:
kv-cache-analyzer analyze logs.jsonl --output-json report.jsonRequest logs
│
▼
┌─────────────┐ validate+scrub ┌──────────────┐
│ Parser │ ─────────────────▶ │ Prefix Trie │
│ (JSONL / │ │ │
│ OpenAI / │ │ Counts how │
│ vLLM / │ │ many requests│
│ plain) │ │ share each │
│ │ │ prefix path │
└─────────────┘ └──────┬───────┘
│ │ hot prefixes
│ session_id ▼
▼ ┌──────────────┐
┌─────────────┐ │ Simulator │
│ Session │ │ │
│ Analysis │ │ no_cache │
│ │ │ lru_prefix │ ──▶ SimulationResult
│ within- │ │ oracle │
│ session │ │ sweep │
│ hit rate │ └──────┬───────┘
└─────────────┘ │
▼
┌──────────────┐
│ Recommender │ ──▶ Actionable text
└──────────────┘
Key concepts:
- Prefix trie — every tokenized request is inserted into a trie. Each node tracks how many requests pass through it. High-count nodes at sufficient depth are "hot prefixes."
- LRU simulation — replays the request trace through an LRU block cache (default 16 tokens/block, matching vLLM). Reports realistic hit rate accounting for cold starts and evictions.
- Oracle simulation — infinite cache upper bound. Shows maximum achievable hit rate if memory were free.
- Within-session analysis — for each session with multiple turns, counts the longest common prefix between consecutive turns. This measures how much additional caching is possible if the server maintains per-user KV state (stateful serving).
- Savings potential —
frequency × length. Longer prefixes shared by more requests have the highest cache ROI.
The tool is designed to be safe to run against production logs:
Input hardening:
- File size limit: Rejects files >512 MB by default (override:
KV_CACHE_MAX_FILE_MB) - Per-line size limit: Rejects individual lines >1 MB — prevents memory bombs (override:
KV_CACHE_MAX_LINE_BYTES) - JSON depth limit: Rejects deeply nested JSON objects — prevents stack overflow attacks (override:
KV_CACHE_MAX_JSON_DEPTH) - Prompt length limit: Caps prompts at ~32K tokens — prevents quadratic trie memory usage (override:
KV_CACHE_MAX_PROMPT_TOKENS) - Symlink protection: Refuses to open symlinks pointing to
/dev/,/proc/,/sys/
Output hardening:
- Path traversal prevention: Rejects output paths resolving to
/etc/,/sys/,/proc/, and macOS system directories
Supply chain:
- Weekly pip-audit scan for known CVEs in all dependencies
- Bandit static analysis on every push
- Trivy filesystem scan pinned to a commit SHA (not a mutable tag — addresses the Trivy supply-chain attack vector)
- CycloneDX SBOM generated on every build
All security limits are overridable via environment variables for power users. See src/kv_cache_analyzer/security.py for the full list.
| Framework | Prefix Caching | How to enable |
|---|---|---|
| vLLM | ✅ APC (block-level LRU) | --enable-prefix-caching |
| SGLang | ✅ RadixAttention (default ON) | Nothing — already on |
| TensorRT-LLM | ✅ Block reuse | KVCacheConfig(enable_block_reuse=True) |
| TGI | ❌ Not supported | Migrate to vLLM or SGLang |
| Triton | ✅ Via TensorRT-LLM backend | See TRT-LLM docs |
git clone https://github.com/bs258q/kv-cache-analyzer
cd kv-cache-analyzer
pip install -e ".[dev]"
pytestGood first issues:
- Add support for new log formats (LiteLLM, Triton, OpenLLM)
- Build a Grafana dashboard JSON template for the JSON report output
- Add per-surface breakdown when logs contain a
surfacefield - Benchmark simulated hit rates against real vLLM
gpu_prefix_cache_hit_ratemetrics - Add support for multi-modal content (image token counting)
To add a new log format, implement a parser in src/kv_cache_analyzer/parser.py and add it to the LogFormat enum in src/kv_cache_analyzer/models.py.
To add a new scenario generator, add conversation templates and a gen_* function to src/kv_cache_analyzer/generators.py and register it in SESSION_GENERATORS.
Apache 2.0