kv-cache-analyzer

Analyze LLM request logs to find KV cache optimization opportunities.

Tells you which prefixes to cache, how much VRAM it costs, what hit rate to expect, and whether your multi-tenant sessions benefit from stateful KV persistence — before you touch a single config file.

┌─────────────────────────────────────────────────────────┐
│  Your request logs  →  kv-cache-analyzer  →  Report     │
│                                                         │
│  "Your system prompt (113 tokens) appears in 94% of     │
│   requests. Enabling prefix caching would skip 87.6%    │
│   of prefill tokens. Cost: 14 MB VRAM on Llama 3 8B."  │
│                                                         │
│  "8 sessions detected. Within-session hit rate: 67%.    │
│   Stateful KV persistence would save 5,100 additional   │
│   tokens beyond prefix caching alone."                  │
└─────────────────────────────────────────────────────────┘

Why this exists

KV cache prefix caching (vLLM APC, SGLang RadixAttention) can eliminate 50–90% of prefill compute — but only if your prompts are structured to take advantage of it. Most teams enable it and hope for the best. This tool tells you exactly what to expect, where the gains are, and how to restructure prompts to maximize hit rate.

It also tells you something prefix caching alone cannot: whether your multi-turn user sessions would benefit from stateful KV persistence — the pattern where the server keeps each user's KV state alive between turns rather than recomputing the full history from scratch on every request.

What it analyzes:

Which prefixes are shared across requests (and how frequently)
Simulated cache hit rate under LRU, oracle, and no-cache policies
Cache size sensitivity — the elbow where more VRAM stops helping
VRAM cost to cache each hot prefix, per model architecture
Session-aware analysis: cross-session vs within-session prefix reuse
Per-session breakdown sorted by stateful-serving benefit
Actionable prompt structure recommendations

Installation

Requires Python 3.11+.

pip install kv-cache-analyzer

Optional extras:

pip install kv-cache-analyzer[tiktoken]    # precise BPE token counts (OpenAI vocab)
pip install kv-cache-analyzer[hf]          # HuggingFace tokenizers for exact model vocab

Quick Start

1. Generate sample logs (no real logs needed)

# Realistic multi-surface prod simulation
kv-cache-analyzer generate-logs sample.jsonl --count 500 --scenario prod

# Multi-turn sessions with session_id (picks up within-session analysis automatically)
kv-cache-analyzer generate-logs sample.jsonl --count 100 --scenario customer-support
kv-cache-analyzer generate-logs sample.jsonl --count 100 --scenario code-assistant

# Full scenario list
kv-cache-analyzer generate-logs --help

2. Scrub PII from real logs

Always run this before analyzing production logs.

kv-cache-analyzer scrub prod_logs.jsonl scrubbed.jsonl --dry-run   # preview first
kv-cache-analyzer scrub prod_logs.jsonl scrubbed.jsonl             # write file

3. Analyze

kv-cache-analyzer analyze scrubbed.jsonl \
  --model llama-3-8b \
  --cache-size 100000

CLI Reference

`analyze`

kv-cache-analyzer analyze logs.jsonl \
  --format openai \           # openai | vllm | jsonl | plain
  --model llama-3-70b \       # model for VRAM estimates (see list-models)
  --cache-size 200000 \       # KV cache capacity in tokens
  --block-size 16 \           # cache block size (match your vLLM config)
  --min-freq 0.03 \           # minimum prefix frequency to report
  --min-length 5 \            # minimum prefix length (tokens) to report
  --tokenizer word \          # word | char | tiktoken | hf
  --limit 50000 \             # analyze only first N requests
  --session-field session_id \ # JSON key for session ID (auto-detected for openai/vllm)
  --output-json report.json   # also write full report as JSON

Supported log formats:

Format	Description	Example
`openai` (default)	OpenAI chat API	`{"messages": [{"role": "system", "content": "..."}]}`
`vllm`	vLLM request logs	`{"prompt": "...", "request_id": "..."}`
`jsonl`	Generic JSONL	`{"prompt": "..."}` (use `--prompt-field` to change key)
`plain`	One prompt per line	plaintext

Session ID is auto-detected from session_id, conversation_id, or thread_id fields in openai/vllm formats.

`generate-logs`

Generate realistic synthetic logs for testing across different production scenarios.

kv-cache-analyzer generate-logs output.jsonl \
  --scenario <name> \    # see table below
  --count 500 \          # number of log entries (turns, not sessions)
  --seed 42              # reproducibility

Scenarios and expected KV cache profiles:

Scenario	Type	LRU Hit Rate	Within-Session	Best for testing
`prod`	single-turn	~75–85%	n/a	General mixed workload
`chatbot`	single-turn	~85–90%	n/a	High-cacheability baseline
`diverse`	single-turn	~5–10%	n/a	Low-cacheability baseline
`no-system`	single-turn	~5%	n/a	No system prompt workload
`customer-support`	multi-turn	~92%	~67%	Support ticket sessions
`ecommerce`	multi-turn	~94%	~69%	Shopping journey sessions
`code-assistant`	multi-turn	~92%	~60%	IDE debugging/refactor sessions
`rag-qa`	multi-turn	~93%*	~30%	RAG pipelines — shows cache limits
`legal-review`	multi-turn	~96%	~64%	Very long system prompts
`healthcare`	multi-turn	~95%	~45%	Mixed single/multi-turn
`content-moderation`	single-turn	~96%	~0%	Pure prefix caching pattern
`fintech`	multi-turn	~96%	~56%	Portfolio analysis + quick queries

Note on rag-qa: The high cross-session LRU rate is an artifact of the word tokenizer seeing the static system header as a long shared prefix. With --tokenizer tiktoken, the rate drops significantly because retrieved context chunks vary per request. This correctly reflects real RAG deployments — use this scenario to demonstrate cache limits before assuming RAG will benefit from prefix caching.

`scrub`

kv-cache-analyzer scrub input.jsonl output.jsonl \
  --format openai \
  --dry-run    # preview without writing

Replaces PII automatically:

Pattern	Placeholder
Email addresses	`[SCRUBBED_EMAIL]`
Phone numbers	`[SCRUBBED_PHONE]`
IPv4 / IPv6	`[SCRUBBED_IP]`
Credit card numbers	`[SCRUBBED_CARD]`
SSNs	`[SCRUBBED_SSN]`
UUIDs	`[SCRUBBED_ID]`
Long hex/base64 tokens	`[SCRUBBED_TOKEN]`

`list-models`

kv-cache-analyzer list-models

Model           Layers  KV Heads  Head Dim  KB/token (BF16)
llama-3-8b          32         8       128              128
llama-3-70b         80         8       128              320
llama-3-405b       126         8       128              512
mistral-7b          32         8       128              128
qwen2-7b            28         4       128               56
gemma-2-9b          42         8       256              344
...

Python API

Basic analysis

from kv_cache_analyzer import analyze

report = analyze(
    log_path="scrubbed_requests.jsonl",
    model_name="llama-3-8b",
    cache_size_tokens=100_000,
)

print(f"Requests:   {report.total_requests:,}")
print(f"LRU rate:   {next(r for r in report.simulation_results if r.policy == 'lru_prefix').cache_hit_rate:.1%}")
print(f"Hot prefixes: {len(report.hot_prefixes)}")

for rec in report.recommendations:
    print(f"\n• {rec}")

Session-aware analysis

When logs include session_id fields, the report automatically includes per-session breakdown:

from kv_cache_analyzer import analyze

report = analyze("session_logs.jsonl")

# Cross-session (prefix caching) vs within-session (stateful KV)
print(f"Sessions detected:       {report.sessions_detected}")
print(f"Cross-session hit rate:  {next(r for r in report.simulation_results if r.policy == 'lru_prefix').cache_hit_rate:.1%}")
print(f"Within-session hit rate: {report.within_session_hit_rate:.1%}")

# Per-session breakdown (sorted by within-session hit rate descending)
for session in report.session_breakdown[:5]:
    print(f"\n  Session {session.session_id}")
    print(f"    Turns:            {session.request_count}")
    print(f"    Avg turn length:  {session.avg_turn_length:.0f} tokens")
    print(f"    Within-sess rate: {session.within_session_hit_rate:.1%}")
    print(f"    Reusable tokens:  {session.within_session_cached_tokens:,}")

Interpreting the two hit rates:

Metric	What it measures	Served by
Cross-session (LRU)	System prompt + few-shot sharing across all users	vLLM APC, SGLang RadixAttention
Within-session	Turn-level prefix reuse within one conversation	Stateful KV persistence (e.g. LayerScale)

High within-session rate → your longest sessions would benefit from a server that keeps each user's KV state alive between turns, not just caches the common system prefix.

Inspect hot prefixes

report = analyze("requests.jsonl", min_frequency=0.05, min_prefix_length=10)

for prefix in report.hot_prefixes:
    print(f"Frequency : {prefix.frequency:.1%}")
    print(f"Length    : {prefix.length} tokens")
    print(f"Stable    : {prefix.is_stable}")   # True = good pinning candidate
    print(f"Score     : {prefix.savings_potential:.1f}")  # frequency × length
    print(f"Preview   : {prefix.text_preview}")

Cache size sweep — find the elbow

report = analyze("requests.jsonl")

for r in report.sweep_results:
    bar = "█" * int(r.cache_hit_rate * 30)
    print(f"{r.cache_size_tokens:>12,}  {bar} {r.cache_hit_rate:.1%}")

# Find where adding more cache stops helping
deltas = [
    report.sweep_results[i+1].cache_hit_rate - report.sweep_results[i].cache_hit_rate
    for i in range(len(report.sweep_results) - 1)
]
elbow = report.sweep_results[deltas.index(max(deltas))]
print(f"Elbow: {elbow.cache_size_tokens:,} tokens → {elbow.cache_hit_rate:.1%}")

Use tiktoken or HuggingFace tokenizers for precise counts

pip install kv-cache-analyzer[tiktoken]

# OpenAI BPE (GPT-4 / text-embedding vocab)
report = analyze("requests.jsonl", tokenizer_name="tiktoken")

# Exact model vocab (e.g. Llama 3)
report = analyze(
    "requests.jsonl",
    tokenizer_name="hf",
    hf_model="meta-llama/Meta-Llama-3-8B-Instruct",
)

CI/CD cache health gate

from kv_cache_analyzer import analyze

def check_cache_regression(log_path: str, min_hit_rate: float = 0.60) -> None:
    """Fail CI if a prompt change would drop the predicted cache hit rate."""
    report = analyze(log_path, cache_size_tokens=100_000)
    lru = next(r for r in report.simulation_results if r.policy == "lru_prefix")

    if lru.cache_hit_rate < min_hit_rate:
        raise SystemExit(
            f"Cache regression: {lru.cache_hit_rate:.1%} < threshold {min_hit_rate:.1%}\n"
            f"Likely cause: dynamic content in system prompt.\n"
            f"Fix: {report.recommendations[0]}"
        )
    print(f"✓ Cache hit rate {lru.cache_hit_rate:.1%} — OK")

check_cache_regression("staging_requests.jsonl", min_hit_rate=0.70)

Use the trie and simulator directly

from kv_cache_analyzer import PrefixTrie, CacheSimulator
from kv_cache_analyzer.tokenizer import get_tokenizer

tokenizer = get_tokenizer("word")

trie = PrefixTrie()
prompts = ["You are helpful. User: Hello", "You are helpful. User: Goodbye"]
for prompt in prompts:
    trie.insert(tokenizer.encode(prompt))

hot = trie.find_hot_prefixes(min_frequency=0.5, min_length=3)
for p in hot:
    print(p.text_preview, "→", p.frequency)

sim = CacheSimulator()
sequences = [tokenizer.encode(p) for p in prompts]
result = sim.simulate(sequences, cache_size_tokens=10_000, policy="lru_prefix")
print(f"Hit rate: {result.cache_hit_rate:.1%}")

Save report as JSON

import json
from kv_cache_analyzer import analyze

report = analyze("requests.jsonl", model_name="llama-3-8b")

output = {
    "summary": {
        "total_requests": report.total_requests,
        "total_tokens": report.total_tokens,
        "avg_request_length": report.avg_request_length,
    },
    "simulations": [
        {"policy": r.policy, "hit_rate": r.cache_hit_rate, "tokens_saved": r.cached_tokens}
        for r in report.simulation_results
    ],
    "hot_prefixes": [
        {"frequency": p.frequency, "length": p.length, "preview": p.text_preview}
        for p in report.hot_prefixes
    ],
    "session_analysis": {
        "sessions_detected": report.sessions_detected,
        "within_session_hit_rate": report.within_session_hit_rate,
        "breakdown": [
            {"session_id": s.session_id, "turns": s.request_count,
             "within_session_hit_rate": s.within_session_hit_rate}
            for s in report.session_breakdown
        ],
    },
    "recommendations": report.recommendations,
}

with open("report.json", "w") as f:
    json.dump(output, f, indent=2)

Or use the CLI flag directly:

kv-cache-analyzer analyze logs.jsonl --output-json report.json

How It Works

Request logs
     │
     ▼
┌─────────────┐   validate+scrub   ┌──────────────┐
│   Parser    │ ─────────────────▶ │ Prefix Trie  │
│  (JSONL /   │                    │              │
│  OpenAI /   │                    │ Counts how   │
│  vLLM /     │                    │ many requests│
│  plain)     │                    │ share each   │
│             │                    │ prefix path  │
└─────────────┘                    └──────┬───────┘
       │                                  │ hot prefixes
       │ session_id                        ▼
       ▼                           ┌──────────────┐
┌─────────────┐                    │  Simulator   │
│  Session    │                    │              │
│  Analysis   │                    │ no_cache     │
│             │                    │ lru_prefix   │ ──▶  SimulationResult
│  within-    │                    │ oracle       │
│  session    │                    │ sweep        │
│  hit rate   │                    └──────┬───────┘
└─────────────┘                           │
                                          ▼
                                   ┌──────────────┐
                                   │ Recommender  │ ──▶  Actionable text
                                   └──────────────┘

Key concepts:

Prefix trie — every tokenized request is inserted into a trie. Each node tracks how many requests pass through it. High-count nodes at sufficient depth are "hot prefixes."
LRU simulation — replays the request trace through an LRU block cache (default 16 tokens/block, matching vLLM). Reports realistic hit rate accounting for cold starts and evictions.
Oracle simulation — infinite cache upper bound. Shows maximum achievable hit rate if memory were free.
Within-session analysis — for each session with multiple turns, counts the longest common prefix between consecutive turns. This measures how much additional caching is possible if the server maintains per-user KV state (stateful serving).
Savings potential — frequency × length. Longer prefixes shared by more requests have the highest cache ROI.

Security

The tool is designed to be safe to run against production logs:

Input hardening:

File size limit: Rejects files >512 MB by default (override: KV_CACHE_MAX_FILE_MB)
Per-line size limit: Rejects individual lines >1 MB — prevents memory bombs (override: KV_CACHE_MAX_LINE_BYTES)
JSON depth limit: Rejects deeply nested JSON objects — prevents stack overflow attacks (override: KV_CACHE_MAX_JSON_DEPTH)
Prompt length limit: Caps prompts at ~32K tokens — prevents quadratic trie memory usage (override: KV_CACHE_MAX_PROMPT_TOKENS)
Symlink protection: Refuses to open symlinks pointing to /dev/, /proc/, /sys/

Output hardening:

Path traversal prevention: Rejects output paths resolving to /etc/, /sys/, /proc/, and macOS system directories

Supply chain:

Weekly pip-audit scan for known CVEs in all dependencies
Bandit static analysis on every push
Trivy filesystem scan pinned to a commit SHA (not a mutable tag — addresses the Trivy supply-chain attack vector)
CycloneDX SBOM generated on every build

All security limits are overridable via environment variables for power users. See src/kv_cache_analyzer/security.py for the full list.

Supported Serving Frameworks

Framework	Prefix Caching	How to enable
vLLM	✅ APC (block-level LRU)	`--enable-prefix-caching`
SGLang	✅ RadixAttention (default ON)	Nothing — already on
TensorRT-LLM	✅ Block reuse	`KVCacheConfig(enable_block_reuse=True)`
TGI	❌ Not supported	Migrate to vLLM or SGLang
Triton	✅ Via TensorRT-LLM backend	See TRT-LLM docs

Contributing

git clone https://github.com/bs258q/kv-cache-analyzer
cd kv-cache-analyzer
pip install -e ".[dev]"
pytest

Good first issues:

Add support for new log formats (LiteLLM, Triton, OpenLLM)
Build a Grafana dashboard JSON template for the JSON report output
Add per-surface breakdown when logs contain a surface field
Benchmark simulated hit rates against real vLLM gpu_prefix_cache_hit_rate metrics
Add support for multi-modal content (image token counting)

To add a new log format, implement a parser in src/kv_cache_analyzer/parser.py and add it to the LogFormat enum in src/kv_cache_analyzer/models.py.

To add a new scenario generator, add conversation templates and a gen_* function to src/kv_cache_analyzer/generators.py and register it in SESSION_GENERATORS.

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
src/kv_cache_analyzer		src/kv_cache_analyzer
tests		tests
.gitignore		.gitignore
README.md		README.md
USE_CASES.md		USE_CASES.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kv-cache-analyzer

Why this exists

Installation

Quick Start

1. Generate sample logs (no real logs needed)

2. Scrub PII from real logs

3. Analyze

CLI Reference

`analyze`

`generate-logs`

`scrub`

`list-models`

Python API

Basic analysis

Session-aware analysis

Inspect hot prefixes

Cache size sweep — find the elbow

Use tiktoken or HuggingFace tokenizers for precise counts

CI/CD cache health gate

Use the trie and simulator directly

Save report as JSON

How It Works

Security

Supported Serving Frameworks

Contributing

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

kv-cache-analyzer

Why this exists

Installation

Quick Start

1. Generate sample logs (no real logs needed)

2. Scrub PII from real logs

3. Analyze

CLI Reference

analyze

generate-logs

scrub

list-models

Python API

Basic analysis

Session-aware analysis

Inspect hot prefixes

Cache size sweep — find the elbow

Use tiktoken or HuggingFace tokenizers for precise counts

CI/CD cache health gate

Use the trie and simulator directly

Save report as JSON

How It Works

Security

Supported Serving Frameworks

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`analyze`

`generate-logs`

`scrub`

`list-models`

Packages