Skip to content

bs258q/kv-cache-analyzer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

kv-cache-analyzer

Analyze LLM request logs to find KV cache optimization opportunities.

Tells you which prefixes to cache, how much VRAM it costs, what hit rate to expect, and whether your multi-tenant sessions benefit from stateful KV persistence — before you touch a single config file.

┌─────────────────────────────────────────────────────────┐
│  Your request logs  →  kv-cache-analyzer  →  Report     │
│                                                         │
│  "Your system prompt (113 tokens) appears in 94% of     │
│   requests. Enabling prefix caching would skip 87.6%    │
│   of prefill tokens. Cost: 14 MB VRAM on Llama 3 8B."  │
│                                                         │
│  "8 sessions detected. Within-session hit rate: 67%.    │
│   Stateful KV persistence would save 5,100 additional   │
│   tokens beyond prefix caching alone."                  │
└─────────────────────────────────────────────────────────┘

Why this exists

KV cache prefix caching (vLLM APC, SGLang RadixAttention) can eliminate 50–90% of prefill compute — but only if your prompts are structured to take advantage of it. Most teams enable it and hope for the best. This tool tells you exactly what to expect, where the gains are, and how to restructure prompts to maximize hit rate.

It also tells you something prefix caching alone cannot: whether your multi-turn user sessions would benefit from stateful KV persistence — the pattern where the server keeps each user's KV state alive between turns rather than recomputing the full history from scratch on every request.

What it analyzes:

  • Which prefixes are shared across requests (and how frequently)
  • Simulated cache hit rate under LRU, oracle, and no-cache policies
  • Cache size sensitivity — the elbow where more VRAM stops helping
  • VRAM cost to cache each hot prefix, per model architecture
  • Session-aware analysis: cross-session vs within-session prefix reuse
  • Per-session breakdown sorted by stateful-serving benefit
  • Actionable prompt structure recommendations

Installation

Requires Python 3.11+.

pip install kv-cache-analyzer

Optional extras:

pip install kv-cache-analyzer[tiktoken]    # precise BPE token counts (OpenAI vocab)
pip install kv-cache-analyzer[hf]          # HuggingFace tokenizers for exact model vocab

Quick Start

1. Generate sample logs (no real logs needed)

# Realistic multi-surface prod simulation
kv-cache-analyzer generate-logs sample.jsonl --count 500 --scenario prod

# Multi-turn sessions with session_id (picks up within-session analysis automatically)
kv-cache-analyzer generate-logs sample.jsonl --count 100 --scenario customer-support
kv-cache-analyzer generate-logs sample.jsonl --count 100 --scenario code-assistant

# Full scenario list
kv-cache-analyzer generate-logs --help

2. Scrub PII from real logs

Always run this before analyzing production logs.

kv-cache-analyzer scrub prod_logs.jsonl scrubbed.jsonl --dry-run   # preview first
kv-cache-analyzer scrub prod_logs.jsonl scrubbed.jsonl             # write file

3. Analyze

kv-cache-analyzer analyze scrubbed.jsonl \
  --model llama-3-8b \
  --cache-size 100000

CLI Reference

analyze

kv-cache-analyzer analyze logs.jsonl \
  --format openai \           # openai | vllm | jsonl | plain
  --model llama-3-70b \       # model for VRAM estimates (see list-models)
  --cache-size 200000 \       # KV cache capacity in tokens
  --block-size 16 \           # cache block size (match your vLLM config)
  --min-freq 0.03 \           # minimum prefix frequency to report
  --min-length 5 \            # minimum prefix length (tokens) to report
  --tokenizer word \          # word | char | tiktoken | hf
  --limit 50000 \             # analyze only first N requests
  --session-field session_id \ # JSON key for session ID (auto-detected for openai/vllm)
  --output-json report.json   # also write full report as JSON

Supported log formats:

Format Description Example
openai (default) OpenAI chat API {"messages": [{"role": "system", "content": "..."}]}
vllm vLLM request logs {"prompt": "...", "request_id": "..."}
jsonl Generic JSONL {"prompt": "..."} (use --prompt-field to change key)
plain One prompt per line plaintext

Session ID is auto-detected from session_id, conversation_id, or thread_id fields in openai/vllm formats.

generate-logs

Generate realistic synthetic logs for testing across different production scenarios.

kv-cache-analyzer generate-logs output.jsonl \
  --scenario <name> \    # see table below
  --count 500 \          # number of log entries (turns, not sessions)
  --seed 42              # reproducibility

Scenarios and expected KV cache profiles:

Scenario Type LRU Hit Rate Within-Session Best for testing
prod single-turn ~75–85% n/a General mixed workload
chatbot single-turn ~85–90% n/a High-cacheability baseline
diverse single-turn ~5–10% n/a Low-cacheability baseline
no-system single-turn ~5% n/a No system prompt workload
customer-support multi-turn ~92% ~67% Support ticket sessions
ecommerce multi-turn ~94% ~69% Shopping journey sessions
code-assistant multi-turn ~92% ~60% IDE debugging/refactor sessions
rag-qa multi-turn ~93%* ~30% RAG pipelines — shows cache limits
legal-review multi-turn ~96% ~64% Very long system prompts
healthcare multi-turn ~95% ~45% Mixed single/multi-turn
content-moderation single-turn ~96% ~0% Pure prefix caching pattern
fintech multi-turn ~96% ~56% Portfolio analysis + quick queries

Note on rag-qa: The high cross-session LRU rate is an artifact of the word tokenizer seeing the static system header as a long shared prefix. With --tokenizer tiktoken, the rate drops significantly because retrieved context chunks vary per request. This correctly reflects real RAG deployments — use this scenario to demonstrate cache limits before assuming RAG will benefit from prefix caching.

scrub

kv-cache-analyzer scrub input.jsonl output.jsonl \
  --format openai \
  --dry-run    # preview without writing

Replaces PII automatically:

Pattern Placeholder
Email addresses [SCRUBBED_EMAIL]
Phone numbers [SCRUBBED_PHONE]
IPv4 / IPv6 [SCRUBBED_IP]
Credit card numbers [SCRUBBED_CARD]
SSNs [SCRUBBED_SSN]
UUIDs [SCRUBBED_ID]
Long hex/base64 tokens [SCRUBBED_TOKEN]

list-models

kv-cache-analyzer list-models
Model           Layers  KV Heads  Head Dim  KB/token (BF16)
llama-3-8b          32         8       128              128
llama-3-70b         80         8       128              320
llama-3-405b       126         8       128              512
mistral-7b          32         8       128              128
qwen2-7b            28         4       128               56
gemma-2-9b          42         8       256              344
...

Python API

Basic analysis

from kv_cache_analyzer import analyze

report = analyze(
    log_path="scrubbed_requests.jsonl",
    model_name="llama-3-8b",
    cache_size_tokens=100_000,
)

print(f"Requests:   {report.total_requests:,}")
print(f"LRU rate:   {next(r for r in report.simulation_results if r.policy == 'lru_prefix').cache_hit_rate:.1%}")
print(f"Hot prefixes: {len(report.hot_prefixes)}")

for rec in report.recommendations:
    print(f"\n{rec}")

Session-aware analysis

When logs include session_id fields, the report automatically includes per-session breakdown:

from kv_cache_analyzer import analyze

report = analyze("session_logs.jsonl")

# Cross-session (prefix caching) vs within-session (stateful KV)
print(f"Sessions detected:       {report.sessions_detected}")
print(f"Cross-session hit rate:  {next(r for r in report.simulation_results if r.policy == 'lru_prefix').cache_hit_rate:.1%}")
print(f"Within-session hit rate: {report.within_session_hit_rate:.1%}")

# Per-session breakdown (sorted by within-session hit rate descending)
for session in report.session_breakdown[:5]:
    print(f"\n  Session {session.session_id}")
    print(f"    Turns:            {session.request_count}")
    print(f"    Avg turn length:  {session.avg_turn_length:.0f} tokens")
    print(f"    Within-sess rate: {session.within_session_hit_rate:.1%}")
    print(f"    Reusable tokens:  {session.within_session_cached_tokens:,}")

Interpreting the two hit rates:

Metric What it measures Served by
Cross-session (LRU) System prompt + few-shot sharing across all users vLLM APC, SGLang RadixAttention
Within-session Turn-level prefix reuse within one conversation Stateful KV persistence (e.g. LayerScale)

High within-session rate → your longest sessions would benefit from a server that keeps each user's KV state alive between turns, not just caches the common system prefix.

Inspect hot prefixes

report = analyze("requests.jsonl", min_frequency=0.05, min_prefix_length=10)

for prefix in report.hot_prefixes:
    print(f"Frequency : {prefix.frequency:.1%}")
    print(f"Length    : {prefix.length} tokens")
    print(f"Stable    : {prefix.is_stable}")   # True = good pinning candidate
    print(f"Score     : {prefix.savings_potential:.1f}")  # frequency × length
    print(f"Preview   : {prefix.text_preview}")

Cache size sweep — find the elbow

report = analyze("requests.jsonl")

for r in report.sweep_results:
    bar = "█" * int(r.cache_hit_rate * 30)
    print(f"{r.cache_size_tokens:>12,}  {bar} {r.cache_hit_rate:.1%}")

# Find where adding more cache stops helping
deltas = [
    report.sweep_results[i+1].cache_hit_rate - report.sweep_results[i].cache_hit_rate
    for i in range(len(report.sweep_results) - 1)
]
elbow = report.sweep_results[deltas.index(max(deltas))]
print(f"Elbow: {elbow.cache_size_tokens:,} tokens → {elbow.cache_hit_rate:.1%}")

Use tiktoken or HuggingFace tokenizers for precise counts

pip install kv-cache-analyzer[tiktoken]
# OpenAI BPE (GPT-4 / text-embedding vocab)
report = analyze("requests.jsonl", tokenizer_name="tiktoken")

# Exact model vocab (e.g. Llama 3)
report = analyze(
    "requests.jsonl",
    tokenizer_name="hf",
    hf_model="meta-llama/Meta-Llama-3-8B-Instruct",
)

CI/CD cache health gate

from kv_cache_analyzer import analyze

def check_cache_regression(log_path: str, min_hit_rate: float = 0.60) -> None:
    """Fail CI if a prompt change would drop the predicted cache hit rate."""
    report = analyze(log_path, cache_size_tokens=100_000)
    lru = next(r for r in report.simulation_results if r.policy == "lru_prefix")

    if lru.cache_hit_rate < min_hit_rate:
        raise SystemExit(
            f"Cache regression: {lru.cache_hit_rate:.1%} < threshold {min_hit_rate:.1%}\n"
            f"Likely cause: dynamic content in system prompt.\n"
            f"Fix: {report.recommendations[0]}"
        )
    print(f"✓ Cache hit rate {lru.cache_hit_rate:.1%} — OK")

check_cache_regression("staging_requests.jsonl", min_hit_rate=0.70)

Use the trie and simulator directly

from kv_cache_analyzer import PrefixTrie, CacheSimulator
from kv_cache_analyzer.tokenizer import get_tokenizer

tokenizer = get_tokenizer("word")

trie = PrefixTrie()
prompts = ["You are helpful. User: Hello", "You are helpful. User: Goodbye"]
for prompt in prompts:
    trie.insert(tokenizer.encode(prompt))

hot = trie.find_hot_prefixes(min_frequency=0.5, min_length=3)
for p in hot:
    print(p.text_preview, "→", p.frequency)

sim = CacheSimulator()
sequences = [tokenizer.encode(p) for p in prompts]
result = sim.simulate(sequences, cache_size_tokens=10_000, policy="lru_prefix")
print(f"Hit rate: {result.cache_hit_rate:.1%}")

Save report as JSON

import json
from kv_cache_analyzer import analyze

report = analyze("requests.jsonl", model_name="llama-3-8b")

output = {
    "summary": {
        "total_requests": report.total_requests,
        "total_tokens": report.total_tokens,
        "avg_request_length": report.avg_request_length,
    },
    "simulations": [
        {"policy": r.policy, "hit_rate": r.cache_hit_rate, "tokens_saved": r.cached_tokens}
        for r in report.simulation_results
    ],
    "hot_prefixes": [
        {"frequency": p.frequency, "length": p.length, "preview": p.text_preview}
        for p in report.hot_prefixes
    ],
    "session_analysis": {
        "sessions_detected": report.sessions_detected,
        "within_session_hit_rate": report.within_session_hit_rate,
        "breakdown": [
            {"session_id": s.session_id, "turns": s.request_count,
             "within_session_hit_rate": s.within_session_hit_rate}
            for s in report.session_breakdown
        ],
    },
    "recommendations": report.recommendations,
}

with open("report.json", "w") as f:
    json.dump(output, f, indent=2)

Or use the CLI flag directly:

kv-cache-analyzer analyze logs.jsonl --output-json report.json

How It Works

Request logs
     │
     ▼
┌─────────────┐   validate+scrub   ┌──────────────┐
│   Parser    │ ─────────────────▶ │ Prefix Trie  │
│  (JSONL /   │                    │              │
│  OpenAI /   │                    │ Counts how   │
│  vLLM /     │                    │ many requests│
│  plain)     │                    │ share each   │
│             │                    │ prefix path  │
└─────────────┘                    └──────┬───────┘
       │                                  │ hot prefixes
       │ session_id                        ▼
       ▼                           ┌──────────────┐
┌─────────────┐                    │  Simulator   │
│  Session    │                    │              │
│  Analysis   │                    │ no_cache     │
│             │                    │ lru_prefix   │ ──▶  SimulationResult
│  within-    │                    │ oracle       │
│  session    │                    │ sweep        │
│  hit rate   │                    └──────┬───────┘
└─────────────┘                           │
                                          ▼
                                   ┌──────────────┐
                                   │ Recommender  │ ──▶  Actionable text
                                   └──────────────┘

Key concepts:

  • Prefix trie — every tokenized request is inserted into a trie. Each node tracks how many requests pass through it. High-count nodes at sufficient depth are "hot prefixes."
  • LRU simulation — replays the request trace through an LRU block cache (default 16 tokens/block, matching vLLM). Reports realistic hit rate accounting for cold starts and evictions.
  • Oracle simulation — infinite cache upper bound. Shows maximum achievable hit rate if memory were free.
  • Within-session analysis — for each session with multiple turns, counts the longest common prefix between consecutive turns. This measures how much additional caching is possible if the server maintains per-user KV state (stateful serving).
  • Savings potentialfrequency × length. Longer prefixes shared by more requests have the highest cache ROI.

Security

The tool is designed to be safe to run against production logs:

Input hardening:

  • File size limit: Rejects files >512 MB by default (override: KV_CACHE_MAX_FILE_MB)
  • Per-line size limit: Rejects individual lines >1 MB — prevents memory bombs (override: KV_CACHE_MAX_LINE_BYTES)
  • JSON depth limit: Rejects deeply nested JSON objects — prevents stack overflow attacks (override: KV_CACHE_MAX_JSON_DEPTH)
  • Prompt length limit: Caps prompts at ~32K tokens — prevents quadratic trie memory usage (override: KV_CACHE_MAX_PROMPT_TOKENS)
  • Symlink protection: Refuses to open symlinks pointing to /dev/, /proc/, /sys/

Output hardening:

  • Path traversal prevention: Rejects output paths resolving to /etc/, /sys/, /proc/, and macOS system directories

Supply chain:

  • Weekly pip-audit scan for known CVEs in all dependencies
  • Bandit static analysis on every push
  • Trivy filesystem scan pinned to a commit SHA (not a mutable tag — addresses the Trivy supply-chain attack vector)
  • CycloneDX SBOM generated on every build

All security limits are overridable via environment variables for power users. See src/kv_cache_analyzer/security.py for the full list.


Supported Serving Frameworks

Framework Prefix Caching How to enable
vLLM ✅ APC (block-level LRU) --enable-prefix-caching
SGLang ✅ RadixAttention (default ON) Nothing — already on
TensorRT-LLM ✅ Block reuse KVCacheConfig(enable_block_reuse=True)
TGI ❌ Not supported Migrate to vLLM or SGLang
Triton ✅ Via TensorRT-LLM backend See TRT-LLM docs

Contributing

git clone https://github.com/bs258q/kv-cache-analyzer
cd kv-cache-analyzer
pip install -e ".[dev]"
pytest

Good first issues:

  • Add support for new log formats (LiteLLM, Triton, OpenLLM)
  • Build a Grafana dashboard JSON template for the JSON report output
  • Add per-surface breakdown when logs contain a surface field
  • Benchmark simulated hit rates against real vLLM gpu_prefix_cache_hit_rate metrics
  • Add support for multi-modal content (image token counting)

To add a new log format, implement a parser in src/kv_cache_analyzer/parser.py and add it to the LogFormat enum in src/kv_cache_analyzer/models.py.

To add a new scenario generator, add conversation templates and a gen_* function to src/kv_cache_analyzer/generators.py and register it in SESSION_GENERATORS.


License

Apache 2.0

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages