Direct Access: https://KevinCheung2259.github.io/kvcache-hit-rate-calculator/
A tool for calculating the theoretical KVCache hit rate in LLM inference services. Based on queuing theory and cache theory, it provides accurate performance predictions and optimization suggestions.
- ๐ฏ Precise Modeling: Mathematical modeling based on queuing theory and cache theory
- ๐ Visual Interface: Modern web interface with real-time calculation and chart display
- ๐ง Parameterized Configuration: Supports model layers, KV heads, data types, and other parameters
- ๐ก Optimization Suggestions: Automatic analysis and memory configuration optimization recommendations
- ๐ Sensitivity Analysis: Analysis of how different parameters affect hit rates
- ๐จ Preset Configurations: Built-in mainstream model configurations (Mistral, Llama3, Qwen3, etc.)
kvcache-hit-rate-calculator/
โโโ kvcache_calculator.py # Core calculation logic
โโโ index.html # Web interface
โโโ style.css # Style files
โโโ calculator.js # Frontend JavaScript logic
โโโ example.py # Python usage examples
โโโ test.py # Test suite
โโโ README.md # Project documentation
โโโ requirements.txt # Dependency management
โโโ LICENSE # Open source license
- Open https://KevinCheung2259.github.io/kvcache-hit-rate-calculator/
- Fill in the model configuration, system configuration, and conversation pattern parameters
- Model Configuration: Number of layers, KV heads, head dimension, data type
- System Configuration: Available memory
- Conversation Pattern: Average conversation length, new conversation arrival rate, within conversation interval, average sequence length
# Run examples
python example.py
# Or use API directly
python -c "
from kvcache_calculator import *
calculator = KVCacheCalculator()
# ... your codeIn LLM inference, KVCache stores the Key and Value matrices of the attention mechanism to avoid repeated calculations:
Memory per token = 2 ร num_layers ร num_kv_heads ร head_dim ร dtype_bytes
This tool is based on the following theoretical models:
- Littles Law:
Average conversations = Arrival rate ร Average stay time - LRU Cache Strategy: Least Recently Used cache replacement algorithm
- Conversation-level Modeling: Considering temporal locality within the same conversation
- Hit Rate: Proportion of requests that hit KVCache
- Cache Utilization: Efficiency of cache space usage
- System QPS: Queries per second the system can handle
- Memory Efficiency: Effective utilization of cache memory
KVCache Memory per Token:
memory_per_token = 2 ร num_layers ร num_kv_heads ร head_dim ร dtype_bytes
Where:
2represents Key and Valuenum_layers: Number of model layersnum_kv_heads: Number of Key-Value headshead_dim: Dimension of each attention headdtype_bytes: Bytes per element (FP16=2, FP8=1, etc.)
Maximum Cached Tokens:
max_cached_tokens = (available_memory - model_memory ร 1.2) / memory_per_token
Active Conversations (Little's Law):
active_conversations = conversation_arrival_rate ร conversation_lifetime
conversation_lifetime = avg_conversation_length ร within_conversation_interval
Maximum Cached Conversations:
max_cached_conversations = max_cached_tokens / avg_tokens_per_conversation
avg_tokens_per_conversation = avg_conversation_length ร avg_sequence_length
Case 1: Sufficient Cache (active_conversations โค max_cached_conversations)
hit_rate = 1 - (1 / avg_conversation_length)
Case 2: Insufficient Cache (active_conversations > max_cached_conversations)
cache_ratio = max_cached_conversations / active_conversations
intra_conversation_hit = 1 - (1 / avg_conversation_length)
inter_conversation_hit = cache_ratio
hit_rate = intra_conversation_hit ร inter_conversation_hit
Derived QPS:
qps_per_conversation = avg_sequence_length / within_conversation_interval
derived_qps = conversation_arrival_rate ร qps_per_conversation
Performance Metrics:
tokens_per_second = derived_qps ร avg_sequence_length
cache_hits_per_second = tokens_per_second ร hit_rate
- Memory Size: Larger memory โ More cache โ Higher hit rate
- Conversation Pattern: Longer conversations โ Higher hit rate
- Data Type: Lower precision types โ Smaller memory usage โ More cache
- System Load: Higher QPS โ More competition โ May reduce hit rate
-
Memory Optimization:
- Use FP8, INT8, or FP16 precision KVCache to reduce memory usage
- FP8 provides a good balance between precision and memory efficiency
- Choose appropriate memory configuration based on business requirements
-
System Design:
- Consider session affinity load balancing
- Optimize conversation distribution strategies
-
Model Selection:
- Balance between precision and memory efficiency
- Consider using MQA/GQA to reduce KV heads
MIT License - See LICENSE file for details
Welcome to submit Issues and Pull Requests to improve this tool!
- Support more cache management policies (FIFO, LFU, etc.)
- Support multi-instance with different schedule strategies
- Integrate more model architecture presets
Thanks to the following resources and projects for inspiration:
- Transformer architecture papers
- Various open-source LLM projects
- Cache theory and queuing theory related research
For questions or suggestions, please contact through:
- Submit GitHub Issues
- Send email to me
๐ป Based on queuing theory and other theoretical modeling | Actual performance may vary due to implementation details