Skip to content

add sliding kv_cache#400

Merged
mike-ferguson merged 2 commits intomainfrom
add_kv_cache
Apr 2, 2026
Merged

add sliding kv_cache#400
mike-ferguson merged 2 commits intomainfrom
add_kv_cache

Conversation

@mike-ferguson
Copy link
Copy Markdown
Member

@mike-ferguson mike-ferguson commented Mar 31, 2026

Add KV cache with sliding window to HuggingfaceSubject for behavioral and engineering benchmarks

Summary

  • KV cache for behavioral tasks: digest_text now reuses past key/value pairs between steps instead of recomputing the full prefix attention on every forward pass. This applies only to behavioral and engineering tasks (Futrell, SyntaxGym) where no layer activations are needed — neural benchmarks are unchanged.
  • Sliding window cache: When the accumulated context exceeds model.config.max_position_embeddings, the oldest KV entries are dropped to keep the attention map at a constant O(max_len) size. This ensures O(1) compute per step for the entire corpus regardless of length.
  • DynamicCache compatibility: Sliding uses the public to_legacy_cache() / from_legacy_cache() API to avoid depending on internal DynamicCache attributes that have changed across transformers versions.
  • Additional Patch: safetensors: Previously, some models would fail due to torch throwing an error regarding safetensors. This fixes that.

Motivation

Without KV cache, scoring Futrell2018 required one full forward pass per word, each recomputing attention over the entire growing prefix. For a ~10k word corpus this is O(n²) compute and O(n) peak memory — causing OOM on models ≥1B parameters and runtimes of 1.5–8+ hours depending on model size.

With this change, each step attends only over 1–2 new tokens (against the cached prefix), reducing total compute to O(n).

Performance

Model Before After
TinyLlama-1.1B ~1.5 hrs ~10 min
Falcon-7B OOM (28% at 1.5 hours) ~21 min
Pythia-12B OOM (unknown time) ~26 min

Scientific note

For models whose context window is smaller than the corpus (e.g. TinyLlama 2k on Natural Stories ~13k tokens), the sliding window introduces mild RoPE positional extrapolation beyond the trained range. This is a known tradeoff: the alternative (hard reset on overflow) is O(1) but discards all prior context, which is arguably a worse approximation. Scores for small-context models on long corpora should be interpreted with this caveat. Models with sufficiently large context windows (Mistral 32k, Qwen 128k) are unaffected.

@KartikP KartikP added the OOM label Mar 31, 2026
@mike-ferguson
Copy link
Copy Markdown
Member Author

Unittest_plugins fail due to OOM - known issue that will be resolved in later PR.

@mike-ferguson mike-ferguson merged commit 8c92806 into main Apr 2, 2026
7 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants