add sliding kv_cache by mike-ferguson · Pull Request #400 · brain-score/language

mike-ferguson · 2026-03-31T15:21:34Z

Add KV cache with sliding window to `HuggingfaceSubject` for behavioral and engineering benchmarks

Summary

KV cache for behavioral tasks: digest_text now reuses past key/value pairs between steps instead of recomputing the full prefix attention on every forward pass. This applies only to behavioral and engineering tasks (Futrell, SyntaxGym) where no layer activations are needed — neural benchmarks are unchanged.
Sliding window cache: When the accumulated context exceeds model.config.max_position_embeddings, the oldest KV entries are dropped to keep the attention map at a constant O(max_len) size. This ensures O(1) compute per step for the entire corpus regardless of length.
DynamicCache compatibility: Sliding uses the public to_legacy_cache() / from_legacy_cache() API to avoid depending on internal DynamicCache attributes that have changed across transformers versions.
Additional Patch: safetensors: Previously, some models would fail due to torch throwing an error regarding safetensors. This fixes that.

Motivation

Without KV cache, scoring Futrell2018 required one full forward pass per word, each recomputing attention over the entire growing prefix. For a ~10k word corpus this is O(n²) compute and O(n) peak memory — causing OOM on models ≥1B parameters and runtimes of 1.5–8+ hours depending on model size.

With this change, each step attends only over 1–2 new tokens (against the cached prefix), reducing total compute to O(n).

Performance

Model	Before	After
TinyLlama-1.1B	~1.5 hrs	~10 min
Falcon-7B	OOM (28% at 1.5 hours)	~21 min
Pythia-12B	OOM (unknown time)	~26 min

Scientific note

For models whose context window is smaller than the corpus (e.g. TinyLlama 2k on Natural Stories ~13k tokens), the sliding window introduces mild RoPE positional extrapolation beyond the trained range. This is a known tradeoff: the alternative (hard reset on overflow) is O(1) but discards all prior context, which is arguably a worse approximation. Scores for small-context models on long corpora should be interpreted with this caveat. Models with sufficiently large context windows (Mistral 32k, Qwen 128k) are unaffected.

mike-ferguson · 2026-04-01T14:18:14Z

Unittest_plugins fail due to OOM - known issue that will be resolved in later PR.

add sliding kv_cache

9689305

KartikP added the OOM label Mar 31, 2026

force_use_safetensors

d7ffabf

mike-ferguson merged commit 8c92806 into main Apr 2, 2026
7 of 10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add sliding kv_cache#400

add sliding kv_cache#400
mike-ferguson merged 2 commits intomainfrom
add_kv_cache

mike-ferguson commented Mar 31, 2026 •

edited

Loading

Uh oh!

mike-ferguson commented Apr 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mike-ferguson commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add KV cache with sliding window to HuggingfaceSubject for behavioral and engineering benchmarks

Summary

Motivation

Performance

Scientific note

Uh oh!

mike-ferguson commented Apr 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mike-ferguson commented Mar 31, 2026 •

edited

Loading

Add KV cache with sliding window to `HuggingfaceSubject` for behavioral and engineering benchmarks