Merged
Conversation
Member
Author
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add KV cache with sliding window to
HuggingfaceSubjectfor behavioral and engineering benchmarksSummary
digest_textnow reuses past key/value pairs between steps instead of recomputing the full prefix attention on every forward pass. This applies only to behavioral and engineering tasks (Futrell, SyntaxGym) where no layer activations are needed — neural benchmarks are unchanged.model.config.max_position_embeddings, the oldest KV entries are dropped to keep the attention map at a constantO(max_len)size. This ensures O(1) compute per step for the entire corpus regardless of length.DynamicCachecompatibility: Sliding uses the publicto_legacy_cache()/from_legacy_cache()API to avoid depending on internalDynamicCacheattributes that have changed acrosstransformersversions.safetensors: Previously, some models would fail due to torch throwing an error regardingsafetensors. This fixes that.Motivation
Without KV cache, scoring Futrell2018 required one full forward pass per word, each recomputing attention over the entire growing prefix. For a ~10k word corpus this is O(n²) compute and O(n) peak memory — causing OOM on models ≥1B parameters and runtimes of 1.5–8+ hours depending on model size.
With this change, each step attends only over 1–2 new tokens (against the cached prefix), reducing total compute to O(n).
Performance
Scientific note
For models whose context window is smaller than the corpus (e.g. TinyLlama 2k on Natural Stories ~13k tokens), the sliding window introduces mild RoPE positional extrapolation beyond the trained range. This is a known tradeoff: the alternative (hard reset on overflow) is O(1) but discards all prior context, which is arguably a worse approximation. Scores for small-context models on long corpora should be interpreted with this caveat. Models with sufficiently large context windows (Mistral 32k, Qwen 128k) are unaffected.