Experiment: Curated context (100%) vs flat log (43%) on llama3.1 8B #13536

MikeyBeez · 2026-02-11T16:57:48Z

MikeyBeez
Feb 11, 2026

Following up on the design paper posted here earlier — we ran an empirical test of the central claim.

Setup: A 78-turn conversation (~6,400 words) with natural noise accumulation (topic shifts, corrections, abandoned approaches, off-topic tangents) fed to llama3.1 8B via Ollama. 10 fact-retrieval questions, tested under two conditions: full flat-log context vs curated thread-only context.

Result: 43.3% accuracy (full context) vs 100% accuracy (curated context).

The model hallucinated facts, denied information existed in the conversation, and picked up contradicted details from noise. Exactly the failure modes predicted by the paper.

Full results, methodology, and reproducible code: github.com/MikeyBeez/fuzzyOS/discussions/2

The takeaway for kernel-level orchestration: a well-curated short context dramatically outperforms a noisy long context, even when the long context is well within the model's window. Context selection dominates context length.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiment: Curated context (100%) vs flat log (43%) on llama3.1 8B #13536

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Experiment: Curated context (100%) vs flat log (43%) on llama3.1 8B #13536

Uh oh!

MikeyBeez Feb 11, 2026

Replies: 0 comments

MikeyBeez
Feb 11, 2026