feat: add epistemic safety engine to prevent LLM psychosis by bcd532 · Pull Request #1 · bcd532/one

bcd532 · 2026-03-17T06:30:42Z

The core problem: LLM-generated speculation was being stored as "knowledge,"
recalled as context for future prompts, and used as evidence to generate more
speculation — creating a self-reinforcing delusion loop where confidence
inflated without any empirical grounding.

New module: epistemic_safety.py

Provenance tracking: every claim tagged with actual source type
Confidence ceilings: LLM self-verification capped at 0.60, synthesis at
0.45, deep synthesis decays 0.10 per depth level
Circular reference detection: blocks LLM output too similar to existing
LLM-generated memories (prevents self-reinforcement)
False certainty detection: flags phrases like "this proves" or "beyond
doubt" from LLM outputs and reduces confidence
Epistemic honesty scoring: rewards hedging and citations, penalizes
overconfidence
Epistemic markers: all LLM-generated content prefixed with provenance tags
so future contexts show where claims actually came from

Integrated into:

synthesis.py: hypotheses get provenance tags, confidence ceilings, circular
reference blocking. Deep synthesis gets stricter limits per depth level.
verification.py: LLM "verifying" LLM output no longer inflates confidence.
Claims marked VERIFIED without specific citations downgraded to UNVERIFIABLE.
research.py: all LLM-sourced findings tagged with provenance warnings.
dialectic.py: "universal patterns" relabeled as LLM-speculated, confidence
capped. Thesis/antithesis/synthesis all get provenance tracking.
gate.py: false certainty in LLM output penalized during storage scoring.
auto.py: epistemic safety protocol injected into autonomous loop system
prompt. Context injection includes provenance warnings.

https://claude.ai/code/session_01HD6xGRMauZKsgE8ehWXnAd

Knowledge graph (one/graph.py): - d3.js force-directed visualization served at /graph - Nodes colored by type, sized by observation count - Click nodes to see linked memories - Auto-refreshes every 10 seconds - Endpoints: GET /graph, /api/graph, /api/entity/<name>/memories Watch mode (one/watch.py): - /watch [dir] monitors for file changes - Auto-logs diffs to memory store with entity linking - Polls every 2s, ignores .git/venv/pycache - /unwatch stops monitoring CLAUDE.md generator (one/claudemd.py): - /generate exports rule tree + entities as CLAUDE.md - Claude reads it natively on every session start - Grouped by context with key files, concepts, tools - Endpoint: GET /api/claudemd VISION.md added with full project roadmap. 6,547 lines across 22 files.

Synthesis (one/synthesis.py): - Scans entity graph for cross-domain connections - Generates hypotheses via Gemma when unrelated concepts co-occur - Recursive deep synthesis builds a DAG of layered insights - /synthesize triggers on current project Deep Research (one/research.py): - /research <topic> launches autonomous research loop - Structured prompts: findings, open problems, cross-disciplinary, contrarian - Extracts findings, builds citation graph, identifies gaps - Fills gaps with targeted follow-up, runs synthesis across findings - /frontier shows open questions and active research topics Playbook System (one/playbook.py): - Auto-generated after every /auto completion - Distills key decisions, reusable patterns, pitfalls via Gemma - Recalled by vector similarity on similar future tasks - Injected into auto loop context — never solve the same class twice - /playbooks lists all with category and recall count Auto loop now: - Injects relevant playbooks before starting - Generates playbook on completion Stats now track syntheses, playbooks, and research topics. 8,166 lines across 25 files.

…, type safety Upgrade core engines: research.py (iterative deepening, adversarial prompts, source quality scoring, quantitative extraction), synthesis.py (novelty scoring, hypothesis testing, contradiction detection), auto.py (reflection checkpoints, milestone tracking, crash recovery via state serialization). Expand entity extraction to 10+ types with relationship extraction. Harden server with API key auth, rate limiting, and CORS. Fix encode_tagged tag vector norm bug and profanity substring false positives. Resolve 24 Pyright type errors across core modules. Add comprehensive test suite: 440 tests across 10 files covering hdc, gate, entities, store, excitation, rules, research, synthesis, server, and playbook.

Swarm multi-agent orchestration with Conductor and 14 agent roles. Dialectic chains (thesis→antithesis→synthesis→verification→meta). Analogical transfer with cross-domain structural isomorphism via HDC. Contradiction mining with severity levels and resolution tracking. Self-verifying knowledge engine with confidence lifecycle and source quality. Active question generation via frontier mapping and information value scoring. Executable verification engine with code and LLM experiment paths. Swarm TUI dashboard with sparklines, breakthrough alerts, dialectic panels. Knowledge health metrics with volume/entities/intelligence/quality/warnings. Morgoth Mode — 7-phase autonomous research loop with eureka capture. Foundry audit with quality scoring, duplicate detection, garbage cleanup. 883 tests passing (443 new + 440 existing), zero Pyright errors. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Research findings auto-inject into Claude's next message context via _preloaded_context after research completes - Added all MIROBEAR commands to ONE_COMMANDS set so they route through our system instead of falling through to Claude - Notification fires when research completes

- Escape now stops current response/auto loop, doesn't exit the app - Ctrl+Q is the new quit - /swarm, /morgoth, /health, /audit, /focus, /inject all routed through our system — never leak to Claude - Help text updated with intelligence section showing all new commands - Swarm, morgoth, health, audit handlers wired to their modules

Auto system prompt now requires: - Actually running features as a user would, not just writing pytest - Verifying all imports resolve before declaring done - Testing through the actual app entry point - Curling endpoints, running CLI commands, importing from app - Smoke tests = failure. Real behavior under real conditions. Removed all orion references from Claude memory system.

17 bug fixes across 8 files: - audit.py: run_full_audit signature, auto_fix implementation, entity column queries - health.py: wrong column names (entity_type→type, rules→rule_nodes, created→timestamp) - research.py: schema migration for 7 missing columns - entities.py: filter .venv/site-packages/slash commands from file entities - contradictions.py + verification.py: recall("") zero vector → get_recent() - app.py: audit result key, help text, /watch backend, command dispatch - server.py: malformed CORS do_OPTIONS - client.py: push_entity wrong kwarg (entity_id not accepted by Foundry action) New: engine.py — Zero Hallucination Engine (1400+ lines) - AST-parses Python, extracts SQL, checks against live PRAGMA table_info - Multi-language: Python, C/C99, JS/TS, HTML, CSS, JSON - Codebase ontology: 483 symbols, 4311 calls, 122 file deps mapped - Impact analysis: knows what breaks before you break it - Signature change detection with caller warnings - Symbol removal detection blocks edits that break callers - Decision logging + turn logging to knowledge graph - Session logs at ~/.one/logs/ survive crashes - verify_edit_with_impact() wired into post-edit hooks - Post-completion auto-verify after every Claude turn - Foundry sync pushes ontology through MemoryEntry + Entity New: ground.py — Ground truth population - Introspects live DB schemas and runtime module signatures - Stores verified facts at 0.95-0.99 confidence - 165 ground truths: schemas, signatures, contracts, traps - Surfaces in recall when context is relevant New: LOCK.md — Single source of truth - Exact SQLite schemas, exact class signatures, command wiring - Known traps, verification checklist Morgoth rewrite: - Phases use start_research() and knowledge engines instead of spawning Claude subprocesses that silently fail - Phase.CONTINUOUS now reachable (was unreachable due to _advance_phase bug) - Swarm/morgoth stored on self (no more GC killing them) - /stop kills auto + swarm + morgoth - All phases log to ~/.one/logs/ via engine session logger TUI: - Input box: TextArea with word wrap, auto-expand 3-10 lines - /verify, /ground commands - Boot sequence: auto map + ground + verify + foundry sync (background) - Post-completion: re-map + re-verify edited files after every turn - Foundry sync moved to background thread (non-blocking boot) Auto prompt: GROUND TRUTH PROTOCOL + LOCK.md reference + ground truth injection in context gathering

…ty action

_call_ollama() now tries Claude (via proxy.quick_ask) first, falls back to Gemma. Every module that calls _call_ollama automatically gets Claude's brain: research, dialectic, synthesis, contradictions, analogy, verification, experiments. Added ClaudeProxy.ask() for synchronous prompt/response and ClaudeProxy.quick_ask() static method for one-shot questions.

…ooks

…e/builds/

The core problem: LLM-generated speculation was being stored as "knowledge," recalled as context for future prompts, and used as evidence to generate more speculation — creating a self-reinforcing delusion loop where confidence inflated without any empirical grounding. New module: epistemic_safety.py - Provenance tracking: every claim tagged with actual source type - Confidence ceilings: LLM self-verification capped at 0.60, synthesis at 0.45, deep synthesis decays 0.10 per depth level - Circular reference detection: blocks LLM output too similar to existing LLM-generated memories (prevents self-reinforcement) - False certainty detection: flags phrases like "this proves" or "beyond doubt" from LLM outputs and reduces confidence - Epistemic honesty scoring: rewards hedging and citations, penalizes overconfidence - Epistemic markers: all LLM-generated content prefixed with provenance tags so future contexts show where claims actually came from Integrated into: - synthesis.py: hypotheses get provenance tags, confidence ceilings, circular reference blocking. Deep synthesis gets stricter limits per depth level. - verification.py: LLM "verifying" LLM output no longer inflates confidence. Claims marked VERIFIED without specific citations downgraded to UNVERIFIABLE. - research.py: all LLM-sourced findings tagged with provenance warnings. - dialectic.py: "universal patterns" relabeled as LLM-speculated, confidence capped. Thesis/antithesis/synthesis all get provenance tracking. - gate.py: false certainty in LLM output penalized during storage scoring. - auto.py: epistemic safety protocol injected into autonomous loop system prompt. Context injection includes provenance warnings. https://claude.ai/code/session_01HD6xGRMauZKsgE8ehWXnAd

The AifGate was named "Active Inference" but was just a weighted linear combination of hand-tuned heuristics. This replaces the scoring engine with actual Friston free energy framework: - Generative model: mixture of learned Regime clusters in HDC vector space - Variational free energy: F = precision * surprise - log(prior) - Precision learning: inverse variance per regime, updated online - Expected free energy: epistemic value of storing (information gain) - Belief updating: regime centroids shift toward observations - Regime lifecycle: creation, merging, replacement of topic clusters The hard noise/redaction filters and content priors are retained as pre-processing — they handle classification that vector similarity cannot. The AIF replaces the ad-hoc weighted combination with a principled decision mechanism where high free energy = genuinely informative. Key properties: - First observations are maximally novel (no model yet) - Repeated topics decrease surprise as regimes tighten - Precision increases for predictable regimes (routine conversations) - Novel topics get high epistemic value (would shift beliefs) - Redundant messages blocked by cosine similarity in recent buffer - Epistemic safety checks still penalize overconfident LLM output https://claude.ai/code/session_01HD6xGRMauZKsgE8ehWXnAd

rail and others added 21 commits March 15, 2026 02:26

chore: exclude internal design docs from repo

148b777

chore: remove internal doc from tracking

9139aec

fix: correct method names for health and audit modules

f19b3f1

fix: remove broken Foundry link code, auto-detect link_memory_to_enti…

ddd9ed8

…ty action

fix: force research Claude to answer directly, disable tools/skills

82d6acf

fix: let research Claude use Read/Grep/Bash, only block Skill

222901f

fix: quick_ask uses simple subprocess, not stream-json (was hanging)

60ef1bc

fix: quick_ask runs from /tmp with plugins disabled, avoids project h…

030a089

…ooks

fix: quick_ask 15min timeout, no-tools system prompt, max-turns 3

94535df

feat: morgoth BUILD phase writes real code via Claude, saves to ~/.on…

34694ad

…e/builds/

bcd532 force-pushed the main branch 2 times, most recently from 2fe6262 to ce3e039 Compare March 17, 2026 15:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add epistemic safety engine to prevent LLM psychosis#1

feat: add epistemic safety engine to prevent LLM psychosis#1
bcd532 wants to merge 21 commits intomainfrom
claude/improve-llm-safety-TC7zG

bcd532 commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bcd532 commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants