Skip to content

feat: add epistemic safety engine to prevent LLM psychosis#1

Open
bcd532 wants to merge 21 commits intomainfrom
claude/improve-llm-safety-TC7zG
Open

feat: add epistemic safety engine to prevent LLM psychosis#1
bcd532 wants to merge 21 commits intomainfrom
claude/improve-llm-safety-TC7zG

Conversation

@bcd532
Copy link
Copy Markdown
Owner

@bcd532 bcd532 commented Mar 17, 2026

The core problem: LLM-generated speculation was being stored as "knowledge,"
recalled as context for future prompts, and used as evidence to generate more
speculation — creating a self-reinforcing delusion loop where confidence
inflated without any empirical grounding.

New module: epistemic_safety.py

  • Provenance tracking: every claim tagged with actual source type
  • Confidence ceilings: LLM self-verification capped at 0.60, synthesis at
    0.45, deep synthesis decays 0.10 per depth level
  • Circular reference detection: blocks LLM output too similar to existing
    LLM-generated memories (prevents self-reinforcement)
  • False certainty detection: flags phrases like "this proves" or "beyond
    doubt" from LLM outputs and reduces confidence
  • Epistemic honesty scoring: rewards hedging and citations, penalizes
    overconfidence
  • Epistemic markers: all LLM-generated content prefixed with provenance tags
    so future contexts show where claims actually came from

Integrated into:

  • synthesis.py: hypotheses get provenance tags, confidence ceilings, circular
    reference blocking. Deep synthesis gets stricter limits per depth level.
  • verification.py: LLM "verifying" LLM output no longer inflates confidence.
    Claims marked VERIFIED without specific citations downgraded to UNVERIFIABLE.
  • research.py: all LLM-sourced findings tagged with provenance warnings.
  • dialectic.py: "universal patterns" relabeled as LLM-speculated, confidence
    capped. Thesis/antithesis/synthesis all get provenance tracking.
  • gate.py: false certainty in LLM output penalized during storage scoring.
  • auto.py: epistemic safety protocol injected into autonomous loop system
    prompt. Context injection includes provenance warnings.

https://claude.ai/code/session_01HD6xGRMauZKsgE8ehWXnAd

rail and others added 21 commits March 15, 2026 02:26
Knowledge graph (one/graph.py):
  - d3.js force-directed visualization served at /graph
  - Nodes colored by type, sized by observation count
  - Click nodes to see linked memories
  - Auto-refreshes every 10 seconds
  - Endpoints: GET /graph, /api/graph, /api/entity/<name>/memories

Watch mode (one/watch.py):
  - /watch [dir] monitors for file changes
  - Auto-logs diffs to memory store with entity linking
  - Polls every 2s, ignores .git/venv/pycache
  - /unwatch stops monitoring

CLAUDE.md generator (one/claudemd.py):
  - /generate exports rule tree + entities as CLAUDE.md
  - Claude reads it natively on every session start
  - Grouped by context with key files, concepts, tools
  - Endpoint: GET /api/claudemd

VISION.md added with full project roadmap.
6,547 lines across 22 files.
Synthesis (one/synthesis.py):
  - Scans entity graph for cross-domain connections
  - Generates hypotheses via Gemma when unrelated concepts co-occur
  - Recursive deep synthesis builds a DAG of layered insights
  - /synthesize triggers on current project

Deep Research (one/research.py):
  - /research <topic> launches autonomous research loop
  - Structured prompts: findings, open problems, cross-disciplinary, contrarian
  - Extracts findings, builds citation graph, identifies gaps
  - Fills gaps with targeted follow-up, runs synthesis across findings
  - /frontier shows open questions and active research topics

Playbook System (one/playbook.py):
  - Auto-generated after every /auto completion
  - Distills key decisions, reusable patterns, pitfalls via Gemma
  - Recalled by vector similarity on similar future tasks
  - Injected into auto loop context — never solve the same class twice
  - /playbooks lists all with category and recall count

Auto loop now:
  - Injects relevant playbooks before starting
  - Generates playbook on completion

Stats now track syntheses, playbooks, and research topics.
8,166 lines across 25 files.
…, type safety

Upgrade core engines: research.py (iterative deepening, adversarial prompts,
source quality scoring, quantitative extraction), synthesis.py (novelty scoring,
hypothesis testing, contradiction detection), auto.py (reflection checkpoints,
milestone tracking, crash recovery via state serialization).

Expand entity extraction to 10+ types with relationship extraction. Harden
server with API key auth, rate limiting, and CORS. Fix encode_tagged tag vector
norm bug and profanity substring false positives. Resolve 24 Pyright type errors
across core modules.

Add comprehensive test suite: 440 tests across 10 files covering hdc, gate,
entities, store, excitation, rules, research, synthesis, server, and playbook.
Swarm multi-agent orchestration with Conductor and 14 agent roles.
Dialectic chains (thesis→antithesis→synthesis→verification→meta).
Analogical transfer with cross-domain structural isomorphism via HDC.
Contradiction mining with severity levels and resolution tracking.
Self-verifying knowledge engine with confidence lifecycle and source quality.
Active question generation via frontier mapping and information value scoring.
Executable verification engine with code and LLM experiment paths.
Swarm TUI dashboard with sparklines, breakthrough alerts, dialectic panels.
Knowledge health metrics with volume/entities/intelligence/quality/warnings.
Morgoth Mode — 7-phase autonomous research loop with eureka capture.
Foundry audit with quality scoring, duplicate detection, garbage cleanup.

883 tests passing (443 new + 440 existing), zero Pyright errors.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Research findings auto-inject into Claude's next message context
  via _preloaded_context after research completes
- Added all MIROBEAR commands to ONE_COMMANDS set so they route
  through our system instead of falling through to Claude
- Notification fires when research completes
- Escape now stops current response/auto loop, doesn't exit the app
- Ctrl+Q is the new quit
- /swarm, /morgoth, /health, /audit, /focus, /inject all routed
  through our system — never leak to Claude
- Help text updated with intelligence section showing all new commands
- Swarm, morgoth, health, audit handlers wired to their modules
Auto system prompt now requires:
- Actually running features as a user would, not just writing pytest
- Verifying all imports resolve before declaring done
- Testing through the actual app entry point
- Curling endpoints, running CLI commands, importing from app
- Smoke tests = failure. Real behavior under real conditions.

Removed all orion references from Claude memory system.
17 bug fixes across 8 files:
- audit.py: run_full_audit signature, auto_fix implementation, entity column queries
- health.py: wrong column names (entity_type→type, rules→rule_nodes, created→timestamp)
- research.py: schema migration for 7 missing columns
- entities.py: filter .venv/site-packages/slash commands from file entities
- contradictions.py + verification.py: recall("") zero vector → get_recent()
- app.py: audit result key, help text, /watch backend, command dispatch
- server.py: malformed CORS do_OPTIONS
- client.py: push_entity wrong kwarg (entity_id not accepted by Foundry action)

New: engine.py — Zero Hallucination Engine (1400+ lines)
- AST-parses Python, extracts SQL, checks against live PRAGMA table_info
- Multi-language: Python, C/C99, JS/TS, HTML, CSS, JSON
- Codebase ontology: 483 symbols, 4311 calls, 122 file deps mapped
- Impact analysis: knows what breaks before you break it
- Signature change detection with caller warnings
- Symbol removal detection blocks edits that break callers
- Decision logging + turn logging to knowledge graph
- Session logs at ~/.one/logs/ survive crashes
- verify_edit_with_impact() wired into post-edit hooks
- Post-completion auto-verify after every Claude turn
- Foundry sync pushes ontology through MemoryEntry + Entity

New: ground.py — Ground truth population
- Introspects live DB schemas and runtime module signatures
- Stores verified facts at 0.95-0.99 confidence
- 165 ground truths: schemas, signatures, contracts, traps
- Surfaces in recall when context is relevant

New: LOCK.md — Single source of truth
- Exact SQLite schemas, exact class signatures, command wiring
- Known traps, verification checklist

Morgoth rewrite:
- Phases use start_research() and knowledge engines instead of spawning
  Claude subprocesses that silently fail
- Phase.CONTINUOUS now reachable (was unreachable due to _advance_phase bug)
- Swarm/morgoth stored on self (no more GC killing them)
- /stop kills auto + swarm + morgoth
- All phases log to ~/.one/logs/ via engine session logger

TUI:
- Input box: TextArea with word wrap, auto-expand 3-10 lines
- /verify, /ground commands
- Boot sequence: auto map + ground + verify + foundry sync (background)
- Post-completion: re-map + re-verify edited files after every turn
- Foundry sync moved to background thread (non-blocking boot)

Auto prompt: GROUND TRUTH PROTOCOL + LOCK.md reference + ground truth
injection in context gathering
_call_ollama() now tries Claude (via proxy.quick_ask) first, falls back
to Gemma. Every module that calls _call_ollama automatically gets
Claude's brain: research, dialectic, synthesis, contradictions,
analogy, verification, experiments.

Added ClaudeProxy.ask() for synchronous prompt/response and
ClaudeProxy.quick_ask() static method for one-shot questions.
The core problem: LLM-generated speculation was being stored as "knowledge,"
recalled as context for future prompts, and used as evidence to generate more
speculation — creating a self-reinforcing delusion loop where confidence
inflated without any empirical grounding.

New module: epistemic_safety.py
- Provenance tracking: every claim tagged with actual source type
- Confidence ceilings: LLM self-verification capped at 0.60, synthesis at
  0.45, deep synthesis decays 0.10 per depth level
- Circular reference detection: blocks LLM output too similar to existing
  LLM-generated memories (prevents self-reinforcement)
- False certainty detection: flags phrases like "this proves" or "beyond
  doubt" from LLM outputs and reduces confidence
- Epistemic honesty scoring: rewards hedging and citations, penalizes
  overconfidence
- Epistemic markers: all LLM-generated content prefixed with provenance tags
  so future contexts show where claims actually came from

Integrated into:
- synthesis.py: hypotheses get provenance tags, confidence ceilings, circular
  reference blocking. Deep synthesis gets stricter limits per depth level.
- verification.py: LLM "verifying" LLM output no longer inflates confidence.
  Claims marked VERIFIED without specific citations downgraded to UNVERIFIABLE.
- research.py: all LLM-sourced findings tagged with provenance warnings.
- dialectic.py: "universal patterns" relabeled as LLM-speculated, confidence
  capped. Thesis/antithesis/synthesis all get provenance tracking.
- gate.py: false certainty in LLM output penalized during storage scoring.
- auto.py: epistemic safety protocol injected into autonomous loop system
  prompt. Context injection includes provenance warnings.

https://claude.ai/code/session_01HD6xGRMauZKsgE8ehWXnAd
The AifGate was named "Active Inference" but was just a weighted linear
combination of hand-tuned heuristics. This replaces the scoring engine
with actual Friston free energy framework:

- Generative model: mixture of learned Regime clusters in HDC vector space
- Variational free energy: F = precision * surprise - log(prior)
- Precision learning: inverse variance per regime, updated online
- Expected free energy: epistemic value of storing (information gain)
- Belief updating: regime centroids shift toward observations
- Regime lifecycle: creation, merging, replacement of topic clusters

The hard noise/redaction filters and content priors are retained as
pre-processing — they handle classification that vector similarity cannot.
The AIF replaces the ad-hoc weighted combination with a principled
decision mechanism where high free energy = genuinely informative.

Key properties:
- First observations are maximally novel (no model yet)
- Repeated topics decrease surprise as regimes tighten
- Precision increases for predictable regimes (routine conversations)
- Novel topics get high epistemic value (would shift beliefs)
- Redundant messages blocked by cosine similarity in recent buffer
- Epistemic safety checks still penalize overconfident LLM output

https://claude.ai/code/session_01HD6xGRMauZKsgE8ehWXnAd
@bcd532 bcd532 force-pushed the main branch 2 times, most recently from 2fe6262 to ce3e039 Compare March 17, 2026 15:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants