feat: add epistemic safety engine to prevent LLM psychosis#1
Open
feat: add epistemic safety engine to prevent LLM psychosis#1
Conversation
Knowledge graph (one/graph.py): - d3.js force-directed visualization served at /graph - Nodes colored by type, sized by observation count - Click nodes to see linked memories - Auto-refreshes every 10 seconds - Endpoints: GET /graph, /api/graph, /api/entity/<name>/memories Watch mode (one/watch.py): - /watch [dir] monitors for file changes - Auto-logs diffs to memory store with entity linking - Polls every 2s, ignores .git/venv/pycache - /unwatch stops monitoring CLAUDE.md generator (one/claudemd.py): - /generate exports rule tree + entities as CLAUDE.md - Claude reads it natively on every session start - Grouped by context with key files, concepts, tools - Endpoint: GET /api/claudemd VISION.md added with full project roadmap. 6,547 lines across 22 files.
Synthesis (one/synthesis.py): - Scans entity graph for cross-domain connections - Generates hypotheses via Gemma when unrelated concepts co-occur - Recursive deep synthesis builds a DAG of layered insights - /synthesize triggers on current project Deep Research (one/research.py): - /research <topic> launches autonomous research loop - Structured prompts: findings, open problems, cross-disciplinary, contrarian - Extracts findings, builds citation graph, identifies gaps - Fills gaps with targeted follow-up, runs synthesis across findings - /frontier shows open questions and active research topics Playbook System (one/playbook.py): - Auto-generated after every /auto completion - Distills key decisions, reusable patterns, pitfalls via Gemma - Recalled by vector similarity on similar future tasks - Injected into auto loop context — never solve the same class twice - /playbooks lists all with category and recall count Auto loop now: - Injects relevant playbooks before starting - Generates playbook on completion Stats now track syntheses, playbooks, and research topics. 8,166 lines across 25 files.
…, type safety Upgrade core engines: research.py (iterative deepening, adversarial prompts, source quality scoring, quantitative extraction), synthesis.py (novelty scoring, hypothesis testing, contradiction detection), auto.py (reflection checkpoints, milestone tracking, crash recovery via state serialization). Expand entity extraction to 10+ types with relationship extraction. Harden server with API key auth, rate limiting, and CORS. Fix encode_tagged tag vector norm bug and profanity substring false positives. Resolve 24 Pyright type errors across core modules. Add comprehensive test suite: 440 tests across 10 files covering hdc, gate, entities, store, excitation, rules, research, synthesis, server, and playbook.
Swarm multi-agent orchestration with Conductor and 14 agent roles. Dialectic chains (thesis→antithesis→synthesis→verification→meta). Analogical transfer with cross-domain structural isomorphism via HDC. Contradiction mining with severity levels and resolution tracking. Self-verifying knowledge engine with confidence lifecycle and source quality. Active question generation via frontier mapping and information value scoring. Executable verification engine with code and LLM experiment paths. Swarm TUI dashboard with sparklines, breakthrough alerts, dialectic panels. Knowledge health metrics with volume/entities/intelligence/quality/warnings. Morgoth Mode — 7-phase autonomous research loop with eureka capture. Foundry audit with quality scoring, duplicate detection, garbage cleanup. 883 tests passing (443 new + 440 existing), zero Pyright errors. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Research findings auto-inject into Claude's next message context via _preloaded_context after research completes - Added all MIROBEAR commands to ONE_COMMANDS set so they route through our system instead of falling through to Claude - Notification fires when research completes
- Escape now stops current response/auto loop, doesn't exit the app - Ctrl+Q is the new quit - /swarm, /morgoth, /health, /audit, /focus, /inject all routed through our system — never leak to Claude - Help text updated with intelligence section showing all new commands - Swarm, morgoth, health, audit handlers wired to their modules
Auto system prompt now requires: - Actually running features as a user would, not just writing pytest - Verifying all imports resolve before declaring done - Testing through the actual app entry point - Curling endpoints, running CLI commands, importing from app - Smoke tests = failure. Real behavior under real conditions. Removed all orion references from Claude memory system.
17 bug fixes across 8 files:
- audit.py: run_full_audit signature, auto_fix implementation, entity column queries
- health.py: wrong column names (entity_type→type, rules→rule_nodes, created→timestamp)
- research.py: schema migration for 7 missing columns
- entities.py: filter .venv/site-packages/slash commands from file entities
- contradictions.py + verification.py: recall("") zero vector → get_recent()
- app.py: audit result key, help text, /watch backend, command dispatch
- server.py: malformed CORS do_OPTIONS
- client.py: push_entity wrong kwarg (entity_id not accepted by Foundry action)
New: engine.py — Zero Hallucination Engine (1400+ lines)
- AST-parses Python, extracts SQL, checks against live PRAGMA table_info
- Multi-language: Python, C/C99, JS/TS, HTML, CSS, JSON
- Codebase ontology: 483 symbols, 4311 calls, 122 file deps mapped
- Impact analysis: knows what breaks before you break it
- Signature change detection with caller warnings
- Symbol removal detection blocks edits that break callers
- Decision logging + turn logging to knowledge graph
- Session logs at ~/.one/logs/ survive crashes
- verify_edit_with_impact() wired into post-edit hooks
- Post-completion auto-verify after every Claude turn
- Foundry sync pushes ontology through MemoryEntry + Entity
New: ground.py — Ground truth population
- Introspects live DB schemas and runtime module signatures
- Stores verified facts at 0.95-0.99 confidence
- 165 ground truths: schemas, signatures, contracts, traps
- Surfaces in recall when context is relevant
New: LOCK.md — Single source of truth
- Exact SQLite schemas, exact class signatures, command wiring
- Known traps, verification checklist
Morgoth rewrite:
- Phases use start_research() and knowledge engines instead of spawning
Claude subprocesses that silently fail
- Phase.CONTINUOUS now reachable (was unreachable due to _advance_phase bug)
- Swarm/morgoth stored on self (no more GC killing them)
- /stop kills auto + swarm + morgoth
- All phases log to ~/.one/logs/ via engine session logger
TUI:
- Input box: TextArea with word wrap, auto-expand 3-10 lines
- /verify, /ground commands
- Boot sequence: auto map + ground + verify + foundry sync (background)
- Post-completion: re-map + re-verify edited files after every turn
- Foundry sync moved to background thread (non-blocking boot)
Auto prompt: GROUND TRUTH PROTOCOL + LOCK.md reference + ground truth
injection in context gathering
_call_ollama() now tries Claude (via proxy.quick_ask) first, falls back to Gemma. Every module that calls _call_ollama automatically gets Claude's brain: research, dialectic, synthesis, contradictions, analogy, verification, experiments. Added ClaudeProxy.ask() for synchronous prompt/response and ClaudeProxy.quick_ask() static method for one-shot questions.
The core problem: LLM-generated speculation was being stored as "knowledge," recalled as context for future prompts, and used as evidence to generate more speculation — creating a self-reinforcing delusion loop where confidence inflated without any empirical grounding. New module: epistemic_safety.py - Provenance tracking: every claim tagged with actual source type - Confidence ceilings: LLM self-verification capped at 0.60, synthesis at 0.45, deep synthesis decays 0.10 per depth level - Circular reference detection: blocks LLM output too similar to existing LLM-generated memories (prevents self-reinforcement) - False certainty detection: flags phrases like "this proves" or "beyond doubt" from LLM outputs and reduces confidence - Epistemic honesty scoring: rewards hedging and citations, penalizes overconfidence - Epistemic markers: all LLM-generated content prefixed with provenance tags so future contexts show where claims actually came from Integrated into: - synthesis.py: hypotheses get provenance tags, confidence ceilings, circular reference blocking. Deep synthesis gets stricter limits per depth level. - verification.py: LLM "verifying" LLM output no longer inflates confidence. Claims marked VERIFIED without specific citations downgraded to UNVERIFIABLE. - research.py: all LLM-sourced findings tagged with provenance warnings. - dialectic.py: "universal patterns" relabeled as LLM-speculated, confidence capped. Thesis/antithesis/synthesis all get provenance tracking. - gate.py: false certainty in LLM output penalized during storage scoring. - auto.py: epistemic safety protocol injected into autonomous loop system prompt. Context injection includes provenance warnings. https://claude.ai/code/session_01HD6xGRMauZKsgE8ehWXnAd
The AifGate was named "Active Inference" but was just a weighted linear combination of hand-tuned heuristics. This replaces the scoring engine with actual Friston free energy framework: - Generative model: mixture of learned Regime clusters in HDC vector space - Variational free energy: F = precision * surprise - log(prior) - Precision learning: inverse variance per regime, updated online - Expected free energy: epistemic value of storing (information gain) - Belief updating: regime centroids shift toward observations - Regime lifecycle: creation, merging, replacement of topic clusters The hard noise/redaction filters and content priors are retained as pre-processing — they handle classification that vector similarity cannot. The AIF replaces the ad-hoc weighted combination with a principled decision mechanism where high free energy = genuinely informative. Key properties: - First observations are maximally novel (no model yet) - Repeated topics decrease surprise as regimes tighten - Precision increases for predictable regimes (routine conversations) - Novel topics get high epistemic value (would shift beliefs) - Redundant messages blocked by cosine similarity in recent buffer - Epistemic safety checks still penalize overconfident LLM output https://claude.ai/code/session_01HD6xGRMauZKsgE8ehWXnAd
2fe6262 to
ce3e039
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The core problem: LLM-generated speculation was being stored as "knowledge,"
recalled as context for future prompts, and used as evidence to generate more
speculation — creating a self-reinforcing delusion loop where confidence
inflated without any empirical grounding.
New module: epistemic_safety.py
0.45, deep synthesis decays 0.10 per depth level
LLM-generated memories (prevents self-reinforcement)
doubt" from LLM outputs and reduces confidence
overconfidence
so future contexts show where claims actually came from
Integrated into:
reference blocking. Deep synthesis gets stricter limits per depth level.
Claims marked VERIFIED without specific citations downgraded to UNVERIFIABLE.
capped. Thesis/antithesis/synthesis all get provenance tracking.
prompt. Context injection includes provenance warnings.
https://claude.ai/code/session_01HD6xGRMauZKsgE8ehWXnAd