Audit remediation + autonomous skill dispatcher + LLM decision capture#1
Audit remediation + autonomous skill dispatcher + LLM decision capture#1
Conversation
Remove unused typing.Any import in gateway/persistent_agent_router.py.
Simple pure helper that classifies a user turn as "positive"/"negative"/None based on explicit feedback markers. Used by the upcoming turn-boundary record_outcome hook and the reflection NULL-outcome backfill to close the signal loop.
…_turn Before this, record_outcome had zero callers in run_agent.py — the outcome column on sessions was only ever set by gateway/telegram reaction handlers, leaving the CLI path (and most sessions) permanently NULL. Now every completed turn tries to infer a coarse positive/negative signal from user feedback markers and persists it via DecisionDB.record_outcome. Wrapped in try/except so outcome recording can never break the turn.
Sessions that stay NULL for more than AGED_NULL_OUTCOME_DAYS (3) never received a reaction signal and are unlikely to. Run the turn-boundary heuristic over the last user message and, when it yields a confident label, persist it so subsequent reflection cycles can actually learn from the session instead of bucketing it under "no outcome recorded". Neutral/unknown cases are left NULL by design.
…eline Exercises all three Phase 1 pieces composed: infer_outcome_from_turn -> SessionDB.record_outcome -> reflection._query_sessions backfill + bucket assignment. Uses a temp state.db via the isolate_hermes_home conftest fixture and monkeypatches reflection._state_db_path to point at it. Also fixes a stale DecisionDB class reference in outcome_signals.py docstring (the class is actually SessionDB).
…loop Calling asyncio.run() from within a thread that already owns a running event loop (the gateway dispatches handle_message via an executor) raises RuntimeError. Mirror the thread-pool bridge pattern used in agent/context_references.py so the Anthropic image fallback works in both sync CLI paths and gateway concurrency.
asyncio.get_event_loop().run_until_complete() inside a coroutine (or any thread with a live loop) is a hard crash under Python 3.12's stricter event-loop policy. Make gather_reflection_input an async function that awaits _try_compile_context directly, and have run_reflection call it rather than duplicating the inline session/tool-usage/memory gather. Top-level cron entry (run_reflection_job) still owns the only asyncio.run() call — the ticker dispatches it from a worker thread with no running loop, which is the supported pattern.
Exercises the two async-in-sync hazards Phase 2 fixed: * _describe_image_for_anthropic_fallback called directly from a live event loop (plus 10-way asyncio.to_thread fan-out) — would raise "asyncio.run() cannot be called from a running event loop" without the thread-pool bridge. * gather_reflection_input awaited from inside a running loop — would raise via get_event_loop().run_until_complete() without the async conversion. Verified the suite fails against HEAD~2 (pre-fix) and passes against HEAD.
Adds classify_task() heuristic that routes delegate invocations to the cheapest compile mode: skip-compile for self-contained tasks, technical/full compile for debugging, user/fast for preferences, and a default fast compile otherwise. Wired into invoke() so the right context reaches the subagent without always hitting the broad full-compile path.
Adds PersistentDelegateTool.invoke_batch: a single broad compile against the joined tasks, then per-subagent slicing of the returned decisions via token-overlap scoring. Each subagent's invoke() accepts a precompiled CompiledContext so the broad-compile output can be passed through without re-hitting /api/compile. Same-project guard: mixed-project batches fall back to the unchanged per-task parallel path to avoid cross-project context leakage. The natural run_agent.py fan-out site (_execute_tool_calls_concurrent) uses a thread pool over sync tool handlers; wiring the async batch path through it is out of scope for this commit. Callers and the integration test exercise invoke_batch directly; the TTL cache in the next commit absorbs the same fan-out when the batch path isn't used.
…ntent Adds _drop_redundant_compiled: for each compiled decision, compute token-overlap against the tail of the parent's recent messages; drop decisions already carried by the conversation (>=80% token overlap). If the whole compile is redundant, zero it out so the prompt block doesn't nag the model with duplicates. Wired into invoke() via parent_agent._session_messages or an explicit recent_messages kwarg.
In-memory dict keyed on (project_id, sha256(task)[:16], fast_mode, namespace) with a 300s TTL. Consulted before provider.compile() and populated with non-degraded results only (avoids pinning the agent in degraded mode past the 5m window). Absorbs N-subagent fan-out when invoke_batch isn't used.
tests/test_task_classifier.py: pure unit coverage for classify_task (self-contained skip, technical/user namespace routing, default). tests/integration/test_multiagent_routing.py: spawns 3 heterogeneous delegate tasks via invoke_batch, asserts exactly 1 compile call reaches the mock HIPP0, each subagent gets a task-relevant slice, user_facts propagate to every subagent, and a separate case that verifies the 5m TTL cache absorbs a repeat compile.
Skills can now be auto-created from reflection proposals, capped at 1 per cycle and gated by an evidence eval: the candidate must be anchored to at least one NEGATIVE-outcome session in the lookback window that mentions a topic token from the proposed skill name/hint. Proposals without prior- failure evidence are logged as skill_eval_gate_failed and skipped.
…delta After a skill is created by the reflection gate, wire it into the outcome ledger so its value can be measured: - Baseline: same-topic outcomes in the 7d BEFORE creation are written as kind='baseline' rows in a new skill_outcomes table (created on demand). - Matches: up to 3 recent sessions whose first-user-message contains a topic token are written as kind='match' rows. - New record_skill_outcome_for_session hook lets the outcome pipeline append kind='post' rows so an A/B delta can be computed later. An A/B summary (baseline total / positive / negative / ratio) is appended to the reflection log at registration time.
Add _propose_unused_skill_deprecation(), called once per reflection cycle. Cross-references skill SKILL.md mtimes against the reflection input's tool_usage map: skills with mtime >30d old whose name tokens do not appear in any recently-used tool are logged as skill_deprecation_proposal entries in the reflection log. Never auto-deletes — a human reviews the log.
Add _prune_reflection_log() invoked at the end of each reflection cycle. Reads the per-agent reflection_log.jsonl, drops JSON entries whose timestamp is older than REFLECTION_LOG_RETENTION_DAYS (180), and rewrites atomically via a .tmp sibling. Fail-open for malformed lines or entries missing a timestamp — we keep them rather than silently destroying rows we can't parse.
… + 2m cooldown Prevent pileup on dead HIPP0 by short-circuiting compile() to degraded-mode while OPEN. CLOSED -> OPEN after 3 unavailable events inside a 60s sliding window; HALF_OPEN probe after 2m cooldown; CLOSE on probe success; re-OPEN on probe failure.
4xx replay failures are contract bugs — don't silently drop. Move them to dead_letter.jsonl next to pending.jsonl with timestamp, status, and error body for operator inspection. Dead-lettered entries are never retried. Adds `hermes wal status` CLI showing per-agent WAL depth, dead-letter depth, and oldest entry age.
Prepend `[STALE MEMORY: last successful compile {N}m ago]` to the
rendered compile block when the circuit breaker is OPEN or the last
successful compile is older than 30 minutes, so the model knows recall
may be out of date.
The `f.get("key") or f.get("fact_key")` fallback masked a HIPP0
contract bug: legacy `fact_key`/`fact_value` entries kept rendering
under the current endpoint shape. Require `key` strictly; log-and-drop
malformed entries so contract violations surface.
Cap the compile-context fetch in gather_reflection_input at 5s so a slow/dead HIPP0 can't stall the reflection pipeline. On timeout, log and proceed without compiled context.
Bump schema to v9: rebuild messages_fts with session_id, role and timestamp UNINDEXED so session/role filters can be served directly from the FTS layer instead of round-tripping to the messages table. Migration drops the old virtual table and triggers, recreates them with the new column set, and repopulates from messages. Add a 5-minute TTL in-process cache on list_sessions_rich() for the top-10 recent sessions hot path (offset=0, limit<=10). Dashboard polling absorbs the churn without hammering SQLite.
Add TrajectoryCompressor.compress_many_async(entries) for ad-hoc batch callers: runs process_entry_async for every entry concurrently behind an asyncio.Semaphore(10) via asyncio.gather, preserving input order. Caps outbound LLM summarization fan-out independently of max_concurrent_requests (which governs the full-directory pipeline).
…_rough Route all ad-hoc `len(x) // 4` token-estimation sites through the existing `agent.model_metadata.estimate_tokens_rough` / `estimate_messages_tokens_rough` helpers so capacity, compression, and cost-estimate math share one source of truth. Sites swapped: - trajectory_compressor.count_tokens() tokenizer fallback - agent/prompt_builder.estimate_prompt_tokens() - agent/hipp0_memory_provider degraded-compile fallback - tools/skills_tool._estimate_tokens() - gateway/platforms/web_platform (transcript/soul/input/output estimates) - scripts/sample_and_compress tokenizer fallback Adds tests/test_token_estimation.py with fixture messages and sensible-range assertions.
…+ CI gate Phase 10 final gate. Extends tests/integration/test_closed_loop.py to cover the full task -> subagent -> compile -> outcome inference -> record_outcome -> reflection backfill -> recompile re-ranks chain end-to-end, using a FakeHipp0Provider that records every call and simulates the hipp0-side trust-multiplier effect to prove HERMES-side wiring. Adds two failure-mode tests: - record_outcome silently dropped -> second compile keeps baseline ranking - provider.record_outcome raises -> caller surfaces the error Adds a dedicated closed-loop CI job to .github/workflows/tests.yml so a regression fails with a clear signal independent of the main test suite.
_drain_wal() read the WAL into memory then rewrote it with write_text() — a concurrent _wal_append between those two steps would be silently clobbered. Serialize both operations under an asyncio.Lock and use atomic replace for rewrites. WAL and dead_letter.jsonl hold full conversation payloads but were created with umask defaults (commonly 0o644), exposing memory to other local users on shared hosts. Create both via os.open() with mode 0o600. Tests added: - test_wal_files_are_mode_0o600 — permission regression - test_drain_rewrite_preserves_concurrent_append — race regression
prune_sessions_older_than() issued one DELETE FROM messages + one DELETE FROM sessions per session_id in a Python loop. Replace with two IN-list statements so a 10k-session prune collapses from 20k round-trips to 2.
|
…ter) Pure-CPU microbenches on the per-turn classifiers: perf_counter sampler with baseline JSON in tests/bench/budgets.json, 1.4x tolerance. CI runs serial (no xdist) so worker variance doesn't pollute p95. HERMES_BENCH_UPDATE=1 reseeds the baseline.
…comes Handler aggregates the routing-outcomes JSONL and returns per-class decision_count / outcome distribution / positive_rate. Auth-gated via the existing _check_auth path so it respects HERMES_API_TOKEN. Consumed by Phase 13's nightly threshold-tuning job and on-demand dashboards; returns 503 when tools.routing_outcomes can't be imported (e.g. trimmed-down distribution) rather than taking the server down.
Adds FaultyHipp0Provider with switchable compile/record faults: - compile: hipp0_500, circuit_open, budget_exceeded (BudgetExceeded) - record: wal_full (OSError ENOSPC), circuit_open Three new parametrized tests: 1. compile faults must raise a distinguishable exception the turn loop can match on (no silent swallowing). 2. record_outcome faults on WAL / circuit must propagate typed errors, while the local SessionDB record path stays functional so the turn loop keeps making progress with the remote down. 3. recovery test: after a transient compile fault, the next compile succeeds (guards against sticky-failure regressions). 13/13 closed-loop tests green (8 existing + 5 new).
|
…or passive decision capture Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… priority ordering
…r as fallback Adds agent/skills/llm_adapter.py (AuxiliaryLLMAdapter) bridging the dispatcher's minimal LLMClient Protocol to hermulti's auxiliary_client primitives. In run_agent.py, the existing regex-only decision-signal capture now first tries the SkillDispatcher (fire-and-forget OUTBOUND_MESSAGE event) and falls back to extract_decision_signals when the dispatcher is disabled or no LLM is configured.
|
…rd matrix SyncError
All in-scope test directories (bench, cron, e2e, environments, honcho_plugin, plugins) pass cleanly. Document the 4 cron tests that skip when the optional croniter package is unavailable. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ateway progress topics - test_models_dev: add autouse fixture to save/restore module-global _models_dev_cache. Was overwriting it with SAMPLE_REGISTRY, breaking downstream opencode-go detection via list_authenticated_providers. - test_run_progress_topics: pre-load tools.terminal_tool so the tool registry is populated before gateway.run emits progress. Previously the emoji depended on test ordering (registry loaded vs not). - test_internal_event_bypass_pairing: redirect gateway.pairing.PAIRING_DIR to tmp_path so the _rate_limits.json state does not leak between tests via the real ~/.hermes/platforms/pairing directory.
|
Companion test to hipp0ai e2e scenario 04. Wires a RecordingLLM returning a record_decision action and a RecordingProvider, then dispatches an OUTBOUND_MESSAGE and asserts the action propagates to provider.record_decision() with the expected payload. Lives under tests/integration/ so the fast unit loop can skip it via collection-time filtering; relies on /root/audit/hipp0ai/skills/ being mounted. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Hipp0MemoryProvider.record_decision was POSTing {title, content, ...}
to /api/decisions. hipp0 rejects this: the route requires description
(not content) and project_id. Without these every captured decision
silently swallowed a 400 in the try/except and returned False.
- test_multi_turn_conversation used a literal string for session_id and
`content` on the decision payload. hipp0 enforces session_id is a
UUID, so the test always failed. Register a hermes agent, start a
real session to obtain a UUID, and use `description` on the decision.
|
Wipes tools.approval module-level dicts (_gateway_queues, _gateway_notify_cbs, _session_approved, _permanent_approved, _pending) before and after every approval-related test in tests/gateway/ so xdist workers cannot observe torn state from sibling runs. tools/approval.py already serializes every mutation through _lock, so thread-safety of the module itself is already covered. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
save_permanent_allowlist() iterates via list(patterns) outside the lock. If another thread calls approve_permanent() during that iteration, CPython raises "Set changed size during iteration". Three call sites (check_dangerous_command, check_all_command_guards x2) previously passed the live set; now they copy under _lock first. Completes the thread-safety audit started in 5cda48f. The autouse isolation fixture already covers cross-test state pollution under xdist. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Summary
All hermulti-side work for the hipp0 GBrain-tier program. Scope expanded from initial audit remediation to include Phase 2 (signal detector) and Phase 5 (autonomous skill dispatcher, Tier B).
Audit remediation (earlier commits)
record_outcome, outcome_signals.py, reflection NULL-outcome backfill, closed-loop integration testsPhase 2: signal detector (1 commit)
feat(signals):extract_decision_signals()+DecisionSignaldataclass added toagent/outcome_signals.py;record_decision()added to Hipp0MemoryProvider; wired into turn loop with try/except fallback. Regex-based (5 patterns for explicit decisions, rejections, confidence inference). 5 tests.Phase 5: autonomous skill dispatcher (Tier B, 5 commits)
Upgrades from regex-only signal capture to a full skill execution framework driven by LLM with structured-output actions.
feat(skills): SkillLoader parsesRESOLVER.md+SKILL.mdfiles. Custom YAML frontmatter parser (no external dep). Reads fromHIPP0_SKILLS_DIRor defaults to/root/audit/hipp0ai/skills.feat(skills): TriggerMatcher — regex-first matching of events against skill triggers, with optional LLM-classifier fallback for ambiguous events. Pre-compiles trigger phrases toEventTypetags and literal-substring regexes.feat(skills): SkillRunner — builds LLM prompt from skill body + event, parses structured JSON actions, dispatches tohipp0_provider.record_decision/record_outcome/log/noop. Fully typed viaLLMClientandHipp0ProviderProtoProtocols — no hard coupling to concrete classes.feat(skills): SkillDispatcher orchestrator with priority ordering:brain-opsREAD phase fires FIRST on PRE_TASK eventssignal-detectorruns in PARALLEL (fire-and-forget, never blocks) on INBOUND/OUTBOUND messagesbrain-opsWRITE phase fires LAST on POST_DECISION / POST_OUTCOMEfeat(skills): Turn loop integration —_get_skill_dispatcher()lazy-init helper on AIAgent. When enabled, dispatchesSkillEvent(OUTBOUND_MESSAGE)via fire-and-forgetasyncio.create_task(). Regex fallback preserves Phase 2 semantics when dispatcher is off.Uses hermulti's existing
auxiliary_client.pyviaAuxiliaryLLMAdapter(prefers async clients, falls back to synccall_llmthrough thread executor).Config flags
HIPP0_SKILL_DISPATCHER=auto|on|off(default: auto, enabled iff LLM configured)HIPP0_SKILL_LLM_PROVIDER=anthropic|codex|openai-codex(default: probe all)HIPP0_SKILLS_DIR(default:/root/audit/hipp0ai/skills)Test plan
pytest tests/skills/— 122 tests covering loader (5), matcher (9), runner (8), dispatcher (8) + integrationpytest tests/agent/test_decision_signals.py— 5 Phase 2 regex testspy_compile run_agent.pyparses cleanly after integration editCompanion PR
Paired with hipp0ai/#3 which adds the hipp0-side GBrain-tier work (phases 1-5).