Skip to content

v1.0.0: Live monitoring + circuit breakers #8

@Siddhant-K-code

Description

@Siddhant-K-code

Problem

All current agent-trace commands are post-hoc. You analyze a session after it finishes. But the most expensive failures are the ones you could have stopped mid-session:

  • Agent retrying the same failing command 15 times
  • Cost spiraling past $20 on a simple task
  • Agent writing to production config files
  • Infinite loop: read → edit → test → fail → read → edit → test → fail

You need real-time monitoring with the ability to kill a runaway session.

Proposal

agent-strace watch

Real-time session monitoring with configurable circuit breakers. Watches the active session's event stream and triggers alerts or kills the session when thresholds are exceeded.

Built-in watchers:

Watcher Trigger Default
RetryWatcher Same command fails N times 5 retries
CostWatcher Estimated cost exceeds threshold $10
ScopeWatcher Agent touches file outside .agent-scope.json policy any violation
DurationWatcher Session exceeds time limit 30 minutes
LoopWatcher Same sequence of 3+ events repeats N times 3 repetitions

Alert actions:

  • terminal — print warning to stderr
  • file — append to .agent-traces/alerts.log
  • webhook — POST JSON to a URL (Slack, PagerDuty, etc.)
  • kill — send SIGTERM to the agent process

CLI

# Quick start with defaults
agent-strace watch

# Custom thresholds
agent-strace watch --max-retries 3 --max-cost 5 --max-duration 10m --alert terminal

# Kill on violation
agent-strace watch --max-retries 3 --on-violation kill

# Config file
agent-strace watch --config .agent-watch.json

Config file format (.agent-watch.json)

{
  "watchers": {
    "retry": { "max": 3, "alert": "terminal" },
    "cost": { "max_dollars": 5.0, "alert": "terminal" },
    "scope": { "policy": ".agent-scope.json", "alert": "kill" },
    "duration": { "max_minutes": 15, "alert": "terminal" },
    "loop": { "sequence_length": 3, "max_repeats": 3, "alert": "file" }
  },
  "webhook": {
    "url": "https://hooks.slack.com/services/...",
    "events": ["cost", "scope"]
  }
}

How it works

  1. Tail the active session's events.ndjson file (or hook into the event callback pipeline in hooks.py)
  2. Each new event is passed through all registered watchers
  3. Watchers maintain state (retry count, running cost, event history for loop detection)
  4. When a watcher triggers, execute the configured alert action
  5. For kill: find the agent PID from session metadata and send SIGTERM

Terminal output (live)

[watch] Monitoring session abc123...
[watch] +0:30  ✅ 5 events, $0.12, 0 retries
[watch] +1:05  ⚠️  RetryWatcher: "python -m pytest" failed 3 times
[watch] +2:30  ⚠️  CostWatcher: $5.12 (threshold: $5.00)
[watch] +3:00  ❌ LoopWatcher: detected loop (read→edit→test) × 4 — killing session

Implementation

  • New file: src/agent_trace/watch.py
  • Event tailing: poll events.ndjson with os.stat for size changes, read new lines
  • Webhook: stdlib urllib.request (no requests library)
  • Process kill: os.kill(pid, signal.SIGTERM)
  • New tests: tests/test_watch.py (~9 tests)
    • test_retry_detection
    • test_cost_threshold
    • test_scope_violation
    • test_duration_limit
    • test_loop_detection
    • test_alert_terminal
    • test_alert_webhook
    • test_config_file_loading
    • test_session_kill
  • CLI: add watch subcommand to cli.py
  • Depends on: cost.py (v0.4.0: Session explain + cost estimation #3) for cost estimation, audit.py (v0.7.0: Permission audit trail #7) for scope checking
  • Zero new dependencies

Constraint

Python stdlib only. HTTP via urllib.request. Process management via os and signal.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions