👉 Try the live demo — read-only, seeded with realistic incidents, no signup
AI-powered incident response for DevOps teams. Connects to your monitoring stack via MCP to investigate incidents, scan for problems before they alert, and deliver structured RCA reports with evidence — automatically. Skip the part where you tab between five dashboards at 3am trying to figure out what broke.
The Operations Desk — live service catalog, investigation log, recent scan runs, and event rail in one view.
- Automated RCA pipeline — 6-phase investigation: prefetch → anomaly detection → planning → parallel evidence gathering (metrics + logs + infra + changes) → synthesis → report
- Proactive scanning — cron-driven probe evaluates PromQL and LogQL rules across every configured service and kicks off headless investigations when thresholds trip. Four-track evaluator covers availability, pod-restart storms, log-error bursts, and custom rules
- AI service discovery — guided setup wizard walks your Prometheus and Loki stack, with live progress (discover → validate → review) and a one-click Accept to populate the service catalog with canonical metrics, log labels, and per-service + stack-wide probe rules. Headless
npm run discoveravailable for CI - MCP-agnostic providers — pluggable architecture. Wire in Grafana, Kubernetes, GitLab, Coroot, or any MCP-compatible backend and assign roles (metrics, logs, infrastructure, changes, dependencies) in config
- Four trigger sources — operator messages in chat, Alertmanager webhooks, health-poller transitions, and scheduled scans
- Notifications — deliver every completed investigation to Slack and/or email. Per-recipient severity threshold and source allowlist (webhook / scan / poller / manual). Teams-safe HTML email
- Operations Desk — live catalog of services with health chips, investigation stream, recent scan runs, and an event rail in one view
- Activity center — unified
/activityroute with four tabs (Investigations, Scans, Patterns, Events). Shared filter-bar idiom, URL-driven state, paginated history, and a 30-day persistent event feed - Multi-stack — run prod, staging, and dev side-by-side in one deployment. Each stack has its own providers, services, probe rules, and investigation history
- Web UI + CLI — real-time progress over WebSocket, or a terminal REPL (Ink) with tool call visibility
- LLM resilience — every model call retries with exponential backoff on transient provider blips (HTTP 408/409/429/5xx, connection errors). When the LLM stays down, investigations fail loudly with the actual reason instead of spinning forever. Tunable via
llm.retryin config - Deploy anywhere — single Docker image, Helm chart for Kubernetes, or run
npm run webbehind your own process manager
npm install
cp config.yaml.example config.yaml # then edit
export OPENAI_API_KEY=sk-...
npm run web # port 3000Open http://localhost:3000. The setup wizard walks you through Connect Provider → Discover Services → Monitor — point it at your Grafana MCP server and the AI populates the service catalog from your Prometheus labels. Headless equivalent: npm run discover.
For the full configuration reference (providers, scan rules, webhooks, notifications, SMTP), see the Ops Runbook or config.yaml.example.
User messages are classified by an intent router. Questions go to a chat agent; incident reports trigger the investigation workflow.
Six phases. Evidence gathering (metrics, logs, infra, changes) runs in parallel for speed. Each agent gets only the MCP tools relevant to its role, so the metrics agent never sees log query tools and vice versa.
Investigations can start four ways.
Type a message like admin-task is returning 500 errors since 4pm. The intent router detects an incident report and launches the full investigation pipeline with your message as context. Results stream back in real time.
Generate a bearer token in Settings -> Alert Webhooks, then point Alertmanager at the webhook URL shown for that stack. The default stack uses POST /api/webhook/alert; non-default stacks use POST /api/webhook/alert/<stack-slug>. The handler validates the stack-scoped bearer token, dedups recent alerts, extracts service/severity/labels, merges with the service's known metrics and log selectors from services.yaml, and runs a headless investigation.
# alertmanager.yml
receivers:
- name: alert-assistant
webhook_configs:
- url: http://assistant:3000/api/webhook/alert/<stack-slug>
http_config:
authorization:
type: Bearer
credentials: "<your-token>"Investigation depth (quick / standard / full) can be mapped per severity in config.
A background poller queries Prometheus every 60 seconds for deployment replica counts and up metrics. When a service transitions from healthy to down it auto-fires a quick investigation. No extra config — the poller uses the service registry and each service's known selectors.
A cron-scheduled probe walks every service on the cadence you set and evaluates four tracks of rules:
- Global rules — stack-wide availability written by the discovery agent, aware of whichever label key your stack actually uses (
app,service,job,deployment) - Per-service metric rules — discovery-written thresholds like pod-restart storms, using each service's real Kubernetes namespace
- Per-service log rules — LogQL
count_over_time(... |= error)scoped to the service's real Loki labels - Config-file defaults — hardcoded fallback rules from
config.yamlwhen discovery hasn't run
Each rule has hysteresis (consecutive-tick counters) so a single flap doesn't fire a scan. When a rule trips, the probe spawns a headless investigation and the scan run lands on the Operations Desk with status, phase breakdown, and links to each child investigation.
Every tick creates a durable ScanRun record at /scan/runs/:id — copy the link, download as PNG or Markdown, or fire it to Slack with one click.
A scan run that dispatched two investigations — the 3-phase probe/triage/investigate breakdown is preserved forever.
| Trigger | Context | Depth | Requires |
|---|---|---|---|
| Operator | High (natural language + time refs) | Configurable | Nothing extra |
| Alert webhook | Medium (alert labels + service config) | Per-severity template | Alertmanager config |
| Health poller | Medium (transition info + service config) | Quick | Prometheus provider |
| K8s event poller | Medium (pod restart + reason + service config) | Standard | Infrastructure (k8s) MCP provider |
| Proactive scan | Medium (rule trigger + service config) | Configurable per rule | scan.enabled: true |
Every investigation gets a shareable URL, a phase rail that streams live, and a structured RCA report with root cause, contributing factors, timeline, evidence (metrics + logs + infra + changes), and recommended actions.
Everything that happens in the system — investigations, scan runs, learned patterns, and lifecycle events — lives under a single /activity route, split into four tabs that share a filter-bar idiom (search, severity pills, status toggles, time-window shortcuts, URL-driven state).
Investigations — every investigation, filterable by severity / status / service / time. URL-driven filters mean bookmarks and browser history Just Work.
Scans — every probe tick the scheduler ran, with trigger (cron / manual / webhook), status, hits dispatched, and a deep link to each run's Probe → Triage → Investigate breakdown. Click into a run from anywhere it's referenced.
Patterns — the learned-pattern catalog. Every confirmed RCA contributes to a service's pattern library, scoped by severity + service, with drill-down to the source investigation that taught it.
Events — the persistent system feed. Investigation lifecycle (investigation_started / _completed / _failed), alert webhooks, scan-run completions, manual scan triggers, and provider health crossings — all backed by a 30-day retention window and filterable by kind / severity / service / time.
Every completed investigation can be delivered to Slack and email. Recipients are filtered independently on two axes — minimum severity and allowed trigger source — so each inbox only sees what it wants.
Slack (via incoming webhook): per-investigation summary posts, plus optional run-level scan summaries (always / hits-only / off).
Email (via SMTP): Teams-safe HTML body that renders the full RCA report — severity banner, summary, root cause with confidence, contributing factors, timeline, evidence (metrics + logs + infra + changes), recommended actions, and a deep link back to the investigation. Plain-text fallback included. Works with Microsoft Teams channel email addresses.
Manage recipients at Settings → Notifications in the UI — add, edit, toggle, and send a fixture RCA through the real pipeline with the per-row Test button.
The default landing page is a live SOC-style console: health strip, service catalog with status chips, investigation log, recent scan runs with a one-click Scan now button, and an event stream rail. Drilling into a service opens a tabbed detail view (metrics, history, dependencies, AI brief).
Live demo: wz.github.io/dops-assistant — fully interactive, hosted on GitHub Pages, no signup. Click into any investigation, browse the service catalog, drill into a scan run. Mutations are disabled (it's static), but every read path is real.
Want to run the same thing yourself?
Locally (Node server):
npm install
npm run seed:demo # writes fixture data to data-demo/
npm run demo # boots with DEMO_MODE=true on port 3000As a static site (GitHub Pages — zero infra, zero cost):
npm run build:demo-static # SPA + seed + static JSON snapshots
npx serve dist/web --single # any static server worksThe static build is what the deploy-demo GitHub Actions workflow ships to Pages — see demo/README.md for the one-time setup (a single toggle: Settings → Pages → Source: GitHub Actions). All mutating endpoints are disabled, no LLM calls are made, and no real infrastructure is touched.
- Docker — single image, mount
config.yamlandservices.yaml, passOPENAI_API_KEY - Helm — chart at
deploy/helm/dops-assistant. Supports sub-path ingress viaAPP_BASE_PATH, SMTP creds viaextraEnvFromon an existing Secret, and ingress WebSocket timeout annotations for the ~60s LLM silent-thinking phase - Process manager —
npm run build:web && npm run webbehind systemd, pm2, or your stack of choice
- Architecture Overview — system design, component details, data flow, design decisions
- Ops Runbook — MCP setup, full config reference, tuning, troubleshooting
- Email Notifications Setup — SMTP, Teams tenant rules, GUI walkthrough
- Provider YAML Spec — writing custom MCP providers
- Changelog — release history
npm run web # web server (loads dev/.env, port 3000)
npm run cli # terminal REPL
npm run build:web # build frontend (Vite → dist/web/)
npm run discover # run AI service discovery
npm run test:discover-eval # score discovery output quality (CI gates at 75/100)
npx tsx src/eval/rca-eval.ts # score RCA report quality
npx vitest run # run tests (100+ files)
npx tsc --noEmit # type checkContributions welcome. Please open an issue first to discuss what you'd like to change.







