notifications: preliminary UI work#306
Conversation
🔐 Codex Security Review
Review SummaryOverall Risk: HIGH Findings[HIGH] Notification destination SSRF guard is not enforced at the actual egress boundary
[MEDIUM] Production notification overlay does not pass Grafana credentials to fleet-api
NotesThe new notification RPCs are registered behind the shared auth/validation interceptor chain, are session-only, and are permission-mapped. I did not find pool/wallet/stratum hijacking changes in the reviewed diff; the “pool” references are alert templates, nav, or existing permission context. Generated by Codex Security Review | |
9534768 to
e52b9ea
Compare
7811814 to
4370492
Compare
5ee6d9f to
a4625a5
Compare
4370492 to
8d92a42
Compare
494025b to
ba75406
Compare
Addresses the Codex security review on PR #306: - RBAC bypass (HIGH): add notification:read / notification:manage to the authz catalog and gate every notifications route through middleware.RequirePermission. The handler now loads effective permissions after session validation, mirroring the Connect auth interceptor. Reads sit on :read; all mutations including TestChannel sit on :manage. - Org-wide silences via empty scopes (HIGH): validate silence scopes in the domain layer — every scope kind now requires its target (rule_id / group_id / site_id / device_ids). The UI hides the group/site/device scope options until their pickers exist. - Channels not routed to alerts (HIGH): made the UI explicit that saved channels are not yet attached to alert routing; policy-tree routing lands as a follow-up. - Secret leakage in Grafana error logs (HIGH): redact authorization_credentials, smtpPassword, and other credential fields from the request/response bodies logged on non-2xx Grafana responses. - SSRF via webhook/SMTP destinations (MEDIUM): new DestinationPolicy rejects destinations that are or resolve to loopback, link-local, private, or unspecified addresses; opt out via FLEET_METRICS_NOTIFICATIONS_ALLOW_PRIVATE_DESTINATIONS for dev stacks (enabled in the dev compose overlay). - Updates wiping stored secrets (MEDIUM): UpdateChannel now carries the existing settings' secret field into the PUT payload when the request doesn't include a fresh secret, so renames and destination edits no longer clear webhook credentials or SMTP passwords. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Follow-ups from the Codex re-review on PR #306: - DNS rebinding caveat (HIGH, partial): destination DNS lookups now fail closed — an unresolvable host is rejected instead of waved through. The rebinding residual (re-resolution at Grafana delivery time) is documented on checkDestinationHost; a hard guarantee needs egress enforcement at the Grafana container's network boundary. - Webhook URLs in error logs (MEDIUM): "url" added to the redacted log keys — webhook URLs routinely embed capability tokens (Slack, PagerDuty) in the path or query. - Silence regex injection (MEDIUM): device ids in device-scoped silences are validated against the identifier alphabet, escaped with regexp.QuoteMeta, and compiled into an anchored regex matcher so a crafted id like ".*" can't widen the silence org-wide. matchersToScope strips the anchors/escapes on reads. - ADMIN backfill (MEDIUM): migration 000087 seeds notification:read / notification:manage and backfills them onto existing ADMIN roles — the additive reconciler intentionally never re-asserts seed keys on existing built-ins (mirrors 000069). - Grafana admin creds in deployment (MEDIUM, partial): the deployment overlay now passes FLEET_METRICS_GRAFANA_TOKEN through (it takes precedence over basic auth in the client) so operators can move fleet-api onto a least-privilege service-account token; the admin fallback remains until a token is configured. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Second-round review findings addressed in 72d9d9e: [HIGH] SSRF policy bypassable via DNS rebinding — Partially mitigated: DNS lookups now fail closed (an unresolvable destination is rejected rather than allowed), per the reviewer's minimum recommendation. The fundamental rebinding window can't be closed from fleet-api — Grafana re-resolves the hostname at delivery time — so the residual risk is documented on [MEDIUM] Webhook URLs leak into error logs — [MEDIUM] Device-scope silence regex injection — Device ids are now validated against the identifier alphabet ( [MEDIUM] Existing ADMIN roles not backfilled — Correct catch: the additive reconciler never re-asserts seed keys on existing built-ins. Migration [MEDIUM] Grafana admin creds in deployment overlay — The overlay now passes [MEDIUM] Generated Connect API not served — Intentionally deferred, as documented in the handler header: the hand-written JSON routes are the interim surface until the 🤖 Generated with Claude Code |
Follow-ups from the Codex re-review on PR #306: - Webhook URLs leak to read-only users (MEDIUM): webhook URLs embed capability tokens in the path/query, so reads (reachable by notification:read) now return them host-only (scheme://host). UpdateChannel carries the stored full URL through an edit that didn't replace it — an empty or still-redacted URL is treated as unchanged, mirroring the secret-carry path; a manage user submitting a fresh full URL overrides it. - Raw Grafana error bodies returned to clients (MEDIUM): GrafanaError messages reach the browser via err.Error(), and update payloads now carry stored secrets, so a Grafana error echoing the request body could leak them. The message is redacted with the same key-based redactor used for logs. - Hand-rolled JSON routes bypass validation/body limits (MEDIUM): added http.MaxBytesReader (1 MiB) on the body, and enforce the proto's name (255) / comment (1024) / device-id-count (500) bounds in the hand-written path until the generated Connect handlers with the validate interceptor take over. DNS rebinding (MEDIUM) is already mitigated at the application layer (fail-closed resolution) and documented; the network-boundary fix is infra. HistoryService-outside-proto (MEDIUM) is part of the tracked generated-handler migration. Both noted on the PR. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Third-round findings addressed in 0b131db: [MEDIUM] Webhook URLs leak to read-only users — Good catch, consistent with the logger change. Reads now return webhook URLs host-only ( [MEDIUM] Raw Grafana error bodies returned to clients — [MEDIUM] Hand-rolled routes bypass validation/body limits — Added [MEDIUM] DNS rebinding bypasses SSRF checks — Unchanged from round two: mitigated at the application layer (fail-closed resolution) and documented on [MEDIUM] HistoryService outside the proto contract — Acknowledged; this is part of the same generated-handler migration the handler header documents. When 🤖 Generated with Claude Code |
|
Fourth-round review — risk is back at HIGH on two findings. Triage: [HIGH] Channel destination updates preserve old secrets — Valid, and a regression from my round-1 secret-carry. This is already fixed in the working tree (in-progress on the branch): the secret carry is now gated on the destination being unchanged — a webhook URL or SMTP host/port change without a fresh secret drops the stored credential instead of delivering the old bearer token / SMTP password to the new destination. A rename or recipients-only edit still preserves it. It'll land with the next push of the in-progress channel work on this branch. [HIGH] DNS rebinding bypasses SSRF check — Unchanged position: this cannot be closed in fleet-api. Pinning the resolved IP and handing Grafana an IP instead of the hostname would break TLS SNI / the Host header for real destinations (Slack, PagerDuty). The validation is an application-level backstop (fail-closed resolution, documented on [MEDIUM] Grafana error redaction passes non-JSON bodies through — Fixed in 988f58a. [MEDIUM] Saved channel tests ignore the saved channel ID — Valid; [MEDIUM] History endpoint outside the proto contract — Same disposition as prior rounds: part of the tracked hand-written-JSON → generated-Connect-handler migration documented in the handler header. When 🤖 Generated with Claude Code |
|
Fifth-round review (MEDIUM). Two fixed in 97ca093, three deferred with consistent rationale: [MEDIUM] Grafana JSON error bodies leak secrets in generic string fields — Fixed. My round-4 redaction was key-based only, so a secret echoed inside a value (e.g. [MEDIUM] Fleet API always receives the Grafana admin password — Fixed in the deployment overlay. It previously hard-required [MEDIUM] Hand-written JSON not Connect/protobuf-compatible & [MEDIUM] History endpoint missing from proto — Both are the same tracked item: the hand-written JSON routes are the documented interim surface (see the handler header) until the [MEDIUM] SSRF is only a pre-resolution check (DNS rebinding) — Unchanged: not closable in fleet-api without breaking TLS SNI/Host for real destinations. Application-level fail-closed validation stays as the backstop; the guarantee needs egress policy at the Grafana container boundary or a pinning delivery proxy. Flagged for human risk-acceptance + infra. (Noted the CGNAT range suggestion — worth folding into the host check if/when we extend it.) 🤖 Generated with Claude Code |
|
Sixth-round review (MEDIUM). One fixed, two are the same tracked migration item: [MEDIUM] Saved channel tests don't use the saved destination — Fixed in c587607. This was the round-4 item I'd deferred while Caveat noted in the finding: a saved Slack channel still can't be fully tested because Grafana stores the Slack URL as a secure field and doesn't return it on reads — same fundamental limitation as any stored secret, and not regressed by this change (it failed before too). Fully fixing it would require fleet-api to keep a parallel copy of the secret, which the design deliberately avoids. [MEDIUM] Manual wire format diverges from generated protobuf & [MEDIUM] History endpoint not in proto — Same tracked item as prior rounds: the hand-written JSON routes are the documented interim surface until the The DNS-rebinding item moved to the reviewer's Notes this round ("deployment should enforce Grafana egress at the network layer") — agreed, that's the resolution: infra-level egress, not a fleet-api code change. 🤖 Generated with Claude Code |
|
Seventh-round review — converged. No new actionable findings this round; both remaining items are ones already dispositioned over prior rounds, and I'm not making code changes for them (flagging here so it's explicit rather than re-answering each re-review):
All concrete, code-level findings from rounds 1–6 (RBAC bypass, org-wide silences, secret-log/error leakage, SSRF pre-checks, secret-reuse-on-destination-change, secret redaction depth, token-only Grafana auth, saved-channel testing) are addressed and on the branch. The two items above are an infra task and a tracked refactor respectively — both appropriate to handle outside this preliminary-UI PR. I'll continue watching and will only follow up if a future re-review surfaces something genuinely new. 🤖 Generated with Claude Code |
Addresses the Codex security review on PR #306: - RBAC bypass (HIGH): add notification:read / notification:manage to the authz catalog and gate every notifications route through middleware.RequirePermission. The handler now loads effective permissions after session validation, mirroring the Connect auth interceptor. Reads sit on :read; all mutations including TestChannel sit on :manage. - Org-wide silences via empty scopes (HIGH): validate silence scopes in the domain layer — every scope kind now requires its target (rule_id / group_id / site_id / device_ids). The UI hides the group/site/device scope options until their pickers exist. - Channels not routed to alerts (HIGH): made the UI explicit that saved channels are not yet attached to alert routing; policy-tree routing lands as a follow-up. - Secret leakage in Grafana error logs (HIGH): redact authorization_credentials, smtpPassword, and other credential fields from the request/response bodies logged on non-2xx Grafana responses. - SSRF via webhook/SMTP destinations (MEDIUM): new DestinationPolicy rejects destinations that are or resolve to loopback, link-local, private, or unspecified addresses; opt out via FLEET_METRICS_NOTIFICATIONS_ALLOW_PRIVATE_DESTINATIONS for dev stacks (enabled in the dev compose overlay). - Updates wiping stored secrets (MEDIUM): UpdateChannel now carries the existing settings' secret field into the PUT payload when the request doesn't include a fresh secret, so renames and destination edits no longer clear webhook credentials or SMTP passwords. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Follow-ups from the Codex re-review on PR #306: - DNS rebinding caveat (HIGH, partial): destination DNS lookups now fail closed — an unresolvable host is rejected instead of waved through. The rebinding residual (re-resolution at Grafana delivery time) is documented on checkDestinationHost; a hard guarantee needs egress enforcement at the Grafana container's network boundary. - Webhook URLs in error logs (MEDIUM): "url" added to the redacted log keys — webhook URLs routinely embed capability tokens (Slack, PagerDuty) in the path or query. - Silence regex injection (MEDIUM): device ids in device-scoped silences are validated against the identifier alphabet, escaped with regexp.QuoteMeta, and compiled into an anchored regex matcher so a crafted id like ".*" can't widen the silence org-wide. matchersToScope strips the anchors/escapes on reads. - ADMIN backfill (MEDIUM): migration 000087 seeds notification:read / notification:manage and backfills them onto existing ADMIN roles — the additive reconciler intentionally never re-asserts seed keys on existing built-ins (mirrors 000069). - Grafana admin creds in deployment (MEDIUM, partial): the deployment overlay now passes FLEET_METRICS_GRAFANA_TOKEN through (it takes precedence over basic auth in the client) so operators can move fleet-api onto a least-privilege service-account token; the admin fallback remains until a token is configured. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Follow-ups from the Codex re-review on PR #306: - Webhook URLs leak to read-only users (MEDIUM): webhook URLs embed capability tokens in the path/query, so reads (reachable by notification:read) now return them host-only (scheme://host). UpdateChannel carries the stored full URL through an edit that didn't replace it — an empty or still-redacted URL is treated as unchanged, mirroring the secret-carry path; a manage user submitting a fresh full URL overrides it. - Raw Grafana error bodies returned to clients (MEDIUM): GrafanaError messages reach the browser via err.Error(), and update payloads now carry stored secrets, so a Grafana error echoing the request body could leak them. The message is redacted with the same key-based redactor used for logs. - Hand-rolled JSON routes bypass validation/body limits (MEDIUM): added http.MaxBytesReader (1 MiB) on the body, and enforce the proto's name (255) / comment (1024) / device-id-count (500) bounds in the hand-written path until the generated Connect handlers with the validate interceptor take over. DNS rebinding (MEDIUM) is already mitigated at the application layer (fail-closed resolution) and documented; the network-boundary fix is infra. HistoryService-outside-proto (MEDIUM) is part of the tracked generated-handler migration. Both noted on the PR. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Fourth-round Codex finding: redactSecrets passed non-JSON upstream error bodies through verbatim, and that string is both logged and returned to the browser via GrafanaError. A Grafana/proxy HTML or plaintext error that echoes the request payload could leak bearer tokens, SMTP passwords, or secret-bearing webhook URLs — key-based redaction can't reach secrets in unstructured text. - redactSecrets now returns a length marker for non-JSON input instead of the raw body. - The client-facing GrafanaError message is the generic status text for non-JSON responses; only Grafana's own structured (JSON) error is surfaced, after key redaction. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Fifth-round Codex security findings:
- Grafana JSON error bodies can leak secrets in generic string fields
(MEDIUM): key-based redaction misses a secret Grafana echoes inside
a value like {"message":"failed to POST to https://hooks/.../TOKEN"}.
redactValue now scrubs webhook-URL and bearer-token substrings out
of every string value, wherever they sit — applied to both the log
body and the client-facing GrafanaError message.
- Fleet API always receives the Grafana admin password (MEDIUM): the
deployment overlay hard-required GRAFANA_ADMIN_PASSWORD for
fleet-api even when a service-account token is configured, so a
fleet-api compromise leaked the admin credential. The basic-auth
password now defaults empty and is an explicit opt-in fallback; a
token deployment no longer carries the admin credential in
fleet-api's environment. (Needs a companion run-fleet.sh change to
set the password for no-token deployments — flagged on the PR.)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sixth-round Codex finding (also a regression surfaced by the read-time URL redaction): TestChannel ignored Channel.ID and always probed the request payload. Since reads now hand the UI a redacted destination (webhook URLs host-only, Slack URLs omitted), clicking "Test" on a saved channel hit the wrong target — or a host root that isn't the real destination. - When an id is supplied, TestChannel now resolves the owned Grafana contact point (ownership-checked via findOwnedChannel) and tests its stored settings — the full URL and Grafana's secure fields. An unsaved definition (no id) still validates and tests the request payload for the "Test before save" flow. - The handler now surfaces TestChannel's service error (unknown/foreign id, invalid destination, Grafana unreachable) as a 4xx/5xx instead of discarding it, distinct from a delivery failure reported in the body. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
c587607 to
223114e
Compare
Codex finding: the value-level bearer redaction used a narrow character class (Bearer\s+[A-Za-z0-9._-]+), so a base64 / base64url / JWT token containing +, /, =, ~, or : was only partially scrubbed — the suffix after the first unmatched character leaked into fleet-api logs and the browser-facing error. Switch to a delimiter-based pattern (Bearer\s+[^\s"']+) that consumes the whole token up to the next whitespace or quote, matching the URL pattern's approach. Tests cover base64, JWT, and punctuation-bearing tokens. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
New re-review surfaced one genuinely new, actionable finding — fixed in 497d229: [MEDIUM] Bearer-token redaction can leak token suffixes — Valid bug in the round-5 value-level redaction I added. The other two findings are the previously-converged items — proto/Connect wire-contract divergence and HistoryService not in proto — both part of the tracked generated-handler migration (hand-written JSON is the documented interim surface; the sole consumer today is the hand-written TS client). No code change here; dispositioned as before. 🤖 Generated with Claude Code |
Codex finding: the dashboard unconditionally rendered ActiveNotificationsCard, whose HistoryTable fetches notification history on mount. The server gates that RPC on notification:read, so any authenticated user without it (e.g. FIELD_TECH, custom roles) hit a guaranteed 403 and an error state on the main dashboard. - Render the dashboard notification card only when the user holds notification:read. - Add requiredPermission: "notification:read" to the /settings/notifications nav item so the link is hidden from users who can't use the page. Server-side RBAC was already correct; this fixes the client-side UX regression. Gating the manage-only mutation controls (Add/Test/Delete, pause/resume) on notification:manage is a follow-up better done alongside the in-flight notifications UI work. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
One genuinely new finding addressed in 1ae32d8: [LOW] Dashboard fetches notification history for users without
Server-side RBAC was already correct; this is purely the client UX gap. The third sub-recommendation — gating the manage-only controls (Add/Test/Delete channel, pause/resume rule, add silence) on [MEDIUM] RPC routes don't match the protobuf contract — Unchanged disposition: the tracked generated-handler migration (hand-written JSON is the documented interim surface; sole consumer is the hand-written TS client). No churn. 🤖 Generated with Claude Code |
…JSON Addresses the recurring security-review finding that the notifications routes spoke a hand-written snake_case/lowercase-enum JSON contract divergent from the committed protobuf, and that HistoryService was served but undeclared in the proto. Proto: - Add HistoryService + ListNotifications messages + NotificationHistoryEntry so the history endpoint is part of the declared contract, and regenerate Go + TS bindings. Server: - Replace the hand-written http.Handler + wire DTOs with generated notificationsv1connect ChannelService/RuleService/SilenceService/ HistoryService implementations that convert protobuf <-> domain models. - Mount them through the shared interceptor chain (li): authentication, buf.validate request validation, error mapping, request-log redaction. - Register all 13 procedures in ProcedurePermissions (notification:read / notification:manage), the contract test's registeredServices, SessionOnlyProcedures (preserving the prior session-only behaviour), and RedactedRequestProcedures (channel mutations carry secrets). All security logic — org scoping, secret redaction, SSRF destination checks, silence scope/device-id validation, secret-carry-on-update, saved-channel testing — is unchanged: it lives in the domain service, which the new handlers call exactly as before. Client: - Talk to the generated Connect clients; map protobuf <-> the existing view types at the API boundary so the store and components are untouched. Removes the bespoke fetch/JSON wrapper. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Done — the notifications routes now use the generated Proto/Connect wire-contract divergence — The hand-written snake_case fields + lowercase enum tokens are gone. The server mounts generated HistoryService not in the proto — Added How it's wired:
Verified: server build + 🤖 Generated with Claude Code |
Codex finding: ListNotifications keyset-paginates by (organization_id, id < cursor ORDER BY id DESC), but the only indexes on notification_history are on received_at — so each history page filters/scans rows, and the dashboard fetches this for every notification:read user. Add idx_notification_history_org_id (organization_id, id DESC) so each page is an index range scan. New migration; the deployed 000083 is left untouched. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
The migration landed cleanly — this round's Notes confirm the new RPCs are session-only, RBAC-gated, on the shared interceptor chain, and the proto-divergence/HistoryService findings are gone. One genuinely new finding, fixed: [MEDIUM] History pagination lacks a matching index — Fixed in 50b913b. [MEDIUM] DNS-rebinding SSRF — Unchanged disposition: this is the standing infra-level item. fleet-api keeps its fail-closed pre-resolution check as an advisory backstop, but the hard guarantee has to be an egress boundary at the Grafana container (firewall/proxy blocking private/link-local/loopback/host-gateway/metadata) — the reviewer's recommendation, and not a fleet-api code change. Flagged for infra/risk-acceptance. 🤖 Generated with Claude Code |
Extract channelDestinationFields() — the kind/webhook/smtp/slack transformation was copy-pasted across createChannel, updateChannel, and testChannel. Pure cleanup; no behavior change. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
a5a1b58 to
3c5818f
Compare
3c5818f to
d25296b
Compare
…ioned admin credential The production notifications overlay left FLEET_METRICS_GRAFANA_PASSWORD defaulting to empty, so without a manually-provided token or password fleet-api fell back to admin with an empty password and every Grafana-proxying notification RPC 401'd while the stack still came up healthy. Default the basic-auth password to the GRAFANA_ADMIN_PASSWORD that run-fleet.sh already provisions for Grafana; a service-account token (FLEET_METRICS_GRAFANA_TOKEN) still takes precedence and avoids handing fleet-api the admin credential. Addresses Codex security review finding (MEDIUM) on #306. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Superseded by a 5-PR stack that splits this work into reviewable layers:
The stack tip is byte-identical to this branch, plus the fix for the [MEDIUM] Codex finding (prod overlay now defaults The outstanding [HIGH] Codex finding (Grafana egress SSRF enforcement) is not addressed here and should be tracked separately against #455. Closing this in favor of the stack. |
UI work expose notifications controls and alerting sinks.