notifications: preliminary UI work by illegalprime · Pull Request #306 · block/proto-fleet

illegalprime · 2026-05-22T19:58:59Z

UI work expose notifications controls and alerting sinks.

github-actions · 2026-05-22T20:02:19Z

🔐 Codex Security Review

Note: This is an automated security-focused code review generated by Codex.
It should be used as a supplementary check alongside human review.
False positives are possible - use your judgment.

Scope summary

Reviewed pull request diff only (45a10bc28c37f66376e61ebce1fe33d07f5757cc...d25296bc0755f80e859f92dc614bcb1a426de096, exact PR three-dot diff)

Model: gpt-5.5

💡 Click "edited" above to see previous reviews for this PR.

Review Summary

Overall Risk: HIGH

Findings

[HIGH] Notification destination SSRF guard is not enforced at the actual egress boundary

Category: Infrastructure
Location: server/internal/domain/notifications/service.go:367
Description: The backend classifies webhook/Slack/SMTP hosts with a one-time DNS lookup before saving/testing, but Grafana performs the actual outbound delivery later and resolves/connects independently. The code even notes this is “Not rebinding-proof,” but the PR does not add the required Grafana egress enforcement. A notification:manage user can use DNS rebinding, and potentially redirect behavior in Grafana’s notifier, to pass validation with a public IP and then make Grafana POST to private/LAN/link-local targets.
Impact: Blind SSRF from the Grafana container into the fleet network, host gateway, LAN miner APIs, or metadata/internal services. This bypasses the intended private-address block and is especially risky in a miner fleet where internal device APIs often sit on RFC1918 networks.
Recommendation: Add a hard egress control where Grafana connects out: network policy/firewall deny rules for private, loopback, link-local, multicast, and metadata ranges, or route notification delivery through a backend-controlled proxy with IP pinning and redirect target validation. Revalidate saved-channel tests as defense-in-depth, but do not rely on app-layer DNS checks alone.

[MEDIUM] Production notification overlay does not pass Grafana credentials to fleet-api

Category: Infrastructure
Location: deployment-files/docker-compose.notifications.yaml:11
Description: fleet-api is configured with FLEET_METRICS_GRAFANA_PASSWORD: "${FLEET_METRICS_GRAFANA_PASSWORD:-}", while the deployment flow provisions GRAFANA_ADMIN_PASSWORD for Grafana itself and does not set FLEET_METRICS_GRAFANA_PASSWORD or a service-account token. With the default overlay, fleet-api will try Grafana basic auth as admin with an empty password unless the operator manually provides a separate env var.
Impact: The beta notifications stack can come up healthy but all notification RPCs that proxy Grafana can fail with 401s, breaking channel/rule/silence management in production deployments.
Recommendation: Either generate/provision FLEET_METRICS_GRAFANA_TOKEN as part of run-fleet.sh, or default FLEET_METRICS_GRAFANA_PASSWORD to the generated GRAFANA_ADMIN_PASSWORD until service-account token provisioning exists. Document the required credential path.

Notes

The new notification RPCs are registered behind the shared auth/validation interceptor chain, are session-only, and are permission-mapped. I did not find pool/wallet/stratum hijacking changes in the reviewed diff; the “pool” references are alert templates, nav, or existing permission context.

_{Generated by Codex Security Review |

Triggered by: @illegalprime |

Review workflow run}

Addresses the Codex security review on PR #306: - RBAC bypass (HIGH): add notification:read / notification:manage to the authz catalog and gate every notifications route through middleware.RequirePermission. The handler now loads effective permissions after session validation, mirroring the Connect auth interceptor. Reads sit on :read; all mutations including TestChannel sit on :manage. - Org-wide silences via empty scopes (HIGH): validate silence scopes in the domain layer — every scope kind now requires its target (rule_id / group_id / site_id / device_ids). The UI hides the group/site/device scope options until their pickers exist. - Channels not routed to alerts (HIGH): made the UI explicit that saved channels are not yet attached to alert routing; policy-tree routing lands as a follow-up. - Secret leakage in Grafana error logs (HIGH): redact authorization_credentials, smtpPassword, and other credential fields from the request/response bodies logged on non-2xx Grafana responses. - SSRF via webhook/SMTP destinations (MEDIUM): new DestinationPolicy rejects destinations that are or resolve to loopback, link-local, private, or unspecified addresses; opt out via FLEET_METRICS_NOTIFICATIONS_ALLOW_PRIVATE_DESTINATIONS for dev stacks (enabled in the dev compose overlay). - Updates wiping stored secrets (MEDIUM): UpdateChannel now carries the existing settings' secret field into the PUT payload when the request doesn't include a fresh secret, so renames and destination edits no longer clear webhook credentials or SMTP passwords. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Follow-ups from the Codex re-review on PR #306: - DNS rebinding caveat (HIGH, partial): destination DNS lookups now fail closed — an unresolvable host is rejected instead of waved through. The rebinding residual (re-resolution at Grafana delivery time) is documented on checkDestinationHost; a hard guarantee needs egress enforcement at the Grafana container's network boundary. - Webhook URLs in error logs (MEDIUM): "url" added to the redacted log keys — webhook URLs routinely embed capability tokens (Slack, PagerDuty) in the path or query. - Silence regex injection (MEDIUM): device ids in device-scoped silences are validated against the identifier alphabet, escaped with regexp.QuoteMeta, and compiled into an anchored regex matcher so a crafted id like ".*" can't widen the silence org-wide. matchersToScope strips the anchors/escapes on reads. - ADMIN backfill (MEDIUM): migration 000087 seeds notification:read / notification:manage and backfills them onto existing ADMIN roles — the additive reconciler intentionally never re-asserts seed keys on existing built-ins (mirrors 000069). - Grafana admin creds in deployment (MEDIUM, partial): the deployment overlay now passes FLEET_METRICS_GRAFANA_TOKEN through (it takes precedence over basic auth in the client) so operators can move fleet-api onto a least-privilege service-account token; the admin fallback remains until a token is configured. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

illegalprime · 2026-06-12T17:53:30Z

Second-round review findings addressed in 72d9d9e:

[HIGH] SSRF policy bypassable via DNS rebinding — Partially mitigated: DNS lookups now fail closed (an unresolvable destination is rejected rather than allowed), per the reviewer's minimum recommendation. The fundamental rebinding window can't be closed from fleet-api — Grafana re-resolves the hostname at delivery time — so the residual risk is documented on checkDestinationHost with a pointer to network-boundary egress enforcement on the Grafana container, which is the real fix and an infra follow-up.

[MEDIUM] Webhook URLs leak into error logs — url added to the redacted log keys (Slack/PagerDuty-style URLs embed capability tokens in the path). The GrafanaError message returned to the API caller is unchanged: it reaches only the org user who supplied the channel definition, and stripping it would gut the test-channel UX; flagging if anyone disagrees.

[MEDIUM] Device-scope silence regex injection — Device ids are now validated against the identifier alphabet ([A-Za-z0-9._:-]+, covers UUIDs/MACs/serials), regexp.QuoteMeta-escaped, and compiled into an anchored ^(?:id1|id2)$ matcher; read-back strips the anchors. A crafted .* or a|b id is rejected with invalid_argument.

[MEDIUM] Existing ADMIN roles not backfilled — Correct catch: the additive reconciler never re-asserts seed keys on existing built-ins. Migration 000087_seed_notification_permissions backfills notification:read/notification:manage onto existing ADMIN roles, mirroring the 000069 activity:read precedent.

[MEDIUM] Grafana admin creds in deployment overlay — The overlay now passes FLEET_METRICS_GRAFANA_TOKEN through (the client already prefers it over basic auth), so operators can move fleet-api onto a least-privilege service-account token. The admin fallback remains until a token is provisioned — automating SA token creation in run-fleet.sh is a deployment follow-up.

[MEDIUM] Generated Connect API not served — Intentionally deferred, as documented in the handler header: the hand-written JSON routes are the interim surface until the notificationsv1connect codegen lands, at which point the swap is mechanical and the routes join the standard interceptor chain. The hand-written UI client is the only consumer of this surface today.

🤖 Generated with Claude Code

Follow-ups from the Codex re-review on PR #306: - Webhook URLs leak to read-only users (MEDIUM): webhook URLs embed capability tokens in the path/query, so reads (reachable by notification:read) now return them host-only (scheme://host). UpdateChannel carries the stored full URL through an edit that didn't replace it — an empty or still-redacted URL is treated as unchanged, mirroring the secret-carry path; a manage user submitting a fresh full URL overrides it. - Raw Grafana error bodies returned to clients (MEDIUM): GrafanaError messages reach the browser via err.Error(), and update payloads now carry stored secrets, so a Grafana error echoing the request body could leak them. The message is redacted with the same key-based redactor used for logs. - Hand-rolled JSON routes bypass validation/body limits (MEDIUM): added http.MaxBytesReader (1 MiB) on the body, and enforce the proto's name (255) / comment (1024) / device-id-count (500) bounds in the hand-written path until the generated Connect handlers with the validate interceptor take over. DNS rebinding (MEDIUM) is already mitigated at the application layer (fail-closed resolution) and documented; the network-boundary fix is infra. HistoryService-outside-proto (MEDIUM) is part of the tracked generated-handler migration. Both noted on the PR. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

illegalprime · 2026-06-12T18:07:20Z

Third-round findings addressed in 0b131db:

[MEDIUM] Webhook URLs leak to read-only users — Good catch, consistent with the logger change. Reads now return webhook URLs host-only (scheme://host, dropping the userinfo/path/query where tokens live); the full URL stays only in Grafana's stored settings. UpdateChannel carries the stored full URL through an edit that didn't replace it — an empty or still-redacted URL is treated as "unchanged," same pattern as the secret carry — so renames don't truncate it, while a manage user submitting a fresh full URL overrides it. Covered by new tests (TestListChannelsRedactsWebhookURL, TestUpdateChannelPreservesWebhookURLOnRename).

[MEDIUM] Raw Grafana error bodies returned to clients — GrafanaError.Message is now redacted with the same key-based redactor before it rides back to the browser, so a Grafana/proxy error echoing the (secret-carrying) request body can't leak credentials.

[MEDIUM] Hand-rolled routes bypass validation/body limits — Added http.MaxBytesReader (1 MiB) on the request body, and the hand-written path now enforces the proto's buf.validate bounds (channel name ≤255, silence comment ≤1024, device-id count ≤500) until the generated Connect handlers + validate interceptor take over.

[MEDIUM] DNS rebinding bypasses SSRF checks — Unchanged from round two: mitigated at the application layer (fail-closed resolution) and documented on checkDestinationHost. Closing the rebinding window requires re-resolving at dial time, which only Grafana's own egress path controls — that's a network-boundary/infra fix, not something fleet-api can guarantee. Tracking separately rather than implying a code-level fix here closes it.

[MEDIUM] HistoryService outside the proto contract — Acknowledged; this is part of the same generated-handler migration the handler header documents. When notificationsv1connect is wired, HistoryService will be added to the proto and regenerated alongside the others. No generated client consumes this surface today (the hand-written UI client is the only caller), so there's no client to break in the interim.

🤖 Generated with Claude Code

illegalprime · 2026-06-12T18:20:09Z

Fourth-round review — risk is back at HIGH on two findings. Triage:

[HIGH] Channel destination updates preserve old secrets — Valid, and a regression from my round-1 secret-carry. This is already fixed in the working tree (in-progress on the branch): the secret carry is now gated on the destination being unchanged — a webhook URL or SMTP host/port change without a fresh secret drops the stored credential instead of delivering the old bearer token / SMTP password to the new destination. A rename or recipients-only edit still preserves it. It'll land with the next push of the in-progress channel work on this branch.

[HIGH] DNS rebinding bypasses SSRF check — Unchanged position: this cannot be closed in fleet-api. Pinning the resolved IP and handing Grafana an IP instead of the hostname would break TLS SNI / the Host header for real destinations (Slack, PagerDuty). The validation is an application-level backstop (fail-closed resolution, documented on checkDestinationHost); the actual guarantee has to come from egress policy at the Grafana container's network boundary, or a delivery proxy that pins/validates at dial time. Flagging for a human risk-acceptance decision + an infra follow-up rather than implying a code fix exists here.

[MEDIUM] Grafana error redaction passes non-JSON bodies through — Fixed in 988f58a. redactSecrets no longer returns non-JSON bodies verbatim (key-based redaction can't scrub HTML/plaintext that echoes the payload); it returns a length marker. The client-facing GrafanaError message is now the generic status text for non-JSON responses, surfacing only Grafana's own structured JSON error after redaction.

[MEDIUM] Saved channel tests ignore the saved channel ID — Valid; TestChannel tests the request body, and with the URL now redacted on reads the UI sends a host-only URL + empty secret, so testing a saved channel hits the wrong destination. The right fix (load + ownership-check the stored contact point by id, test its stored settings, use the request body only for unsaved preview) lives in service.go, which is under active concurrent edit for the Slack channel work — I'm deferring it to that change rather than racing the same file. Noting it here so it isn't lost.

[MEDIUM] History endpoint outside the proto contract — Same disposition as prior rounds: part of the tracked hand-written-JSON → generated-Connect-handler migration documented in the handler header. When notificationsv1connect is wired, HistoryService gets added to the proto and regenerated with the rest. No generated client consumes it today.

🤖 Generated with Claude Code

illegalprime · 2026-06-12T18:30:16Z

Fifth-round review (MEDIUM). Two fixed in 97ca093, three deferred with consistent rationale:

[MEDIUM] Grafana JSON error bodies leak secrets in generic string fields — Fixed. My round-4 redaction was key-based only, so a secret echoed inside a value (e.g. {"message":"failed to POST to https://hooks.slack.com/.../TOKEN"}) slipped through. redactValue now also scrubs webhook-URL and Bearer <token> substrings out of every string value, wherever they sit — applied to both the logged body and the client-facing GrafanaError message. (SMTP passwords inside free-text aren't pattern-matchable generically; Grafana doesn't echo them in practice, and the secret-keyed cases are already covered.)

[MEDIUM] Fleet API always receives the Grafana admin password — Fixed in the deployment overlay. It previously hard-required GRAFANA_ADMIN_PASSWORD for fleet-api even with a token configured, so the admin credential sat in fleet-api's env regardless. The basic-auth password now defaults empty and is an explicit opt-in fallback — a token deployment carries no admin credential. ⚠️ This needs a companion run-fleet.sh change to set FLEET_METRICS_GRAFANA_PASSWORD for no-token deployments (otherwise basic auth gets an empty password); I can't see/test that script here, so flagging it for whoever owns the deploy tooling. The full "fail startup if neither token nor explicit fallback is present" check is a server-side follow-up.

[MEDIUM] Hand-written JSON not Connect/protobuf-compatible & [MEDIUM] History endpoint missing from proto — Both are the same tracked item: the hand-written JSON routes are the documented interim surface (see the handler header) until the notificationsv1connect bindings are served. The Slack work just regenerated the proto, so the bindings are moving; HistoryService gets added and the UI switched to generated bindings as part of that migration. No generated/SDK client consumes this surface today — the hand-written TS client is the only caller — so there's no client mismatch in practice yet.

[MEDIUM] SSRF is only a pre-resolution check (DNS rebinding) — Unchanged: not closable in fleet-api without breaking TLS SNI/Host for real destinations. Application-level fail-closed validation stays as the backstop; the guarantee needs egress policy at the Grafana container boundary or a pinning delivery proxy. Flagged for human risk-acceptance + infra. (Noted the CGNAT range suggestion — worth folding into the host check if/when we extend it.)

🤖 Generated with Claude Code

illegalprime · 2026-06-12T18:39:36Z

Sixth-round review (MEDIUM). One fixed, two are the same tracked migration item:

[MEDIUM] Saved channel tests don't use the saved destination — Fixed in c587607. This was the round-4 item I'd deferred while service.go was under active concurrent edit (the Slack work has since landed, so it's safe to touch now), and it was made worse by my own read-time URL redaction. TestChannel now, when given a channel id, resolves the owned Grafana contact point (ownership-checked) and tests its stored settings — the full webhook URL and Grafana's secure fields — instead of the redacted read-model the UI echoes back. Unsaved definitions (no id) still validate + test the request payload for the "Test before save" flow. The handler also now surfaces TestChannel's service error (unknown/foreign id, invalid destination, Grafana unreachable) as a 4xx/5xx rather than discarding it, distinct from a delivery failure reported in the body. Covered by two new tests (stored-URL probe + foreign-channel rejection).

Caveat noted in the finding: a saved Slack channel still can't be fully tested because Grafana stores the Slack URL as a secure field and doesn't return it on reads — same fundamental limitation as any stored secret, and not regressed by this change (it failed before too). Fully fixing it would require fleet-api to keep a parallel copy of the secret, which the design deliberately avoids.

[MEDIUM] Manual wire format diverges from generated protobuf & [MEDIUM] History endpoint not in proto — Same tracked item as prior rounds: the hand-written JSON routes are the documented interim surface until the notificationsv1connect bindings are served (handler header documents this). The contract mismatch (lowercase enum tokens vs CHANNEL_KIND_WEBHOOK, missing HistoryService) is real but only matters once a generated client/handler is actually wired — the hand-written TS client is the sole consumer today. The Slack work regenerated the proto, so the bindings exist; serving them (and adding HistoryService to the proto) is the migration's job, tracked separately rather than partially done here.

The DNS-rebinding item moved to the reviewer's Notes this round ("deployment should enforce Grafana egress at the network layer") — agreed, that's the resolution: infra-level egress, not a fleet-api code change.

🤖 Generated with Claude Code

illegalprime · 2026-06-12T18:45:57Z

Seventh-round review — converged. No new actionable findings this round; both remaining items are ones already dispositioned over prior rounds, and I'm not making code changes for them (flagging here so it's explicit rather than re-answering each re-review):

Destination checks don't enforce Grafana egress (DNS rebinding / redirects) — Confirmed not fixable in fleet-api: handing Grafana a pinned IP breaks TLS SNI/Host for real destinations, and Grafana owns the delivery-time resolution + redirect behavior. The reviewer's own recommendation is the resolution — deny egress to loopback/RFC1918/link-local/metadata at the Grafana container's network boundary (an infra/deployment change). fleet-api keeps its fail-closed pre-resolution check as a best-effort backstop. Tracking as an infra task, not a code fix on this PR.
HTTP surface diverges from generated protobuf + HistoryService not in proto — The hand-written JSON routes are the documented interim surface (see the handler package header) until the notificationsv1connect bindings are served; that migration is where the enum-name alignment and HistoryService proto declaration belong. The only consumer today is the hand-written TS client, so there's no generated-client mismatch in practice yet. Tracked with the generated-handler migration.

All concrete, code-level findings from rounds 1–6 (RBAC bypass, org-wide silences, secret-log/error leakage, SSRF pre-checks, secret-reuse-on-destination-change, secret redaction depth, token-only Grafana auth, saved-channel testing) are addressed and on the branch. The two items above are an infra task and a tracked refactor respectively — both appropriate to handle outside this preliminary-UI PR.

I'll continue watching and will only follow up if a future re-review surfaces something genuinely new.

🤖 Generated with Claude Code

Addresses the Codex security review on PR #306: - RBAC bypass (HIGH): add notification:read / notification:manage to the authz catalog and gate every notifications route through middleware.RequirePermission. The handler now loads effective permissions after session validation, mirroring the Connect auth interceptor. Reads sit on :read; all mutations including TestChannel sit on :manage. - Org-wide silences via empty scopes (HIGH): validate silence scopes in the domain layer — every scope kind now requires its target (rule_id / group_id / site_id / device_ids). The UI hides the group/site/device scope options until their pickers exist. - Channels not routed to alerts (HIGH): made the UI explicit that saved channels are not yet attached to alert routing; policy-tree routing lands as a follow-up. - Secret leakage in Grafana error logs (HIGH): redact authorization_credentials, smtpPassword, and other credential fields from the request/response bodies logged on non-2xx Grafana responses. - SSRF via webhook/SMTP destinations (MEDIUM): new DestinationPolicy rejects destinations that are or resolve to loopback, link-local, private, or unspecified addresses; opt out via FLEET_METRICS_NOTIFICATIONS_ALLOW_PRIVATE_DESTINATIONS for dev stacks (enabled in the dev compose overlay). - Updates wiping stored secrets (MEDIUM): UpdateChannel now carries the existing settings' secret field into the PUT payload when the request doesn't include a fresh secret, so renames and destination edits no longer clear webhook credentials or SMTP passwords. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Follow-ups from the Codex re-review on PR #306: - DNS rebinding caveat (HIGH, partial): destination DNS lookups now fail closed — an unresolvable host is rejected instead of waved through. The rebinding residual (re-resolution at Grafana delivery time) is documented on checkDestinationHost; a hard guarantee needs egress enforcement at the Grafana container's network boundary. - Webhook URLs in error logs (MEDIUM): "url" added to the redacted log keys — webhook URLs routinely embed capability tokens (Slack, PagerDuty) in the path or query. - Silence regex injection (MEDIUM): device ids in device-scoped silences are validated against the identifier alphabet, escaped with regexp.QuoteMeta, and compiled into an anchored regex matcher so a crafted id like ".*" can't widen the silence org-wide. matchersToScope strips the anchors/escapes on reads. - ADMIN backfill (MEDIUM): migration 000087 seeds notification:read / notification:manage and backfills them onto existing ADMIN roles — the additive reconciler intentionally never re-asserts seed keys on existing built-ins (mirrors 000069). - Grafana admin creds in deployment (MEDIUM, partial): the deployment overlay now passes FLEET_METRICS_GRAFANA_TOKEN through (it takes precedence over basic auth in the client) so operators can move fleet-api onto a least-privilege service-account token; the admin fallback remains until a token is configured. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Follow-ups from the Codex re-review on PR #306: - Webhook URLs leak to read-only users (MEDIUM): webhook URLs embed capability tokens in the path/query, so reads (reachable by notification:read) now return them host-only (scheme://host). UpdateChannel carries the stored full URL through an edit that didn't replace it — an empty or still-redacted URL is treated as unchanged, mirroring the secret-carry path; a manage user submitting a fresh full URL overrides it. - Raw Grafana error bodies returned to clients (MEDIUM): GrafanaError messages reach the browser via err.Error(), and update payloads now carry stored secrets, so a Grafana error echoing the request body could leak them. The message is redacted with the same key-based redactor used for logs. - Hand-rolled JSON routes bypass validation/body limits (MEDIUM): added http.MaxBytesReader (1 MiB) on the body, and enforce the proto's name (255) / comment (1024) / device-id-count (500) bounds in the hand-written path until the generated Connect handlers with the validate interceptor take over. DNS rebinding (MEDIUM) is already mitigated at the application layer (fail-closed resolution) and documented; the network-boundary fix is infra. HistoryService-outside-proto (MEDIUM) is part of the tracked generated-handler migration. Both noted on the PR. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Fourth-round Codex finding: redactSecrets passed non-JSON upstream error bodies through verbatim, and that string is both logged and returned to the browser via GrafanaError. A Grafana/proxy HTML or plaintext error that echoes the request payload could leak bearer tokens, SMTP passwords, or secret-bearing webhook URLs — key-based redaction can't reach secrets in unstructured text. - redactSecrets now returns a length marker for non-JSON input instead of the raw body. - The client-facing GrafanaError message is the generic status text for non-JSON responses; only Grafana's own structured (JSON) error is surfaced, after key redaction. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Fifth-round Codex security findings: - Grafana JSON error bodies can leak secrets in generic string fields (MEDIUM): key-based redaction misses a secret Grafana echoes inside a value like {"message":"failed to POST to https://hooks/.../TOKEN"}. redactValue now scrubs webhook-URL and bearer-token substrings out of every string value, wherever they sit — applied to both the log body and the client-facing GrafanaError message. - Fleet API always receives the Grafana admin password (MEDIUM): the deployment overlay hard-required GRAFANA_ADMIN_PASSWORD for fleet-api even when a service-account token is configured, so a fleet-api compromise leaked the admin credential. The basic-auth password now defaults empty and is an explicit opt-in fallback; a token deployment no longer carries the admin credential in fleet-api's environment. (Needs a companion run-fleet.sh change to set the password for no-token deployments — flagged on the PR.) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Sixth-round Codex finding (also a regression surfaced by the read-time URL redaction): TestChannel ignored Channel.ID and always probed the request payload. Since reads now hand the UI a redacted destination (webhook URLs host-only, Slack URLs omitted), clicking "Test" on a saved channel hit the wrong target — or a host root that isn't the real destination. - When an id is supplied, TestChannel now resolves the owned Grafana contact point (ownership-checked via findOwnedChannel) and tests its stored settings — the full URL and Grafana's secure fields. An unsaved definition (no id) still validates and tests the request payload for the "Test before save" flow. - The handler now surfaces TestChannel's service error (unknown/foreign id, invalid destination, Grafana unreachable) as a 4xx/5xx instead of discarding it, distinct from a delivery failure reported in the body. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Codex finding: the value-level bearer redaction used a narrow character class (Bearer\s+[A-Za-z0-9._-]+), so a base64 / base64url / JWT token containing +, /, =, ~, or : was only partially scrubbed — the suffix after the first unmatched character leaked into fleet-api logs and the browser-facing error. Switch to a delimiter-based pattern (Bearer\s+[^\s"']+) that consumes the whole token up to the next whitespace or quote, matching the URL pattern's approach. Tests cover base64, JWT, and punctuation-bearing tokens. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

illegalprime · 2026-06-15T14:16:54Z

New re-review surfaced one genuinely new, actionable finding — fixed in 497d229:

[MEDIUM] Bearer-token redaction can leak token suffixes — Valid bug in the round-5 value-level redaction I added. bearerValuePattern used a narrow class (Bearer\s+[A-Za-z0-9._-]+), so a base64/base64url/JWT token containing + / = ~ : was only partially scrubbed — the suffix after the first unmatched character leaked into logs and the error response. Switched to a delimiter-based Bearer\s+[^\s"']+ (consumes the whole token up to whitespace/quote, matching the URL pattern's approach). Added tests for base64, JWT, and punctuation-bearing tokens.

The other two findings are the previously-converged items — proto/Connect wire-contract divergence and HistoryService not in proto — both part of the tracked generated-handler migration (hand-written JSON is the documented interim surface; the sole consumer today is the hand-written TS client). No code change here; dispositioned as before.

🤖 Generated with Claude Code

Codex finding: the dashboard unconditionally rendered ActiveNotificationsCard, whose HistoryTable fetches notification history on mount. The server gates that RPC on notification:read, so any authenticated user without it (e.g. FIELD_TECH, custom roles) hit a guaranteed 403 and an error state on the main dashboard. - Render the dashboard notification card only when the user holds notification:read. - Add requiredPermission: "notification:read" to the /settings/notifications nav item so the link is hidden from users who can't use the page. Server-side RBAC was already correct; this fixes the client-side UX regression. Gating the manage-only mutation controls (Add/Test/Delete, pause/resume) on notification:manage is a follow-up better done alongside the in-flight notifications UI work. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

illegalprime · 2026-06-15T14:24:21Z

One genuinely new finding addressed in 1ae32d8:

[LOW] Dashboard fetches notification history for users without notification:read — Real RBAC/UX regression: the dashboard rendered ActiveNotificationsCard unconditionally, and its HistoryTable fetches history on mount, which the server correctly gates on notification:read — so any user lacking that permission (FIELD_TECH, custom roles) hit a guaranteed 403 + error state on the main dashboard. Fixed:

The dashboard card now renders only when the user holds notification:read (useHasPermission).
Added requiredPermission: "notification:read" to the /settings/notifications nav item so the link is hidden from users who can't use the page.

Server-side RBAC was already correct; this is purely the client UX gap. The third sub-recommendation — gating the manage-only controls (Add/Test/Delete channel, pause/resume rule, add silence) on notification:manage — is defense-in-depth UX (the server already enforces it) and touches several components under active edit for the Slack work, so I've left it as a follow-up for that UI pass rather than sprawl across it now.

[MEDIUM] RPC routes don't match the protobuf contract — Unchanged disposition: the tracked generated-handler migration (hand-written JSON is the documented interim surface; sole consumer is the hand-written TS client). No churn.

🤖 Generated with Claude Code

…JSON Addresses the recurring security-review finding that the notifications routes spoke a hand-written snake_case/lowercase-enum JSON contract divergent from the committed protobuf, and that HistoryService was served but undeclared in the proto. Proto: - Add HistoryService + ListNotifications messages + NotificationHistoryEntry so the history endpoint is part of the declared contract, and regenerate Go + TS bindings. Server: - Replace the hand-written http.Handler + wire DTOs with generated notificationsv1connect ChannelService/RuleService/SilenceService/ HistoryService implementations that convert protobuf <-> domain models. - Mount them through the shared interceptor chain (li): authentication, buf.validate request validation, error mapping, request-log redaction. - Register all 13 procedures in ProcedurePermissions (notification:read / notification:manage), the contract test's registeredServices, SessionOnlyProcedures (preserving the prior session-only behaviour), and RedactedRequestProcedures (channel mutations carry secrets). All security logic — org scoping, secret redaction, SSRF destination checks, silence scope/device-id validation, secret-carry-on-update, saved-channel testing — is unchanged: it lives in the domain service, which the new handlers call exactly as before. Client: - Talk to the generated Connect clients; map protobuf <-> the existing view types at the API boundary so the store and components are untouched. Removes the bespoke fetch/JSON wrapper. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

illegalprime · 2026-06-15T14:49:15Z

Done — the notifications routes now use the generated notificationsv1connect Connect handlers instead of the hand-written JSON surface (7432d65). This resolves the two findings I'd been tracking as a deferred migration:

Proto/Connect wire-contract divergence — The hand-written snake_case fields + lowercase enum tokens are gone. The server mounts generated ChannelService / RuleService / SilenceService / HistoryService handlers that convert protobuf↔domain, and the client talks to the generated Connect clients. The wire is now real protojson (camelCase, CHANNEL_KIND_WEBHOOK, int64-as-string, Timestamp) — a generated client interoperates.

HistoryService not in the proto — Added HistoryService + ListNotifications{Request,Response} + NotificationHistoryEntry to notifications.proto and regenerated Go/TS. It's now a declared, discoverable part of the contract.

How it's wired:

Generated handlers run through the shared interceptor chain — authentication, buf.validate request validation (replacing the hand-rolled length/body checks), error mapping, and request-log redaction.
All 13 procedures are registered in ProcedurePermissions (notification:read / notification:manage), the RPC contract test's registeredServices, SessionOnlyProcedures (preserving the prior session-only behaviour — no API-key auth), and RedactedRequestProcedures (channel mutations carry secrets).
Every security control from the earlier rounds is untouched: it all lives in the domain service (org scoping, secret redaction, SSRF checks, silence scope/device-id validation, secret-carry-on-update, saved-channel testing), which the new handlers call exactly as before.
Client change is confined to the API boundary: notificationsApi.ts maps protobuf↔the existing view types, so the store and components are unchanged.

Verified: server build + golangci-lint clean, RPC contract test passes (all 13 procedures classified), notifications domain tests pass, client tsc + eslint clean, pre-push hooks green.

🤖 Generated with Claude Code

Codex finding: ListNotifications keyset-paginates by (organization_id, id < cursor ORDER BY id DESC), but the only indexes on notification_history are on received_at — so each history page filters/scans rows, and the dashboard fetches this for every notification:read user. Add idx_notification_history_org_id (organization_id, id DESC) so each page is an index range scan. New migration; the deployed 000083 is left untouched. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

illegalprime · 2026-06-15T14:54:53Z

The migration landed cleanly — this round's Notes confirm the new RPCs are session-only, RBAC-gated, on the shared interceptor chain, and the proto-divergence/HistoryService findings are gone. One genuinely new finding, fixed:

[MEDIUM] History pagination lacks a matching index — Fixed in 50b913b. ListNotifications keyset-paginates by (organization_id, id < cursor ORDER BY id DESC), but notification_history's indexes were all on received_at, so each page scanned/filtered — and the dashboard card fetches this for every notification:read user. Added idx_notification_history_org_id (organization_id, id DESC) in a new migration (000088) so each page is an index range scan; the deployed 000083 is untouched.

[MEDIUM] DNS-rebinding SSRF — Unchanged disposition: this is the standing infra-level item. fleet-api keeps its fail-closed pre-resolution check as an advisory backstop, but the hard guarantee has to be an egress boundary at the Grafana container (firewall/proxy blocking private/link-local/loopback/host-gateway/metadata) — the reviewer's recommendation, and not a fleet-api code change. Flagged for infra/risk-acceptance.

🤖 Generated with Claude Code

Extract channelDestinationFields() — the kind/webhook/smtp/slack transformation was copy-pasted across createChannel, updateChannel, and testChannel. Pure cleanup; no behavior change. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…ioned admin credential The production notifications overlay left FLEET_METRICS_GRAFANA_PASSWORD defaulting to empty, so without a manually-provided token or password fleet-api fell back to admin with an empty password and every Grafana-proxying notification RPC 401'd while the stack still came up healthy. Default the basic-auth password to the GRAFANA_ADMIN_PASSWORD that run-fleet.sh already provisions for Grafana; a service-account token (FLEET_METRICS_GRAFANA_TOKEN) still takes precedence and avoids handing fleet-api the admin credential. Addresses Codex security review finding (MEDIUM) on #306. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

illegalprime · 2026-06-15T16:08:23Z

Superseded by a 5-PR stack that splits this work into reviewable layers:

notifications (1/5): proto definitions and generated code #452 — proto definitions and generated code
notifications (2/5): notification_history keyset listing #453 — notification_history keyset listing
notifications (3/5): notification:read and notification:manage permissions #454 — notification:read / notification:manage permissions
notifications (4/5): Grafana-backed service, handlers, and server wiring #455 — Grafana-backed service, handlers, and server wiring
notifications (5/5): notifications UI #456 — notifications UI

The stack tip is byte-identical to this branch, plus the fix for the [MEDIUM] Codex finding (prod overlay now defaults FLEET_METRICS_GRAFANA_PASSWORD to the provisioned GRAFANA_ADMIN_PASSWORD), which landed in #455.

The outstanding [HIGH] Codex finding (Grafana egress SSRF enforcement) is not addressed here and should be tracked separately against #455. Closing this in favor of the stack.

github-actions Bot assigned illegalprime May 22, 2026

github-actions Bot added javascript Pull requests that update javascript code client server shared labels May 22, 2026

illegalprime force-pushed the eden/notifications.grafana branch 2 times, most recently from 9534768 to e52b9ea Compare June 4, 2026 13:47

illegalprime force-pushed the eden/notifications.ui branch from 7811814 to 4370492 Compare June 4, 2026 19:07

github-actions Bot added documentation Improvements or additions to documentation dependencies Pull requests that update a dependency file github_actions Pull requests that update GitHub Actions code automation labels Jun 4, 2026

illegalprime force-pushed the eden/notifications.grafana branch from 5ee6d9f to a4625a5 Compare June 11, 2026 17:45

Base automatically changed from eden/notifications.grafana to main June 11, 2026 19:17

illegalprime force-pushed the eden/notifications.ui branch from 4370492 to 8d92a42 Compare June 12, 2026 14:31

github-actions Bot removed documentation Improvements or additions to documentation github_actions Pull requests that update GitHub Actions code automation labels Jun 12, 2026

illegalprime force-pushed the eden/notifications.ui branch from 494025b to ba75406 Compare June 12, 2026 17:28

ui wip

21c273b

illegalprime and others added 12 commits June 15, 2026 10:10

add notifications history

f4488a2

reorder notifications items

5281dd4

reorder notifs settings page

2cf9207

fix test

6f55c23

slack notifs

413f692

resize status font to match others

223114e

illegalprime force-pushed the eden/notifications.ui branch from c587607 to 223114e Compare June 15, 2026 14:10

github-actions Bot removed the dependencies Pull requests that update a dependency file label Jun 15, 2026

illegalprime force-pushed the eden/notifications.ui branch from a5a1b58 to 3c5818f Compare June 15, 2026 15:54

cleanup comments

d25296b

illegalprime force-pushed the eden/notifications.ui branch from 3c5818f to d25296b Compare June 15, 2026 15:56

illegalprime closed this Jun 15, 2026

illegalprime mentioned this pull request Jun 15, 2026

Notifications: enforce Grafana egress restrictions (destination SSRF) #459

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

notifications: preliminary UI work#306

notifications: preliminary UI work#306
illegalprime wants to merge 19 commits into
mainfrom
eden/notifications.ui

illegalprime commented May 22, 2026

Uh oh!

github-actions Bot commented May 22, 2026 •

edited

Loading

Uh oh!

illegalprime commented Jun 12, 2026

Uh oh!

illegalprime commented Jun 12, 2026

Uh oh!

illegalprime commented Jun 12, 2026

Uh oh!

illegalprime commented Jun 12, 2026

Uh oh!

illegalprime commented Jun 12, 2026

Uh oh!

illegalprime commented Jun 12, 2026

Uh oh!

illegalprime commented Jun 15, 2026

Uh oh!

illegalprime commented Jun 15, 2026

Uh oh!

illegalprime commented Jun 15, 2026

Uh oh!

illegalprime commented Jun 15, 2026

Uh oh!

illegalprime commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

illegalprime commented May 22, 2026

Uh oh!

github-actions Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔐 Codex Security Review

Review Summary

Findings

[HIGH] Notification destination SSRF guard is not enforced at the actual egress boundary

[MEDIUM] Production notification overlay does not pass Grafana credentials to fleet-api

Notes

Uh oh!

illegalprime commented Jun 12, 2026

Uh oh!

illegalprime commented Jun 12, 2026

Uh oh!

illegalprime commented Jun 12, 2026

Uh oh!

illegalprime commented Jun 12, 2026

Uh oh!

illegalprime commented Jun 12, 2026

Uh oh!

illegalprime commented Jun 12, 2026

Uh oh!

illegalprime commented Jun 15, 2026

Uh oh!

illegalprime commented Jun 15, 2026

Uh oh!

illegalprime commented Jun 15, 2026

Uh oh!

illegalprime commented Jun 15, 2026

Uh oh!

illegalprime commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented May 22, 2026 •

edited

Loading