Skip to content

notifications: preliminary UI work#306

Closed
illegalprime wants to merge 19 commits into
mainfrom
eden/notifications.ui
Closed

notifications: preliminary UI work#306
illegalprime wants to merge 19 commits into
mainfrom
eden/notifications.ui

Conversation

@illegalprime

Copy link
Copy Markdown
Contributor

UI work expose notifications controls and alerting sinks.

@github-actions github-actions Bot added javascript Pull requests that update javascript code client server shared labels May 22, 2026
@github-actions

github-actions Bot commented May 22, 2026

Copy link
Copy Markdown

🔐 Codex Security Review

Note: This is an automated security-focused code review generated by Codex.
It should be used as a supplementary check alongside human review.
False positives are possible - use your judgment.

Scope summary

  • Reviewed pull request diff only (45a10bc28c37f66376e61ebce1fe33d07f5757cc...d25296bc0755f80e859f92dc614bcb1a426de096, exact PR three-dot diff)
  • Model: gpt-5.5

💡 Click "edited" above to see previous reviews for this PR.


Review Summary

Overall Risk: HIGH

Findings

[HIGH] Notification destination SSRF guard is not enforced at the actual egress boundary

  • Category: Infrastructure
  • Location: server/internal/domain/notifications/service.go:367
  • Description: The backend classifies webhook/Slack/SMTP hosts with a one-time DNS lookup before saving/testing, but Grafana performs the actual outbound delivery later and resolves/connects independently. The code even notes this is “Not rebinding-proof,” but the PR does not add the required Grafana egress enforcement. A notification:manage user can use DNS rebinding, and potentially redirect behavior in Grafana’s notifier, to pass validation with a public IP and then make Grafana POST to private/LAN/link-local targets.
  • Impact: Blind SSRF from the Grafana container into the fleet network, host gateway, LAN miner APIs, or metadata/internal services. This bypasses the intended private-address block and is especially risky in a miner fleet where internal device APIs often sit on RFC1918 networks.
  • Recommendation: Add a hard egress control where Grafana connects out: network policy/firewall deny rules for private, loopback, link-local, multicast, and metadata ranges, or route notification delivery through a backend-controlled proxy with IP pinning and redirect target validation. Revalidate saved-channel tests as defense-in-depth, but do not rely on app-layer DNS checks alone.

[MEDIUM] Production notification overlay does not pass Grafana credentials to fleet-api

  • Category: Infrastructure
  • Location: deployment-files/docker-compose.notifications.yaml:11
  • Description: fleet-api is configured with FLEET_METRICS_GRAFANA_PASSWORD: "${FLEET_METRICS_GRAFANA_PASSWORD:-}", while the deployment flow provisions GRAFANA_ADMIN_PASSWORD for Grafana itself and does not set FLEET_METRICS_GRAFANA_PASSWORD or a service-account token. With the default overlay, fleet-api will try Grafana basic auth as admin with an empty password unless the operator manually provides a separate env var.
  • Impact: The beta notifications stack can come up healthy but all notification RPCs that proxy Grafana can fail with 401s, breaking channel/rule/silence management in production deployments.
  • Recommendation: Either generate/provision FLEET_METRICS_GRAFANA_TOKEN as part of run-fleet.sh, or default FLEET_METRICS_GRAFANA_PASSWORD to the generated GRAFANA_ADMIN_PASSWORD until service-account token provisioning exists. Document the required credential path.

Notes

The new notification RPCs are registered behind the shared auth/validation interceptor chain, are session-only, and are permission-mapped. I did not find pool/wallet/stratum hijacking changes in the reviewed diff; the “pool” references are alert templates, nav, or existing permission context.


Generated by Codex Security Review |
Triggered by: @illegalprime |
Review workflow run

@illegalprime illegalprime force-pushed the eden/notifications.grafana branch 2 times, most recently from 9534768 to e52b9ea Compare June 4, 2026 13:47
@illegalprime illegalprime force-pushed the eden/notifications.ui branch from 7811814 to 4370492 Compare June 4, 2026 19:07
@github-actions github-actions Bot added documentation Improvements or additions to documentation dependencies Pull requests that update a dependency file github_actions Pull requests that update GitHub Actions code automation labels Jun 4, 2026
@illegalprime illegalprime force-pushed the eden/notifications.grafana branch from 5ee6d9f to a4625a5 Compare June 11, 2026 17:45
Base automatically changed from eden/notifications.grafana to main June 11, 2026 19:17
@illegalprime illegalprime force-pushed the eden/notifications.ui branch from 4370492 to 8d92a42 Compare June 12, 2026 14:31
@github-actions github-actions Bot removed documentation Improvements or additions to documentation github_actions Pull requests that update GitHub Actions code automation labels Jun 12, 2026
@illegalprime illegalprime force-pushed the eden/notifications.ui branch from 494025b to ba75406 Compare June 12, 2026 17:28
illegalprime added a commit that referenced this pull request Jun 12, 2026
Addresses the Codex security review on PR #306:

- RBAC bypass (HIGH): add notification:read / notification:manage to
  the authz catalog and gate every notifications route through
  middleware.RequirePermission. The handler now loads effective
  permissions after session validation, mirroring the Connect auth
  interceptor. Reads sit on :read; all mutations including
  TestChannel sit on :manage.

- Org-wide silences via empty scopes (HIGH): validate silence scopes
  in the domain layer — every scope kind now requires its target
  (rule_id / group_id / site_id / device_ids). The UI hides the
  group/site/device scope options until their pickers exist.

- Channels not routed to alerts (HIGH): made the UI explicit that
  saved channels are not yet attached to alert routing; policy-tree
  routing lands as a follow-up.

- Secret leakage in Grafana error logs (HIGH): redact
  authorization_credentials, smtpPassword, and other credential
  fields from the request/response bodies logged on non-2xx
  Grafana responses.

- SSRF via webhook/SMTP destinations (MEDIUM): new DestinationPolicy
  rejects destinations that are or resolve to loopback, link-local,
  private, or unspecified addresses; opt out via
  FLEET_METRICS_NOTIFICATIONS_ALLOW_PRIVATE_DESTINATIONS for dev
  stacks (enabled in the dev compose overlay).

- Updates wiping stored secrets (MEDIUM): UpdateChannel now carries
  the existing settings' secret field into the PUT payload when the
  request doesn't include a fresh secret, so renames and destination
  edits no longer clear webhook credentials or SMTP passwords.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
illegalprime added a commit that referenced this pull request Jun 12, 2026
Follow-ups from the Codex re-review on PR #306:

- DNS rebinding caveat (HIGH, partial): destination DNS lookups now
  fail closed — an unresolvable host is rejected instead of waved
  through. The rebinding residual (re-resolution at Grafana delivery
  time) is documented on checkDestinationHost; a hard guarantee needs
  egress enforcement at the Grafana container's network boundary.

- Webhook URLs in error logs (MEDIUM): "url" added to the redacted
  log keys — webhook URLs routinely embed capability tokens (Slack,
  PagerDuty) in the path or query.

- Silence regex injection (MEDIUM): device ids in device-scoped
  silences are validated against the identifier alphabet, escaped
  with regexp.QuoteMeta, and compiled into an anchored regex matcher
  so a crafted id like ".*" can't widen the silence org-wide.
  matchersToScope strips the anchors/escapes on reads.

- ADMIN backfill (MEDIUM): migration 000087 seeds notification:read /
  notification:manage and backfills them onto existing ADMIN roles —
  the additive reconciler intentionally never re-asserts seed keys on
  existing built-ins (mirrors 000069).

- Grafana admin creds in deployment (MEDIUM, partial): the deployment
  overlay now passes FLEET_METRICS_GRAFANA_TOKEN through (it takes
  precedence over basic auth in the client) so operators can move
  fleet-api onto a least-privilege service-account token; the admin
  fallback remains until a token is configured.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@illegalprime

Copy link
Copy Markdown
Contributor Author

Second-round review findings addressed in 72d9d9e:

[HIGH] SSRF policy bypassable via DNS rebinding — Partially mitigated: DNS lookups now fail closed (an unresolvable destination is rejected rather than allowed), per the reviewer's minimum recommendation. The fundamental rebinding window can't be closed from fleet-api — Grafana re-resolves the hostname at delivery time — so the residual risk is documented on checkDestinationHost with a pointer to network-boundary egress enforcement on the Grafana container, which is the real fix and an infra follow-up.

[MEDIUM] Webhook URLs leak into error logsurl added to the redacted log keys (Slack/PagerDuty-style URLs embed capability tokens in the path). The GrafanaError message returned to the API caller is unchanged: it reaches only the org user who supplied the channel definition, and stripping it would gut the test-channel UX; flagging if anyone disagrees.

[MEDIUM] Device-scope silence regex injection — Device ids are now validated against the identifier alphabet ([A-Za-z0-9._:-]+, covers UUIDs/MACs/serials), regexp.QuoteMeta-escaped, and compiled into an anchored ^(?:id1|id2)$ matcher; read-back strips the anchors. A crafted .* or a|b id is rejected with invalid_argument.

[MEDIUM] Existing ADMIN roles not backfilled — Correct catch: the additive reconciler never re-asserts seed keys on existing built-ins. Migration 000087_seed_notification_permissions backfills notification:read/notification:manage onto existing ADMIN roles, mirroring the 000069 activity:read precedent.

[MEDIUM] Grafana admin creds in deployment overlay — The overlay now passes FLEET_METRICS_GRAFANA_TOKEN through (the client already prefers it over basic auth), so operators can move fleet-api onto a least-privilege service-account token. The admin fallback remains until a token is provisioned — automating SA token creation in run-fleet.sh is a deployment follow-up.

[MEDIUM] Generated Connect API not served — Intentionally deferred, as documented in the handler header: the hand-written JSON routes are the interim surface until the notificationsv1connect codegen lands, at which point the swap is mechanical and the routes join the standard interceptor chain. The hand-written UI client is the only consumer of this surface today.

🤖 Generated with Claude Code

illegalprime added a commit that referenced this pull request Jun 12, 2026
Follow-ups from the Codex re-review on PR #306:

- Webhook URLs leak to read-only users (MEDIUM): webhook URLs embed
  capability tokens in the path/query, so reads (reachable by
  notification:read) now return them host-only (scheme://host).
  UpdateChannel carries the stored full URL through an edit that
  didn't replace it — an empty or still-redacted URL is treated as
  unchanged, mirroring the secret-carry path; a manage user submitting
  a fresh full URL overrides it.

- Raw Grafana error bodies returned to clients (MEDIUM): GrafanaError
  messages reach the browser via err.Error(), and update payloads now
  carry stored secrets, so a Grafana error echoing the request body
  could leak them. The message is redacted with the same key-based
  redactor used for logs.

- Hand-rolled JSON routes bypass validation/body limits (MEDIUM):
  added http.MaxBytesReader (1 MiB) on the body, and enforce the
  proto's name (255) / comment (1024) / device-id-count (500) bounds
  in the hand-written path until the generated Connect handlers with
  the validate interceptor take over.

DNS rebinding (MEDIUM) is already mitigated at the application layer
(fail-closed resolution) and documented; the network-boundary fix is
infra. HistoryService-outside-proto (MEDIUM) is part of the tracked
generated-handler migration. Both noted on the PR.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@illegalprime

Copy link
Copy Markdown
Contributor Author

Third-round findings addressed in 0b131db:

[MEDIUM] Webhook URLs leak to read-only users — Good catch, consistent with the logger change. Reads now return webhook URLs host-only (scheme://host, dropping the userinfo/path/query where tokens live); the full URL stays only in Grafana's stored settings. UpdateChannel carries the stored full URL through an edit that didn't replace it — an empty or still-redacted URL is treated as "unchanged," same pattern as the secret carry — so renames don't truncate it, while a manage user submitting a fresh full URL overrides it. Covered by new tests (TestListChannelsRedactsWebhookURL, TestUpdateChannelPreservesWebhookURLOnRename).

[MEDIUM] Raw Grafana error bodies returned to clientsGrafanaError.Message is now redacted with the same key-based redactor before it rides back to the browser, so a Grafana/proxy error echoing the (secret-carrying) request body can't leak credentials.

[MEDIUM] Hand-rolled routes bypass validation/body limits — Added http.MaxBytesReader (1 MiB) on the request body, and the hand-written path now enforces the proto's buf.validate bounds (channel name ≤255, silence comment ≤1024, device-id count ≤500) until the generated Connect handlers + validate interceptor take over.

[MEDIUM] DNS rebinding bypasses SSRF checks — Unchanged from round two: mitigated at the application layer (fail-closed resolution) and documented on checkDestinationHost. Closing the rebinding window requires re-resolving at dial time, which only Grafana's own egress path controls — that's a network-boundary/infra fix, not something fleet-api can guarantee. Tracking separately rather than implying a code-level fix here closes it.

[MEDIUM] HistoryService outside the proto contract — Acknowledged; this is part of the same generated-handler migration the handler header documents. When notificationsv1connect is wired, HistoryService will be added to the proto and regenerated alongside the others. No generated client consumes this surface today (the hand-written UI client is the only caller), so there's no client to break in the interim.

🤖 Generated with Claude Code

@illegalprime

Copy link
Copy Markdown
Contributor Author

Fourth-round review — risk is back at HIGH on two findings. Triage:

[HIGH] Channel destination updates preserve old secrets — Valid, and a regression from my round-1 secret-carry. This is already fixed in the working tree (in-progress on the branch): the secret carry is now gated on the destination being unchanged — a webhook URL or SMTP host/port change without a fresh secret drops the stored credential instead of delivering the old bearer token / SMTP password to the new destination. A rename or recipients-only edit still preserves it. It'll land with the next push of the in-progress channel work on this branch.

[HIGH] DNS rebinding bypasses SSRF check — Unchanged position: this cannot be closed in fleet-api. Pinning the resolved IP and handing Grafana an IP instead of the hostname would break TLS SNI / the Host header for real destinations (Slack, PagerDuty). The validation is an application-level backstop (fail-closed resolution, documented on checkDestinationHost); the actual guarantee has to come from egress policy at the Grafana container's network boundary, or a delivery proxy that pins/validates at dial time. Flagging for a human risk-acceptance decision + an infra follow-up rather than implying a code fix exists here.

[MEDIUM] Grafana error redaction passes non-JSON bodies through — Fixed in 988f58a. redactSecrets no longer returns non-JSON bodies verbatim (key-based redaction can't scrub HTML/plaintext that echoes the payload); it returns a length marker. The client-facing GrafanaError message is now the generic status text for non-JSON responses, surfacing only Grafana's own structured JSON error after redaction.

[MEDIUM] Saved channel tests ignore the saved channel ID — Valid; TestChannel tests the request body, and with the URL now redacted on reads the UI sends a host-only URL + empty secret, so testing a saved channel hits the wrong destination. The right fix (load + ownership-check the stored contact point by id, test its stored settings, use the request body only for unsaved preview) lives in service.go, which is under active concurrent edit for the Slack channel work — I'm deferring it to that change rather than racing the same file. Noting it here so it isn't lost.

[MEDIUM] History endpoint outside the proto contract — Same disposition as prior rounds: part of the tracked hand-written-JSON → generated-Connect-handler migration documented in the handler header. When notificationsv1connect is wired, HistoryService gets added to the proto and regenerated with the rest. No generated client consumes it today.

🤖 Generated with Claude Code

@illegalprime

Copy link
Copy Markdown
Contributor Author

Fifth-round review (MEDIUM). Two fixed in 97ca093, three deferred with consistent rationale:

[MEDIUM] Grafana JSON error bodies leak secrets in generic string fields — Fixed. My round-4 redaction was key-based only, so a secret echoed inside a value (e.g. {"message":"failed to POST to https://hooks.slack.com/.../TOKEN"}) slipped through. redactValue now also scrubs webhook-URL and Bearer <token> substrings out of every string value, wherever they sit — applied to both the logged body and the client-facing GrafanaError message. (SMTP passwords inside free-text aren't pattern-matchable generically; Grafana doesn't echo them in practice, and the secret-keyed cases are already covered.)

[MEDIUM] Fleet API always receives the Grafana admin password — Fixed in the deployment overlay. It previously hard-required GRAFANA_ADMIN_PASSWORD for fleet-api even with a token configured, so the admin credential sat in fleet-api's env regardless. The basic-auth password now defaults empty and is an explicit opt-in fallback — a token deployment carries no admin credential. ⚠️ This needs a companion run-fleet.sh change to set FLEET_METRICS_GRAFANA_PASSWORD for no-token deployments (otherwise basic auth gets an empty password); I can't see/test that script here, so flagging it for whoever owns the deploy tooling. The full "fail startup if neither token nor explicit fallback is present" check is a server-side follow-up.

[MEDIUM] Hand-written JSON not Connect/protobuf-compatible & [MEDIUM] History endpoint missing from proto — Both are the same tracked item: the hand-written JSON routes are the documented interim surface (see the handler header) until the notificationsv1connect bindings are served. The Slack work just regenerated the proto, so the bindings are moving; HistoryService gets added and the UI switched to generated bindings as part of that migration. No generated/SDK client consumes this surface today — the hand-written TS client is the only caller — so there's no client mismatch in practice yet.

[MEDIUM] SSRF is only a pre-resolution check (DNS rebinding) — Unchanged: not closable in fleet-api without breaking TLS SNI/Host for real destinations. Application-level fail-closed validation stays as the backstop; the guarantee needs egress policy at the Grafana container boundary or a pinning delivery proxy. Flagged for human risk-acceptance + infra. (Noted the CGNAT range suggestion — worth folding into the host check if/when we extend it.)

🤖 Generated with Claude Code

@illegalprime

Copy link
Copy Markdown
Contributor Author

Sixth-round review (MEDIUM). One fixed, two are the same tracked migration item:

[MEDIUM] Saved channel tests don't use the saved destination — Fixed in c587607. This was the round-4 item I'd deferred while service.go was under active concurrent edit (the Slack work has since landed, so it's safe to touch now), and it was made worse by my own read-time URL redaction. TestChannel now, when given a channel id, resolves the owned Grafana contact point (ownership-checked) and tests its stored settings — the full webhook URL and Grafana's secure fields — instead of the redacted read-model the UI echoes back. Unsaved definitions (no id) still validate + test the request payload for the "Test before save" flow. The handler also now surfaces TestChannel's service error (unknown/foreign id, invalid destination, Grafana unreachable) as a 4xx/5xx rather than discarding it, distinct from a delivery failure reported in the body. Covered by two new tests (stored-URL probe + foreign-channel rejection).

Caveat noted in the finding: a saved Slack channel still can't be fully tested because Grafana stores the Slack URL as a secure field and doesn't return it on reads — same fundamental limitation as any stored secret, and not regressed by this change (it failed before too). Fully fixing it would require fleet-api to keep a parallel copy of the secret, which the design deliberately avoids.

[MEDIUM] Manual wire format diverges from generated protobuf & [MEDIUM] History endpoint not in proto — Same tracked item as prior rounds: the hand-written JSON routes are the documented interim surface until the notificationsv1connect bindings are served (handler header documents this). The contract mismatch (lowercase enum tokens vs CHANNEL_KIND_WEBHOOK, missing HistoryService) is real but only matters once a generated client/handler is actually wired — the hand-written TS client is the sole consumer today. The Slack work regenerated the proto, so the bindings exist; serving them (and adding HistoryService to the proto) is the migration's job, tracked separately rather than partially done here.

The DNS-rebinding item moved to the reviewer's Notes this round ("deployment should enforce Grafana egress at the network layer") — agreed, that's the resolution: infra-level egress, not a fleet-api code change.

🤖 Generated with Claude Code

@illegalprime

Copy link
Copy Markdown
Contributor Author

Seventh-round review — converged. No new actionable findings this round; both remaining items are ones already dispositioned over prior rounds, and I'm not making code changes for them (flagging here so it's explicit rather than re-answering each re-review):

  1. Destination checks don't enforce Grafana egress (DNS rebinding / redirects) — Confirmed not fixable in fleet-api: handing Grafana a pinned IP breaks TLS SNI/Host for real destinations, and Grafana owns the delivery-time resolution + redirect behavior. The reviewer's own recommendation is the resolution — deny egress to loopback/RFC1918/link-local/metadata at the Grafana container's network boundary (an infra/deployment change). fleet-api keeps its fail-closed pre-resolution check as a best-effort backstop. Tracking as an infra task, not a code fix on this PR.

  2. HTTP surface diverges from generated protobuf + HistoryService not in proto — The hand-written JSON routes are the documented interim surface (see the handler package header) until the notificationsv1connect bindings are served; that migration is where the enum-name alignment and HistoryService proto declaration belong. The only consumer today is the hand-written TS client, so there's no generated-client mismatch in practice yet. Tracked with the generated-handler migration.

All concrete, code-level findings from rounds 1–6 (RBAC bypass, org-wide silences, secret-log/error leakage, SSRF pre-checks, secret-reuse-on-destination-change, secret redaction depth, token-only Grafana auth, saved-channel testing) are addressed and on the branch. The two items above are an infra task and a tracked refactor respectively — both appropriate to handle outside this preliminary-UI PR.

I'll continue watching and will only follow up if a future re-review surfaces something genuinely new.

🤖 Generated with Claude Code

illegalprime and others added 12 commits June 15, 2026 10:10
Addresses the Codex security review on PR #306:

- RBAC bypass (HIGH): add notification:read / notification:manage to
  the authz catalog and gate every notifications route through
  middleware.RequirePermission. The handler now loads effective
  permissions after session validation, mirroring the Connect auth
  interceptor. Reads sit on :read; all mutations including
  TestChannel sit on :manage.

- Org-wide silences via empty scopes (HIGH): validate silence scopes
  in the domain layer — every scope kind now requires its target
  (rule_id / group_id / site_id / device_ids). The UI hides the
  group/site/device scope options until their pickers exist.

- Channels not routed to alerts (HIGH): made the UI explicit that
  saved channels are not yet attached to alert routing; policy-tree
  routing lands as a follow-up.

- Secret leakage in Grafana error logs (HIGH): redact
  authorization_credentials, smtpPassword, and other credential
  fields from the request/response bodies logged on non-2xx
  Grafana responses.

- SSRF via webhook/SMTP destinations (MEDIUM): new DestinationPolicy
  rejects destinations that are or resolve to loopback, link-local,
  private, or unspecified addresses; opt out via
  FLEET_METRICS_NOTIFICATIONS_ALLOW_PRIVATE_DESTINATIONS for dev
  stacks (enabled in the dev compose overlay).

- Updates wiping stored secrets (MEDIUM): UpdateChannel now carries
  the existing settings' secret field into the PUT payload when the
  request doesn't include a fresh secret, so renames and destination
  edits no longer clear webhook credentials or SMTP passwords.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Follow-ups from the Codex re-review on PR #306:

- DNS rebinding caveat (HIGH, partial): destination DNS lookups now
  fail closed — an unresolvable host is rejected instead of waved
  through. The rebinding residual (re-resolution at Grafana delivery
  time) is documented on checkDestinationHost; a hard guarantee needs
  egress enforcement at the Grafana container's network boundary.

- Webhook URLs in error logs (MEDIUM): "url" added to the redacted
  log keys — webhook URLs routinely embed capability tokens (Slack,
  PagerDuty) in the path or query.

- Silence regex injection (MEDIUM): device ids in device-scoped
  silences are validated against the identifier alphabet, escaped
  with regexp.QuoteMeta, and compiled into an anchored regex matcher
  so a crafted id like ".*" can't widen the silence org-wide.
  matchersToScope strips the anchors/escapes on reads.

- ADMIN backfill (MEDIUM): migration 000087 seeds notification:read /
  notification:manage and backfills them onto existing ADMIN roles —
  the additive reconciler intentionally never re-asserts seed keys on
  existing built-ins (mirrors 000069).

- Grafana admin creds in deployment (MEDIUM, partial): the deployment
  overlay now passes FLEET_METRICS_GRAFANA_TOKEN through (it takes
  precedence over basic auth in the client) so operators can move
  fleet-api onto a least-privilege service-account token; the admin
  fallback remains until a token is configured.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Follow-ups from the Codex re-review on PR #306:

- Webhook URLs leak to read-only users (MEDIUM): webhook URLs embed
  capability tokens in the path/query, so reads (reachable by
  notification:read) now return them host-only (scheme://host).
  UpdateChannel carries the stored full URL through an edit that
  didn't replace it — an empty or still-redacted URL is treated as
  unchanged, mirroring the secret-carry path; a manage user submitting
  a fresh full URL overrides it.

- Raw Grafana error bodies returned to clients (MEDIUM): GrafanaError
  messages reach the browser via err.Error(), and update payloads now
  carry stored secrets, so a Grafana error echoing the request body
  could leak them. The message is redacted with the same key-based
  redactor used for logs.

- Hand-rolled JSON routes bypass validation/body limits (MEDIUM):
  added http.MaxBytesReader (1 MiB) on the body, and enforce the
  proto's name (255) / comment (1024) / device-id-count (500) bounds
  in the hand-written path until the generated Connect handlers with
  the validate interceptor take over.

DNS rebinding (MEDIUM) is already mitigated at the application layer
(fail-closed resolution) and documented; the network-boundary fix is
infra. HistoryService-outside-proto (MEDIUM) is part of the tracked
generated-handler migration. Both noted on the PR.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Fourth-round Codex finding: redactSecrets passed non-JSON upstream
error bodies through verbatim, and that string is both logged and
returned to the browser via GrafanaError. A Grafana/proxy HTML or
plaintext error that echoes the request payload could leak bearer
tokens, SMTP passwords, or secret-bearing webhook URLs — key-based
redaction can't reach secrets in unstructured text.

- redactSecrets now returns a length marker for non-JSON input
  instead of the raw body.
- The client-facing GrafanaError message is the generic status text
  for non-JSON responses; only Grafana's own structured (JSON) error
  is surfaced, after key redaction.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Fifth-round Codex security findings:

- Grafana JSON error bodies can leak secrets in generic string fields
  (MEDIUM): key-based redaction misses a secret Grafana echoes inside
  a value like {"message":"failed to POST to https://hooks/.../TOKEN"}.
  redactValue now scrubs webhook-URL and bearer-token substrings out
  of every string value, wherever they sit — applied to both the log
  body and the client-facing GrafanaError message.

- Fleet API always receives the Grafana admin password (MEDIUM): the
  deployment overlay hard-required GRAFANA_ADMIN_PASSWORD for
  fleet-api even when a service-account token is configured, so a
  fleet-api compromise leaked the admin credential. The basic-auth
  password now defaults empty and is an explicit opt-in fallback; a
  token deployment no longer carries the admin credential in
  fleet-api's environment. (Needs a companion run-fleet.sh change to
  set the password for no-token deployments — flagged on the PR.)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sixth-round Codex finding (also a regression surfaced by the read-time
URL redaction): TestChannel ignored Channel.ID and always probed the
request payload. Since reads now hand the UI a redacted destination
(webhook URLs host-only, Slack URLs omitted), clicking "Test" on a
saved channel hit the wrong target — or a host root that isn't the
real destination.

- When an id is supplied, TestChannel now resolves the owned Grafana
  contact point (ownership-checked via findOwnedChannel) and tests its
  stored settings — the full URL and Grafana's secure fields. An
  unsaved definition (no id) still validates and tests the request
  payload for the "Test before save" flow.
- The handler now surfaces TestChannel's service error (unknown/foreign
  id, invalid destination, Grafana unreachable) as a 4xx/5xx instead of
  discarding it, distinct from a delivery failure reported in the body.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@illegalprime illegalprime force-pushed the eden/notifications.ui branch from c587607 to 223114e Compare June 15, 2026 14:10
@github-actions github-actions Bot removed the dependencies Pull requests that update a dependency file label Jun 15, 2026
Codex finding: the value-level bearer redaction used a narrow
character class (Bearer\s+[A-Za-z0-9._-]+), so a base64 / base64url /
JWT token containing +, /, =, ~, or : was only partially scrubbed —
the suffix after the first unmatched character leaked into fleet-api
logs and the browser-facing error.

Switch to a delimiter-based pattern (Bearer\s+[^\s"']+) that consumes
the whole token up to the next whitespace or quote, matching the URL
pattern's approach. Tests cover base64, JWT, and punctuation-bearing
tokens.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@illegalprime

Copy link
Copy Markdown
Contributor Author

New re-review surfaced one genuinely new, actionable finding — fixed in 497d229:

[MEDIUM] Bearer-token redaction can leak token suffixes — Valid bug in the round-5 value-level redaction I added. bearerValuePattern used a narrow class (Bearer\s+[A-Za-z0-9._-]+), so a base64/base64url/JWT token containing + / = ~ : was only partially scrubbed — the suffix after the first unmatched character leaked into logs and the error response. Switched to a delimiter-based Bearer\s+[^\s"']+ (consumes the whole token up to whitespace/quote, matching the URL pattern's approach). Added tests for base64, JWT, and punctuation-bearing tokens.

The other two findings are the previously-converged items — proto/Connect wire-contract divergence and HistoryService not in proto — both part of the tracked generated-handler migration (hand-written JSON is the documented interim surface; the sole consumer today is the hand-written TS client). No code change here; dispositioned as before.

🤖 Generated with Claude Code

Codex finding: the dashboard unconditionally rendered
ActiveNotificationsCard, whose HistoryTable fetches notification
history on mount. The server gates that RPC on notification:read, so
any authenticated user without it (e.g. FIELD_TECH, custom roles) hit
a guaranteed 403 and an error state on the main dashboard.

- Render the dashboard notification card only when the user holds
  notification:read.
- Add requiredPermission: "notification:read" to the
  /settings/notifications nav item so the link is hidden from users
  who can't use the page.

Server-side RBAC was already correct; this fixes the client-side UX
regression. Gating the manage-only mutation controls (Add/Test/Delete,
pause/resume) on notification:manage is a follow-up better done
alongside the in-flight notifications UI work.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@illegalprime

Copy link
Copy Markdown
Contributor Author

One genuinely new finding addressed in 1ae32d8:

[LOW] Dashboard fetches notification history for users without notification:read — Real RBAC/UX regression: the dashboard rendered ActiveNotificationsCard unconditionally, and its HistoryTable fetches history on mount, which the server correctly gates on notification:read — so any user lacking that permission (FIELD_TECH, custom roles) hit a guaranteed 403 + error state on the main dashboard. Fixed:

  • The dashboard card now renders only when the user holds notification:read (useHasPermission).
  • Added requiredPermission: "notification:read" to the /settings/notifications nav item so the link is hidden from users who can't use the page.

Server-side RBAC was already correct; this is purely the client UX gap. The third sub-recommendation — gating the manage-only controls (Add/Test/Delete channel, pause/resume rule, add silence) on notification:manage — is defense-in-depth UX (the server already enforces it) and touches several components under active edit for the Slack work, so I've left it as a follow-up for that UI pass rather than sprawl across it now.

[MEDIUM] RPC routes don't match the protobuf contract — Unchanged disposition: the tracked generated-handler migration (hand-written JSON is the documented interim surface; sole consumer is the hand-written TS client). No churn.

🤖 Generated with Claude Code

…JSON

Addresses the recurring security-review finding that the notifications
routes spoke a hand-written snake_case/lowercase-enum JSON contract
divergent from the committed protobuf, and that HistoryService was
served but undeclared in the proto.

Proto:
- Add HistoryService + ListNotifications messages + NotificationHistoryEntry
  so the history endpoint is part of the declared contract, and regenerate
  Go + TS bindings.

Server:
- Replace the hand-written http.Handler + wire DTOs with generated
  notificationsv1connect ChannelService/RuleService/SilenceService/
  HistoryService implementations that convert protobuf <-> domain models.
- Mount them through the shared interceptor chain (li): authentication,
  buf.validate request validation, error mapping, request-log redaction.
- Register all 13 procedures in ProcedurePermissions (notification:read /
  notification:manage), the contract test's registeredServices,
  SessionOnlyProcedures (preserving the prior session-only behaviour),
  and RedactedRequestProcedures (channel mutations carry secrets).

All security logic — org scoping, secret redaction, SSRF destination
checks, silence scope/device-id validation, secret-carry-on-update,
saved-channel testing — is unchanged: it lives in the domain service,
which the new handlers call exactly as before.

Client:
- Talk to the generated Connect clients; map protobuf <-> the existing
  view types at the API boundary so the store and components are
  untouched. Removes the bespoke fetch/JSON wrapper.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@illegalprime

Copy link
Copy Markdown
Contributor Author

Done — the notifications routes now use the generated notificationsv1connect Connect handlers instead of the hand-written JSON surface (7432d65). This resolves the two findings I'd been tracking as a deferred migration:

Proto/Connect wire-contract divergence — The hand-written snake_case fields + lowercase enum tokens are gone. The server mounts generated ChannelService / RuleService / SilenceService / HistoryService handlers that convert protobuf↔domain, and the client talks to the generated Connect clients. The wire is now real protojson (camelCase, CHANNEL_KIND_WEBHOOK, int64-as-string, Timestamp) — a generated client interoperates.

HistoryService not in the proto — Added HistoryService + ListNotifications{Request,Response} + NotificationHistoryEntry to notifications.proto and regenerated Go/TS. It's now a declared, discoverable part of the contract.

How it's wired:

  • Generated handlers run through the shared interceptor chain — authentication, buf.validate request validation (replacing the hand-rolled length/body checks), error mapping, and request-log redaction.
  • All 13 procedures are registered in ProcedurePermissions (notification:read / notification:manage), the RPC contract test's registeredServices, SessionOnlyProcedures (preserving the prior session-only behaviour — no API-key auth), and RedactedRequestProcedures (channel mutations carry secrets).
  • Every security control from the earlier rounds is untouched: it all lives in the domain service (org scoping, secret redaction, SSRF checks, silence scope/device-id validation, secret-carry-on-update, saved-channel testing), which the new handlers call exactly as before.
  • Client change is confined to the API boundary: notificationsApi.ts maps protobuf↔the existing view types, so the store and components are unchanged.

Verified: server build + golangci-lint clean, RPC contract test passes (all 13 procedures classified), notifications domain tests pass, client tsc + eslint clean, pre-push hooks green.

🤖 Generated with Claude Code

Codex finding: ListNotifications keyset-paginates by
(organization_id, id < cursor ORDER BY id DESC), but the only indexes
on notification_history are on received_at — so each history page
filters/scans rows, and the dashboard fetches this for every
notification:read user.

Add idx_notification_history_org_id (organization_id, id DESC) so each
page is an index range scan. New migration; the deployed 000083 is
left untouched.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@illegalprime

Copy link
Copy Markdown
Contributor Author

The migration landed cleanly — this round's Notes confirm the new RPCs are session-only, RBAC-gated, on the shared interceptor chain, and the proto-divergence/HistoryService findings are gone. One genuinely new finding, fixed:

[MEDIUM] History pagination lacks a matching index — Fixed in 50b913b. ListNotifications keyset-paginates by (organization_id, id < cursor ORDER BY id DESC), but notification_history's indexes were all on received_at, so each page scanned/filtered — and the dashboard card fetches this for every notification:read user. Added idx_notification_history_org_id (organization_id, id DESC) in a new migration (000088) so each page is an index range scan; the deployed 000083 is untouched.

[MEDIUM] DNS-rebinding SSRF — Unchanged disposition: this is the standing infra-level item. fleet-api keeps its fail-closed pre-resolution check as an advisory backstop, but the hard guarantee has to be an egress boundary at the Grafana container (firewall/proxy blocking private/link-local/loopback/host-gateway/metadata) — the reviewer's recommendation, and not a fleet-api code change. Flagged for infra/risk-acceptance.

🤖 Generated with Claude Code

Extract channelDestinationFields() — the kind/webhook/smtp/slack
transformation was copy-pasted across createChannel, updateChannel,
and testChannel. Pure cleanup; no behavior change.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@illegalprime illegalprime force-pushed the eden/notifications.ui branch from a5a1b58 to 3c5818f Compare June 15, 2026 15:54
@illegalprime illegalprime force-pushed the eden/notifications.ui branch from 3c5818f to d25296b Compare June 15, 2026 15:56
illegalprime added a commit that referenced this pull request Jun 15, 2026
…ioned admin credential

The production notifications overlay left FLEET_METRICS_GRAFANA_PASSWORD
defaulting to empty, so without a manually-provided token or password
fleet-api fell back to admin with an empty password and every Grafana-proxying
notification RPC 401'd while the stack still came up healthy. Default the
basic-auth password to the GRAFANA_ADMIN_PASSWORD that run-fleet.sh already
provisions for Grafana; a service-account token (FLEET_METRICS_GRAFANA_TOKEN)
still takes precedence and avoids handing fleet-api the admin credential.

Addresses Codex security review finding (MEDIUM) on #306.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@illegalprime

Copy link
Copy Markdown
Contributor Author

Superseded by a 5-PR stack that splits this work into reviewable layers:

  1. notifications (1/5): proto definitions and generated code #452 — proto definitions and generated code
  2. notifications (2/5): notification_history keyset listing #453 — notification_history keyset listing
  3. notifications (3/5): notification:read and notification:manage permissions #454 — notification:read / notification:manage permissions
  4. notifications (4/5): Grafana-backed service, handlers, and server wiring #455 — Grafana-backed service, handlers, and server wiring
  5. notifications (5/5): notifications UI #456 — notifications UI

The stack tip is byte-identical to this branch, plus the fix for the [MEDIUM] Codex finding (prod overlay now defaults FLEET_METRICS_GRAFANA_PASSWORD to the provisioned GRAFANA_ADMIN_PASSWORD), which landed in #455.

The outstanding [HIGH] Codex finding (Grafana egress SSRF enforcement) is not addressed here and should be tracked separately against #455. Closing this in favor of the stack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

client javascript Pull requests that update javascript code server shared

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant