Skip to content

notifications (4/5): Grafana-backed service, handlers, and server wiring#455

Closed
illegalprime wants to merge 3 commits into
mainfrom
eden/notifications-4-server
Closed

notifications (4/5): Grafana-backed service, handlers, and server wiring#455
illegalprime wants to merge 3 commits into
mainfrom
eden/notifications-4-server

Conversation

@illegalprime

@illegalprime illegalprime commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Stack 4/5 — base: eden/notifications-3-authz (#454)

1. Summary

This PR makes the Notifications API real: operators can create and test alert-delivery channels (webhook / SMTP / Slack), browse the org's provisioned alert rules and pause/resume them, and schedule maintenance windows that mute alerts for a rule, group, site, or set of devices. fleet-api implements all four notifications.v1 Connect services as a thin, org-scoped proxy in front of a Grafana sidecar — translating each RPC into Grafana HTTP API calls and persisting all state in Grafana, with no new notification tables of its own. Channel mutations are marked secret-redacted and the whole API is session-only so a leaked API key can't reconfigure delivery or silence alerts.

2. How it works

The server owns no notification storage; Grafana is the source of truth and fleet-api is a translation + isolation layer in front of it.

Request path (every RPC): client → Connect handler → authorize() runs the per-route RBAC gate (notification:read for lists, notification:manage for mutations) and extracts the caller's organization_id from the session (a zero org id fails closed) → the domain Service does org-scoped work → the Grafana HTTP adapter calls the sidecar → state lands in Grafana (which itself queries TimescaleDB for alert evaluation).

The three Grafana surfaces each concept maps to:

  • Channel → Grafana contact point via the provisioning API (/api/v1/provisioning/contact-points). Test posts a synthetic alert to .../test.
  • Rule → Grafana alert rule via the provisioning API (read-only list + single-rule get). No create/update/delete: rules are YAML-provisioned and Grafana 11.6+ blocks in-place API edits.
  • MaintenanceWindow + rule pause → Grafana silence via the Alertmanager API (/api/alertmanager/grafana/api/v2/silences).

Org isolation is enforced entirely inside the domain layer because Grafana does not sandbox these objects per Proto-Fleet org:

  • Channels — every contact point name is prefixed org-<id>-; reads filter by prefix, writes inject it, and an update/delete first re-verifies the prefix before touching the row.
  • Rules — a rule is visible if it has no organization_id label (a global YAML default) or one matching the caller.
  • Maintenance windows / pauses — every silence carries an organization_id=<id> equality matcher; reads filter on it, writes always inject it. A cross-org id never matches, so it surfaces as not-found rather than acting as a list oracle.

Secrets are write-only. Webhook bearer credentials, SMTP passwords, and Slack URLs are accepted on writes but never echoed on reads (the proto omits them; webhook URLs are reduced to scheme://host). On an edit with no fresh secret, the stored secret is carried forward — but only if the destination is unchanged, so an old credential can never be replayed against a new destination. Grafana request/response bodies are scrubbed (known secret keys redacted, plus regex scrubbing of URLs and Bearer … substrings) before they reach logs.

Pause/resume does not flip Grafana's isPaused (forbidden on YAML-provisioned rules). Instead PauseRule writes a marker "pause silence" — an org-scoped silence carrying __alert_rule_uid__=<rule> plus a proto_fleet_pause=true matcher and a far-future end time. ResumeRule deletes that silence. List responses overlay active pause silences onto each rule's Enabled flag, and pause silences are hidden from the maintenance-window list since they're an implementation detail.

Maintenance-window scope → matchers. A structured scope (rule / group / site / device-ids) compiles to Alertmanager matchers alongside the org matcher. Device lists become an anchored, regex-escaped alternation (^(?:a|b)$); ids are restricted to a safe alphabet and capped at 500 so a crafted id like .* can't widen the mute. Targetless scopes are rejected (they'd compile to just the org matcher and silence everything).

SSRF pre-flight. Because Grafana opens the outbound connection, a user-supplied webhook/Slack URL or SMTP host is an SSRF vector. Before any write, fleet-api parses the destination, requires http(s), and resolves the host — rejecting loopback / link-local / private / unspecified addresses (DNS failure fails closed). This is a guardrail, not the egress boundary (see decisions).

History (ListNotifications) is the one read backed by Proto Fleet's own DB: it clamps page size (default 50, max 200), fetches one extra row to compute has_more exactly, and calls the #453 keyset store.

Auth to Grafana. The adapter prefers a bearer service-account token and falls back to basic auth only when no token is set. run-fleet.sh mints a least-privilege token post-boot (an fleet-api service account, Editor on org 1), writes it to .env (chmod 600), and recreates fleet-api so it stops using the admin password; the basic-auth password now defaults empty.

flowchart LR
  client["UI / API client"] -->|"Connect RPC"| handler["notifications Connect handlers"]
  handler -->|"RBAC gate + org id"| authz["interceptors + middleware<br/>(session-only, redacted)"]
  authz --> svc["notifications.Service<br/>org-scoping, secret redaction,<br/>SSRF pre-flight, scope compile"]
  svc --> adapter["Grafana HTTP adapter<br/>(grafana_client.go)"]
  adapter -->|"provisioning + Alertmanager API"| grafana["Grafana sidecar"]
  grafana --> tsdb[("TimescaleDB")]
  handler -->|"ListNotifications only"| histstore["notification_history store (#453)"]
  histstore --> tsdb
Loading

CreateChannel (representative mutating flow)

sequenceDiagram
  participant C as Client
  participant H as Handler
  participant S as Service
  participant G as Grafana adapter
  participant GF as Grafana
  C->>H: CreateChannel(name, kind, secret)
  H->>H: authorize(notification:manage) -> orgID
  H->>S: CreateChannel(orgID, channel)
  S->>S: validateDestination (scheme + SSRF host check)
  S->>S: prefix name "org-<id>-", encode settings
  S->>G: POST /provisioning/contact-points
  G->>GF: create contact point
  GF-->>G: 2xx (secret stripped)
  G-->>S: contact point
  S-->>H: Channel (HasSecret=true, secret value omitted)
  H-->>C: CreateChannelResponse
Loading

Rule paused/enabled state (via pause silence, not isPaused)

stateDiagram-v2
  [*] --> Enabled
  Enabled --> Paused: "PauseRule writes org-scoped pause silence"
  Paused --> Enabled: "ResumeRule deletes pause silence"
  Paused --> Paused: "PauseRule (idempotent)"
  Enabled --> Enabled: "ResumeRule (idempotent)"
  note right of Paused
    ListRules ORs an active pause
    silence onto the rule's Enabled flag.
    A YAML isPaused stays paused even after Resume.
  end note
Loading

3. Areas of the code involved

Area / package / file What changed Why it matters for review
server/internal/domain/notifications/grafana_client.go (new) Grafana HTTP adapter: contact-point / alert-rule / silence calls, X-Disable-Provenance on writes, token-vs-basic-auth, secret redaction in logs/errors. Keeps Grafana's "silence" vocabulary because it models the Alertmanager API. The only boundary that talks to Grafana. Check the API paths, the GrafanaError surfacing, and that redaction covers logs and returned error messages.
server/internal/domain/notifications/service.go (new) Domain logic: org isolation (name prefix / label / org matcher), write-only secret carry-forward + redaction, SSRF pre-flight, pause-via-silence, scope→matcher compilation, no rule CUD. The core of the PR and where all isolation/secret/SSRF guarantees live. Highest-scrutiny file.
server/internal/domain/notifications/models.go (new) Domain types: Channel/Rule/MaintenanceWindow/MaintenanceWindowScope, kinds, validation state. Read shapes deliberately exclude secret values.
server/internal/handlers/notifications/handler.go (new) Connect handlers for all four services; per-route RBAC + org extraction; proto⇄domain mapping; ListNotifications page-size clamp + has_more. Thin glue. Confirm every mutation gates on notification:manage, reads on notification:read, and secrets aren't mapped back into responses.
server/cmd/fleetd/main.go Constructs the Grafana client + service; mounts Channel / Rule / MaintenanceWindow / History handlers. Wiring only.
server/internal/handlers/interceptors/config.go Channel mutations → redacted-request; all notification procedures → session-only. Defense-in-depth: a leaked API key cannot reconfigure delivery or mute alerts.
server/internal/infrastructure/metrics/provider.go Adds embedded GrafanaConfig and DestinationPolicy config (env-prefixed). Config surface for Grafana connection + SSRF policy.
deployment-files/run-fleet.sh Mints/reuses a least-privilege Grafana service-account token post-boot, persists to .env, recreates fleet-api. Loud failure on restart error. Operational. Check idempotency and that failures aren't swallowed.
deployment-files/docker-compose.notifications.yaml, server/docker-compose.notifications.yaml Token env wiring; prod basic-auth password defaults empty; dev allows private destinations. Verify prod never holds the Grafana admin password by default.
server/internal/handlers/middleware/rpc_permissions_test.go Registers the four services in the contract-test fixture (moved here from #454, where they aren't mounted). Keeps TestRPCContract_RegisteredServicesMatchMainMux green.
server/internal/domain/notifications/service_test.go, grafana_client_test.go (new) Unit tests for redaction, SSRF validation, scope validation/compilation, secret preservation, cross-org rejection. Primary correctness evidence.

Generated pb.go / connect / pb.ts are not in this PR (they ship in #452) — nothing here is generated.

4. Key technical decisions & trade-offs

  • Proxy Grafana rather than own notification state — no new tables/duplication, at the cost of every RPC being a live Grafana call and isolation living in app code instead of the datastore.
  • Org isolation in the domain layer (name prefix / label / matcher) — chosen because Grafana doesn't sandbox these objects per Proto-Fleet org; the alternative (Grafana orgs/folders per tenant) was heavier and not adopted.
  • Pause via marker silence, not isPaused — required because Grafana 11.6+ forbids the provisioning API from editing YAML-provisioned rules; trade-off is an overlay computed at list time and a hidden "pause silence" class.
  • Secret carry-forward gated on unchanged destination — preserves stored secrets on benign edits while guaranteeing an old credential can't be delivered to a new destination; simpler "always carry" was rejected as unsafe.
  • Service-account token over admin basic-auth — a least-privilege Editor token minted post-boot; prod basic-auth password defaults empty so a default deploy never holds the Grafana admin credential.
  • Cross-org id → not-found, not permission-denied detail — avoids turning id probes into an existence oracle.

5. Testing & validation

Covered (unit): secret redaction (known keys, empty-value preservation, arrays, secrets embedded in string values, punctuation-bearing bearer tokens, non-JSON bodies); SSRF destination validation; maintenance-window scope validation + device-id regex compilation/rejection (.*, alternation, anchors); webhook/SMTP/Slack secret preservation, replacement, and redaction; cross-org rejection; the RPC contract test confirming all four services are mounted.

Explicitly NOT covered / known limitations:

  • Routing not shipped. Created channels are not yet wired into Grafana's notification-policy tree — they can be listed and tested, but live alerts still go only to the internal receiver.
  • Group/site maintenance windows are inert today. They write group_id / site_id matchers that the current alert rules don't emit, so only rule- and device-scoped windows actually match.
  • SSRF pre-flight is a guardrail, not the egress boundary. Grafana performs the actual outbound delivery and the DNS check isn't rebinding-proof; durable enforcement is Grafana-side egress restriction, tracked separately.
  • No end-to-end test against a live Grafana sidecar in this PR.

🤖 Generated with Claude Code

@illegalprime illegalprime requested a review from a team as a code owner June 15, 2026 15:53
@github-actions

github-actions Bot commented Jun 15, 2026

Copy link
Copy Markdown

🔐 Codex Security Review

Note: This is an automated security-focused code review generated by Codex.
It should be used as a supplementary check alongside human review.
False positives are possible - use your judgment.

Scope summary

  • Reviewed pull request diff only (74b4320925d9356ca9898364accb2867f5aea2c1...85f424eea112cfcbec15c37a77dfaea85b3bed9c, exact PR three-dot diff)
  • Model: gpt-5.5

💡 Click "edited" above to see previous reviews for this PR.


Review Summary

Overall Risk: HIGH

Findings

[HIGH] Notification destination SSRF validation is not enforced at send time

  • Category: gRPC | Infrastructure
  • Location: server/internal/domain/notifications/service.go:357
  • Description: The new notification service validates webhook/Slack/SMTP hosts once in fleet-api, then stores the destination in Grafana. Grafana later performs the actual outbound request. The code explicitly notes this is “not rebinding-proof,” and the compose changes do not add a Grafana egress boundary. A user with notification:manage can pass validation with a public host, then use DNS rebinding or an HTTP redirect from that host to make Grafana call loopback, link-local metadata, Docker bridge services, or host-gateway reachable services.
  • Impact: This creates an SSRF path from the Grafana container despite AllowPrivateDestinations=false. Depending on deployment, it can reach internal-only services, cloud metadata endpoints, or Proto Fleet host services.
  • Recommendation: Do not rely on pre-storage DNS classification for destinations that Grafana sends later. Enforce private-address blocking at Grafana’s network boundary, or route notifications through a fleet-api relay that resolves, pins, dials, and redirect-checks the final destination with private/link-local/loopback ranges blocked on every send/test.

[MEDIUM] Grafana admin password is exposed in process arguments during token provisioning

  • Category: Infrastructure
  • Location: deployment-files/run-fleet.sh:923
  • Description: The new provisioning function passes GRAFANA_ADMIN_PASSWORD to curl via -u "admin:${admin_pass}" on several calls. On typical Linux hosts, process arguments are readable by other local users through ps or /proc/<pid>/cmdline while the command runs.
  • Impact: A local unprivileged user could capture the Grafana admin password and then access the loopback-only Grafana UI/API from the same host.
  • Recommendation: Avoid putting secrets in command-line arguments. Use a temporary 0600 netrc/config file with a cleanup trap, or another mechanism that passes credentials via stdin/file descriptor rather than argv.

Notes

No pool configuration, miner command execution, plugin execution, frontend rendering, protobuf wire-format changes, or nmap/discovery code was changed in .git/codex-review.diff. I reviewed only the supplied diff scope.


Generated by Codex Security Review |
Triggered by: @illegalprime |
Review workflow run

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8c55e88d7a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread deployment-files/docker-compose.notifications.yaml
Comment thread server/internal/domain/notifications/service.go Outdated
Comment thread server/internal/domain/notifications/service.go
Comment thread server/internal/domain/notifications/service.go

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6cdd7d2724

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread server/internal/domain/notifications/service.go
Comment thread server/internal/handlers/notifications/handler.go Outdated
@illegalprime illegalprime force-pushed the eden/notifications-3-authz branch from 507a4e5 to 3bd492f Compare June 15, 2026 17:37
illegalprime added a commit that referenced this pull request Jun 15, 2026
The prod overlay left FLEET_METRICS_GRAFANA_PASSWORD empty so fleet-api never
held the Grafana admin credential, but that meant a default deployment had no
way to authenticate to Grafana and every Grafana-proxying notification RPC
401'd while the stack came up healthy.

Instead of defaulting fleet-api to the admin password, mint a least-privilege
Grafana service account (Editor on org 1) and token post-boot in run-fleet.sh,
persist it to .env, and recreate fleet-api so it authenticates with the scoped
token. The basic-auth password stays empty by default. Idempotent across
re-runs.

Addresses Codex security review finding (MEDIUM) on #455.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@illegalprime illegalprime force-pushed the eden/notifications-4-server branch from 6cdd7d2 to 6655a15 Compare June 15, 2026 17:37
@illegalprime

Copy link
Copy Markdown
Contributor Author

Codex findings on this PR:

  • [MEDIUM] prod overlay defaults fleet-api to Grafana admin creds — fixed (commit 6655a15). Reverted the admin-password default back to empty and instead provision a least-privilege Grafana service-account token (Editor on org 1) post-boot in run-fleet.sh, persisted to .env, with fleet-api recreated to pick it up. Basic-auth password stays empty by default, so a token deployment never holds the admin credential. Idempotent across re-runs.
  • [MEDIUM] destination validation bypass / SSRF — not fixed inline; the durable fix is Grafana-side egress enforcement (the pre-flight DNS check in fleet-api isn't the real egress boundary). Tracked in Notifications: enforce Grafana egress restrictions (destination SSRF) #459.

⚠️ The run-fleet.sh token provisioning calls the live Grafana API and could not be end-to-end tested here — please validate on a real --enable-beta-notifications deploy.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6655a150e3

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread server/internal/domain/notifications/grafana_client.go Outdated
Comment thread server/internal/domain/notifications/service.go Outdated
@illegalprime illegalprime force-pushed the eden/notifications-3-authz branch from 3bd492f to 66e55db Compare June 15, 2026 17:44
illegalprime added a commit that referenced this pull request Jun 15, 2026
The prod overlay left FLEET_METRICS_GRAFANA_PASSWORD empty so fleet-api never
held the Grafana admin credential, but that meant a default deployment had no
way to authenticate to Grafana and every Grafana-proxying notification RPC
401'd while the stack came up healthy.

Instead of defaulting fleet-api to the admin password, mint a least-privilege
Grafana service account (Editor on org 1) and token post-boot in run-fleet.sh,
persist it to .env, and recreate fleet-api so it authenticates with the scoped
token. The basic-auth password stays empty by default. Idempotent across
re-runs.

Addresses Codex security review finding (MEDIUM) on #455.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@illegalprime illegalprime force-pushed the eden/notifications-4-server branch from 6655a15 to 753a5b8 Compare June 15, 2026 17:44
illegalprime added a commit that referenced this pull request Jun 15, 2026
…Ls in parse errors

Two follow-ups from review:
- run-fleet.sh swallowed the post-token fleet-api restart (>/dev/null 2>&1
  || true), so a failed recreate left fleet-api on the pre-token env (still
  401ing) while the script reported success. Check the result and fail loudly
  with manual-restart instructions.
- checkDestinationURL returned url.Parse's error verbatim, which embeds the
  raw input — a webhook/Slack URL can carry a capability token. Return a
  generic 'not parseable' message so the secret can't leak via errors or logs.

Addresses Codex security review findings (MEDIUM) on #455.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@illegalprime

Copy link
Copy Markdown
Contributor Author

Re the re-review (the first two are now fixed in commit 75185f6):

  • [MEDIUM] malformed webhook URLs leak tokens through errors — fixed. checkDestinationURL no longer returns url.Parse's error verbatim (it embeds the raw input, which can carry a capability token); returns a generic "not parseable" message instead.
  • [MEDIUM] token provisioning can strand fleet-api on hidden restart failure — fixed. run-fleet.sh no longer swallows the post-token fleet-api recreate (>/dev/null 2>&1 || true); it checks the result and fails loudly with manual-restart instructions.
  • [MEDIUM] TestChannel validates stored channel instead of edited definition — by design (documented at the call site): for a saved channel the echoed-back payload is redacted, so testing the request body would probe a secret-stripped target; we test the stored settings instead. New/unsaved channels are tested from the entered payload directly (AddChannelModal → testChannelApi). Not changing this.
  • [HIGH] DNS-rebinding SSRF — unchanged: tracked in Notifications: enforce Grafana egress restrictions (destination SSRF) #459 (needs Grafana-side egress enforcement, not a fleet-api pre-flight fix).

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 75185f6639

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +804 to +807
"smtpHost": c.SMTP.Host,
"smtpPort": c.SMTP.Port,
"smtpUsername": c.SMTP.Username,
"fromAddress": c.SMTP.From,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Don't store SMTP server settings on email contact points

For SMTP channels, these per-channel smtpHost/smtpPort/credential fields are written into the Grafana email contact point, but Grafana OSS reads SMTP server configuration from [smtp]/GF_SMTP_*, not from the contact point; the Grafana email docs require SMTP to be configured in grafana.ini and list the email contact-point settings as addresses/single-email/template fields (https://grafana.com/docs/grafana/latest/alerting/configure-notifications/manage-contact-points/integrations/configure-email/). Since the bundled server/monitoring/grafana/grafana.ini has no [smtp] section, creating/testing an SMTP channel with a host/password here can appear saved while real email delivery still uses an unconfigured global SMTP transport and fails.

Useful? React with 👍 / 👎.

Comment thread server/internal/domain/notifications/service.go Outdated
@illegalprime illegalprime force-pushed the eden/notifications-3-authz branch from 66e55db to d2685fc Compare June 15, 2026 18:15
illegalprime added a commit that referenced this pull request Jun 15, 2026
The prod overlay left FLEET_METRICS_GRAFANA_PASSWORD empty so fleet-api never
held the Grafana admin credential, but that meant a default deployment had no
way to authenticate to Grafana and every Grafana-proxying notification RPC
401'd while the stack came up healthy.

Instead of defaulting fleet-api to the admin password, mint a least-privilege
Grafana service account (Editor on org 1) and token post-boot in run-fleet.sh,
persist it to .env, and recreate fleet-api so it authenticates with the scoped
token. The basic-auth password stays empty by default. Idempotent across
re-runs.

Addresses Codex security review finding (MEDIUM) on #455.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
illegalprime added a commit that referenced this pull request Jun 15, 2026
…Ls in parse errors

Two follow-ups from review:
- run-fleet.sh swallowed the post-token fleet-api restart (>/dev/null 2>&1
  || true), so a failed recreate left fleet-api on the pre-token env (still
  401ing) while the script reported success. Check the result and fail loudly
  with manual-restart instructions.
- checkDestinationURL returned url.Parse's error verbatim, which embeds the
  raw input — a webhook/Slack URL can carry a capability token. Return a
  generic 'not parseable' message so the secret can't leak via errors or logs.

Addresses Codex security review findings (MEDIUM) on #455.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
illegalprime added a commit that referenced this pull request Jun 15, 2026
…K, slack kind-change, and open-ended silences

- Re-add the notification services to registeredServices here (with the main.go
  mounts), so TestRPCContract_RegisteredServicesMatchMainMux passes; the entries
  were moved out of the authz PR where the services aren't mounted yet.
- UpdateContactPoint: Grafana's provisioning PUT answers a 202 Ack, not the
  contact point. Stop decoding the (empty) body and build the UpdateChannel
  response from the contact point we sent, so a successful edit is no longer
  reported as an RPC error.
- UpdateChannel: only keep the stored Slack URL when the channel was already
  Slack; changing kind to Slack without a fresh URL no longer grafts the prior
  kind's secret (e.g. a webhook URL) onto the Slack contact point.
- domainSilenceToGrafana: materialize an open-ended mute's zero EndsAt to the
  far-future sentinel pause silences already use, so Alertmanager (which
  requires endsAt) accepts indefinite user silences.

Addresses Codex connector findings (P1/P2) on #455.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@illegalprime illegalprime force-pushed the eden/notifications-4-server branch from 75185f6 to 876a9df Compare June 15, 2026 18:15
@illegalprime

Copy link
Copy Markdown
Contributor Author

codex-connector triage (the earlier #459 reference is folded in here — tracking on this thread, not as issues):

Fixed:

  • [P1] Pass Grafana password to fleet-api — resolved by provisioning a least-privilege service-account token in run-fleet.sh (basic-auth password stays empty by default).
  • [P1] Handle Grafana contact-point update ACK — fixed: PUT returns a 202 Ack, so UpdateChannel now builds its response from the contact point we sent instead of decoding the empty body.
  • [P2] Require a Slack URL when changing channel kind — fixed: keepStoredSlackURL now requires owned.Kind == Slack, so switching kind to Slack can't graft the prior kind's secret.
  • [P2] Require/materialize a silence end time — fixed: a zero EndsAt is materialized to the far-future sentinel pause silences already use, so Alertmanager accepts open-ended mutes.

Won't-change (intentional):

  • [P2] Honor unsaved test definitions when an id is present — the impl deliberately tests the stored channel when an id is set: the echoed-back payload is redacted (would probe a secret-stripped target) and this enforces org ownership of the id (TestTestChannelUsesStoredDestinationForSavedChannel / …RejectsForeignSavedChannel assert both). The proto comment ("unsaved wins when both set") is the part that's out of sync and should be corrected to match the impl via a proto regen (deferred so we don't re-run the proto generator mid-stack).

Deferred — design-level, tracked here (no separate issue):

  • [P1] Route created channels into Grafana notification policies — created contact points aren't wired into the policy tree (single root receiver), so live alerts don't reach user channels. This is the routing feature the UI already marks "not yet shipped"; needs a routing model + policy-tree provisioning. Not a review-fix.
  • [P2 ×2] Group/site silences don't match emitted labels — rules only emit organization_id/device_id, so group_id/site_id matchers never match. Needs either those labels added to the metric/rule pipeline (data-model change) or rejecting those scopes. Rule/device silences work today.
  • [HIGH, security-review] DNS-rebinding SSRF — needs Grafana-side egress enforcement (the fleet-api pre-flight DNS check isn't the egress boundary). Deferred for infra/design.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 876a9df6f9

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread server/internal/domain/notifications/service.go Outdated
@illegalprime illegalprime force-pushed the eden/notifications-3-authz branch from d2685fc to 23cd610 Compare June 15, 2026 18:36
illegalprime added a commit that referenced this pull request Jun 15, 2026
The prod overlay left FLEET_METRICS_GRAFANA_PASSWORD empty so fleet-api never
held the Grafana admin credential, but that meant a default deployment had no
way to authenticate to Grafana and every Grafana-proxying notification RPC
401'd while the stack came up healthy.

Instead of defaulting fleet-api to the admin password, mint a least-privilege
Grafana service account (Editor on org 1) and token post-boot in run-fleet.sh,
persist it to .env, and recreate fleet-api so it authenticates with the scoped
token. The basic-auth password stays empty by default. Idempotent across
re-runs.

Addresses Codex security review finding (MEDIUM) on #455.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
illegalprime added a commit that referenced this pull request Jun 15, 2026
…Ls in parse errors

Two follow-ups from review:
- run-fleet.sh swallowed the post-token fleet-api restart (>/dev/null 2>&1
  || true), so a failed recreate left fleet-api on the pre-token env (still
  401ing) while the script reported success. Check the result and fail loudly
  with manual-restart instructions.
- checkDestinationURL returned url.Parse's error verbatim, which embeds the
  raw input — a webhook/Slack URL can carry a capability token. Return a
  generic 'not parseable' message so the secret can't leak via errors or logs.

Addresses Codex security review findings (MEDIUM) on #455.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
illegalprime added a commit that referenced this pull request Jun 15, 2026
…K, slack kind-change, and open-ended silences

- Re-add the notification services to registeredServices here (with the main.go
  mounts), so TestRPCContract_RegisteredServicesMatchMainMux passes; the entries
  were moved out of the authz PR where the services aren't mounted yet.
- UpdateContactPoint: Grafana's provisioning PUT answers a 202 Ack, not the
  contact point. Stop decoding the (empty) body and build the UpdateChannel
  response from the contact point we sent, so a successful edit is no longer
  reported as an RPC error.
- UpdateChannel: only keep the stored Slack URL when the channel was already
  Slack; changing kind to Slack without a fresh URL no longer grafts the prior
  kind's secret (e.g. a webhook URL) onto the Slack contact point.
- domainSilenceToGrafana: materialize an open-ended mute's zero EndsAt to the
  far-future sentinel pause silences already use, so Alertmanager (which
  requires endsAt) accepts indefinite user silences.

Addresses Codex connector findings (P1/P2) on #455.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@illegalprime illegalprime force-pushed the eden/notifications-4-server branch from 876a9df to 5d7cfa0 Compare June 15, 2026 18:36

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5d7cfa0298

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread server/internal/domain/notifications/grafana_client.go
@illegalprime illegalprime force-pushed the eden/notifications-3-authz branch from 23cd610 to 8cd5d9a Compare June 15, 2026 19:24
illegalprime added a commit that referenced this pull request Jun 15, 2026
The prod overlay left FLEET_METRICS_GRAFANA_PASSWORD empty so fleet-api never
held the Grafana admin credential, but that meant a default deployment had no
way to authenticate to Grafana and every Grafana-proxying notification RPC
401'd while the stack came up healthy.

Instead of defaulting fleet-api to the admin password, mint a least-privilege
Grafana service account (Editor on org 1) and token post-boot in run-fleet.sh,
persist it to .env, and recreate fleet-api so it authenticates with the scoped
token. The basic-auth password stays empty by default. Idempotent across
re-runs.

Addresses Codex security review finding (MEDIUM) on #455.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
illegalprime added a commit that referenced this pull request Jun 15, 2026
…Ls in parse errors

Two follow-ups from review:
- run-fleet.sh swallowed the post-token fleet-api restart (>/dev/null 2>&1
  || true), so a failed recreate left fleet-api on the pre-token env (still
  401ing) while the script reported success. Check the result and fail loudly
  with manual-restart instructions.
- checkDestinationURL returned url.Parse's error verbatim, which embeds the
  raw input — a webhook/Slack URL can carry a capability token. Return a
  generic 'not parseable' message so the secret can't leak via errors or logs.

Addresses Codex security review findings (MEDIUM) on #455.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
illegalprime added a commit that referenced this pull request Jun 15, 2026
…K, slack kind-change, and open-ended silences

- Re-add the notification services to registeredServices here (with the main.go
  mounts), so TestRPCContract_RegisteredServicesMatchMainMux passes; the entries
  were moved out of the authz PR where the services aren't mounted yet.
- UpdateContactPoint: Grafana's provisioning PUT answers a 202 Ack, not the
  contact point. Stop decoding the (empty) body and build the UpdateChannel
  response from the contact point we sent, so a successful edit is no longer
  reported as an RPC error.
- UpdateChannel: only keep the stored Slack URL when the channel was already
  Slack; changing kind to Slack without a fresh URL no longer grafts the prior
  kind's secret (e.g. a webhook URL) onto the Slack contact point.
- domainSilenceToGrafana: materialize an open-ended mute's zero EndsAt to the
  far-future sentinel pause silences already use, so Alertmanager (which
  requires endsAt) accepts indefinite user silences.

Addresses Codex connector findings (P1/P2) on #455.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@illegalprime illegalprime force-pushed the eden/notifications-4-server branch from 5d7cfa0 to 382a0e6 Compare June 15, 2026 19:24
@illegalprime illegalprime force-pushed the eden/notifications-4-server branch from 382a0e6 to ffdd5ad Compare June 16, 2026 14:43
@illegalprime illegalprime force-pushed the eden/notifications-3-authz branch from 8cd5d9a to 341655e Compare June 16, 2026 14:43

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ffdd5addbf

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread server/internal/domain/notifications/service.go Outdated
@illegalprime illegalprime force-pushed the eden/notifications-3-authz branch from 341655e to 9548ca8 Compare June 16, 2026 14:59
@illegalprime illegalprime force-pushed the eden/notifications-4-server branch from ffdd5ad to 7b2fdcd Compare June 16, 2026 15:10
Base automatically changed from eden/notifications-3-authz to main June 16, 2026 18:23
Add the Grafana-backed notifications domain service and client, the Connect
handlers, and the fleetd server wiring (mounting the Channel, Rule,
MaintenanceWindow, and History services). run-fleet.sh provisions a
least-privilege Grafana service-account token for fleet-api.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@illegalprime illegalprime force-pushed the eden/notifications-4-server branch from 7b2fdcd to 69f0098 Compare June 16, 2026 18:26

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 69f0098d5d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread server/internal/handlers/notifications/handler.go Outdated
…ator, SSRF ranges)

- UpdateChannel: only reuse the stored webhook URL when the channel was
  already a webhook, mirroring the Slack fix, so a kind change can't graft
  the prior kind's secret (e.g. a Slack URL) onto a webhook contact point.
- UpdateMaintenanceWindow: carry the matched silence's CreatedBy so an
  update (which has no created_by) doesn't wipe the audit owner.
- checkDestinationHost: reject CGNAT (100.64/10), benchmarking (198.18/15),
  and reserved (240/4) ranges that net.IP.IsPrivate misses.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6901a8b145

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".


create_body=$(curl -fsS --max-time 10 -u "admin:${admin_pass}" \
-H "Content-Type: application/json" \
-d "{\"name\":\"${sa_name}\",\"role\":\"Editor\",\"isDisabled\":false}" \

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Mint a token with provisioning permissions

In the production --enable-beta-notifications path, this creates the fleet-api service account as Editor, but the service only talks to Grafana's Alerting Provisioning API (/api/v1/provisioning/...) for contact points and rules; Grafana's role mapping assigns fixed:alerting.provisioning:writer to Admin, not Editor (https://grafana.com/docs/grafana/latest/administration/roles-and-permissions/access-control/rbac-fixed-basic-role-definitions/). With the generated token, channel/rule management will authenticate but still get Grafana 403s in deployments enforcing those roles, so mint/update the service account with a role that includes provisioning access.

Useful? React with 👍 / 👎.

TestChannel loads the stored Grafana contact point whenever an id is set
and ignores kind/config, so don't run kind conversion (which rejects the
default CHANNEL_KIND_UNSPECIFIED) for id-only test requests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 85f424eea1

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +867 to +871
out := Channel{
ID: cp.UID,
OrganizationID: orgID,
Name: channelDisplayName(orgID, cp.Name),
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve channel timestamps when round-tripping Grafana contact points

When channels are created or listed, the only state reconstructed from Grafana is the contact point payload, and this initializes Channel without CreatedAt or UpdatedAt. Since CreateChannel and UpdateChannel both return the result of contactPointToChannel, and ListChannels sorts by CreatedAt, every Grafana-backed channel comes back with the Go zero timestamp rather than the time set just before the write; clients will display bogus dates and ordering is effectively meaningless unless these fields are persisted in settings or copied into the response.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant