notifications (4/6): rules + maintenance windows#496
Conversation
Adds the alerting-management surface on top of the channels foundation. Rules and maintenance windows ship together because they are coupled at the UI: the rule row's kebab attaches/lifts a maintenance window, and the maintenance-window modal needs the rule list to populate its rule picker. Server: the RuleService (list provisioned rules, pause/resume via marker silences) and MaintenanceWindowService (schedule/edit/lift org-scoped silences with rule/group/site/device scope compilation), the shared Grafana silence + alert-rule adapter calls, and the alert-rule/silence domain mappers. Both services are mounted and registered in the RPC contract test and session-only interceptor list. Client: the Rules section (read-only list + pause/resume + maintenance window/lift kebab), the Maintenance Windows section, the add/edit maintenance-window modal with quick presets and scope picker, and the rules+MW parts of the API module, store, and view-model types. History + the dashboard card ship in slice 3. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
🔐 Codex Security Review
Review SummaryOverall Risk: MEDIUM Findings[MEDIUM] Device Alert Query Failures Can Now Resolve Silently
NotesNo auth bypass, SQL injection, command injection, plugin boundary issue, frontend XSS, pool/wallet hijack, or protobuf wire-format issue was evident in the scoped diff. Generated by Codex Security Review | |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fd0647bec7
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Address PR #496 review: - Pause marker (P1): buildPauseSilence carried `proto_fleet_pause=true` as a silence matcher. Alertmanager ANDs matchers against an alert's labels, and no provisioned rule emits that label, so the silence muted nothing even though the rule reported as paused. Move the marker into the silence comment (the only non-matcher field) and identify pause silences by it. Guard maintenance-window comments so a same-org caller can't smuggle the marker in. - Rule visibility: ruleVisibleToOrg treated rules with no organization_id label as visible to every org (fail open), so a tenant-specific rule provisioned without its label would leak cross-org. Fail closed: unlabeled rules are hidden; shared rules opt into cross-org visibility with proto_fleet_scope=global, now set on the provisioned default rules. Adds unit coverage for both. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4faf3a2ab8
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
…e, pause attribution) Follow-up to the previous fixup, addressing the Codex review of 4faf3a2: - [P2] The proto-fleet-self ingest-stalled rule uses noDataState=Alerting and can emit a single unscoped alert (no organization_id), which an org-scoped silence can never match. Marking it proto_fleet_scope=global would have given every org a Pause that reports success yet leaves the alert firing, so drop the marker: the rule stays operator-only (fails closed). The three device rules (noDataState=OK, always org-labeled) keep the marker. - [MEDIUM] Rule-scoped maintenance windows only checked rule_id != "". Resolve the target through requireRule (same visibility/existence check PauseRule uses) so a notification:manage user can't silence a rule they can't list or a guessed UID. - [MEDIUM] Rule pauses hard-coded CreatedBy: "Proto Fleet", losing actor attribution. Thread the authenticated username through PauseRule/buildPauseSilence and record it on the silence (createdBy + comment); fall back to the app name when unknown. Adds an authorizeActor handler helper and uses it for CreateMaintenanceWindow too. Adds unit coverage for all three. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 916133d186
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
… match Address the Codex re-review of 916133d (both the inline review and the security review flagged this): a group/site-scoped maintenance window emits a group_id/site_id matcher, but the provisioned alert rules only label instances with organization_id and device_id (and the metric contract names the group label device_group, not group_id), so such a window is saved and reported active while Alertmanager mutes nothing. Reject group and site scopes at validation until the alert queries emit a matching label. Rule scope (uses __alert_rule_uid__) and device scope (device_id is emitted) are unchanged and still functional; the UI only exposes rule scope today, so this only tightens the API surface. Updates the scope-validation test accordingly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…itoring rule The earlier fail-closed ruleVisibleToOrg hid every provisioned rule that lacked a proto_fleet_scope=global label, so the Rules list went empty. Invert the model: a rule with no org label and no internal marker is a shared default that monitors every org's devices and is visible to each of them; tenant-specific rules stay scoped by their organization_id label. Drop the global markers from the device/telemetry default rules (they're shared defaults, not "global" infra) so every org sees and can pause them — the pause silence is org-scoped, so a pause only mutes the caller's own org. The self-monitoring ingest-stalled rule, whose noDataState=Alerting path emits an unscoped instance an org silence can't match, now opts out explicitly with proto_fleet_scope=internal instead of relying on the absence of a marker. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 83b1f773a1
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
…ale row on id change Address the Codex review of 83b1f77 (two client P2s in notificationsStore): - The active flag was a Date.now() snapshot frozen at load/create/update, but the Active/Expired badge, the list sort, and RulesSection's pause/lift action all read it, so a window that started or expired while the page stayed open showed stale state. Recompute active at render against a ticking clock: extract isMaintenanceWindowActive (now injectable) and a useNow() hook; RulesSection and MaintenanceWindowsSection derive active from current time instead of the frozen field. - A history-affecting edit (e.g. scope change) makes Alertmanager expire the original silence and return a new id, so upserting by the returned id left the old input.id row behind (often still active). updateMaintenanceWindow now drops the old row when the id changes, so the window isn't listed twice. Removes the now-unused computeMaintenanceWindowActive export. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Address the Codex security review of 567477c: the UI requires ends_at and ends > starts, but CreateMaintenanceWindow/UpdateMaintenanceWindow didn't, so a direct notification:manage RPC could omit ends_at (compiled to the 2099 far-future sentinel — a decades-long silence) or pass ends_at <= starts_at to persist an already-expired/invalid window. Add validateMaintenanceWindowTimes before PutSilence: require starts_at and ends_at and require ends_at strictly after starts_at. Indefinite suppression stays exclusive to the PauseRule path. Adds TestMaintenanceWindowRejectsInvalidTimes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9e0e36cd0d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Address the Codex review of 9e0e36c: - [Security MEDIUM] ListMaintenanceWindows treated any org-matched Grafana silence as a Proto Fleet maintenance window, and Update/Delete used that list as ownership proof, so a notification:manage user could mutate (and notification:read users could see) operator- created silences that merely shared the org matcher. Stamp PF-created windows with a reserved comment marker ([proto-fleet-mw], like the pause marker — in the comment, not a matcher, so it can't affect alert matching); list/update/delete now require it, external silences are invisible, the marker is stripped for display, and user comments containing it are rejected. - [Codex Review P2] A window scheduled for the future rendered as "Expired" and sorted in with old history, so operators could think it wasn't created. Add a distinct "Scheduled" badge and order active > scheduled > expired. Adds TestListMaintenanceWindowsIgnoresUnmarkedSilences. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Address the remaining Codex security finding (unlabeled rules visible to every org). ruleVisibleToOrg is now fail-closed and driven entirely by proto_fleet_scope: - shared -> visible to every org (shared platform defaults) - internal -> hidden from all orgs (operator-only self-monitoring) - neither -> visible only if the rule carries this org's organization_id label An unmarked, unlabeled rule is hidden, so a tenant-specific rule provisioned without its org label can't leak across orgs. Tag the three bundled device/telemetry default rules proto_fleet_scope=shared so they stay visible to every org; the self-monitoring rule keeps proto_fleet_scope=internal. NOTE: Grafana must be re-provisioned to pick up the shared marker, or these rules will not appear in the Rules list. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Address the Codex security finding "[MEDIUM] Rule list fails open when pause-silence state cannot be loaded". pauseSilencedRules swallowed the silences-read error and returned nil, so ListRules rendered every rule from !IsPaused alone — a rule muted by a Proto Fleet pause silence showed as enabled whenever the Alertmanager silences read failed, and PauseRule (which resolves through ListRules) could write a duplicate pause silence during such an outage. pauseSilencedRules now returns (map, error) and ListRules propagates it, so the request fails closed instead of presenting a muted rule as confidently enabled. PauseRule inherits the same guard via requireRule. Adds TestListRulesFailsClosedWhenSilencesUnavailable. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 678a1ccae8
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Address the Codex review of 678a1cc: the three shared device/telemetry rules used execErrState: Error, but Grafana emits DatasourceError instances with rule/datasource labels and no organization_id. Since PauseRule and rule-scoped maintenance windows always add an organization_id matcher, an org could pause/schedule one of these shared rules during a datasource outage, see a successful UI state, yet the error alert would keep firing (nothing carries the org label to match the silence). Set execErrState: OK on the three org-facing shared rules so they only emit org-scoped, pausable instances. Datasource/ingest outages remain covered globally by the operator-only protofleet-ingest-stalled rule (proto_fleet_scope=internal), which keeps execErrState: Error. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The shared device rules run execErrState=OK so org-scoped pause/maintenance silences stay correct, but that also means a schema regression on notification_metric_sample would make their queries error and silently resolve to OK, hiding offline/overheat/telemetry detection. Grafana's own evaluation-failure metrics aren't scraped (TimescaleDB is the only datasource), so add an operator-only (proto_fleet_scope: internal) canary that references every column those rules read and always evaluates to 0 — it never fires on data, but a schema drift errors its query and execErrState=Error raises an operator-only alert. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Re: Codex Security Review [MEDIUM] — Default Alert Rules Now Hide Query/Evaluation Failures Fixed in b8f51bf. The finding is correct that Reverting to So instead the device rules keep
Scope caveat I want to be explicit about: this catches the high-impact systemic failure (the shared table/schema the device queries depend on breaking). It does not catch a typo confined to one rule's expression that doesn't touch a shared column — those are caught at PR/provisioning review and don't drift spontaneously. A full datasource outage is still caught directly by |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b8f51bfc77
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| open={open} | ||
| onDismiss={onDismiss} | ||
| title={isEditing ? "Edit maintenance window" : "Add maintenance window"} | ||
| description="Mute alerts during planned work. Suppressed events still record to the activity log so you can audit what would have fired." |
There was a problem hiding this comment.
Remove the suppressed-event audit promise
When a maintenance window actually matches an alert, this feature mutes it by creating an Alertmanager silence, and the repo's notification history is only written from the Alertmanager webhook path (server/internal/handlers/alertmanagerwebhook/handler.go) behind the provisioned receiver. Silenced alerts are suppressed before that receiver is invoked, so operators relying on this copy will expect audit/history rows for alerts that the app never records; either remove the claim or add a separate alert-state audit source for suppressed events.
Useful? React with 👍 / 👎.
| Rule, | ||
| } from "@/protoFleet/features/notifications/types"; | ||
|
|
||
| interface NotificationsState { |
There was a problem hiding this comment.
maybe this didnt get rebased properly? but the zustand store came back. Shared store should only be for UI persistence and not an RPC cache
| itemName={{ singular: "maintenance window", plural: "maintenance windows" }} | ||
| noDataElement={ | ||
| <div className="py-10 text-center text-text-primary-50"> | ||
| No maintenanceWindows right now — click Add maintenanceWindow to mute during planned work. |
There was a problem hiding this comment.
copy shouldn't be camel cased -> "maintenance windows"
| <Header title="Rules" titleSize="text-heading-200" /> | ||
| <p className="text-300 text-text-primary-50"> | ||
| Provisioned conditions that decide when a notification fires. The rule set is managed by ops — pause one to | ||
| maintenanceWindow it indefinitely, or attach a maintenanceWindow to mute it for a finite maintenance window. |
There was a problem hiding this comment.
copy should not be camel cased -> "maintenance window"
Reviewable diff: +2070/-4 across 16 files (excludes generated, test, and story files).
Summary
Adds alert rules and maintenance windows management to the Notifications feature, end-to-end (server domain + Connect RPC handlers + protoFleet UI). Operators can see the org's Grafana-provisioned alert rules, pause/resume a noisy rule, and schedule maintenance windows that silence alerts for a chosen scope (a rule, a group, a site, or specific devices) over a time range. Neither rules nor windows are stored by proto-fleet — both are projections over Grafana's alerting state, so Grafana stays the single source of truth.
Stack
This is part of the notifications series. The chain (bottom → top):
main→eden/notif-2-rules-maintenance← this PR →eden/notif-3-history→eden/notif-1b-smtp-channelThe diff is relative to
main; this is the first PR in the reordered stack, so there are no ancestor PRs to re-review. It builds on context already merged tomain: the channels + Grafana client foundation (#486), thenotification:read/notification:managepermissions (#454), and the notifications proto contract + generatedRuleService/MaintenanceWindowServiceConnect code (so no generated files appear in this diff).Out of scope here, landing in branches stacked on top (not yet open as PRs):
eden/notif-3-history— delivery history listing + dashboard active-alerts card.eden/notif-1b-smtp-channel— SMTP delivery channel.How it works
Everything is mediated through Grafana's alerting API; proto-fleet adds the org-scoping, the pause semantics, and the UI.
Rules (read + pause/resume). Alert rules are defined in Grafana via YAML provisioning.
ListRulesfetches Grafana's alert rules, keeps only those visible to the caller's org, and maps them to a domainRule. Because Grafana 11.6+ forbids the provisioning API from editing a YAML-provisioned rule, "pause" does not flip the rule'sisPausedflag. Instead, pausing writes a marker silence scoped to that rule's UID and the org; resuming deletes that silence.ListRulesthen overlays paused state by checking which rules currently have an active pause-silence, so theEnabledfield a client sees reflects the silence, not the stored rule.Maintenance windows (CRUD). A maintenance window is a Grafana silence carrying an org matcher plus scope matchers (rule / group / site / device). Create and update both POST the silence (Grafana has no separate update endpoint — a POST with the existing id replaces it); delete removes it. A window's
Activeflag is derived at read time fromnow ∈ [starts_at, ends_at)rather than persisted. Pause-silences are filtered out of the windows list since they're an internal detail of rule pausing.Authorization. List endpoints require
notification:read; all mutations requirenotification:manage. The new RPCs are registered as session-authenticated procedures in the interceptor config.Client.
clients.tsregisters Connect clients forRuleServiceandMaintenanceWindowService;notificationsApi.tswraps them in typed calls. A Zustand store holds the rules and maintenance windows because the Rules and Maintenance sections share that state; the Notifications page triggers the initial load. (Channels, by contrast, use the per-featureuseChannelshook introduced in #486 — this store is the one place feature state still lives in Zustand, justified by the cross-section sharing.)flowchart LR subgraph Client["protoFleet UI"] Page["Notifications page"] Rules["RulesSection"] MW["MaintenanceWindowsSection + AddMaintenanceWindowModal"] Store["notificationsStore (Zustand)"] Page --> Store Rules --> Store MW --> Store Store --> API["notificationsApi"] end API -->|"Connect RPC"| H["RuleService / MaintenanceWindowService handlers"] H -->|"notification:read / notification:manage"| Svc["notifications.Service"] Svc --> GC["Grafana client"] GC -->|"alert rules"| G[("Grafana")] GC -->|"silences"| GAreas of the code involved
server/internal/domain/notifications/service.goListRules,PauseRule/ResumeRule, maintenance-window CRUD, org filtering, scope validationserver/internal/domain/notifications/grafana_client.goListAlertRules,GetAlertRule,ListSilences,PutSilence,DeleteSilenceserver/internal/domain/notifications/models.goRule,RuleTemplate,MaintenanceWindow,MaintenanceWindowScopeserver/internal/handlers/notifications/handler.goserver/internal/handlers/interceptors/config.goserver/cmd/fleetd/main.goclient/src/protoFleet/api/clients.tsclient/.../notifications/api/notificationsApi.tsclient/.../notifications/store/notificationsStore.tsclient/.../components/RulesSection.tsxclient/.../components/MaintenanceWindowsSection.tsxclient/.../components/AddMaintenanceWindowModal.tsxclient/.../components/SinglePickerField.tsxclient/.../lib/maintenanceWindowOptions.tsclient/.../pages/Notifications.tsxclient/.../notifications/types/index.tsRule/MaintenanceWindowTS types*_test.goKey technical decisions & trade-offs
isPaused— chosen because Grafana 11.6+ blocks provisioning-API edits to YAML-provisioned rules. Trade-off: paused state lives in a silence, soListRulesmust overlay it rather than read it off the rule.Activeis computed at read time rather than persisted.Testing & validation
service_test.goadds unit coverage for the rules and maintenance-window paths against a fake Grafana;rpc_permissions_test.goupdated for the new procedures.go build ./...,go vet,go test ./internal/domain/notifications/... ./internal/handlers/notifications/..., clienttsc --noEmit, andeslinton the notifications feature all pass.