Skip to content

notifications (4/6): rules + maintenance windows#496

Open
illegalprime wants to merge 12 commits into
mainfrom
eden/notif-2-rules-maintenance
Open

notifications (4/6): rules + maintenance windows#496
illegalprime wants to merge 12 commits into
mainfrom
eden/notif-2-rules-maintenance

Conversation

@illegalprime

@illegalprime illegalprime commented Jun 18, 2026

Copy link
Copy Markdown
Contributor
Screenshot 2026-06-18 at 15-26-47 Proto Fleet

Reviewable diff: +2070/-4 across 16 files (excludes generated, test, and story files).

Summary

Adds alert rules and maintenance windows management to the Notifications feature, end-to-end (server domain + Connect RPC handlers + protoFleet UI). Operators can see the org's Grafana-provisioned alert rules, pause/resume a noisy rule, and schedule maintenance windows that silence alerts for a chosen scope (a rule, a group, a site, or specific devices) over a time range. Neither rules nor windows are stored by proto-fleet — both are projections over Grafana's alerting state, so Grafana stays the single source of truth.

Stack

This is part of the notifications series. The chain (bottom → top):

  • maineden/notif-2-rules-maintenance ← this PReden/notif-3-historyeden/notif-1b-smtp-channel

The diff is relative to main; this is the first PR in the reordered stack, so there are no ancestor PRs to re-review. It builds on context already merged to main: the channels + Grafana client foundation (#486), the notification:read / notification:manage permissions (#454), and the notifications proto contract + generated RuleService / MaintenanceWindowService Connect code (so no generated files appear in this diff).

Out of scope here, landing in branches stacked on top (not yet open as PRs):

  • eden/notif-3-history — delivery history listing + dashboard active-alerts card.
  • eden/notif-1b-smtp-channel — SMTP delivery channel.

How it works

Everything is mediated through Grafana's alerting API; proto-fleet adds the org-scoping, the pause semantics, and the UI.

Rules (read + pause/resume). Alert rules are defined in Grafana via YAML provisioning. ListRules fetches Grafana's alert rules, keeps only those visible to the caller's org, and maps them to a domain Rule. Because Grafana 11.6+ forbids the provisioning API from editing a YAML-provisioned rule, "pause" does not flip the rule's isPaused flag. Instead, pausing writes a marker silence scoped to that rule's UID and the org; resuming deletes that silence. ListRules then overlays paused state by checking which rules currently have an active pause-silence, so the Enabled field a client sees reflects the silence, not the stored rule.

Maintenance windows (CRUD). A maintenance window is a Grafana silence carrying an org matcher plus scope matchers (rule / group / site / device). Create and update both POST the silence (Grafana has no separate update endpoint — a POST with the existing id replaces it); delete removes it. A window's Active flag is derived at read time from now ∈ [starts_at, ends_at) rather than persisted. Pause-silences are filtered out of the windows list since they're an internal detail of rule pausing.

Authorization. List endpoints require notification:read; all mutations require notification:manage. The new RPCs are registered as session-authenticated procedures in the interceptor config.

Client. clients.ts registers Connect clients for RuleService and MaintenanceWindowService; notificationsApi.ts wraps them in typed calls. A Zustand store holds the rules and maintenance windows because the Rules and Maintenance sections share that state; the Notifications page triggers the initial load. (Channels, by contrast, use the per-feature useChannels hook introduced in #486 — this store is the one place feature state still lives in Zustand, justified by the cross-section sharing.)

flowchart LR
  subgraph Client["protoFleet UI"]
    Page["Notifications page"]
    Rules["RulesSection"]
    MW["MaintenanceWindowsSection + AddMaintenanceWindowModal"]
    Store["notificationsStore (Zustand)"]
    Page --> Store
    Rules --> Store
    MW --> Store
    Store --> API["notificationsApi"]
  end
  API -->|"Connect RPC"| H["RuleService / MaintenanceWindowService handlers"]
  H -->|"notification:read / notification:manage"| Svc["notifications.Service"]
  Svc --> GC["Grafana client"]
  GC -->|"alert rules"| G[("Grafana")]
  GC -->|"silences"| G
Loading
sequenceDiagram
  actor Op as Operator
  participant Svc as notifications.Service
  participant G as Grafana
  Op->>Svc: PauseRule(id)
  Svc->>G: PutSilence(matcher: ruleUID + org)
  Note over Svc,G: YAML-provisioned rules can't be edited,<br/>so a marker silence mutes instead
  Op->>Svc: ListRules()
  Svc->>G: ListAlertRules() + ListSilences()
  Svc-->>Op: rules with Enabled=false where an active pause-silence matches
  Op->>Svc: ResumeRule(id)
  Svc->>G: DeleteSilence(pause-silence for ruleUID)
Loading

Areas of the code involved

Area / file What changed Why it matters for review
server/internal/domain/notifications/service.go +545: ListRules, PauseRule/ResumeRule, maintenance-window CRUD, org filtering, scope validation Core logic; the pause-via-silence and org-isolation rules live here
server/internal/domain/notifications/grafana_client.go +84: ListAlertRules, GetAlertRule, ListSilences, PutSilence, DeleteSilence The Grafana REST surface this feature depends on
server/internal/domain/notifications/models.go +54: Rule, RuleTemplate, MaintenanceWindow, MaintenanceWindowScope Domain shapes and the scope kinds
server/internal/handlers/notifications/handler.go +225: Rule + MaintenanceWindow RPC handlers with permission checks Maps RPCs to the service; read vs. manage permission boundary
server/internal/handlers/interceptors/config.go +7: registers new RPCs as session-only procedures Auth wiring — confirm no endpoint is accidentally public
server/cmd/fleetd/main.go +2: mounts the two new service handlers Server registration
client/src/protoFleet/api/clients.ts +9: Rule + MaintenanceWindow Connect clients Client transport registration
client/.../notifications/api/notificationsApi.ts +157: typed wrappers for rules/maintenance Client ↔ proto mapping
client/.../notifications/store/notificationsStore.ts new +112: shared rules + windows store The one feature-state store (see decision below)
client/.../components/RulesSection.tsx new +186: rules list + pause/resume
client/.../components/MaintenanceWindowsSection.tsx new +179: windows list + active badge
client/.../components/AddMaintenanceWindowModal.tsx new +265: create/edit window form
client/.../components/SinglePickerField.tsx new +163: scope picker UI
client/.../lib/maintenanceWindowOptions.ts new +23: scope option helpers
client/.../pages/Notifications.tsx +19: wires sections + initial load
client/.../notifications/types/index.ts +40: Rule / MaintenanceWindow TS types
*_test.go service + permission tests generated/test — skip for architecture review

Key technical decisions & trade-offs

  • Pause writes a marker silence instead of flipping isPaused — chosen because Grafana 11.6+ blocks provisioning-API edits to YAML-provisioned rules. Trade-off: paused state lives in a silence, so ListRules must overlay it rather than read it off the rule.
  • Rules are read-only projections of Grafana alert rules (no proto-fleet rule store) — keeps Grafana the single source of truth; proto-fleet only lists, pauses, and resumes.
  • Maintenance windows are Grafana silences, updated via PUT-replace since Grafana has no silence-update endpoint; Active is computed at read time rather than persisted.
  • Org isolation is enforced on every read (matcher/label filter) and ownership is verified before every mutation, so an id from another org is not a write or list oracle.
  • A Zustand store is retained for rules + maintenance because the two sibling sections share the same data; channels (from notifications (3/6): webhook + Slack channels end-to-end + Grafana foundation #486) use a per-feature hook instead. A reviewer may prefer converting these to hooks too — flagged as the deliberate inconsistency.

Testing & validation

  • service_test.go adds unit coverage for the rules and maintenance-window paths against a fake Grafana; rpc_permissions_test.go updated for the new procedures.
  • Verified locally: go build ./..., go vet, go test ./internal/domain/notifications/... ./internal/handlers/notifications/..., client tsc --noEmit, and eslint on the notifications feature all pass.
  • Not covered: Playwright E2E for rules/maintenance (deferred), and behavior against a real Grafana beyond the fake client.

Adds the alerting-management surface on top of the channels foundation.
Rules and maintenance windows ship together because they are coupled at
the UI: the rule row's kebab attaches/lifts a maintenance window, and the
maintenance-window modal needs the rule list to populate its rule picker.

Server: the RuleService (list provisioned rules, pause/resume via marker
silences) and MaintenanceWindowService (schedule/edit/lift org-scoped
silences with rule/group/site/device scope compilation), the shared
Grafana silence + alert-rule adapter calls, and the alert-rule/silence
domain mappers. Both services are mounted and registered in the RPC
contract test and session-only interceptor list.

Client: the Rules section (read-only list + pause/resume + maintenance
window/lift kebab), the Maintenance Windows section, the add/edit
maintenance-window modal with quick presets and scope picker, and the
rules+MW parts of the API module, store, and view-model types.

History + the dashboard card ship in slice 3.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@illegalprime illegalprime requested a review from a team as a code owner June 18, 2026 15:04
@github-actions github-actions Bot added javascript Pull requests that update javascript code client server labels Jun 18, 2026
@github-actions

github-actions Bot commented Jun 18, 2026

Copy link
Copy Markdown

🔐 Codex Security Review

Note: This is an automated security-focused code review generated by Codex.
It should be used as a supplementary check alongside human review.
False positives are possible - use your judgment.

Scope summary

  • Reviewed pull request diff only (46cf862cb3b29b48405a3c91e841b62c7b76c017...b8f51bfc77882b4031e30c05d5f3b7fa12aa5ee5, exact PR three-dot diff)
  • Model: gpt-5.5

💡 Click "edited" above to see previous reviews for this PR.


Review Summary

Overall Risk: MEDIUM

Findings

[MEDIUM] Device Alert Query Failures Can Now Resolve Silently

  • Category: Reliability
  • Location: server/monitoring/grafana/provisioning/alerting/proto-fleet-rules.yaml:44
  • Description: The PR changes the shared device alert rules from execErrState: Error to execErrState: OK at lines 44, 102, and 157. The added internal canary only checks that selected columns in notification_metric_sample can be read; it does not execute the actual alert queries, including the Timescale last(value, time) aggregate, per-rule grouping, joins, or Grafana expression behavior.
  • Impact: A rule-specific query or expression failure can make offline, temperature, or telemetry-poll alerts silently resolve to OK while the canary stays healthy. Operators may lose detection for miner outages or overheating without an alert that the rule itself is broken.
  • Recommendation: Keep rule-specific evaluation failures observable. Either retain execErrState: Error and route those unscoped errors to an operator-only receiver, or add internal canaries that execute the same query/expression paths as each suppressed rule, or alert from Grafana rule evaluation error metrics if available.

Notes

No auth bypass, SQL injection, command injection, plugin boundary issue, frontend XSS, pool/wallet hijack, or protobuf wire-format issue was evident in the scoped diff.


Generated by Codex Security Review |
Triggered by: @illegalprime |
Review workflow run

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fd0647bec7

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread server/internal/domain/notifications/service.go Outdated
@illegalprime illegalprime changed the title notifications (2/3): rules + maintenance windows end-to-end notifications (4/6): rules + maintenance windows end-to-end Jun 18, 2026
Address PR #496 review:

- Pause marker (P1): buildPauseSilence carried `proto_fleet_pause=true` as a
  silence matcher. Alertmanager ANDs matchers against an alert's labels, and no
  provisioned rule emits that label, so the silence muted nothing even though the
  rule reported as paused. Move the marker into the silence comment (the only
  non-matcher field) and identify pause silences by it. Guard maintenance-window
  comments so a same-org caller can't smuggle the marker in.

- Rule visibility: ruleVisibleToOrg treated rules with no organization_id label as
  visible to every org (fail open), so a tenant-specific rule provisioned without
  its label would leak cross-org. Fail closed: unlabeled rules are hidden; shared
  rules opt into cross-org visibility with proto_fleet_scope=global, now set on the
  provisioned default rules.

Adds unit coverage for both.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4faf3a2ab8

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread server/monitoring/grafana/provisioning/alerting/proto-fleet-rules.yaml Outdated
…e, pause attribution)

Follow-up to the previous fixup, addressing the Codex review of 4faf3a2:

- [P2] The proto-fleet-self ingest-stalled rule uses noDataState=Alerting and can
  emit a single unscoped alert (no organization_id), which an org-scoped silence can
  never match. Marking it proto_fleet_scope=global would have given every org a Pause
  that reports success yet leaves the alert firing, so drop the marker: the rule stays
  operator-only (fails closed). The three device rules (noDataState=OK, always
  org-labeled) keep the marker.

- [MEDIUM] Rule-scoped maintenance windows only checked rule_id != "". Resolve the
  target through requireRule (same visibility/existence check PauseRule uses) so a
  notification:manage user can't silence a rule they can't list or a guessed UID.

- [MEDIUM] Rule pauses hard-coded CreatedBy: "Proto Fleet", losing actor attribution.
  Thread the authenticated username through PauseRule/buildPauseSilence and record it on
  the silence (createdBy + comment); fall back to the app name when unknown. Adds an
  authorizeActor handler helper and uses it for CreateMaintenanceWindow too.

Adds unit coverage for all three.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 916133d186

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread server/internal/domain/notifications/service.go
illegalprime and others added 2 commits June 18, 2026 12:42
… match

Address the Codex re-review of 916133d (both the inline review and the security
review flagged this): a group/site-scoped maintenance window emits a group_id/site_id
matcher, but the provisioned alert rules only label instances with organization_id and
device_id (and the metric contract names the group label device_group, not group_id), so
such a window is saved and reported active while Alertmanager mutes nothing.

Reject group and site scopes at validation until the alert queries emit a matching label.
Rule scope (uses __alert_rule_uid__) and device scope (device_id is emitted) are unchanged
and still functional; the UI only exposes rule scope today, so this only tightens the
API surface. Updates the scope-validation test accordingly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…itoring rule

The earlier fail-closed ruleVisibleToOrg hid every provisioned rule that lacked a
proto_fleet_scope=global label, so the Rules list went empty. Invert the model: a rule
with no org label and no internal marker is a shared default that monitors every org's
devices and is visible to each of them; tenant-specific rules stay scoped by their
organization_id label. Drop the global markers from the device/telemetry default rules
(they're shared defaults, not "global" infra) so every org sees and can pause them — the
pause silence is org-scoped, so a pause only mutes the caller's own org.

The self-monitoring ingest-stalled rule, whose noDataState=Alerting path emits an
unscoped instance an org silence can't match, now opts out explicitly with
proto_fleet_scope=internal instead of relying on the absence of a marker.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 83b1f773a1

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread client/src/protoFleet/features/notifications/store/notificationsStore.ts Outdated
illegalprime and others added 2 commits June 18, 2026 13:19
…ale row on id change

Address the Codex review of 83b1f77 (two client P2s in notificationsStore):

- The active flag was a Date.now() snapshot frozen at load/create/update, but the
  Active/Expired badge, the list sort, and RulesSection's pause/lift action all read it,
  so a window that started or expired while the page stayed open showed stale state.
  Recompute active at render against a ticking clock: extract isMaintenanceWindowActive
  (now injectable) and a useNow() hook; RulesSection and MaintenanceWindowsSection derive
  active from current time instead of the frozen field.

- A history-affecting edit (e.g. scope change) makes Alertmanager expire the original
  silence and return a new id, so upserting by the returned id left the old input.id row
  behind (often still active). updateMaintenanceWindow now drops the old row when the id
  changes, so the window isn't listed twice.

Removes the now-unused computeMaintenanceWindowActive export.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Address the Codex security review of 567477c: the UI requires ends_at and ends > starts,
but CreateMaintenanceWindow/UpdateMaintenanceWindow didn't, so a direct notification:manage
RPC could omit ends_at (compiled to the 2099 far-future sentinel — a decades-long silence)
or pass ends_at <= starts_at to persist an already-expired/invalid window.

Add validateMaintenanceWindowTimes before PutSilence: require starts_at and ends_at and
require ends_at strictly after starts_at. Indefinite suppression stays exclusive to the
PauseRule path. Adds TestMaintenanceWindowRejectsInvalidTimes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9e0e36cd0d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

illegalprime and others added 3 commits June 18, 2026 14:18
Address the Codex review of 9e0e36c:

- [Security MEDIUM] ListMaintenanceWindows treated any org-matched Grafana silence as a
  Proto Fleet maintenance window, and Update/Delete used that list as ownership proof, so
  a notification:manage user could mutate (and notification:read users could see) operator-
  created silences that merely shared the org matcher. Stamp PF-created windows with a
  reserved comment marker ([proto-fleet-mw], like the pause marker — in the comment, not a
  matcher, so it can't affect alert matching); list/update/delete now require it, external
  silences are invisible, the marker is stripped for display, and user comments containing
  it are rejected.

- [Codex Review P2] A window scheduled for the future rendered as "Expired" and sorted in
  with old history, so operators could think it wasn't created. Add a distinct "Scheduled"
  badge and order active > scheduled > expired.

Adds TestListMaintenanceWindowsIgnoresUnmarkedSilences.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Address the remaining Codex security finding (unlabeled rules visible to every org).
ruleVisibleToOrg is now fail-closed and driven entirely by proto_fleet_scope:
- shared   -> visible to every org (shared platform defaults)
- internal -> hidden from all orgs (operator-only self-monitoring)
- neither  -> visible only if the rule carries this org's organization_id label

An unmarked, unlabeled rule is hidden, so a tenant-specific rule provisioned without its
org label can't leak across orgs. Tag the three bundled device/telemetry default rules
proto_fleet_scope=shared so they stay visible to every org; the self-monitoring rule keeps
proto_fleet_scope=internal.

NOTE: Grafana must be re-provisioned to pick up the shared marker, or these rules will not
appear in the Rules list.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Address the Codex security finding "[MEDIUM] Rule list fails open when pause-silence state
cannot be loaded". pauseSilencedRules swallowed the silences-read error and returned nil, so
ListRules rendered every rule from !IsPaused alone — a rule muted by a Proto Fleet pause
silence showed as enabled whenever the Alertmanager silences read failed, and PauseRule
(which resolves through ListRules) could write a duplicate pause silence during such an
outage.

pauseSilencedRules now returns (map, error) and ListRules propagates it, so the request
fails closed instead of presenting a muted rule as confidently enabled. PauseRule inherits
the same guard via requireRule. Adds TestListRulesFailsClosedWhenSilencesUnavailable.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@illegalprime illegalprime changed the title notifications (4/6): rules + maintenance windows end-to-end notifications (4/6): rules + maintenance windows Jun 18, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 678a1ccae8

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

illegalprime and others added 2 commits June 18, 2026 15:33
Address the Codex review of 678a1cc: the three shared device/telemetry rules used
execErrState: Error, but Grafana emits DatasourceError instances with rule/datasource
labels and no organization_id. Since PauseRule and rule-scoped maintenance windows always
add an organization_id matcher, an org could pause/schedule one of these shared rules during
a datasource outage, see a successful UI state, yet the error alert would keep firing
(nothing carries the org label to match the silence).

Set execErrState: OK on the three org-facing shared rules so they only emit org-scoped,
pausable instances. Datasource/ingest outages remain covered globally by the operator-only
protofleet-ingest-stalled rule (proto_fleet_scope=internal), which keeps execErrState: Error.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The shared device rules run execErrState=OK so org-scoped pause/maintenance
silences stay correct, but that also means a schema regression on
notification_metric_sample would make their queries error and silently
resolve to OK, hiding offline/overheat/telemetry detection. Grafana's own
evaluation-failure metrics aren't scraped (TimescaleDB is the only
datasource), so add an operator-only (proto_fleet_scope: internal) canary
that references every column those rules read and always evaluates to 0 —
it never fires on data, but a schema drift errors its query and
execErrState=Error raises an operator-only alert.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@illegalprime

Copy link
Copy Markdown
Contributor Author

Re: Codex Security Review [MEDIUM] — Default Alert Rules Now Hide Query/Evaluation Failures

Fixed in b8f51bf. The finding is correct that execErrState: OK on the three shared device rules turns a query/evaluation failure into a silent OK, and that protofleet-ingest-stalled doesn't cover it — that rule reads fleet_telemetry_poll_heartbeat, so a schema regression on notification_metric_sample (the table the device rules read) would leave heartbeat ingest healthy while every device query errors silently.

Reverting to execErrState: Error isn't viable: it reopens the inline P2 (Error-state instances are unscoped — no organization_id — so an org-scoped pause/maintenance silence reports success but can't mute them). And Grafana's own evaluation-failure metrics aren't reachable here — TimescaleDB is the only provisioned datasource, there's no Prometheus scrape of Grafana's /metrics.

So instead the device rules keep execErrState: OK (org pauses stay correct) and I added an operator-only canary, protofleet-device-rule-schema-drift, that exercises the same table:

  • It selects every column the three device rules depend on (organization_id, device_id, sensor_kind, result, value, metric, time) but always evaluates to 0, so it never fires on data.
  • If that schema drifts (renamed/dropped column, migration drift, datasource permission change), its query errors and execErrState: Error raises an alert.
  • It carries proto_fleet_scope: internal (same as protofleet-ingest-stalled), so ruleVisibleToOrg hides it from every org — no org gets a pause action on an unscoped error instance it could never mute. noDataState: OK so a fresh deploy with an empty table doesn't page.

Scope caveat I want to be explicit about: this catches the high-impact systemic failure (the shared table/schema the device queries depend on breaking). It does not catch a typo confined to one rule's expression that doesn't touch a shared column — those are caught at PR/provisioning review and don't drift spontaneously. A full datasource outage is still caught directly by protofleet-ingest-stalled (execErrState: Error).

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b8f51bfc77

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

open={open}
onDismiss={onDismiss}
title={isEditing ? "Edit maintenance window" : "Add maintenance window"}
description="Mute alerts during planned work. Suppressed events still record to the activity log so you can audit what would have fired."

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Remove the suppressed-event audit promise

When a maintenance window actually matches an alert, this feature mutes it by creating an Alertmanager silence, and the repo's notification history is only written from the Alertmanager webhook path (server/internal/handlers/alertmanagerwebhook/handler.go) behind the provisioned receiver. Silenced alerts are suppressed before that receiver is invoked, so operators relying on this copy will expect audit/history rows for alerts that the app never records; either remove the claim or add a separate alert-state audit source for suppressed events.

Useful? React with 👍 / 👎.

Rule,
} from "@/protoFleet/features/notifications/types";

interface NotificationsState {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this didnt get rebased properly? but the zustand store came back. Shared store should only be for UI persistence and not an RPC cache

itemName={{ singular: "maintenance window", plural: "maintenance windows" }}
noDataElement={
<div className="py-10 text-center text-text-primary-50">
No maintenanceWindows right now — click Add maintenanceWindow to mute during planned work.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy shouldn't be camel cased -> "maintenance windows"

<Header title="Rules" titleSize="text-heading-200" />
<p className="text-300 text-text-primary-50">
Provisioned conditions that decide when a notification fires. The rule set is managed by ops — pause one to
maintenanceWindow it indefinitely, or attach a maintenanceWindow to mute it for a finite maintenance window.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy should not be camel cased -> "maintenance window"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

client javascript Pull requests that update javascript code server

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants