Curtailment: Enforce MQTT OFF automation demand#481
Conversation
🔐 Codex Security Review
Review SummaryOverall Risk: NONE FindingsNo security, correctness, or reliability findings identified in the scoped diff. NotesReviewed No evidence found of auth bypass, SQL injection, command injection, unsafe frontend rendering, cryptostealing/pool hijack behavior, or protobuf wire-format breakage. I did not run tests; this was a static PR-diff review. Generated by Codex Security Review | |
Require internal automation ownership before applying the OFF restore guard, move the guard into the restore transition transaction, and bound repeated OFF reassertion work while preserving recurtail recovery.
Only coalesce a recent repeated OFF after confirming the linked automation event is still non-restoring, so force or max-duration restore can be recurtailled while OFF remains asserted.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d535bfa316
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Add a migration for the reassert_off pending direction so repeated OFF demand assertions can persist and reach automation on deployed databases.
Sequence preview refresh results, bound repeated OFF heartbeat work, document API semantics, and tighten reserved automation source validation.
Prevent repeated OFF from re-curtailing events after their max-duration cap and bound MQTT no-signal timeout values across API, server, and UI validation.
Summary
Enforces MQTT
OFFas an active automation demand level instead of a one-time edge. Operators can no longer perform a normal restore of an automation-owned curtailment while the owning enabled MQTT automation rule still hasOFFasserted;force=trueremains the explicit admin override, and MQTTONremains the automation-owned restore path.This also tightens MQTT automation source behavior and response-profile authoring. Fresh repeated
OFFpayloads can reassert demand without letting high-frequency heartbeats create unbounded database work, duplicate retained/replayed payloads remain suppressed, max-duration restores remain hard caps, source rows expose a bounded "No signal timeout", pending OFF retries are reflected in source status, and response-profile creation no longer treats transient live-preview errors as save blockers.Fixes #451.
How it works
MQTT ingest persists source state through the existing pending-edge machinery. When a source is already
OFF, a freshOFFpublisher timestamp becomes a distinctreassert_offedge after the source-level heartbeat window, while exact same-timestamp duplicates remain no-ops. This keeps durable level reassertion separate from normalon_to_offtransitions without allowing per-secondOFFheartbeats to run the full automation/database path.The automation executor coalesces recent repeated
OFFsignals only after checking the linked active event still exists and is not restoring. Reassertions outside the coalescing window, rules without a known active event, stale active-event state, or restoring events continue through the normal path so automation can repair state or re-curtail after an operator restore. If a restoring event is restoring because its finite max-duration cap has elapsed, repeatedOFFis ignored so the cap remains a hard stop instead of creating a curtail/restore loop.StopCurtailmentnow treats an event as automation-owned only when it has the internalSourceActorAutomationactor and the reservedcurtailment_automationexternal source. Public/manual starts cannot use the reserved automation source name, including whitespace-padded variants. For real automation events, theOFFdemand check runs inside the SQL restore-transition transaction and locks the owning automation rule-state row, so a concurrent MQTTOFFeither blocks the restore before it starts or observes the event as restoring and can re-curtail when allowed.MQTT source create/update carries
staleness_threshold_secfrom the source form. The UI defaults to 240 seconds, validates positive whole seconds up to 86,400 seconds, and the proto/server layers enforce the same upper bound so typoed or direct API values cannot effectively disable fail-safeOFF. Runtime watchdog behavior continues to use the source staleness field to synthesize fail-safeOFFwhen no broker signal arrives.Source status serialization projects a pending edge into the displayed last signal. A failed OFF execution retry therefore shows the demand being retried instead of stale settled state. Response-profile previews still use the live selector, but the UI refreshes previews periodically, ignores stale async preview responses, and shows a neutral unavailable state for response-profile creation when live fleet telemetry is temporarily unsuitable; saving the profile remains allowed while the separate "Run curtailment" test action remains blocked on preview errors.
Diagrams
flowchart TD A["MQTT payload"] --> B["Source worker decodes target and timestamp"] B --> C{"Target already processed at timestamp?"} C -->|"yes"| D["No-op duplicate/replay"] C -->|"no"| E{"Target OFF and source already OFF?"} E -->|"yes, inside heartbeat window"| F["Advance liveness only"] E -->|"yes, outside heartbeat window"| G["Persist reassert_off pending edge"] E -->|"no"| H["Normal edge detection"] G --> I{"Recent active OFF already recorded?"} I -->|"yes"| J["Coalesce automation work"] I -->|"no"| K["Automation executor"] H --> K K --> L["Start, recurtail, or restore event"]flowchart TD A["StopCurtailment request"] --> B{"Internal automation-owned event?"} B -->|"no"| C["Existing stop flow"] B -->|"yes"| D{"force=true or automation ON restore?"} D -->|"yes"| C D -->|"no"| E["BeginRestoreTransition transaction"] E --> F["Lock owning automation rule-state row"] F --> G{"Latest rule signal OFF?"} G -->|"yes"| H["Reject normal restore"] G -->|"no"| I["Flip event to restoring"]flowchart TD A["Create or edit MQTT source"] --> B["No signal timeout field"] B --> C{"1 to 86,400 seconds?"} C -->|"no"| D["Reject before save"] C -->|"yes"| E["Create/Update MQTT source RPC"] E --> F["Persist staleness_threshold_sec"] F --> G{"No broker signal before timeout?"} G -->|"yes"| H["Watchdog emits OFF edge"] G -->|"no"| I["Normal broker edge handling"]Areas of the code involved
server/internal/domain/curtailment/mqttingestreassert_off, same-timestamp replay suppression, source-level repeated-OFF throttling, pending-edge rehydration coverage, and bounded source staleness validation.server/internal/domain/curtailmentserver/internal/domain/stores/interfacesBeginRestoreTransitionwith restore options carrying the automation demand guard.server/internal/domain/stores/sqlstoresserver/sqlc/queries/curtailment_automation.sqlFOR UPDATE OF st.server/internal/handlers/curtailment/mqtt_settings.goserver/migrations000089to allowreassert_offas a durable MQTT pending direction.proto/curtailment/v1and generated API codeclient/src/protoFleet/api/useMqttCurtailmentSources.tsstalenessThresholdSec.client/src/protoFleet/features/settings/components/Curtailmentclient/src/protoFleet/features/energyserver/generated/sqlc,server/generated/grpc,client/src/protoFleet/api/generatedKey technical decisions & trade-offs
SourceActorAutomationplus the reserved external source, rather than trusting client-supplied external attribution alone.BeginRestoreTransitioninstead of a service pre-check, so the guard and restore state write share one transaction.OFFhandling to serialize with operator restore.OFFasreassert_off, keeping transition semantics distinct from level reassertion and allowing that value in the deployed MQTT source-state constraint.force=trueas the explicit operator escape hatch instead of introducing a new public RPC field.staleness_threshold_secmodel for no-signal timeout, with a 24-hour cap instead of a new settings table or global cooldown knob.Testing & validation
just genbin/go test ./server/internal/domain/curtailment ./server/internal/domain/curtailment/mqttingestbin/go test ./server/internal/domain/curtailment/... ./server/internal/handlers/curtailmentbin/go test ./server/internal/domain/stores/sqlstores -run '^$'../bin/golangci-lint run -c .golangci.yaml ./internal/domain/curtailment/... ./internal/handlers/curtailment ./internal/domain/stores/sqlstoresfromserver/npm test -- --run src/protoFleet/features/energy/CurtailmentStartModal.test.tsx src/protoFleet/features/energy/useCurtailmentPlanPreview.test.ts src/protoFleet/features/settings/components/Curtailment/CurtailmentSettingsPage.test.tsx src/protoFleet/api/useMqttCurtailmentSources.test.tsxnpm test -- --run src/protoFleet/features/settings/components/Curtailment/CurtailmentSettingsPage.test.tsxnpx tsc --noEmitnpm run lintserver-lint,client-typecheckFocused coverage now includes manual spoofing of the reserved automation external source, same-timestamp
OFFreplay afterON, frequent repeatedOFFsuppression, repeatedOFFre-curtailment of restoring events, max-duration restoration staying a hard cap, durablereassert_offpending-edge rehydration, automation-owned stop guard behavior, source staleness form/API plumbing and upper bounds, preview refresh sequencing, and response-profile save behavior when live preview is temporarily unavailable.