Skip to content

Curtailment: Enforce MQTT OFF automation demand#481

Open
rongxin-liu wants to merge 8 commits into
mainfrom
rongxin/451-mqtt-off-enforcement
Open

Curtailment: Enforce MQTT OFF automation demand#481
rongxin-liu wants to merge 8 commits into
mainfrom
rongxin/451-mqtt-off-enforcement

Conversation

@rongxin-liu

@rongxin-liu rongxin-liu commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Summary

Enforces MQTT OFF as an active automation demand level instead of a one-time edge. Operators can no longer perform a normal restore of an automation-owned curtailment while the owning enabled MQTT automation rule still has OFF asserted; force=true remains the explicit admin override, and MQTT ON remains the automation-owned restore path.

This also tightens MQTT automation source behavior and response-profile authoring. Fresh repeated OFF payloads can reassert demand without letting high-frequency heartbeats create unbounded database work, duplicate retained/replayed payloads remain suppressed, max-duration restores remain hard caps, source rows expose a bounded "No signal timeout", pending OFF retries are reflected in source status, and response-profile creation no longer treats transient live-preview errors as save blockers.

Fixes #451.

How it works

MQTT ingest persists source state through the existing pending-edge machinery. When a source is already OFF, a fresh OFF publisher timestamp becomes a distinct reassert_off edge after the source-level heartbeat window, while exact same-timestamp duplicates remain no-ops. This keeps durable level reassertion separate from normal on_to_off transitions without allowing per-second OFF heartbeats to run the full automation/database path.

The automation executor coalesces recent repeated OFF signals only after checking the linked active event still exists and is not restoring. Reassertions outside the coalescing window, rules without a known active event, stale active-event state, or restoring events continue through the normal path so automation can repair state or re-curtail after an operator restore. If a restoring event is restoring because its finite max-duration cap has elapsed, repeated OFF is ignored so the cap remains a hard stop instead of creating a curtail/restore loop.

StopCurtailment now treats an event as automation-owned only when it has the internal SourceActorAutomation actor and the reserved curtailment_automation external source. Public/manual starts cannot use the reserved automation source name, including whitespace-padded variants. For real automation events, the OFF demand check runs inside the SQL restore-transition transaction and locks the owning automation rule-state row, so a concurrent MQTT OFF either blocks the restore before it starts or observes the event as restoring and can re-curtail when allowed.

MQTT source create/update carries staleness_threshold_sec from the source form. The UI defaults to 240 seconds, validates positive whole seconds up to 86,400 seconds, and the proto/server layers enforce the same upper bound so typoed or direct API values cannot effectively disable fail-safe OFF. Runtime watchdog behavior continues to use the source staleness field to synthesize fail-safe OFF when no broker signal arrives.

Source status serialization projects a pending edge into the displayed last signal. A failed OFF execution retry therefore shows the demand being retried instead of stale settled state. Response-profile previews still use the live selector, but the UI refreshes previews periodically, ignores stale async preview responses, and shows a neutral unavailable state for response-profile creation when live fleet telemetry is temporarily unsuitable; saving the profile remains allowed while the separate "Run curtailment" test action remains blocked on preview errors.

Diagrams

flowchart TD
    A["MQTT payload"] --> B["Source worker decodes target and timestamp"]
    B --> C{"Target already processed at timestamp?"}
    C -->|"yes"| D["No-op duplicate/replay"]
    C -->|"no"| E{"Target OFF and source already OFF?"}
    E -->|"yes, inside heartbeat window"| F["Advance liveness only"]
    E -->|"yes, outside heartbeat window"| G["Persist reassert_off pending edge"]
    E -->|"no"| H["Normal edge detection"]
    G --> I{"Recent active OFF already recorded?"}
    I -->|"yes"| J["Coalesce automation work"]
    I -->|"no"| K["Automation executor"]
    H --> K
    K --> L["Start, recurtail, or restore event"]
Loading
flowchart TD
    A["StopCurtailment request"] --> B{"Internal automation-owned event?"}
    B -->|"no"| C["Existing stop flow"]
    B -->|"yes"| D{"force=true or automation ON restore?"}
    D -->|"yes"| C
    D -->|"no"| E["BeginRestoreTransition transaction"]
    E --> F["Lock owning automation rule-state row"]
    F --> G{"Latest rule signal OFF?"}
    G -->|"yes"| H["Reject normal restore"]
    G -->|"no"| I["Flip event to restoring"]
Loading
flowchart TD
    A["Create or edit MQTT source"] --> B["No signal timeout field"]
    B --> C{"1 to 86,400 seconds?"}
    C -->|"no"| D["Reject before save"]
    C -->|"yes"| E["Create/Update MQTT source RPC"]
    E --> F["Persist staleness_threshold_sec"]
    F --> G{"No broker signal before timeout?"}
    G -->|"yes"| H["Watchdog emits OFF edge"]
    G -->|"no"| I["Normal broker edge handling"]
Loading

Areas of the code involved

Area / package / file What changed Why it matters for review
server/internal/domain/curtailment/mqttingest Adds reassert_off, same-timestamp replay suppression, source-level repeated-OFF throttling, pending-edge rehydration coverage, and bounded source staleness validation. Core source-level behavior for demand enforcement, retry UX, replay safety, and fail-safe timeout correctness.
server/internal/domain/curtailment Requires internal automation ownership for the stop guard, reserves the automation external source from manual starts, handles repeated OFF reassertions, preserves max-duration hard caps, sets automation restore intent on MQTT ON, and bypasses cooldown for automation profile starts. Core enforcement boundary for operator restore behavior, broker-noise resilience, automation recovery, and duration-cap safety.
server/internal/domain/stores/interfaces Extends BeginRestoreTransition with restore options carrying the automation demand guard. Makes the restore guard a compile-time store contract instead of an optional runtime extension.
server/internal/domain/stores/sqlstores Runs the automation OFF-demand check inside the restore transition transaction and locks the rule-state row. Closes the Stop/MQTT OFF race by serializing signal writes with restore transition.
server/sqlc/queries/curtailment_automation.sql Adds the guarded automation rule lookup for event/external reference with FOR UPDATE OF st. Provides transactional concurrency control for the restore guard.
server/internal/handlers/curtailment/mqtt_settings.go Returns pending edge status as the displayed source signal and accepts bounded source staleness updates. Keeps source trigger UI aligned with retrying OFF demand and supports configurable no-signal timeout.
server/migrations Adds migration 000089 to allow reassert_off as a durable MQTT pending direction. Keeps deployed DB constraints in sync with the new source-worker edge direction.
proto/curtailment/v1 and generated API code Documents Stop restore preconditions, documents pending status projection, and validates MQTT source staleness timeout at the API boundary. Keeps client/server contracts explicit for new failure modes and status semantics.
client/src/protoFleet/api/useMqttCurtailmentSources.ts Sends and reads source stalenessThresholdSec. Wires the existing backend staleness field to the settings UI.
client/src/protoFleet/features/settings/components/Curtailment Adds the "No signal timeout" source field, 86,400-second cap, and related validation/test fixtures. Lets operators configure watchdog fail-safe behavior without accidentally disabling it.
client/src/protoFleet/features/energy Periodically refreshes plan previews, ignores stale preview responses, and softens response-profile preview errors without blocking profile save. Avoids confusing insufficient-load/cooldown messages during profile authoring while preserving validation for live run/test actions.
server/generated/sqlc, server/generated/grpc, client/src/protoFleet/api/generated Generated output from sqlc/proto changes. Generated; review only when validating source/query/proto shape.

Key technical decisions & trade-offs

  • Gate automation ownership on server-derived SourceActorAutomation plus the reserved external source, rather than trusting client-supplied external attribution alone.
  • Put the OFF-demand check inside BeginRestoreTransition instead of a service pre-check, so the guard and restore state write share one transaction.
  • Lock the automation rule-state row during the guarded restore path, allowing concurrent MQTT OFF handling to serialize with operator restore.
  • Represent fresh repeated OFF as reassert_off, keeping transition semantics distinct from level reassertion and allowing that value in the deployed MQTT source-state constraint.
  • Bound repeated OFF heartbeat work at the source worker before the executor path, while still allowing reassertions outside the window to repair stale or restoring automation state.
  • Treat max-duration restore as a hard cap for the same event instead of letting persistent OFF demand create a curtail/restore loop.
  • Keep force=true as the explicit operator escape hatch instead of introducing a new public RPC field.
  • Use the existing per-source staleness_threshold_sec model for no-signal timeout, with a 24-hour cap instead of a new settings table or global cooldown knob.
  • Let automation starts bypass post-event cooldown, while response-profile previews continue to show live fleet availability as a non-blocking authoring signal.

Testing & validation

  • just gen
  • bin/go test ./server/internal/domain/curtailment ./server/internal/domain/curtailment/mqttingest
  • bin/go test ./server/internal/domain/curtailment/... ./server/internal/handlers/curtailment
  • bin/go test ./server/internal/domain/stores/sqlstores -run '^$'
  • ../bin/golangci-lint run -c .golangci.yaml ./internal/domain/curtailment/... ./internal/handlers/curtailment ./internal/domain/stores/sqlstores from server/
  • npm test -- --run src/protoFleet/features/energy/CurtailmentStartModal.test.tsx src/protoFleet/features/energy/useCurtailmentPlanPreview.test.ts src/protoFleet/features/settings/components/Curtailment/CurtailmentSettingsPage.test.tsx src/protoFleet/api/useMqttCurtailmentSources.test.tsx
  • npm test -- --run src/protoFleet/features/settings/components/Curtailment/CurtailmentSettingsPage.test.tsx
  • npx tsc --noEmit
  • npm run lint
  • Latest pre-push hook: server-lint, client-typecheck

Focused coverage now includes manual spoofing of the reserved automation external source, same-timestamp OFF replay after ON, frequent repeated OFF suppression, repeated OFF re-curtailment of restoring events, max-duration restoration staying a hard cap, durable reassert_off pending-edge rehydration, automation-owned stop guard behavior, source staleness form/API plumbing and upper bounds, preview refresh sequencing, and response-profile save behavior when live preview is temporarily unavailable.

@github-actions

github-actions Bot commented Jun 16, 2026

Copy link
Copy Markdown

🔐 Codex Security Review

Note: This is an automated security-focused code review generated by Codex.
It should be used as a supplementary check alongside human review.
False positives are possible - use your judgment.

Scope summary

  • Reviewed pull request diff only (d34b3cd7fc43747cf82d7170d3bd4e0b5756b8c9...c57a71e961e44557bcb1ca03b7dfc4c6d7221511, exact PR three-dot diff)
  • Model: gpt-5.5

💡 Click "edited" above to see previous reviews for this PR.


Review Summary

Overall Risk: NONE

Findings

No security, correctness, or reliability findings identified in the scoped diff.

Notes

Reviewed .git/codex-review.diff only, covering the MQTT curtailment automation changes, Stop guard behavior, protobuf validation/comment updates, SQL migration/query changes, generated Go/TS artifacts, and frontend settings/preview updates.

No evidence found of auth bypass, SQL injection, command injection, unsafe frontend rendering, cryptostealing/pool hijack behavior, or protobuf wire-format breakage. I did not run tests; this was a static PR-diff review.


Generated by Codex Security Review |
Triggered by: @rongxin-liu |
Review workflow run

Require internal automation ownership before applying the OFF restore guard, move the guard into the restore transition transaction, and bound repeated OFF reassertion work while preserving recurtail recovery.
@rongxin-liu rongxin-liu marked this pull request as ready for review June 17, 2026 01:19
@rongxin-liu rongxin-liu requested a review from a team as a code owner June 17, 2026 01:19
Copilot AI review requested due to automatic review settings June 17, 2026 01:20

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

Only coalesce a recent repeated OFF after confirming the linked automation event is still non-restoring, so force or max-duration restore can be recurtailled while OFF remains asserted.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d535bfa316

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread server/internal/domain/curtailment/mqttingest/worker.go
Add a migration for the reassert_off pending direction so repeated OFF demand assertions can persist and reach automation on deployed databases.
@github-actions github-actions Bot added javascript Pull requests that update javascript code client labels Jun 17, 2026
Sequence preview refresh results, bound repeated OFF heartbeat work, document API semantics, and tighten reserved automation source validation.
Prevent repeated OFF from re-curtailing events after their max-duration cap and bound MQTT no-signal timeout values across API, server, and UI validation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

client javascript Pull requests that update javascript code server shared

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enforce MQTT curtailment automation while OFF remains asserted

2 participants