Skip to content

feat(sites): atomic cross-site reparent (2/4)#464

Merged
flesher merged 6 commits into
mainfrom
feat/atomic-cross-site-reparent-2of4
Jun 16, 2026
Merged

feat(sites): atomic cross-site reparent (2/4)#464
flesher merged 6 commits into
mainfrom
feat/atomic-cross-site-reparent-2of4

Conversation

@flesher

@flesher flesher commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Part 2 of 4 — Atomic Cross-Site Reparent

Stacks on PR #463. Closes the cross-site orphan window that survived the original issue #420 fix.

1. Summary

Adds an opt-in force_clear_conflicting_rack_membership flag to AssignDevicesToSite so the server can clear a miner's rack membership inside the same transaction as the site assignment. Without the flag, today's behavior is preserved: a miner whose current rack lives on a different site rejects the whole batch. With the flag, the rack rows are dropped and the site write commits atomically — eliminating the client-side remove-then-reassign loop and the orphan-state window it left behind. The picker UI in MinerReparentPicker collapses to a single RPC for the cross-site flow.

2. How it works

Pre-PR cross-site flow (orphan window):

  1. Operator picks a target site for a set of miners that includes some currently in racks at the source site.
  2. Server's AssignDevicesToSite rejects with PerDeviceConflict[] of DEVICE_IN_RACK_AT_OTHER_SITE.
  3. UI prompts: "these miners are in rack X at the source site; unassign them and continue?"
  4. On confirm, UI loops RemoveDevicesFromDeviceSet per source rack, then calls AssignDevicesToSite again.
  5. Failure between step 4's loop and the final assign → miners are rack-less but still on the old site.

Post-PR cross-site flow:

  1. UI sees the conflict response and prompts the operator the same way.
  2. On confirm, UI calls AssignDevicesToSite once with force_clear_conflicting_rack_membership: true.
  3. Server opens one transaction:
    a. Lock target site row.
    b. Lock device rows for reassign.
    c. Compute conflicts (per-device validation).
    d. Partition conflicts: clearable (rack-at-other-site) vs residual (e.g. DEVICE_NOT_FOUND).
    e. If residual conflicts remain → return them without any deletion. (Don't half-commit.)
    f. If only clearable: RemoveDevicesFromAnyRack(clearableIDs, target=0) drops the rack-membership rows for only the conflicting devices. Devices already at the target site keep their rack rows intact.
    g. Apply the site write (UPDATE device SET site_id = …).
  4. Activity event records both the site move and the cascade clear (count + identifier sample), so audits trace the rack detachment.

Auth boundary. AssignDevicesToSite is gated by PermSiteManage at the procedure level. When the caller sets force_clear=true, the handler performs a second RequirePermission(PermRackManage) check before invoking the service — the cascade deletes device_set_membership rows, which sibling rack RPCs require PermRackManage to do.

Retry safety. Two store wrappers (RemoveDevicesFromAnyRack, LockDevicesForReassign) now wrap their PG errors with %w so the SQLTransactor retry loop can unwrap to pgconn.PgError and detect 40P01 (deadlock_detected). Cross-RPC lock-order collisions become silent retries instead of Internal 500s.

3. Diagrams

Cross-site reparent flow (pre vs post)

flowchart TB
    subgraph Pre["Pre-PR: 3 RPCs, orphan window between them"]
        A1["UI: AssignDevicesToSite"] --> A2["Server: reject<br/>PerDeviceConflict DEVICE_IN_RACK_AT_OTHER_SITE"]
        A2 --> A3["UI: loop RemoveDevicesFromDeviceSet<br/>per source rack"]
        A3 -.->|"network fail here<br/>= rack-less + old site"| A4["UI: AssignDevicesToSite retry"]
        A4 --> A5["Server: apply site write"]
    end
    subgraph Post["Post-PR: 1 RPC, atomic"]
        B1["UI: AssignDevicesToSite<br/>force_clear=false"] --> B2["Server: reject<br/>PerDeviceConflict DEVICE_IN_RACK_AT_OTHER_SITE"]
        B2 --> B3["UI: prompt unassign and continue?"]
        B3 --> B4["UI: AssignDevicesToSite<br/>force_clear=true"]
        B4 --> B5["Server: one tx<br/>1. lock site + devices<br/>2. partition conflicts<br/>3. clear conflicting rack rows<br/>4. apply site write"]
    end
Loading

Server-side AssignDevicesToSite with force_clear=true

sequenceDiagram
    participant Client
    participant Handler
    participant Auth as middleware.RequirePermission
    participant Svc as sites.Service
    participant Store
    participant Coll as collectionStore
    participant DB

    Client->>Handler: AssignDevicesToSite force_clear=true
    Handler->>Auth: PermSiteManage
    Auth-->>Handler: ok
    Handler->>Auth: PermRackManage gated by flag
    Auth-->>Handler: ok
    Handler->>Svc: AssignDevicesToSite params
    Svc->>DB: BEGIN
    Svc->>Store: LockSiteForWrite target
    Svc->>Store: LockDevicesForReassign ids
    Svc->>Svc: computeReassignConflicts
    alt residual non-clearable conflicts
        Svc-->>Handler: txConflicts no write
        Svc->>DB: COMMIT lock-only
    else only clearable rack-at-other-site
        Svc->>Coll: RemoveDevicesFromAnyRack clearableIDs target=0
        Svc->>Store: AssignDevicesToSite bulk UPDATE
        Svc->>DB: COMMIT
        Svc->>Svc: emit activity event with force_cleared metadata
    end
    Svc-->>Handler: rowsAffected or conflicts
    Handler-->>Client: response
Loading

Conflict partitioning state

stateDiagram-v2
    [*] --> ComputeConflicts
    ComputeConflicts --> NoConflicts: conflicts == 0
    ComputeConflicts --> Partition: conflicts > 0
    NoConflicts --> SiteWrite
    Partition --> CheckResidual
    CheckResidual --> AbortNoDelete: residual greater than 0
    CheckResidual --> ClearableOnly: residual==0 AND clearable greater than 0
    CheckResidual --> ReturnAsConflicts: residual==0 AND clearable==0 AND force_clear=false
    AbortNoDelete --> ReturnConflicts
    ClearableOnly --> RemoveRackRows
    RemoveRackRows --> SiteWrite
    SiteWrite --> EmitAudit
    EmitAudit --> [*]
    ReturnConflicts --> [*]
    ReturnAsConflicts --> [*]
Loading

4. Areas of the code involved

Area What changed Why it matters for review
proto/sites/v1/sites.proto Added optional bool force_clear_conflicting_rack_membership = 3 to AssignDevicesToSiteRequest. Fresh tag (no reuse). Wire-compat: default-false preserves current behavior; only callers that explicitly set the flag get the new semantics.
server/generated/grpc/sites/v1/sites.pb.go Proto regen. Generated — skip.
client/src/protoFleet/api/generated/sites/v1/sites_pb.ts Proto regen. Generated — skip.
server/internal/handlers/sites/handler.go (modified) After the existing PermSiteManage check, when the flag is set, runs a second RequirePermission(PermRackManage) before invoking the service. Security boundary. Reviewer should confirm the flag-gated dual check + that the order (site-gate first, then conditional rack-gate) reflects the catalog's intent.
server/internal/handlers/sites/translate.go (modified) toAssignDevicesParams threads req.GetForceClearConflictingRackMembership() into the domain params. Trivial mapping; confirm the flag isn't dropped.
server/internal/handlers/sites/handler_test.go (modified) Adds TestHandler_AssignDevicesToSite_forceClearRequiresRackManage with 3 subcases (perm matrix). Pins the security boundary; small mock-driven test.
server/internal/domain/sites/models/models.go (modified) New ForceClearConflictingRackMembership bool on AssignDevicesToSiteParams. Param shape; one field.
server/internal/domain/sites/service.go (modified) AssignDevicesToSite now uses RunInTxWithResult returning a per-attempt assignDevicesToSiteTx{rowsAffected, txConflicts, forceClearedIDs} struct (retry-safe). Force-clear branch partitions conflicts and clears only the conflicting devices' rack rows. Activity event extended with force_cleared_rack_membership_count + truncated identifiers when the cascade fired. Core logic. Reviewer should focus on (a) the conflict-partition / abort-before-delete order, (b) lock order (site → device row → rack membership row), (c) per-attempt struct shape, (d) audit-event completeness.
server/internal/domain/sites/service_test.go (modified) Adds 6 tests covering the force-clear branch: cascade + write success, no-conflict no-op, residual DEVICE_NOT_FOUND aborts BEFORE delete, rollback on site-write failure, scoped-clear (only conflicting devices touched), Unassigned target + force_clear. Plus 2 audit-log assertions (cascade metadata present / absent). Service-level coverage.
server/internal/domain/stores/sqlstores/collection.go (modified) One-character change: %v%w in RemoveDevicesFromAnyRack error wrap. Lets the retry loop unwrap to pgconn.PgError and detect 40P01.
server/internal/domain/stores/sqlstores/site.go (modified) Same %v%w in LockDevicesForReassign. Same reason; covers the other half of the cross-RPC deadlock pair.
client/src/protoFleet/api/sites.ts (modified) Adds forceClearConflictingRackMembership?: boolean to AssignDevicesToSiteProps. Threaded through to sitesClient.assignDevicesToSite. Surface for callers.
client/src/protoFleet/features/fleetManagement/components/MinerActionsMenu/MinerReparentPicker.tsx (modified) Removes the per-rack removeDevicesFromDeviceSet loop in dispatchSiteMoveWithUnassign and the now-unused removeFromRack helper / hook destructure. The single call now is dispatchSiteReassign(targetSiteId, ids, true). UX behavior. Reviewer should confirm the abort-signal threading + that the consolidated toast wording matches the operator's mental model.

5. Key technical decisions & trade-offs

  • Single RPC + opt-in flag over a new dedicated AssignDevicesToSiteAndClearRack RPC. Keeps the surface small; gives the server one transaction boundary; preserves today's strict-conflict behavior for callers that don't set the flag. Trade-off: a single RPC now has two semantic modes.
  • Dual permission gate at the handler (PermSiteManage + conditional PermRackManage) over (a) widening AssignDevicesToSite to also require PermRackManage unconditionally or (b) declaring the permission union at the procedure-map level. Handler-level conditional check preserves the existing site-only gating for the non-cascade path and avoids middleware that has to peek the request body.
  • Conflict partitioning that aborts BEFORE any deletion when residual conflicts remain. Alternative — delete-then-return-conflicts — would leave rack-stripped devices on the old site if any device was missing. Symmetric tx semantics: either the whole batch moves and the rack memberships drop, or nothing changes.
  • Clear only the conflicting identifiers, not the full request list. Alternative — clear everything — would drop rack memberships for devices already at the target site (and their rack_slot children). Scoped clear preserves untouched rack assemblies.
  • %w wrapping in two sqlstores as the cheap fix for the deadlock-retry path. The proper fix — aligning lock order between this RPC and AssignDevicesToRack — needs a new sqlc LockSourceRacksForDevices query and is filed as a follow-up. The %w change makes the deadlock window recoverable (silent retry) in the meantime.
  • Per-attempt struct via RunInTxWithResult over outer-scope accumulators. Matches the convention introduced by PR refactor(reparent): atomic reparent foundation (1/4) #463 for AssignBuildingsToSite / AssignRacksToSite / AssignRacksToBuilding. Tx retries start each attempt from zero; final response and audit event reflect only the committed attempt.
  • Audit-event extension over a new event type. Reviewer-discovery: investigators looking at "devices reassigned to site" rows see both the site move and the rack detachment in one row's metadata; alternative was a separate eventDevicesRackMembershipForceCleared event, which would split the side-effect from the cause.
  • MinerReparentPicker collapses to a single RPC, removing the per-rack remove loop entirely. Trade-off: the operator's "unassigning from N racks…" progress toast is gone; replaced with a single "moving miners…" toast since the whole operation is now atomic on the server.

6. Testing & validation

Service-layer tests (server/internal/domain/sites/service_test.go):

  • forceClearCascadesRackMembership — cascade + write success; mock asserts RemoveDevicesFromAnyRack(..., []string{"d1"}, 0) receives only the conflicting id.
  • forceClearWithoutConflicts — no rack conflicts → no cascade call.
  • forceClearMissingDeviceStillRejects — residual DEVICE_NOT_FOUND aborts BEFORE the rack-clear delete fires.
  • forceClearRollsBackOnSiteWriteFailure — tx.calls == 1.
  • forceClearOnlyConflictingDevices — scoped clear regression.
  • forceClearWithUnassignedTargettarget_site_id = nil + flag works (no site lock; rack clear still fires).
  • forceClearAuditLogsCascade — audit metadata includes force_cleared_rack_membership_count + identifiers + description suffix.
  • noForceClearOmitsCascadeMetadata — plain reassign event has none of the cascade keys.

Handler-layer tests (server/internal/handlers/sites/handler_test.go):

  • forceClearRequiresRackManage matrix: site-only + flag-false → ok; site-only + flag-true → PermissionDenied; both perms + flag-true → ok.

Verification:

  • go build ./... — clean
  • go test ./server/internal/domain/sites/... ./server/internal/handlers/sites/... ./server/internal/domain/collection/... -count=1 — pass
  • npx tsc --noEmit (client) — clean
  • npm run lint — clean

Explicitly NOT covered:

  • End-to-end on a real Postgres — the deadlock-retry path (%w unwrap → IsRetryablePostgresError → SQLTransactor retries) is exercised only via unit-test mocks. A focused integration test that fires concurrent AssignDevicesToSite(force_clear=true) + AssignDevicesToRack against the same device set would verify both the deadlock detection and the retry recovery. Filed as a follow-up.
  • Lock-order alignment between this RPC and AssignDevicesToRack (which would eliminate the deadlock window, not just make it recoverable). Needs a new sqlc LockSourceRacksForDevices query + interface/mock regen. Filed as a follow-up; the %w retry safety net is the bridge.

How it fits into the larger refactor (4 PRs)

Co-Authored-By: Claude Opus 4.7 noreply@anthropic.com

@flesher flesher requested a review from a team as a code owner June 15, 2026 20:43
@github-actions github-actions Bot added javascript Pull requests that update javascript code client server shared labels Jun 15, 2026
@github-actions

github-actions Bot commented Jun 15, 2026

Copy link
Copy Markdown

🔐 Codex Security Review

Note: This is an automated security-focused code review generated by Codex.
It should be used as a supplementary check alongside human review.
False positives are possible - use your judgment.

Scope summary

  • Reviewed pull request diff only (c84c07cd80d20fbc5cf8b2006ddfa540729b0c72...8b95245c930e1440e9f66bfae39058375ef5a31d, exact PR three-dot diff)
  • Model: gpt-5.5

💡 Click "edited" above to see previous reviews for this PR.


Review Summary

Overall Risk: MEDIUM

Findings

[MEDIUM] Force-clear site moves can deadlock with concurrent rack moves

  • Category: Concurrency | Reliability
  • Location: server/internal/domain/sites/service.go:372
  • Description: The new force-clear branch deletes rack membership after AssignDevicesToSite has already locked the device rows. The existing rack reparent path removes/adds device_set_membership rows first, then cascades device.site_id. If two operators concurrently move the same miner to a site and to a rack, the site transaction can hold the device row while waiting on the rack transaction’s membership row, while the rack transaction waits on the device row for the site cascade.
  • Impact: PostgreSQL will detect a deadlock. The transaction layer retries, but only a fixed number of times, so repeated collisions can surface as failed/internal errors for normal concurrent reparent operations.
  • Recommendation: Make the site force-clear path and rack reparent path use a consistent lock order. For example, have rack reparent lock the affected device rows before membership delete/insert/cascade writes, or introduce an explicit membership/device locking helper used by both paths and add an integration test that runs concurrent site-force-clear and rack assignment for the same device.

Notes

No auth, SQL injection, protobuf wire compatibility, pool hijacking, or frontend XSS issues were found in the reviewed diff.


Generated by Codex Security Review |
Triggered by: @flesher |
Review workflow run

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ac47c992ba

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread server/internal/domain/sites/service.go

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c73d160c29

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread server/internal/domain/sites/service.go
@flesher flesher force-pushed the feat/atomic-cross-site-reparent-2of4 branch 3 times, most recently from 44a1f0c to 184614e Compare June 15, 2026 22:21
Base automatically changed from refactor/reparent-foundation-1of4 to main June 16, 2026 00:01
flesher and others added 2 commits June 15, 2026 17:02
…sToSite

Adds optional force_clear_conflicting_rack_membership flag to
AssignDevicesToSiteRequest. When true the server, inside the same tx
as the site write, drops conflicting rack memberships for devices
hitting DEVICE_IN_RACK_AT_OTHER_SITE before applying the site
assignment. Residual non-clearable conflicts (e.g. DEVICE_NOT_FOUND)
abort the tx BEFORE any deletion so partial-state writes are
impossible. Only conflicting identifiers are passed to
RemoveDevicesFromAnyRack so unrelated rack rows aren't dropped.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Threads the new force_clear_conflicting_rack_membership flag through
the sites client API and collapses MinerReparentPicker's
dispatchSiteMoveWithUnassign from a per-rack
removeDevicesFromDeviceSet loop + assignDevicesToSite into a single
server-side atomic call. This closes the orphan window where a
transport failure between the per-rack unassign and the site
reassign could leave miners rack-less but still on the old site.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@flesher flesher force-pushed the feat/atomic-cross-site-reparent-2of4 branch from 184614e to b53a01f Compare June 16, 2026 00:02
flesher and others added 4 commits June 15, 2026 18:07
The force_clear_conflicting_rack_membership flag on AssignDevicesToSite
deletes device_set_membership rows — the same write that sibling rack
RPCs require rack:manage to perform. Site:manage alone was sufficient
to trip the cascade, letting a site-only operator bypass the rack auth
gate via this flag. Now both keys are required for the flagged path;
plain reassignments still go through with site:manage only.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
NewInternalErrorf("...: %v", err) lost the underlying pgconn.PgError,
so IsRetryablePostgresError couldn't recognize 40P01 deadlock_detected
and the SQLTransactor retry loop bailed instead of retrying. Switch to
%w in both LockDevicesForReassign and RemoveDevicesFromAnyRack so the
deadlock-detection path on the cross-site reparent flow works as
intended.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
When force_clear_conflicting_rack_membership actually deletes rack
rows, the activity event now records the side effect alongside the
site move. Metadata gets force_cleared_rack_membership_count and a
truncated force_cleared_device_identifiers list; the description
appends an "(N rack membership(s) force-cleared)" suffix.

Also converts AssignDevicesToSite to RunInTxWithResult + a per-attempt
struct (assignDevicesToSiteTx). Lifting state out of captured closure
vars matches the convention PR 1 introduced for the other group RPCs
and keeps retries (deadlock / serialization) clean: each attempt
starts from zero, and the committed totals — including the cleared
identifier list the audit log needs — flow out of the transactor.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Pins the cascade-clear semantic on the Unassigned path: when the
caller passes target_site_id == nil and force_clear = true, devices
that live in racks at any site get their rack memberships dropped
before the site_id is nulled out. No target site lock fires (no
target), but the conflict-partition + RemoveDevicesFromAnyRack +
nil-target site write sequence still applies.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 16, 2026 01:09

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Extends the AssignDevicesToSite RPC to support an atomic cross-site reparent path by optionally clearing conflicting rack memberships inside the same transaction as the site update, and updates the ProtoFleet UI to use that single atomic call (eliminating the client-side “remove-from-rack loop” orphan window).

Changes:

  • Add force_clear_conflicting_rack_membership (optional) to AssignDevicesToSiteRequest, plumbed through handler translation and domain params.
  • Enforce additional auth (rack:manage) when the force-clear cascade is requested, and implement server-side conflict partitioning + in-tx rack detach + audit metadata.
  • Update MinerReparentPicker to collapse cross-site moves into a single assignDevicesToSite(..., forceClearConflictingRackMembership: true) dispatch; add handler/service test coverage.

Reviewed changes

Copilot reviewed 11 out of 13 changed files in this pull request and generated no comments.

Show a summary per file
File Description
server/internal/handlers/sites/translate.go Maps the new force-clear request flag into service params.
server/internal/handlers/sites/handler.go Requires rack:manage when force-clear cascade is requested.
server/internal/handlers/sites/handler_test.go Adds tests pinning the handler auth gate for the force-clear branch.
server/internal/domain/stores/sqlstores/site.go Switches internal error formatting to %w for proper wrapping.
server/internal/domain/stores/sqlstores/collection.go Switches internal error formatting to %w for proper wrapping.
server/internal/domain/sites/service.go Implements force-clear conflict partitioning, in-tx rack detach, retry-safe per-attempt counters, and cascade-aware audit metadata.
server/internal/domain/sites/service_test.go Adds service-layer tests for cascade success/no-op/reject-before-delete/rollback/scoped-clear/unassigned target/audit metadata.
server/internal/domain/sites/models/models.go Adds ForceClearConflictingRackMembership to the service params shape and documents behavior.
proto/sites/v1/sites.proto Adds the optional force_clear_conflicting_rack_membership field with API documentation.
client/src/protoFleet/features/fleetManagement/components/MinerActionsMenu/MinerReparentPicker.tsx Replaces the per-rack unassign loop + site reassign with a single atomic site RPC using the new flag.
client/src/protoFleet/api/sites.ts Exposes forceClearConflictingRackMembership on the client hook and passes it through to the RPC.
server/generated/grpc/sites/v1/sites.pb.go Regenerated Go protobuf output for the new request field.
client/src/protoFleet/api/generated/sites/v1/sites_pb.ts Regenerated TS protobuf output for the new request field.

@flesher flesher merged commit 7324718 into main Jun 16, 2026
77 checks passed
@flesher flesher deleted the feat/atomic-cross-site-reparent-2of4 branch June 16, 2026 18:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

client javascript Pull requests that update javascript code server shared

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants