feat(sites): atomic cross-site reparent (2/4)#464
Conversation
🔐 Codex Security Review
Review SummaryOverall Risk: MEDIUM Findings[MEDIUM] Force-clear site moves can deadlock with concurrent rack moves
NotesNo auth, SQL injection, protobuf wire compatibility, pool hijacking, or frontend XSS issues were found in the reviewed diff. Generated by Codex Security Review | |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ac47c992ba
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
ac47c99 to
c73d160
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c73d160c29
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
44a1f0c to
184614e
Compare
…sToSite Adds optional force_clear_conflicting_rack_membership flag to AssignDevicesToSiteRequest. When true the server, inside the same tx as the site write, drops conflicting rack memberships for devices hitting DEVICE_IN_RACK_AT_OTHER_SITE before applying the site assignment. Residual non-clearable conflicts (e.g. DEVICE_NOT_FOUND) abort the tx BEFORE any deletion so partial-state writes are impossible. Only conflicting identifiers are passed to RemoveDevicesFromAnyRack so unrelated rack rows aren't dropped. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Threads the new force_clear_conflicting_rack_membership flag through the sites client API and collapses MinerReparentPicker's dispatchSiteMoveWithUnassign from a per-rack removeDevicesFromDeviceSet loop + assignDevicesToSite into a single server-side atomic call. This closes the orphan window where a transport failure between the per-rack unassign and the site reassign could leave miners rack-less but still on the old site. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
184614e to
b53a01f
Compare
The force_clear_conflicting_rack_membership flag on AssignDevicesToSite deletes device_set_membership rows — the same write that sibling rack RPCs require rack:manage to perform. Site:manage alone was sufficient to trip the cascade, letting a site-only operator bypass the rack auth gate via this flag. Now both keys are required for the flagged path; plain reassignments still go through with site:manage only. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
NewInternalErrorf("...: %v", err) lost the underlying pgconn.PgError,
so IsRetryablePostgresError couldn't recognize 40P01 deadlock_detected
and the SQLTransactor retry loop bailed instead of retrying. Switch to
%w in both LockDevicesForReassign and RemoveDevicesFromAnyRack so the
deadlock-detection path on the cross-site reparent flow works as
intended.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
When force_clear_conflicting_rack_membership actually deletes rack rows, the activity event now records the side effect alongside the site move. Metadata gets force_cleared_rack_membership_count and a truncated force_cleared_device_identifiers list; the description appends an "(N rack membership(s) force-cleared)" suffix. Also converts AssignDevicesToSite to RunInTxWithResult + a per-attempt struct (assignDevicesToSiteTx). Lifting state out of captured closure vars matches the convention PR 1 introduced for the other group RPCs and keeps retries (deadlock / serialization) clean: each attempt starts from zero, and the committed totals — including the cleared identifier list the audit log needs — flow out of the transactor. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Pins the cascade-clear semantic on the Unassigned path: when the caller passes target_site_id == nil and force_clear = true, devices that live in racks at any site get their rack memberships dropped before the site_id is nulled out. No target site lock fires (no target), but the conflict-partition + RemoveDevicesFromAnyRack + nil-target site write sequence still applies. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Extends the AssignDevicesToSite RPC to support an atomic cross-site reparent path by optionally clearing conflicting rack memberships inside the same transaction as the site update, and updates the ProtoFleet UI to use that single atomic call (eliminating the client-side “remove-from-rack loop” orphan window).
Changes:
- Add
force_clear_conflicting_rack_membership(optional) toAssignDevicesToSiteRequest, plumbed through handler translation and domain params. - Enforce additional auth (
rack:manage) when the force-clear cascade is requested, and implement server-side conflict partitioning + in-tx rack detach + audit metadata. - Update
MinerReparentPickerto collapse cross-site moves into a singleassignDevicesToSite(..., forceClearConflictingRackMembership: true)dispatch; add handler/service test coverage.
Reviewed changes
Copilot reviewed 11 out of 13 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| server/internal/handlers/sites/translate.go | Maps the new force-clear request flag into service params. |
| server/internal/handlers/sites/handler.go | Requires rack:manage when force-clear cascade is requested. |
| server/internal/handlers/sites/handler_test.go | Adds tests pinning the handler auth gate for the force-clear branch. |
| server/internal/domain/stores/sqlstores/site.go | Switches internal error formatting to %w for proper wrapping. |
| server/internal/domain/stores/sqlstores/collection.go | Switches internal error formatting to %w for proper wrapping. |
| server/internal/domain/sites/service.go | Implements force-clear conflict partitioning, in-tx rack detach, retry-safe per-attempt counters, and cascade-aware audit metadata. |
| server/internal/domain/sites/service_test.go | Adds service-layer tests for cascade success/no-op/reject-before-delete/rollback/scoped-clear/unassigned target/audit metadata. |
| server/internal/domain/sites/models/models.go | Adds ForceClearConflictingRackMembership to the service params shape and documents behavior. |
| proto/sites/v1/sites.proto | Adds the optional force_clear_conflicting_rack_membership field with API documentation. |
| client/src/protoFleet/features/fleetManagement/components/MinerActionsMenu/MinerReparentPicker.tsx | Replaces the per-rack unassign loop + site reassign with a single atomic site RPC using the new flag. |
| client/src/protoFleet/api/sites.ts | Exposes forceClearConflictingRackMembership on the client hook and passes it through to the RPC. |
| server/generated/grpc/sites/v1/sites.pb.go | Regenerated Go protobuf output for the new request field. |
| client/src/protoFleet/api/generated/sites/v1/sites_pb.ts | Regenerated TS protobuf output for the new request field. |
Part 2 of 4 — Atomic Cross-Site Reparent
Stacks on PR #463. Closes the cross-site orphan window that survived the original issue #420 fix.
1. Summary
Adds an opt-in
force_clear_conflicting_rack_membershipflag toAssignDevicesToSiteso the server can clear a miner's rack membership inside the same transaction as the site assignment. Without the flag, today's behavior is preserved: a miner whose current rack lives on a different site rejects the whole batch. With the flag, the rack rows are dropped and the site write commits atomically — eliminating the client-side remove-then-reassign loop and the orphan-state window it left behind. The picker UI inMinerReparentPickercollapses to a single RPC for the cross-site flow.2. How it works
Pre-PR cross-site flow (orphan window):
AssignDevicesToSiterejects withPerDeviceConflict[]ofDEVICE_IN_RACK_AT_OTHER_SITE.RemoveDevicesFromDeviceSetper source rack, then callsAssignDevicesToSiteagain.Post-PR cross-site flow:
AssignDevicesToSiteonce withforce_clear_conflicting_rack_membership: true.a. Lock target site row.
b. Lock device rows for reassign.
c. Compute conflicts (per-device validation).
d. Partition conflicts: clearable (rack-at-other-site) vs residual (e.g.
DEVICE_NOT_FOUND).e. If residual conflicts remain → return them without any deletion. (Don't half-commit.)
f. If only clearable:
RemoveDevicesFromAnyRack(clearableIDs, target=0)drops the rack-membership rows for only the conflicting devices. Devices already at the target site keep their rack rows intact.g. Apply the site write (
UPDATE device SET site_id = …).Auth boundary.
AssignDevicesToSiteis gated byPermSiteManageat the procedure level. When the caller setsforce_clear=true, the handler performs a secondRequirePermission(PermRackManage)check before invoking the service — the cascade deletesdevice_set_membershiprows, which sibling rack RPCs requirePermRackManageto do.Retry safety. Two store wrappers (
RemoveDevicesFromAnyRack,LockDevicesForReassign) now wrap their PG errors with%wso the SQLTransactor retry loop can unwrap topgconn.PgErrorand detect40P01(deadlock_detected). Cross-RPC lock-order collisions become silent retries instead of Internal 500s.3. Diagrams
Cross-site reparent flow (pre vs post)
flowchart TB subgraph Pre["Pre-PR: 3 RPCs, orphan window between them"] A1["UI: AssignDevicesToSite"] --> A2["Server: reject<br/>PerDeviceConflict DEVICE_IN_RACK_AT_OTHER_SITE"] A2 --> A3["UI: loop RemoveDevicesFromDeviceSet<br/>per source rack"] A3 -.->|"network fail here<br/>= rack-less + old site"| A4["UI: AssignDevicesToSite retry"] A4 --> A5["Server: apply site write"] end subgraph Post["Post-PR: 1 RPC, atomic"] B1["UI: AssignDevicesToSite<br/>force_clear=false"] --> B2["Server: reject<br/>PerDeviceConflict DEVICE_IN_RACK_AT_OTHER_SITE"] B2 --> B3["UI: prompt unassign and continue?"] B3 --> B4["UI: AssignDevicesToSite<br/>force_clear=true"] B4 --> B5["Server: one tx<br/>1. lock site + devices<br/>2. partition conflicts<br/>3. clear conflicting rack rows<br/>4. apply site write"] endServer-side
AssignDevicesToSitewithforce_clear=truesequenceDiagram participant Client participant Handler participant Auth as middleware.RequirePermission participant Svc as sites.Service participant Store participant Coll as collectionStore participant DB Client->>Handler: AssignDevicesToSite force_clear=true Handler->>Auth: PermSiteManage Auth-->>Handler: ok Handler->>Auth: PermRackManage gated by flag Auth-->>Handler: ok Handler->>Svc: AssignDevicesToSite params Svc->>DB: BEGIN Svc->>Store: LockSiteForWrite target Svc->>Store: LockDevicesForReassign ids Svc->>Svc: computeReassignConflicts alt residual non-clearable conflicts Svc-->>Handler: txConflicts no write Svc->>DB: COMMIT lock-only else only clearable rack-at-other-site Svc->>Coll: RemoveDevicesFromAnyRack clearableIDs target=0 Svc->>Store: AssignDevicesToSite bulk UPDATE Svc->>DB: COMMIT Svc->>Svc: emit activity event with force_cleared metadata end Svc-->>Handler: rowsAffected or conflicts Handler-->>Client: responseConflict partitioning state
stateDiagram-v2 [*] --> ComputeConflicts ComputeConflicts --> NoConflicts: conflicts == 0 ComputeConflicts --> Partition: conflicts > 0 NoConflicts --> SiteWrite Partition --> CheckResidual CheckResidual --> AbortNoDelete: residual greater than 0 CheckResidual --> ClearableOnly: residual==0 AND clearable greater than 0 CheckResidual --> ReturnAsConflicts: residual==0 AND clearable==0 AND force_clear=false AbortNoDelete --> ReturnConflicts ClearableOnly --> RemoveRackRows RemoveRackRows --> SiteWrite SiteWrite --> EmitAudit EmitAudit --> [*] ReturnConflicts --> [*] ReturnAsConflicts --> [*]4. Areas of the code involved
proto/sites/v1/sites.protooptional bool force_clear_conflicting_rack_membership = 3toAssignDevicesToSiteRequest. Fresh tag (no reuse).server/generated/grpc/sites/v1/sites.pb.goclient/src/protoFleet/api/generated/sites/v1/sites_pb.tsserver/internal/handlers/sites/handler.go(modified)PermSiteManagecheck, when the flag is set, runs a secondRequirePermission(PermRackManage)before invoking the service.server/internal/handlers/sites/translate.go(modified)toAssignDevicesParamsthreadsreq.GetForceClearConflictingRackMembership()into the domain params.server/internal/handlers/sites/handler_test.go(modified)TestHandler_AssignDevicesToSite_forceClearRequiresRackManagewith 3 subcases (perm matrix).server/internal/domain/sites/models/models.go(modified)ForceClearConflictingRackMembership boolonAssignDevicesToSiteParams.server/internal/domain/sites/service.go(modified)AssignDevicesToSitenow usesRunInTxWithResultreturning a per-attemptassignDevicesToSiteTx{rowsAffected, txConflicts, forceClearedIDs}struct (retry-safe). Force-clear branch partitions conflicts and clears only the conflicting devices' rack rows. Activity event extended withforce_cleared_rack_membership_count+ truncated identifiers when the cascade fired.server/internal/domain/sites/service_test.go(modified)DEVICE_NOT_FOUNDaborts BEFORE delete, rollback on site-write failure, scoped-clear (only conflicting devices touched), Unassigned target + force_clear. Plus 2 audit-log assertions (cascade metadata present / absent).server/internal/domain/stores/sqlstores/collection.go(modified)%v→%winRemoveDevicesFromAnyRackerror wrap.pgconn.PgErrorand detect40P01.server/internal/domain/stores/sqlstores/site.go(modified)%v→%winLockDevicesForReassign.client/src/protoFleet/api/sites.ts(modified)forceClearConflictingRackMembership?: booleantoAssignDevicesToSiteProps. Threaded through tositesClient.assignDevicesToSite.client/src/protoFleet/features/fleetManagement/components/MinerActionsMenu/MinerReparentPicker.tsx(modified)removeDevicesFromDeviceSetloop indispatchSiteMoveWithUnassignand the now-unusedremoveFromRackhelper / hook destructure. The single call now isdispatchSiteReassign(targetSiteId, ids, true).5. Key technical decisions & trade-offs
AssignDevicesToSiteAndClearRackRPC. Keeps the surface small; gives the server one transaction boundary; preserves today's strict-conflict behavior for callers that don't set the flag. Trade-off: a single RPC now has two semantic modes.PermSiteManage+ conditionalPermRackManage) over (a) wideningAssignDevicesToSiteto also requirePermRackManageunconditionally or (b) declaring the permission union at the procedure-map level. Handler-level conditional check preserves the existing site-only gating for the non-cascade path and avoids middleware that has to peek the request body.rack_slotchildren). Scoped clear preserves untouched rack assemblies.%wwrapping in two sqlstores as the cheap fix for the deadlock-retry path. The proper fix — aligning lock order between this RPC andAssignDevicesToRack— needs a new sqlcLockSourceRacksForDevicesquery and is filed as a follow-up. The%wchange makes the deadlock window recoverable (silent retry) in the meantime.RunInTxWithResultover outer-scope accumulators. Matches the convention introduced by PR refactor(reparent): atomic reparent foundation (1/4) #463 forAssignBuildingsToSite/AssignRacksToSite/AssignRacksToBuilding. Tx retries start each attempt from zero; final response and audit event reflect only the committed attempt.eventDevicesRackMembershipForceClearedevent, which would split the side-effect from the cause.MinerReparentPickercollapses to a single RPC, removing the per-rack remove loop entirely. Trade-off: the operator's "unassigning from N racks…" progress toast is gone; replaced with a single "moving miners…" toast since the whole operation is now atomic on the server.6. Testing & validation
Service-layer tests (
server/internal/domain/sites/service_test.go):forceClearCascadesRackMembership— cascade + write success; mock assertsRemoveDevicesFromAnyRack(..., []string{"d1"}, 0)receives only the conflicting id.forceClearWithoutConflicts— no rack conflicts → no cascade call.forceClearMissingDeviceStillRejects— residualDEVICE_NOT_FOUNDaborts BEFORE the rack-clear delete fires.forceClearRollsBackOnSiteWriteFailure— tx.calls == 1.forceClearOnlyConflictingDevices— scoped clear regression.forceClearWithUnassignedTarget—target_site_id = nil+ flag works (no site lock; rack clear still fires).forceClearAuditLogsCascade— audit metadata includesforce_cleared_rack_membership_count+ identifiers + description suffix.noForceClearOmitsCascadeMetadata— plain reassign event has none of the cascade keys.Handler-layer tests (
server/internal/handlers/sites/handler_test.go):forceClearRequiresRackManagematrix: site-only + flag-false → ok; site-only + flag-true →PermissionDenied; both perms + flag-true → ok.Verification:
go build ./...— cleango test ./server/internal/domain/sites/... ./server/internal/handlers/sites/... ./server/internal/domain/collection/... -count=1— passnpx tsc --noEmit(client) — cleannpm run lint— cleanExplicitly NOT covered:
%wunwrap →IsRetryablePostgresError→ SQLTransactor retries) is exercised only via unit-test mocks. A focused integration test that fires concurrentAssignDevicesToSite(force_clear=true)+AssignDevicesToRackagainst the same device set would verify both the deadlock detection and the retry recovery. Filed as a follow-up.AssignDevicesToRack(which would eliminate the deadlock window, not just make it recoverable). Needs a new sqlcLockSourceRacksForDevicesquery + interface/mock regen. Filed as a follow-up; the%wretry safety net is the bridge.How it fits into the larger refactor (4 PRs)
force_clear_conflicting_rack_membership.LockRacksForReparenthelper that the lock-order alignment follow-up would build on).Co-Authored-By: Claude Opus 4.7 noreply@anthropic.com