Summary
sub_batch_terminal status promotion in POST /api/internal/agent/exec-batches/{batch_id}/progress performs three non-atomic HGETs followed by a conditional HSET. Under ≥3 concurrent sub-jobs with at least one failure, an interleaving exists where the hash ends up with status="complete" but failed=1 — operator misses the error.
Surfaced by the Phase 28 code review (CR-01); classified by the verifier as a residual defect, not a phase-goal hole. Must land before v4.0 multi-host deployment scales to ≥3 concurrent sub-jobs per batch.
Location
src/phaze/routers/agent_exec_batches.py:189-198
Repro (logical trace)
3 sub-jobs A/C succeed, B fails, all terminal:
- A's pipeline:
subjobs_completed=1, completed+=1
- C's pipeline:
subjobs_completed=2, completed+=1
- B's pipeline begins:
HINCRBY subjobs_completed=3 lands; HINCRBY failed is queued but server has not yet processed it
- C's handler enters terminal block:
HGET subjobs_completed=3, subjobs_expected=3, failed=0 (B's HINCRBY failed not yet processed). C HSET status="complete".
- B's
HINCRBY failed runs → failed=1
- B's handler enters terminal block: reads
failed=1, HSET status="complete_with_errors"
- If C's HSET lands after B's HSET, final state is
status="complete" with failed=1
The SSE close-event name (event: complete vs event: complete_with_errors) drives the operator's HTMX sse-close listener — wrong status produces wrong completion UI.
transaction=False on a redis-py async pipeline does not atomize across connections; concurrent connections interleave on the Redis server.
Fix
Use a server-side Lua script (one atomic round-trip):
_PROMOTE_STATUS_LUA = """
local key = KEYS[1]
local sc = tonumber(redis.call('HGET', key, 'subjobs_completed') or '0')
local se = tonumber(redis.call('HGET', key, 'subjobs_expected') or '0')
if sc ~= se then return 0 end
local failed = tonumber(redis.call('HGET', key, 'failed') or '0')
local new_status = (failed == 0) and 'complete' or 'complete_with_errors'
redis.call('HSET', key, 'status', new_status)
return 1
"""
Register once at module load via redis_client.register_script(...) and replace the three-HGET-plus-HSET block with a single call.
Acceptance
- New test in
tests/test_routers/test_agent_exec_batches.py that drives ≥3 concurrent terminal POSTs (one failed) using asyncio.gather against a real Redis instance and asserts final status is consistent with failed count
- Existing 28-V-13 / 28-V-14 / 28-V-16 sub_batch_terminal tests remain GREEN
- Pre-commit clean (ruff / mypy / bandit)
Context
- Phase 28 review:
.planning/phases/28-distributed-execution-dispatch/28-REVIEW.md (CR-01)
- Phase 28 verification:
.planning/phases/28-distributed-execution-dispatch/28-VERIFICATION.md
- Locked decisions D-04 / D-07 specify the read-then-write semantics but do not constrain atomicity — Lua-based fix preserves the contract
Summary
sub_batch_terminalstatus promotion inPOST /api/internal/agent/exec-batches/{batch_id}/progressperforms three non-atomicHGETs followed by a conditionalHSET. Under ≥3 concurrent sub-jobs with at least one failure, an interleaving exists where the hash ends up withstatus="complete"butfailed=1— operator misses the error.Surfaced by the Phase 28 code review (CR-01); classified by the verifier as a residual defect, not a phase-goal hole. Must land before v4.0 multi-host deployment scales to ≥3 concurrent sub-jobs per batch.
Location
src/phaze/routers/agent_exec_batches.py:189-198Repro (logical trace)
3 sub-jobs A/C succeed, B fails, all terminal:
subjobs_completed=1,completed+=1subjobs_completed=2,completed+=1HINCRBY subjobs_completed=3lands;HINCRBY failedis queued but server has not yet processed itHGET subjobs_completed=3,subjobs_expected=3,failed=0(B'sHINCRBY failednot yet processed). CHSET status="complete".HINCRBY failedruns →failed=1failed=1,HSET status="complete_with_errors"status="complete"withfailed=1The SSE close-event name (
event: completevsevent: complete_with_errors) drives the operator's HTMXsse-closelistener — wrong status produces wrong completion UI.transaction=Falseon a redis-py async pipeline does not atomize across connections; concurrent connections interleave on the Redis server.Fix
Use a server-side Lua script (one atomic round-trip):
Register once at module load via
redis_client.register_script(...)and replace the three-HGET-plus-HSET block with a single call.Acceptance
tests/test_routers/test_agent_exec_batches.pythat drives ≥3 concurrent terminal POSTs (one failed) usingasyncio.gatheragainst a real Redis instance and asserts finalstatusis consistent withfailedcountContext
.planning/phases/28-distributed-execution-dispatch/28-REVIEW.md(CR-01).planning/phases/28-distributed-execution-dispatch/28-VERIFICATION.md