Skip to content

Fix HINCRBY/HSET race in agent exec-batch terminal status promotion #61

@SimplicityGuy

Description

@SimplicityGuy

Summary

sub_batch_terminal status promotion in POST /api/internal/agent/exec-batches/{batch_id}/progress performs three non-atomic HGETs followed by a conditional HSET. Under ≥3 concurrent sub-jobs with at least one failure, an interleaving exists where the hash ends up with status="complete" but failed=1 — operator misses the error.

Surfaced by the Phase 28 code review (CR-01); classified by the verifier as a residual defect, not a phase-goal hole. Must land before v4.0 multi-host deployment scales to ≥3 concurrent sub-jobs per batch.

Location

src/phaze/routers/agent_exec_batches.py:189-198

Repro (logical trace)

3 sub-jobs A/C succeed, B fails, all terminal:

  1. A's pipeline: subjobs_completed=1, completed+=1
  2. C's pipeline: subjobs_completed=2, completed+=1
  3. B's pipeline begins: HINCRBY subjobs_completed=3 lands; HINCRBY failed is queued but server has not yet processed it
  4. C's handler enters terminal block: HGET subjobs_completed=3, subjobs_expected=3, failed=0 (B's HINCRBY failed not yet processed). C HSET status="complete".
  5. B's HINCRBY failed runs → failed=1
  6. B's handler enters terminal block: reads failed=1, HSET status="complete_with_errors"
  7. If C's HSET lands after B's HSET, final state is status="complete" with failed=1

The SSE close-event name (event: complete vs event: complete_with_errors) drives the operator's HTMX sse-close listener — wrong status produces wrong completion UI.

transaction=False on a redis-py async pipeline does not atomize across connections; concurrent connections interleave on the Redis server.

Fix

Use a server-side Lua script (one atomic round-trip):

_PROMOTE_STATUS_LUA = """
local key = KEYS[1]
local sc = tonumber(redis.call('HGET', key, 'subjobs_completed') or '0')
local se = tonumber(redis.call('HGET', key, 'subjobs_expected') or '0')
if sc ~= se then return 0 end
local failed = tonumber(redis.call('HGET', key, 'failed') or '0')
local new_status = (failed == 0) and 'complete' or 'complete_with_errors'
redis.call('HSET', key, 'status', new_status)
return 1
"""

Register once at module load via redis_client.register_script(...) and replace the three-HGET-plus-HSET block with a single call.

Acceptance

  • New test in tests/test_routers/test_agent_exec_batches.py that drives ≥3 concurrent terminal POSTs (one failed) using asyncio.gather against a real Redis instance and asserts final status is consistent with failed count
  • Existing 28-V-13 / 28-V-14 / 28-V-16 sub_batch_terminal tests remain GREEN
  • Pre-commit clean (ruff / mypy / bandit)

Context

  • Phase 28 review: .planning/phases/28-distributed-execution-dispatch/28-REVIEW.md (CR-01)
  • Phase 28 verification: .planning/phases/28-distributed-execution-dispatch/28-VERIFICATION.md
  • Locked decisions D-04 / D-07 specify the read-then-write semantics but do not constrain atomicity — Lua-based fix preserves the contract

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions