Skip to content

Enforce timeout invariant: batch < worker < prune#211

Open
jkurashcvx wants to merge 9 commits into
release-3from
hotfix/timeout-invariant
Open

Enforce timeout invariant: batch < worker < prune#211
jkurashcvx wants to merge 9 commits into
release-3from
hotfix/timeout-invariant

Conversation

@jkurashcvx
Copy link
Copy Markdown
Member

Summary

Introduces a timeout invariant that prevents cascade failures observed in Job 17eed641:

batch_timeout (10s) < worker_timeout (720s) < prune_timeout (max(720, wt+30))

Changes (3 expressions)

  • addprocs_with_timeout: Use JULIA_AZMANAGERS_BATCH_TIMEOUT (default 10s) instead of worker_timeout() + 30. Workers already completed Phase 1 (connected + validated), so Phase 2 back-connection should be fast.

  • setup_launched_worker: Per-worker timeout = batch_timeout - 2. Fails individual workers before batch timeout fires.

  • prune_scalesets: max(VM_JOIN_TIMEOUT, worker_timeout+30). Prune never fires before a batch has time to complete.

Problem

  1. One bad worker blocks entire batch for ~830s (PID ordering in addprocs_locked)
  2. Prune fires at 720s < batch timeout of 830s → kills VMs still being processed
  3. Killed VMs reconnect → PID inflation (947 PIDs, 8 real workers)

Env Vars

Variable Default Purpose
JULIA_AZMANAGERS_BATCH_TIMEOUT 10 Phase 2 batch timeout (new)
JULIA_WORKER_TIMEOUT 60 (stdlib) Worker boot+join
JULIA_AZMANAGERS_VM_JOIN_TIMEOUT 720 Prune floor

…ilure

Timeout & deadline fixes:
- Enforce invariant: batch_timeout < worker_timeout < prune_timeout
- Add hard deadlines to addprocs_with_timeout and setup_launched_worker
- Reduce default prune interval from 600s to 120s

VM state tracking:
- Add in_flight tracking to prevent prune/sync interference with pipeline
- Fix nworkers_provisioned: subtract pending_down, fix scaleset_sync double-count
- Upgrade VMSS API to 2024-07-01 with $expand=instanceView for power state
- Deduplicate list_scaleset_vms calls (one fetch shared by prune_cluster + prune_scalesets)

Reimage-on-first-failure:
- Failed VMs are reimaged before deletion (second failure = delete)
- Bulk reimage API with empty-ids guard
- pending_reimage tracking through updating/creating phase
- Add empty-ids guard to delete_vms
@jkurashcvx jkurashcvx force-pushed the hotfix/timeout-invariant branch from a0dc806 to 4f12f68 Compare May 20, 2026 13:21
Josh and others added 8 commits May 20, 2026 15:56
- check_service_health: query Microsoft.ResourceHealth/events API for active
  ServiceIssue events affecting VM/Network in the VMSS region; blocks scaling
  in scaleset_create_or_update when an incident is active (fails open)
- list_scaleset_nics: bulk fetch NIC provisioning states per scaleset (one API
  call, not per-VM); integrated into scaleset_pruning loop
- prune_scalesets: VMs with succeeded provisioningState but failed NIC are now
  treated as failed (reimage on first failure, delete on second)
- nic_state included in reimage/delete warning log messages for diagnostics
…andoned batch cleanup

P0-1: Purge pending_reimage/reimaged entries when VMs are deleted in
delete_pending_down_vms, delete_orphan_pending_down_vms, and delete_scaleset.
Prevents ghost entries that cause 99-100% reimage failure rates.

P1-6: Cap reimage API failures at 2 attempts per instance. After 2 consecutive
failures, escalate to orphan_pending_down for deletion instead of retrying
indefinitely. Add reimage_failures tracking field to AzManager.

P1-7: Log reimage errors at @warn level with scaleset context instead of @error
without context.

P1-5: When a batch is abandoned due to timeout, close sockets for unregistered
workers and remove from in_flight. Workers detect the closed socket and retry
via azure_worker's retry loop instead of being permanently lost.
…double-encoding

The $filter query parameter was being pre-encoded with HTTP.escapeuri(),
causing Azure to receive %27 instead of single quotes, resulting in
400 Bad Request with InvalidODataQueryOptions. Also downgraded
api-version from 2025-05-01 to 2024-02-01.
The 'or' clause in the OData filter was causing InvalidODataQueryOptions.
Without encoding, the spaces broke the URL. With encoding, Azure rejected
the encoded single quotes. Simplified to filter on 'Virtual Machines' only
with proper URL encoding.
The Azure Resource Health Events API requires ISO 8601 dates. Using
m/d/yyyy format (e.g. 5/27/2026) caused InvalidODataQueryOptions 400
errors because the slashes were percent-encoded and the format was
not recognized by the OData parser.
…eryOptions)

The $filter parameter with 'service eq Virtual Machines' returns 400
on api-version 2024-02-01. Dropping it and using only queryStartTime
works (200). The code already filters by event type, service, and
region client-side.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants