Enforce timeout invariant: batch < worker < prune by jkurashcvx · Pull Request #211 · ChevronETC/AzManagers.jl

jkurashcvx · 2026-05-18T18:34:47Z

Summary

Introduces a timeout invariant that prevents cascade failures observed in Job 17eed641:

batch_timeout (10s) < worker_timeout (720s) < prune_timeout (max(720, wt+30))

Changes (3 expressions)

addprocs_with_timeout: Use JULIA_AZMANAGERS_BATCH_TIMEOUT (default 10s) instead of worker_timeout() + 30. Workers already completed Phase 1 (connected + validated), so Phase 2 back-connection should be fast.
setup_launched_worker: Per-worker timeout = batch_timeout - 2. Fails individual workers before batch timeout fires.
prune_scalesets: max(VM_JOIN_TIMEOUT, worker_timeout+30). Prune never fires before a batch has time to complete.

Problem

One bad worker blocks entire batch for ~830s (PID ordering in addprocs_locked)
Prune fires at 720s < batch timeout of 830s → kills VMs still being processed
Killed VMs reconnect → PID inflation (947 PIDs, 8 real workers)

Env Vars

Variable	Default	Purpose
`JULIA_AZMANAGERS_BATCH_TIMEOUT`	`10`	Phase 2 batch timeout (new)
`JULIA_WORKER_TIMEOUT`	`60` (stdlib)	Worker boot+join
`JULIA_AZMANAGERS_VM_JOIN_TIMEOUT`	`720`	Prune floor

…ilure Timeout & deadline fixes: - Enforce invariant: batch_timeout < worker_timeout < prune_timeout - Add hard deadlines to addprocs_with_timeout and setup_launched_worker - Reduce default prune interval from 600s to 120s VM state tracking: - Add in_flight tracking to prevent prune/sync interference with pipeline - Fix nworkers_provisioned: subtract pending_down, fix scaleset_sync double-count - Upgrade VMSS API to 2024-07-01 with $expand=instanceView for power state - Deduplicate list_scaleset_vms calls (one fetch shared by prune_cluster + prune_scalesets) Reimage-on-first-failure: - Failed VMs are reimaged before deletion (second failure = delete) - Bulk reimage API with empty-ids guard - pending_reimage tracking through updating/creating phase - Add empty-ids guard to delete_vms

- check_service_health: query Microsoft.ResourceHealth/events API for active ServiceIssue events affecting VM/Network in the VMSS region; blocks scaling in scaleset_create_or_update when an incident is active (fails open) - list_scaleset_nics: bulk fetch NIC provisioning states per scaleset (one API call, not per-VM); integrated into scaleset_pruning loop - prune_scalesets: VMs with succeeded provisioningState but failed NIC are now treated as failed (reimage on first failure, delete on second) - nic_state included in reimage/delete warning log messages for diagnostics

@warn

…andoned batch cleanup P0-1: Purge pending_reimage/reimaged entries when VMs are deleted in delete_pending_down_vms, delete_orphan_pending_down_vms, and delete_scaleset. Prevents ghost entries that cause 99-100% reimage failure rates. P1-6: Cap reimage API failures at 2 attempts per instance. After 2 consecutive failures, escalate to orphan_pending_down for deletion instead of retrying indefinitely. Add reimage_failures tracking field to AzManager. P1-7: Log reimage errors at @warn level with scaleset context instead of @error without context. P1-5: When a batch is abandoned due to timeout, close sockets for unregistered workers and remove from in_flight. Workers detect the closed socket and retry via azure_worker's retry loop instead of being permanently lost.

…double-encoding The $filter query parameter was being pre-encoded with HTTP.escapeuri(), causing Azure to receive %27 instead of single quotes, resulting in 400 Bad Request with InvalidODataQueryOptions. Also downgraded api-version from 2025-05-01 to 2024-02-01.

The 'or' clause in the OData filter was causing InvalidODataQueryOptions. Without encoding, the spaces broke the URL. With encoding, Azure rejected the encoded single quotes. Simplified to filter on 'Virtual Machines' only with proper URL encoding.

The Azure Resource Health Events API requires ISO 8601 dates. Using m/d/yyyy format (e.g. 5/27/2026) caused InvalidODataQueryOptions 400 errors because the slashes were percent-encoded and the format was not recognized by the OData parser.

…eryOptions) The $filter parameter with 'service eq Virtual Machines' returns 400 on api-version 2024-02-01. Dropping it and using only queryStartTime works (200). The code already filters by event type, service, and region client-side.

samtkaplan approved these changes May 19, 2026

View reviewed changes

jkurashcvx force-pushed the hotfix/timeout-invariant branch from a0dc806 to 4f12f68 Compare May 20, 2026 13:21

Josh and others added 8 commits May 20, 2026 15:56

change API version from 2025-05-01 to 2024-02-01

69c6cb0

fix: correct Dates.format specifier for seconds (SS not ss)

2fa5808

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enforce timeout invariant: batch < worker < prune#211

Enforce timeout invariant: batch < worker < prune#211
jkurashcvx wants to merge 9 commits into
release-3from
hotfix/timeout-invariant

jkurashcvx commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jkurashcvx commented May 18, 2026

Summary

Changes (3 expressions)

Problem

Env Vars

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants