Enforce timeout invariant: batch < worker < prune#211
Open
jkurashcvx wants to merge 9 commits into
Open
Conversation
samtkaplan
approved these changes
May 19, 2026
…ilure Timeout & deadline fixes: - Enforce invariant: batch_timeout < worker_timeout < prune_timeout - Add hard deadlines to addprocs_with_timeout and setup_launched_worker - Reduce default prune interval from 600s to 120s VM state tracking: - Add in_flight tracking to prevent prune/sync interference with pipeline - Fix nworkers_provisioned: subtract pending_down, fix scaleset_sync double-count - Upgrade VMSS API to 2024-07-01 with $expand=instanceView for power state - Deduplicate list_scaleset_vms calls (one fetch shared by prune_cluster + prune_scalesets) Reimage-on-first-failure: - Failed VMs are reimaged before deletion (second failure = delete) - Bulk reimage API with empty-ids guard - pending_reimage tracking through updating/creating phase - Add empty-ids guard to delete_vms
a0dc806 to
4f12f68
Compare
- check_service_health: query Microsoft.ResourceHealth/events API for active ServiceIssue events affecting VM/Network in the VMSS region; blocks scaling in scaleset_create_or_update when an incident is active (fails open) - list_scaleset_nics: bulk fetch NIC provisioning states per scaleset (one API call, not per-VM); integrated into scaleset_pruning loop - prune_scalesets: VMs with succeeded provisioningState but failed NIC are now treated as failed (reimage on first failure, delete on second) - nic_state included in reimage/delete warning log messages for diagnostics
…andoned batch cleanup P0-1: Purge pending_reimage/reimaged entries when VMs are deleted in delete_pending_down_vms, delete_orphan_pending_down_vms, and delete_scaleset. Prevents ghost entries that cause 99-100% reimage failure rates. P1-6: Cap reimage API failures at 2 attempts per instance. After 2 consecutive failures, escalate to orphan_pending_down for deletion instead of retrying indefinitely. Add reimage_failures tracking field to AzManager. P1-7: Log reimage errors at @warn level with scaleset context instead of @error without context. P1-5: When a batch is abandoned due to timeout, close sockets for unregistered workers and remove from in_flight. Workers detect the closed socket and retry via azure_worker's retry loop instead of being permanently lost.
…double-encoding The $filter query parameter was being pre-encoded with HTTP.escapeuri(), causing Azure to receive %27 instead of single quotes, resulting in 400 Bad Request with InvalidODataQueryOptions. Also downgraded api-version from 2025-05-01 to 2024-02-01.
The 'or' clause in the OData filter was causing InvalidODataQueryOptions. Without encoding, the spaces broke the URL. With encoding, Azure rejected the encoded single quotes. Simplified to filter on 'Virtual Machines' only with proper URL encoding.
The Azure Resource Health Events API requires ISO 8601 dates. Using m/d/yyyy format (e.g. 5/27/2026) caused InvalidODataQueryOptions 400 errors because the slashes were percent-encoded and the format was not recognized by the OData parser.
…eryOptions) The $filter parameter with 'service eq Virtual Machines' returns 400 on api-version 2024-02-01. Dropping it and using only queryStartTime works (200). The code already filters by event type, service, and region client-side.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Introduces a timeout invariant that prevents cascade failures observed in Job 17eed641:
Changes (3 expressions)
addprocs_with_timeout: UseJULIA_AZMANAGERS_BATCH_TIMEOUT(default 10s) instead ofworker_timeout() + 30. Workers already completed Phase 1 (connected + validated), so Phase 2 back-connection should be fast.setup_launched_worker: Per-worker timeout = batch_timeout - 2. Fails individual workers before batch timeout fires.prune_scalesets:max(VM_JOIN_TIMEOUT, worker_timeout+30). Prune never fires before a batch has time to complete.Problem
addprocs_locked)Env Vars
JULIA_AZMANAGERS_BATCH_TIMEOUT10JULIA_WORKER_TIMEOUT60(stdlib)JULIA_AZMANAGERS_VM_JOIN_TIMEOUT720