Azmgr fix#210
Open
jkwashbourne-oss wants to merge 7 commits into
Open
Conversation
Phase A: Diagnostic logging around Phase 1/Phase 2 worker connections - A1: Log Phase 2 start/complete/fail in setup_launched_worker with staleness, elapsed time - A2: Hex dump and null count on null cookie errors in azure_worker_start - A3: Log peer address on every worker socket accept - A4: Log Phase 1 validation with bind_addr, instanceid; stamp _validated_at timestamp Phase B2: Prune timeout alignment - prune_scalesets timeout is now max(VM_JOIN_TIMEOUT, worker_timeout+30) - Prevents pruning VMs still being processed by addprocs_with_timeout
Tests exercise: - _read_worker_config: Phase 1 handshake, bad cookie rejection, _validated_at stamping - validate_connection: timeout on slow/hanging socket - prune timeout alignment: max(join_timeout, worker_timeout+30) - setup_launched_worker scoping: tic variable accessible in catch and after try/end - Worker cookie validation: null bytes, partial nulls, wrong cookie, hex dump - _validated_at pipeline: timestamp flows through to staleness calculation All tests run locally via loopback TCP, no Azure resources required.
Connection pipeline: - _read_worker_config: handshake, bad cookie, ppi>1, _validated_at stamp - validate_connection: timeout on slow socket, success path - _validated_at timestamp pipeline flow - Worker cookie validation: null bytes, partial, wrong, hex dump - setup_launched_worker variable scoping pattern Instance management: - ScaleSet lowercasing and equality - add_instance_to_pending_down/pruned/deleted/preempted_list - ispreempted with present, absent, and wrong-scaleset cases Error handling: - isretryable for all HTTP codes and error types - status helper for HTTP vs non-HTTP errors - logerror smoke test Utilities: - nthreads_filter, spin, build_envstring - remaining_resource header extraction - timestamp_metaformatter - _replace string helper Templates: - build_sstemplate: base, datadisks, UltraSSD, tags, no-tags, cross-sub, encryption, multi-disk - build_nictemplate: accelerated on/off - build_vmtemplate: base, datadisks+tags - save_template: create, append, overwrite - templates_folder and filename helpers Environment: - compress/decompress roundtrip with and without LocalPreferences Cluster: - addprocs_with_timeout with empty wconfigs - prune timeout alignment arithmetic All tests run locally, no Azure resources required.
- process_pending_connections no longer blocks waiting for previous batch to complete; batches are dispatched independently and tracked as in-flight tasks - setup_launched_worker now skips connections whose staleness exceeds JULIA_AZMANAGERS_MAX_STALENESS (defaults to JULIA_WORKER_TIMEOUT), preventing dead-socket Phase 2 attempts after a pipeline stall - closes the stale socket and marks the instance for deletion when skipped
When addprocs races with delete_pending_down_vms, the Azure-reported scaleset capacity still includes VMs scheduled for deletion. Adding δn on top of this inflated capacity causes target_capacity to exceed maxworkers (observed 771 when max was 700). Subtract the pending_down count before adding δn so we don't double-count VMs that are about to be removed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.