Skip to content

Azmgr fix#210

Open
jkwashbourne-oss wants to merge 7 commits into
ChevronETC:release-3from
jkwashbourne-oss:azmgr-fix
Open

Azmgr fix#210
jkwashbourne-oss wants to merge 7 commits into
ChevronETC:release-3from
jkwashbourne-oss:azmgr-fix

Conversation

@jkwashbourne-oss
Copy link
Copy Markdown
Member

No description provided.

Josh and others added 5 commits May 15, 2026 20:26
Phase A: Diagnostic logging around Phase 1/Phase 2 worker connections
- A1: Log Phase 2 start/complete/fail in setup_launched_worker with staleness, elapsed time
- A2: Hex dump and null count on null cookie errors in azure_worker_start
- A3: Log peer address on every worker socket accept
- A4: Log Phase 1 validation with bind_addr, instanceid; stamp _validated_at timestamp

Phase B2: Prune timeout alignment
- prune_scalesets timeout is now max(VM_JOIN_TIMEOUT, worker_timeout+30)
- Prevents pruning VMs still being processed by addprocs_with_timeout
Tests exercise:
- _read_worker_config: Phase 1 handshake, bad cookie rejection, _validated_at stamping
- validate_connection: timeout on slow/hanging socket
- prune timeout alignment: max(join_timeout, worker_timeout+30)
- setup_launched_worker scoping: tic variable accessible in catch and after try/end
- Worker cookie validation: null bytes, partial nulls, wrong cookie, hex dump
- _validated_at pipeline: timestamp flows through to staleness calculation

All tests run locally via loopback TCP, no Azure resources required.
Connection pipeline:
- _read_worker_config: handshake, bad cookie, ppi>1, _validated_at stamp
- validate_connection: timeout on slow socket, success path
- _validated_at timestamp pipeline flow
- Worker cookie validation: null bytes, partial, wrong, hex dump
- setup_launched_worker variable scoping pattern

Instance management:
- ScaleSet lowercasing and equality
- add_instance_to_pending_down/pruned/deleted/preempted_list
- ispreempted with present, absent, and wrong-scaleset cases

Error handling:
- isretryable for all HTTP codes and error types
- status helper for HTTP vs non-HTTP errors
- logerror smoke test

Utilities:
- nthreads_filter, spin, build_envstring
- remaining_resource header extraction
- timestamp_metaformatter
- _replace string helper

Templates:
- build_sstemplate: base, datadisks, UltraSSD, tags, no-tags, cross-sub, encryption, multi-disk
- build_nictemplate: accelerated on/off
- build_vmtemplate: base, datadisks+tags
- save_template: create, append, overwrite
- templates_folder and filename helpers

Environment:
- compress/decompress roundtrip with and without LocalPreferences

Cluster:
- addprocs_with_timeout with empty wconfigs
- prune timeout alignment arithmetic

All tests run locally, no Azure resources required.
- process_pending_connections no longer blocks waiting for previous batch
  to complete; batches are dispatched independently and tracked as
  in-flight tasks
- setup_launched_worker now skips connections whose staleness exceeds
  JULIA_AZMANAGERS_MAX_STALENESS (defaults to JULIA_WORKER_TIMEOUT),
  preventing dead-socket Phase 2 attempts after a pipeline stall
- closes the stale socket and marks the instance for deletion when skipped
@jkwashbourne-oss jkwashbourne-oss changed the base branch from master to release-3 May 16, 2026 21:12
When addprocs races with delete_pending_down_vms, the Azure-reported
scaleset capacity still includes VMs scheduled for deletion. Adding δn
on top of this inflated capacity causes target_capacity to exceed
maxworkers (observed 771 when max was 700).

Subtract the pending_down count before adding δn so we don't
double-count VMs that are about to be removed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants