Skip to content

Pre-validate connections before batching#208

Open
jkurashcvx wants to merge 4 commits into
masterfrom
hotfix/prevalidate-connections-v4
Open

Pre-validate connections before batching#208
jkurashcvx wants to merge 4 commits into
masterfrom
hotfix/prevalidate-connections-v4

Conversation

@jkurashcvx
Copy link
Copy Markdown
Member

Move socket IO (cookie read + connection string parse) out of addprocs_locked into validate_connection(), which runs with a configurable timeout (JULIA_AZMANAGERS_VALIDATION_TIMEOUT, default 30s) before sockets enter the batch. Only validated WorkerConfig objects are batched and passed to addprocs.

Problem: Raw TCPSocket objects handed to addprocs_locked → launch → launch_on_machine could block on read() if a VM was slow/dead, holding worker_lock and causing other workers in the batch to time out and drop.

Changes:

  • New validate_connection() with async timeout + _read_worker_config()
  • process_pending_connections batches WorkerConfig[] instead of TCPSocket[]
  • launch() pushes pre-built WorkerConfigs (no IO inside worker_lock)
  • Remove launch_on_machine (logic moved to validate_connection)
  • Rename sockets kwarg to wconfigs through addprocs/addprocs_with_timeout

Cherry-picked from hotfix/prevalidate-connections (release-3 hotfix).

Josh added 4 commits May 12, 2026 13:49
Move socket IO (cookie read + connection string parse) out of
addprocs_locked into validate_connection(), which runs with a
configurable timeout (JULIA_AZMANAGERS_VALIDATION_TIMEOUT, default 30s)
before sockets enter the batch. Only validated WorkerConfig objects
are batched and passed to addprocs.

Cherry-picked from hotfix/prevalidate-connections (release-3 hotfix).
validate_connection was called synchronously in the batch loop, causing
connections to serialize (up to 30s each). Move validation into
add_pending_connections via @async so all sockets validate in parallel.
process_pending_connections now reads from pending_validated channel
(instant take of pre-validated WorkerConfigs).
…ests

- Run socket validations concurrently via @async in add_pending_connections
  instead of serially in process_pending_connections. Adds pending_validated
  channel so batch loop reads pre-validated WorkerConfigs instantly.
- Downgrade validation failure log from @warn to @debug (expected operational
  outcome, not a warning).
- Add test_validate_connections.jl with unit tests (good/slow/dead/bad-cookie
  sockets) and pipeline integration test exercising the full accept → async
  validate → pending_validated channel flow.
…ntax

Cherry-picked from hotfix/prevalidate-connections (release-3):
- Run socket validations concurrently via @async in add_pending_connections
- Downgrade validation failure log from @warn to @debug
- Add test_validate_connections.jl (unit + pipeline integration tests)
- Fix _::AzManager write-only identifier for Julia 1.12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants