Skip to content

Add vLLM dynamic scheduler reconfigure for single-server sweeps#1029

Open
JordanNanos wants to merge 8 commits intomainfrom
vllm-dynamic-scheduler-reconfigure
Open

Add vLLM dynamic scheduler reconfigure for single-server sweeps#1029
JordanNanos wants to merge 8 commits intomainfrom
vllm-dynamic-scheduler-reconfigure

Conversation

@JordanNanos
Copy link
Copy Markdown
Collaborator

@JordanNanos JordanNanos commented Apr 14, 2026

Summary

  • Opt-in VLLM_DYNAMIC_RECONFIGURE=1 hook in run_benchmark_serving that
    calls vLLM /pause/reconfigure/resume before each benchmark run
  • Reads VLLM_MAX_NUM_BATCHED_TOKENS and VLLM_MAX_NUM_SEQS env vars and
    sends them as a JSON body to POST /reconfigure
  • benchmarks/test_reconfigure_sweep.sh — standalone A/B test script that
    compares N cold starts (baseline) vs 1 cold start + N reconfigure cycles
  • Documentation in docs/vllm-dynamic-scheduler-reconfigure.md

Pre-built image

docker pull semianalysiswork/vllm-reconfigure:latest

Based on vllm/vllm-openai:v0.18.0 with the reconfigure API overlaid.
Source: JordanNanos/vllm feature/reconfigure-scheduler

Single-node test

docker run --rm --init --network host \
  --runtime nvidia --gpus all --ipc host --privileged \
  --shm-size=16g --ulimit memlock=-1 --ulimit stack=67108864 \
  -v $HF_HUB_CACHE:/root/.cache/huggingface \
  -v $(pwd):/workspace -w /workspace \
  -e HF_TOKEN -e PORT=8888 \
  -e MODEL=openai/gpt-oss-120b \
  -e TP=8 -e CONC=32 \
  -e ISL=1024 -e OSL=1024 \
  semianalysiswork/vllm-reconfigure:latest \
  bash benchmarks/test_reconfigure_sweep.sh

Sweeps 3 max_num_batched_tokens × 2 max_num_seqs = 6 configs.
Phase A: 6 cold starts. Phase B: 1 cold start + 5 reconfigure cycles (~1s each).

Test plan

  • bash -n benchmarks/benchmark_lib.sh — syntax check
  • bash -n benchmarks/test_reconfigure_sweep.sh — syntax check
  • Build overlay image (semianalysiswork/vllm-reconfigure:latest)
  • Run A/B test on a single GPU node
  • Verify benchmark metrics match between baseline and reconfigure phases
  • Compare total wall-clock time (expect ~5× reduction in startup overhead)

AI assistance was used to prepare this change.

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

…t script

- Fix reconfigure_vllm_scheduler() to use POST /reconfigure with a JSON
  body instead of query params to the non-existent /reconfigure_scheduler
- Remove max_num_scheduled_tokens (internal name, not exposed by API)
- Use mode=abort&clear_cache=true on /pause for clean reconfigure cycles
- Add benchmarks/test_reconfigure_sweep.sh for standalone A/B testing on
  a cluster: runs N cold starts (baseline) vs 1 start + N reconfigure
  cycles and prints wall-clock comparison
- Update docs to match actual API surface

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@JordanNanos JordanNanos changed the title Add vLLM dynamic scheduler reconfigure hook Add vLLM dynamic scheduler reconfigure for single-server sweeps Apr 14, 2026
@JordanNanos
Copy link
Copy Markdown
Collaborator Author

Added the requested patched-vLLM distribution paths:

  • install_patched_vllm helper in benchmarks/benchmark_lib.sh supporting wheel, git ref, and editable checkout installs.
  • docs/vllm-patched-distribution.md with custom image, pinned wheel, mounted editable checkout, and pinned git-ref workflows.

Recommended for cluster sweeps: use a custom image or pinned wheel, then enable VLLM_DYNAMIC_RECONFIGURE=1 only for jobs running a vLLM build with the runtime reconfiguration API.

The vLLM /reconfigure endpoint requires PAUSED_ALL state, which maps to
pause mode="keep". Using mode="abort" would leave the scheduler in
PAUSED_NEW state, causing reconfigure to reject the request.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment on lines +38 to +40
fi
json+="}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The three curl calls in reconfigure_vllm_scheduler() are not chained with &&, so a failure of /reconfigure_scheduler is masked by the subsequent /resume success — the function returns 0 even when reconfiguration failed. Additionally, the call site in run_benchmark_serving() ignores the return value entirely, so the benchmark always proceeds regardless of reconfiguration outcome, silently producing incorrect results with stale scheduler settings.

Extended reasoning...

Bug 1 — internal error masking in reconfigure_vllm_scheduler() (lines 38-40):

In bash, a function's return code is the exit code of its last executed command. The three curl calls run unconditionally with no chaining:

curl -fsS -X POST "$base_url/pause?mode=keep"
curl -fsS -X POST -G "$base_url/reconfigure_scheduler" "${params[@]}"
curl -fsS -X POST "$base_url/resume"

The -f flag in -fsS makes curl exit with code 22 on HTTP 4xx/5xx responses. If /reconfigure_scheduler returns an error (e.g., HTTP 400 for an invalid parameter value), curl exits 22 — but execution continues unconditionally to /resume. If /resume succeeds (exit 0), the function returns 0, masking the reconfiguration failure completely.

Bug 2 — return value ignored at the call site in run_benchmark_serving() (~line 361):

if [[ "${VLLM_DYNAMIC_RECONFIGURE:-0}" == "1" && "$backend" == "vllm" ]]; then
    reconfigure_vllm_scheduler "$port"
fi

There is no || return 1 or any check on the return value. There is no set -e in the script (only set +x/set -x). Even if the function were fixed to propagate errors, the benchmark would still proceed unconditionally.

Combined effect — step-by-step proof:

  1. User sets VLLM_DYNAMIC_RECONFIGURE=1, VLLM_MAX_NUM_SEQS=999999 (invalid, exceeds server capacity)
  2. run_benchmark_serving calls reconfigure_vllm_scheduler "$port"
  3. Inside the function: curl -fsS -X POST .../pause succeeds (exit 0)
  4. curl -fsS -X POST -G .../reconfigure_scheduler ... → server returns HTTP 400 → curl exits 22
  5. Execution continues (no &&, no error check): curl -fsS -X POST .../resume → succeeds (exit 0)
  6. Function returns 0 (last command's exit code) — failure masked
  7. Back in run_benchmark_serving: return value not checked, benchmark proceeds
  8. vLLM server is still running with its original scheduler limits
  9. Benchmark results are recorded as if they were obtained with the requested settings — silently incorrect

Fix: Chain the curl calls with && inside the function, and add || return 1 at the call site:

# Inside reconfigure_vllm_scheduler():
curl -fsS -X POST "$base_url/pause?mode=keep" && curl -fsS -X POST -G "$base_url/reconfigure_scheduler" "${params[@]}" && curl -fsS -X POST "$base_url/resume"

# At the call site in run_benchmark_serving():
reconfigure_vllm_scheduler "$port" || return 1

1. Double reconfigure in test_reconfigure_sweep.sh: Phase B called
   reconfigure_vllm_scheduler manually then run_benchmark_serving
   called it again via the VLLM_DYNAMIC_RECONFIGURE hook. Remove the
   manual call and let the hook handle it.

2. Doc listed mode=abort but vLLM /reconfigure requires PAUSED_ALL
   which maps to mode=keep. Fix the Requirements section.

3. No error recovery in reconfigure_vllm_scheduler: if /reconfigure
   failed, curl exited non-zero, set -e killed the function, and the
   server stayed paused forever. Now capture the exit code, always
   call /resume, then propagate the error.

4. --force-reinstall in wheel mode reinstalls all dependencies. Use
   --no-deps --force-reinstall to only replace the vllm package.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@functionstackx functionstackx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this work with cudagraphs, aiter, amd, flashinfer, torch compile or any other

JordanNanos and others added 2 commits April 14, 2026 15:08
Image: semianalysiswork/vllm-reconfigure:latest
Based on vllm/vllm-openai:v0.18.0 with reconfigure API overlay.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Standalone workflow_dispatch workflow that runs
benchmarks/test_reconfigure_sweep.sh on any GPU runner using the
semianalysiswork/vllm-reconfigure:latest image.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread .github/workflows/test-reconfigure.yml Fixed
@JordanNanos
Copy link
Copy Markdown
Collaborator Author

does this work with cudagraphs, aiter, amd, flashinfer, torch compile or any other

@functionstackx unlikely, only vllm has /pause and /resume from what I can tell

…ntain permissions'

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants