Skip to content

Publish-only rollout endpoint#3

Draft
jvmncs wants to merge 21 commits into
mainfrom
jvmncs/rollout-endpoint
Draft

Publish-only rollout endpoint#3
jvmncs wants to merge 21 commits into
mainfrom
jvmncs/rollout-endpoint

Conversation

@jvmncs

@jvmncs jvmncs commented Jun 4, 2026

Copy link
Copy Markdown

Motivation

slime currently assumes it owns the rollout backend: it launches or registers SGLang engines (or fixed --rollout-external-engine-addrs) and pushes weight updates to each engine via per-engine RPCs. This PR lets slime train against an elastic, externally managed inference fleet behind a single HTTP endpoint — one with no stable per-engine handles and no SGLang-router worker-management APIs, where workers may scale up/down mid-run. Three opt-in pieces compose to support this:

  1. Opaque HTTP rollout endpoint — send /generate requests to a base URL without launching or registering any SGLang workers.
  2. Version-pinned rollout requests — generation payloads can pin an exact weight version, so an elastic fleet only serves samples from the intended policy version.
  3. Publish-only disk delta sync — instead of direct update_weights_from_disk RPCs, the trainer publishes each complete delta version through a custom hook (e.g. to shared storage that the endpoint's workers consume), and the publish overlaps the next training step.

All features are off by default; existing behavior is unchanged when the new flags are unset.

Modifications

Opaque HTTP rollout endpoint (--rollout-http-endpoint-url)

  • New slime/backends/sglang_utils/http_endpoint.py: URL normalization/validation and HttpEndpointRolloutServer, a no-engine rollout server stub (no offload/onload, no fault-tolerance recover).
  • slime/ray/rollout.py returns it from the server-startup path; get_model_url() in sglang_rollout.py returns the endpoint URL (with the requested route appended) and never assumes router APIs exist.
  • slime/ray/placement_group.py allocates no rollout GPUs in this mode.
  • Mutually exclusive with --rollout-external-engine-addrs (validated).
  • --rollout-http-endpoint-abort-strategy {cancel-only,router-workers}: cancel-only (the default when an endpoint is set) cancels local pending tasks without calling router /list_workers; the existing router-based abort is refactored into _drain_aborted_pending_tasks and remains the default otherwise.

Version-pinned rollout requests (--rollout-weight-version-policy exact-rollout-id)

  • /generate payloads gain weight_version={"exact_version": rollout_id}, scoped per rollout via rollout_weight_version_context.
  • Retries while the target version is unavailable are tunable via --rollout-weight-version-retry-attempts/-sleep; slime/utils/http_utils.py post() gains a retry_sleep parameter.

Publish-only disk delta sync (--update-weight-delta-publish-only), and skips version-dir cleanup (--update-weight-delta-keep-files is required).

  • The dispatched publish intentionally stays in flight across the training step and is drained at the start of the next sync (or on disconnect_rollout_engines), so publish latency overlaps training; a failed publish surfaces one sync late on rank 0.
  • Argument validation enforces delta mode + disk transport + publish path + keep-files.

Tests

  • New: tests/test_rollout_http_endpoint.py (8 tests: URL validation, endpoint routing, payload pinning, retry-until-version-available, cancel-only abort, no-engine server), tests/test_delta_publish_only.py (4 tests: hook invocation without engine RPCs or cleanup, no-op version publish, drain-on-disconnect, publish deferred to finalize).
  • Extended: tests/test_placement_group.py (http-endpoint layouts), tests/test_megatron_argument_validation.py.

Checklist

  • Format your code
  • Add unit tests according — 12 new tests + 3 extended parametrizations, all passing
  • Update documentation
  • Provide accuracy and speed benchmark results

@jvmncs jvmncs force-pushed the jvmncs/rollout-endpoint branch from c5b3993 to 5b48529 Compare June 11, 2026 04:59
@jvmncs jvmncs force-pushed the jvmncs/rollout-endpoint branch from 5b48529 to 4d5645f Compare June 12, 2026 00:27
@jvmncs jvmncs force-pushed the jvmncs/rollout-endpoint branch from 4d5645f to 2bb8465 Compare June 12, 2026 18:11
@jvmncs jvmncs force-pushed the jvmncs/rollout-endpoint branch from 2bb8465 to 1ab3399 Compare June 16, 2026 17:30
jvmncs added 5 commits June 16, 2026 22:31
In publish-only mode, _finalize_sync now dispatches the publish hook
without awaiting its refs; the start of the next sync (or
disconnect_rollout_engines) drains them, so the publish overlaps a full
training step with at most one version outstanding. Failures surface
one sync late. Direct disk transport still drains before cleanup and
resume.
@nanjiangwill nanjiangwill force-pushed the jvmncs/rollout-endpoint branch from 1ab3399 to 1499c3e Compare June 16, 2026 22:33
start_rollout_servers' HTTP-endpoint branch returned the bare servers dict from
start_http_endpoint_rollout_servers, but the caller (RolloutManager.__init__,
rollout.py:435) and this function's `-> tuple[dict, list]` annotation expect a
(servers, init_handles) tuple — every other branch returns one. In endpoint mode
this raised `ValueError: not enough values to unpack (expected 2, got 1)` at
RolloutManager init. HTTP endpoints have no local engine init handles, so [].

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants