From bacd6a8bd7db50ae71ec59bf5cb3c28d0d5a96d3 Mon Sep 17 00:00:00 2001
From: inaniloquentee <3051000145@qq.com>
Date: Tue, 16 Jun 2026 16:40:29 +0800
Subject: [PATCH] docs: define vime integration design

Signed-off-by: inaniloquentee <3051000145@qq.com>
---
 docs/design/vime-integration-design.md | 419 +++++++++++++++++++++++++
 1 file changed, 419 insertions(+)
 create mode 100644 docs/design/vime-integration-design.md

diff --git a/docs/design/vime-integration-design.md b/docs/design/vime-integration-design.md
new file mode 100644
index 0000000..edd98d3
--- /dev/null
+++ b/docs/design/vime-integration-design.md
@@ -0,0 +1,419 @@
+# RL-Kernel to Vime Integration Design
+
+This document defines a non-intrusive integration path for using RL-Kernel
+operators under Vime. It builds on the hook inventory in
+`docs/design/vime-architecture-and-hook-points.md` and is scoped to issue
+#119: define the positioning, contracts, activation shape, change surface, and
+fallback policy before broad implementation work starts.
+
+## Goals
+
+- Keep Vime's rollout, training, Ray placement, and weight-sync lifecycle in
+  charge.
+- Add RL-Kernel as an optional operator-level layer under selected Vime
+  callsites.
+- Preserve Vime's pure native behavior when RL-Kernel is not enabled, not
+  installed, or not usable on the current hardware.
+- Make the first PoC small enough to validate with Vime's existing custom
+  function hooks.
+- Define the contracts that a later production integration must satisfy before
+  claiming train-inference consistency.
+
+## Issue Checklist Mapping
+
+This document covers the planned #119 PR items. The exact checklist phrases are
+kept here intentionally so reviewers can map the design back to the issue:
+
+- Integration design doc covering insertion points, expected tensor/API contracts, lifecycle, ownership boundaries, and rollout/training-side responsibilities.
+- Non-intrusive activation path: custom op, optional backend, and config flag
+  shape, with pure Vime inference and native RL paths unaffected when disabled.
+- Change-surface estimate: what can live entirely in RL-Kernel, what requires a Vime/slime adapter, and what would need upstream vLLM/vllm-router changes.
+- Positioning statement: RL-Kernel provides operator-level consistency beneath
+  Vime's framework-level alignment; complementary rather than competing.
+- Risk and fallback section: private API reliance, version pinning, missing extension points, unsupported hardware, and how the PoC should fail safely.
+
+## Non-Goals
+
+- No production integration in this issue.
+- No Vime, slime, vLLM, or Megatron upstream PR in this issue.
+- No claim that this closes real-engine train-inference consistency.
+- No vLLM internal kernel replacement from Vime's HTTP layer alone.
+- No replacement of Vime's scheduler, rollout buffer, data source, Megatron
+  actor, or weight-update path.
+
+## Positioning
+
+Vime is the framework-level RL post-training orchestrator. It owns the macro
+dataflow:
+
+```text
+prompt/data source
+  -> rollout function
+  -> vLLM/vllm-router generation
+  -> Sample objects
+  -> train-data conversion
+  -> Megatron logprob/loss/reduction
+  -> weight update back to rollout engines
+```
+
+RL-Kernel should not compete with that layer. Its role is to provide
+operator-level consistency and performance beneath Vime:
+
+- selected-token logprob implementations;
+- GRPO/PPO loss or loss sub-primitives;
+- deterministic masked reductions;
+- rollout-side sampling/logprob kernels only after WS3 exposes an engine-level
+  insertion point with the required tensors.
+
+In short: Vime aligns the framework workflow; RL-Kernel aligns and accelerates
+the kernels used inside that workflow.
+
+## Activation Model
+
+The integration should be opt-in and fail-safe.
+
+Recommended future flag shape:
+
+| Flag | Values | Default | Purpose |
+| --- | --- | --- | --- |
+| `--rl-kernel-mode` | `off`, `observe`, `train_ops`, `rollout_ops` | `off` | Top-level enablement gate. |
+| `--rl-kernel-logp` | `auto`, `native`, `disabled` | `auto` when enabled | Select selected-logprob backend through RL-Kernel dispatch. |
+| `--rl-kernel-loss` | `auto`, `native`, `disabled` | `disabled` initially | Select fused GRPO/PPO loss or sub-primitives. |
+| `--rl-kernel-strict` | boolean | `False` | Raise on RL-Kernel import/backend failure instead of falling back. |
+| `--rl-kernel-metrics-prefix` | string | `rl_kernel` | Prefix for invocation, fallback, and drift metrics. |
+
+For the minimal PoC in #120, these do not need to be added to Vime. The PoC can
+use a Vime custom function path and an environment variable or tiny shim-local
+setting. A production-quality integration should expose explicit flags so the
+native path remains the default and CI can test both native and enabled modes.
+
+Fallback rules:
+
+1. `off` means Vime runs exactly as it does today.
+2. `observe` may import RL-Kernel and emit diagnostics, but must not change
+   tensors consumed by training.
+3. `train_ops` may replace training-side operator calls only after tensor
+   contracts pass validation.
+4. `rollout_ops` is blocked for selected-logprob recomputation until vLLM or
+   vllm-router exposes logits or an internal custom-op insertion point.
+5. In non-strict mode, any import, shape, dtype, device, or backend failure logs
+   one warning and falls back to native Vime behavior for that callsite.
+
+## Insertion Points and Ownership
+
+The hook inventory in #118 identified several Vime-level extension points. The
+integration should use them in this order.
+
+| Stage | Vime hook or callsite | Integration role | Recommendation |
+| --- | --- | --- | --- |
+| Rollout invocation proof | `--custom-generate-function-path` in `vime.rollout.vllm_rollout.generate_and_rm` | Prove Vime can call RL-Kernel without replacing rollout orchestration. | Use for #120 PoC. |
+| Rollout diagnostics | `--rollout-data-postprocess-path` after rollout data reaches the actor | Add validation, counters, or metadata before training. | Useful in observe mode. |
+| Train-data conversion | `--custom-convert-samples-to-train-data-path` in `RolloutManager._convert_samples_to_train_data` | Preserve or add optional RL-Kernel metadata. | Use only if default fields are insufficient. |
+| Training selected logprob | `get_log_probs_and_entropy` / `calculate_log_probs_and_entropy` / `compute_log_probs` | Replace selected-token logprob extraction from Megatron logits. | Best production operator target. |
+| Training loss | `--custom-loss-function-path` with `--loss-type custom_loss` | Exercise RL-Kernel loss while preserving Vime's actor loop. | Good staged integration for isolated loss tests. |
+| PG loss reducer | `--custom-pg-loss-reducer-function-path` | Try deterministic masked reduction or custom normalization. | Low-risk reducer target. |
+| Megatron lifecycle hooks | `--custom-megatron-init-path`, `--custom-megatron-before-log-prob-hook-path`, `--custom-megatron-before-train-step-hook-path` | Initialize RL-Kernel, validate device, or emit metrics. | Auxiliary hooks, not enough to replace logprob alone. |
+| vLLM internal sampling/logprob | Below vllm-router HTTP generation | Replace rollout-side kernels. | Needs WS3/vLLM insertion point. |
+
+The first real operator integration should be training-side selected logprob or
+masked reduction. These are closer to the tensors RL-Kernel already accepts and
+do not require changing Vime's macro dataflow.
+
+Ownership boundary:
+
+- Vime owns scheduling, Ray actors, rollout buffering, reward execution,
+  train-data conversion, Megatron model forward, optimizer steps, checkpointing,
+  and weight sync.
+- RL-Kernel owns operator implementations, backend dispatch, local fallback
+  behavior, and operator-level reference tests.
+- A Vime adapter owns tensor layout conversion between Vime's packed/list
+  structures and RL-Kernel's operator contracts.
+
+## Tensor and API Contracts
+
+### Rollout Sample Contract
+
+Vime's rollout path returns `Sample` objects. Any RL-Kernel wrapper that touches
+rollout data must preserve these fields:
+
+| Field | Contract |
+| --- | --- |
+| `tokens` | Prompt plus response token ids as `list[int]`. |
+| `response_length` | Number of response tokens. |
+| `loss_mask` | Optional `list[int]`; if present, length must equal `response_length`. |
+| `rollout_log_probs` | Optional `list[float]`; when present, length must equal `response_length`. |
+| `reward` | Must be populated before training unless a later hook intentionally fills it. |
+| `status` | Must remain a `Sample.Status` value. |
+| `rollout_id` | Shared by sibling samples from one rollout execution when fan-out is used. |
+
+The default vLLM rollout asks vllm-router for `logprobs: 1` and parses selected
+response logprobs into `Sample.rollout_log_probs`. That is sufficient for
+comparison and mismatch metrics, but not sufficient for recomputing selected
+logprobs with RL-Kernel because RL-Kernel's logprob operator requires logits.
+
+### Train-Data Contract
+
+`RolloutManager._convert_samples_to_train_data` produces the dictionary that is
+later split by DP rank. RL-Kernel-aware code must preserve the default required
+keys:
+
+```text
+tokens
+response_lengths
+rewards
+raw_reward
+truncated
+sample_indices
+rollout_ids
+loss_masks
+rollout_mask_sums
+```
+
+Optional fields that matter for train-inference comparison:
+
+```text
+rollout_log_probs
+teacher_log_probs
+rollout_routed_experts
+metadata
+multimodal_train_inputs
+```
+
+`rollout_mask_sums` is important for Vime's per-rollout normalization. A custom
+converter must not drop or recompute it with only the current micro-batch, since
+Vime may split sibling samples across micro-batches.
+
+### Training Logprob Contract
+
+RL-Kernel's existing logprob operator accepts:
+
+```python
+logp_op = kernel_registry.get_op("logp")
+selected_logps = logp_op(logits, token_ids)
+```
+
+Contract for a Vime adapter:
+
+| Tensor | Shape | Dtype | Notes |
+| --- | --- | --- | --- |
+| `logits` | `[N, V]` or reshapeable to that form | fp32/bf16 depending on backend | Must be contiguous for fused CUDA paths. |
+| `token_ids` | `[N]` or same leading shape as `logits` | int64/int32 accepted by wrapper | Target token ids aligned to response logits. |
+| output | `[N]` or original leading shape | fp32 preferred for comparison | One selected logprob per response token. |
+
+The adapter must handle Vime's layouts before calling RL-Kernel:
+
+- THD and BSHD packing;
+- context parallel slicing;
+- tensor parallel vocabulary partitioning;
+- optional all-gather-CP mode;
+- temperature scaling with `args.rollout_temperature`;
+- response-only token alignment where logits at position `t - 1` score token
+  `t`.
+
+The existing RL-Kernel logprob op does not by itself replace Vime's
+tensor-parallel distributed log-sum-exp semantics. A production integration
+must either add a TP-aware RL-Kernel entry point or call RL-Kernel only after
+Vime has materialized full-vocab logits for the local response rows.
+
+### Loss and Reduction Contract
+
+RL-Kernel's current GRPO loss operator is shaped around dense `[B, T]` tensors:
+
+```python
+loss, policy_loss, kl = grpo_loss(
+    current_logps,
+    old_logps,
+    ref_logps,
+    rewards,
+    completion_mask,
+    clip_eps=...,
+    beta=...,
+    samples_per_prompt=...,
+)
+```
+
+Vime's policy loss path is list/packed oriented:
+
+```text
+batch["advantages"]       -> list[Tensor], response local
+batch["log_probs"]        -> list[Tensor], old train logprobs
+batch["rollout_log_probs"] -> list[Tensor], optional rollout logprobs
+batch["ref_log_probs"]    -> list[Tensor], optional ref logprobs
+batch["loss_masks"]       -> list[Tensor], full response masks
+sum_of_sample_mean        -> callable normalizer
+```
+
+Therefore a direct fused-loss replacement must first define a packing adapter
+that converts Vime's per-sample lists into dense or flat tensors while
+preserving:
+
+- variable response lengths;
+- masked tokens;
+- per-rollout denominators from `rollout_mask_sums`;
+- CP local chunks and all-gather behavior;
+- GRPO/GSPO/PPO differences;
+- optional KL loss, entropy loss, OPSM, TIS/MIS, OPD, and custom reducers.
+
+The safer staged path is to integrate smaller RL-Kernel primitives first:
+selected logprob, ratio/KL, or masked reductions. A full `policy_loss_function`
+replacement should come later, after parity tests cover the active Vime flags.
+
+## Lifecycle
+
+### Import and Initialization
+
+An adapter should lazily import RL-Kernel at the first enabled callsite:
+
+```python
+try:
+    from rl_engine.kernels.registry import kernel_registry
+except Exception as exc:
+    fallback_or_raise(exc)
+```
+
+Do not import or initialize CUDA extensions on Vime's driver process unless the
+operator will actually run there. Most training-side operators should initialize
+inside `MegatronTrainRayActor` processes after the CUDA device and distributed
+groups are ready.
+
+The `--custom-megatron-init-path` hook is a reasonable place to validate
+installation and log selected RL-Kernel backends. It should not force the
+integration to be active unless the explicit RL-Kernel mode is enabled.
+
+### Rollout Loop
+
+The rollout loop should remain:
+
+```text
+RolloutManager.generate
+  -> configured rollout function
+  -> Sample objects
+  -> default logging
+  -> _convert_samples_to_train_data
+  -> _split_train_data_by_dp
+```
+
+An observe-mode PoC can wrap `--custom-generate-function-path`, call a cheap
+RL-Kernel fallback op on synthetic or available tensors, and record a counter in
+`sample.metadata`. It must return a valid `Sample` and leave generated tokens,
+logprobs, rewards, and loss masks unchanged.
+
+### Training Loop
+
+The training loop should remain:
+
+```text
+MegatronTrainRayActor._get_rollout_data
+  -> compute old/ref/current logprobs as configured
+  -> compute_advantages_and_returns
+  -> optional rollout_data_postprocess
+  -> train(...)
+  -> loss_function(...)
+```
+
+Training-side RL-Kernel work should run only inside actor processes and only on
+tensors already owned by the current rank. The adapter must not add hidden
+collectives beyond the collectives Vime already performs unless the contract
+states them explicitly.
+
+### Weight Sync
+
+RL-Kernel should not participate in Vime's initial weight push or periodic
+Megatron-to-vLLM update. It may report diagnostics that compare rollout
+logprobs and training logprobs, but the ownership of weight update remains:
+
+```text
+actor_model.update_weights()
+  -> Megatron weight iterator/updater
+  -> VLLMEngine update endpoints
+```
+
+## Change Surface
+
+| Area | Can live in RL-Kernel | Needs Vime adapter | Needs upstream vLLM/vllm-router |
+| --- | --- | --- | --- |
+| Operator dispatch and fallback | Yes | No | No |
+| Selected-logprob CUDA/Triton/PyTorch kernels | Yes | Thin callsite adapter | No for training side |
+| Vime tensor packing/unpacking | No | Yes | No |
+| Vime CLI flags for RL-Kernel enablement | No | Yes | No |
+| Training selected-logprob replacement | Mostly | Yes | No |
+| Masked reduction helpers | Mostly | Yes | No |
+| Custom loss PoC through `--custom-loss-function-path` | Mostly | User/Vime plugin shim | No |
+| Rollout invocation proof | Mostly | User/Vime plugin shim | No |
+| Rollout-side true logprob recomputation | Partly | Yes | Yes, unless vLLM exposes logits through a supported API |
+| vLLM sampling kernel replacement | Partly | Maybe | Yes |
+| Weight update | No | Existing Vime path | Existing vLLM endpoints |
+
+This separation keeps the core RL-Kernel repo focused on reusable operators and
+small adapters, while any framework-specific lifecycle wiring stays in Vime or a
+Vime plugin.
+
+## Recommended PoC for #120
+
+Start with an invocation proof, not a numerical replacement:
+
+1. Use the #117 baseline, preferably
+   `examples/fully_async/run-qwen2.5-0.5B-fully_async.sh` or the corresponding
+   short CI test.
+2. Add a tiny `--custom-generate-function-path` shim that imports RL-Kernel,
+   obtains one operator through `kernel_registry`, and increments a structured
+   counter.
+3. Return the same `Sample` fields Vime would have returned without the shim.
+4. Log `rl_kernel/invoked`, `rl_kernel/backend`, and `rl_kernel/fallback` in
+   metadata or rollout metrics.
+5. Run once with RL-Kernel installed and once with RL-Kernel import disabled to
+   prove fallback does not break native Vime.
+
+After that, the first operator-semantic PoC should be training-side selected
+logprob on a constrained layout. It should compare RL-Kernel output with Vime's
+native `compute_log_probs` on the same fixed logits and target tokens before it
+is used for loss.
+
+## Validation Plan
+
+Minimum checks before enabling any real operator path:
+
+- import/fallback test for missing RL-Kernel;
+- backend selection test for CPU/PyTorch fallback;
+- shape validation for `logits`, `token_ids`, `loss_masks`, and
+  `response_lengths`;
+- numerical parity with Vime's native selected-logprob reference on fixed
+  logits;
+- CP/TP layout test or an explicit guard that disables the integration when the
+  unsupported layout is active;
+- native Vime run with RL-Kernel mode off;
+- enabled-mode run that proves the intended callsite was invoked;
+- debug dump comparison using Vime's `save_debug_rollout_data` and
+  `save_debug_train_data` when available.
+
+For rollout-side claims, add WS3 validation first:
+
+- exact vLLM `logprobs_mode`;
+- temperature semantics;
+- fixed token replay to remove sampling randomness;
+- availability of logits or an internal vLLM custom-op insertion point.
+
+## Risks and Fallbacks
+
+| Risk | Impact | Fallback |
+| --- | --- | --- |
+| RL-Kernel not installed in Vime workers | Import failure in Ray actor or custom hook | Non-strict mode logs once and uses native Vime path. |
+| CUDA extension unavailable or unsupported GPU | Runtime backend failure | Use RL-Kernel PyTorch fallback or native Vime path. |
+| Vime TP/CP layout not supported by RL-Kernel op | Incorrect logprobs or shape errors | Disable RL-Kernel for that callsite until adapter coverage exists. |
+| vLLM HTTP response lacks logits | Cannot recompute rollout logprobs | Treat rollout path as observe/compare only; wait for WS3. |
+| Private vLLM/vllm-router API reliance | Breakage across versions | Pin tested versions and isolate upstream-specific code in a small adapter. |
+| Loss replacement misses Vime options | Training behavior regression | Start with sub-primitives; require parity tests for each enabled flag. |
+| Silent numerical drift | Hard-to-debug train instability | Emit drift metrics and support strict mode for CI. |
+
+## Open Questions
+
+- Should the production integration live in Vime, in RL-Kernel as a Vime plugin,
+  or in a separate adapter package?
+- What exact TP-aware logprob contract should RL-Kernel expose for Megatron's
+  sharded vocabulary path?
+- Which Vime layouts are in scope for the first semantic operator PoC:
+  `qkv_format=thd`, `bshd`, CP=1 only, or allgather-CP?
+- Can vLLM expose logits or selected-logprob custom-op hooks without expanding
+  the rollout HTTP payload?
+- What metric threshold should define acceptable train-vs-rollout logprob drift
+  for WS5 reporting?