Updates to slurm launcher#86
Updates to slurm launcher#86raviguptaamd wants to merge 12 commits intoROCm:coketaste/refactor-disfrom
Conversation
- Add docker_image field to built_images and built_models in --use-image mode - Merge model's distributed.launcher into deployment_config for proper detection - Make reservation optional (empty string is ignored) - Pass reservation to SlurmNodeSelector for health checks Co-authored-by: Cursor <cursoragent@cursor.com>
Add missing newline after 'fi' statement to prevent syntax error in generated sbatch script when using --build-on-compute option. Co-authored-by: Cursor <cursoragent@cursor.com>
Replace legacy launcher names with unified slurm_multi: - Remove sglang-disagg, vllm-disagg from VALID_LAUNCHERS - Add slurm_multi as the only baremetal multi-node launcher - Update deployment logic in slurm.py and kubernetes.py - Rename example config files from sglang-disagg-* to slurm-multi-* - Update docs/launchers.md references Breaking change: sglang-disagg and vllm-disagg are no longer valid launcher names. Use slurm_multi instead. Made-with: Cursor
…ures docs/cli-reference.md: - Add --use-image and --build-on-compute options to build command - Add examples for pre-built image and compute node build workflows - Document when to use each mode docs/usage.md: - Add "Pre-built Image Mode" section with examples - Add "Build on Compute Node" section explaining workflow docs/deployment.md: - Add "SLURM Allocation Detection" section - Add "Baremetal Execution (slurm_multi)" section - Document environment variables and behavior inside salloc docs/launchers.md: - Fix example file paths (slurm_multi -> slurm-multi) Made-with: Cursor
--use-image changes: - Make image name optional (auto-detect from model card's DOCKER_IMAGE_NAME) - Add mutual exclusivity with --registry and --build-on-compute - Handle multiple models with different images (use first, warn) --build-on-compute changes: - Require --registry option (build once, push, pull everywhere) - Build on 1 compute node only, push to registry - Merge SLURM config from model card + additional-context (CLI overrides) - Add registry credential validation before build - Add docker login step in build script - Smart registry image naming (namespace/repo vs namespace formats) - Parallel docker pull on all nodes in run phase - Clear error messages for missing partition, credentials Run phase changes: - Detect built_on_compute flag in manifest - Pull image in parallel on all nodes via srun before model execution Documentation updated for new workflows. Made-with: Cursor
- slurm_multi launcher now requires --registry or --use-image - Parallel docker pull on all nodes for any registry image (not just built_on_compute) - DOCKER_IMAGE_NAME added to manifest env_vars for all registry builds - Documentation updated for slurm_multi requirements Made-with: Cursor
Ensure consistent documentation across: - deployment.md - launchers.md - usage.md All now mention that slurm_multi requires --registry or --use-image. Made-with: Cursor
- Make gpu_vendor/guest_os optional when using --use-image (pre-built image) - Remove model name truncation in CLI results table - Merge model card slurm and distributed config into build_manifest.json - Add skip_monitoring flag for synchronous salloc execution - Add completion marker file for robust sbatch job completion detection Made-with: Cursor
…cher column - Emit #SBATCH --nodelist in _prepare_baremetal_script() when nodelist is set via model card or --additional-context - Include nodelist in model card slurm merge key list so it propagates to the build manifest - Rename perf table column from "Launcher" to "Workload" for clarity Made-with: Cursor
…time) When --additional-context provides partial slurm overrides (e.g. only nodelist), ConfigLoader defaults were overwriting model card values like nodes and time. Fixed by capturing original user slurm keys before ConfigLoader runs, and deep-merging manifest slurm config with runtime context instead of replacing it wholesale. Made-with: Cursor
- Rename BAREMETAL_LAUNCHERS to SELF_MANAGED_LAUNCHERS to avoid confusion with PR ROCm#62's baremetal_vm (Docker-less nodes with VMs) - Rename _run_on_baremetal -> _run_self_managed, _prepare_baremetal_script -> _prepare_slurm_multi_script throughout - Add slurm_multi to common.py VALID_LAUNCHERS (canonical list) and add hyphen-variant normalization (slurm-multi -> slurm_multi) - Remove duplicate VALID_LAUNCHERS from kubernetes.py (now inherits from common.py) - Fix SBATCH output filename pattern: _%j -> _%j_%t to match the glob pattern in collect_results (madengine-*_{deployment_id}_*.out) - Add _collect_slurm_multi_results for explicit slurm_multi results collection path with perf.csv retry and NFS propagation handling - Add --use-image / --build-on-compute mutual exclusivity check in build_orchestrator.py - Document --use-image vs MAD_CONTAINER_IMAGE relationship in cli-reference.md Made-with: Cursor
fe22e7e to
87f8b7f
Compare
There was a problem hiding this comment.
Pull request overview
This PR updates Madengine’s SLURM (and related) workflows to better support disaggregated / self-managed multi-node inference by introducing a slurm_multi launcher path, plus new build options for using pre-built images or building on compute nodes.
Changes:
- Add
--use-imageand--build-on-computebuild workflows and propagate the resulting image name into manifests/env for multi-node pulls. - Add
slurm_multilauncher handling across SLURM/K8s deployments, including self-managed (host-executed) script paths and SLURM reservation support. - Improve manifest/runtime config merging behavior for dict-valued deployment keys.
Reviewed changes
Copilot reviewed 22 out of 22 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| src/madengine/orchestration/run_orchestrator.py | Shallow-merges dict-valued deployment config keys into runtime additional context. |
| src/madengine/orchestration/build_orchestrator.py | Implements --use-image and --build-on-compute build flows and writes manifests accordingly. |
| src/madengine/execution/docker_builder.py | Writes DOCKER_IMAGE_NAME into built_models.env_vars when a registry image exists (supports parallel pulls). |
| src/madengine/execution/container_runner.py | Adds “self-managed” launcher path that runs scripts directly on host instead of Madengine-managed Docker. |
| src/madengine/deployment/slurm_node_selector.py | Adds reservation support for srun health checks / cleanup. |
| src/madengine/deployment/slurm.py | Adds slurm_multi wrapper script generation, reservation support, and existing-allocation detection (salloc). |
| src/madengine/deployment/kubernetes.py | Treats slurm_multi as a disaggregated launcher option in K8s template context and adds launcher command generators. |
| src/madengine/deployment/common.py | Updates valid launcher list and normalizes slurm-multi → slurm_multi. |
| src/madengine/deployment/base.py | Adds skip_monitoring flag to support synchronous deploy flows (e.g., within salloc). |
| src/madengine/cli/validators.py | Extends validator signature to accept use_image (currently unused). |
| src/madengine/cli/utils.py | Adjusts display/extraction formatting (e.g., model name truncation and column header rename). |
| src/madengine/cli/commands/build.py | Adds CLI flags --use-image / --build-on-compute and enforces mutual exclusivity/requirements. |
| examples/slurm-configs/minimal/slurm-multi-minimal.json | Updates launcher value to slurm_multi. |
| examples/slurm-configs/basic/slurm-multi-multi-node.json | Updates launcher value to slurm_multi. |
| examples/slurm-configs/basic/slurm-multi-custom-split.json | Updates launcher value to slurm_multi. |
| examples/k8s-configs/minimal/slurm-multi-minimal.json | Updates launcher value to slurm_multi. |
| examples/k8s-configs/basic/slurm-multi-multi-node-basic.json | Updates launcher value to slurm_multi. |
| examples/k8s-configs/basic/slurm-multi-custom-split.json | Updates launcher value to slurm_multi. |
| docs/usage.md | Documents --use-image, --build-on-compute, and slurm_multi usage. |
| docs/launchers.md | Renames launcher references and documents slurm_multi registry requirement. |
| docs/deployment.md | Documents allocation detection and self-managed slurm_multi execution model. |
| docs/cli-reference.md | Adds CLI reference entries and usage examples for new build options and slurm_multi. |
Comments suppressed due to low confidence (1)
src/madengine/cli/validators.py:307
validate_additional_context()now acceptsuse_image, and the docstring says it “skips required field validation”, but the parameter is not used and validation runs unconditionally viafinalize_additional_context_dict(context). Either remove the unused parameter (and update callers), or implement the intended conditional validation behavior.
def validate_additional_context(
additional_context: str,
additional_context_file: Optional[str] = None,
use_image: Optional[str] = None,
) -> Dict[str, Any]:
"""
Validate and parse additional context.
Args:
additional_context: JSON string containing additional context
additional_context_file: Optional file containing additional context
use_image: Optional pre-built image to use (skips required field validation)
Returns:
Dict containing parsed additional context
Raises:
typer.Exit: If validation fails
"""
context, _ = merge_additional_context_from_sources(
additional_context, additional_context_file
)
finalize_additional_context_dict(context)
return context
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Merge model's distributed config | ||
| model_distributed = models[0].get("distributed", {}) | ||
| if model_distributed: | ||
| if "distributed" not in saved_manifest["deployment_config"]: | ||
| saved_manifest["deployment_config"]["distributed"] = {} | ||
|
|
||
| # Copy launcher and other critical fields from model config | ||
| for key in ["launcher", "nnodes", "nproc_per_node", "backend", "port", "sglang_disagg", "vllm_disagg"]: | ||
| if key in model_distributed and key not in saved_manifest["deployment_config"]["distributed"]: | ||
| saved_manifest["deployment_config"]["distributed"][key] = model_distributed[key] | ||
|
|
||
| # Merge model's slurm config into deployment_config.slurm | ||
| # This enables run phase to auto-detect SLURM deployment without --additional-context | ||
| model_slurm = models[0].get("slurm", {}) | ||
| if model_slurm: | ||
| if "slurm" not in saved_manifest["deployment_config"]: | ||
| saved_manifest["deployment_config"]["slurm"] = {} | ||
|
|
||
| # Copy slurm settings from model config (model card fills in | ||
| # values not explicitly set by --additional-context). | ||
| # Use _original_user_slurm_keys (captured before ConfigLoader | ||
| # applies defaults) so model card values override defaults | ||
| # but user's explicit CLI values still win. | ||
| for key in ["partition", "nodes", "gpus_per_node", "time", "exclusive", "reservation", "output_dir", "nodelist"]: | ||
| if key in model_slurm and key not in self._original_user_slurm_keys: | ||
| saved_manifest["deployment_config"]["slurm"][key] = model_slurm[key] |
There was a problem hiding this comment.
_execute_with_prebuilt_image() populates deployment_config.distributed and deployment_config.slurm from models[0] only. If the tag selection yields multiple models with differing distributed/slurm settings, the manifest will silently contain config for just the first model. Either validate that only one model is allowed in this mode, or merge per-model configs in a deterministic way (and/or store them per built_model entry).
| def _execute_build_on_compute( | ||
| self, | ||
| registry: Optional[str] = None, | ||
| clean_cache: bool = False, | ||
| manifest_output: str = "build_manifest.json", | ||
| batch_build_metadata: Optional[Dict] = None, | ||
| ) -> str: | ||
| """ | ||
| Execute Docker build on a SLURM compute node and push to registry. | ||
|
|
||
| Build workflow: | ||
| 1. Build on 1 compute node only | ||
| 2. Push image to registry | ||
| 3. Store registry image name in manifest | ||
| 4. Run phase will pull image in parallel on all nodes | ||
|
|
||
| Args: | ||
| registry: Registry to push images to (REQUIRED) | ||
| clean_cache: Whether to use --no-cache for Docker builds | ||
| manifest_output: Output file for build manifest | ||
| batch_build_metadata: Optional batch build metadata | ||
|
|
||
| Returns: | ||
| Path to generated build_manifest.json | ||
| """ | ||
| import subprocess | ||
| import os | ||
| import glob | ||
|
|
||
| self.rich_console.print(f"\n[dim]{'=' * 60}[/dim]") | ||
| self.rich_console.print("[bold blue]🔨 BUILD PHASE (Compute Node Mode)[/bold blue]") | ||
| self.rich_console.print("[cyan]Building on 1 compute node, pushing to registry...[/cyan]") | ||
| self.rich_console.print(f"[dim]{'=' * 60}[/dim]\n") | ||
|
|
||
| # Discover models first to get SLURM config from model card | ||
| self.rich_console.print("[bold cyan]🔍 Discovering models...[/bold cyan]") | ||
| discover_models = DiscoverModels(args=self.args) | ||
| models = discover_models.run() | ||
|
|
||
| if not models: | ||
| raise DiscoveryError( | ||
| "No models discovered for build-on-compute", | ||
| context=create_error_context( | ||
| operation="build_on_compute", | ||
| component="BuildOrchestrator", | ||
| ), | ||
| suggestions=[ | ||
| "Check if models.json exists", | ||
| "Verify --tags parameter is correct", | ||
| ], | ||
| ) | ||
|
|
||
| model = models[0] | ||
| model_name = model.get("name", "unknown") | ||
| self.rich_console.print(f"[green]✓ Found model: {model_name}[/green]\n") | ||
|
|
||
| # Merge SLURM config: model card (base) + additional-context (override) | ||
| model_slurm_config = model.get("slurm", {}) | ||
| context_slurm_config = self.additional_context.get("slurm", {}) | ||
|
|
||
| # Start with model card config, then override with command-line context | ||
| slurm_config = {**model_slurm_config, **context_slurm_config} | ||
|
|
||
| self.rich_console.print("[bold cyan]📋 SLURM Configuration (merged):[/bold cyan]") | ||
| if model_slurm_config: | ||
| self.rich_console.print(f" [dim]From model card:[/dim] {list(model_slurm_config.keys())}") | ||
| if context_slurm_config: | ||
| self.rich_console.print(f" [dim]From --additional-context (overrides):[/dim] {list(context_slurm_config.keys())}") | ||
|
|
||
| # Validate required fields | ||
| partition = slurm_config.get("partition") | ||
| if not partition: | ||
| raise ConfigurationError( | ||
| "Missing required SLURM field: partition", | ||
| context=create_error_context( | ||
| operation="build_on_compute", | ||
| component="BuildOrchestrator", | ||
| ), | ||
| suggestions=[ | ||
| 'Add "partition" to model card\'s slurm section', | ||
| 'Or specify via --additional-context \'{"slurm": {"partition": "gpu"}}\'', | ||
| ], | ||
| ) | ||
|
|
||
| reservation = slurm_config.get("reservation", "") | ||
| time_limit = slurm_config.get("time", "02:00:00") | ||
|
|
||
| self.rich_console.print(f" Partition: {partition}") | ||
| self.rich_console.print(f" Time limit: {time_limit}") | ||
| if reservation: | ||
| self.rich_console.print(f" Reservation: {reservation}") | ||
| self.rich_console.print("") | ||
|
|
||
| # Validate registry credentials | ||
| self.rich_console.print("[bold cyan]🔐 Registry Configuration:[/bold cyan]") | ||
| self.rich_console.print(f" Registry: {registry}") | ||
|
|
||
| # Check for credentials - either from environment or credential.json | ||
| dockerhub_user = os.environ.get("MAD_DOCKERHUB_USER", "") | ||
| dockerhub_password = os.environ.get("MAD_DOCKERHUB_PASSWORD", "") | ||
|
|
||
| # Try to load from credential.json if env vars not set | ||
| credential_file = Path("credential.json") | ||
| if not dockerhub_user and credential_file.exists(): | ||
| try: | ||
| with open(credential_file) as f: | ||
| creds = json.load(f) | ||
| dockerhub_creds = creds.get("dockerhub", {}) | ||
| dockerhub_user = dockerhub_creds.get("username", "") | ||
| dockerhub_password = dockerhub_creds.get("password", "") | ||
| if dockerhub_user: | ||
| self.rich_console.print(f" Credentials: Found in credential.json") | ||
| except (json.JSONDecodeError, IOError) as e: | ||
| self.rich_console.print(f" [yellow]Warning: Could not read credential.json: {e}[/yellow]") | ||
| elif dockerhub_user: | ||
| self.rich_console.print(f" Credentials: Found in environment (MAD_DOCKERHUB_USER)") | ||
|
|
||
| # Determine if registry requires authentication | ||
| requires_auth = True | ||
| public_registries = ["docker.io", "ghcr.io", "gcr.io", "quay.io", "nvcr.io"] | ||
| registry_lower = registry.lower() if registry else "" | ||
|
|
||
| # For docker.io pushes, authentication is always required | ||
| if any(pub_reg in registry_lower for pub_reg in public_registries): | ||
| if not dockerhub_user or not dockerhub_password: | ||
| raise ConfigurationError( |
There was a problem hiding this comment.
_execute_build_on_compute() treats registry as required (calls registry.lower(), registry.replace(...), etc.) but the function signature allows registry: Optional[str] = None and there’s no guard. If this path is invoked programmatically (or via a code path that bypasses the CLI validation), it will raise an AttributeError rather than a user-facing ConfigurationError. Add an explicit if not registry: raise ConfigurationError(...) near the top of this method.
| result = subprocess.run( | ||
| cmd, | ||
| shell=True, | ||
| cwd=script_dir, | ||
| env=env, | ||
| timeout=timeout if timeout > 0 else None, | ||
| ) |
There was a problem hiding this comment.
subprocess.run(..., shell=True) is used with a command string that includes model_args (which can come from manifests / user input). This allows shell metacharacters in args to be interpreted by the shell (command injection). Build the argv list explicitly (e.g., ["bash", script_path, *shlex.split(model_args)] / ["python3", ...]) and run with shell=False.
| "#SBATCH --exclusive", | ||
| ] |
There was a problem hiding this comment.
The slurm_multi wrapper script always adds #SBATCH --exclusive regardless of the user’s slurm.exclusive setting. In the standard SLURM template, exclusivity is conditional (job.sh.j2:20-22). Consider respecting self.slurm_config.get("exclusive", True) here as well to avoid changing behavior for users who explicitly disable exclusivity.
| "#SBATCH --exclusive", | |
| ] | |
| ] | |
| if self.slurm_config.get("exclusive", True): | |
| script_lines.append("#SBATCH --exclusive") |
| from madengine.utils.run_details import get_build_number, get_pipeline | ||
| from madengine.utils.path_utils import scripts_base_dir_from | ||
| import json | ||
| from typing import Optional |
There was a problem hiding this comment.
There is a duplicate import of Optional (from typing import Any, Dict, List, Optional and again from typing import Optional). Drop the second import to avoid redundancy.
| from typing import Optional |
| # Add each discovered model with the pre-built image | ||
| # Use the image name as the key (matches how madengine build does it) | ||
| for model in models: | ||
| model_name = model.get("name", "unknown") | ||
| model_distributed = model.get("distributed", {}) | ||
|
|
||
| # Merge DOCKER_IMAGE_NAME into env_vars for parallel pull in run phase | ||
| model_env_vars = model.get("env_vars", {}).copy() | ||
| model_env_vars["DOCKER_IMAGE_NAME"] = use_image | ||
|
|
||
| # Use image name as key so slurm.py can find docker_image | ||
| manifest["built_models"][use_image] = { | ||
| "name": model_name, | ||
| "image": use_image, | ||
| "docker_image": use_image, | ||
| "dockerfile": model.get("dockerfile", ""), | ||
| "scripts": model.get("scripts", ""), | ||
| "data": model.get("data", ""), | ||
| "n_gpus": model.get("n_gpus", "8"), | ||
| "owner": model.get("owner", ""), | ||
| "training_precision": model.get("training_precision", ""), | ||
| "multiple_results": model.get("multiple_results", ""), | ||
| "tags": model.get("tags", []), | ||
| "timeout": model.get("timeout", -1), | ||
| "args": model.get("args", ""), | ||
| "slurm": model.get("slurm", {}), | ||
| "distributed": model_distributed, | ||
| "env_vars": model_env_vars, | ||
| "prebuilt": True, | ||
| } |
There was a problem hiding this comment.
In pre-built image mode, the for model in models: loop always writes to manifest["built_models"][use_image], so each iteration overwrites the previous one. If multiple models are discovered for the selected tags, only the last model will be present in built_models even though summary.successful_builds lists all of them. Use a per-model unique key in built_models (and keep it consistent with built_images lookups), while setting docker_image to the shared use_image value.
| # Determine if registry requires authentication | ||
| requires_auth = True | ||
| public_registries = ["docker.io", "ghcr.io", "gcr.io", "quay.io", "nvcr.io"] | ||
| registry_lower = registry.lower() if registry else "" | ||
|
|
||
| # For docker.io pushes, authentication is always required | ||
| if any(pub_reg in registry_lower for pub_reg in public_registries): | ||
| if not dockerhub_user or not dockerhub_password: | ||
| raise ConfigurationError( | ||
| f"Registry credentials required for pushing to {registry}", | ||
| context=create_error_context( | ||
| operation="build_on_compute", | ||
| component="BuildOrchestrator", | ||
| registry=registry, | ||
| ), | ||
| suggestions=[ | ||
| "Set environment variables: MAD_DOCKERHUB_USER and MAD_DOCKERHUB_PASSWORD", | ||
| 'Or create credential.json: {"dockerhub": {"username": "...", "password": "..."}}', | ||
| "For Docker Hub, use a Personal Access Token (PAT) as password", | ||
| f"Example: export MAD_DOCKERHUB_USER=myuser", | ||
| f"Example: export MAD_DOCKERHUB_PASSWORD=dckr_pat_xxxxx", | ||
| ], | ||
| ) |
There was a problem hiding this comment.
The registry auth/error messaging here is Docker Hub–specific (MAD_DOCKERHUB_USER/PASSWORD, PAT guidance) but the logic also classifies ghcr.io, gcr.io, quay.io, nvcr.io as registries that “require credentials”. This will produce misleading guidance for non-DockerHub registries (and may block valid flows where different credentials are required). Consider either limiting this branch to Docker Hub only, or generalizing the env var names / suggestions based on registry_host.
| srun_cmd.extend(["bash", "-c", cleanup_script]) | ||
|
|
||
| try: | ||
| srun_cmd = [ | ||
| "srun", | ||
| f"--nodelist={node}", | ||
| "--ntasks=1", | ||
| "--time=00:01:00", | ||
| "--overlap", | ||
| "--quiet", | ||
| ] | ||
| if self.reservation: | ||
| srun_cmd.append(f"--reservation={self.reservation}") | ||
| srun_cmd.extend(["bash", "-c", cleanup_script]) | ||
|
|
There was a problem hiding this comment.
cleanup_node() builds srun_cmd (including job_name) before the try:, but then immediately reassigns srun_cmd inside the try: block. The reassignment drops the job_name argument and duplicates logic, making the first construction dead code and preventing health-check job cleanup from working as intended. Remove the duplicate reassignment and append both --job-name (when provided) and --reservation consistently.
| srun_cmd.extend(["bash", "-c", cleanup_script]) | |
| try: | |
| srun_cmd = [ | |
| "srun", | |
| f"--nodelist={node}", | |
| "--ntasks=1", | |
| "--time=00:01:00", | |
| "--overlap", | |
| "--quiet", | |
| ] | |
| if self.reservation: | |
| srun_cmd.append(f"--reservation={self.reservation}") | |
| srun_cmd.extend(["bash", "-c", cleanup_script]) | |
| if self.reservation: | |
| srun_cmd.append(f"--reservation={self.reservation}") | |
| srun_cmd.extend(["bash", "-c", cleanup_script]) | |
| try: |
| # Merge deployment_config into additional_context (for deployment layer to use). | ||
| # For dict-valued keys (slurm, k8s, etc.), deep-merge so manifest | ||
| # values fill in gaps while runtime --additional-context wins on conflicts. | ||
| for key in ["slurm", "k8s", "kubernetes", "distributed", "vllm", "env_vars", "debug"]: | ||
| if key in deployment_config and key not in self.additional_context: | ||
| self.additional_context[key] = deployment_config[key] | ||
| if key in deployment_config: | ||
| if key not in self.additional_context: | ||
| self.additional_context[key] = deployment_config[key] | ||
| elif isinstance(deployment_config[key], dict) and isinstance(self.additional_context[key], dict): | ||
| merged = dict(deployment_config[key]) | ||
| merged.update(self.additional_context[key]) | ||
| self.additional_context[key] = merged |
There was a problem hiding this comment.
The comment says “deep-merge” but the implementation here is a one-level dict merge (merged = dict(...); merged.update(...)). Nested dictionaries under these keys will still be overwritten rather than recursively merged. Either adjust the comment to describe a shallow merge, or implement an actual recursive deep merge for nested structures.
Motivation
PR Summary: SGLang Disaggregated Inference Support for Madengine v2
Overview
This PR adds support for running SGLang Disaggregated (P/D) inference workloads on SLURM clusters. The key architectural change is introducing a "baremetal launcher" path for disaggregated inference launchers that manage their own Docker containers via SLURM.
Files Modified
1.
src/madengine/cli/commands/build.py(+18 lines)Purpose: Add CLI options for flexible build workflows
--use-imageoption: Skip Docker build and use pre-built image--build-on-computeoption: Build Docker images on SLURM compute node2.
src/madengine/deployment/slurm.py(+344 lines)Purpose: Support baremetal execution for disagg launchers
self.reservationparsing from slurm configprepare()_prepare_baremetal_script()method that generates simple wrapper scripts_run_inside_existing_allocation()for salloc support_validate_allocation_nodes()for node count validationsglang-disaggorvllm-disagglaunchersjob.sh.j2template path3.
src/madengine/deployment/slurm_node_selector.py(+29/-18 lines)Purpose: Add reservation support to node health checks
reservationparameter to srun commands for health checks and cleanup4.
src/madengine/execution/container_runner.py(+206 lines)Purpose: Execute scripts directly on baremetal for disagg launchers
BAREMETAL_LAUNCHERSconstant_run_on_baremetal()method for direct script execution5.
src/madengine/orchestration/build_orchestrator.py(+283 lines)Purpose: Support pre-built images and compute-node builds
_execute_with_prebuilt_image()for--use-imageoption_execute_build_on_compute()for--build-on-computeoptionexecute()flow unchanged when options not usedImpact Analysis
Backward Compatibility
Testing
Tested with:
pyt_sglang_disagg_qwen3-32bmodel on 3-node SLURM clustersalloc(existing allocation) andsbatch(new job) modes--use-imageHow to test
Made with Cursor