Updates to slurm launcher by raviguptaamd · Pull Request #86 · ROCm/madengine

raviguptaamd · 2026-03-12T20:19:59Z

Motivation

PR Summary: SGLang Disaggregated Inference Support for Madengine v2

Replaces PR #68 - now sourced from fork with all latest commits.

Overview

This PR adds support for running SGLang Disaggregated (P/D) inference workloads on SLURM clusters. The key architectural change is introducing a "baremetal launcher" path for disaggregated inference launchers that manage their own Docker containers via SLURM.

Files Modified

1. `src/madengine/cli/commands/build.py` (+18 lines)

Purpose: Add CLI options for flexible build workflows

Added --use-image option: Skip Docker build and use pre-built image
Added --build-on-compute option: Build Docker images on SLURM compute node
Both options are optional and default to disabled
Existing build workflows unchanged when options not specified

2. `src/madengine/deployment/slurm.py` (+344 lines)

Purpose: Support baremetal execution for disagg launchers

Added self.reservation parsing from slurm config
Added baremetal launcher detection in prepare()
Added _prepare_baremetal_script() method that generates simple wrapper scripts
Added _run_inside_existing_allocation() for salloc support
Added _validate_allocation_nodes() for node count validation
Detection is explicit: Only triggers for sglang-disagg or vllm-disagg launchers
All other launchers continue to use existing job.sh.j2 template path

3. `src/madengine/deployment/slurm_node_selector.py` (+29/-18 lines)

Purpose: Add reservation support to node health checks

Added reservation parameter to srun commands for health checks and cleanup

4. `src/madengine/execution/container_runner.py` (+206 lines)

Purpose: Execute scripts directly on baremetal for disagg launchers

Added BAREMETAL_LAUNCHERS constant
Added _run_on_baremetal() method for direct script execution
Added launcher detection before Docker container creation
All other launchers continue to Docker execution unchanged

5. `src/madengine/orchestration/build_orchestrator.py` (+283 lines)

Purpose: Support pre-built images and compute-node builds

Added _execute_with_prebuilt_image() for --use-image option
Added _execute_build_on_compute() for --build-on-compute option
Standard execute() flow unchanged when options not used

Impact Analysis

Component	Affected Launchers	Other Launchers
slurm.py	sglang-disagg, vllm-disagg	No change - uses existing template
container_runner.py	sglang-disagg, vllm-disagg	No change - uses Docker execution
build.py	All (optional flags)	No change - flags default to disabled
build_orchestrator.py	All (optional flags)	No change - standard build if flags not used

Backward Compatibility

Existing models: All existing models continue to work unchanged
Existing launchers: torchrun, deepspeed, megatron-lm, vllm, sglang (non-disagg) all unchanged
Existing build workflows: Default behavior preserved
Existing SLURM deployments: Standard job template path unchanged

Testing

Tested with:

pyt_sglang_disagg_qwen3-32b model on 3-node SLURM cluster
Both salloc (existing allocation) and sbatch (new job) modes
Pre-built image workflow with --use-image
Benchmarks completed successfully (GSM8K accuracy + throughput sweep)

How to test

# Clone from fork
git clone https://github.com/raviguptaamd/madengine.git
cd madengine
git checkout raviguptaamd/update_slurm_launcher
pip install -e .

Made with Cursor

- Add docker_image field to built_images and built_models in --use-image mode - Merge model's distributed.launcher into deployment_config for proper detection - Make reservation optional (empty string is ignored) - Pass reservation to SlurmNodeSelector for health checks Co-authored-by: Cursor <cursoragent@cursor.com>

Add missing newline after 'fi' statement to prevent syntax error in generated sbatch script when using --build-on-compute option. Co-authored-by: Cursor <cursoragent@cursor.com>

Replace legacy launcher names with unified slurm_multi: - Remove sglang-disagg, vllm-disagg from VALID_LAUNCHERS - Add slurm_multi as the only baremetal multi-node launcher - Update deployment logic in slurm.py and kubernetes.py - Rename example config files from sglang-disagg-* to slurm-multi-* - Update docs/launchers.md references Breaking change: sglang-disagg and vllm-disagg are no longer valid launcher names. Use slurm_multi instead. Made-with: Cursor

…ures docs/cli-reference.md: - Add --use-image and --build-on-compute options to build command - Add examples for pre-built image and compute node build workflows - Document when to use each mode docs/usage.md: - Add "Pre-built Image Mode" section with examples - Add "Build on Compute Node" section explaining workflow docs/deployment.md: - Add "SLURM Allocation Detection" section - Add "Baremetal Execution (slurm_multi)" section - Document environment variables and behavior inside salloc docs/launchers.md: - Fix example file paths (slurm_multi -> slurm-multi) Made-with: Cursor

--use-image changes: - Make image name optional (auto-detect from model card's DOCKER_IMAGE_NAME) - Add mutual exclusivity with --registry and --build-on-compute - Handle multiple models with different images (use first, warn) --build-on-compute changes: - Require --registry option (build once, push, pull everywhere) - Build on 1 compute node only, push to registry - Merge SLURM config from model card + additional-context (CLI overrides) - Add registry credential validation before build - Add docker login step in build script - Smart registry image naming (namespace/repo vs namespace formats) - Parallel docker pull on all nodes in run phase - Clear error messages for missing partition, credentials Run phase changes: - Detect built_on_compute flag in manifest - Pull image in parallel on all nodes via srun before model execution Documentation updated for new workflows. Made-with: Cursor

- slurm_multi launcher now requires --registry or --use-image - Parallel docker pull on all nodes for any registry image (not just built_on_compute) - DOCKER_IMAGE_NAME added to manifest env_vars for all registry builds - Documentation updated for slurm_multi requirements Made-with: Cursor

Ensure consistent documentation across: - deployment.md - launchers.md - usage.md All now mention that slurm_multi requires --registry or --use-image. Made-with: Cursor

- Make gpu_vendor/guest_os optional when using --use-image (pre-built image) - Remove model name truncation in CLI results table - Merge model card slurm and distributed config into build_manifest.json - Add skip_monitoring flag for synchronous salloc execution - Add completion marker file for robust sbatch job completion detection Made-with: Cursor

…cher column - Emit #SBATCH --nodelist in _prepare_baremetal_script() when nodelist is set via model card or --additional-context - Include nodelist in model card slurm merge key list so it propagates to the build manifest - Rename perf table column from "Launcher" to "Workload" for clarity Made-with: Cursor

…time) When --additional-context provides partial slurm overrides (e.g. only nodelist), ConfigLoader defaults were overwriting model card values like nodes and time. Fixed by capturing original user slurm keys before ConfigLoader runs, and deep-merging manifest slurm config with runtime context instead of replacing it wholesale. Made-with: Cursor

- Rename BAREMETAL_LAUNCHERS to SELF_MANAGED_LAUNCHERS to avoid confusion with PR ROCm#62's baremetal_vm (Docker-less nodes with VMs) - Rename _run_on_baremetal -> _run_self_managed, _prepare_baremetal_script -> _prepare_slurm_multi_script throughout - Add slurm_multi to common.py VALID_LAUNCHERS (canonical list) and add hyphen-variant normalization (slurm-multi -> slurm_multi) - Remove duplicate VALID_LAUNCHERS from kubernetes.py (now inherits from common.py) - Fix SBATCH output filename pattern: _%j -> _%j_%t to match the glob pattern in collect_results (madengine-*_{deployment_id}_*.out) - Add _collect_slurm_multi_results for explicit slurm_multi results collection path with perf.csv retry and NFS propagation handling - Add --use-image / --build-on-compute mutual exclusivity check in build_orchestrator.py - Document --use-image vs MAD_CONTAINER_IMAGE relationship in cli-reference.md Made-with: Cursor

Copilot

Pull request overview

This PR updates Madengine’s SLURM (and related) workflows to better support disaggregated / self-managed multi-node inference by introducing a slurm_multi launcher path, plus new build options for using pre-built images or building on compute nodes.

Changes:

Add --use-image and --build-on-compute build workflows and propagate the resulting image name into manifests/env for multi-node pulls.
Add slurm_multi launcher handling across SLURM/K8s deployments, including self-managed (host-executed) script paths and SLURM reservation support.
Improve manifest/runtime config merging behavior for dict-valued deployment keys.

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
src/madengine/orchestration/run_orchestrator.py	Shallow-merges dict-valued deployment config keys into runtime additional context.
src/madengine/orchestration/build_orchestrator.py	Implements `--use-image` and `--build-on-compute` build flows and writes manifests accordingly.
src/madengine/execution/docker_builder.py	Writes `DOCKER_IMAGE_NAME` into `built_models.env_vars` when a registry image exists (supports parallel pulls).
src/madengine/execution/container_runner.py	Adds “self-managed” launcher path that runs scripts directly on host instead of Madengine-managed Docker.
src/madengine/deployment/slurm_node_selector.py	Adds reservation support for `srun` health checks / cleanup.
src/madengine/deployment/slurm.py	Adds `slurm_multi` wrapper script generation, reservation support, and existing-allocation detection (salloc).
src/madengine/deployment/kubernetes.py	Treats `slurm_multi` as a disaggregated launcher option in K8s template context and adds launcher command generators.
src/madengine/deployment/common.py	Updates valid launcher list and normalizes `slurm-multi` → `slurm_multi`.
src/madengine/deployment/base.py	Adds `skip_monitoring` flag to support synchronous deploy flows (e.g., within salloc).
src/madengine/cli/validators.py	Extends validator signature to accept `use_image` (currently unused).
src/madengine/cli/utils.py	Adjusts display/extraction formatting (e.g., model name truncation and column header rename).
src/madengine/cli/commands/build.py	Adds CLI flags `--use-image` / `--build-on-compute` and enforces mutual exclusivity/requirements.
examples/slurm-configs/minimal/slurm-multi-minimal.json	Updates launcher value to `slurm_multi`.
examples/slurm-configs/basic/slurm-multi-multi-node.json	Updates launcher value to `slurm_multi`.
examples/slurm-configs/basic/slurm-multi-custom-split.json	Updates launcher value to `slurm_multi`.
examples/k8s-configs/minimal/slurm-multi-minimal.json	Updates launcher value to `slurm_multi`.
examples/k8s-configs/basic/slurm-multi-multi-node-basic.json	Updates launcher value to `slurm_multi`.
examples/k8s-configs/basic/slurm-multi-custom-split.json	Updates launcher value to `slurm_multi`.
docs/usage.md	Documents `--use-image`, `--build-on-compute`, and `slurm_multi` usage.
docs/launchers.md	Renames launcher references and documents `slurm_multi` registry requirement.
docs/deployment.md	Documents allocation detection and self-managed `slurm_multi` execution model.
docs/cli-reference.md	Adds CLI reference entries and usage examples for new build options and `slurm_multi`.

Comments suppressed due to low confidence (1)

src/madengine/cli/validators.py:307

validate_additional_context() now accepts use_image, and the docstring says it “skips required field validation”, but the parameter is not used and validation runs unconditionally via finalize_additional_context_dict(context). Either remove the unused parameter (and update callers), or implement the intended conditional validation behavior.

def validate_additional_context(
    additional_context: str,
    additional_context_file: Optional[str] = None,
    use_image: Optional[str] = None,
) -> Dict[str, Any]:
    """
    Validate and parse additional context.

    Args:
        additional_context: JSON string containing additional context
        additional_context_file: Optional file containing additional context
        use_image: Optional pre-built image to use (skips required field validation)

    Returns:
        Dict containing parsed additional context

    Raises:
        typer.Exit: If validation fails
    """
    context, _ = merge_additional_context_from_sources(
        additional_context, additional_context_file
    )
    finalize_additional_context_dict(context)
    return context

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-16T08:11:15Z

+                # Merge model's distributed config
+                model_distributed = models[0].get("distributed", {})
+                if model_distributed:
+                    if "distributed" not in saved_manifest["deployment_config"]:
+                        saved_manifest["deployment_config"]["distributed"] = {}
+
+                    # Copy launcher and other critical fields from model config
+                    for key in ["launcher", "nnodes", "nproc_per_node", "backend", "port", "sglang_disagg", "vllm_disagg"]:
+                        if key in model_distributed and key not in saved_manifest["deployment_config"]["distributed"]:
+                            saved_manifest["deployment_config"]["distributed"][key] = model_distributed[key]
+
+                # Merge model's slurm config into deployment_config.slurm
+                # This enables run phase to auto-detect SLURM deployment without --additional-context
+                model_slurm = models[0].get("slurm", {})
+                if model_slurm:
+                    if "slurm" not in saved_manifest["deployment_config"]:
+                        saved_manifest["deployment_config"]["slurm"] = {}
+
+                    # Copy slurm settings from model config (model card fills in
+                    # values not explicitly set by --additional-context).
+                    # Use _original_user_slurm_keys (captured before ConfigLoader
+                    # applies defaults) so model card values override defaults
+                    # but user's explicit CLI values still win.
+                    for key in ["partition", "nodes", "gpus_per_node", "time", "exclusive", "reservation", "output_dir", "nodelist"]:
+                        if key in model_slurm and key not in self._original_user_slurm_keys:
+                            saved_manifest["deployment_config"]["slurm"][key] = model_slurm[key]


_execute_with_prebuilt_image() populates deployment_config.distributed and deployment_config.slurm from models[0] only. If the tag selection yields multiple models with differing distributed/slurm settings, the manifest will silently contain config for just the first model. Either validate that only one model is allowed in this mode, or merge per-model configs in a deterministic way (and/or store them per built_model entry).

Copilot · 2026-04-16T08:11:16Z

+    def _execute_build_on_compute(
+        self,
+        registry: Optional[str] = None,
+        clean_cache: bool = False,
+        manifest_output: str = "build_manifest.json",
+        batch_build_metadata: Optional[Dict] = None,
+    ) -> str:
+        """
+        Execute Docker build on a SLURM compute node and push to registry.
+
+        Build workflow:
+        1. Build on 1 compute node only
+        2. Push image to registry
+        3. Store registry image name in manifest
+        4. Run phase will pull image in parallel on all nodes
+
+        Args:
+            registry: Registry to push images to (REQUIRED)
+            clean_cache: Whether to use --no-cache for Docker builds
+            manifest_output: Output file for build manifest
+            batch_build_metadata: Optional batch build metadata
+
+        Returns:
+            Path to generated build_manifest.json
+        """
+        import subprocess
+        import os
+        import glob
+
+        self.rich_console.print(f"\n[dim]{'=' * 60}[/dim]")
+        self.rich_console.print("[bold blue]🔨 BUILD PHASE (Compute Node Mode)[/bold blue]")
+        self.rich_console.print("[cyan]Building on 1 compute node, pushing to registry...[/cyan]")
+        self.rich_console.print(f"[dim]{'=' * 60}[/dim]\n")
+
+        # Discover models first to get SLURM config from model card
+        self.rich_console.print("[bold cyan]🔍 Discovering models...[/bold cyan]")
+        discover_models = DiscoverModels(args=self.args)
+        models = discover_models.run()
+
+        if not models:
+            raise DiscoveryError(
+                "No models discovered for build-on-compute",
+                context=create_error_context(
+                    operation="build_on_compute",
+                    component="BuildOrchestrator",
+                ),
+                suggestions=[
+                    "Check if models.json exists",
+                    "Verify --tags parameter is correct",
+                ],
+            )
+
+        model = models[0]
+        model_name = model.get("name", "unknown")
+        self.rich_console.print(f"[green]✓ Found model: {model_name}[/green]\n")
+
+        # Merge SLURM config: model card (base) + additional-context (override)
+        model_slurm_config = model.get("slurm", {})
+        context_slurm_config = self.additional_context.get("slurm", {})
+
+        # Start with model card config, then override with command-line context
+        slurm_config = {**model_slurm_config, **context_slurm_config}
+
+        self.rich_console.print("[bold cyan]📋 SLURM Configuration (merged):[/bold cyan]")
+        if model_slurm_config:
+            self.rich_console.print(f"  [dim]From model card:[/dim] {list(model_slurm_config.keys())}")
+        if context_slurm_config:
+            self.rich_console.print(f"  [dim]From --additional-context (overrides):[/dim] {list(context_slurm_config.keys())}")
+
+        # Validate required fields
+        partition = slurm_config.get("partition")
+        if not partition:
+            raise ConfigurationError(
+                "Missing required SLURM field: partition",
+                context=create_error_context(
+                    operation="build_on_compute",
+                    component="BuildOrchestrator",
+                ),
+                suggestions=[
+                    'Add "partition" to model card\'s slurm section',
+                    'Or specify via --additional-context \'{"slurm": {"partition": "gpu"}}\'',
+                ],
+            )
+
+        reservation = slurm_config.get("reservation", "")
+        time_limit = slurm_config.get("time", "02:00:00")
+
+        self.rich_console.print(f"  Partition: {partition}")
+        self.rich_console.print(f"  Time limit: {time_limit}")
+        if reservation:
+            self.rich_console.print(f"  Reservation: {reservation}")
+        self.rich_console.print("")
+
+        # Validate registry credentials
+        self.rich_console.print("[bold cyan]🔐 Registry Configuration:[/bold cyan]")
+        self.rich_console.print(f"  Registry: {registry}")
+
+        # Check for credentials - either from environment or credential.json
+        dockerhub_user = os.environ.get("MAD_DOCKERHUB_USER", "")
+        dockerhub_password = os.environ.get("MAD_DOCKERHUB_PASSWORD", "")
+
+        # Try to load from credential.json if env vars not set
+        credential_file = Path("credential.json")
+        if not dockerhub_user and credential_file.exists():
+            try:
+                with open(credential_file) as f:
+                    creds = json.load(f)
+                    dockerhub_creds = creds.get("dockerhub", {})
+                    dockerhub_user = dockerhub_creds.get("username", "")
+                    dockerhub_password = dockerhub_creds.get("password", "")
+                    if dockerhub_user:
+                        self.rich_console.print(f"  Credentials: Found in credential.json")
+            except (json.JSONDecodeError, IOError) as e:
+                self.rich_console.print(f"  [yellow]Warning: Could not read credential.json: {e}[/yellow]")
+        elif dockerhub_user:
+            self.rich_console.print(f"  Credentials: Found in environment (MAD_DOCKERHUB_USER)")
+
+        # Determine if registry requires authentication
+        requires_auth = True
+        public_registries = ["docker.io", "ghcr.io", "gcr.io", "quay.io", "nvcr.io"]
+        registry_lower = registry.lower() if registry else ""
+
+        # For docker.io pushes, authentication is always required
+        if any(pub_reg in registry_lower for pub_reg in public_registries):
+            if not dockerhub_user or not dockerhub_password:
+                raise ConfigurationError(


_execute_build_on_compute() treats registry as required (calls registry.lower(), registry.replace(...), etc.) but the function signature allows registry: Optional[str] = None and there’s no guard. If this path is invoked programmatically (or via a code path that bypasses the CLI validation), it will raise an AttributeError rather than a user-facing ConfigurationError. Add an explicit if not registry: raise ConfigurationError(...) near the top of this method.

Copilot · 2026-04-16T08:11:16Z

+                    result = subprocess.run(
+                        cmd,
+                        shell=True,
+                        cwd=script_dir,
+                        env=env,
+                        timeout=timeout if timeout > 0 else None,
+                    )


subprocess.run(..., shell=True) is used with a command string that includes model_args (which can come from manifests / user input). This allows shell metacharacters in args to be interpreted by the shell (command injection). Build the argv list explicitly (e.g., ["bash", script_path, *shlex.split(model_args)] / ["python3", ...]) and run with shell=False.

Copilot · 2026-04-16T08:11:16Z

+            "#SBATCH --exclusive",
+        ]


The slurm_multi wrapper script always adds #SBATCH --exclusive regardless of the user’s slurm.exclusive setting. In the standard SLURM template, exclusivity is conditional (job.sh.j2:20-22). Consider respecting self.slurm_config.get("exclusive", True) here as well to avoid changing behavior for users who explicitly disable exclusivity.

Suggested change

"#SBATCH --exclusive",

]

]

if self.slurm_config.get("exclusive", True):

script_lines.append("#SBATCH --exclusive")

Copilot · 2026-04-16T08:11:17Z

 from madengine.utils.run_details import get_build_number, get_pipeline
 from madengine.utils.path_utils import scripts_base_dir_from
 import json
+from typing import Optional


There is a duplicate import of Optional (from typing import Any, Dict, List, Optional and again from typing import Optional). Drop the second import to avoid redundancy.

Suggested change

from typing import Optional

Copilot · 2026-04-16T08:11:17Z

+            # Add each discovered model with the pre-built image
+            # Use the image name as the key (matches how madengine build does it)
+            for model in models:
+                model_name = model.get("name", "unknown")
+                model_distributed = model.get("distributed", {})
+
+                # Merge DOCKER_IMAGE_NAME into env_vars for parallel pull in run phase
+                model_env_vars = model.get("env_vars", {}).copy()
+                model_env_vars["DOCKER_IMAGE_NAME"] = use_image
+
+                # Use image name as key so slurm.py can find docker_image
+                manifest["built_models"][use_image] = {
+                    "name": model_name,
+                    "image": use_image,
+                    "docker_image": use_image,
+                    "dockerfile": model.get("dockerfile", ""),
+                    "scripts": model.get("scripts", ""),
+                    "data": model.get("data", ""),
+                    "n_gpus": model.get("n_gpus", "8"),
+                    "owner": model.get("owner", ""),
+                    "training_precision": model.get("training_precision", ""),
+                    "multiple_results": model.get("multiple_results", ""),
+                    "tags": model.get("tags", []),
+                    "timeout": model.get("timeout", -1),
+                    "args": model.get("args", ""),
+                    "slurm": model.get("slurm", {}),
+                    "distributed": model_distributed,
+                    "env_vars": model_env_vars,
+                    "prebuilt": True,
+                }


In pre-built image mode, the for model in models: loop always writes to manifest["built_models"][use_image], so each iteration overwrites the previous one. If multiple models are discovered for the selected tags, only the last model will be present in built_models even though summary.successful_builds lists all of them. Use a per-model unique key in built_models (and keep it consistent with built_images lookups), while setting docker_image to the shared use_image value.

Copilot · 2026-04-16T08:11:17Z

+        # Determine if registry requires authentication
+        requires_auth = True
+        public_registries = ["docker.io", "ghcr.io", "gcr.io", "quay.io", "nvcr.io"]
+        registry_lower = registry.lower() if registry else ""
+
+        # For docker.io pushes, authentication is always required
+        if any(pub_reg in registry_lower for pub_reg in public_registries):
+            if not dockerhub_user or not dockerhub_password:
+                raise ConfigurationError(
+                    f"Registry credentials required for pushing to {registry}",
+                    context=create_error_context(
+                        operation="build_on_compute",
+                        component="BuildOrchestrator",
+                        registry=registry,
+                    ),
+                    suggestions=[
+                        "Set environment variables: MAD_DOCKERHUB_USER and MAD_DOCKERHUB_PASSWORD",
+                        'Or create credential.json: {"dockerhub": {"username": "...", "password": "..."}}',
+                        "For Docker Hub, use a Personal Access Token (PAT) as password",
+                        f"Example: export MAD_DOCKERHUB_USER=myuser",
+                        f"Example: export MAD_DOCKERHUB_PASSWORD=dckr_pat_xxxxx",
+                    ],
+                )


The registry auth/error messaging here is Docker Hub–specific (MAD_DOCKERHUB_USER/PASSWORD, PAT guidance) but the logic also classifies ghcr.io, gcr.io, quay.io, nvcr.io as registries that “require credentials”. This will produce misleading guidance for non-DockerHub registries (and may block valid flows where different credentials are required). Consider either limiting this branch to Docker Hub only, or generalizing the env var names / suggestions based on registry_host.

Copilot · 2026-04-16T08:11:17Z

        srun_cmd.extend(["bash", "-c", cleanup_script])

        try:
+            srun_cmd = [
+                "srun",
+                f"--nodelist={node}",
+                "--ntasks=1",
+                "--time=00:01:00",
+                "--overlap",
+                "--quiet",
+            ]
+            if self.reservation:
+                srun_cmd.append(f"--reservation={self.reservation}")
+            srun_cmd.extend(["bash", "-c", cleanup_script])
+


cleanup_node() builds srun_cmd (including job_name) before the try:, but then immediately reassigns srun_cmd inside the try: block. The reassignment drops the job_name argument and duplicates logic, making the first construction dead code and preventing health-check job cleanup from working as intended. Remove the duplicate reassignment and append both --job-name (when provided) and --reservation consistently.

Suggested change

srun_cmd.extend(["bash", "-c", cleanup_script])

try:

srun_cmd = [

"srun",

f"--nodelist={node}",

"--ntasks=1",

"--time=00:01:00",

"--overlap",

"--quiet",

]

if self.reservation:

srun_cmd.append(f"--reservation={self.reservation}")

srun_cmd.extend(["bash", "-c", cleanup_script])

if self.reservation:

srun_cmd.append(f"--reservation={self.reservation}")

srun_cmd.extend(["bash", "-c", cleanup_script])

try:

Copilot · 2026-04-16T08:11:18Z

+            # Merge deployment_config into additional_context (for deployment layer to use).
+            # For dict-valued keys (slurm, k8s, etc.), deep-merge so manifest
+            # values fill in gaps while runtime --additional-context wins on conflicts.
            for key in ["slurm", "k8s", "kubernetes", "distributed", "vllm", "env_vars", "debug"]:
-                if key in deployment_config and key not in self.additional_context:
-                    self.additional_context[key] = deployment_config[key]
+                if key in deployment_config:
+                    if key not in self.additional_context:
+                        self.additional_context[key] = deployment_config[key]
+                    elif isinstance(deployment_config[key], dict) and isinstance(self.additional_context[key], dict):
+                        merged = dict(deployment_config[key])
+                        merged.update(self.additional_context[key])
+                        self.additional_context[key] = merged


The comment says “deep-merge” but the implementation here is a one-level dict merge (merged = dict(...); merged.update(...)). Nested dictionaries under these keys will still be overwritten rather than recursively merged. Either adjust the comment to describe a shallow merge, or implement an actual recursive deep merge for nested structures.

raviguptaamd and others added 12 commits April 16, 2026 07:14

Updates to slurm launcher

284f7d2

Fix syntax error in build-on-compute sbatch script generation

e2c089f

Add missing newline after 'fi' statement to prevent syntax error in generated sbatch script when using --build-on-compute option. Co-authored-by: Cursor <cursoragent@cursor.com>

Add slurm_multi registry requirement to all docs

b13e1f1

Ensure consistent documentation across: - deployment.md - launchers.md - usage.md All now mention that slurm_multi requires --registry or --use-image. Made-with: Cursor

raviguptaamd force-pushed the raviguptaamd/update_slurm_launcher branch from fe22e7e to 87f8b7f Compare April 16, 2026 08:05

Copilot AI review requested due to automatic review settings April 16, 2026 08:05

Copilot started reviewing on behalf of raviguptaamd April 16, 2026 08:06 View session

Copilot AI reviewed Apr 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updates to slurm launcher#86

Updates to slurm launcher#86
raviguptaamd wants to merge 12 commits intoROCm:coketaste/refactor-disfrom
raviguptaamd:raviguptaamd/update_slurm_launcher

raviguptaamd commented Mar 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-            "#SBATCH --exclusive",
-        ]
+        ]
+        if self.slurm_config.get("exclusive", True):
+            script_lines.append("#SBATCH --exclusive")

Conversation

raviguptaamd commented Mar 12, 2026

Motivation

Overview

Files Modified

1. src/madengine/cli/commands/build.py (+18 lines)

2. src/madengine/deployment/slurm.py (+344 lines)

3. src/madengine/deployment/slurm_node_selector.py (+29/-18 lines)

4. src/madengine/execution/container_runner.py (+206 lines)

5. src/madengine/orchestration/build_orchestrator.py (+283 lines)

Impact Analysis

Backward Compatibility

Testing

How to test

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. `src/madengine/cli/commands/build.py` (+18 lines)

2. `src/madengine/deployment/slurm.py` (+344 lines)

3. `src/madengine/deployment/slurm_node_selector.py` (+29/-18 lines)

4. `src/madengine/execution/container_runner.py` (+206 lines)

5. `src/madengine/orchestration/build_orchestrator.py` (+283 lines)