Skip to content

Updates to slurm launcher#86

Open
raviguptaamd wants to merge 12 commits intoROCm:coketaste/refactor-disfrom
raviguptaamd:raviguptaamd/update_slurm_launcher
Open

Updates to slurm launcher#86
raviguptaamd wants to merge 12 commits intoROCm:coketaste/refactor-disfrom
raviguptaamd:raviguptaamd/update_slurm_launcher

Conversation

@raviguptaamd
Copy link
Copy Markdown

Motivation

PR Summary: SGLang Disaggregated Inference Support for Madengine v2

Replaces PR #68 - now sourced from fork with all latest commits.

Overview

This PR adds support for running SGLang Disaggregated (P/D) inference workloads on SLURM clusters. The key architectural change is introducing a "baremetal launcher" path for disaggregated inference launchers that manage their own Docker containers via SLURM.

Files Modified

1. src/madengine/cli/commands/build.py (+18 lines)

Purpose: Add CLI options for flexible build workflows

  • Added --use-image option: Skip Docker build and use pre-built image
  • Added --build-on-compute option: Build Docker images on SLURM compute node
  • Both options are optional and default to disabled
  • Existing build workflows unchanged when options not specified

2. src/madengine/deployment/slurm.py (+344 lines)

Purpose: Support baremetal execution for disagg launchers

  • Added self.reservation parsing from slurm config
  • Added baremetal launcher detection in prepare()
  • Added _prepare_baremetal_script() method that generates simple wrapper scripts
  • Added _run_inside_existing_allocation() for salloc support
  • Added _validate_allocation_nodes() for node count validation
  • Detection is explicit: Only triggers for sglang-disagg or vllm-disagg launchers
  • All other launchers continue to use existing job.sh.j2 template path

3. src/madengine/deployment/slurm_node_selector.py (+29/-18 lines)

Purpose: Add reservation support to node health checks

  • Added reservation parameter to srun commands for health checks and cleanup

4. src/madengine/execution/container_runner.py (+206 lines)

Purpose: Execute scripts directly on baremetal for disagg launchers

  • Added BAREMETAL_LAUNCHERS constant
  • Added _run_on_baremetal() method for direct script execution
  • Added launcher detection before Docker container creation
  • All other launchers continue to Docker execution unchanged

5. src/madengine/orchestration/build_orchestrator.py (+283 lines)

Purpose: Support pre-built images and compute-node builds

  • Added _execute_with_prebuilt_image() for --use-image option
  • Added _execute_build_on_compute() for --build-on-compute option
  • Standard execute() flow unchanged when options not used

Impact Analysis

Component Affected Launchers Other Launchers
slurm.py sglang-disagg, vllm-disagg No change - uses existing template
container_runner.py sglang-disagg, vllm-disagg No change - uses Docker execution
build.py All (optional flags) No change - flags default to disabled
build_orchestrator.py All (optional flags) No change - standard build if flags not used

Backward Compatibility

  • Existing models: All existing models continue to work unchanged
  • Existing launchers: torchrun, deepspeed, megatron-lm, vllm, sglang (non-disagg) all unchanged
  • Existing build workflows: Default behavior preserved
  • Existing SLURM deployments: Standard job template path unchanged

Testing

Tested with:

  • pyt_sglang_disagg_qwen3-32b model on 3-node SLURM cluster
  • Both salloc (existing allocation) and sbatch (new job) modes
  • Pre-built image workflow with --use-image
  • Benchmarks completed successfully (GSM8K accuracy + throughput sweep)

How to test

# Clone from fork
git clone https://github.com/raviguptaamd/madengine.git
cd madengine
git checkout raviguptaamd/update_slurm_launcher
pip install -e .

Made with Cursor

raviguptaamd and others added 12 commits April 16, 2026 07:14
- Add docker_image field to built_images and built_models in --use-image mode
- Merge model's distributed.launcher into deployment_config for proper detection
- Make reservation optional (empty string is ignored)
- Pass reservation to SlurmNodeSelector for health checks

Co-authored-by: Cursor <cursoragent@cursor.com>
Add missing newline after 'fi' statement to prevent syntax error
in generated sbatch script when using --build-on-compute option.

Co-authored-by: Cursor <cursoragent@cursor.com>
Replace legacy launcher names with unified slurm_multi:
- Remove sglang-disagg, vllm-disagg from VALID_LAUNCHERS
- Add slurm_multi as the only baremetal multi-node launcher
- Update deployment logic in slurm.py and kubernetes.py
- Rename example config files from sglang-disagg-* to slurm-multi-*
- Update docs/launchers.md references

Breaking change: sglang-disagg and vllm-disagg are no longer valid
launcher names. Use slurm_multi instead.

Made-with: Cursor
…ures

docs/cli-reference.md:
- Add --use-image and --build-on-compute options to build command
- Add examples for pre-built image and compute node build workflows
- Document when to use each mode

docs/usage.md:
- Add "Pre-built Image Mode" section with examples
- Add "Build on Compute Node" section explaining workflow

docs/deployment.md:
- Add "SLURM Allocation Detection" section
- Add "Baremetal Execution (slurm_multi)" section
- Document environment variables and behavior inside salloc

docs/launchers.md:
- Fix example file paths (slurm_multi -> slurm-multi)

Made-with: Cursor
--use-image changes:
- Make image name optional (auto-detect from model card's DOCKER_IMAGE_NAME)
- Add mutual exclusivity with --registry and --build-on-compute
- Handle multiple models with different images (use first, warn)

--build-on-compute changes:
- Require --registry option (build once, push, pull everywhere)
- Build on 1 compute node only, push to registry
- Merge SLURM config from model card + additional-context (CLI overrides)
- Add registry credential validation before build
- Add docker login step in build script
- Smart registry image naming (namespace/repo vs namespace formats)
- Parallel docker pull on all nodes in run phase
- Clear error messages for missing partition, credentials

Run phase changes:
- Detect built_on_compute flag in manifest
- Pull image in parallel on all nodes via srun before model execution

Documentation updated for new workflows.

Made-with: Cursor
- slurm_multi launcher now requires --registry or --use-image
- Parallel docker pull on all nodes for any registry image (not just built_on_compute)
- DOCKER_IMAGE_NAME added to manifest env_vars for all registry builds
- Documentation updated for slurm_multi requirements

Made-with: Cursor
Ensure consistent documentation across:
- deployment.md
- launchers.md
- usage.md

All now mention that slurm_multi requires --registry or --use-image.

Made-with: Cursor
- Make gpu_vendor/guest_os optional when using --use-image (pre-built image)
- Remove model name truncation in CLI results table
- Merge model card slurm and distributed config into build_manifest.json
- Add skip_monitoring flag for synchronous salloc execution
- Add completion marker file for robust sbatch job completion detection

Made-with: Cursor
…cher column

- Emit #SBATCH --nodelist in _prepare_baremetal_script() when nodelist
  is set via model card or --additional-context
- Include nodelist in model card slurm merge key list so it propagates
  to the build manifest
- Rename perf table column from "Launcher" to "Workload" for clarity

Made-with: Cursor
…time)

When --additional-context provides partial slurm overrides (e.g. only
nodelist), ConfigLoader defaults were overwriting model card values like
nodes and time. Fixed by capturing original user slurm keys before
ConfigLoader runs, and deep-merging manifest slurm config with runtime
context instead of replacing it wholesale.

Made-with: Cursor
- Rename BAREMETAL_LAUNCHERS to SELF_MANAGED_LAUNCHERS to avoid
  confusion with PR ROCm#62's baremetal_vm (Docker-less nodes with VMs)
- Rename _run_on_baremetal -> _run_self_managed, _prepare_baremetal_script
  -> _prepare_slurm_multi_script throughout
- Add slurm_multi to common.py VALID_LAUNCHERS (canonical list) and
  add hyphen-variant normalization (slurm-multi -> slurm_multi)
- Remove duplicate VALID_LAUNCHERS from kubernetes.py (now inherits
  from common.py)
- Fix SBATCH output filename pattern: _%j -> _%j_%t to match the glob
  pattern in collect_results (madengine-*_{deployment_id}_*.out)
- Add _collect_slurm_multi_results for explicit slurm_multi results
  collection path with perf.csv retry and NFS propagation handling
- Add --use-image / --build-on-compute mutual exclusivity check in
  build_orchestrator.py
- Document --use-image vs MAD_CONTAINER_IMAGE relationship in
  cli-reference.md

Made-with: Cursor
@raviguptaamd raviguptaamd force-pushed the raviguptaamd/update_slurm_launcher branch from fe22e7e to 87f8b7f Compare April 16, 2026 08:05
Copilot AI review requested due to automatic review settings April 16, 2026 08:05
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates Madengine’s SLURM (and related) workflows to better support disaggregated / self-managed multi-node inference by introducing a slurm_multi launcher path, plus new build options for using pre-built images or building on compute nodes.

Changes:

  • Add --use-image and --build-on-compute build workflows and propagate the resulting image name into manifests/env for multi-node pulls.
  • Add slurm_multi launcher handling across SLURM/K8s deployments, including self-managed (host-executed) script paths and SLURM reservation support.
  • Improve manifest/runtime config merging behavior for dict-valued deployment keys.

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
src/madengine/orchestration/run_orchestrator.py Shallow-merges dict-valued deployment config keys into runtime additional context.
src/madengine/orchestration/build_orchestrator.py Implements --use-image and --build-on-compute build flows and writes manifests accordingly.
src/madengine/execution/docker_builder.py Writes DOCKER_IMAGE_NAME into built_models.env_vars when a registry image exists (supports parallel pulls).
src/madengine/execution/container_runner.py Adds “self-managed” launcher path that runs scripts directly on host instead of Madengine-managed Docker.
src/madengine/deployment/slurm_node_selector.py Adds reservation support for srun health checks / cleanup.
src/madengine/deployment/slurm.py Adds slurm_multi wrapper script generation, reservation support, and existing-allocation detection (salloc).
src/madengine/deployment/kubernetes.py Treats slurm_multi as a disaggregated launcher option in K8s template context and adds launcher command generators.
src/madengine/deployment/common.py Updates valid launcher list and normalizes slurm-multislurm_multi.
src/madengine/deployment/base.py Adds skip_monitoring flag to support synchronous deploy flows (e.g., within salloc).
src/madengine/cli/validators.py Extends validator signature to accept use_image (currently unused).
src/madengine/cli/utils.py Adjusts display/extraction formatting (e.g., model name truncation and column header rename).
src/madengine/cli/commands/build.py Adds CLI flags --use-image / --build-on-compute and enforces mutual exclusivity/requirements.
examples/slurm-configs/minimal/slurm-multi-minimal.json Updates launcher value to slurm_multi.
examples/slurm-configs/basic/slurm-multi-multi-node.json Updates launcher value to slurm_multi.
examples/slurm-configs/basic/slurm-multi-custom-split.json Updates launcher value to slurm_multi.
examples/k8s-configs/minimal/slurm-multi-minimal.json Updates launcher value to slurm_multi.
examples/k8s-configs/basic/slurm-multi-multi-node-basic.json Updates launcher value to slurm_multi.
examples/k8s-configs/basic/slurm-multi-custom-split.json Updates launcher value to slurm_multi.
docs/usage.md Documents --use-image, --build-on-compute, and slurm_multi usage.
docs/launchers.md Renames launcher references and documents slurm_multi registry requirement.
docs/deployment.md Documents allocation detection and self-managed slurm_multi execution model.
docs/cli-reference.md Adds CLI reference entries and usage examples for new build options and slurm_multi.
Comments suppressed due to low confidence (1)

src/madengine/cli/validators.py:307

  • validate_additional_context() now accepts use_image, and the docstring says it “skips required field validation”, but the parameter is not used and validation runs unconditionally via finalize_additional_context_dict(context). Either remove the unused parameter (and update callers), or implement the intended conditional validation behavior.
def validate_additional_context(
    additional_context: str,
    additional_context_file: Optional[str] = None,
    use_image: Optional[str] = None,
) -> Dict[str, Any]:
    """
    Validate and parse additional context.

    Args:
        additional_context: JSON string containing additional context
        additional_context_file: Optional file containing additional context
        use_image: Optional pre-built image to use (skips required field validation)

    Returns:
        Dict containing parsed additional context

    Raises:
        typer.Exit: If validation fails
    """
    context, _ = merge_additional_context_from_sources(
        additional_context, additional_context_file
    )
    finalize_additional_context_dict(context)
    return context

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +651 to +676
# Merge model's distributed config
model_distributed = models[0].get("distributed", {})
if model_distributed:
if "distributed" not in saved_manifest["deployment_config"]:
saved_manifest["deployment_config"]["distributed"] = {}

# Copy launcher and other critical fields from model config
for key in ["launcher", "nnodes", "nproc_per_node", "backend", "port", "sglang_disagg", "vllm_disagg"]:
if key in model_distributed and key not in saved_manifest["deployment_config"]["distributed"]:
saved_manifest["deployment_config"]["distributed"][key] = model_distributed[key]

# Merge model's slurm config into deployment_config.slurm
# This enables run phase to auto-detect SLURM deployment without --additional-context
model_slurm = models[0].get("slurm", {})
if model_slurm:
if "slurm" not in saved_manifest["deployment_config"]:
saved_manifest["deployment_config"]["slurm"] = {}

# Copy slurm settings from model config (model card fills in
# values not explicitly set by --additional-context).
# Use _original_user_slurm_keys (captured before ConfigLoader
# applies defaults) so model card values override defaults
# but user's explicit CLI values still win.
for key in ["partition", "nodes", "gpus_per_node", "time", "exclusive", "reservation", "output_dir", "nodelist"]:
if key in model_slurm and key not in self._original_user_slurm_keys:
saved_manifest["deployment_config"]["slurm"][key] = model_slurm[key]
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_execute_with_prebuilt_image() populates deployment_config.distributed and deployment_config.slurm from models[0] only. If the tag selection yields multiple models with differing distributed/slurm settings, the manifest will silently contain config for just the first model. Either validate that only one model is allowed in this mode, or merge per-model configs in a deterministic way (and/or store them per built_model entry).

Copilot uses AI. Check for mistakes.
Comment on lines +779 to +904
def _execute_build_on_compute(
self,
registry: Optional[str] = None,
clean_cache: bool = False,
manifest_output: str = "build_manifest.json",
batch_build_metadata: Optional[Dict] = None,
) -> str:
"""
Execute Docker build on a SLURM compute node and push to registry.

Build workflow:
1. Build on 1 compute node only
2. Push image to registry
3. Store registry image name in manifest
4. Run phase will pull image in parallel on all nodes

Args:
registry: Registry to push images to (REQUIRED)
clean_cache: Whether to use --no-cache for Docker builds
manifest_output: Output file for build manifest
batch_build_metadata: Optional batch build metadata

Returns:
Path to generated build_manifest.json
"""
import subprocess
import os
import glob

self.rich_console.print(f"\n[dim]{'=' * 60}[/dim]")
self.rich_console.print("[bold blue]🔨 BUILD PHASE (Compute Node Mode)[/bold blue]")
self.rich_console.print("[cyan]Building on 1 compute node, pushing to registry...[/cyan]")
self.rich_console.print(f"[dim]{'=' * 60}[/dim]\n")

# Discover models first to get SLURM config from model card
self.rich_console.print("[bold cyan]🔍 Discovering models...[/bold cyan]")
discover_models = DiscoverModels(args=self.args)
models = discover_models.run()

if not models:
raise DiscoveryError(
"No models discovered for build-on-compute",
context=create_error_context(
operation="build_on_compute",
component="BuildOrchestrator",
),
suggestions=[
"Check if models.json exists",
"Verify --tags parameter is correct",
],
)

model = models[0]
model_name = model.get("name", "unknown")
self.rich_console.print(f"[green]✓ Found model: {model_name}[/green]\n")

# Merge SLURM config: model card (base) + additional-context (override)
model_slurm_config = model.get("slurm", {})
context_slurm_config = self.additional_context.get("slurm", {})

# Start with model card config, then override with command-line context
slurm_config = {**model_slurm_config, **context_slurm_config}

self.rich_console.print("[bold cyan]📋 SLURM Configuration (merged):[/bold cyan]")
if model_slurm_config:
self.rich_console.print(f" [dim]From model card:[/dim] {list(model_slurm_config.keys())}")
if context_slurm_config:
self.rich_console.print(f" [dim]From --additional-context (overrides):[/dim] {list(context_slurm_config.keys())}")

# Validate required fields
partition = slurm_config.get("partition")
if not partition:
raise ConfigurationError(
"Missing required SLURM field: partition",
context=create_error_context(
operation="build_on_compute",
component="BuildOrchestrator",
),
suggestions=[
'Add "partition" to model card\'s slurm section',
'Or specify via --additional-context \'{"slurm": {"partition": "gpu"}}\'',
],
)

reservation = slurm_config.get("reservation", "")
time_limit = slurm_config.get("time", "02:00:00")

self.rich_console.print(f" Partition: {partition}")
self.rich_console.print(f" Time limit: {time_limit}")
if reservation:
self.rich_console.print(f" Reservation: {reservation}")
self.rich_console.print("")

# Validate registry credentials
self.rich_console.print("[bold cyan]🔐 Registry Configuration:[/bold cyan]")
self.rich_console.print(f" Registry: {registry}")

# Check for credentials - either from environment or credential.json
dockerhub_user = os.environ.get("MAD_DOCKERHUB_USER", "")
dockerhub_password = os.environ.get("MAD_DOCKERHUB_PASSWORD", "")

# Try to load from credential.json if env vars not set
credential_file = Path("credential.json")
if not dockerhub_user and credential_file.exists():
try:
with open(credential_file) as f:
creds = json.load(f)
dockerhub_creds = creds.get("dockerhub", {})
dockerhub_user = dockerhub_creds.get("username", "")
dockerhub_password = dockerhub_creds.get("password", "")
if dockerhub_user:
self.rich_console.print(f" Credentials: Found in credential.json")
except (json.JSONDecodeError, IOError) as e:
self.rich_console.print(f" [yellow]Warning: Could not read credential.json: {e}[/yellow]")
elif dockerhub_user:
self.rich_console.print(f" Credentials: Found in environment (MAD_DOCKERHUB_USER)")

# Determine if registry requires authentication
requires_auth = True
public_registries = ["docker.io", "ghcr.io", "gcr.io", "quay.io", "nvcr.io"]
registry_lower = registry.lower() if registry else ""

# For docker.io pushes, authentication is always required
if any(pub_reg in registry_lower for pub_reg in public_registries):
if not dockerhub_user or not dockerhub_password:
raise ConfigurationError(
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_execute_build_on_compute() treats registry as required (calls registry.lower(), registry.replace(...), etc.) but the function signature allows registry: Optional[str] = None and there’s no guard. If this path is invoked programmatically (or via a code path that bypasses the CLI validation), it will raise an AttributeError rather than a user-facing ConfigurationError. Add an explicit if not registry: raise ConfigurationError(...) near the top of this method.

Copilot uses AI. Check for mistakes.
Comment on lines +787 to +793
result = subprocess.run(
cmd,
shell=True,
cwd=script_dir,
env=env,
timeout=timeout if timeout > 0 else None,
)
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subprocess.run(..., shell=True) is used with a command string that includes model_args (which can come from manifests / user input). This allows shell metacharacters in args to be interpreted by the shell (command injection). Build the argv list explicitly (e.g., ["bash", script_path, *shlex.split(model_args)] / ["python3", ...]) and run with shell=False.

Copilot uses AI. Check for mistakes.
Comment on lines +434 to +435
"#SBATCH --exclusive",
]
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The slurm_multi wrapper script always adds #SBATCH --exclusive regardless of the user’s slurm.exclusive setting. In the standard SLURM template, exclusivity is conditional (job.sh.j2:20-22). Consider respecting self.slurm_config.get("exclusive", True) here as well to avoid changing behavior for users who explicitly disable exclusivity.

Suggested change
"#SBATCH --exclusive",
]
]
if self.slurm_config.get("exclusive", True):
script_lines.append("#SBATCH --exclusive")

Copilot uses AI. Check for mistakes.
from madengine.utils.run_details import get_build_number, get_pipeline
from madengine.utils.path_utils import scripts_base_dir_from
import json
from typing import Optional
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a duplicate import of Optional (from typing import Any, Dict, List, Optional and again from typing import Optional). Drop the second import to avoid redundancy.

Suggested change
from typing import Optional

Copilot uses AI. Check for mistakes.
Comment on lines +603 to +632
# Add each discovered model with the pre-built image
# Use the image name as the key (matches how madengine build does it)
for model in models:
model_name = model.get("name", "unknown")
model_distributed = model.get("distributed", {})

# Merge DOCKER_IMAGE_NAME into env_vars for parallel pull in run phase
model_env_vars = model.get("env_vars", {}).copy()
model_env_vars["DOCKER_IMAGE_NAME"] = use_image

# Use image name as key so slurm.py can find docker_image
manifest["built_models"][use_image] = {
"name": model_name,
"image": use_image,
"docker_image": use_image,
"dockerfile": model.get("dockerfile", ""),
"scripts": model.get("scripts", ""),
"data": model.get("data", ""),
"n_gpus": model.get("n_gpus", "8"),
"owner": model.get("owner", ""),
"training_precision": model.get("training_precision", ""),
"multiple_results": model.get("multiple_results", ""),
"tags": model.get("tags", []),
"timeout": model.get("timeout", -1),
"args": model.get("args", ""),
"slurm": model.get("slurm", {}),
"distributed": model_distributed,
"env_vars": model_env_vars,
"prebuilt": True,
}
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In pre-built image mode, the for model in models: loop always writes to manifest["built_models"][use_image], so each iteration overwrites the previous one. If multiple models are discovered for the selected tags, only the last model will be present in built_models even though summary.successful_builds lists all of them. Use a per-model unique key in built_models (and keep it consistent with built_images lookups), while setting docker_image to the shared use_image value.

Copilot uses AI. Check for mistakes.
Comment on lines +896 to +918
# Determine if registry requires authentication
requires_auth = True
public_registries = ["docker.io", "ghcr.io", "gcr.io", "quay.io", "nvcr.io"]
registry_lower = registry.lower() if registry else ""

# For docker.io pushes, authentication is always required
if any(pub_reg in registry_lower for pub_reg in public_registries):
if not dockerhub_user or not dockerhub_password:
raise ConfigurationError(
f"Registry credentials required for pushing to {registry}",
context=create_error_context(
operation="build_on_compute",
component="BuildOrchestrator",
registry=registry,
),
suggestions=[
"Set environment variables: MAD_DOCKERHUB_USER and MAD_DOCKERHUB_PASSWORD",
'Or create credential.json: {"dockerhub": {"username": "...", "password": "..."}}',
"For Docker Hub, use a Personal Access Token (PAT) as password",
f"Example: export MAD_DOCKERHUB_USER=myuser",
f"Example: export MAD_DOCKERHUB_PASSWORD=dckr_pat_xxxxx",
],
)
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The registry auth/error messaging here is Docker Hub–specific (MAD_DOCKERHUB_USER/PASSWORD, PAT guidance) but the logic also classifies ghcr.io, gcr.io, quay.io, nvcr.io as registries that “require credentials”. This will produce misleading guidance for non-DockerHub registries (and may block valid flows where different credentials are required). Consider either limiting this branch to Docker Hub only, or generalizing the env var names / suggestions based on registry_host.

Copilot uses AI. Check for mistakes.
Comment on lines 331 to +345
srun_cmd.extend(["bash", "-c", cleanup_script])

try:
srun_cmd = [
"srun",
f"--nodelist={node}",
"--ntasks=1",
"--time=00:01:00",
"--overlap",
"--quiet",
]
if self.reservation:
srun_cmd.append(f"--reservation={self.reservation}")
srun_cmd.extend(["bash", "-c", cleanup_script])

Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleanup_node() builds srun_cmd (including job_name) before the try:, but then immediately reassigns srun_cmd inside the try: block. The reassignment drops the job_name argument and duplicates logic, making the first construction dead code and preventing health-check job cleanup from working as intended. Remove the duplicate reassignment and append both --job-name (when provided) and --reservation consistently.

Suggested change
srun_cmd.extend(["bash", "-c", cleanup_script])
try:
srun_cmd = [
"srun",
f"--nodelist={node}",
"--ntasks=1",
"--time=00:01:00",
"--overlap",
"--quiet",
]
if self.reservation:
srun_cmd.append(f"--reservation={self.reservation}")
srun_cmd.extend(["bash", "-c", cleanup_script])
if self.reservation:
srun_cmd.append(f"--reservation={self.reservation}")
srun_cmd.extend(["bash", "-c", cleanup_script])
try:

Copilot uses AI. Check for mistakes.
Comment on lines +233 to +243
# Merge deployment_config into additional_context (for deployment layer to use).
# For dict-valued keys (slurm, k8s, etc.), deep-merge so manifest
# values fill in gaps while runtime --additional-context wins on conflicts.
for key in ["slurm", "k8s", "kubernetes", "distributed", "vllm", "env_vars", "debug"]:
if key in deployment_config and key not in self.additional_context:
self.additional_context[key] = deployment_config[key]
if key in deployment_config:
if key not in self.additional_context:
self.additional_context[key] = deployment_config[key]
elif isinstance(deployment_config[key], dict) and isinstance(self.additional_context[key], dict):
merged = dict(deployment_config[key])
merged.update(self.additional_context[key])
self.additional_context[key] = merged
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says “deep-merge” but the implementation here is a one-level dict merge (merged = dict(...); merged.update(...)). Nested dictionaries under these keys will still be overwritten rather than recursively merged. Either adjust the comment to describe a shallow merge, or implement an actual recursive deep merge for nested structures.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants