Skip to content

Latest commit

 

History

History
659 lines (496 loc) · 18.8 KB

File metadata and controls

659 lines (496 loc) · 18.8 KB

CLI Command Reference

Complete reference for all madengine CLI commands with detailed options and examples.

Table of Contents

Overview

madengine provides a modern CLI for AI model automation and distributed execution. All commands follow a consistent pattern with rich terminal output and comprehensive error handling.

madengine [OPTIONS] COMMAND [ARGS]...

Global Options

These options are available for the main madengine command:

Option Description
--version Show version and exit
--help Show help message and exit

Commands

discover - Discover Available Models

Discover all models available in the MAD package based on specified tags.

Usage:

madengine discover [OPTIONS]

Options:

Option Short Type Default Description
--tags -t TEXT [] Model tags to discover (can specify multiple)
--verbose -v FLAG False Enable verbose logging

Examples:

# Discover all models
madengine discover

# Discover specific models by tag
madengine discover --tags dummy pyt_huggingface_bert

# Multiple tags with comma separation
madengine discover --tags dummy,multi,vllm

# With verbose output
madengine discover --tags model --verbose

# Directory-specific models
madengine discover --tags dummy2:dummy_2

# Dynamic models with parameters
madengine discover --tags dummy3:dummy_3:batch_size=512

Discovery Methods:

  1. Root models - From models.json in MAD package root
  2. Directory-specific - From scripts/{dir}/models.json
  3. Dynamic models - Generated by scripts/{dir}/get_models_json.py

build - Build Docker Images

Build Docker images for models, optionally pushing them to a registry.

Usage:

madengine build [OPTIONS]

Options:

Option Short Type Default Description
--tags -t TEXT [] Model tags to build (can specify multiple)
--target-archs -a TEXT [] Target GPU architectures (e.g., gfx908,gfx90a,gfx942)
--registry -r TEXT None Docker registry to push images to
--batch-manifest TEXT None Input batch.json file for batch build mode
--additional-context -c TEXT "{}" Additional context as JSON string
--additional-context-file -f TEXT None File containing additional context JSON
--clean-docker-cache FLAG False Rebuild images without using cache
--manifest-output -m TEXT build_manifest.json Output file for build manifest
--summary-output -s TEXT None Output file for build summary JSON
--live-output -l FLAG False Print output in real-time
--verbose -v FLAG False Enable verbose logging

Examples:

# Basic build
madengine build --tags dummy \
  --additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'

# Build with registry
madengine build --tags model \
  --registry docker.io/myorg \
  --additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'

# Build multiple models
madengine build --tags model1 model2 model3 \
  --registry localhost:5000

# Build for multiple GPU architectures
madengine build --tags model \
  --target-archs gfx908 gfx90a gfx942 \
  --registry gcr.io/myproject

# Clean rebuild without cache
madengine build --tags model --clean-docker-cache

# Batch build mode (selective builds)
madengine build --batch-manifest batch.json \
  --registry docker.io/myorg \
  --additional-context-file config.json

# Custom manifest output
madengine build --tags model \
  --manifest-output my_manifest.json \
  --summary-output build_summary.json

# Real-time output with verbose logging
madengine build --tags model --live-output --verbose

Default Values:

The build command applies the following defaults if not specified:

  • gpu_vendor: AMD
  • guest_os: UBUNTU

Example with defaults:

# Equivalent to providing {"gpu_vendor": "AMD", "guest_os": "UBUNTU"}
madengine build --tags dummy

You will see a message indicating which defaults were applied:

ℹ️  Using default values for build configuration:
   • gpu_vendor: AMD (default)
   • guest_os: UBUNTU (default)

💡 To customize, use --additional-context '{"gpu_vendor": "NVIDIA", "guest_os": "CENTOS"}'

Supported Values:

  • gpu_vendor: "AMD" or "NVIDIA"
  • guest_os: "UBUNTU" or "CENTOS"

Batch Build Mode:

When using --batch-manifest, provide a JSON file with selective build configuration:

[
  {
    "model_name": "model1",
    "build_new": true,
    "registry": "docker.io/myorg",
    "registry_image": "custom-namespace/model1"
  },
  {
    "model_name": "model2",
    "build_new": false
  }
]

See Batch Build Guide for details.


run - Execute Models

Run models locally or deploy to Kubernetes/SLURM clusters.

Usage:

madengine run [OPTIONS]

Options:

Option Short Type Default Description
--tags -t TEXT [] Model tags to run (can specify multiple)
--manifest-file -m TEXT "" Build manifest file path (for pre-built images)
--rocm-path TEXT None ROCm installation root (default: ROCM_PATH env or /opt/rocm). Use when ROCm is not in /opt/rocm (e.g. Rock, pip).
--registry -r TEXT None Docker registry URL
--timeout INT -1 Timeout in seconds (-1=default 7200s, 0=no timeout)
--additional-context -c TEXT "{}" Additional context as JSON string
--additional-context-file -f TEXT None File containing additional context JSON
--keep-alive FLAG False Keep Docker containers alive after run
--keep-model-dir FLAG False Keep model directory after run
--clean-docker-cache FLAG False Rebuild images without using cache (full workflow)
--skip-model-run FLAG False After a build in this invocation, skip executing models (manifest/images still produced). Ignored when using --manifest-file with an existing manifest (run-only), or when no build ran in this invocation. See Usage — Skip model run.
--manifest-output TEXT build_manifest.json Output file for build manifest (full workflow)
--summary-output -s TEXT None Output file for summary JSON
--live-output -l FLAG False Print output in real-time
--output -o TEXT perf_entry.csv Performance output file
--ignore-deprecated FLAG False Force run deprecated models
--data-config TEXT data.json Custom data configuration file
--tools-config TEXT tools.json Custom tools JSON configuration
--sys-env-details FLAG True Generate system config env details
--force-mirror-local TEXT None Path to force local data mirroring
--disable-skip-gpu-arch FLAG False Disable skipping models based on GPU architecture
--verbose -v FLAG False Enable verbose logging
--cleanup-perf FLAG False Remove intermediate perf_entry files after run (keeps perf.csv and perf_super files)

Examples:

# Local execution
madengine run --tags dummy \
  --additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'

# Custom ROCm path (when ROCm is not in /opt/rocm, e.g. Rock or pip install)
madengine run --tags dummy --rocm-path /path/to/rocm \
  --additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'

# Run with pre-built images (manifest-based)
madengine run --manifest-file build_manifest.json

# Build in this invocation but skip executing containers (CI: images + manifest only)
madengine run --tags model \
  --additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}' \
  --skip-model-run

# Multi-GPU with torchrun
madengine run --tags model \
  --additional-context '{
    "gpu_vendor": "AMD",
    "guest_os": "UBUNTU",
    "docker_gpus": "0,1,2,3",
    "distributed": {
      "launcher": "torchrun",
      "nproc_per_node": 4
    }
  }'

# Kubernetes deployment (minimal config)
madengine run --tags model \
  --additional-context '{"k8s": {"gpu_count": 2}}'

# Kubernetes multi-node with vLLM
madengine run --tags model \
  --additional-context '{
    "k8s": {"gpu_count": 8},
    "distributed": {
      "launcher": "vllm",
      "nnodes": 2,
      "nproc_per_node": 4
    }
  }'

# SLURM deployment
madengine run --tags model \
  --additional-context '{
    "slurm": {
      "partition": "gpu",
      "nodes": 4,
      "gpus_per_node": 8
    },
    "distributed": {
      "launcher": "torchtitan",
      "nnodes": 4,
      "nproc_per_node": 8
    }
  }'

# With profiling tools
madengine run --tags model \
  --additional-context '{
    "gpu_vendor": "AMD",
    "guest_os": "UBUNTU",
    "tools": [
      {"name": "rocprof"},
      {"name": "gpu_info_power_profiler"}
    ]
  }'

# Custom timeout (2 hours)
madengine run --tags model --timeout 7200

# No timeout (run indefinitely)
madengine run --tags model --timeout 0

# Keep container alive for debugging
madengine run --tags model --keep-alive --verbose

# Real-time output
madengine run --tags model --live-output

# Custom performance output file
madengine run --tags model --output my_perf_results.csv

# Clean up intermediate perf files after run
madengine run --tags model --cleanup-perf

# Using configuration file
madengine run --tags model \
  --additional-context-file k8s-config.json

Execution Modes:

  1. Full Workflow - Build + Run (when no manifest exists)
  2. Execution Only - Run only (when manifest-file provided and exists)
  3. Manifest-based - Use pre-built images from manifest

Deployment Targets:

  • Local - Docker containers on local machine
  • Kubernetes - Detected when k8s key present in context
  • SLURM - Detected when slurm key present in context

Performance Output:

Results are saved to CSV file (default: perf_entry.csv) with metrics including:

  • Execution time
  • GPU utilization
  • Memory usage
  • Model-specific performance metrics

report - Generate Reports

Generate HTML reports from CSV performance files.

Subcommands

report to-html - Convert CSV to HTML

Convert a single CSV file to HTML table format.

Usage:

madengine report to-html [OPTIONS]

Options:

Option Short Type Required Description
--csv-file TEXT Yes Path to the CSV file to convert
--verbose -v FLAG No Enable verbose logging

Examples:

# Convert CSV to HTML
madengine report to-html --csv-file perf_entry.csv

# With custom CSV file
madengine report to-html --csv-file results/perf_mi300.csv

# Verbose output
madengine report to-html --csv-file perf.csv --verbose

Output: Creates {filename}.html in the same directory as the CSV file.


report to-email - Generate Email Report

Convert all CSV files in a directory to a consolidated email-ready HTML report.

Usage:

madengine report to-email [OPTIONS]

Options:

Option Short Type Default Description
--directory --dir TEXT "." Path to directory containing CSV files
--output -o TEXT run_results.html Output HTML filename
--verbose -v FLAG False Enable verbose logging

Examples:

# Generate email report from current directory
madengine report to-email

# Specify directory
madengine report to-email --directory ./results

# Custom output filename
madengine report to-email --dir ./results --output summary.html

# Verbose output
madengine report to-email --directory ./results --verbose

Output: Creates consolidated HTML report suitable for email distribution.


database - Upload to MongoDB

Upload CSV performance data to MongoDB database.

Usage:

madengine database [OPTIONS]

Options:

Option Short Type Default Required Description
--csv-file TEXT perf_entry.csv No Path to the CSV file to upload
--database-name --db TEXT None Yes Name of the MongoDB database
--collection-name --collection TEXT None Yes Name of the MongoDB collection
--verbose -v FLAG False No Enable verbose logging

Examples:

# Upload to MongoDB
madengine database \
  --csv-file perf_entry.csv \
  --database-name mydb \
  --collection-name results

# Short option names
madengine database \
  --csv-file perf.csv \
  --db test \
  --collection perf_data

# With verbose output
madengine database \
  --csv-file perf.csv \
  --db mydb \
  --collection results \
  --verbose

Environment Variables:

MongoDB connection details are read from environment variables:

Variable Description Example
MONGO_HOST MongoDB host address localhost or mongodb.example.com
MONGO_PORT MongoDB port 27017
MONGO_USER MongoDB username admin
MONGO_PASSWORD MongoDB password secretpassword

Example Setup:

export MONGO_HOST=mongodb.example.com
export MONGO_PORT=27017
export MONGO_USER=myuser
export MONGO_PASSWORD=mypassword

madengine database \
  --csv-file perf_entry.csv \
  --db performance_db \
  --collection model_runs

Exit Codes

madengine uses standard exit codes so scripts and CI (e.g. Jenkins) can detect success or failure:

Code Constant Description
0 SUCCESS Command completed successfully
1 FAILURE General failure
2 BUILD_FAILURE One or more image builds failed (e.g. Docker build error)
3 RUN_FAILURE One or more model executions failed
4 INVALID_ARGS Invalid command-line arguments or configuration

Failure recording: Pre-run failures (e.g. image pull, setup) and run failures are recorded in the performance table (perf.csv) with status FAILURE, so all attempted models appear in the CSV. The file is created automatically if missing.

Example usage in scripts / CI:

#!/bin/bash

madengine build --tags model
if [ $? -eq 0 ]; then
  echo "Build successful"
  madengine run --manifest-file build_manifest.json
else
  echo "Build failed with exit code $?"
  exit $?
fi

Configuration File Format

For complex configurations, use JSON files with --additional-context-file:

Example: config.json

{
  "gpu_vendor": "AMD",
  "guest_os": "UBUNTU",
  "docker_gpus": "0,1,2,3",
  "timeout_multiplier": 2.0,
  "docker_env_vars": {
    "PYTORCH_TUNABLEOP_ENABLED": "1",
    "HSA_ENABLE_SDMA": "0",
    "NCCL_DEBUG": "INFO"
  },
  "distributed": {
    "launcher": "torchrun",
    "nnodes": 1,
    "nproc_per_node": 4
  }
}

Example: k8s-config.json

{
  "gpu_vendor": "AMD",
  "k8s": {
    "namespace": "ml-team",
    "gpu_count": 8,
    "cpu_request": "32",
    "memory_request": "256Gi",
    "node_selector": {
      "gpu-type": "mi300x"
    }
  },
  "distributed": {
    "launcher": "vllm",
    "nnodes": 2,
    "nproc_per_node": 4
  }
}

Example: slurm-config.json

{
  "gpu_vendor": "AMD",
  "slurm": {
    "partition": "gpu",
    "nodes": 4,
    "gpus_per_node": 8,
    "time": "24:00:00",
    "account": "ml_research",
    "qos": "high"
  },
  "distributed": {
    "launcher": "torchtitan",
    "nnodes": 4,
    "nproc_per_node": 8
  }
}

To run on specific nodes, add "nodelist": "node01,node02" to the slurm section. When set, the job runs only on those nodes and node health preflight is skipped. See examples/slurm-configs/basic/03-multi-node-basic-nodelist.json.

Run phase: log error pattern scan (optional)

These keys apply to local Docker runs when madengine post-processes the run log. Use them when substring matches cause false FAILURE status (for example benign RuntimeError: lines). Full details: Configuration — Run phase: log error pattern scan.

Key Description
log_error_pattern_scan Default true. Set false to skip grep-based log failure detection.
log_error_benign_patterns Array of extra strings to exclude from matching (merged with built-in benign list).
log_error_patterns Non-empty array replaces the default substring list (advanced).

Environment Variables

madengine recognizes these environment variables:

Variable Description Default
MODEL_DIR Path to MAD package directory Auto-detected
ROCM_PATH ROCm installation root (used when --rocm-path not set) /opt/rocm
MAD_VERBOSE_CONFIG Enable verbose configuration logging false
MAD_DOCKERHUB_USER Docker Hub username None
MAD_DOCKERHUB_PASSWORD Docker Hub password/token None
MAD_DOCKERHUB_REPO Docker Hub repository None
MAD_CONTAINER_IMAGE Pre-built container image to use None
MONGO_HOST MongoDB host for database command localhost
MONGO_PORT MongoDB port for database command 27017
MONGO_USER MongoDB username None
MONGO_PASSWORD MongoDB password None

Best Practices

  1. Use configuration files for complex setups instead of long command lines
  2. Separate build and run phases for distributed deployments
  3. Test locally first before deploying to clusters
  4. Use registries for distributed execution across multiple nodes
  5. Enable verbose logging (--verbose) when debugging issues
  6. Use real-time output (--live-output) for long-running operations
  7. Version your configuration files alongside your model code
  8. Use batch build mode for CI/CD pipelines to optimize build times

Related Documentation


Version: 2.0.0
Last Updated: December 2025