Complete guide to using madengine for running AI models locally and in distributed environments.
📖 Quick Reference: For detailed command options and flags, see the CLI Command Reference.
- Python 3.8+ with madengine installed
- Docker with GPU support
- MAD package cloned locally
git clone https://github.com/ROCm/MAD.git
cd MAD
pip install git+https://github.com/ROCm/madengine.git# Discover models
madengine discover --tags dummy
# Run locally (full workflow: discover/build/run as configured by the model)
madengine run --tags dummy
# Or with explicit configuration
madengine run --tags dummy \
--additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'Note:
gpu_vendordefaults toAMDandguest_osdefaults toUBUNTUfor build operations. For production or non-AMD/Ubuntu environments, specify these values explicitly.
Results are saved to perf_entry.csv.
madengine provides five main commands:
| Command | Purpose | Common Options |
|---|---|---|
discover |
Find available models | --tags, --verbose |
build |
Build Docker images | --tags, --registry, --batch-manifest |
run |
Execute models | --tags, --manifest-file, --timeout |
report |
Generate HTML reports | to-html, to-email |
database |
Upload to MongoDB | --csv-file, --database-name |
For complete command options and detailed examples, see CLI Command Reference.
# Discover models
madengine discover --tags dummy
# Build image (uses AMD/UBUNTU defaults)
madengine build --tags model
# Run model
madengine run --tags model
# For NVIDIA or other configurations, specify explicitly:
# madengine build --tags model --additional-context '{"gpu_vendor": "NVIDIA", "guest_os": "CENTOS"}'
# Generate HTML report
madengine report to-html --csv-file perf_entry.csv
# Upload to MongoDB
madengine database --csv-file perf_entry.csv \
--database-name mydb --collection-name resultsmadengine supports three discovery methods:
Central model definitions in MAD package root:
madengine discover --tags dummy pyt_huggingface_bertModels organized in subdirectories (scripts/{dir}/models.json):
madengine discover --tags dummy2:dummy_2Python-generated models (scripts/{dir}/get_models_json.py):
madengine discover --tags dummy3:dummy_3:batch_size=512:in=32Create Docker images and manifest:
madengine build --tags model \
--registry localhost:5000 \
--additional-context-file config.jsonCreates build_manifest.json:
{
"models": [
{
"model_name": "my_model",
"image": "localhost:5000/my_model:20240115_123456",
"tag": "my_model"
}
],
"registry": "localhost:5000",
"build_timestamp": "2024-01-15T12:34:56Z"
}Include deployment configuration:
{
"gpu_vendor": "AMD",
"guest_os": "UBUNTU",
"k8s": {
"gpu_count": 2,
"namespace": "ml-team"
}
}madengine build --tags model \
--registry docker.io/myorg \
--additional-context-file k8s-config.jsonThe deployment config is saved in build_manifest.json and used during run phase.
Configure in credential.json (MAD package root):
{
"dockerhub": {
"username": "your_username",
"password": "your_token",
"repository": "myorg"
}
}Or use environment variables:
export MAD_DOCKERHUB_USER=your_username
export MAD_DOCKERHUB_PASSWORD=your_token
export MAD_DOCKERHUB_REPO=myorgBatch build mode enables selective builds with per-model configuration, ideal for CI/CD pipelines where you need fine-grained control over which models to rebuild.
Create a JSON file (e.g., batch.json) with a list of model entries:
[
{
"model_name": "model1",
"build_new": true,
"registry": "my-registry.com",
"registry_image": "custom-namespace/model1"
},
{
"model_name": "model2",
"build_new": false,
"registry": "my-registry.com",
"registry_image": "custom-namespace/model2"
},
{
"model_name": "model3",
"build_new": true
}
]Fields:
model_name(required): Model tag to includebuild_new(optional, default: false): If true, build this model; if false, reference existing imageregistry(optional): Per-model registry overrideregistry_image(optional): Custom registry image name/namespace
# Basic batch build
madengine build --batch-manifest batch.json \
--additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'
# With global registry (can be overridden per model)
madengine build --batch-manifest batch.json \
--registry localhost:5000 \
--additional-context-file config.json
# Verbose output
madengine build --batch-manifest batch.json \
--registry my-registry.com \
--verboseSelective Building: Only models with "build_new": true are built. Models with "build_new": false are added to the output manifest without building, useful for referencing existing images.
Per-Model Registry Override: Each model can specify its own registry and registry_image, overriding the global --registry flag.
Mutually Exclusive: Cannot use --batch-manifest and --tags together.
CI/CD Incremental Builds:
[
{"model_name": "changed_model", "build_new": true},
{"model_name": "unchanged_model1", "build_new": false},
{"model_name": "unchanged_model2", "build_new": false}
]Multi-Registry Deployment:
[
{
"model_name": "public_model",
"build_new": true,
"registry": "docker.io/myorg"
},
{
"model_name": "private_model",
"build_new": true,
"registry": "gcr.io/myproject"
}
]Development vs Production:
[
{
"model_name": "dev_model",
"build_new": true,
"registry": "localhost:5000"
},
{
"model_name": "prod_model",
"build_new": false,
"registry": "prod-registry.com",
"registry_image": "production/model"
}
]When madengine run builds in the same invocation (no pre-existing --manifest-file), you can pass --skip-model-run to produce images and build_manifest.json without running model containers.
- Ignored when
--manifest-filepoints at an existing manifest (execution-only mode): use plainmadengine run --manifest-file ...to run later. - Ignored with a warning if this invocation did not perform a build (for example a manifest was already present and no rebuild occurred).
madengine run --tags model \
--additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}' \
--skip-model-runSee CLI Reference — run and madengine run --help.
Run on local machine:
madengine run --tags model \
--additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'Required for Local:
gpu_vendor: "AMD", "NVIDIA"guest_os: "UBUNTU", "CENTOS"
When ROCm is not installed under /opt/rocm (e.g. TheRock or pip), set the ROCm root so GPU detection and container environment use the correct paths:
# Via environment variable
export ROCM_PATH=/path/to/rocm
madengine run --tags model --additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'
# Via CLI (overrides ROCM_PATH)
madengine run --tags model --rocm-path /path/to/rocm \
--additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'--rocm-path applies only to the run command (not build). See CLI Reference - run.
# Build phase
madengine build --tags model \
--registry gcr.io/myproject \
--additional-context '{"k8s": {"gpu_count": 2}}'
# Deploy phase
madengine run --manifest-file build_manifest.jsonDeployment target is automatically detected from k8s key in configuration.
# Build phase (local or CI)
madengine build --tags model \
--registry my-registry.io \
--additional-context '{"slurm": {"partition": "gpu", "gpus_per_node": 4}}'
# Deploy phase (on SLURM login node)
ssh user@hpc-login.example.com
madengine run --manifest-file build_manifest.jsonDeployment target is automatically detected from slurm key in configuration. To run on specific nodes, set slurm.nodelist (e.g. "nodelist": "node01,node02"); see Configuration and examples/slurm-configs/basic/03-multi-node-basic-nodelist.json.
Use configuration files for complex settings:
config.json:
{
"gpu_vendor": "AMD",
"guest_os": "UBUNTU",
"timeout_multiplier": 2.0,
"docker_env_vars": {
"PYTORCH_TUNABLEOP_ENABLED": "1",
"HSA_ENABLE_SDMA": "0"
}
}madengine run --tags model --additional-context-file config.json# Override default timeout
madengine run --tags model --timeout 7200
# No timeout (run indefinitely)
madengine run --tags model --timeout 0# Keep containers alive
madengine run --tags model --keep-alive
# Verbose output
madengine run --tags model --verbose --live-output
# Both
madengine run --tags model --keep-alive --verbose --live-outputIf the run is marked FAILURE because the log contains benign substrings (for example RuntimeError:) while the workload actually passed, configure log error pattern scan (log_error_pattern_scan, log_error_benign_patterns).
# Rebuild without Docker cache
madengine build --tags model --clean-docker-cacheProfile GPU usage and library calls:
# GPU profiling
madengine run --tags model \
--additional-context '{
"gpu_vendor": "AMD",
"guest_os": "UBUNTU",
"tools": [{"name": "rocprof"}]
}'
# Library tracing
madengine run --tags model \
--additional-context '{"tools": [{"name": "rocblas_trace"}]}'
# Multiple tools (stackable)
madengine run --tags model \
--additional-context '{"tools": [
{"name": "rocprof"},
{"name": "miopen_trace"}
]}'See Profiling Guide and CLI Reference - run command for details.
Convert performance CSV files to viewable HTML reports:
# Single CSV to HTML
madengine report to-html --csv-file perf_entry.csv
# Result: Creates perf_entry.html in same directoryGenerate a single HTML report from multiple CSV files:
# Process all CSV files in current directory
madengine report to-email
# Specify directory
madengine report to-email --directory ./results
# Custom output filename
madengine report to-email --dir ./results --output weekly_summary.htmlUse Cases:
- Weekly performance summaries
- CI/CD result reports
- Team email distributions
- Performance trend analysis
Store performance data in MongoDB for long-term tracking:
# Configure MongoDB connection
export MONGO_HOST=mongodb.example.com
export MONGO_PORT=27017
export MONGO_USER=performance_user
export MONGO_PASSWORD=secretpassword
# Upload results
madengine database \
--csv-file perf_entry.csv \
--database-name performance_tracking \
--collection-name model_runs
# Upload specific results
madengine database \
--csv-file results/perf_mi300.csv \
--db benchmarks \
--collection mi300_resultsIntegration Workflow:
# 1. Run benchmarks
madengine run --tags model1 model2 model3 \
--output perf_entry.csv
# 2. Generate HTML report
madengine report to-html --csv-file perf_entry.csv
# 3. Upload to database
madengine database \
--csv-file perf_entry.csv \
--db benchmarks \
--collection daily_runs
# 4. Send email report
madengine report to-email --output daily_summary.html
# (Then use your email tool to send daily_summary.html)See CLI Reference and CLI Reference for complete options.
Configure distributed training:
{
"k8s": {
"gpu_count": 8
},
"distributed": {
"launcher": "torchrun",
"nnodes": 2,
"nproc_per_node": 4
}
}Supported Launchers:
torchrun- PyTorch DDP/FSDPdeepspeed- ZeRO optimizationmegatron- Large transformers (K8s + SLURM)torchtitan- LLM pre-trainingvllm- LLM inferencesglang- Structured generation
See Launchers Guide for details.
Results are saved to perf_entry.csv:
model_name,execution_time,gpu_utilization,memory_used,...
my_model,125.3,98.5,15.2,...build_manifest.json contains:
- Built image names and tags
- Model configurations
- Deployment configuration
- Build timestamp
Use this manifest to run pre-built images:
madengine run --manifest-file build_manifest.json# Ensure you're in MAD directory
cd /path/to/MAD
madengine discover --tags your_model# Add user to docker group (Linux)
sudo usermod -aG docker $USER
newgrp docker# AMD GPUs
rocm-smi
# NVIDIA GPUs
nvidia-smi
# Test with Docker
docker run --rm --device=/dev/kfd --device=/dev/dri \
rocm/pytorch:latest rocm-smi# Check Docker daemon
docker ps
# Rebuild without cache
madengine build --tags model --clean-docker-cache --verbose| Variable | Description | Example |
|---|---|---|
MODEL_DIR |
MAD package directory | /path/to/MAD |
ROCM_PATH |
ROCm installation root (used when --rocm-path not set). Use when ROCm is not in /opt/rocm (e.g. Rock, pip). |
/path/to/rocm |
MAD_VERBOSE_CONFIG |
Verbose config logging | "true" |
MAD_DOCKERHUB_USER |
Docker Hub username | "myusername" |
MAD_DOCKERHUB_PASSWORD |
Docker Hub password | "mytoken" |
MAD_DOCKERHUB_REPO |
Docker Hub repository | "myorg" |
- Use configuration files for complex settings
- Separate build and run for distributed deployments
- Test locally first before deploying to clusters
- Use registries for distributed execution
- Enable verbose logging when debugging
- Start with small timeouts and increase as needed
For complex configurations, use JSON files:
# Create config.json
cat > config.json << 'EOF'
{
"gpu_vendor": "AMD",
"guest_os": "UBUNTU",
"docker_gpus": "0,1,2,3",
"timeout_multiplier": 2.0,
"distributed": {
"launcher": "torchrun",
"nproc_per_node": 4
}
}
EOF
# Use with commands
madengine build --tags model --additional-context-file config.json
madengine run --tags model --additional-context-file config.jsonSpecify tags in multiple ways:
# Space-separated
madengine run --tags model1 --tags model2 --tags model3
# Comma-separated
madengine run --tags model1,model2,model3
# Mix both
madengine run --tags model1 --tags model2,model3# Full verbose output with real-time logs
madengine run --tags model --verbose --live-output
# Keep container alive for inspection
madengine run --tags model --keep-alive
# Check what will be discovered
madengine discover --tags model --verbosemadengine uses consistent exit codes (0=success, 2=build failure, 3=run failure, 4=invalid args). Failed runs are still written to perf.csv with status FAILURE. See CLI Reference — Exit Codes for the full table.
#!/bin/bash
# Example CI script
set -e # Exit on error
# Build images
madengine build --batch-manifest batch.json \
--registry docker.io/myorg \
--verbose
# Run tests
madengine run --manifest-file build_manifest.json \
--timeout 3600
# Check exit code (0=success, 2=build failure, 3=run failure; see CLI Reference)
if [ $? -eq 0 ]; then
echo "✅ Tests passed"
# Generate and upload results
madengine report to-email --output ci_results.html
madengine database \
--csv-file perf.csv \
--db ci_results \
--collection ${CI_BUILD_ID}
else
echo "❌ Tests failed"
exit $?
fi- CLI Reference - Complete command options and examples
- Configuration Guide - Advanced configuration options
- Deployment Guide - Kubernetes and SLURM deployment
- Batch Build Guide - Selective builds for CI/CD
- Profiling Guide - Performance analysis
- Launchers Guide - Multi-node training frameworks
- Main README - Project overview
- Installation Guide - Setup instructions
- Contributing Guide - How to contribute
- GitHub Issues - Report issues or get help