Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
834fe82
Add MI325X DeepSeek-R1 FP8 disaggregated inference (1P1D, Broadcom Th…
Mar 31, 2026
7b50476
Update amd-master.yaml
JordanNanos Mar 31, 2026
b40908c
Add MTP config, expand sweep to full pareto frontier, use -good image
Mar 31, 2026
2421ca5
Add perf-changelog entry for MI325X disagg configs
Mar 31, 2026
6abdf85
Fix MI325X QoS detection and NFS-safe cleanup for disagg benchmarks
JordanNanos Apr 1, 2026
3716258
Add local NVMe model caching for faster model loading
JordanNanos Apr 1, 2026
db677bd
Switch model caching from rsync to rclone sync
JordanNanos Apr 1, 2026
0a485de
Add MTP baseline to single-node MI325X DeepSeek-R1 FP8 config
JordanNanos Apr 1, 2026
67dec7c
Split MI325X single-node MTP into separate config key
JordanNanos Apr 1, 2026
f18257f
Fix MI325X single-node script resolution and add MTP support
JordanNanos Apr 2, 2026
3ccfba3
Fix decode dispatch token limit for DP attention disagg configs
JordanNanos Apr 2, 2026
0213032
Disable EP8/DP disagg configs on MI325X and bump MTP to 3 tokens
JordanNanos Apr 2, 2026
2afb24a
Add single-node EP8/DP test configs for MI325X disagg
JordanNanos Apr 2, 2026
36aebfd
Move container image to semianalysiswork Docker Hub and fix launcher …
JordanNanos Apr 3, 2026
b5a0bc2
Test EP8/DP workaround: drop MoRI a2a backend on MI325X bnxt_re
JordanNanos Apr 4, 2026
beb3808
Fix MODEL_NAME for EP8/DP test configs with MODEL_YAML_KEY override
JordanNanos Apr 4, 2026
23c2931
fix: resolve MODEL_NAME from flat repo dir when HF snapshot absent
JordanNanos Apr 4, 2026
e5b9d00
Tune EP8/DP test: lower concurrency + QP params for SQ full fix
JordanNanos Apr 4, 2026
76d89d0
fix: lower bnxt_re QP limits and concurrency for MI325X EP8/DP disagg
JordanNanos Apr 5, 2026
4d9ee30
Add GLM-5 FP8 single-node benchmark for MI325X
JordanNanos Apr 9, 2026
13c1167
Skip HF download validation when model is cached on MI325X
JordanNanos Apr 9, 2026
d4d6e19
Add Qwen3.5 and GLM-5 FP8 disaggregated inference for MI325X
JordanNanos Apr 9, 2026
5228c62
Fix HF cache path resolution: use sed instead of tr for org/repo sepa…
JordanNanos Apr 9, 2026
b08abaf
Sanitize MODEL_NAME in Docker container name
JordanNanos Apr 10, 2026
6dbaa19
Force-reinstall transformers for GLM-5 in disagg Docker containers
JordanNanos Apr 10, 2026
2c24d0d
Switch GLM-5 MI325X configs to v0.5.10 image
JordanNanos Apr 10, 2026
d3522ec
Switch GLM-5 MI325X to MI355X GLM-5 image (rocm/sgl-dev mori-0402)
JordanNanos Apr 10, 2026
5dd235f
Switch Qwen3.5/GLM-5 disagg to v0.5.10 image + no-MoRI transfer
JordanNanos Apr 10, 2026
d8abc66
Switch Qwen3.5/GLM-5 disagg to v0.5.10 image + no-MoRI transfer
JordanNanos Apr 10, 2026
44780e0
Fix YAML: switch Qwen3.5/GLM-5 disagg to v0.5.10 + no-MoRI transfer
JordanNanos Apr 10, 2026
21ce11a
Remove MODEL_NAME overrides — let launcher resolve HF cache path
JordanNanos Apr 10, 2026
fc2f0d9
Fix TP mismatch for non-MLA models in Qwen3.5/GLM-5 disagg
JordanNanos Apr 10, 2026
c956ce2
Add MI325X container image build scripts and documentation
JordanNanos Apr 10, 2026
18f1c5c
Use latest SGLang main for MI325X image build
JordanNanos Apr 10, 2026
13be2f6
Update build script default to SGL_BRANCH=v0.5.10
JordanNanos Apr 10, 2026
9ec6e9d
Add transformers patch layer for GLM-5/Qwen3.5 model type support
JordanNanos Apr 10, 2026
02645c7
Build from SGLang main for Qwen3.5/GLM-5 PD disagg fixes
JordanNanos Apr 10, 2026
947e339
Switch Qwen3.5/GLM-5 to main-bnxt image with PD disagg fixes
JordanNanos Apr 10, 2026
d648774
Switch to v0.5.10-bnxt-patched (PD fixes + transformers patch)
JordanNanos Apr 11, 2026
d6053e1
Add thin bnxt layer Dockerfile for existing SGLang images
JordanNanos Apr 11, 2026
757d015
Switch Qwen3.5/GLM-5 to amd-disagg-bnxt-lite image
JordanNanos Apr 11, 2026
a205fda
Switch to amd-main-bnxt image (full AMD fork build)
JordanNanos Apr 13, 2026
e661747
Switch to amd-main-bnxt-nopatch (no transformers override)
JordanNanos Apr 13, 2026
03015e0
Update MI325X runners to new amds naming convention
JordanNanos Apr 14, 2026
268be7b
Install tiktoken/sentencepiece in disagg server for GLM-5 tokenizer
JordanNanos Apr 14, 2026
02154ef
Switch Qwen3.5/GLM-5 disagg to explicit Mooncake transfer backend
JordanNanos Apr 14, 2026
dcade3d
Build MI325X image from MI355X Qwen3.5 disagg base + bnxt
JordanNanos Apr 14, 2026
b8d3d44
Fix router health check: use /health instead of /readiness
JordanNanos Apr 14, 2026
3b9eb4f
Add --trust-remote-code to disagg benchmark for Qwen3.5/GLM-5
JordanNanos Apr 14, 2026
ec7e4f3
Switch to PR #22665 image with MoRI DSA/GDN fix
JordanNanos Apr 15, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
752 changes: 752 additions & 0 deletions .github/configs/amd-master.yaml

Large diffs are not rendered by default.

17 changes: 13 additions & 4 deletions .github/configs/runners.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -71,10 +71,19 @@ mi300x:
- 'mi300x-amds_2'
- 'mi300x-amds_3'
mi325x:
- 'mi325x-amd_0'
- 'mi325x-amd_1'
- 'mi325x-amd_2'
- 'mi325x-amd_3'
- 'mi325x-amds_00'
- 'mi325x-amds_01'
- 'mi325x-amds_02'
- 'mi325x-amds_03'
- 'mi325x-amds_04'
- 'mi325x-amds_05'
- 'mi325x-amds_06'
- 'mi325x-amds_08'
mi325x-disagg:
- 'mi325x-amds_00'
- 'mi325x-amds_01'
- 'mi325x-amds_02'
- 'mi325x-amds_03'
mi355x:
- 'mi355x-amds_0'
- 'mi355x-amds_1'
Expand Down
1 change: 1 addition & 0 deletions benchmarks/multi_node/amd_utils/bench.sh
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@ for max_concurrency in ${chosen_concurrencies[@]}; do
--max-concurrency "$max_concurrency" \
--result-filename "$export_file" \
--result-dir /workspace/ \
--trust-remote-code \
$( [ "$IS_MTP" = "true" ] && echo "--use-chat-template" )

echo "-----------------------------------------"
Expand Down
27 changes: 25 additions & 2 deletions benchmarks/multi_node/amd_utils/env.sh
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,9 @@ if [[ -z "$IBDEVICES" ]]; then
export IBDEVICES=ionic_0,ionic_1,ionic_2,ionic_3,ionic_4,ionic_5,ionic_6,ionic_7
elif [[ $NODENAME == mia1* ]]; then
export IBDEVICES=rdma0,rdma1,rdma2,rdma3,rdma4,rdma5,rdma6,rdma7
elif [[ $NODENAME == chi-mi325x* ]]; then
# Vultr/CPE MI325X cluster: Broadcom RoCE (bnxt_re); bnxt_re6 is DOWN, skip it
export IBDEVICES=bnxt_re0,bnxt_re1,bnxt_re2,bnxt_re3,bnxt_re4,bnxt_re5,bnxt_re7,bnxt_re8
else
echo "ERROR: Unable to detect cluster from hostname $NODENAME and IBDEVICES not set" >&2
exit 1
Expand All @@ -42,6 +45,13 @@ export SGLANG_USE_AITER=1
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=1200
export SGLANG_DISAGGREGATION_WAITING_TIMEOUT=1200

# GLM-5: uses NSA (not MLA), needs fused-decode-MLA disabled + fast loading
if [[ "$MODEL_NAME" == *GLM-5* ]]; then
export SGLANG_ROCM_FUSED_DECODE_MLA=0
export ROCM_QUICK_REDUCE_QUANTIZATION=INT4
export SAFETENSORS_FAST_GPU=1
fi

# Disable allocating memory in one pass
export MORI_SHMEM_MODE=ISOLATION
export SGLANG_MORI_FP8_DISP=True
Expand All @@ -64,8 +74,11 @@ export MORI_MAX_DISPATCH_TOKENS_DECODE=160
export SGLANG_MORI_DISPATCH_INTER_KERNEL_SWITCH_THRESHOLD=$((MORI_MAX_DISPATCH_TOKENS_DECODE * 2))

export MORI_EP_LAUNCH_CONFIG_MODE=AUTO
export MORI_IO_QP_MAX_SEND_WR=16384
export MORI_IO_QP_MAX_CQE=32768
# Broadcom bnxt_re NICs cap SQ depth at ~4351 entries. Lower from upstream
# defaults (16384/32768) to avoid SQ overflow under EP8 RDMA traffic.
# See sgl-project/sglang#22072
export MORI_IO_QP_MAX_SEND_WR=4096
export MORI_IO_QP_MAX_CQE=8192
export MORI_IO_QP_MAX_SGE=4

export MORI_APP_LOG_LEVEL=INFO
Expand Down Expand Up @@ -101,6 +114,11 @@ $1 == "DSCP" && $2 == ":" && $NF == p {
elif [[ $NODENAME == mia1* ]]; then
export MORI_RDMA_TC=104
echo "[INFO] Auto-detected MORI_RDMA_TC=$MORI_RDMA_TC from hostname $NODENAME"
elif [[ $NODENAME == chi-mi325x* ]]; then
# Vultr/CPE MI325X: Broadcom Thor 2, DSCP AF31(26)->prio 3, TC=4*26=104
export MORI_RDMA_TC=104
export MORI_RDMA_SL=3
echo "[INFO] Auto-detected MORI_RDMA_TC=$MORI_RDMA_TC, MORI_RDMA_SL=$MORI_RDMA_SL from hostname $NODENAME"
else
echo "[INFO] Unable to detect MORI_RDMA_TC from hostname. Skipping RDMA QoS configuration."
fi
Expand All @@ -114,6 +132,11 @@ else
elif [[ $NODENAME == mia1* ]]; then
export MORI_RDMA_TC=104
echo "[INFO] Auto-detected MORI_RDMA_TC=$MORI_RDMA_TC from hostname $NODENAME"
elif [[ $NODENAME == chi-mi325x* ]]; then
# Vultr/CPE MI325X: Broadcom Thor 2, DSCP AF31(26)->prio 3, TC=4*26=104
export MORI_RDMA_TC=104
export MORI_RDMA_SL=3
echo "[INFO] Auto-detected MORI_RDMA_TC=$MORI_RDMA_TC, MORI_RDMA_SL=$MORI_RDMA_SL from hostname $NODENAME"
else
echo "[INFO] nicctl not found and unable to detect from hostname. Skipping RDMA QoS configuration."
echo " This is normal for clusters without QoS or outside Docker containers."
Expand Down
75 changes: 64 additions & 11 deletions benchmarks/multi_node/amd_utils/job.slurm
Original file line number Diff line number Diff line change
Expand Up @@ -30,14 +30,18 @@ if [[ ! -f "$MODELS_YAML" ]]; then
exit 1
fi

# Validate MODEL_NAME exists as a top-level key in models.yaml
if ! grep -q "^${MODEL_NAME}:" "$MODELS_YAML"; then
echo "Error: Model '$MODEL_NAME' not found in models.yaml"
# MODEL_YAML_KEY is the models.yaml lookup key (bare model name, e.g. DeepSeek-R1-0528).
# MODEL_NAME may be a longer HF cache path (e.g. models--org--repo/snapshots/<hash>).
_MODEL_YAML_KEY="${MODEL_YAML_KEY:-$MODEL_NAME}"

# Validate the yaml key exists as a top-level key in models.yaml
if ! grep -q "^${_MODEL_YAML_KEY}:" "$MODELS_YAML"; then
echo "Error: Model '$_MODEL_YAML_KEY' not found in models.yaml"
echo "Available models:"
grep -E '^[A-Za-z]' "$MODELS_YAML" | sed 's/:.*$//' | sed 's/^/ - /'
exit 1
fi
echo "Model found: $MODEL_NAME"
echo "Model found: $_MODEL_YAML_KEY"

# All models use server.sh as the entrypoint
RUN_FILE="server.sh"
Expand Down Expand Up @@ -249,10 +253,9 @@ echo "NNODES is ${NNODES}"
echo "REPO Directory is ${DI_REPO_DIR}"
echo "USER_NAME is ${USER_NAME}"

# Get the RDMA priority and DSCP value from the NIC
# Get the RDMA priority and DSCP value from the NIC (optional - env.sh handles absence gracefully)
if ! command -v nicctl >/dev/null 2>&1; then
echo "Error: nicctl command not found. Please ensure nicctl is installed and available." >&2
exit 1
echo "[INFO] nicctl not found. RDMA QoS configuration will be skipped inside the container." >&2
fi

# Reduce log spam
Expand Down Expand Up @@ -286,7 +289,8 @@ export DRY_RUN="${DRY_RUN:-0}"
export BENCHMARK_LOGS_DIR="${BENCHMARK_LOGS_DIR:-$(pwd)/benchmark_logs}"

SANITIZED_USER=$(echo "$USER_NAME" | tr -c 'a-zA-Z0-9_.-' '_')
export DOCKER_CONT_NAME="container_sbatch_${SANITIZED_USER}_${MODEL_NAME}_${SLURM_JOB_ID}"
SANITIZED_MODEL=$(echo "$MODEL_NAME" | tr -c 'a-zA-Z0-9_.-' '_')
export DOCKER_CONT_NAME="container_sbatch_${SANITIZED_USER}_${SANITIZED_MODEL}_${SLURM_JOB_ID}"
export RUN_FILE_FULL="$SGLANG_WS_PATH/${RUN_FILE}"


Expand All @@ -296,8 +300,8 @@ SELECTED_NODELIST_SRUN=$(echo "$SELECTED_NODES" | paste -sd,)

cleanup() {
echo "[${SLURM_JOB_ID}] termination received on $(hostname); cleaning stale logs folder..."
# clean up the logs folder
sudo rm -rf ${SLURM_SUBMIT_DIR}/logs 2>/dev/null || true
# NFS-safe cleanup: use timeout to avoid hanging on stale NFS locks
timeout --kill-after=5 30 sudo rm -rf ${SLURM_SUBMIT_DIR}/logs 2>/dev/null || true

echo "[${SLURM_JOB_ID}] cleanup done."
}
Expand All @@ -318,6 +322,54 @@ srun --nodelist="$SELECTED_NODELIST_SRUN" bash -c '
echo "NFS cache refreshed on $(hostname)"
'

# =============================================================================
# Optional: Pre-stage model to local NVMe for faster loading
# =============================================================================
# LOCAL_MODEL_CACHE_DIR: mount point for fast local storage (NVMe/SSD) on compute nodes.
# Set per-cluster via the runner/launch script. When set, model weights are rsync'd
# from shared storage to local NVMe before Docker starts. This is idempotent —
# subsequent runs skip files already cached locally.
#
# If unset or the local path doesn't exist, the model is served directly from
# shared storage (NFS/Lustre) as before.
if [[ -n "${LOCAL_MODEL_CACHE_DIR:-}" ]]; then
LOCAL_MODEL_FULL="${LOCAL_MODEL_CACHE_DIR}/${MODEL_NAME}"
echo "[cache] Pre-staging model to local NVMe on all nodes..."
echo "[cache] Source: $MODEL_PATH"
echo "[cache] Dest: $LOCAL_MODEL_FULL"

srun --nodelist="$SELECTED_NODELIST_SRUN" bash -c '
set -euo pipefail
SRC="'"$MODEL_PATH"'"
DST="'"$LOCAL_MODEL_FULL"'"
CACHE_DIR="'"${LOCAL_MODEL_CACHE_DIR}"'"

# Create destination directory
sudo mkdir -p "$CACHE_DIR" 2>/dev/null || mkdir -p "$CACHE_DIR"
sudo chown -R "$(whoami)" "$CACHE_DIR" 2>/dev/null || true

echo "[cache] $(hostname): Syncing model to local NVMe..."
START=$(date +%s)

rclone sync "$SRC/" "$DST/" \
--transfers 32 \
--checkers 32 \
--links \
--progress

ELAPSED=$(( $(date +%s) - START ))
SIZE=$(du -sh "$DST" 2>/dev/null | cut -f1)
echo "[cache] $(hostname): Done in ${ELAPSED}s ($SIZE)"
' 2>&1

if [[ $? -eq 0 ]]; then
echo "[cache] Model pre-staged successfully. Updating MODEL_DIR."
MODEL_DIR="${LOCAL_MODEL_CACHE_DIR}"
else
echo "[cache] WARNING: Local caching failed on some nodes. Falling back to shared storage."
fi
fi

srun \
--nodelist="$SELECTED_NODELIST_SRUN" \
--kill-on-bad-exit=1 \
Expand Down Expand Up @@ -357,7 +409,7 @@ exec sudo docker run --rm \
--privileged \
-v ${MODEL_DIR}:/models \
-v \$HOME/.ssh:/root/.ssh \
-v $(which nicctl):/usr/sbin/nicctl \
$(command -v nicctl &>/dev/null && echo "-v $(which nicctl):/usr/sbin/nicctl") \
Comment on lines 256 to +412
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u verify if these changes break mi355 disagg? +viz @Oseltamivir

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the check for nicctl was breaking on this cluster, MoRI needs it to enforce QoS, disabled for now as it's not installed on these nodes or in the container built and seems unnecessary

--shm-size 128G \
-v /tmp:/run_logs \
-v ${BENCHMARK_LOGS_DIR}:/benchmark_logs \
Expand All @@ -373,6 +425,7 @@ exec sudo docker run --rm \
-e xP=\$xP \
-e yD=\$yD \
-e MODEL_NAME=\$MODEL_NAME \
-e MODEL_YAML_KEY=${_MODEL_YAML_KEY} \
-e IPADDRS=\$IPADDRS \
-e PREFILL_TP_SIZE=\$PREFILL_TP_SIZE \
-e PREFILL_ENABLE_EP=\$PREFILL_ENABLE_EP \
Expand Down
Loading
Loading