feat(autotuner): autonomous kernel and inference configuration tuning for AMD GPUs#522
Open
ChuanLi1101 wants to merge 11 commits intomainfrom
Open
feat(autotuner): autonomous kernel and inference configuration tuning for AMD GPUs#522ChuanLi1101 wants to merge 11 commits intomainfrom
ChuanLi1101 wants to merge 11 commits intomainfrom
Conversation
…ults Targeted Pareto optimization for GPT-OSS-120B MXFP4 on single MI355X: - Throughput +3.6% at c256 (12023 -> 12458 tok/s) - TTFT -78% at c256 (1042ms -> 227ms) with max_num_batched_tokens=8192 - 8K/1K TTFT -42% at c256 with combined config Key findings: - max_num_batched_tokens=8192 is the single best optimization for high concurrency - gpu_memory_utilization=0.95 provides +3.3% throughput at c256 - ATOM_DUAL_STREAM_MOE_TOKEN_THRESHOLD=512 gives +1.3% at medium concurrency Infrastructure: - orchestrator.py: Master experiment driver with targeted search strategy - experiment_tracker.py: Pareto frontier tracking with auto status file generation - notifier.py: Multi-channel push notifications (ntfy/Slack/Discord/Telegram) - status.py: CLI tool for remote experiment monitoring - run_bench.py: Enhanced benchmark runner with integrated tracking Made-with: Cursor
…placeholders Made-with: Cursor
…i355x-perf-experiment
…github.com/ROCm/ATOM into chuali/gpt-oss-120b-mi355x-perf-experiment
…oard changes Made-with: Cursor
…github.com/ROCm/ATOM into chuali/gpt-oss-120b-mi355x-perf-experiment
…ning for AMD GPUs Framework-agnostic autotuner inspired by NVIDIA AIConfigurator (offline perf modeling + config search) and Karpathy's autoresearch (agent-driven experiment loop). Targets MI355X/MI325X/MI300X on ROCm. Key components: - Collector: LLM-workload-informed micro-benchmarks for GEMM, attention, MoE, RCCL - Database: RBF interpolation + roofline SOL modeling with 4 accuracy modes - Search: grid / Bayesian / agent-guided strategies with Pareto frontier analysis - Agent: propose -> benchmark -> evaluate -> keep/discard autonomous loop - Adapters: pluggable backends for ATOM, vLLM, and SGLang - CLI: python -m atom.autotuner.cli run --model <hf_id> --system mi355x Includes 49 unit tests (no GPU required) covering all components. Made-with: Cursor
Comment on lines
+29
to
+35
| from atom.autotuner.types import ( | ||
| BenchmarkResult, | ||
| ExperimentStatus, | ||
| GPUInfo, | ||
| InferenceConfig, | ||
| TunerState, | ||
| ) |
Contributor
There was a problem hiding this comment.
atom.autotuner.types.ExperimentStatus imported but unused
Suggested change
| from atom.autotuner.types import ( | |
| BenchmarkResult, | |
| ExperimentStatus, | |
| GPUInfo, | |
| InferenceConfig, | |
| TunerState, | |
| ) | |
| from atom.autotuner.types import ( | |
| BenchmarkResult, | |
| GPUInfo, | |
| InferenceConfig, | |
| TunerState, | |
| ) |
| from atom.autotuner.database.estimator import E2EEstimator, ModelArch | ||
| from atom.autotuner.database.perf_model import PerformanceModel | ||
| from atom.autotuner.search.pareto import ParetoAnalyzer | ||
| from atom.autotuner.search.space import ConfigSpace, SearchBounds |
Contributor
| Returns the experiment tracker with all results. | ||
| """ | ||
| self._setup_signal_handlers() | ||
| start_time = time.time() |
Contributor
| strategy = self._build_strategy() | ||
| evaluate_fn = self._build_evaluate_fn() | ||
|
|
||
| last_checkpoint = time.time() |
Contributor
|
|
||
| def _cmd_run(args: argparse.Namespace) -> int: | ||
| """Run the autonomous tuning loop.""" | ||
| from atom.autotuner.types import DatabaseMode, GPUInfo |
Contributor
| grid[y][x] = "." | ||
|
|
||
| lines = [] | ||
| lines.append(f" tokens/s/gpu vs tokens/s/user (* = Pareto frontier)") |
Contributor
Comment on lines
+18
to
+19
| import math | ||
| from dataclasses import dataclass |
Contributor
Comment on lines
+14
to
+15
| import time | ||
| from abc import ABC, abstractmethod |
Contributor
| import random | ||
| import time | ||
| from abc import ABC, abstractmethod | ||
| from typing import Callable, Optional |
Contributor
Comment on lines
+15
to
+16
| import json | ||
| import logging |
Contributor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add
atom.autotuner-- an autonomous kernel and inference configuration tuning framework for AMD GPUs (MI355X/MI325X/MI300X).InferenceAdapterABCArchitecture
Code cleanup in this PR
InferenceAdapterbase class, eliminating ~140 lines of duplicated code across 3 adapterstests/autotuner/__init__.pyTest plan
python -m pytest tests/autotuner/ -v) -- no GPU requiredpython -m atom.autotuner.cli run --model meta-llama/Llama-3.1-70B --system mi355x --total-gpus 8on MI355X--adapter atom --eval-mode real_bench