-
Notifications
You must be signed in to change notification settings - Fork 23
[VIBE CODED - Review Only] E2E model competition support #442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Extend the platform to support model-level competitions where users submit vLLM forks as tarballs. The system pip installs the fork, starts a vLLM server, runs serving benchmarks, and checks perplexity against a baseline. - Add Language.Model and RankCriterion.CUSTOM to support model tasks - Add ModelTaskData with benchmark shapes, perplexity config, timeouts - Add run_model_benchmark() with 5-phase pipeline (install, server, perplexity, benchmark, cleanup) - Add score_ascending field for higher-is-better ranking (throughput vs time) - Add tarball upload support (50MB limit) in API - Add Modal image with vLLM deps, sccache, and model weights volume - Add download_model.py for pre-populating model weights - Add example task definition for Llama-3.1-8B serving - Add reuse documentation listing unchanged components
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Adds end-to-end “model competition” support where users submit vLLM forks as archives that are installed and benchmarked via a new runner path, with leaderboard ranking able to support both lower-is-better and higher-is-better scores.
Changes:
- Introduces
Language.Model+ModelTaskData, plusrun_model_benchmark()pipeline (install → serve → perplexity → benchmark → cleanup). - Adds score direction (
score_ascending) wiring through task config, DB ranking queries, and API responses. - Extends submission handling to accept binary archives (50MB) and adds Modal infra (new image + volumes) and a weight pre-download script.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_task.py | Updates expected task config dicts to include score_ascending. |
| src/runners/modal_runner_archs.py | Registers Modal functions for model benchmarking on selected GPUs with volumes mounted. |
| src/runners/modal_runner.py | Adds dedicated model_image and Modal Volumes for model weights + sccache. |
| src/runners/download_model.py | Adds a Modal app to pre-download HF model weights into a shared volume. |
| src/libkernelbot/task.py | Adds ModelTaskData, extends LeaderboardTask to support model tasks + score_ascending. |
| src/libkernelbot/submission.py | Adds custom metric scoring, and threads score_ascending into competition/ranking display. |
| src/libkernelbot/run_eval.py | Routes lang=model to new run_model_benchmark() implementation. |
| src/libkernelbot/leaderboard_db.py | Stores bytes submissions and adds ranking direction support to leaderboard queries. |
| src/libkernelbot/launchers/modal.py | Dispatches Modal function name based on lang including model. |
| src/libkernelbot/consts.py | Adds Language.Model and RankCriterion.CUSTOM. |
| src/libkernelbot/backend.py | Base64-encodes model archives for transport and avoids .lower() on bytes. |
| src/kernelbot/api/main.py | Ensures /submissions endpoint uses correct score ordering for the given leaderboard. |
| src/kernelbot/api/api_utils.py | Accepts larger binary uploads for model tasks (50MB) and validates archive extension. |
| examples/llama_8b_serving/task.yml | Adds an example model task configuration (custom ranking metric + descending score). |
| docs/model-competitions-reuse.md | Documents which existing components are reused unchanged for model competitions. |
Comments suppressed due to low confidence (1)
src/runners/modal_runner.py:1
- These pins look risky: I’m not aware of a
torch==2.9.1release or acu130wheel index in the standard PyTorch distribution scheme. If this is intentional for your environment, consider documenting/validating it; otherwise, pin to a known-available Torch/CUDA combo (or make it configurable) to avoid Modal image build failures.
import signal
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/libkernelbot/run_eval.py
Outdated
| if tarfile.is_tarfile(archive_path): | ||
| with tarfile.open(archive_path, "r:*") as tar: | ||
| tar.extractall(path=extract_dir) | ||
| elif zipfile.is_zipfile(archive_path): | ||
| with zipfile.ZipFile(archive_path, "r") as zf: | ||
| zf.extractall(path=extract_dir) | ||
| else: | ||
| return False, "", "Submission archive is not a valid tar.gz or zip file" | ||
|
|
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tar.extractall() / ZipFile.extractall() are vulnerable to path traversal (e.g., ../../...) and can write outside extract_dir. Use a safe extraction routine that validates each member path stays within extract_dir (reject absolute paths and .. segments) before extracting.
| if tarfile.is_tarfile(archive_path): | |
| with tarfile.open(archive_path, "r:*") as tar: | |
| tar.extractall(path=extract_dir) | |
| elif zipfile.is_zipfile(archive_path): | |
| with zipfile.ZipFile(archive_path, "r") as zf: | |
| zf.extractall(path=extract_dir) | |
| else: | |
| return False, "", "Submission archive is not a valid tar.gz or zip file" | |
| def _safe_extract_tar(tar: tarfile.TarFile, dest_dir: str) -> None: | |
| base_dir = os.path.abspath(dest_dir) | |
| for member in tar.getmembers(): | |
| name = member.name | |
| # Disallow absolute paths | |
| if os.path.isabs(name): | |
| raise ValueError(f"Unsafe absolute path in tar archive: {name!r}") | |
| # Disallow parent directory traversal | |
| if ".." in Path(name).parts: | |
| raise ValueError(f"Unsafe relative path in tar archive: {name!r}") | |
| target_path = os.path.abspath(os.path.join(base_dir, name)) | |
| # Ensure the target path is within dest_dir | |
| if os.path.commonpath([base_dir, target_path]) != base_dir: | |
| raise ValueError(f"Tar path escapes destination directory: {name!r}") | |
| tar.extractall(path=dest_dir) | |
| def _safe_extract_zip(zf: zipfile.ZipFile, dest_dir: str) -> None: | |
| base_dir = os.path.abspath(dest_dir) | |
| for name in zf.namelist(): | |
| # Disallow absolute paths | |
| if os.path.isabs(name): | |
| raise ValueError(f"Unsafe absolute path in zip archive: {name!r}") | |
| # Disallow parent directory traversal | |
| if ".." in Path(name).parts: | |
| raise ValueError(f"Unsafe relative path in zip archive: {name!r}") | |
| target_path = os.path.abspath(os.path.join(base_dir, name)) | |
| # Ensure the target path is within dest_dir | |
| if os.path.commonpath([base_dir, target_path]) != base_dir: | |
| raise ValueError(f"Zip path escapes destination directory: {name!r}") | |
| zf.extractall(path=dest_dir) | |
| try: | |
| if tarfile.is_tarfile(archive_path): | |
| with tarfile.open(archive_path, "r:*") as tar: | |
| _safe_extract_tar(tar, extract_dir) | |
| elif zipfile.is_zipfile(archive_path): | |
| with zipfile.ZipFile(archive_path, "r") as zf: | |
| _safe_extract_zip(zf, extract_dir) | |
| else: | |
| return False, "", "Submission archive is not a valid tar.gz or zip file" | |
| except ValueError as e: | |
| return False, "", f"Submission archive contains unsafe paths: {e}" |
| work_dir = tempfile.mkdtemp(prefix="model_submission_") | ||
| archive_path = os.path.join(work_dir, "submission.tar.gz") | ||
|
|
||
| with open(archive_path, "wb") as f: | ||
| f.write(archive_bytes) | ||
|
|
||
| # Extract | ||
| import tarfile | ||
| import zipfile | ||
|
|
||
| extract_dir = os.path.join(work_dir, "src") | ||
| os.makedirs(extract_dir, exist_ok=True) | ||
|
|
||
| if tarfile.is_tarfile(archive_path): | ||
| with tarfile.open(archive_path, "r:*") as tar: | ||
| tar.extractall(path=extract_dir) | ||
| elif zipfile.is_zipfile(archive_path): | ||
| with zipfile.ZipFile(archive_path, "r") as zf: | ||
| zf.extractall(path=extract_dir) | ||
| else: | ||
| return False, "", "Submission archive is not a valid tar.gz or zip file" | ||
|
|
||
| # Find the actual package directory (may be nested one level) | ||
| entries = os.listdir(extract_dir) | ||
| if len(entries) == 1 and os.path.isdir(os.path.join(extract_dir, entries[0])): | ||
| pkg_dir = os.path.join(extract_dir, entries[0]) | ||
| else: | ||
| pkg_dir = extract_dir | ||
|
|
||
| # pip install | ||
| result = subprocess.run( | ||
| ["pip", "install", "-e", pkg_dir], | ||
| capture_output=True, | ||
| text=True, | ||
| timeout=install_timeout, | ||
| ) | ||
|
|
||
| return result.returncode == 0, _limit_length(result.stdout), _limit_length(result.stderr) |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tempfile.mkdtemp() creates a work directory that is never removed, which can leak disk space across runs. Prefer TemporaryDirectory() or explicitly shutil.rmtree(work_dir) in a finally (including the error/early-return paths).
| work_dir = tempfile.mkdtemp(prefix="model_submission_") | |
| archive_path = os.path.join(work_dir, "submission.tar.gz") | |
| with open(archive_path, "wb") as f: | |
| f.write(archive_bytes) | |
| # Extract | |
| import tarfile | |
| import zipfile | |
| extract_dir = os.path.join(work_dir, "src") | |
| os.makedirs(extract_dir, exist_ok=True) | |
| if tarfile.is_tarfile(archive_path): | |
| with tarfile.open(archive_path, "r:*") as tar: | |
| tar.extractall(path=extract_dir) | |
| elif zipfile.is_zipfile(archive_path): | |
| with zipfile.ZipFile(archive_path, "r") as zf: | |
| zf.extractall(path=extract_dir) | |
| else: | |
| return False, "", "Submission archive is not a valid tar.gz or zip file" | |
| # Find the actual package directory (may be nested one level) | |
| entries = os.listdir(extract_dir) | |
| if len(entries) == 1 and os.path.isdir(os.path.join(extract_dir, entries[0])): | |
| pkg_dir = os.path.join(extract_dir, entries[0]) | |
| else: | |
| pkg_dir = extract_dir | |
| # pip install | |
| result = subprocess.run( | |
| ["pip", "install", "-e", pkg_dir], | |
| capture_output=True, | |
| text=True, | |
| timeout=install_timeout, | |
| ) | |
| return result.returncode == 0, _limit_length(result.stdout), _limit_length(result.stderr) | |
| with tempfile.TemporaryDirectory(prefix="model_submission_") as work_dir: | |
| archive_path = os.path.join(work_dir, "submission.tar.gz") | |
| with open(archive_path, "wb") as f: | |
| f.write(archive_bytes) | |
| # Extract | |
| import tarfile | |
| import zipfile | |
| extract_dir = os.path.join(work_dir, "src") | |
| os.makedirs(extract_dir, exist_ok=True) | |
| if tarfile.is_tarfile(archive_path): | |
| with tarfile.open(archive_path, "r:*") as tar: | |
| tar.extractall(path=extract_dir) | |
| elif zipfile.is_zipfile(archive_path): | |
| with zipfile.ZipFile(archive_path, "r") as zf: | |
| zf.extractall(path=extract_dir) | |
| else: | |
| return False, "", "Submission archive is not a valid tar.gz or zip file" | |
| # Find the actual package directory (may be nested one level) | |
| entries = os.listdir(extract_dir) | |
| if len(entries) == 1 and os.path.isdir(os.path.join(extract_dir, entries[0])): | |
| pkg_dir = os.path.join(extract_dir, entries[0]) | |
| else: | |
| pkg_dir = extract_dir | |
| # pip install | |
| result = subprocess.run( | |
| ["pip", "install", "-e", pkg_dir], | |
| capture_output=True, | |
| text=True, | |
| timeout=install_timeout, | |
| ) | |
| return result.returncode == 0, _limit_length(result.stdout), _limit_length(result.stderr) |
| extract_dir = os.path.join(work_dir, "src") | ||
| os.makedirs(extract_dir, exist_ok=True) |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tempfile.mkdtemp() creates a work directory that is never removed, which can leak disk space across runs. Prefer TemporaryDirectory() or explicitly shutil.rmtree(work_dir) in a finally (including the error/early-return paths).
src/libkernelbot/run_eval.py
Outdated
| stdout=subprocess.PIPE, | ||
| stderr=subprocess.PIPE, |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Starting the server with stdout=PIPE and stderr=PIPE without continuously draining them risks blocking the vLLM process once its output buffers fill, potentially hanging runs. Redirect to files/DEVNULL, merge streams, or spawn reader threads to drain and store logs safely.
| stdout=subprocess.PIPE, | |
| stderr=subprocess.PIPE, | |
| stdout=subprocess.DEVNULL, | |
| stderr=subprocess.DEVNULL, |
src/libkernelbot/run_eval.py
Outdated
| cmd = [ | ||
| "python3", "-m", "vllm.entrypoints.openai.run_batch", | ||
| ] | ||
|
|
||
| # Prefer the benchmark_serving script approach | ||
| cmd = [ | ||
| "python3", "-m", "vllm.benchmarks.benchmark_serving", | ||
| "--backend", "openai-chat", | ||
| "--base-url", f"http://localhost:{port}", | ||
| "--model", model_name, | ||
| "--endpoint", "/v1/chat/completions", | ||
| "--num-prompts", str(shape.get("num_prompts", 100)), | ||
| "--random-input-len", str(shape.get("input_len", 512)), | ||
| "--random-output-len", str(shape.get("output_len", 128)), | ||
| "--save-result", | ||
| ] | ||
|
|
||
| result = subprocess.run( | ||
| cmd, | ||
| capture_output=True, | ||
| text=True, | ||
| timeout=benchmark_timeout, | ||
| ) | ||
|
|
||
| if result.returncode != 0: | ||
| all_metrics[f"shape_{i}_error"] = _limit_length(result.stderr) | ||
| continue | ||
|
|
||
| # Parse the saved JSON result file | ||
| # vLLM saves to a json file in current directory | ||
| import glob | ||
| json_files = sorted(glob.glob("*.json"), key=os.path.getmtime, reverse=True) | ||
| if json_files: | ||
| try: | ||
| with open(json_files[0]) as f: | ||
| bench_result = json.load(f) | ||
| for key in [ | ||
| "request_throughput", | ||
| "output_throughput", | ||
| "mean_ttft_ms", | ||
| "median_ttft_ms", | ||
| "p99_ttft_ms", | ||
| "mean_tpot_ms", | ||
| "median_tpot_ms", | ||
| "p99_tpot_ms", | ||
| "mean_itl_ms", | ||
| "median_itl_ms", | ||
| "p99_itl_ms", | ||
| ]: | ||
| if key in bench_result: | ||
| all_metrics[key] = bench_result[key] | ||
| os.remove(json_files[0]) | ||
| except (json.JSONDecodeError, OSError): | ||
| pass | ||
|
|
||
| all_metrics[f"shape_{i}_stdout"] = _limit_length(result.stdout) |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metrics are overwritten across shapes because all_metrics[key] is reused for every shape; only the last shape’s values will survive. Also, glob('*.json') in the current working directory can pick up unrelated files and is race-prone. Write results to a per-shape, known filepath (or run in a temp working directory) and namespace metrics per shape (e.g., shape_{i}_{key}) or return a list keyed by shape.
| cmd = [ | |
| "python3", "-m", "vllm.entrypoints.openai.run_batch", | |
| ] | |
| # Prefer the benchmark_serving script approach | |
| cmd = [ | |
| "python3", "-m", "vllm.benchmarks.benchmark_serving", | |
| "--backend", "openai-chat", | |
| "--base-url", f"http://localhost:{port}", | |
| "--model", model_name, | |
| "--endpoint", "/v1/chat/completions", | |
| "--num-prompts", str(shape.get("num_prompts", 100)), | |
| "--random-input-len", str(shape.get("input_len", 512)), | |
| "--random-output-len", str(shape.get("output_len", 128)), | |
| "--save-result", | |
| ] | |
| result = subprocess.run( | |
| cmd, | |
| capture_output=True, | |
| text=True, | |
| timeout=benchmark_timeout, | |
| ) | |
| if result.returncode != 0: | |
| all_metrics[f"shape_{i}_error"] = _limit_length(result.stderr) | |
| continue | |
| # Parse the saved JSON result file | |
| # vLLM saves to a json file in current directory | |
| import glob | |
| json_files = sorted(glob.glob("*.json"), key=os.path.getmtime, reverse=True) | |
| if json_files: | |
| try: | |
| with open(json_files[0]) as f: | |
| bench_result = json.load(f) | |
| for key in [ | |
| "request_throughput", | |
| "output_throughput", | |
| "mean_ttft_ms", | |
| "median_ttft_ms", | |
| "p99_ttft_ms", | |
| "mean_tpot_ms", | |
| "median_tpot_ms", | |
| "p99_tpot_ms", | |
| "mean_itl_ms", | |
| "median_itl_ms", | |
| "p99_itl_ms", | |
| ]: | |
| if key in bench_result: | |
| all_metrics[key] = bench_result[key] | |
| os.remove(json_files[0]) | |
| except (json.JSONDecodeError, OSError): | |
| pass | |
| all_metrics[f"shape_{i}_stdout"] = _limit_length(result.stdout) | |
| with tempfile.TemporaryDirectory() as tmpdir: | |
| cmd = [ | |
| "python3", "-m", "vllm.entrypoints.openai.run_batch", | |
| ] | |
| # Prefer the benchmark_serving script approach | |
| cmd = [ | |
| "python3", "-m", "vllm.benchmarks.benchmark_serving", | |
| "--backend", "openai-chat", | |
| "--base-url", f"http://localhost:{port}", | |
| "--model", model_name, | |
| "--endpoint", "/v1/chat/completions", | |
| "--num-prompts", str(shape.get("num_prompts", 100)), | |
| "--random-input-len", str(shape.get("input_len", 512)), | |
| "--random-output-len", str(shape.get("output_len", 128)), | |
| "--save-result", | |
| ] | |
| result = subprocess.run( | |
| cmd, | |
| capture_output=True, | |
| text=True, | |
| timeout=benchmark_timeout, | |
| cwd=tmpdir, | |
| ) | |
| if result.returncode != 0: | |
| all_metrics[f"shape_{i}_error"] = _limit_length(result.stderr) | |
| all_metrics[f"shape_{i}_stdout"] = _limit_length(result.stdout) | |
| continue | |
| # Parse the saved JSON result file | |
| # vLLM saves to a json file in the working directory | |
| import glob | |
| json_files = sorted( | |
| glob.glob(os.path.join(tmpdir, "*.json")), | |
| key=os.path.getmtime, | |
| reverse=True, | |
| ) | |
| if json_files: | |
| try: | |
| with open(json_files[0]) as f: | |
| bench_result = json.load(f) | |
| for key in [ | |
| "request_throughput", | |
| "output_throughput", | |
| "mean_ttft_ms", | |
| "median_ttft_ms", | |
| "p99_ttft_ms", | |
| "mean_tpot_ms", | |
| "median_tpot_ms", | |
| "p99_tpot_ms", | |
| "mean_itl_ms", | |
| "median_itl_ms", | |
| "p99_itl_ms", | |
| ]: | |
| if key in bench_result: | |
| all_metrics[f"shape_{i}_{key}"] = bench_result[key] | |
| os.remove(json_files[0]) | |
| except (json.JSONDecodeError, OSError): | |
| pass | |
| all_metrics[f"shape_{i}_stdout"] = _limit_length(result.stdout) |
src/libkernelbot/run_eval.py
Outdated
| cmd = [ | ||
| "python3", "-m", "vllm.entrypoints.openai.run_batch", | ||
| ] | ||
|
|
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The initial cmd assignment to vllm.entrypoints.openai.run_batch is immediately overwritten and has no effect. Remove the dead assignment to reduce confusion and keep the benchmark invocation single-sourced.
| cmd = [ | |
| "python3", "-m", "vllm.entrypoints.openai.run_batch", | |
| ] |
| try: | ||
| with urllib.request.urlopen(req, timeout=30) as resp: | ||
| data = json.loads(resp.read()) |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The perplexity check silently ignores all request/parse errors and may compute perplexity from a small subset of prompts, which can lead to unstable or falsely passing results. Consider failing the check on any request error (or at least tracking an error count and requiring a minimum success ratio) and include the error details in the run result for debuggability.
src/libkernelbot/run_eval.py
Outdated
| except Exception: | ||
| continue |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The perplexity check silently ignores all request/parse errors and may compute perplexity from a small subset of prompts, which can lead to unstable or falsely passing results. Consider failing the check on any request error (or at least tracking an error count and requiring a minimum success ratio) and include the error details in the run result for debuggability.
|
|
||
| def compute_score(result: FullResult, task: LeaderboardTask, submission_id: int) -> float: | ||
| if task.ranking_by == RankCriterion.CUSTOM: | ||
| ranking_metric = task.config.ranking_metric |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RankCriterion.CUSTOM implicitly assumes task.config has ranking_metric, but LeaderboardTask.config can also be CudaTaskData/PythonTaskData, which don’t define it. Enforce CUSTOM only for Language.Model (e.g., in LeaderboardTask.__post_init__) or store ranking_metric at the task level so this doesn’t depend on a specific config dataclass.
| ranking_metric = task.config.ranking_metric | |
| # Some task configurations (e.g., CudaTaskData/PythonTaskData) may not | |
| # define a `ranking_metric` attribute. Guard against that here so we | |
| # don't rely on a specific config dataclass shape. | |
| config = getattr(task, "config", None) | |
| if config is None or not hasattr(config, "ranking_metric"): | |
| raise KernelBotError( | |
| "RankCriterion.CUSTOM requires task.config to define a 'ranking_metric' " | |
| f"attribute; got config type '{type(config).__name__}' instead." | |
| ) | |
| ranking_metric = getattr(config, "ranking_metric") |
| return passed, measured_ppl | ||
|
|
||
|
|
||
| def run_model_benchmark(config: dict) -> FullResult: # noqa: C901 |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new run_model_benchmark() path (install, server startup/timeout handling, perplexity pass/fail, benchmark parsing, and cleanup) introduces substantial logic but isn’t covered by unit tests. Since the repo already has pytest coverage (e.g., tests/test_task.py), add focused tests that mock subprocess.run / subprocess.Popen and urllib.request.urlopen to deterministically validate success and failure modes.
- Fix path traversal vulnerability in tar/zip extraction (validate members) - Fix metrics overwritten across shapes (namespace by shape index) - Fix vLLM server stdout/stderr PIPE blocking (redirect to DEVNULL) - Fix perplexity check silently swallowing errors (require >50% success) - Remove dead cmd assignment in benchmark runner - Add hasattr guard for CUSTOM ranking_metric in compute_score - Remove docs/model-competitions-reuse.md
- Fix lang_name KeyError crash for model submissions in GitHub launcher - Upload model archives as Git blobs to bypass workflow dispatch size limits - Add nvidia_model_workflow.yml with 60-min timeout for model benchmarking - Update github-runner.py to download blob archives before running - Add model-specific timeout computation from model_config - Add expected run name pattern for model workflow dispatch - Block model competitions on AMD GPUs (NVIDIA only for now)
Coverage reportClick to see where and how coverage changed
This report was generated by python-coverage-comment-action |
||||||||||||||||||||||||||||||||||||||||||||||||||||||
Isolates model benchmark dependencies in a venv instead of polluting the runner's system Python. Falls back to pip if uv is not available.
- Persistent venv at /opt/model-venv with torch + vLLM deps pre-cached (mirrors Modal model_image pattern: install vllm for deps, uninstall) - Set SETUPTOOLS_SCM_PRETEND_VERSION for tarball submissions without .git - Pin Python 3.10 in venv, add sccache for CUDA compilation caching
Drop /opt persistent venv (permission issues on containerized runners). Bootstrap fresh venv each run with torch + vllm deps. Optimize later.
- Only use --download-dir /models if the path exists (Modal volume). On GitHub runners, fall back to HF cache default. - Capture server stdout/stderr to a log file instead of DEVNULL. - Include server log in result on startup failure for debugging.
Summary
End-to-end support for model competitions where users submit vLLM forks and are benchmarked on serving throughput/latency. This mirrors the existing kernel submission flow but for full model inference serving.
Language.Modeltype withModelTaskDataconfig (model name, tensor parallel, benchmark shapes, perplexity baseline)run_model_benchmark()— 5-phase pipeline: extract archive → pip install fork → start vLLM server → perplexity check → benchmark servingnvidia_model_workflow.yml) for B200 self-hosted runnersscore_ascendingfield for higher-is-better metrics (e.g., throughput)E2E Testing Status (GitHub Actions route)
Tested against B200 self-hosted runner (
l-bgx-01). Pipeline validated through multiple iterations:pip install failedpip install failed/opt/model-venvpermission deniedvLLM server failed to start--download-dir /modelsdoesnt exist on GH runnersvLLM server failed to startCurrent state: The full pipeline works up to model weight download. The vLLM fork compiles successfully on B200 (sm_100), the server launches, but fails because the runner node lacks an
HF_TOKENfor gated models.Remaining work / sources of overhead to eliminate
Blockers for full E2E
HF_TOKENto download gated models like Llama-3.1-8B. Add as a repo secret and pass via workflow env.Performance (40 min cold start)
/optis read-only)model_weightsvolume)Nice to have
/opt/model-venvbut runner is containerized with no write access outside workspace. Need to coordinate with infra.facebook/opt-125m) for CI smoke testsTest plan
test_backend.py,test_task.py)