Skip to content

Conversation

@msaroufim
Copy link
Member

@msaroufim msaroufim commented Feb 10, 2026

Summary

End-to-end support for model competitions where users submit vLLM forks and are benchmarked on serving throughput/latency. This mirrors the existing kernel submission flow but for full model inference serving.

  • New Language.Model type with ModelTaskData config (model name, tensor parallel, benchmark shapes, perplexity baseline)
  • run_model_benchmark() — 5-phase pipeline: extract archive → pip install fork → start vLLM server → perplexity check → benchmark serving
  • GitHub Actions workflow (nvidia_model_workflow.yml) for B200 self-hosted runners
  • Modal runner with persistent model weights volume and sccache for CUDA compilation
  • API support for 50MB binary archive uploads (tar.gz/zip)
  • score_ascending field for higher-is-better metrics (e.g., throughput)
  • Security: tar path traversal validation, metrics namespacing, perplexity success threshold

E2E Testing Status (GitHub Actions route)

Tested against B200 self-hosted runner (l-bgx-01). Pipeline validated through multiple iterations:

Run Result Issue
1 pip install failed Bad pyproject.toml build backend in test payload
2 pip install failed setuptools-scm cant detect version without .git
3 /opt/model-venv permission denied Runner is containerized, cant write to /opt
4 vLLM server failed to start --download-dir /models doesnt exist on GH runners
5 vLLM server failed to start HF token not available — Llama-3.1-8B is gated

Current state: The full pipeline works up to model weight download. The vLLM fork compiles successfully on B200 (sm_100), the server launches, but fails because the runner node lacks an HF_TOKEN for gated models.

Remaining work / sources of overhead to eliminate

Blockers for full E2E

  • HF_TOKEN as GitHub secret — runner nodes need HF_TOKEN to download gated models like Llama-3.1-8B. Add as a repo secret and pass via workflow env.
  • Popcorn-CLI end-to-end test — need to test the full flow: popcorn-cli → API → GitHub Actions → result callback. Zip payload support exists but hasnt been validated.

Performance (40 min cold start)

  • vLLM compilation from source (~20 min) — every run recompiles CUDA extensions for B200. Options:
    • Persistent venv with pre-compiled vLLM deps (needs writable path on runner nodes, /opt is read-only)
    • Pre-built wheel cache or sccache volume
    • Docker image with vLLM pre-installed (only users diff gets compiled)
  • Model weight download (~10 min) — Llama-3.1-8B is ~16GB. Options:
    • Persistent HF cache directory on runner nodes
    • Pre-downloaded weights volume (like Modals model_weights volume)
  • Environment setup (~2 min) — torch + vLLM deps installed via uv each run. Would be instant with a persistent venv or Docker image.

Nice to have

  • Persistent venv on runner nodes — tried /opt/model-venv but runner is containerized with no write access outside workspace. Need to coordinate with infra.
  • sccache for CUDA compilation — Modal runner has this, GitHub runner doesnt yet
  • Smaller test model — use a non-gated model (e.g., facebook/opt-125m) for CI smoke tests

Test plan

  • Unit tests pass (test_backend.py, test_task.py)
  • GitHub Actions workflow dispatches and runs on B200 runner
  • vLLM source compilation works on B200 (sm_100)
  • Server log capture works (stderr visible in result.json on failure)
  • Full E2E with HF_TOKEN (server starts, perplexity check, benchmark)
  • Popcorn-CLI submission flow
  • Modal runner deployment and test

Extend the platform to support model-level competitions where users submit
vLLM forks as tarballs. The system pip installs the fork, starts a vLLM
server, runs serving benchmarks, and checks perplexity against a baseline.

- Add Language.Model and RankCriterion.CUSTOM to support model tasks
- Add ModelTaskData with benchmark shapes, perplexity config, timeouts
- Add run_model_benchmark() with 5-phase pipeline (install, server, perplexity, benchmark, cleanup)
- Add score_ascending field for higher-is-better ranking (throughput vs time)
- Add tarball upload support (50MB limit) in API
- Add Modal image with vLLM deps, sccache, and model weights volume
- Add download_model.py for pre-populating model weights
- Add example task definition for Llama-3.1-8B serving
- Add reuse documentation listing unchanged components
Copilot AI review requested due to automatic review settings February 10, 2026 20:38
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds end-to-end “model competition” support where users submit vLLM forks as archives that are installed and benchmarked via a new runner path, with leaderboard ranking able to support both lower-is-better and higher-is-better scores.

Changes:

  • Introduces Language.Model + ModelTaskData, plus run_model_benchmark() pipeline (install → serve → perplexity → benchmark → cleanup).
  • Adds score direction (score_ascending) wiring through task config, DB ranking queries, and API responses.
  • Extends submission handling to accept binary archives (50MB) and adds Modal infra (new image + volumes) and a weight pre-download script.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
tests/test_task.py Updates expected task config dicts to include score_ascending.
src/runners/modal_runner_archs.py Registers Modal functions for model benchmarking on selected GPUs with volumes mounted.
src/runners/modal_runner.py Adds dedicated model_image and Modal Volumes for model weights + sccache.
src/runners/download_model.py Adds a Modal app to pre-download HF model weights into a shared volume.
src/libkernelbot/task.py Adds ModelTaskData, extends LeaderboardTask to support model tasks + score_ascending.
src/libkernelbot/submission.py Adds custom metric scoring, and threads score_ascending into competition/ranking display.
src/libkernelbot/run_eval.py Routes lang=model to new run_model_benchmark() implementation.
src/libkernelbot/leaderboard_db.py Stores bytes submissions and adds ranking direction support to leaderboard queries.
src/libkernelbot/launchers/modal.py Dispatches Modal function name based on lang including model.
src/libkernelbot/consts.py Adds Language.Model and RankCriterion.CUSTOM.
src/libkernelbot/backend.py Base64-encodes model archives for transport and avoids .lower() on bytes.
src/kernelbot/api/main.py Ensures /submissions endpoint uses correct score ordering for the given leaderboard.
src/kernelbot/api/api_utils.py Accepts larger binary uploads for model tasks (50MB) and validates archive extension.
examples/llama_8b_serving/task.yml Adds an example model task configuration (custom ranking metric + descending score).
docs/model-competitions-reuse.md Documents which existing components are reused unchanged for model competitions.
Comments suppressed due to low confidence (1)

src/runners/modal_runner.py:1

  • These pins look risky: I’m not aware of a torch==2.9.1 release or a cu130 wheel index in the standard PyTorch distribution scheme. If this is intentional for your environment, consider documenting/validating it; otherwise, pin to a known-available Torch/CUDA combo (or make it configurable) to avoid Modal image build failures.
import signal

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 899 to 907
if tarfile.is_tarfile(archive_path):
with tarfile.open(archive_path, "r:*") as tar:
tar.extractall(path=extract_dir)
elif zipfile.is_zipfile(archive_path):
with zipfile.ZipFile(archive_path, "r") as zf:
zf.extractall(path=extract_dir)
else:
return False, "", "Submission archive is not a valid tar.gz or zip file"

Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tar.extractall() / ZipFile.extractall() are vulnerable to path traversal (e.g., ../../...) and can write outside extract_dir. Use a safe extraction routine that validates each member path stays within extract_dir (reject absolute paths and .. segments) before extracting.

Suggested change
if tarfile.is_tarfile(archive_path):
with tarfile.open(archive_path, "r:*") as tar:
tar.extractall(path=extract_dir)
elif zipfile.is_zipfile(archive_path):
with zipfile.ZipFile(archive_path, "r") as zf:
zf.extractall(path=extract_dir)
else:
return False, "", "Submission archive is not a valid tar.gz or zip file"
def _safe_extract_tar(tar: tarfile.TarFile, dest_dir: str) -> None:
base_dir = os.path.abspath(dest_dir)
for member in tar.getmembers():
name = member.name
# Disallow absolute paths
if os.path.isabs(name):
raise ValueError(f"Unsafe absolute path in tar archive: {name!r}")
# Disallow parent directory traversal
if ".." in Path(name).parts:
raise ValueError(f"Unsafe relative path in tar archive: {name!r}")
target_path = os.path.abspath(os.path.join(base_dir, name))
# Ensure the target path is within dest_dir
if os.path.commonpath([base_dir, target_path]) != base_dir:
raise ValueError(f"Tar path escapes destination directory: {name!r}")
tar.extractall(path=dest_dir)
def _safe_extract_zip(zf: zipfile.ZipFile, dest_dir: str) -> None:
base_dir = os.path.abspath(dest_dir)
for name in zf.namelist():
# Disallow absolute paths
if os.path.isabs(name):
raise ValueError(f"Unsafe absolute path in zip archive: {name!r}")
# Disallow parent directory traversal
if ".." in Path(name).parts:
raise ValueError(f"Unsafe relative path in zip archive: {name!r}")
target_path = os.path.abspath(os.path.join(base_dir, name))
# Ensure the target path is within dest_dir
if os.path.commonpath([base_dir, target_path]) != base_dir:
raise ValueError(f"Zip path escapes destination directory: {name!r}")
zf.extractall(path=dest_dir)
try:
if tarfile.is_tarfile(archive_path):
with tarfile.open(archive_path, "r:*") as tar:
_safe_extract_tar(tar, extract_dir)
elif zipfile.is_zipfile(archive_path):
with zipfile.ZipFile(archive_path, "r") as zf:
_safe_extract_zip(zf, extract_dir)
else:
return False, "", "Submission archive is not a valid tar.gz or zip file"
except ValueError as e:
return False, "", f"Submission archive contains unsafe paths: {e}"

Copilot uses AI. Check for mistakes.
Comment on lines 886 to 923
work_dir = tempfile.mkdtemp(prefix="model_submission_")
archive_path = os.path.join(work_dir, "submission.tar.gz")

with open(archive_path, "wb") as f:
f.write(archive_bytes)

# Extract
import tarfile
import zipfile

extract_dir = os.path.join(work_dir, "src")
os.makedirs(extract_dir, exist_ok=True)

if tarfile.is_tarfile(archive_path):
with tarfile.open(archive_path, "r:*") as tar:
tar.extractall(path=extract_dir)
elif zipfile.is_zipfile(archive_path):
with zipfile.ZipFile(archive_path, "r") as zf:
zf.extractall(path=extract_dir)
else:
return False, "", "Submission archive is not a valid tar.gz or zip file"

# Find the actual package directory (may be nested one level)
entries = os.listdir(extract_dir)
if len(entries) == 1 and os.path.isdir(os.path.join(extract_dir, entries[0])):
pkg_dir = os.path.join(extract_dir, entries[0])
else:
pkg_dir = extract_dir

# pip install
result = subprocess.run(
["pip", "install", "-e", pkg_dir],
capture_output=True,
text=True,
timeout=install_timeout,
)

return result.returncode == 0, _limit_length(result.stdout), _limit_length(result.stderr)
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tempfile.mkdtemp() creates a work directory that is never removed, which can leak disk space across runs. Prefer TemporaryDirectory() or explicitly shutil.rmtree(work_dir) in a finally (including the error/early-return paths).

Suggested change
work_dir = tempfile.mkdtemp(prefix="model_submission_")
archive_path = os.path.join(work_dir, "submission.tar.gz")
with open(archive_path, "wb") as f:
f.write(archive_bytes)
# Extract
import tarfile
import zipfile
extract_dir = os.path.join(work_dir, "src")
os.makedirs(extract_dir, exist_ok=True)
if tarfile.is_tarfile(archive_path):
with tarfile.open(archive_path, "r:*") as tar:
tar.extractall(path=extract_dir)
elif zipfile.is_zipfile(archive_path):
with zipfile.ZipFile(archive_path, "r") as zf:
zf.extractall(path=extract_dir)
else:
return False, "", "Submission archive is not a valid tar.gz or zip file"
# Find the actual package directory (may be nested one level)
entries = os.listdir(extract_dir)
if len(entries) == 1 and os.path.isdir(os.path.join(extract_dir, entries[0])):
pkg_dir = os.path.join(extract_dir, entries[0])
else:
pkg_dir = extract_dir
# pip install
result = subprocess.run(
["pip", "install", "-e", pkg_dir],
capture_output=True,
text=True,
timeout=install_timeout,
)
return result.returncode == 0, _limit_length(result.stdout), _limit_length(result.stderr)
with tempfile.TemporaryDirectory(prefix="model_submission_") as work_dir:
archive_path = os.path.join(work_dir, "submission.tar.gz")
with open(archive_path, "wb") as f:
f.write(archive_bytes)
# Extract
import tarfile
import zipfile
extract_dir = os.path.join(work_dir, "src")
os.makedirs(extract_dir, exist_ok=True)
if tarfile.is_tarfile(archive_path):
with tarfile.open(archive_path, "r:*") as tar:
tar.extractall(path=extract_dir)
elif zipfile.is_zipfile(archive_path):
with zipfile.ZipFile(archive_path, "r") as zf:
zf.extractall(path=extract_dir)
else:
return False, "", "Submission archive is not a valid tar.gz or zip file"
# Find the actual package directory (may be nested one level)
entries = os.listdir(extract_dir)
if len(entries) == 1 and os.path.isdir(os.path.join(extract_dir, entries[0])):
pkg_dir = os.path.join(extract_dir, entries[0])
else:
pkg_dir = extract_dir
# pip install
result = subprocess.run(
["pip", "install", "-e", pkg_dir],
capture_output=True,
text=True,
timeout=install_timeout,
)
return result.returncode == 0, _limit_length(result.stdout), _limit_length(result.stderr)

Copilot uses AI. Check for mistakes.
Comment on lines +896 to +897
extract_dir = os.path.join(work_dir, "src")
os.makedirs(extract_dir, exist_ok=True)
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tempfile.mkdtemp() creates a work directory that is never removed, which can leak disk space across runs. Prefer TemporaryDirectory() or explicitly shutil.rmtree(work_dir) in a finally (including the error/early-return paths).

Copilot uses AI. Check for mistakes.
Comment on lines 943 to 944
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting the server with stdout=PIPE and stderr=PIPE without continuously draining them risks blocking the vLLM process once its output buffers fill, potentially hanging runs. Redirect to files/DEVNULL, merge streams, or spawn reader threads to drain and store logs safely.

Suggested change
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL,

Copilot uses AI. Check for mistakes.
Comment on lines 979 to 1034
cmd = [
"python3", "-m", "vllm.entrypoints.openai.run_batch",
]

# Prefer the benchmark_serving script approach
cmd = [
"python3", "-m", "vllm.benchmarks.benchmark_serving",
"--backend", "openai-chat",
"--base-url", f"http://localhost:{port}",
"--model", model_name,
"--endpoint", "/v1/chat/completions",
"--num-prompts", str(shape.get("num_prompts", 100)),
"--random-input-len", str(shape.get("input_len", 512)),
"--random-output-len", str(shape.get("output_len", 128)),
"--save-result",
]

result = subprocess.run(
cmd,
capture_output=True,
text=True,
timeout=benchmark_timeout,
)

if result.returncode != 0:
all_metrics[f"shape_{i}_error"] = _limit_length(result.stderr)
continue

# Parse the saved JSON result file
# vLLM saves to a json file in current directory
import glob
json_files = sorted(glob.glob("*.json"), key=os.path.getmtime, reverse=True)
if json_files:
try:
with open(json_files[0]) as f:
bench_result = json.load(f)
for key in [
"request_throughput",
"output_throughput",
"mean_ttft_ms",
"median_ttft_ms",
"p99_ttft_ms",
"mean_tpot_ms",
"median_tpot_ms",
"p99_tpot_ms",
"mean_itl_ms",
"median_itl_ms",
"p99_itl_ms",
]:
if key in bench_result:
all_metrics[key] = bench_result[key]
os.remove(json_files[0])
except (json.JSONDecodeError, OSError):
pass

all_metrics[f"shape_{i}_stdout"] = _limit_length(result.stdout)
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metrics are overwritten across shapes because all_metrics[key] is reused for every shape; only the last shape’s values will survive. Also, glob('*.json') in the current working directory can pick up unrelated files and is race-prone. Write results to a per-shape, known filepath (or run in a temp working directory) and namespace metrics per shape (e.g., shape_{i}_{key}) or return a list keyed by shape.

Suggested change
cmd = [
"python3", "-m", "vllm.entrypoints.openai.run_batch",
]
# Prefer the benchmark_serving script approach
cmd = [
"python3", "-m", "vllm.benchmarks.benchmark_serving",
"--backend", "openai-chat",
"--base-url", f"http://localhost:{port}",
"--model", model_name,
"--endpoint", "/v1/chat/completions",
"--num-prompts", str(shape.get("num_prompts", 100)),
"--random-input-len", str(shape.get("input_len", 512)),
"--random-output-len", str(shape.get("output_len", 128)),
"--save-result",
]
result = subprocess.run(
cmd,
capture_output=True,
text=True,
timeout=benchmark_timeout,
)
if result.returncode != 0:
all_metrics[f"shape_{i}_error"] = _limit_length(result.stderr)
continue
# Parse the saved JSON result file
# vLLM saves to a json file in current directory
import glob
json_files = sorted(glob.glob("*.json"), key=os.path.getmtime, reverse=True)
if json_files:
try:
with open(json_files[0]) as f:
bench_result = json.load(f)
for key in [
"request_throughput",
"output_throughput",
"mean_ttft_ms",
"median_ttft_ms",
"p99_ttft_ms",
"mean_tpot_ms",
"median_tpot_ms",
"p99_tpot_ms",
"mean_itl_ms",
"median_itl_ms",
"p99_itl_ms",
]:
if key in bench_result:
all_metrics[key] = bench_result[key]
os.remove(json_files[0])
except (json.JSONDecodeError, OSError):
pass
all_metrics[f"shape_{i}_stdout"] = _limit_length(result.stdout)
with tempfile.TemporaryDirectory() as tmpdir:
cmd = [
"python3", "-m", "vllm.entrypoints.openai.run_batch",
]
# Prefer the benchmark_serving script approach
cmd = [
"python3", "-m", "vllm.benchmarks.benchmark_serving",
"--backend", "openai-chat",
"--base-url", f"http://localhost:{port}",
"--model", model_name,
"--endpoint", "/v1/chat/completions",
"--num-prompts", str(shape.get("num_prompts", 100)),
"--random-input-len", str(shape.get("input_len", 512)),
"--random-output-len", str(shape.get("output_len", 128)),
"--save-result",
]
result = subprocess.run(
cmd,
capture_output=True,
text=True,
timeout=benchmark_timeout,
cwd=tmpdir,
)
if result.returncode != 0:
all_metrics[f"shape_{i}_error"] = _limit_length(result.stderr)
all_metrics[f"shape_{i}_stdout"] = _limit_length(result.stdout)
continue
# Parse the saved JSON result file
# vLLM saves to a json file in the working directory
import glob
json_files = sorted(
glob.glob(os.path.join(tmpdir, "*.json")),
key=os.path.getmtime,
reverse=True,
)
if json_files:
try:
with open(json_files[0]) as f:
bench_result = json.load(f)
for key in [
"request_throughput",
"output_throughput",
"mean_ttft_ms",
"median_ttft_ms",
"p99_ttft_ms",
"mean_tpot_ms",
"median_tpot_ms",
"p99_tpot_ms",
"mean_itl_ms",
"median_itl_ms",
"p99_itl_ms",
]:
if key in bench_result:
all_metrics[f"shape_{i}_{key}"] = bench_result[key]
os.remove(json_files[0])
except (json.JSONDecodeError, OSError):
pass
all_metrics[f"shape_{i}_stdout"] = _limit_length(result.stdout)

Copilot uses AI. Check for mistakes.
Comment on lines 979 to 982
cmd = [
"python3", "-m", "vllm.entrypoints.openai.run_batch",
]

Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initial cmd assignment to vllm.entrypoints.openai.run_batch is immediately overwritten and has no effect. Remove the dead assignment to reduce confusion and keep the benchmark invocation single-sourced.

Suggested change
cmd = [
"python3", "-m", "vllm.entrypoints.openai.run_batch",
]

Copilot uses AI. Check for mistakes.
Comment on lines +1085 to +1087
try:
with urllib.request.urlopen(req, timeout=30) as resp:
data = json.loads(resp.read())
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The perplexity check silently ignores all request/parse errors and may compute perplexity from a small subset of prompts, which can lead to unstable or falsely passing results. Consider failing the check on any request error (or at least tracking an error count and requiring a minimum success ratio) and include the error details in the run result for debuggability.

Copilot uses AI. Check for mistakes.
Comment on lines 1094 to 1095
except Exception:
continue
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The perplexity check silently ignores all request/parse errors and may compute perplexity from a small subset of prompts, which can lead to unstable or falsely passing results. Consider failing the check on any request error (or at least tracking an error count and requiring a minimum success ratio) and include the error details in the run result for debuggability.

Copilot uses AI. Check for mistakes.

def compute_score(result: FullResult, task: LeaderboardTask, submission_id: int) -> float:
if task.ranking_by == RankCriterion.CUSTOM:
ranking_metric = task.config.ranking_metric
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RankCriterion.CUSTOM implicitly assumes task.config has ranking_metric, but LeaderboardTask.config can also be CudaTaskData/PythonTaskData, which don’t define it. Enforce CUSTOM only for Language.Model (e.g., in LeaderboardTask.__post_init__) or store ranking_metric at the task level so this doesn’t depend on a specific config dataclass.

Suggested change
ranking_metric = task.config.ranking_metric
# Some task configurations (e.g., CudaTaskData/PythonTaskData) may not
# define a `ranking_metric` attribute. Guard against that here so we
# don't rely on a specific config dataclass shape.
config = getattr(task, "config", None)
if config is None or not hasattr(config, "ranking_metric"):
raise KernelBotError(
"RankCriterion.CUSTOM requires task.config to define a 'ranking_metric' "
f"attribute; got config type '{type(config).__name__}' instead."
)
ranking_metric = getattr(config, "ranking_metric")

Copilot uses AI. Check for mistakes.
return passed, measured_ppl


def run_model_benchmark(config: dict) -> FullResult: # noqa: C901
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new run_model_benchmark() path (install, server startup/timeout handling, perplexity pass/fail, benchmark parsing, and cleanup) introduces substantial logic but isn’t covered by unit tests. Since the repo already has pytest coverage (e.g., tests/test_task.py), add focused tests that mock subprocess.run / subprocess.Popen and urllib.request.urlopen to deterministically validate success and failure modes.

Copilot uses AI. Check for mistakes.
- Fix path traversal vulnerability in tar/zip extraction (validate members)
- Fix metrics overwritten across shapes (namespace by shape index)
- Fix vLLM server stdout/stderr PIPE blocking (redirect to DEVNULL)
- Fix perplexity check silently swallowing errors (require >50% success)
- Remove dead cmd assignment in benchmark runner
- Add hasattr guard for CUSTOM ranking_metric in compute_score
- Remove docs/model-competitions-reuse.md
- Fix lang_name KeyError crash for model submissions in GitHub launcher
- Upload model archives as Git blobs to bypass workflow dispatch size limits
- Add nvidia_model_workflow.yml with 60-min timeout for model benchmarking
- Update github-runner.py to download blob archives before running
- Add model-specific timeout computation from model_config
- Add expected run name pattern for model workflow dispatch
- Block model competitions on AMD GPUs (NVIDIA only for now)
@github-actions
Copy link

Coverage report

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  src/libkernelbot
  backend.py 198
  consts.py
  leaderboard_db.py
  submission.py 178-190, 235
  task.py 83, 87, 101, 156, 222
  utils.py
Project Total  

This report was generated by python-coverage-comment-action

Isolates model benchmark dependencies in a venv instead of
polluting the runner's system Python. Falls back to pip if
uv is not available.
- Persistent venv at /opt/model-venv with torch + vLLM deps pre-cached
  (mirrors Modal model_image pattern: install vllm for deps, uninstall)
- Set SETUPTOOLS_SCM_PRETEND_VERSION for tarball submissions without .git
- Pin Python 3.10 in venv, add sccache for CUDA compilation caching
Drop /opt persistent venv (permission issues on containerized runners).
Bootstrap fresh venv each run with torch + vllm deps. Optimize later.
- Only use --download-dir /models if the path exists (Modal volume).
  On GitHub runners, fall back to HF cache default.
- Capture server stdout/stderr to a log file instead of DEVNULL.
- Include server log in result on startup failure for debugging.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant