[VIBE CODED - Review Only] E2E model competition support #442

msaroufim · 2026-02-10T20:38:06Z

Summary

End-to-end support for model competitions where users submit vLLM forks and are benchmarked on serving throughput/latency. This mirrors the existing kernel submission flow but for full model inference serving.

New Language.Model type with ModelTaskData config (model name, tensor parallel, benchmark shapes, perplexity baseline)
run_model_benchmark() — 5-phase pipeline: extract archive → pip install fork → start vLLM server → perplexity check → benchmark serving
GitHub Actions workflow (nvidia_model_workflow.yml) for B200 self-hosted runners
Modal runner with persistent model weights volume and sccache for CUDA compilation
API support for 50MB binary archive uploads (tar.gz/zip)
score_ascending field for higher-is-better metrics (e.g., throughput)
Security: tar path traversal validation, metrics namespacing, perplexity success threshold

E2E Testing Status (GitHub Actions route)

Tested against B200 self-hosted runner (l-bgx-01). Pipeline validated through multiple iterations:

Run	Result	Issue
1	`pip install failed`	Bad pyproject.toml build backend in test payload
2	`pip install failed`	setuptools-scm cant detect version without .git
3	`/opt/model-venv` permission denied	Runner is containerized, cant write to /opt
4	`vLLM server failed to start`	`--download-dir /models` doesnt exist on GH runners
5	`vLLM server failed to start`	HF token not available — Llama-3.1-8B is gated

Current state: The full pipeline works up to model weight download. The vLLM fork compiles successfully on B200 (sm_100), the server launches, but fails because the runner node lacks an HF_TOKEN for gated models.

Remaining work / sources of overhead to eliminate

Blockers for full E2E

HF_TOKEN as GitHub secret — runner nodes need HF_TOKEN to download gated models like Llama-3.1-8B. Add as a repo secret and pass via workflow env.
Popcorn-CLI end-to-end test — need to test the full flow: popcorn-cli → API → GitHub Actions → result callback. Zip payload support exists but hasnt been validated.

Performance (40 min cold start)

vLLM compilation from source (~20 min) — every run recompiles CUDA extensions for B200. Options:
- Persistent venv with pre-compiled vLLM deps (needs writable path on runner nodes, /opt is read-only)
- Pre-built wheel cache or sccache volume
- Docker image with vLLM pre-installed (only users diff gets compiled)
Model weight download (~10 min) — Llama-3.1-8B is ~16GB. Options:
- Persistent HF cache directory on runner nodes
- Pre-downloaded weights volume (like Modals model_weights volume)
Environment setup (~2 min) — torch + vLLM deps installed via uv each run. Would be instant with a persistent venv or Docker image.

Nice to have

Persistent venv on runner nodes — tried /opt/model-venv but runner is containerized with no write access outside workspace. Need to coordinate with infra.
sccache for CUDA compilation — Modal runner has this, GitHub runner doesnt yet
Smaller test model — use a non-gated model (e.g., facebook/opt-125m) for CI smoke tests

Test plan

Unit tests pass (test_backend.py, test_task.py)
GitHub Actions workflow dispatches and runs on B200 runner
vLLM source compilation works on B200 (sm_100)
Server log capture works (stderr visible in result.json on failure)
Full E2E with HF_TOKEN (server starts, perplexity check, benchmark)
Popcorn-CLI submission flow
Modal runner deployment and test

Extend the platform to support model-level competitions where users submit vLLM forks as tarballs. The system pip installs the fork, starts a vLLM server, runs serving benchmarks, and checks perplexity against a baseline. - Add Language.Model and RankCriterion.CUSTOM to support model tasks - Add ModelTaskData with benchmark shapes, perplexity config, timeouts - Add run_model_benchmark() with 5-phase pipeline (install, server, perplexity, benchmark, cleanup) - Add score_ascending field for higher-is-better ranking (throughput vs time) - Add tarball upload support (50MB limit) in API - Add Modal image with vLLM deps, sccache, and model weights volume - Add download_model.py for pre-populating model weights - Add example task definition for Llama-3.1-8B serving - Add reuse documentation listing unchanged components

Copilot

Pull request overview

Adds end-to-end “model competition” support where users submit vLLM forks as archives that are installed and benchmarked via a new runner path, with leaderboard ranking able to support both lower-is-better and higher-is-better scores.

Changes:

Introduces Language.Model + ModelTaskData, plus run_model_benchmark() pipeline (install → serve → perplexity → benchmark → cleanup).
Adds score direction (score_ascending) wiring through task config, DB ranking queries, and API responses.
Extends submission handling to accept binary archives (50MB) and adds Modal infra (new image + volumes) and a weight pre-download script.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
tests/test_task.py	Updates expected task config dicts to include `score_ascending`.
src/runners/modal_runner_archs.py	Registers Modal functions for model benchmarking on selected GPUs with volumes mounted.
src/runners/modal_runner.py	Adds dedicated `model_image` and Modal `Volume`s for model weights + sccache.
src/runners/download_model.py	Adds a Modal app to pre-download HF model weights into a shared volume.
src/libkernelbot/task.py	Adds `ModelTaskData`, extends `LeaderboardTask` to support `model` tasks + `score_ascending`.
src/libkernelbot/submission.py	Adds custom metric scoring, and threads `score_ascending` into competition/ranking display.
src/libkernelbot/run_eval.py	Routes `lang=model` to new `run_model_benchmark()` implementation.
src/libkernelbot/leaderboard_db.py	Stores bytes submissions and adds ranking direction support to leaderboard queries.
src/libkernelbot/launchers/modal.py	Dispatches Modal function name based on `lang` including `model`.
src/libkernelbot/consts.py	Adds `Language.Model` and `RankCriterion.CUSTOM`.
src/libkernelbot/backend.py	Base64-encodes model archives for transport and avoids `.lower()` on bytes.
src/kernelbot/api/main.py	Ensures `/submissions` endpoint uses correct score ordering for the given leaderboard.
src/kernelbot/api/api_utils.py	Accepts larger binary uploads for model tasks (50MB) and validates archive extension.
examples/llama_8b_serving/task.yml	Adds an example model task configuration (custom ranking metric + descending score).
docs/model-competitions-reuse.md	Documents which existing components are reused unchanged for model competitions.

Comments suppressed due to low confidence (1)

src/runners/modal_runner.py:1

These pins look risky: I’m not aware of a torch==2.9.1 release or a cu130 wheel index in the standard PyTorch distribution scheme. If this is intentional for your environment, consider documenting/validating it; otherwise, pin to a known-available Torch/CUDA combo (or make it configurable) to avoid Modal image build failures.

import signal

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-10T20:40:50Z

src/libkernelbot/run_eval.py

+    if tarfile.is_tarfile(archive_path):
+        with tarfile.open(archive_path, "r:*") as tar:
+            tar.extractall(path=extract_dir)
+    elif zipfile.is_zipfile(archive_path):
+        with zipfile.ZipFile(archive_path, "r") as zf:
+            zf.extractall(path=extract_dir)
+    else:
+        return False, "", "Submission archive is not a valid tar.gz or zip file"
+


tar.extractall() / ZipFile.extractall() are vulnerable to path traversal (e.g., ../../...) and can write outside extract_dir. Use a safe extraction routine that validates each member path stays within extract_dir (reject absolute paths and .. segments) before extracting.

Suggested change

if tarfile.is_tarfile(archive_path):

with tarfile.open(archive_path, "r:*") as tar:

tar.extractall(path=extract_dir)

elif zipfile.is_zipfile(archive_path):

with zipfile.ZipFile(archive_path, "r") as zf:

zf.extractall(path=extract_dir)

else:

return False, "", "Submission archive is not a valid tar.gz or zip file"

def _safe_extract_tar(tar: tarfile.TarFile, dest_dir: str) -> None:

base_dir = os.path.abspath(dest_dir)

for member in tar.getmembers():

name = member.name

# Disallow absolute paths

if os.path.isabs(name):

raise ValueError(f"Unsafe absolute path in tar archive: {name!r}")

# Disallow parent directory traversal

if ".." in Path(name).parts:

raise ValueError(f"Unsafe relative path in tar archive: {name!r}")

target_path = os.path.abspath(os.path.join(base_dir, name))

# Ensure the target path is within dest_dir

if os.path.commonpath([base_dir, target_path]) != base_dir:

raise ValueError(f"Tar path escapes destination directory: {name!r}")

tar.extractall(path=dest_dir)

def _safe_extract_zip(zf: zipfile.ZipFile, dest_dir: str) -> None:

base_dir = os.path.abspath(dest_dir)

for name in zf.namelist():

# Disallow absolute paths

if os.path.isabs(name):

raise ValueError(f"Unsafe absolute path in zip archive: {name!r}")

# Disallow parent directory traversal

if ".." in Path(name).parts:

raise ValueError(f"Unsafe relative path in zip archive: {name!r}")

target_path = os.path.abspath(os.path.join(base_dir, name))

# Ensure the target path is within dest_dir

if os.path.commonpath([base_dir, target_path]) != base_dir:

raise ValueError(f"Zip path escapes destination directory: {name!r}")

zf.extractall(path=dest_dir)

try:

if tarfile.is_tarfile(archive_path):

with tarfile.open(archive_path, "r:*") as tar:

_safe_extract_tar(tar, extract_dir)

elif zipfile.is_zipfile(archive_path):

with zipfile.ZipFile(archive_path, "r") as zf:

_safe_extract_zip(zf, extract_dir)

else:

return False, "", "Submission archive is not a valid tar.gz or zip file"

except ValueError as e:

return False, "", f"Submission archive contains unsafe paths: {e}"

Copilot · 2026-02-10T20:40:51Z

src/libkernelbot/run_eval.py

+    work_dir = tempfile.mkdtemp(prefix="model_submission_")
+    archive_path = os.path.join(work_dir, "submission.tar.gz")
+
+    with open(archive_path, "wb") as f:
+        f.write(archive_bytes)
+
+    # Extract
+    import tarfile
+    import zipfile
+
+    extract_dir = os.path.join(work_dir, "src")
+    os.makedirs(extract_dir, exist_ok=True)
+
+    if tarfile.is_tarfile(archive_path):
+        with tarfile.open(archive_path, "r:*") as tar:
+            tar.extractall(path=extract_dir)
+    elif zipfile.is_zipfile(archive_path):
+        with zipfile.ZipFile(archive_path, "r") as zf:
+            zf.extractall(path=extract_dir)
+    else:
+        return False, "", "Submission archive is not a valid tar.gz or zip file"
+
+    # Find the actual package directory (may be nested one level)
+    entries = os.listdir(extract_dir)
+    if len(entries) == 1 and os.path.isdir(os.path.join(extract_dir, entries[0])):
+        pkg_dir = os.path.join(extract_dir, entries[0])
+    else:
+        pkg_dir = extract_dir
+
+    # pip install
+    result = subprocess.run(
+        ["pip", "install", "-e", pkg_dir],
+        capture_output=True,
+        text=True,
+        timeout=install_timeout,
+    )
+
+    return result.returncode == 0, _limit_length(result.stdout), _limit_length(result.stderr)


tempfile.mkdtemp() creates a work directory that is never removed, which can leak disk space across runs. Prefer TemporaryDirectory() or explicitly shutil.rmtree(work_dir) in a finally (including the error/early-return paths).

Suggested change

work_dir = tempfile.mkdtemp(prefix="model_submission_")

archive_path = os.path.join(work_dir, "submission.tar.gz")

with open(archive_path, "wb") as f:

f.write(archive_bytes)

# Extract

import tarfile

import zipfile

extract_dir = os.path.join(work_dir, "src")

os.makedirs(extract_dir, exist_ok=True)

if tarfile.is_tarfile(archive_path):

with tarfile.open(archive_path, "r:*") as tar:

tar.extractall(path=extract_dir)

elif zipfile.is_zipfile(archive_path):

with zipfile.ZipFile(archive_path, "r") as zf:

zf.extractall(path=extract_dir)

else:

return False, "", "Submission archive is not a valid tar.gz or zip file"

# Find the actual package directory (may be nested one level)

entries = os.listdir(extract_dir)

if len(entries) == 1 and os.path.isdir(os.path.join(extract_dir, entries[0])):

pkg_dir = os.path.join(extract_dir, entries[0])

else:

pkg_dir = extract_dir

# pip install

result = subprocess.run(

["pip", "install", "-e", pkg_dir],

capture_output=True,

text=True,

timeout=install_timeout,

)

return result.returncode == 0, _limit_length(result.stdout), _limit_length(result.stderr)

with tempfile.TemporaryDirectory(prefix="model_submission_") as work_dir:

archive_path = os.path.join(work_dir, "submission.tar.gz")

with open(archive_path, "wb") as f:

f.write(archive_bytes)

# Extract

import tarfile

import zipfile

extract_dir = os.path.join(work_dir, "src")

os.makedirs(extract_dir, exist_ok=True)

if tarfile.is_tarfile(archive_path):

with tarfile.open(archive_path, "r:*") as tar:

tar.extractall(path=extract_dir)

elif zipfile.is_zipfile(archive_path):

with zipfile.ZipFile(archive_path, "r") as zf:

zf.extractall(path=extract_dir)

else:

return False, "", "Submission archive is not a valid tar.gz or zip file"

# Find the actual package directory (may be nested one level)

entries = os.listdir(extract_dir)

if len(entries) == 1 and os.path.isdir(os.path.join(extract_dir, entries[0])):

pkg_dir = os.path.join(extract_dir, entries[0])

else:

pkg_dir = extract_dir

# pip install

result = subprocess.run(

["pip", "install", "-e", pkg_dir],

capture_output=True,

text=True,

timeout=install_timeout,

)

return result.returncode == 0, _limit_length(result.stdout), _limit_length(result.stderr)

Copilot · 2026-02-10T20:40:51Z

src/libkernelbot/run_eval.py

+    extract_dir = os.path.join(work_dir, "src")
+    os.makedirs(extract_dir, exist_ok=True)


tempfile.mkdtemp() creates a work directory that is never removed, which can leak disk space across runs. Prefer TemporaryDirectory() or explicitly shutil.rmtree(work_dir) in a finally (including the error/early-return paths).

Copilot · 2026-02-10T20:40:51Z

src/libkernelbot/run_eval.py

+        stdout=subprocess.PIPE,
+        stderr=subprocess.PIPE,


Starting the server with stdout=PIPE and stderr=PIPE without continuously draining them risks blocking the vLLM process once its output buffers fill, potentially hanging runs. Redirect to files/DEVNULL, merge streams, or spawn reader threads to drain and store logs safely.

Suggested change

stdout=subprocess.PIPE,

stderr=subprocess.PIPE,

stdout=subprocess.DEVNULL,

stderr=subprocess.DEVNULL,

Copilot · 2026-02-10T20:40:52Z

src/libkernelbot/run_eval.py

+        cmd = [
+            "python3", "-m", "vllm.entrypoints.openai.run_batch",
+        ]
+
+        # Prefer the benchmark_serving script approach
+        cmd = [
+            "python3", "-m", "vllm.benchmarks.benchmark_serving",
+            "--backend", "openai-chat",
+            "--base-url", f"http://localhost:{port}",
+            "--model", model_name,
+            "--endpoint", "/v1/chat/completions",
+            "--num-prompts", str(shape.get("num_prompts", 100)),
+            "--random-input-len", str(shape.get("input_len", 512)),
+            "--random-output-len", str(shape.get("output_len", 128)),
+            "--save-result",
+        ]
+
+        result = subprocess.run(
+            cmd,
+            capture_output=True,
+            text=True,
+            timeout=benchmark_timeout,
+        )
+
+        if result.returncode != 0:
+            all_metrics[f"shape_{i}_error"] = _limit_length(result.stderr)
+            continue
+
+        # Parse the saved JSON result file
+        # vLLM saves to a json file in current directory
+        import glob
+        json_files = sorted(glob.glob("*.json"), key=os.path.getmtime, reverse=True)
+        if json_files:
+            try:
+                with open(json_files[0]) as f:
+                    bench_result = json.load(f)
+                for key in [
+                    "request_throughput",
+                    "output_throughput",
+                    "mean_ttft_ms",
+                    "median_ttft_ms",
+                    "p99_ttft_ms",
+                    "mean_tpot_ms",
+                    "median_tpot_ms",
+                    "p99_tpot_ms",
+                    "mean_itl_ms",
+                    "median_itl_ms",
+                    "p99_itl_ms",
+                ]:
+                    if key in bench_result:
+                        all_metrics[key] = bench_result[key]
+                os.remove(json_files[0])
+            except (json.JSONDecodeError, OSError):
+                pass
+
+        all_metrics[f"shape_{i}_stdout"] = _limit_length(result.stdout)


Metrics are overwritten across shapes because all_metrics[key] is reused for every shape; only the last shape’s values will survive. Also, glob('*.json') in the current working directory can pick up unrelated files and is race-prone. Write results to a per-shape, known filepath (or run in a temp working directory) and namespace metrics per shape (e.g., shape_{i}_{key}) or return a list keyed by shape.

Suggested change

cmd = [

"python3", "-m", "vllm.entrypoints.openai.run_batch",

]

# Prefer the benchmark_serving script approach

cmd = [

"python3", "-m", "vllm.benchmarks.benchmark_serving",

"--backend", "openai-chat",

"--base-url", f"http://localhost:{port}",

"--model", model_name,

"--endpoint", "/v1/chat/completions",

"--num-prompts", str(shape.get("num_prompts", 100)),

"--random-input-len", str(shape.get("input_len", 512)),

"--random-output-len", str(shape.get("output_len", 128)),

"--save-result",

]

result = subprocess.run(

cmd,

capture_output=True,

text=True,

timeout=benchmark_timeout,

)

if result.returncode != 0:

all_metrics[f"shape_{i}_error"] = _limit_length(result.stderr)

continue

# Parse the saved JSON result file

# vLLM saves to a json file in current directory

import glob

json_files = sorted(glob.glob("*.json"), key=os.path.getmtime, reverse=True)

if json_files:

try:

with open(json_files[0]) as f:

bench_result = json.load(f)

for key in [

"request_throughput",

"output_throughput",

"mean_ttft_ms",

"median_ttft_ms",

"p99_ttft_ms",

"mean_tpot_ms",

"median_tpot_ms",

"p99_tpot_ms",

"mean_itl_ms",

"median_itl_ms",

"p99_itl_ms",

]:

if key in bench_result:

all_metrics[key] = bench_result[key]

os.remove(json_files[0])

except (json.JSONDecodeError, OSError):

pass

all_metrics[f"shape_{i}_stdout"] = _limit_length(result.stdout)

with tempfile.TemporaryDirectory() as tmpdir:

cmd = [

"python3", "-m", "vllm.entrypoints.openai.run_batch",

]

# Prefer the benchmark_serving script approach

cmd = [

"python3", "-m", "vllm.benchmarks.benchmark_serving",

"--backend", "openai-chat",

"--base-url", f"http://localhost:{port}",

"--model", model_name,

"--endpoint", "/v1/chat/completions",

"--num-prompts", str(shape.get("num_prompts", 100)),

"--random-input-len", str(shape.get("input_len", 512)),

"--random-output-len", str(shape.get("output_len", 128)),

"--save-result",

]

result = subprocess.run(

cmd,

capture_output=True,

text=True,

timeout=benchmark_timeout,

cwd=tmpdir,

)

if result.returncode != 0:

all_metrics[f"shape_{i}_error"] = _limit_length(result.stderr)

all_metrics[f"shape_{i}_stdout"] = _limit_length(result.stdout)

continue

# Parse the saved JSON result file

# vLLM saves to a json file in the working directory

import glob

json_files = sorted(

glob.glob(os.path.join(tmpdir, "*.json")),

key=os.path.getmtime,

reverse=True,

)

if json_files:

try:

with open(json_files[0]) as f:

bench_result = json.load(f)

for key in [

"request_throughput",

"output_throughput",

"mean_ttft_ms",

"median_ttft_ms",

"p99_ttft_ms",

"mean_tpot_ms",

"median_tpot_ms",

"p99_tpot_ms",

"mean_itl_ms",

"median_itl_ms",

"p99_itl_ms",

]:

if key in bench_result:

all_metrics[f"shape_{i}_{key}"] = bench_result[key]

os.remove(json_files[0])

except (json.JSONDecodeError, OSError):

pass

all_metrics[f"shape_{i}_stdout"] = _limit_length(result.stdout)

Copilot · 2026-02-10T20:40:52Z

src/libkernelbot/run_eval.py

+        cmd = [
+            "python3", "-m", "vllm.entrypoints.openai.run_batch",
+        ]
+


The initial cmd assignment to vllm.entrypoints.openai.run_batch is immediately overwritten and has no effect. Remove the dead assignment to reduce confusion and keep the benchmark invocation single-sourced.

Suggested change

cmd = [

"python3", "-m", "vllm.entrypoints.openai.run_batch",

]

Copilot · 2026-02-10T20:40:53Z

src/libkernelbot/run_eval.py

+        try:
+            with urllib.request.urlopen(req, timeout=30) as resp:
+                data = json.loads(resp.read())


The perplexity check silently ignores all request/parse errors and may compute perplexity from a small subset of prompts, which can lead to unstable or falsely passing results. Consider failing the check on any request error (or at least tracking an error count and requiring a minimum success ratio) and include the error details in the run result for debuggability.

Copilot · 2026-02-10T20:40:53Z

src/libkernelbot/run_eval.py

+        except Exception:
+            continue


The perplexity check silently ignores all request/parse errors and may compute perplexity from a small subset of prompts, which can lead to unstable or falsely passing results. Consider failing the check on any request error (or at least tracking an error count and requiring a minimum success ratio) and include the error details in the run result for debuggability.

Copilot · 2026-02-10T20:40:53Z

src/libkernelbot/submission.py


 def compute_score(result: FullResult, task: LeaderboardTask, submission_id: int) -> float:
+    if task.ranking_by == RankCriterion.CUSTOM:
+        ranking_metric = task.config.ranking_metric


RankCriterion.CUSTOM implicitly assumes task.config has ranking_metric, but LeaderboardTask.config can also be CudaTaskData/PythonTaskData, which don’t define it. Enforce CUSTOM only for Language.Model (e.g., in LeaderboardTask.__post_init__) or store ranking_metric at the task level so this doesn’t depend on a specific config dataclass.

Suggested change

ranking_metric = task.config.ranking_metric

# Some task configurations (e.g., CudaTaskData/PythonTaskData) may not

# define a `ranking_metric` attribute. Guard against that here so we

# don't rely on a specific config dataclass shape.

config = getattr(task, "config", None)

if config is None or not hasattr(config, "ranking_metric"):

raise KernelBotError(

"RankCriterion.CUSTOM requires task.config to define a 'ranking_metric' "

f"attribute; got config type '{type(config).__name__}' instead."

)

ranking_metric = getattr(config, "ranking_metric")

Copilot · 2026-02-10T20:40:53Z

src/libkernelbot/run_eval.py

+    return passed, measured_ppl
+
+
+def run_model_benchmark(config: dict) -> FullResult:  # noqa: C901


The new run_model_benchmark() path (install, server startup/timeout handling, perplexity pass/fail, benchmark parsing, and cleanup) introduces substantial logic but isn’t covered by unit tests. Since the repo already has pytest coverage (e.g., tests/test_task.py), add focused tests that mock subprocess.run / subprocess.Popen and urllib.request.urlopen to deterministically validate success and failure modes.

- Fix path traversal vulnerability in tar/zip extraction (validate members) - Fix metrics overwritten across shapes (namespace by shape index) - Fix vLLM server stdout/stderr PIPE blocking (redirect to DEVNULL) - Fix perplexity check silently swallowing errors (require >50% success) - Remove dead cmd assignment in benchmark runner - Add hasattr guard for CUSTOM ranking_metric in compute_score - Remove docs/model-competitions-reuse.md

- Fix lang_name KeyError crash for model submissions in GitHub launcher - Upload model archives as Git blobs to bypass workflow dispatch size limits - Add nvidia_model_workflow.yml with 60-min timeout for model benchmarking - Update github-runner.py to download blob archives before running - Add model-specific timeout computation from model_config - Add expected run name pattern for model workflow dispatch - Block model competitions on AMD GPUs (NVIDIA only for now)

github-actions · 2026-02-10T23:22:04Z

Coverage report

Click to see where and how coverage changed

File	Statements	Missing	Coverage	Coverage (new stmts)	Lines missing
src/libkernelbot
backend.py					198
consts.py
leaderboard_db.py
submission.py					178-190, 235
task.py					83, 87, 101, 156, 222
utils.py
Project Total

_{This report was generated by python-coverage-comment-action}

Isolates model benchmark dependencies in a venv instead of polluting the runner's system Python. Falls back to pip if uv is not available.

- Persistent venv at /opt/model-venv with torch + vLLM deps pre-cached (mirrors Modal model_image pattern: install vllm for deps, uninstall) - Set SETUPTOOLS_SCM_PRETEND_VERSION for tarball submissions without .git - Pin Python 3.10 in venv, add sccache for CUDA compilation caching

Drop /opt persistent venv (permission issues on containerized runners). Bootstrap fresh venv each run with torch + vllm deps. Optimize later.

- Only use --download-dir /models if the path exists (Modal volume). On GitHub runners, fall back to HF cache default. - Capture server stdout/stderr to a log file instead of DEVNULL. - Include server log in result on startup failure for debugging.

Copilot AI review requested due to automatic review settings February 10, 2026 20:38

Copilot AI reviewed Feb 10, 2026

View reviewed changes

msaroufim added 3 commits February 10, 2026 12:49

Fix test_backend.py: add score_ascending to expected config dicts

5b46ed7

msaroufim added 5 commits February 10, 2026 15:28

Add testing guide for model competitions (Modal-first E2E)

b8bd7d2

Use uv venv in model workflow and submission install

57c3166

Isolates model benchmark dependencies in a venv instead of polluting the runner's system Python. Falls back to pip if uv is not available.

Simplify model workflow: local venv, no persistent paths

4c84d44

Drop /opt persistent venv (permission issues on containerized runners). Bootstrap fresh venv each run with torch + vllm deps. Optimize later.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VIBE CODED - Review Only] E2E model competition support #442

[VIBE CODED - Review Only] E2E model competition support #442

msaroufim commented Feb 10, 2026 •

edited by S1ro1

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

github-actions bot commented Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-    if tarfile.is_tarfile(archive_path):
-        with tarfile.open(archive_path, "r:*") as tar:
-            tar.extractall(path=extract_dir)
-    elif zipfile.is_zipfile(archive_path):
-        with zipfile.ZipFile(archive_path, "r") as zf:
-            zf.extractall(path=extract_dir)
-    else:
-        return False, "", "Submission archive is not a valid tar.gz or zip file"
+    def _safe_extract_tar(tar: tarfile.TarFile, dest_dir: str) -> None:
+        base_dir = os.path.abspath(dest_dir)
+        for member in tar.getmembers():
+            name = member.name
+            # Disallow absolute paths
+            if os.path.isabs(name):
+                raise ValueError(f"Unsafe absolute path in tar archive: {name!r}")
+            # Disallow parent directory traversal
+            if ".." in Path(name).parts:
+                raise ValueError(f"Unsafe relative path in tar archive: {name!r}")
+            target_path = os.path.abspath(os.path.join(base_dir, name))
+            # Ensure the target path is within dest_dir
+            if os.path.commonpath([base_dir, target_path]) != base_dir:
+                raise ValueError(f"Tar path escapes destination directory: {name!r}")
+        tar.extractall(path=dest_dir)
+    def _safe_extract_zip(zf: zipfile.ZipFile, dest_dir: str) -> None:
+        base_dir = os.path.abspath(dest_dir)
+        for name in zf.namelist():
+            # Disallow absolute paths
+            if os.path.isabs(name):
+                raise ValueError(f"Unsafe absolute path in zip archive: {name!r}")
+            # Disallow parent directory traversal
+            if ".." in Path(name).parts:
+                raise ValueError(f"Unsafe relative path in zip archive: {name!r}")
+            target_path = os.path.abspath(os.path.join(base_dir, name))
+            # Ensure the target path is within dest_dir
+            if os.path.commonpath([base_dir, target_path]) != base_dir:
+                raise ValueError(f"Zip path escapes destination directory: {name!r}")
+        zf.extractall(path=dest_dir)
+    try:
+        if tarfile.is_tarfile(archive_path):
+            with tarfile.open(archive_path, "r:*") as tar:
+                _safe_extract_tar(tar, extract_dir)
+        elif zipfile.is_zipfile(archive_path):
+            with zipfile.ZipFile(archive_path, "r") as zf:
+                _safe_extract_zip(zf, extract_dir)
+        else:
+            return False, "", "Submission archive is not a valid tar.gz or zip file"
+    except ValueError as e:
+        return False, "", f"Submission archive contains unsafe paths: {e}"

		extract_dir = os.path.join(work_dir, "src")
		os.makedirs(extract_dir, exist_ok=True)

	cmd = [
	"python3", "-m", "vllm.entrypoints.openai.run_batch",
	]

-        ranking_metric = task.config.ranking_metric
+        # Some task configurations (e.g., CudaTaskData/PythonTaskData) may not
+        # define a `ranking_metric` attribute. Guard against that here so we
+        # don't rely on a specific config dataclass shape.
+        config = getattr(task, "config", None)
+        if config is None or not hasattr(config, "ranking_metric"):
+            raise KernelBotError(
+                "RankCriterion.CUSTOM requires task.config to define a 'ranking_metric' "
+                f"attribute; got config type '{type(config).__name__}' instead."
+            )
+        ranking_metric = getattr(config, "ranking_metric")

		return passed, measured_ppl


		def run_model_benchmark(config: dict) -> FullResult: # noqa: C901

[VIBE CODED - Review Only] E2E model competition support #442

Are you sure you want to change the base?

[VIBE CODED - Review Only] E2E model competition support #442

Conversation

msaroufim commented Feb 10, 2026 • edited by S1ro1 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

E2E Testing Status (GitHub Actions route)

Remaining work / sources of overhead to eliminate

Blockers for full E2E

Performance (40 min cold start)

Nice to have

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 10, 2026

Coverage report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

msaroufim commented Feb 10, 2026 •

edited by S1ro1

Loading