Add lm-eval benchmark runner for InferenceX evals by Oseltamivir · Pull Request #12 · NVIDIA/srt-slurm

Oseltamivir · 2026-04-07T23:23:06Z

Summary

Add InferenceX multi-node eval support through an lm-eval benchmark runner and eval-only orchestration path. Lets InferenceX run accuracy-only jobs against existing srt-slurm multi-node disaggregated recipes without running the throughput benchmark stage.

Copied from ishandhanani/srt-slurm#245

How

Add an lm-eval benchmark runner that sources InferenceX's benchmarks/benchmark_lib.sh from a mounted /infmax-workspace.
Mount INFMAX_WORKSPACE into the container as /infmax-workspace when provided.
Add EVAL_ONLY=true handling in do_sweep.py so eval-only jobs start infra/workers/frontend, run
the full model health check, skip throughput, and launch lm-eval directly.
Keep RUN_EVAL=true behavior as a post-benchmark eval path for normal throughput jobs.
Pass model/framework/topology metadata into the eval container, including served MODEL_NAME, prefill/decode TP/EP/DPA/worker counts, sequence length, precision, runner type, and eval concurrency.
Map srt-slurm PREFILL_DP_ATTN / DECODE_DP_ATTN env vars to the InferenceX PREFILL_DP_ATTENTION /DECODE_DP_ATTENTION names expected by append_lm_eval_summary.
Copy eval outputs (meta_env.json, results*.json, sample*.jsonl) into /logs/eval_results/ for launcher-side artifact pickup.
Preserve partial eval artifacts on lm-eval failure while still returning the original eval failure
code.
Document the InferenceX lm-eval integration in docs/accuracy.md.

What

For EVAL_ONLY=true:

srt-slurm still starts the normal deployment topology.
The throughput benchmark runner is skipped.
wait_for_model() verifies the configured prefill/decode or aggregated worker counts.
lm-eval runs against the OpenAI-compatible endpoint.
Eval failure is fatal.
Low score leads to failure

For RUN_EVAL=true without EVAL_ONLY=true:

The normal benchmark runs first.
lm-eval runs as a post-step if throughput succeeds.
Eval failure is non-fatal to the benchmark result.
Low score leads to failure

Validation run

https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24059388771

InferenceX PR

SemiAnalysisAI/InferenceX#1000

Adds support for running lm-eval accuracy evaluations as a post-benchmark step, leveraging the InferenceX benchmark_lib.sh harness. - New LMEvalRunner registered as "lm-eval" benchmark type - bench.sh script sources benchmark_lib.sh and calls run_eval/append_lm_eval_summary - Post-benchmark eval hook in SweepOrchestrator.run() triggered by RUN_EVAL=true - Auto-mount INFMAX_WORKSPACE into container when env var is set Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

In eval-only mode the benchmark stage is skipped, which also skips its model health check. The 30s port check in _run_post_eval is insufficient — workers are still loading. Use wait_for_model() with the full health check config (same as benchmark stage) when EVAL_ONLY=true. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Instead of capping eval examples with --limit to avoid timeouts, use the highest benchmark concurrency for eval requests. This runs the full eval set faster by matching the throughput the server was already benchmarked at. do_sweep.py computes max(config.benchmark.concurrencies) and passes it as EVAL_CONC to the lm-eval bench script. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

functionstackx · 2026-04-14T00:56:41Z

hi @kedarpotdar-nv , can we get an review on this PR? thanks!

functionstackx · 2026-04-14T01:50:26Z

@Ankur-singh

codecov-commenter · 2026-04-17T00:25:05Z

Codecov Report

❌ Patch coverage is 20.68966% with 69 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (sa-submission-q2-2026@8294e64). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
src/srtctl/cli/do_sweep.py	3.17%	61 Missing ⚠️
src/srtctl/benchmarks/lm_eval.py	65.00%	7 Missing ⚠️
src/srtctl/core/runtime.py	66.66%	1 Missing ⚠️

Additional details and impacted files

@@                   Coverage Diff                    @@
##             sa-submission-q2-2026      #12   +/-   ##
========================================================
  Coverage                         ?   57.86%           
========================================================
  Files                            ?       48           
  Lines                            ?     4122           
  Branches                         ?        0           
========================================================
  Hits                             ?     2385           
  Misses                           ?     1737           
  Partials                         ?        0

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

xinli-sw

Thanks for the contributions, LGTM!

We currently do not have CI to check this, do you have a sample inferenceX run with the change to make sure it works?

Oseltamivir · 2026-04-17T00:36:51Z

Hi @xinli-sw,

https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24059388771 should suffice.

All runs at https://github.com/SemiAnalysisAI/InferenceX/actions?query=branch%3Amultinode_eval use this fork

Covers Codecov gaps: lm_eval.py (100%), do_sweep.py eval paths, runtime.py INFMAX mount. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

# Conflicts: # src/srtctl/benchmarks/lm_eval.py # src/srtctl/benchmarks/scripts/lm-eval/bench.sh

Oseltamivir · 2026-04-17T02:04:39Z

@xinli-sw

I also added the relevant tests

functionstackx · 2026-04-17T02:05:38Z

@Oseltamivir can u open up an pr to merge this into master branch too so that we dont need to constantly fix this in sa-submission-q2-2026 branch

Oseltamivir and others added 7 commits April 7, 2026 16:20

update docs, clean up code

83a15fd

Clean up

41083ea

Add SPDX copyright headers for NVIDIA and SemiAnalysis

b3ac8b7

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

copyright

fa55725

This was referenced Apr 8, 2026

Multinode Evals ishandhanani/srt-slurm#245

Closed

Multinode evals SemiAnalysisAI/InferenceX#1000

Open

xinli-sw approved these changes Apr 17, 2026

View reviewed changes

Oseltamivir and others added 4 commits April 16, 2026 18:26

tests

119fdec

Merge branch 'sa-submission-q2-2026' into nvidia-pr

7070bb9

Add tests for lm-eval runner, _run_post_eval, and INFMAX_WORKSPACE mount

bc26d95

Covers Codecov gaps: lm_eval.py (100%), do_sweep.py eval paths, runtime.py INFMAX mount. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge branch 'pr-12-base' into nvidia-pr

ed8c1df

# Conflicts: # src/srtctl/benchmarks/lm_eval.py # src/srtctl/benchmarks/scripts/lm-eval/bench.sh

Oseltamivir force-pushed the nvidia-pr branch 2 times, most recently from 27d5209 to ed8c1df Compare April 17, 2026 02:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add lm-eval benchmark runner for InferenceX evals#12

Add lm-eval benchmark runner for InferenceX evals#12
Oseltamivir wants to merge 11 commits intoNVIDIA:sa-submission-q2-2026from
Oseltamivir:nvidia-pr

Oseltamivir commented Apr 7, 2026 •

edited

Loading

Uh oh!

functionstackx commented Apr 14, 2026

Uh oh!

functionstackx commented Apr 14, 2026

Uh oh!

codecov-commenter commented Apr 17, 2026

Uh oh!

xinli-sw left a comment

Uh oh!

Oseltamivir commented Apr 17, 2026 •

edited

Loading

Uh oh!

Oseltamivir commented Apr 17, 2026

Uh oh!

functionstackx commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Oseltamivir commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How

What

Validation run

InferenceX PR

Uh oh!

functionstackx commented Apr 14, 2026

Uh oh!

functionstackx commented Apr 14, 2026

Uh oh!

codecov-commenter commented Apr 17, 2026

Codecov Report

Uh oh!

xinli-sw left a comment

Choose a reason for hiding this comment

Uh oh!

Oseltamivir commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Oseltamivir commented Apr 17, 2026

Uh oh!

functionstackx commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Oseltamivir commented Apr 7, 2026 •

edited

Loading

Oseltamivir commented Apr 17, 2026 •

edited

Loading