Add lm-eval benchmark runner for InferenceX evals#12
Add lm-eval benchmark runner for InferenceX evals#12Oseltamivir wants to merge 11 commits intoNVIDIA:sa-submission-q2-2026from
Conversation
Adds support for running lm-eval accuracy evaluations as a post-benchmark step, leveraging the InferenceX benchmark_lib.sh harness. - New LMEvalRunner registered as "lm-eval" benchmark type - bench.sh script sources benchmark_lib.sh and calls run_eval/append_lm_eval_summary - Post-benchmark eval hook in SweepOrchestrator.run() triggered by RUN_EVAL=true - Auto-mount INFMAX_WORKSPACE into container when env var is set Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
In eval-only mode the benchmark stage is skipped, which also skips its model health check. The 30s port check in _run_post_eval is insufficient — workers are still loading. Use wait_for_model() with the full health check config (same as benchmark stage) when EVAL_ONLY=true. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of capping eval examples with --limit to avoid timeouts, use the highest benchmark concurrency for eval requests. This runs the full eval set faster by matching the throughput the server was already benchmarked at. do_sweep.py computes max(config.benchmark.concurrencies) and passes it as EVAL_CONC to the lm-eval bench script. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
hi @kedarpotdar-nv , can we get an review on this PR? thanks! |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## sa-submission-q2-2026 #12 +/- ##
========================================================
Coverage ? 57.86%
========================================================
Files ? 48
Lines ? 4122
Branches ? 0
========================================================
Hits ? 2385
Misses ? 1737
Partials ? 0 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
xinli-sw
left a comment
There was a problem hiding this comment.
Thanks for the contributions, LGTM!
We currently do not have CI to check this, do you have a sample inferenceX run with the change to make sure it works?
|
Hi @xinli-sw, https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24059388771 should suffice. All runs at https://github.com/SemiAnalysisAI/InferenceX/actions?query=branch%3Amultinode_eval use this fork |
Covers Codecov gaps: lm_eval.py (100%), do_sweep.py eval paths, runtime.py INFMAX mount. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
# Conflicts: # src/srtctl/benchmarks/lm_eval.py # src/srtctl/benchmarks/scripts/lm-eval/bench.sh
|
I also added the relevant tests |
|
@Oseltamivir can u open up an pr to merge this into master branch too so that we dont need to constantly fix this in |
27d5209 to
ed8c1df
Compare
Summary
Add InferenceX multi-node eval support through an
lm-evalbenchmark runner and eval-only orchestration path. Lets InferenceX run accuracy-only jobs against existing srt-slurm multi-node disaggregated recipes without running the throughput benchmark stage.Copied from ishandhanani/srt-slurm#245
How
lm-evalbenchmark runner that sources InferenceX'sbenchmarks/benchmark_lib.shfrom a mounted/infmax-workspace.INFMAX_WORKSPACEinto the container as/infmax-workspacewhen provided.EVAL_ONLY=truehandling indo_sweep.pyso eval-only jobs start infra/workers/frontend, runthe full model health check, skip throughput, and launch
lm-evaldirectly.RUN_EVAL=truebehavior as a post-benchmark eval path for normal throughput jobs.MODEL_NAME, prefill/decode TP/EP/DPA/worker counts, sequence length, precision, runner type, and eval concurrency.PREFILL_DP_ATTN/DECODE_DP_ATTNenv vars to the InferenceXPREFILL_DP_ATTENTION/DECODE_DP_ATTENTIONnames expected byappend_lm_eval_summary.meta_env.json,results*.json,sample*.jsonl) into/logs/eval_results/for launcher-side artifact pickup.code.
docs/accuracy.md.What
For
EVAL_ONLY=true:wait_for_model()verifies the configured prefill/decode or aggregated worker counts.lm-evalruns against the OpenAI-compatible endpoint.For
RUN_EVAL=truewithoutEVAL_ONLY=true:lm-evalruns as a post-step if throughput succeeds.Validation run
https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24059388771
InferenceX PR
SemiAnalysisAI/InferenceX#1000