|
1 | | -# LLM Benchmark Suite (Accuracy • Speed • Memory) |
2 | 1 |
|
3 | | -Lightweight harness to benchmark LLM providers locally. See `bench_config.yaml` for an example run and `datasets/` for tiny test datasets. |
| 2 | +# LLM Bench — Accuracy • Speed • Memory |
4 | 3 |
|
5 | | -Run: |
| 4 | +Lightweight, configurable benchmark harness for LLM providers. Measure accuracy (EM/F1/MC), latency/throughput, and resource timelines (CPU/RAM, optional GPU). |
| 5 | + |
| 6 | +This repository is intentionally small and pluggable — add adapters for your provider and drop in any JSONL dataset. |
| 7 | + |
| 8 | +## Quick start (local, no external model) |
| 9 | + |
| 10 | +1. Create and activate a Python virtualenv. The project supports Python 3.10+. |
| 11 | + |
| 12 | +2. (Optional) install dev deps for testing and plotting: |
| 13 | + |
| 14 | +```bash |
| 15 | +python -m pip install -r requirements-dev.txt |
| 16 | +# Optional GPU support: python -m pip install -e .[gpu] |
| 17 | +``` |
| 18 | + |
| 19 | +3. Run the example harness (uses an in-repo MockProvider): |
| 20 | + |
| 21 | +```bash |
| 22 | +python - <<'PY' |
| 23 | +import asyncio, sys |
| 24 | +sys.path.insert(0, '') |
| 25 | +from benches.harness import run_bench |
| 26 | +asyncio.run(run_bench('bench_config.yaml')) |
| 27 | +PY |
| 28 | +``` |
| 29 | + |
| 30 | +Output files will be written to the `reports/` prefix declared in `bench_config.yaml` (JSONL per-sample, CSV summary, resources timeline and a compact Markdown report). |
| 31 | + |
| 32 | +## Configuration (`bench_config.yaml`) |
| 33 | + |
| 34 | +Key fields: |
| 35 | + |
| 36 | +- `provider`: select `kind: mock | ollama | openai` and provider-specific connection options. |
| 37 | +- `io.dataset_path`: path to JSONL dataset. |
| 38 | +- `io.output_prefix`: prefix for the output artifacts in `reports/`. |
| 39 | +- `prompt.system` and `prompt.template`: system message and per-sample template using `{input}` and other fields from the dataset. |
| 40 | +- `load.concurrency` and `load.batch_size`: concurrency/batch settings. |
| 41 | +- `limits.max_samples`: limit number of samples for fast experiments. |
| 42 | +- `metrics.normalization`: optional normalization (e.g., `lower_strip`) applied to accuracy metrics. |
| 43 | + |
| 44 | +Example config is included as `bench_config.yaml`. |
| 45 | + |
| 46 | +## Dataset formats |
| 47 | + |
| 48 | +- Free‑text QA JSONL (one object per line): |
| 49 | + |
| 50 | +```json |
| 51 | +{"id":"1","input":"Capital of France?","target":"Paris"} |
| 52 | +``` |
| 53 | + |
| 54 | +- Multiple choice JSONL: |
| 55 | + |
| 56 | +```json |
| 57 | +{"id":"1","input":"Capital of France?","choices":["Paris","Lyon"],"answer":"Paris"} |
| 58 | +``` |
| 59 | + |
| 60 | +## Providers |
| 61 | + |
| 62 | +Implement a `Provider` with two async methods: |
| 63 | + |
| 64 | +- `generate(prompt, system=None, options=None) -> dict` — returns at least `output` and may provide `latency_s`, `ttft_s`, `prompt_eval_count`, `eval_count`. |
| 65 | +- `tokenize(text) -> int` — optional but helpful for token counts. |
| 66 | + |
| 67 | +Included adapters: |
| 68 | + |
| 69 | +- `OllamaProvider` (calls /api/generate and /api/tokenize) |
| 70 | +- `OpenAIStyleProvider` (calls /v1/chat/completions) |
| 71 | +- `MockProvider` (local, for testing and CI) |
| 72 | + |
| 73 | +Add your provider implementation to `benches/providers.py` and register it in `_load_provider` in `benches/harness.py`. |
| 74 | + |
| 75 | +## Metrics |
| 76 | + |
| 77 | +- Exact Match (EM), token-F1, multiple-choice accuracy implemented in `benches/metrics.py`. |
| 78 | +- BLEU and ROUGE-L are optional; they require `sacrebleu` and `rouge-score` respectively. |
| 79 | + |
| 80 | +## Resource monitoring |
| 81 | + |
| 82 | +`benches/monitor.py` samples process CPU/RAM (via `psutil`) and optionally GPU stats via NVML. |
| 83 | + |
| 84 | +- GPU sampling is optional and controlled by the environment variable `LLM_BENCH_SKIP_GPU=1` (CI sets this variable by default). |
| 85 | +- GPU support is available via the optional package extra `gpu` (recommended package `nvidia-ml-py`, fallback to `pynvml` is supported). |
| 86 | + |
| 87 | +Install the GPU extra locally with: |
6 | 88 |
|
7 | 89 | ```bash |
8 | | -python scripts/run_bench.py -c bench_config.yaml |
| 90 | +python -m pip install -e .[gpu] |
9 | 91 | ``` |
| 92 | + |
| 93 | +Note: GitHub-hosted runners do not provide GPUs; the CI workflow sets `LLM_BENCH_SKIP_GPU=1`. |
| 94 | + |
| 95 | +## Outputs |
| 96 | + |
| 97 | +- `*.jsonl` — per-sample detailed results |
| 98 | +- `*_summary.csv` — single-row summary (latency percentiles, accuracy means, tokens) |
| 99 | +- `*_resources.csv` — timeline of CPU/RAM/(optional)GPU samples |
| 100 | +- `*_report.md` — compact human report |
| 101 | + |
| 102 | +## Tests & CI |
| 103 | + |
| 104 | +- Unit and integration tests live in `tests/`. |
| 105 | +- Run tests locally with `pytest` or `make test`. |
| 106 | +- CI (`.github/workflows/ci.yml`) runs tests and sets `LLM_BENCH_SKIP_GPU=1` so GPU sampling is skipped on GitHub runners. |
| 107 | + |
| 108 | +## Examples |
| 109 | + |
| 110 | +- `examples/run_mock.py` — programmatic example that runs the harness against the MockProvider. |
| 111 | +- `benches/plot.py` — helper to plot `resources.csv` (requires `matplotlib` + `pandas`). |
| 112 | + |
| 113 | +## Extending |
| 114 | + |
| 115 | +- Add a provider: implement `Provider.generate()` and `tokenize()`, and register it in `_load_provider`. |
| 116 | +- Add a metric: implement in `benches/metrics.py` and wire into `benches/harness.py`. |
| 117 | +- Throughput sweeps: write a wrapper that modifies `bench_config.yaml` concurrency/batch settings and re-runs the harness to gather scaling data. |
| 118 | + |
| 119 | +## License |
| 120 | + |
| 121 | +MIT — do what you want, but please share interesting improvements. |
| 122 | + |
| 123 | + |
| 124 | +GPU (optional): |
| 125 | + |
| 126 | +The harness can sample NVIDIA GPU stats via NVML. This is optional — GitHub Actions runners don't have GPUs, and CI skips GPU sampling by default. |
| 127 | + |
| 128 | +To install the optional GPU dependency locally: |
| 129 | + |
| 130 | +```bash |
| 131 | +python -m pip install -e .[gpu] |
| 132 | +# or: pip install nvidia-ml-py |
| 133 | +``` |
| 134 | + |
| 135 | +CI note: the provided GitHub Actions workflow sets `LLM_BENCH_SKIP_GPU=1` so GPU sampling is disabled in CI. |
0 commit comments