Skip to content

Commit 0b1bb68

Browse files
committed
docs: README
1 parent 8f1a37d commit 0b1bb68

File tree

1 file changed

+130
-4
lines changed

1 file changed

+130
-4
lines changed

README.md

Lines changed: 130 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,135 @@
1-
# LLM Benchmark Suite (Accuracy • Speed • Memory)
21

3-
Lightweight harness to benchmark LLM providers locally. See `bench_config.yaml` for an example run and `datasets/` for tiny test datasets.
2+
# LLM Bench — Accuracy • Speed • Memory
43

5-
Run:
4+
Lightweight, configurable benchmark harness for LLM providers. Measure accuracy (EM/F1/MC), latency/throughput, and resource timelines (CPU/RAM, optional GPU).
5+
6+
This repository is intentionally small and pluggable — add adapters for your provider and drop in any JSONL dataset.
7+
8+
## Quick start (local, no external model)
9+
10+
1. Create and activate a Python virtualenv. The project supports Python 3.10+.
11+
12+
2. (Optional) install dev deps for testing and plotting:
13+
14+
```bash
15+
python -m pip install -r requirements-dev.txt
16+
# Optional GPU support: python -m pip install -e .[gpu]
17+
```
18+
19+
3. Run the example harness (uses an in-repo MockProvider):
20+
21+
```bash
22+
python - <<'PY'
23+
import asyncio, sys
24+
sys.path.insert(0, '')
25+
from benches.harness import run_bench
26+
asyncio.run(run_bench('bench_config.yaml'))
27+
PY
28+
```
29+
30+
Output files will be written to the `reports/` prefix declared in `bench_config.yaml` (JSONL per-sample, CSV summary, resources timeline and a compact Markdown report).
31+
32+
## Configuration (`bench_config.yaml`)
33+
34+
Key fields:
35+
36+
- `provider`: select `kind: mock | ollama | openai` and provider-specific connection options.
37+
- `io.dataset_path`: path to JSONL dataset.
38+
- `io.output_prefix`: prefix for the output artifacts in `reports/`.
39+
- `prompt.system` and `prompt.template`: system message and per-sample template using `{input}` and other fields from the dataset.
40+
- `load.concurrency` and `load.batch_size`: concurrency/batch settings.
41+
- `limits.max_samples`: limit number of samples for fast experiments.
42+
- `metrics.normalization`: optional normalization (e.g., `lower_strip`) applied to accuracy metrics.
43+
44+
Example config is included as `bench_config.yaml`.
45+
46+
## Dataset formats
47+
48+
- Free‑text QA JSONL (one object per line):
49+
50+
```json
51+
{"id":"1","input":"Capital of France?","target":"Paris"}
52+
```
53+
54+
- Multiple choice JSONL:
55+
56+
```json
57+
{"id":"1","input":"Capital of France?","choices":["Paris","Lyon"],"answer":"Paris"}
58+
```
59+
60+
## Providers
61+
62+
Implement a `Provider` with two async methods:
63+
64+
- `generate(prompt, system=None, options=None) -> dict` — returns at least `output` and may provide `latency_s`, `ttft_s`, `prompt_eval_count`, `eval_count`.
65+
- `tokenize(text) -> int` — optional but helpful for token counts.
66+
67+
Included adapters:
68+
69+
- `OllamaProvider` (calls /api/generate and /api/tokenize)
70+
- `OpenAIStyleProvider` (calls /v1/chat/completions)
71+
- `MockProvider` (local, for testing and CI)
72+
73+
Add your provider implementation to `benches/providers.py` and register it in `_load_provider` in `benches/harness.py`.
74+
75+
## Metrics
76+
77+
- Exact Match (EM), token-F1, multiple-choice accuracy implemented in `benches/metrics.py`.
78+
- BLEU and ROUGE-L are optional; they require `sacrebleu` and `rouge-score` respectively.
79+
80+
## Resource monitoring
81+
82+
`benches/monitor.py` samples process CPU/RAM (via `psutil`) and optionally GPU stats via NVML.
83+
84+
- GPU sampling is optional and controlled by the environment variable `LLM_BENCH_SKIP_GPU=1` (CI sets this variable by default).
85+
- GPU support is available via the optional package extra `gpu` (recommended package `nvidia-ml-py`, fallback to `pynvml` is supported).
86+
87+
Install the GPU extra locally with:
688

789
```bash
8-
python scripts/run_bench.py -c bench_config.yaml
90+
python -m pip install -e .[gpu]
991
```
92+
93+
Note: GitHub-hosted runners do not provide GPUs; the CI workflow sets `LLM_BENCH_SKIP_GPU=1`.
94+
95+
## Outputs
96+
97+
- `*.jsonl` — per-sample detailed results
98+
- `*_summary.csv` — single-row summary (latency percentiles, accuracy means, tokens)
99+
- `*_resources.csv` — timeline of CPU/RAM/(optional)GPU samples
100+
- `*_report.md` — compact human report
101+
102+
## Tests & CI
103+
104+
- Unit and integration tests live in `tests/`.
105+
- Run tests locally with `pytest` or `make test`.
106+
- CI (`.github/workflows/ci.yml`) runs tests and sets `LLM_BENCH_SKIP_GPU=1` so GPU sampling is skipped on GitHub runners.
107+
108+
## Examples
109+
110+
- `examples/run_mock.py` — programmatic example that runs the harness against the MockProvider.
111+
- `benches/plot.py` — helper to plot `resources.csv` (requires `matplotlib` + `pandas`).
112+
113+
## Extending
114+
115+
- Add a provider: implement `Provider.generate()` and `tokenize()`, and register it in `_load_provider`.
116+
- Add a metric: implement in `benches/metrics.py` and wire into `benches/harness.py`.
117+
- Throughput sweeps: write a wrapper that modifies `bench_config.yaml` concurrency/batch settings and re-runs the harness to gather scaling data.
118+
119+
## License
120+
121+
MIT — do what you want, but please share interesting improvements.
122+
123+
124+
GPU (optional):
125+
126+
The harness can sample NVIDIA GPU stats via NVML. This is optional — GitHub Actions runners don't have GPUs, and CI skips GPU sampling by default.
127+
128+
To install the optional GPU dependency locally:
129+
130+
```bash
131+
python -m pip install -e .[gpu]
132+
# or: pip install nvidia-ml-py
133+
```
134+
135+
CI note: the provided GitHub Actions workflow sets `LLM_BENCH_SKIP_GPU=1` so GPU sampling is disabled in CI.

0 commit comments

Comments
 (0)