docs: README

Teagan42 · Teagan42 · commit 0b1bb68c454d · 2025-10-23T08:06:47.000-06:00
diff --git a/README.md b/README.md
@@ -1,9 +1,135 @@
-# LLM Benchmark Suite (Accuracy • Speed • Memory)
 
-Lightweight harness to benchmark LLM providers locally. See `bench_config.yaml` for an example run and `datasets/` for tiny test datasets.
+# LLM Bench — Accuracy • Speed • Memory
 
-Run:
+Lightweight, configurable benchmark harness for LLM providers. Measure accuracy (EM/F1/MC), latency/throughput, and resource timelines (CPU/RAM, optional GPU).
+
+This repository is intentionally small and pluggable — add adapters for your provider and drop in any JSONL dataset.
+
+## Quick start (local, no external model)
+
+1. Create and activate a Python virtualenv. The project supports Python 3.10+.
+
+2. (Optional) install dev deps for testing and plotting:
+
+```bash
+python -m pip install -r requirements-dev.txt
+# Optional GPU support: python -m pip install -e .[gpu]
+```
+
+3. Run the example harness (uses an in-repo MockProvider):
+
+```bash
+python - <<'PY'
+import asyncio, sys
+sys.path.insert(0, '')
+from benches.harness import run_bench
+asyncio.run(run_bench('bench_config.yaml'))
+PY
+```
+
+Output files will be written to the `reports/` prefix declared in `bench_config.yaml` (JSONL per-sample, CSV summary, resources timeline and a compact Markdown report).
+
+## Configuration (`bench_config.yaml`)
+
+Key fields:
+
+- `provider`: select `kind: mock | ollama | openai` and provider-specific connection options.
+- `io.dataset_path`: path to JSONL dataset.
+- `io.output_prefix`: prefix for the output artifacts in `reports/`.
+- `prompt.system` and `prompt.template`: system message and per-sample template using `{input}` and other fields from the dataset.
+- `load.concurrency` and `load.batch_size`: concurrency/batch settings.
+- `limits.max_samples`: limit number of samples for fast experiments.
+- `metrics.normalization`: optional normalization (e.g., `lower_strip`) applied to accuracy metrics.
+
+Example config is included as `bench_config.yaml`.
+
+## Dataset formats
+
+- Free‑text QA JSONL (one object per line):
+
+```json
+{"id":"1","input":"Capital of France?","target":"Paris"}
+```
+
+- Multiple choice JSONL:
+
+```json
+{"id":"1","input":"Capital of France?","choices":["Paris","Lyon"],"answer":"Paris"}
+```
+
+## Providers
+
+Implement a `Provider` with two async methods:
+
+- `generate(prompt, system=None, options=None) -> dict` — returns at least `output` and may provide `latency_s`, `ttft_s`, `prompt_eval_count`, `eval_count`.
+- `tokenize(text) -> int` — optional but helpful for token counts.
+
+Included adapters:
+
+- `OllamaProvider` (calls /api/generate and /api/tokenize)
+- `OpenAIStyleProvider` (calls /v1/chat/completions)
+- `MockProvider` (local, for testing and CI)
+
+Add your provider implementation to `benches/providers.py` and register it in `_load_provider` in `benches/harness.py`.
+
+## Metrics
+
+- Exact Match (EM), token-F1, multiple-choice accuracy implemented in `benches/metrics.py`.
+- BLEU and ROUGE-L are optional; they require `sacrebleu` and `rouge-score` respectively.
+
+## Resource monitoring
+
+`benches/monitor.py` samples process CPU/RAM (via `psutil`) and optionally GPU stats via NVML.
+
+- GPU sampling is optional and controlled by the environment variable `LLM_BENCH_SKIP_GPU=1` (CI sets this variable by default).
+- GPU support is available via the optional package extra `gpu` (recommended package `nvidia-ml-py`, fallback to `pynvml` is supported).
+
+Install the GPU extra locally with:
 
 ```bash
-python scripts/run_bench.py -c bench_config.yaml
+python -m pip install -e .[gpu]
 ```
+
+Note: GitHub-hosted runners do not provide GPUs; the CI workflow sets `LLM_BENCH_SKIP_GPU=1`.
+
+## Outputs
+
+- `*.jsonl` — per-sample detailed results
+- `*_summary.csv` — single-row summary (latency percentiles, accuracy means, tokens)
+- `*_resources.csv` — timeline of CPU/RAM/(optional)GPU samples
+- `*_report.md` — compact human report
+
+## Tests & CI
+
+- Unit and integration tests live in `tests/`.
+- Run tests locally with `pytest` or `make test`.
+- CI (`.github/workflows/ci.yml`) runs tests and sets `LLM_BENCH_SKIP_GPU=1` so GPU sampling is skipped on GitHub runners.
+
+## Examples
+
+- `examples/run_mock.py` — programmatic example that runs the harness against the MockProvider.
+- `benches/plot.py` — helper to plot `resources.csv` (requires `matplotlib` + `pandas`).
+
+## Extending
+
+- Add a provider: implement `Provider.generate()` and `tokenize()`, and register it in `_load_provider`.
+- Add a metric: implement in `benches/metrics.py` and wire into `benches/harness.py`.
+- Throughput sweeps: write a wrapper that modifies `bench_config.yaml` concurrency/batch settings and re-runs the harness to gather scaling data.
+
+## License
+
+MIT — do what you want, but please share interesting improvements.
+
+
+GPU (optional):
+
+The harness can sample NVIDIA GPU stats via NVML. This is optional — GitHub Actions runners don't have GPUs, and CI skips GPU sampling by default.
+
+To install the optional GPU dependency locally:
+
+```bash
+python -m pip install -e .[gpu]
+# or: pip install nvidia-ml-py
+```
+
+CI note: the provided GitHub Actions workflow sets `LLM_BENCH_SKIP_GPU=1` so GPU sampling is disabled in CI.