A user-friendly streamlit UI for running various lm_eval supported benchmarks on large language models and to compare them with one another.
Supported Benchmarks:
- gpqa_diamond_zeroshot
- gsm8k
- winogrande
- arc_challenge
- hellaswag
- truthfulqa_mc2
- mmlu
Clone into the repo:
git clone https://github.com/TeichAI/Model-Benchmark-Suite.git
cd Model-Benchmark-Suite
Install deps and start the app:
pip install -r requirements.txt
streamlit run app.py