GIMBench is a benchmarking framework for evaluating Guided Infilling Models (GIM).
This project provides tools and benchmarks to evaluate models' ability to perform guided infilling tasks - generating text that follows specific constraints and patterns.
Install GIMBench using pip:
pip install gimbenchFor development:
make install-devGIMBench provides several benchmark types:
- CV Parsing: Evaluate models on structured information extraction from CVs
- Regex Matching: Test models' ability to generate text matching specific patterns
- Multiple Choice QA: Assess guided generation in question-answering contexts
- Perplexity: Measure language modeling quality with constraints
Run MMLU-Pro benchmark:
python -m gimbench.mcqa.mmlu_pro \
--model_type vllm \
--model_name meta-llama/Llama-3.1-8B-Instruct \
--base_url http://localhost:8000/v1Run GPQA Diamond benchmark:
python -m gimbench.mcqa.gpqa_diamond \
--model_type openai \
--model_name gpt-4 \
--api_key YOUR_API_KEYRun GIM-SFT perplexity evaluation:
python -m gimbench.ppl.gim_sft \
--model_type vllm-offline \
--model_name meta-llama/Llama-3.1-8B-InstructRun linting:
make lintFix linting issues automatically:
make lint-fixRun pre-commit hooks:
make pre-commit