Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
78 changes: 78 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# GIMBench

GIMBench is a benchmarking framework for evaluating Guided Infilling Models (GIM).

## Overview

This project provides tools and benchmarks to evaluate models' ability to perform guided infilling tasks - generating text that follows specific constraints and patterns.

## Installation

Install GIMBench using pip:

```bash
pip install gimbench
```

For development:

```bash
make install-dev
```
Comment on lines +17 to +21
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The development install instructions rely on make install-dev, which runs uv sync (see Makefile). Consider documenting uv as a prerequisite (or provide an alternative dev install command) so the steps are reproducible for new contributors.

Copilot uses AI. Check for mistakes.

## Usage

GIMBench provides several benchmark types:

- **CV Parsing**: Evaluate models on structured information extraction from CVs
- **Regex Matching**: Test models' ability to generate text matching specific patterns
- **Multiple Choice QA**: Assess guided generation in question-answering contexts
- **Perplexity**: Measure language modeling quality with constraints

### Example Commands

Run MMLU-Pro benchmark:

```bash
python -m gimbench.mcqa.mmlu_pro \
--model_type vllm \
--model_name meta-llama/Llama-3.1-8B-Instruct \
--base_url http://localhost:8000/v1
```

Run GPQA Diamond benchmark:

```bash
python -m gimbench.mcqa.gpqa_diamond \
--model_type openai \
--model_name gpt-4 \
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GPQA example uses --model_type openai with gpt-4 but does not specify --base_url. The CLI default base URL is http://localhost:8000/v1, so this command would point at a local endpoint rather than the OpenAI API unless clarified. Suggest adding an explicit OpenAI base URL (or noting that openai here means an OpenAI-compatible local server).

Suggested change
--model_name gpt-4 \
--model_name gpt-4 \
--base_url https://api.openai.com/v1 \

Copilot uses AI. Check for mistakes.
--api_key YOUR_API_KEY
```

Run GIM-SFT perplexity evaluation:

```bash
python -m gimbench.ppl.gim_sft \
--model_type vllm-offline \
--model_name meta-llama/Llama-3.1-8B-Instruct
```

## Development

Run linting:

```bash
make lint
```

Fix linting issues automatically:

```bash
make lint-fix
```

Run pre-commit hooks:

```bash
make pre-commit
```