LABBench2-Pro

A methodological audit and extension of LABBench2 — FutureHouse's benchmark for evaluating LLMs on biology research tasks.

LABBench2 established the standard for measuring how well frontier models handle real scientific work: literature comprehension, figure interpretation, sequence analysis, protocol reasoning. But benchmarks themselves need benchmarking. LABBench2-Pro identifies and addresses specific methodological gaps in the current evaluation framework.

Motivation

Our gap analysis identified five categories of issues:

Scoring reliability — LLM-as-judge grading introduces unquantified noise. Position bias (does answer order matter?), verbosity bias (are longer answers favored?), and inter-judge disagreement are never measured.
Statistical reporting — Results are reported as point estimates without confidence intervals. With small per-category sample sizes, two models can appear different when their CIs overlap entirely.
Contamination risk — No probes test whether models have memorized benchmark questions from training data.
Coverage gaps — LABBench2 tests retrieval and comprehension but not statistical reasoning, uncertainty calibration, hypothesis generation, or structural biology interpretation.
Atomic-only evaluation — Every task is independent. Real research requires multi-step reasoning where errors compound — reading a paper, interpreting its figures, choosing the right statistical test, designing a follow-up experiment.

What LABBench2-Pro Does

Tier 1: Methodological Audit of LABBench2

Run frontier models against the existing LABBench2 categories, then apply rigorous statistical analysis that the original benchmark lacks:

Bootstrap CIs — BCa 95% confidence intervals on accuracy with pairwise significance tests (Bonferroni-corrected)
IRT Analysis — 2-parameter logistic Item Response Theory to identify low-discrimination items, compute test information functions, and recommend a pruned item set
Judge Audit — Run two LLM judges on the same responses, measure Cohen's kappa, test position bias (swap reference/response order) and verbosity bias (shorten responses, re-judge)
Contamination Probes — Cloze completion (can the model finish a truncated question?), reverse reconstruction (can it guess the question from the answer?), temporal split (chi-squared test on pre- vs post-cutoff accuracy)

Tier 2: New Task Categories (703 tasks)

Programmatically generated tasks with deterministic or rubric-graded ground truth:

Category	Count	Verification	Source
Statistical Reasoning	200	Programmatic (scipy)	Synthetic gene expression data
Structure Analysis	303	Programmatic + LLM-judge	Real PDB structures (BioPython) + synthetic gel images
Uncertainty Calibration	100	LLM-judge	LABBench2 questions with critical info stripped
Hypothesis Generation	100	LLM-judge (rubric)	Real PubMed abstracts (NCBI Entrez)

Tier 3: Compositional Chains (30 chains, 96 steps)

Multi-step research workflows where each step depends on the previous answer. If the model gets step 1 wrong, it cascades. 30 hand-authored chains across 10 template types, with 3 variants each covering different biomedical domains:

Template	Workflow	Steps	Example Topics
Paper to Experiment	Paper finding → data interpretation → stats test → hypothesis	4	SHP2, JAK2 V617F, PCSK9
Structure to Drug	Protein structure → binding mechanism → SAR prediction → validation	4	EGFR, KRAS G12C, SARS-CoV-2 Mpro
Stats Pipeline	Test selection → multiple testing correction → pathway interpretation	3	TNBC RNA-seq, T2D GWAS, scRNA-seq TILs
Critical Appraisal	Evaluate weak evidence → integrate conflicting data → design definitive experiment	3	IDH1 glioma, Lecanemab, Microbiome
Genetics to Therapy	Genetic finding → structural impact → therapeutic strategy	3	PINK1 Parkinson's, CFTR CF, SCN1A Dravet
Protocol Troubleshoot	Diagnose error → interpret fix → quantitative follow-up	3	KRAS-BRAF co-IP, ChIP-seq, CRISPR base editing
Paradox Resolution	Explain paradox → discriminating experiment → synthesize conclusion	3	ZEB1 EMT, PD-1 hyperprogression, Exercise immunosuppression
Sequence to Function	Identify protein → predict adaptations → design validation	3	Psychrophilic LDH, AmpC β-lactamase, Novel Cas effector
Data to Mechanism	Interpret ambiguous data → update with evidence → correct prior analysis	3	GAPDH confound, Imatinib resistance, Venetoclax synergy
Evidence Synthesis	Compare conflicting studies → meta-analysis → clinical recommendation	3	ctDNA lung cancer, FLT3 AML, BRAF melanoma

All chains use real data verified against 6 databases: PDB, UniProt, ChEMBL, ClinVar, ClinicalTrials.gov, and Open Targets. See tasks/chains/LABBench2Pro_AllExamples.md for the complete chain content and tasks/chains/VERIFICATION_REPORT.md for the data verification audit (95/95 data points verified correct).

Architecture

labbench2-pro/
├── src/
│   ├── config.py              # Model registry, API keys, cost table
│   ├── models.py              # Unified model caller (Anthropic/OpenAI/Google)
│   ├── db.py                  # Thin asyncpg wrapper
│   ├── cache.py               # Redis response cache
│   ├── api.py                 # FastAPI REST endpoints
│   │
│   ├── tier1/                 # Methodological audit
│   │   ├── run_eval.py        # Run models against tasks (HF or local)
│   │   ├── grading.py         # Programmatic + LLM-judge grading
│   │   ├── bootstrap_ci.py    # BCa CIs + pairwise tests
│   │   ├── irt_analysis.py    # 2PL IRT + test information
│   │   ├── judge_audit.py     # Inter-judge agreement + bias tests
│   │   └── contamination.py   # Cloze, reverse, temporal probes
│   │
│   ├── tier2/                 # Task generation
│   │   ├── gen_stats_tasks.py # Statistical reasoning (scipy ground truth)
│   │   ├── gen_structure.py   # PDB parsing + gel images
│   │   ├── gen_calibration.py # Uncertainty calibration
│   │   ├── gen_hypothesis.py  # PubMed-based hypothesis tasks
│   │   └── validate_tasks.py  # Schema validation
│   │
│   └── tier3/                 # Compositional chains
│       ├── run_chains.py      # Execute chains, measure error propagation
│       ├── feedback_sim.py    # Re-run with correctness signal
│       ├── gen_chains.py      # Auto-generate chains (scaffolding)
│       └── cost_tracker.py    # Pareto frontier + cost analysis
│
├── tasks/chains/              # 30 chains (96 steps), verified data
│   ├── tasks/                 # Individual step JSON files
│   ├── chain_definitions.json # Chain wiring (step order, dependencies)
│   ├── LABBench2Pro_AllExamples.md  # Full chain content (readable)
│   └── VERIFICATION_REPORT.md # Data verification audit
├── db/schema.sql              # PostgreSQL schema (5 tables)
├── docker-compose.yml         # Postgres 16 + Redis 7
└── run_all.sh                 # Single-command pipeline

No LangChain, no orchestration frameworks. Direct SDK calls, raw SQL, standalone scripts.

Quick Start

Prerequisites

Docker (for Postgres + Redis)
Python 3.11+
API keys: ANTHROPIC_API_KEY (required for judge), plus any of OPENAI_API_KEY, GOOGLE_API_KEY

Setup

# Clone
git clone https://github.com/VibeCodingScientist/LABBench2-Pro.git
cd LABBench2-Pro

# Environment
cp .env.example .env
# Edit .env — add your API keys

# Install
pip install -e .

# Start services
docker compose up -d

Run the Full Pipeline

# Run each model (or just the ones you have API keys for)
./run_all.sh --model claude-opus-4.6
./run_all.sh --model gpt-5.2
./run_all.sh --model gemini-2.5-pro

This runs all 6 phases automatically:

Start Postgres + Redis, apply schema
Generate Tier 2 tasks (stats, structures, calibration, hypothesis)
Run Tier 1 evals against LABBench2 categories from HuggingFace
Run Tier 2 generated tasks through eval
Run compositional chains
Analysis: bootstrap CIs, IRT, judge audit, contamination probes, cost summary

Results are stored in PostgreSQL. Resume-safe — if interrupted, re-running skips already-completed tasks.

Run Individual Components

# Single category eval
python -m src.tier1.run_eval --model claude-opus-4.6 --category LitQA2 --concurrency 5

# Bootstrap CIs for a category
python -m src.tier1.bootstrap_ci --category LitQA2

# IRT analysis across all results
python -m src.tier1.irt_analysis

# Judge audit (sample 50 responses)
python -m src.tier1.judge_audit --category LitQA2 --sample-size 50

# Contamination probes
python -m src.tier1.contamination --model claude-opus-4.6

# Generate tasks
python -m src.tier2.gen_stats_tasks --output-dir tasks/stats_reasoning --count 200
python -m src.tier2.gen_structure --output-dir tasks/structures --pdb-count 50 --gel-count 60

# Run a specific chain
python -m src.tier3.run_chains --model claude-opus-4.6 --chain chain01

# Cost summary
python -m src.tier3.cost_tracker

Query Results

# Direct SQL
PGPASSWORD=dev psql -h localhost -p 5433 -U dev -d labbench2pro

# Or start the API
uvicorn src.api:app --host 0.0.0.0 --port 8000
# GET /results/eval?category=LitQA2
# GET /results/ci?category=LitQA2
# GET /status

Models Tested

Model	Provider	Input $/1M	Output $/1M	Status
claude-opus-4.6	Anthropic	$15.00	$75.00	Complete
claude-sonnet-4.6	Anthropic	$3.00	$15.00	Complete
gpt-5.2	OpenAI	$1.75	$14.00	Complete
gemini-2.5-pro	Google	$1.25	$10.00	Complete

Results (4 Models, 3 Providers — Feb 2026)

Cross-provider comparison: 9,591 eval runs, ~$132.57 combined cost. See results/RESULTS.md for complete analysis.

Tier 1: LABBench Categories

Category	Opus 4.6	Sonnet 4.6	GPT-5.2	Gemini 2.5 Pro	Best
CloningScenarios	39.4%	27.3%	15.2%	27.3%	Opus
LitQA2	31.2%	28.3%	30.2%	23.1%	Opus
SeqQA	17.8%	12.4%	10.8%	15.5%	Opus
ProtocolQA	15.0%	15.7%	18.5%	13.9%	GPT-5.2
SuppQA	11.1%	6.2%	12.2%	7.3%	GPT-5.2
FigQA	10.5%	5.0%	6.1%	8.8%	Opus
DbQA	4.7%	2.2%	1.7%	2.1%	Opus

Tier 2: New Tasks

Category	Opus 4.6	Sonnet 4.6	GPT-5.2	Gemini 2.5 Pro	Best
Calibration	100%	100%	100%	94.0%	Tie
Hypothesis Gen.	97.0%	97.0%	100%	95.0%	GPT-5.2
Structure Analysis	45.5%	43.9%	52.8%	51.2%	GPT-5.2
Statistical Reasoning	23.0%	35.0%	19.5%	67.5%	Gemini

Tier 3: Compositional Chains

Metric	Opus 4.6	Sonnet 4.6	GPT-5.2	Gemini 2.5 Pro
Step-level accuracy	83.9%	88.8%	81.2%	71.9%
End-to-end accuracy	60.0%	73.3%	36.7%	36.7%
Error propagation gap	23.9 pp	15.5 pp	44.5 pp	35.2 pp
Total cost	$102.81	$14.76	~$8.50	~$6.50

Key Findings

No single model dominates. Opus leads Tier 1 retrieval (5/7 categories), GPT-5.2 leads structural biology, Gemini crushes statistical reasoning (67.5%), and Sonnet wins compositional chains.
Anthropic models dominate multi-step workflows. Sonnet 4.6 (73.3% E2E) and Opus 4.6 (60.0%) dramatically outperform GPT-5.2 and Gemini (both 36.7%) on compositional chains.
Gemini's statistical reasoning is a standout. 67.5% — nearly 2x the next best model (Sonnet at 35.0%). The largest single-category advantage in the entire benchmark.
Atomic task performance does not predict compositional ability. GPT-5.2 has competitive step-level accuracy (81.2%) but the worst error propagation gap (44.5pp).
Cost varies 16x across providers. Opus costs $102.81 vs Gemini's ~$6.50 for the full benchmark. For most research teams, non-Opus models offer better value.
88% of benchmark items don't discriminate between models. IRT analysis recommends pruning to 303 high-discrimination items for 8x more efficient evaluations.
Zero contamination detected. 0% cloze match rates for Anthropic models.

Methodological Audit

Metric	Value
Inter-judge agreement	90.0%
Cohen's kappa	0.765
Position bias	15.0%
Verbosity bias	+5.0%
IRT items analyzed	2,522
High-discrimination items	303 (12.0%)
Cloze contamination	0.0%

Design Principles

Simplest thing that works. No abstraction until it's needed twice.
Every script runnable standalone. No hidden dependencies between modules.
Provider-agnostic. Unified call_model() interface works with Anthropic, OpenAI, and Google.
Resume-safe. Every eval checks the DB before calling the API. Interrupted runs pick up where they left off.
Raw SQL, no ORM. Five tables, four indexes. Schema changes = drop and recreate (results are reproducible).
Cost-aware. Every API call is tracked. Cost-accuracy Pareto frontier identifies dominated models.

Artifacts

Directory	Contents
`results/raw/`	All 9,591 eval traces, chain traces, 60 judge audits (CSV)
`results/figures/`	Publication-ready figures (PDF + PNG)
`results/tables/`	LaTeX tables + supplementary chain traces
`results/summary.json`	Machine-readable results summary
`paper/`	Draft manuscript
`tasks/chains/`	30 chains with verification report

Contributing

All 30 compositional chains are authored and verified. Contributions welcome for:

Additional chain variants (new biomedical topics using the 10 existing templates)
New chain templates (novel multi-step reasoning patterns)
Running the benchmark against additional models
Running with tool-augmented models to measure skill-based improvement

Citation

@misc{labbench2pro2026,
  title={LABBench2-Pro: A Methodological Audit and Extension of Scientific LLM Evaluation},
  author={Weidener, Lukas},
  year={2026},
  url={https://github.com/VibeCodingScientist/LABBench2-Pro}
}

License

Research use. Not affiliated with FutureHouse.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LABBench2-Pro

Motivation

What LABBench2-Pro Does

Tier 1: Methodological Audit of LABBench2

Tier 2: New Task Categories (703 tasks)

Tier 3: Compositional Chains (30 chains, 96 steps)

Architecture

Quick Start

Prerequisites

Setup

Run the Full Pipeline

Run Individual Components

Query Results

Models Tested

Results (4 Models, 3 Providers — Feb 2026)

Tier 1: LABBench Categories

Tier 2: New Tasks

Tier 3: Compositional Chains

Key Findings

Methodological Audit

Design Principles

Artifacts

Contributing

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
db		db
results		results
scripts		scripts
src		src
tasks/chains		tasks/chains
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
labbench2_pro_dev_plan.md		labbench2_pro_dev_plan.md
pyproject.toml		pyproject.toml
run_all.sh		run_all.sh

Folders and files

Latest commit

History

Repository files navigation

LABBench2-Pro

Motivation

What LABBench2-Pro Does

Tier 1: Methodological Audit of LABBench2

Tier 2: New Task Categories (703 tasks)

Tier 3: Compositional Chains (30 chains, 96 steps)

Architecture

Quick Start

Prerequisites

Setup

Run the Full Pipeline

Run Individual Components

Query Results

Models Tested

Results (4 Models, 3 Providers — Feb 2026)

Tier 1: LABBench Categories

Tier 2: New Tasks

Tier 3: Compositional Chains

Key Findings

Methodological Audit

Design Principles

Artifacts

Contributing

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages