A methodological audit and extension of LABBench2 — FutureHouse's benchmark for evaluating LLMs on biology research tasks.
LABBench2 established the standard for measuring how well frontier models handle real scientific work: literature comprehension, figure interpretation, sequence analysis, protocol reasoning. But benchmarks themselves need benchmarking. LABBench2-Pro identifies and addresses specific methodological gaps in the current evaluation framework.
Our gap analysis identified five categories of issues:
-
Scoring reliability — LLM-as-judge grading introduces unquantified noise. Position bias (does answer order matter?), verbosity bias (are longer answers favored?), and inter-judge disagreement are never measured.
-
Statistical reporting — Results are reported as point estimates without confidence intervals. With small per-category sample sizes, two models can appear different when their CIs overlap entirely.
-
Contamination risk — No probes test whether models have memorized benchmark questions from training data.
-
Coverage gaps — LABBench2 tests retrieval and comprehension but not statistical reasoning, uncertainty calibration, hypothesis generation, or structural biology interpretation.
-
Atomic-only evaluation — Every task is independent. Real research requires multi-step reasoning where errors compound — reading a paper, interpreting its figures, choosing the right statistical test, designing a follow-up experiment.
Run frontier models against the existing LABBench2 categories, then apply rigorous statistical analysis that the original benchmark lacks:
- Bootstrap CIs — BCa 95% confidence intervals on accuracy with pairwise significance tests (Bonferroni-corrected)
- IRT Analysis — 2-parameter logistic Item Response Theory to identify low-discrimination items, compute test information functions, and recommend a pruned item set
- Judge Audit — Run two LLM judges on the same responses, measure Cohen's kappa, test position bias (swap reference/response order) and verbosity bias (shorten responses, re-judge)
- Contamination Probes — Cloze completion (can the model finish a truncated question?), reverse reconstruction (can it guess the question from the answer?), temporal split (chi-squared test on pre- vs post-cutoff accuracy)
Programmatically generated tasks with deterministic or rubric-graded ground truth:
| Category | Count | Verification | Source |
|---|---|---|---|
| Statistical Reasoning | 200 | Programmatic (scipy) | Synthetic gene expression data |
| Structure Analysis | 303 | Programmatic + LLM-judge | Real PDB structures (BioPython) + synthetic gel images |
| Uncertainty Calibration | 100 | LLM-judge | LABBench2 questions with critical info stripped |
| Hypothesis Generation | 100 | LLM-judge (rubric) | Real PubMed abstracts (NCBI Entrez) |
Multi-step research workflows where each step depends on the previous answer. If the model gets step 1 wrong, it cascades. 30 hand-authored chains across 10 template types, with 3 variants each covering different biomedical domains:
| Template | Workflow | Steps | Example Topics |
|---|---|---|---|
| Paper to Experiment | Paper finding → data interpretation → stats test → hypothesis | 4 | SHP2, JAK2 V617F, PCSK9 |
| Structure to Drug | Protein structure → binding mechanism → SAR prediction → validation | 4 | EGFR, KRAS G12C, SARS-CoV-2 Mpro |
| Stats Pipeline | Test selection → multiple testing correction → pathway interpretation | 3 | TNBC RNA-seq, T2D GWAS, scRNA-seq TILs |
| Critical Appraisal | Evaluate weak evidence → integrate conflicting data → design definitive experiment | 3 | IDH1 glioma, Lecanemab, Microbiome |
| Genetics to Therapy | Genetic finding → structural impact → therapeutic strategy | 3 | PINK1 Parkinson's, CFTR CF, SCN1A Dravet |
| Protocol Troubleshoot | Diagnose error → interpret fix → quantitative follow-up | 3 | KRAS-BRAF co-IP, ChIP-seq, CRISPR base editing |
| Paradox Resolution | Explain paradox → discriminating experiment → synthesize conclusion | 3 | ZEB1 EMT, PD-1 hyperprogression, Exercise immunosuppression |
| Sequence to Function | Identify protein → predict adaptations → design validation | 3 | Psychrophilic LDH, AmpC β-lactamase, Novel Cas effector |
| Data to Mechanism | Interpret ambiguous data → update with evidence → correct prior analysis | 3 | GAPDH confound, Imatinib resistance, Venetoclax synergy |
| Evidence Synthesis | Compare conflicting studies → meta-analysis → clinical recommendation | 3 | ctDNA lung cancer, FLT3 AML, BRAF melanoma |
All chains use real data verified against 6 databases: PDB, UniProt, ChEMBL, ClinVar, ClinicalTrials.gov, and Open Targets. See tasks/chains/LABBench2Pro_AllExamples.md for the complete chain content and tasks/chains/VERIFICATION_REPORT.md for the data verification audit (95/95 data points verified correct).
labbench2-pro/
├── src/
│ ├── config.py # Model registry, API keys, cost table
│ ├── models.py # Unified model caller (Anthropic/OpenAI/Google)
│ ├── db.py # Thin asyncpg wrapper
│ ├── cache.py # Redis response cache
│ ├── api.py # FastAPI REST endpoints
│ │
│ ├── tier1/ # Methodological audit
│ │ ├── run_eval.py # Run models against tasks (HF or local)
│ │ ├── grading.py # Programmatic + LLM-judge grading
│ │ ├── bootstrap_ci.py # BCa CIs + pairwise tests
│ │ ├── irt_analysis.py # 2PL IRT + test information
│ │ ├── judge_audit.py # Inter-judge agreement + bias tests
│ │ └── contamination.py # Cloze, reverse, temporal probes
│ │
│ ├── tier2/ # Task generation
│ │ ├── gen_stats_tasks.py # Statistical reasoning (scipy ground truth)
│ │ ├── gen_structure.py # PDB parsing + gel images
│ │ ├── gen_calibration.py # Uncertainty calibration
│ │ ├── gen_hypothesis.py # PubMed-based hypothesis tasks
│ │ └── validate_tasks.py # Schema validation
│ │
│ └── tier3/ # Compositional chains
│ ├── run_chains.py # Execute chains, measure error propagation
│ ├── feedback_sim.py # Re-run with correctness signal
│ ├── gen_chains.py # Auto-generate chains (scaffolding)
│ └── cost_tracker.py # Pareto frontier + cost analysis
│
├── tasks/chains/ # 30 chains (96 steps), verified data
│ ├── tasks/ # Individual step JSON files
│ ├── chain_definitions.json # Chain wiring (step order, dependencies)
│ ├── LABBench2Pro_AllExamples.md # Full chain content (readable)
│ └── VERIFICATION_REPORT.md # Data verification audit
├── db/schema.sql # PostgreSQL schema (5 tables)
├── docker-compose.yml # Postgres 16 + Redis 7
└── run_all.sh # Single-command pipeline
No LangChain, no orchestration frameworks. Direct SDK calls, raw SQL, standalone scripts.
- Docker (for Postgres + Redis)
- Python 3.11+
- API keys:
ANTHROPIC_API_KEY(required for judge), plus any ofOPENAI_API_KEY,GOOGLE_API_KEY
# Clone
git clone https://github.com/VibeCodingScientist/LABBench2-Pro.git
cd LABBench2-Pro
# Environment
cp .env.example .env
# Edit .env — add your API keys
# Install
pip install -e .
# Start services
docker compose up -d# Run each model (or just the ones you have API keys for)
./run_all.sh --model claude-opus-4.6
./run_all.sh --model gpt-5.2
./run_all.sh --model gemini-2.5-proThis runs all 6 phases automatically:
- Start Postgres + Redis, apply schema
- Generate Tier 2 tasks (stats, structures, calibration, hypothesis)
- Run Tier 1 evals against LABBench2 categories from HuggingFace
- Run Tier 2 generated tasks through eval
- Run compositional chains
- Analysis: bootstrap CIs, IRT, judge audit, contamination probes, cost summary
Results are stored in PostgreSQL. Resume-safe — if interrupted, re-running skips already-completed tasks.
# Single category eval
python -m src.tier1.run_eval --model claude-opus-4.6 --category LitQA2 --concurrency 5
# Bootstrap CIs for a category
python -m src.tier1.bootstrap_ci --category LitQA2
# IRT analysis across all results
python -m src.tier1.irt_analysis
# Judge audit (sample 50 responses)
python -m src.tier1.judge_audit --category LitQA2 --sample-size 50
# Contamination probes
python -m src.tier1.contamination --model claude-opus-4.6
# Generate tasks
python -m src.tier2.gen_stats_tasks --output-dir tasks/stats_reasoning --count 200
python -m src.tier2.gen_structure --output-dir tasks/structures --pdb-count 50 --gel-count 60
# Run a specific chain
python -m src.tier3.run_chains --model claude-opus-4.6 --chain chain01
# Cost summary
python -m src.tier3.cost_tracker# Direct SQL
PGPASSWORD=dev psql -h localhost -p 5433 -U dev -d labbench2pro
# Or start the API
uvicorn src.api:app --host 0.0.0.0 --port 8000
# GET /results/eval?category=LitQA2
# GET /results/ci?category=LitQA2
# GET /status| Model | Provider | Input $/1M | Output $/1M | Status |
|---|---|---|---|---|
| claude-opus-4.6 | Anthropic | $15.00 | $75.00 | Complete |
| claude-sonnet-4.6 | Anthropic | $3.00 | $15.00 | Complete |
| gpt-5.2 | OpenAI | $1.75 | $14.00 | Complete |
| gemini-2.5-pro | $1.25 | $10.00 | Complete |
Cross-provider comparison: 9,591 eval runs, ~$132.57 combined cost. See results/RESULTS.md for complete analysis.
| Category | Opus 4.6 | Sonnet 4.6 | GPT-5.2 | Gemini 2.5 Pro | Best |
|---|---|---|---|---|---|
| CloningScenarios | 39.4% | 27.3% | 15.2% | 27.3% | Opus |
| LitQA2 | 31.2% | 28.3% | 30.2% | 23.1% | Opus |
| SeqQA | 17.8% | 12.4% | 10.8% | 15.5% | Opus |
| ProtocolQA | 15.0% | 15.7% | 18.5% | 13.9% | GPT-5.2 |
| SuppQA | 11.1% | 6.2% | 12.2% | 7.3% | GPT-5.2 |
| FigQA | 10.5% | 5.0% | 6.1% | 8.8% | Opus |
| DbQA | 4.7% | 2.2% | 1.7% | 2.1% | Opus |
| Category | Opus 4.6 | Sonnet 4.6 | GPT-5.2 | Gemini 2.5 Pro | Best |
|---|---|---|---|---|---|
| Calibration | 100% | 100% | 100% | 94.0% | Tie |
| Hypothesis Gen. | 97.0% | 97.0% | 100% | 95.0% | GPT-5.2 |
| Structure Analysis | 45.5% | 43.9% | 52.8% | 51.2% | GPT-5.2 |
| Statistical Reasoning | 23.0% | 35.0% | 19.5% | 67.5% | Gemini |
| Metric | Opus 4.6 | Sonnet 4.6 | GPT-5.2 | Gemini 2.5 Pro |
|---|---|---|---|---|
| Step-level accuracy | 83.9% | 88.8% | 81.2% | 71.9% |
| End-to-end accuracy | 60.0% | 73.3% | 36.7% | 36.7% |
| Error propagation gap | 23.9 pp | 15.5 pp | 44.5 pp | 35.2 pp |
| Total cost | $102.81 | $14.76 | ~$8.50 | ~$6.50 |
- No single model dominates. Opus leads Tier 1 retrieval (5/7 categories), GPT-5.2 leads structural biology, Gemini crushes statistical reasoning (67.5%), and Sonnet wins compositional chains.
- Anthropic models dominate multi-step workflows. Sonnet 4.6 (73.3% E2E) and Opus 4.6 (60.0%) dramatically outperform GPT-5.2 and Gemini (both 36.7%) on compositional chains.
- Gemini's statistical reasoning is a standout. 67.5% — nearly 2x the next best model (Sonnet at 35.0%). The largest single-category advantage in the entire benchmark.
- Atomic task performance does not predict compositional ability. GPT-5.2 has competitive step-level accuracy (81.2%) but the worst error propagation gap (44.5pp).
- Cost varies 16x across providers. Opus costs $102.81 vs Gemini's ~$6.50 for the full benchmark. For most research teams, non-Opus models offer better value.
- 88% of benchmark items don't discriminate between models. IRT analysis recommends pruning to 303 high-discrimination items for 8x more efficient evaluations.
- Zero contamination detected. 0% cloze match rates for Anthropic models.
| Metric | Value |
|---|---|
| Inter-judge agreement | 90.0% |
| Cohen's kappa | 0.765 |
| Position bias | 15.0% |
| Verbosity bias | +5.0% |
| IRT items analyzed | 2,522 |
| High-discrimination items | 303 (12.0%) |
| Cloze contamination | 0.0% |
- Simplest thing that works. No abstraction until it's needed twice.
- Every script runnable standalone. No hidden dependencies between modules.
- Provider-agnostic. Unified
call_model()interface works with Anthropic, OpenAI, and Google. - Resume-safe. Every eval checks the DB before calling the API. Interrupted runs pick up where they left off.
- Raw SQL, no ORM. Five tables, four indexes. Schema changes = drop and recreate (results are reproducible).
- Cost-aware. Every API call is tracked. Cost-accuracy Pareto frontier identifies dominated models.
| Directory | Contents |
|---|---|
results/raw/ |
All 9,591 eval traces, chain traces, 60 judge audits (CSV) |
results/figures/ |
Publication-ready figures (PDF + PNG) |
results/tables/ |
LaTeX tables + supplementary chain traces |
results/summary.json |
Machine-readable results summary |
paper/ |
Draft manuscript |
tasks/chains/ |
30 chains with verification report |
All 30 compositional chains are authored and verified. Contributions welcome for:
- Additional chain variants (new biomedical topics using the 10 existing templates)
- New chain templates (novel multi-step reasoning patterns)
- Running the benchmark against additional models
- Running with tool-augmented models to measure skill-based improvement
@misc{labbench2pro2026,
title={LABBench2-Pro: A Methodological Audit and Extension of Scientific LLM Evaluation},
author={Weidener, Lukas},
year={2026},
url={https://github.com/VibeCodingScientist/LABBench2-Pro}
}Research use. Not affiliated with FutureHouse.