A benchmark for evaluating how well LLMs can retrieve, assess, and determine novelty of biomedical patent prior art.
BioPAT-NovEx constructs 300 expert-curated patent-derived technical statements and evaluates LLM performance across three tiers: document retrieval, relevance grading, and novelty determination. It includes a 164K dual corpus of scientific papers and patents, 5,352 graded relevance judgments, and ground truth novelty labels from a 3-LLM consensus protocol.
| Stat | Value |
|---|---|
| Statements | 300 patent-derived technical claims |
| Dual corpus | 164,649 documents (158,850 papers + 5,799 patents) |
| Relevance judgments | 5,352 graded qrels (0-3 scale) |
| Novelty labels | 300 (ANTICIPATED 70% / PARTIAL 24% / NOVEL 6%) |
| Domains | A61 (medical, 121) / C07 (organic chemistry, 88) / C12 (biochemistry, 91) |
| LLMs evaluated | GPT-5.2, Claude Sonnet 4.6, Gemini 3 Pro |
| Evaluation runs | 27 (T1: 7 methods, T2: 3 models, T3: 6 conditions + 15 ablation) |
| Format | BEIR-compatible (corpus.jsonl, queries.jsonl, qrels/*.tsv) |
Ground truth novelty labels are derived from a 3-LLM consensus annotation protocol (Fleiss' kappa = 0.342), with 59% unanimous agreement, 33% majority vote, and 8% requiring override rules.
src/biopat/
├── pipeline.py # Phase 1: 8-step benchmark construction with checkpoint/resume
├── pipeline_novelty.py # End-to-end novelty assessment pipeline
├── config.py # Configuration (paths, APIs, LLM, evaluation)
├── compat.py # Polars version compatibility layer
├── reproducibility.py # Checksums and audit logging
│
├── ingestion/ # Data acquisition
│ ├── ros.py # Reliance on Science patent-paper citations
│ ├── patentsview.py # USPTO PatentsView API (claims, metadata)
│ └── openalex.py # OpenAlex API (paper metadata, abstracts)
│
├── processing/ # Data transformation
│ ├── patents.py # Patent filtering by IPC, claim extraction
│ ├── papers.py # Paper metadata cleaning, abstract validation
│ ├── linking.py # Patent-paper citation linking
│ ├── patent_ids.py # Patent ID normalization (US/EP/WO)
│ ├── chemical_index.py # Morgan fingerprints, FAISS chemical search
│ ├── sequence_index.py # BLAST+ sequence alignment search
│ ├── prior_patents.py # Prior patent reference filtering
│ ├── international_patents.py # EP/WO patent support
│ ├── claim_mapper.py # Claim-to-citation mapping
│ └── npl_parser.py # Non-patent literature reference parsing
│
├── groundtruth/ # Relevance judgment creation
│ ├── relevance.py # Graded relevance assignment (0-3 scale)
│ ├── temporal.py # Temporal validation (paper before patent)
│ ├── stratification.py # Domain-stratified sampling by IPC
│ └── ep_citations.py # EP search report category mapping
│
├── benchmark/ # Benchmark formatting
│ ├── beir_format.py # BEIR output (corpus.jsonl, queries.jsonl, qrels)
│ ├── splits.py # Train/dev/test splitting (stratified)
│ └── sampling.py # Uniform and stratified sampling
│
├── evaluation/ # Retrieval evaluation
│ ├── llm_evaluator.py # LLMBenchmarkRunner: orchestrates all experiments
│ ├── bm25.py # BM25 baseline
│ ├── dense.py # Dense retrieval (sentence-transformers, FAISS)
│ ├── hybrid.py # BM25 + dense fusion (RRF, linear)
│ ├── reranker.py # Cross-encoder and LLM listwise reranking
│ ├── metrics.py # IR metrics: NDCG, MAP, MRR, P@k, R@k
│ ├── trimodal_retrieval.py# Text + chemical + sequence fusion
│ ├── agent_retrieval.py # Agentic dual-corpus retrieval
│ ├── agent_metrics.py # Agent-specific metrics and refinement curves
│ ├── dual_qrels.py # Dual corpus builder, qrel inversion
│ ├── ablation.py # Ablation studies
│ ├── error_analysis.py # Failure categorization and vocabulary analysis
│ ├── statistical_tests.py # Significance testing (bootstrap, paired)
│ └── publication.py # Report generation
│
├── retrieval/ # Retrieval methods
│ ├── dense.py # Dense retrieval with domain-specific models
│ ├── hybrid.py # BM25 + dense hybrid fusion
│ ├── reranker.py # Cross-encoder + LLM reranking
│ ├── hyde.py # HyDE query expansion via LLM
│ ├── molecular.py # Chemical fingerprint retrieval (RDKit)
│ ├── sequence.py # Sequence alignment retrieval (BLAST+)
│ ├── splade.py # Learned sparse retrieval (SPLADE)
│ └── colbert.py # Late-interaction retrieval (ColBERT)
│
├── reasoning/ # LLM-based novelty reasoning
│ ├── claim_parser.py # Decompose claims into elements via LLM
│ ├── novelty_reasoner.py # Map prior art to claims, assess novelty
│ └── explanation_generator.py # Generate novelty reports
│
├── novex/ # NovEx 3-tier benchmark
│ ├── curate.py # Statement curation (8-step pipeline, 3-LLM consensus)
│ ├── annotation.py # Multi-LLM annotation protocol
│ ├── benchmark.py # BEIR-compatible loader with domain/category filters
│ ├── evaluator.py # Tier 1/2/3 evaluation harness (context_k, checkpointing)
│ ├── analysis.py # 24 analyses: tables, correlation, confusion, ablation
│ └── figures.py # 10 publication figures (PDF + PNG)
│
└── llm/ # Unified LLM provider interface
├── providers.py # OpenAI, Anthropic, Google with consistent API
└── cost_tracker.py # Token tracking, cost estimation, budget enforcement
| Tier | Task | Input | Output | Primary Metric |
|---|---|---|---|---|
| 1 | Retrieval | Technical statement | Ranked documents from 164K dual corpus | Recall@k, NDCG@10 |
| 2 | Relevance | Statement + 50 candidates | 0-3 relevance grade per document | Weighted Cohen's kappa |
| 3 | Novelty | Statement + top-k prior art | NOVEL / PARTIAL / ANTICIPATED | Accuracy, balanced accuracy, macro-F1 |
| Method | Model | Recall@10 | NDCG@10 | Recall@100 | MAP |
|---|---|---|---|---|---|
| BM25 | — | 0.481 | 0.675 | 1.000 | 0.719 |
| BM25 + Rerank | GPT-5.2 | 0.564 | 0.775 | 0.814 | 0.661 |
| BM25 + Rerank | Claude Sonnet 4.6 | 0.563 | 0.772 | 0.814 | 0.662 |
| BM25 + Rerank | Gemini 3 Pro | 0.481 | 0.675 | 0.814 | 0.600 |
| Agent | Claude Sonnet 4.6 | 0.289 | 0.464 | 0.711 | 0.353 |
| Agent | Gemini 3 Pro | 0.396 | 0.595 | 0.587 | 0.426 |
| Agent | GPT-5.2 | 0.186 | 0.313 | 0.533 | 0.224 |
BM25 achieves perfect Recall@100. LLM reranking improves top-of-list precision (NDCG@10 +15%). Agent retrieval explores the dual corpus iteratively but trades recall for breadth.
| Model | Accuracy | MAE | Weighted Kappa |
|---|---|---|---|
| Claude Sonnet 4.6 | 0.791 | 0.209 | 0.873 |
| GPT-5.2 | 0.767 | 0.236 | 0.853 |
| Gemini 3 Pro | 0.725 | 0.285 | 0.827 |
All models show strong agreement with ground truth (weighted kappa > 0.82). Grade distribution is heavily skewed (43% grade-1, 25% grade-2, 32% grade-3), explaining the high raw agreement but moderate unweighted kappa.
| Model | Mode | Accuracy | Bal. Acc. | Macro-F1 | F1-NOVEL | F1-PARTIAL | F1-ANTIC |
|---|---|---|---|---|---|---|---|
| Claude Sonnet 4.6 | ctx (k=10) | 0.773 | 0.643 | 0.650 | 0.529 | 0.552 | 0.869 |
| GPT-5.2 | ctx (k=10) | 0.753 | 0.678 | 0.661 | 0.615 | 0.521 | 0.848 |
| Gemini 3 Pro | ctx (k=10) | 0.753 | 0.674 | 0.550 | 0.515 | 0.250 | 0.886 |
| Claude Sonnet 4.6 | zero-shot | 0.515 | 0.507 | 0.426 | 0.170 | 0.444 | 0.663 |
| GPT-5.2 | zero-shot | 0.193 | 0.387 | 0.195 | 0.119 | 0.190 | 0.274 |
| Gemini 3 Pro | zero-shot | 0.064 | 0.335 | 0.041 | 0.114 | 0.000 | 0.010 |
Context (providing top-k prior art documents) is essential — all models collapse without it. Claude leads accuracy (77.3%), GPT-5.2 leads macro-F1 (0.661) due to stronger NOVEL class detection.
| Model | k=1 | k=3 | k=5 | k=10 | k=20 |
|---|---|---|---|---|---|
| GPT-5.2 | 0.563 | 0.703 | 0.747 | 0.757 | 0.753 |
| Claude Sonnet 4.6 | 0.706 | 0.746 | 0.759 | 0.756 | 0.756 |
| Gemini 3 Pro | 0.693 | 0.747 | 0.739 | 0.728 | 0.745 |
Performance plateaus around k=5 for all models. GPT-5.2 shows the largest gain from k=1 to k=3 (p < 0.001 by bootstrap test).
| Metric | Value |
|---|---|
| Fleiss' kappa (3 LLMs) | 0.342 (fair agreement) |
| GPT-5.2 ↔ Claude kappa | 0.574 (moderate) |
| GPT-5.2 ↔ Gemini kappa | 0.203 (fair) |
| Claude ↔ Gemini kappa | 0.289 (fair) |
| Unanimous agreement | 178/300 (59%) |
| Majority agreement | 98/300 (33%) |
| Override required | 13/300 (4%) |
| No consensus | 11/300 (4%) |
Models perform ~88% accuracy on unanimous GT items, ~65-70% on majority, and ~60% on contested items, confirming that GT agreement level correlates with prediction difficulty.
| Model | KL Divergence (pred vs GT) | Assessment |
|---|---|---|
| Claude Sonnet 4.6 | 0.0005 | Well calibrated |
| GPT-5.2 | 0.0011 | Well calibrated |
| Gemini 3 Pro | 0.1487 | Over-predicts NOVEL, under-predicts PARTIAL |
| Phase | Cost |
|---|---|
| Benchmark construction (curation + grading + novelty) | $38.56 |
| 3-tier evaluation (27 runs) | ~$346 |
| Total | $384.19 |
All analysis results are stored in data/novex/analysis/:
| File | Description |
|---|---|
summary.json |
Benchmark overview and best results per tier |
tier1_table.json |
Tier 1 retrieval metrics for all methods |
tier2_table.json |
Tier 2 relevance grading results |
tier2_grade_distribution.json |
Grade distribution explaining kappa paradox |
tier3_table.json |
Tier 3 novelty results with balanced accuracy |
context_ablation.json |
Accuracy at k=1,3,5,10,20 for all models |
ablation_significance.json |
Bootstrap paired significance tests for k-values |
confusion_matrices.json |
3x3 confusion matrices (6 conditions) |
error_analysis.json |
Per-class, per-domain error breakdowns |
tier_correlation.json |
BM25 R@10 vs T3 accuracy correlation |
difficulty_stratification.json |
Performance by retrieval difficulty bucket |
cost_performance.json |
Cost vs metric Pareto analysis |
inter_model_agreement.json |
Pairwise model agreement on T2 |
vocabulary_gap.json |
Patent-paper vocabulary divergence |
cross_domain.json |
Per-domain performance breakdown |
per_domain.json |
Domain-level statistics |
doc_type_split.json |
Paper vs patent retrieval breakdown |
significance_tests.json |
Statistical significance across methods |
fleiss_kappa_gt.json |
Inter-annotator agreement (3 LLMs) |
gt_agreement_stratification.json |
T3 accuracy by GT agreement level |
dataset_statistics.json |
Corpus and statement descriptive statistics |
qualitative_examples.json |
Context-sensitive example predictions |
calibration_note.json |
Prediction distribution vs GT distribution |
full_analysis.json |
Combined analysis output |
Located in data/novex/analysis/figures/:
| Figure | Description |
|---|---|
fig1_tier1_recall |
Tier 1 recall@k curves by method |
fig2_tier2_comparison |
Tier 2 model comparison |
fig3_tier3_context_vs_zs |
Context vs zero-shot novelty accuracy |
fig4_tier3_per_class_f1 |
Per-class F1 breakdown |
fig5_paper_vs_patent |
Paper vs patent retrieval analysis |
fig6_agreement_heatmap |
Inter-model agreement heatmap |
fig7_tier_correlation |
BM25 retrieval quality vs T3 accuracy |
fig8_cost_pareto |
Cost-performance Pareto frontier |
fig9_confusion_matrices |
T3 confusion matrices (3 models) |
fig10_context_ablation |
Accuracy vs context quantity (k) |
An expert validation package is included for human review of 100 strategically sampled statements:
data/novex/expert_validation.xlsx— Annotatable Excel workbook (3 tabs: Instructions, Annotation, Selection Criteria)data/novex/expert_validation.csv— Flat CSV versiondata/novex/expert_validation_guidelines.txt— Detailed annotation guidelinesdata/novex/expert_validation_selection.json— Selection strategy and IDs
The 100 statements are stratified to maximize validation value: all 11 no-consensus items, all 13 override items, 46 randomly sampled majority items, and 30 unanimous controls.
git clone https://github.com/VibeCodingScientist/BioPAT-Benchmark.git
cd BioPAT-Benchmark
python -m venv venv
source venv/bin/activate
# Core only
pip install -e .
# With evaluation (sentence-transformers, FAISS, PyTorch)
pip install -e ".[evaluation]"
# With LLM experiments (OpenAI, Anthropic, Google)
pip install -e ".[llm]"
# Everything
pip install -e ".[all]"# LLM providers
export OPENAI_API_KEY=your_key
export ANTHROPIC_API_KEY=your_key
export GOOGLE_API_KEY=your_key
# Data source APIs (for benchmark construction only)
export PATENTSVIEW_API_KEY=your_key
export NCBI_API_KEY=your_key# Run 3-tier evaluation
PYTHONPATH=src python scripts/run_novex.py --tier 1 --method bm25
PYTHONPATH=src python scripts/run_novex.py --tier 2 --model claude-sonnet-4-6
PYTHONPATH=src python scripts/run_novex.py --tier 3 --context-k 10
# Analyze results
PYTHONPATH=src python scripts/analyze_novex.py --analysis all
# Generate figures
PYTHONPATH=src python scripts/generate_figures.py --format png
# Cost estimate without API calls
PYTHONPATH=src python scripts/run_novex.py --dry-rundocker compose build
docker compose run --rm benchmark| Config file | Purpose |
|---|---|
configs/default.yaml |
Phase 1 pipeline (data paths, API settings, IPC filters) |
configs/experiments.yaml |
Full experiment suite (7 experiment types) |
configs/experiments_agent.yaml |
Agent dual-retrieval experiment only |
configs/novex.yaml |
NovEx benchmark (curation, annotation, 3-tier evaluation) |
Three providers with a unified interface (LLMProvider.generate()):
| Provider | Models | Cost tracking |
|---|---|---|
| OpenAI | GPT-4o, GPT-4o-mini, GPT-5.2 | Per-token input/output |
| Anthropic | Claude Opus 4.6, Sonnet 4.5, Haiku 4.5 | Per-token input/output |
| Gemini 3 Pro, Gemini 2.5 Pro | Per-token input/output |
| Script | Purpose |
|---|---|
run_novex.py |
Run NovEx 3-tier evaluation (T1/T2/T3 with checkpointing) |
analyze_novex.py |
Generate analysis JSON files from evaluation checkpoints |
generate_figures.py |
Generate all 10 publication figures (PDF + PNG) |
build_expert_sheet.py |
Build expert validation Excel workbook |
run_phase2.py |
Phase 2 API evaluation runs |
run_phase3.py |
Phase 3 pre-writing analyses (ablation significance, calibration) |
run_phase4.py |
Phase 4 gap-filling (re-runs, Fleiss' kappa, figure regen) |
curate_statements.py |
Statement curation pipeline (3-LLM extraction) |
scale_benchmark.py |
Scale benchmark from 100 to 300 statements |
fix_outliers.py |
Remove bad entries and replace from unused pool |
run_experiments.py |
Run Phase 1 retrieval experiments |
generate_paper_outputs.py |
Generate LaTeX tables and paper outputs |
data/
├── benchmark/ # Phase 1 BEIR-format benchmark
│ ├── corpus.jsonl # 158,850 scientific papers
│ ├── queries.jsonl # 1,984 patent claims
│ ├── qrels/ # 842K relevance judgments (train/dev/test)
│ └── results/ # Retrieval experiment results
│
└── novex/ # NovEx benchmark
├── statements.jsonl # 300 curated statements
├── queries.jsonl # 300 NovEx queries (BEIR format)
├── dual_corpus.jsonl # 164,649 papers + patents
├── qrels/ # tier1.tsv (relevance), tier3.tsv (novelty)
├── results/ # Evaluation checkpoints (all 27 runs)
├── analysis/ # 24 analysis JSON files
│ └── figures/ # 10 figures (PDF + PNG, 20 files)
├── reverse/ # 8-step curation pipeline data
│ └── checkpoints/ # Pipeline checkpoints (step1-step8)
├── expert_validation.xlsx # Expert review workbook
├── expert_validation.csv # Flat CSV version
├── expert_validation_guidelines.txt
└── expert_validation_selection.json
Core: httpx, polars, pyyaml, pydantic, diskcache, tqdm, rank-bm25
Evaluation: sentence-transformers, faiss-cpu, torch
LLM: openai, anthropic, google-genai, scipy
Advanced: rdkit (chemical structures)
Python 3.11-3.12 required.
@software{biopat2026,
author = {BioPAT Contributors},
title = {{BioPAT-NovEx}: A 3-Tier Benchmark for LLM-Based Patent Prior Art Discovery},
year = {2026},
publisher = {GitHub},
url = {https://github.com/VibeCodingScientist/BioPAT-Benchmark}
}MIT License — see LICENSE for details.