Skip to content

VibeCodingScientist/BioPAT-Benchmark

Repository files navigation

BioPAT-NovEx: A 3-Tier Benchmark for LLM-Based Patent Prior Art Discovery

A benchmark for evaluating how well LLMs can retrieve, assess, and determine novelty of biomedical patent prior art.

BioPAT-NovEx constructs 300 expert-curated patent-derived technical statements and evaluates LLM performance across three tiers: document retrieval, relevance grading, and novelty determination. It includes a 164K dual corpus of scientific papers and patents, 5,352 graded relevance judgments, and ground truth novelty labels from a 3-LLM consensus protocol.


Benchmark at a Glance

Stat Value
Statements 300 patent-derived technical claims
Dual corpus 164,649 documents (158,850 papers + 5,799 patents)
Relevance judgments 5,352 graded qrels (0-3 scale)
Novelty labels 300 (ANTICIPATED 70% / PARTIAL 24% / NOVEL 6%)
Domains A61 (medical, 121) / C07 (organic chemistry, 88) / C12 (biochemistry, 91)
LLMs evaluated GPT-5.2, Claude Sonnet 4.6, Gemini 3 Pro
Evaluation runs 27 (T1: 7 methods, T2: 3 models, T3: 6 conditions + 15 ablation)
Format BEIR-compatible (corpus.jsonl, queries.jsonl, qrels/*.tsv)

Ground truth novelty labels are derived from a 3-LLM consensus annotation protocol (Fleiss' kappa = 0.342), with 59% unanimous agreement, 33% majority vote, and 8% requiring override rules.

Architecture

src/biopat/
├── pipeline.py              # Phase 1: 8-step benchmark construction with checkpoint/resume
├── pipeline_novelty.py      # End-to-end novelty assessment pipeline
├── config.py                # Configuration (paths, APIs, LLM, evaluation)
├── compat.py                # Polars version compatibility layer
├── reproducibility.py       # Checksums and audit logging
│
├── ingestion/               # Data acquisition
│   ├── ros.py               #   Reliance on Science patent-paper citations
│   ├── patentsview.py       #   USPTO PatentsView API (claims, metadata)
│   └── openalex.py          #   OpenAlex API (paper metadata, abstracts)
│
├── processing/              # Data transformation
│   ├── patents.py           #   Patent filtering by IPC, claim extraction
│   ├── papers.py            #   Paper metadata cleaning, abstract validation
│   ├── linking.py           #   Patent-paper citation linking
│   ├── patent_ids.py        #   Patent ID normalization (US/EP/WO)
│   ├── chemical_index.py    #   Morgan fingerprints, FAISS chemical search
│   ├── sequence_index.py    #   BLAST+ sequence alignment search
│   ├── prior_patents.py     #   Prior patent reference filtering
│   ├── international_patents.py  # EP/WO patent support
│   ├── claim_mapper.py      #   Claim-to-citation mapping
│   └── npl_parser.py        #   Non-patent literature reference parsing
│
├── groundtruth/             # Relevance judgment creation
│   ├── relevance.py         #   Graded relevance assignment (0-3 scale)
│   ├── temporal.py          #   Temporal validation (paper before patent)
│   ├── stratification.py    #   Domain-stratified sampling by IPC
│   └── ep_citations.py      #   EP search report category mapping
│
├── benchmark/               # Benchmark formatting
│   ├── beir_format.py       #   BEIR output (corpus.jsonl, queries.jsonl, qrels)
│   ├── splits.py            #   Train/dev/test splitting (stratified)
│   └── sampling.py          #   Uniform and stratified sampling
│
├── evaluation/              # Retrieval evaluation
│   ├── llm_evaluator.py     #   LLMBenchmarkRunner: orchestrates all experiments
│   ├── bm25.py              #   BM25 baseline
│   ├── dense.py             #   Dense retrieval (sentence-transformers, FAISS)
│   ├── hybrid.py            #   BM25 + dense fusion (RRF, linear)
│   ├── reranker.py          #   Cross-encoder and LLM listwise reranking
│   ├── metrics.py           #   IR metrics: NDCG, MAP, MRR, P@k, R@k
│   ├── trimodal_retrieval.py#   Text + chemical + sequence fusion
│   ├── agent_retrieval.py   #   Agentic dual-corpus retrieval
│   ├── agent_metrics.py     #   Agent-specific metrics and refinement curves
│   ├── dual_qrels.py        #   Dual corpus builder, qrel inversion
│   ├── ablation.py          #   Ablation studies
│   ├── error_analysis.py    #   Failure categorization and vocabulary analysis
│   ├── statistical_tests.py #   Significance testing (bootstrap, paired)
│   └── publication.py       #   Report generation
│
├── retrieval/               # Retrieval methods
│   ├── dense.py             #   Dense retrieval with domain-specific models
│   ├── hybrid.py            #   BM25 + dense hybrid fusion
│   ├── reranker.py          #   Cross-encoder + LLM reranking
│   ├── hyde.py              #   HyDE query expansion via LLM
│   ├── molecular.py         #   Chemical fingerprint retrieval (RDKit)
│   ├── sequence.py          #   Sequence alignment retrieval (BLAST+)
│   ├── splade.py            #   Learned sparse retrieval (SPLADE)
│   └── colbert.py           #   Late-interaction retrieval (ColBERT)
│
├── reasoning/               # LLM-based novelty reasoning
│   ├── claim_parser.py      #   Decompose claims into elements via LLM
│   ├── novelty_reasoner.py  #   Map prior art to claims, assess novelty
│   └── explanation_generator.py  # Generate novelty reports
│
├── novex/                   # NovEx 3-tier benchmark
│   ├── curate.py            #   Statement curation (8-step pipeline, 3-LLM consensus)
│   ├── annotation.py        #   Multi-LLM annotation protocol
│   ├── benchmark.py         #   BEIR-compatible loader with domain/category filters
│   ├── evaluator.py         #   Tier 1/2/3 evaluation harness (context_k, checkpointing)
│   ├── analysis.py          #   24 analyses: tables, correlation, confusion, ablation
│   └── figures.py           #   10 publication figures (PDF + PNG)
│
└── llm/                     # Unified LLM provider interface
    ├── providers.py         #   OpenAI, Anthropic, Google with consistent API
    └── cost_tracker.py      #   Token tracking, cost estimation, budget enforcement

3-Tier Evaluation Framework

Tier Task Input Output Primary Metric
1 Retrieval Technical statement Ranked documents from 164K dual corpus Recall@k, NDCG@10
2 Relevance Statement + 50 candidates 0-3 relevance grade per document Weighted Cohen's kappa
3 Novelty Statement + top-k prior art NOVEL / PARTIAL / ANTICIPATED Accuracy, balanced accuracy, macro-F1

Results

Tier 1: Document Retrieval

Method Model Recall@10 NDCG@10 Recall@100 MAP
BM25 0.481 0.675 1.000 0.719
BM25 + Rerank GPT-5.2 0.564 0.775 0.814 0.661
BM25 + Rerank Claude Sonnet 4.6 0.563 0.772 0.814 0.662
BM25 + Rerank Gemini 3 Pro 0.481 0.675 0.814 0.600
Agent Claude Sonnet 4.6 0.289 0.464 0.711 0.353
Agent Gemini 3 Pro 0.396 0.595 0.587 0.426
Agent GPT-5.2 0.186 0.313 0.533 0.224

BM25 achieves perfect Recall@100. LLM reranking improves top-of-list precision (NDCG@10 +15%). Agent retrieval explores the dual corpus iteratively but trades recall for breadth.

Tier 2: Relevance Grading (5,352 pairs)

Model Accuracy MAE Weighted Kappa
Claude Sonnet 4.6 0.791 0.209 0.873
GPT-5.2 0.767 0.236 0.853
Gemini 3 Pro 0.725 0.285 0.827

All models show strong agreement with ground truth (weighted kappa > 0.82). Grade distribution is heavily skewed (43% grade-1, 25% grade-2, 32% grade-3), explaining the high raw agreement but moderate unweighted kappa.

Tier 3: Novelty Determination (300 statements)

Model Mode Accuracy Bal. Acc. Macro-F1 F1-NOVEL F1-PARTIAL F1-ANTIC
Claude Sonnet 4.6 ctx (k=10) 0.773 0.643 0.650 0.529 0.552 0.869
GPT-5.2 ctx (k=10) 0.753 0.678 0.661 0.615 0.521 0.848
Gemini 3 Pro ctx (k=10) 0.753 0.674 0.550 0.515 0.250 0.886
Claude Sonnet 4.6 zero-shot 0.515 0.507 0.426 0.170 0.444 0.663
GPT-5.2 zero-shot 0.193 0.387 0.195 0.119 0.190 0.274
Gemini 3 Pro zero-shot 0.064 0.335 0.041 0.114 0.000 0.010

Context (providing top-k prior art documents) is essential — all models collapse without it. Claude leads accuracy (77.3%), GPT-5.2 leads macro-F1 (0.661) due to stronger NOVEL class detection.

Context Ablation (k = number of prior art docs)

Model k=1 k=3 k=5 k=10 k=20
GPT-5.2 0.563 0.703 0.747 0.757 0.753
Claude Sonnet 4.6 0.706 0.746 0.759 0.756 0.756
Gemini 3 Pro 0.693 0.747 0.739 0.728 0.745

Performance plateaus around k=5 for all models. GPT-5.2 shows the largest gain from k=1 to k=3 (p < 0.001 by bootstrap test).

Ground Truth Quality

Metric Value
Fleiss' kappa (3 LLMs) 0.342 (fair agreement)
GPT-5.2 ↔ Claude kappa 0.574 (moderate)
GPT-5.2 ↔ Gemini kappa 0.203 (fair)
Claude ↔ Gemini kappa 0.289 (fair)
Unanimous agreement 178/300 (59%)
Majority agreement 98/300 (33%)
Override required 13/300 (4%)
No consensus 11/300 (4%)

Models perform ~88% accuracy on unanimous GT items, ~65-70% on majority, and ~60% on contested items, confirming that GT agreement level correlates with prediction difficulty.

Calibration

Model KL Divergence (pred vs GT) Assessment
Claude Sonnet 4.6 0.0005 Well calibrated
GPT-5.2 0.0011 Well calibrated
Gemini 3 Pro 0.1487 Over-predicts NOVEL, under-predicts PARTIAL

Cost

Phase Cost
Benchmark construction (curation + grading + novelty) $38.56
3-tier evaluation (27 runs) ~$346
Total $384.19

Analysis Outputs

All analysis results are stored in data/novex/analysis/:

File Description
summary.json Benchmark overview and best results per tier
tier1_table.json Tier 1 retrieval metrics for all methods
tier2_table.json Tier 2 relevance grading results
tier2_grade_distribution.json Grade distribution explaining kappa paradox
tier3_table.json Tier 3 novelty results with balanced accuracy
context_ablation.json Accuracy at k=1,3,5,10,20 for all models
ablation_significance.json Bootstrap paired significance tests for k-values
confusion_matrices.json 3x3 confusion matrices (6 conditions)
error_analysis.json Per-class, per-domain error breakdowns
tier_correlation.json BM25 R@10 vs T3 accuracy correlation
difficulty_stratification.json Performance by retrieval difficulty bucket
cost_performance.json Cost vs metric Pareto analysis
inter_model_agreement.json Pairwise model agreement on T2
vocabulary_gap.json Patent-paper vocabulary divergence
cross_domain.json Per-domain performance breakdown
per_domain.json Domain-level statistics
doc_type_split.json Paper vs patent retrieval breakdown
significance_tests.json Statistical significance across methods
fleiss_kappa_gt.json Inter-annotator agreement (3 LLMs)
gt_agreement_stratification.json T3 accuracy by GT agreement level
dataset_statistics.json Corpus and statement descriptive statistics
qualitative_examples.json Context-sensitive example predictions
calibration_note.json Prediction distribution vs GT distribution
full_analysis.json Combined analysis output

Figures (10 figures, PDF + PNG)

Located in data/novex/analysis/figures/:

Figure Description
fig1_tier1_recall Tier 1 recall@k curves by method
fig2_tier2_comparison Tier 2 model comparison
fig3_tier3_context_vs_zs Context vs zero-shot novelty accuracy
fig4_tier3_per_class_f1 Per-class F1 breakdown
fig5_paper_vs_patent Paper vs patent retrieval analysis
fig6_agreement_heatmap Inter-model agreement heatmap
fig7_tier_correlation BM25 retrieval quality vs T3 accuracy
fig8_cost_pareto Cost-performance Pareto frontier
fig9_confusion_matrices T3 confusion matrices (3 models)
fig10_context_ablation Accuracy vs context quantity (k)

Expert Validation

An expert validation package is included for human review of 100 strategically sampled statements:

  • data/novex/expert_validation.xlsx — Annotatable Excel workbook (3 tabs: Instructions, Annotation, Selection Criteria)
  • data/novex/expert_validation.csv — Flat CSV version
  • data/novex/expert_validation_guidelines.txt — Detailed annotation guidelines
  • data/novex/expert_validation_selection.json — Selection strategy and IDs

The 100 statements are stratified to maximize validation value: all 11 no-consensus items, all 13 override items, 46 randomly sampled majority items, and 30 unanimous controls.

Quick Start

Installation

git clone https://github.com/VibeCodingScientist/BioPAT-Benchmark.git
cd BioPAT-Benchmark

python -m venv venv
source venv/bin/activate

# Core only
pip install -e .

# With evaluation (sentence-transformers, FAISS, PyTorch)
pip install -e ".[evaluation]"

# With LLM experiments (OpenAI, Anthropic, Google)
pip install -e ".[llm]"

# Everything
pip install -e ".[all]"

Environment Variables

# LLM providers
export OPENAI_API_KEY=your_key
export ANTHROPIC_API_KEY=your_key
export GOOGLE_API_KEY=your_key

# Data source APIs (for benchmark construction only)
export PATENTSVIEW_API_KEY=your_key
export NCBI_API_KEY=your_key

Running NovEx Evaluation

# Run 3-tier evaluation
PYTHONPATH=src python scripts/run_novex.py --tier 1 --method bm25
PYTHONPATH=src python scripts/run_novex.py --tier 2 --model claude-sonnet-4-6
PYTHONPATH=src python scripts/run_novex.py --tier 3 --context-k 10

# Analyze results
PYTHONPATH=src python scripts/analyze_novex.py --analysis all

# Generate figures
PYTHONPATH=src python scripts/generate_figures.py --format png

# Cost estimate without API calls
PYTHONPATH=src python scripts/run_novex.py --dry-run

Docker

docker compose build
docker compose run --rm benchmark

Configuration

Config file Purpose
configs/default.yaml Phase 1 pipeline (data paths, API settings, IPC filters)
configs/experiments.yaml Full experiment suite (7 experiment types)
configs/experiments_agent.yaml Agent dual-retrieval experiment only
configs/novex.yaml NovEx benchmark (curation, annotation, 3-tier evaluation)

LLM Support

Three providers with a unified interface (LLMProvider.generate()):

Provider Models Cost tracking
OpenAI GPT-4o, GPT-4o-mini, GPT-5.2 Per-token input/output
Anthropic Claude Opus 4.6, Sonnet 4.5, Haiku 4.5 Per-token input/output
Google Gemini 3 Pro, Gemini 2.5 Pro Per-token input/output

Scripts

Script Purpose
run_novex.py Run NovEx 3-tier evaluation (T1/T2/T3 with checkpointing)
analyze_novex.py Generate analysis JSON files from evaluation checkpoints
generate_figures.py Generate all 10 publication figures (PDF + PNG)
build_expert_sheet.py Build expert validation Excel workbook
run_phase2.py Phase 2 API evaluation runs
run_phase3.py Phase 3 pre-writing analyses (ablation significance, calibration)
run_phase4.py Phase 4 gap-filling (re-runs, Fleiss' kappa, figure regen)
curate_statements.py Statement curation pipeline (3-LLM extraction)
scale_benchmark.py Scale benchmark from 100 to 300 statements
fix_outliers.py Remove bad entries and replace from unused pool
run_experiments.py Run Phase 1 retrieval experiments
generate_paper_outputs.py Generate LaTeX tables and paper outputs

Data Structure

data/
├── benchmark/              # Phase 1 BEIR-format benchmark
│   ├── corpus.jsonl        #   158,850 scientific papers
│   ├── queries.jsonl       #   1,984 patent claims
│   ├── qrels/              #   842K relevance judgments (train/dev/test)
│   └── results/            #   Retrieval experiment results
│
└── novex/                  # NovEx benchmark
    ├── statements.jsonl    #   300 curated statements
    ├── queries.jsonl       #   300 NovEx queries (BEIR format)
    ├── dual_corpus.jsonl   #   164,649 papers + patents
    ├── qrels/              #   tier1.tsv (relevance), tier3.tsv (novelty)
    ├── results/            #   Evaluation checkpoints (all 27 runs)
    ├── analysis/           #   24 analysis JSON files
    │   └── figures/        #   10 figures (PDF + PNG, 20 files)
    ├── reverse/            #   8-step curation pipeline data
    │   └── checkpoints/    #   Pipeline checkpoints (step1-step8)
    ├── expert_validation.xlsx      # Expert review workbook
    ├── expert_validation.csv       # Flat CSV version
    ├── expert_validation_guidelines.txt
    └── expert_validation_selection.json

Dependencies

Core: httpx, polars, pyyaml, pydantic, diskcache, tqdm, rank-bm25

Evaluation: sentence-transformers, faiss-cpu, torch

LLM: openai, anthropic, google-genai, scipy

Advanced: rdkit (chemical structures)

Python 3.11-3.12 required.

Citation

@software{biopat2026,
  author    = {BioPAT Contributors},
  title     = {{BioPAT-NovEx}: A 3-Tier Benchmark for LLM-Based Patent Prior Art Discovery},
  year      = {2026},
  publisher = {GitHub},
  url       = {https://github.com/VibeCodingScientist/BioPAT-Benchmark}
}

License

MIT License — see LICENSE for details.

About

No description, website, or topics provided.

Resources

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages