BioPAT-NovEx: A 3-Tier Benchmark for LLM-Based Patent Prior Art Discovery

A benchmark for evaluating how well LLMs can retrieve, assess, and determine novelty of biomedical patent prior art.

BioPAT-NovEx constructs 300 expert-curated patent-derived technical statements and evaluates LLM performance across three tiers: document retrieval, relevance grading, and novelty determination. It includes a 164K dual corpus of scientific papers and patents, 5,352 graded relevance judgments, and ground truth novelty labels from a 3-LLM consensus protocol.

Benchmark at a Glance

Stat	Value
Statements	300 patent-derived technical claims
Dual corpus	164,649 documents (158,850 papers + 5,799 patents)
Relevance judgments	5,352 graded qrels (0-3 scale)
Novelty labels	300 (ANTICIPATED 70% / PARTIAL 24% / NOVEL 6%)
Domains	A61 (medical, 121) / C07 (organic chemistry, 88) / C12 (biochemistry, 91)
LLMs evaluated	GPT-5.2, Claude Sonnet 4.6, Gemini 3 Pro
Evaluation runs	27 (T1: 7 methods, T2: 3 models, T3: 6 conditions + 15 ablation)
Format	BEIR-compatible (corpus.jsonl, queries.jsonl, qrels/*.tsv)

Ground truth novelty labels are derived from a 3-LLM consensus annotation protocol (Fleiss' kappa = 0.342), with 59% unanimous agreement, 33% majority vote, and 8% requiring override rules.

Architecture

src/biopat/
├── pipeline.py              # Phase 1: 8-step benchmark construction with checkpoint/resume
├── pipeline_novelty.py      # End-to-end novelty assessment pipeline
├── config.py                # Configuration (paths, APIs, LLM, evaluation)
├── compat.py                # Polars version compatibility layer
├── reproducibility.py       # Checksums and audit logging
│
├── ingestion/               # Data acquisition
│   ├── ros.py               #   Reliance on Science patent-paper citations
│   ├── patentsview.py       #   USPTO PatentsView API (claims, metadata)
│   └── openalex.py          #   OpenAlex API (paper metadata, abstracts)
│
├── processing/              # Data transformation
│   ├── patents.py           #   Patent filtering by IPC, claim extraction
│   ├── papers.py            #   Paper metadata cleaning, abstract validation
│   ├── linking.py           #   Patent-paper citation linking
│   ├── patent_ids.py        #   Patent ID normalization (US/EP/WO)
│   ├── chemical_index.py    #   Morgan fingerprints, FAISS chemical search
│   ├── sequence_index.py    #   BLAST+ sequence alignment search
│   ├── prior_patents.py     #   Prior patent reference filtering
│   ├── international_patents.py  # EP/WO patent support
│   ├── claim_mapper.py      #   Claim-to-citation mapping
│   └── npl_parser.py        #   Non-patent literature reference parsing
│
├── groundtruth/             # Relevance judgment creation
│   ├── relevance.py         #   Graded relevance assignment (0-3 scale)
│   ├── temporal.py          #   Temporal validation (paper before patent)
│   ├── stratification.py    #   Domain-stratified sampling by IPC
│   └── ep_citations.py      #   EP search report category mapping
│
├── benchmark/               # Benchmark formatting
│   ├── beir_format.py       #   BEIR output (corpus.jsonl, queries.jsonl, qrels)
│   ├── splits.py            #   Train/dev/test splitting (stratified)
│   └── sampling.py          #   Uniform and stratified sampling
│
├── evaluation/              # Retrieval evaluation
│   ├── llm_evaluator.py     #   LLMBenchmarkRunner: orchestrates all experiments
│   ├── bm25.py              #   BM25 baseline
│   ├── dense.py             #   Dense retrieval (sentence-transformers, FAISS)
│   ├── hybrid.py            #   BM25 + dense fusion (RRF, linear)
│   ├── reranker.py          #   Cross-encoder and LLM listwise reranking
│   ├── metrics.py           #   IR metrics: NDCG, MAP, MRR, P@k, R@k
│   ├── trimodal_retrieval.py#   Text + chemical + sequence fusion
│   ├── agent_retrieval.py   #   Agentic dual-corpus retrieval
│   ├── agent_metrics.py     #   Agent-specific metrics and refinement curves
│   ├── dual_qrels.py        #   Dual corpus builder, qrel inversion
│   ├── ablation.py          #   Ablation studies
│   ├── error_analysis.py    #   Failure categorization and vocabulary analysis
│   ├── statistical_tests.py #   Significance testing (bootstrap, paired)
│   └── publication.py       #   Report generation
│
├── retrieval/               # Retrieval methods
│   ├── dense.py             #   Dense retrieval with domain-specific models
│   ├── hybrid.py            #   BM25 + dense hybrid fusion
│   ├── reranker.py          #   Cross-encoder + LLM reranking
│   ├── hyde.py              #   HyDE query expansion via LLM
│   ├── molecular.py         #   Chemical fingerprint retrieval (RDKit)
│   ├── sequence.py          #   Sequence alignment retrieval (BLAST+)
│   ├── splade.py            #   Learned sparse retrieval (SPLADE)
│   └── colbert.py           #   Late-interaction retrieval (ColBERT)
│
├── reasoning/               # LLM-based novelty reasoning
│   ├── claim_parser.py      #   Decompose claims into elements via LLM
│   ├── novelty_reasoner.py  #   Map prior art to claims, assess novelty
│   └── explanation_generator.py  # Generate novelty reports
│
├── novex/                   # NovEx 3-tier benchmark
│   ├── curate.py            #   Statement curation (8-step pipeline, 3-LLM consensus)
│   ├── annotation.py        #   Multi-LLM annotation protocol
│   ├── benchmark.py         #   BEIR-compatible loader with domain/category filters
│   ├── evaluator.py         #   Tier 1/2/3 evaluation harness (context_k, checkpointing)
│   ├── analysis.py          #   24 analyses: tables, correlation, confusion, ablation
│   └── figures.py           #   10 publication figures (PDF + PNG)
│
└── llm/                     # Unified LLM provider interface
    ├── providers.py         #   OpenAI, Anthropic, Google with consistent API
    └── cost_tracker.py      #   Token tracking, cost estimation, budget enforcement

3-Tier Evaluation Framework

Tier	Task	Input	Output	Primary Metric
1	Retrieval	Technical statement	Ranked documents from 164K dual corpus	Recall@k, NDCG@10
2	Relevance	Statement + 50 candidates	0-3 relevance grade per document	Weighted Cohen's kappa
3	Novelty	Statement + top-k prior art	NOVEL / PARTIAL / ANTICIPATED	Accuracy, balanced accuracy, macro-F1

Results

Tier 1: Document Retrieval

Method	Model	Recall@10	NDCG@10	Recall@100	MAP
BM25	—	0.481	0.675	1.000	0.719
BM25 + Rerank	GPT-5.2	0.564	0.775	0.814	0.661
BM25 + Rerank	Claude Sonnet 4.6	0.563	0.772	0.814	0.662
BM25 + Rerank	Gemini 3 Pro	0.481	0.675	0.814	0.600
Agent	Claude Sonnet 4.6	0.289	0.464	0.711	0.353
Agent	Gemini 3 Pro	0.396	0.595	0.587	0.426
Agent	GPT-5.2	0.186	0.313	0.533	0.224

BM25 achieves perfect Recall@100. LLM reranking improves top-of-list precision (NDCG@10 +15%). Agent retrieval explores the dual corpus iteratively but trades recall for breadth.

Tier 2: Relevance Grading (5,352 pairs)

Model	Accuracy	MAE	Weighted Kappa
Claude Sonnet 4.6	0.791	0.209	0.873
GPT-5.2	0.767	0.236	0.853
Gemini 3 Pro	0.725	0.285	0.827

All models show strong agreement with ground truth (weighted kappa > 0.82). Grade distribution is heavily skewed (43% grade-1, 25% grade-2, 32% grade-3), explaining the high raw agreement but moderate unweighted kappa.

Tier 3: Novelty Determination (300 statements)

Model	Mode	Accuracy	Bal. Acc.	Macro-F1	F1-NOVEL	F1-PARTIAL	F1-ANTIC
Claude Sonnet 4.6	ctx (k=10)	0.773	0.643	0.650	0.529	0.552	0.869
GPT-5.2	ctx (k=10)	0.753	0.678	0.661	0.615	0.521	0.848
Gemini 3 Pro	ctx (k=10)	0.753	0.674	0.550	0.515	0.250	0.886
Claude Sonnet 4.6	zero-shot	0.515	0.507	0.426	0.170	0.444	0.663
GPT-5.2	zero-shot	0.193	0.387	0.195	0.119	0.190	0.274
Gemini 3 Pro	zero-shot	0.064	0.335	0.041	0.114	0.000	0.010

Context (providing top-k prior art documents) is essential — all models collapse without it. Claude leads accuracy (77.3%), GPT-5.2 leads macro-F1 (0.661) due to stronger NOVEL class detection.

Context Ablation (k = number of prior art docs)

Model	k=1	k=3	k=5	k=10	k=20
GPT-5.2	0.563	0.703	0.747	0.757	0.753
Claude Sonnet 4.6	0.706	0.746	0.759	0.756	0.756
Gemini 3 Pro	0.693	0.747	0.739	0.728	0.745

Performance plateaus around k=5 for all models. GPT-5.2 shows the largest gain from k=1 to k=3 (p < 0.001 by bootstrap test).

Ground Truth Quality

Metric	Value
Fleiss' kappa (3 LLMs)	0.342 (fair agreement)
GPT-5.2 ↔ Claude kappa	0.574 (moderate)
GPT-5.2 ↔ Gemini kappa	0.203 (fair)
Claude ↔ Gemini kappa	0.289 (fair)
Unanimous agreement	178/300 (59%)
Majority agreement	98/300 (33%)
Override required	13/300 (4%)
No consensus	11/300 (4%)

Models perform ~88% accuracy on unanimous GT items, ~65-70% on majority, and ~60% on contested items, confirming that GT agreement level correlates with prediction difficulty.

Calibration

Model	KL Divergence (pred vs GT)	Assessment
Claude Sonnet 4.6	0.0005	Well calibrated
GPT-5.2	0.0011	Well calibrated
Gemini 3 Pro	0.1487	Over-predicts NOVEL, under-predicts PARTIAL

Cost

Phase	Cost
Benchmark construction (curation + grading + novelty)	$38.56
3-tier evaluation (27 runs)	~$346
Total	$384.19

Analysis Outputs

All analysis results are stored in data/novex/analysis/:

File	Description
`summary.json`	Benchmark overview and best results per tier
`tier1_table.json`	Tier 1 retrieval metrics for all methods
`tier2_table.json`	Tier 2 relevance grading results
`tier2_grade_distribution.json`	Grade distribution explaining kappa paradox
`tier3_table.json`	Tier 3 novelty results with balanced accuracy
`context_ablation.json`	Accuracy at k=1,3,5,10,20 for all models
`ablation_significance.json`	Bootstrap paired significance tests for k-values
`confusion_matrices.json`	3x3 confusion matrices (6 conditions)
`error_analysis.json`	Per-class, per-domain error breakdowns
`tier_correlation.json`	BM25 R@10 vs T3 accuracy correlation
`difficulty_stratification.json`	Performance by retrieval difficulty bucket
`cost_performance.json`	Cost vs metric Pareto analysis
`inter_model_agreement.json`	Pairwise model agreement on T2
`vocabulary_gap.json`	Patent-paper vocabulary divergence
`cross_domain.json`	Per-domain performance breakdown
`per_domain.json`	Domain-level statistics
`doc_type_split.json`	Paper vs patent retrieval breakdown
`significance_tests.json`	Statistical significance across methods
`fleiss_kappa_gt.json`	Inter-annotator agreement (3 LLMs)
`gt_agreement_stratification.json`	T3 accuracy by GT agreement level
`dataset_statistics.json`	Corpus and statement descriptive statistics
`qualitative_examples.json`	Context-sensitive example predictions
`calibration_note.json`	Prediction distribution vs GT distribution
`full_analysis.json`	Combined analysis output

Figures (10 figures, PDF + PNG)

Located in data/novex/analysis/figures/:

Figure	Description
`fig1_tier1_recall`	Tier 1 recall@k curves by method
`fig2_tier2_comparison`	Tier 2 model comparison
`fig3_tier3_context_vs_zs`	Context vs zero-shot novelty accuracy
`fig4_tier3_per_class_f1`	Per-class F1 breakdown
`fig5_paper_vs_patent`	Paper vs patent retrieval analysis
`fig6_agreement_heatmap`	Inter-model agreement heatmap
`fig7_tier_correlation`	BM25 retrieval quality vs T3 accuracy
`fig8_cost_pareto`	Cost-performance Pareto frontier
`fig9_confusion_matrices`	T3 confusion matrices (3 models)
`fig10_context_ablation`	Accuracy vs context quantity (k)

Expert Validation

An expert validation package is included for human review of 100 strategically sampled statements:

data/novex/expert_validation.xlsx — Annotatable Excel workbook (3 tabs: Instructions, Annotation, Selection Criteria)
data/novex/expert_validation.csv — Flat CSV version
data/novex/expert_validation_guidelines.txt — Detailed annotation guidelines
data/novex/expert_validation_selection.json — Selection strategy and IDs

The 100 statements are stratified to maximize validation value: all 11 no-consensus items, all 13 override items, 46 randomly sampled majority items, and 30 unanimous controls.

Quick Start

Installation

git clone https://github.com/VibeCodingScientist/BioPAT-Benchmark.git
cd BioPAT-Benchmark

python -m venv venv
source venv/bin/activate

# Core only
pip install -e .

# With evaluation (sentence-transformers, FAISS, PyTorch)
pip install -e ".[evaluation]"

# With LLM experiments (OpenAI, Anthropic, Google)
pip install -e ".[llm]"

# Everything
pip install -e ".[all]"

Environment Variables

# LLM providers
export OPENAI_API_KEY=your_key
export ANTHROPIC_API_KEY=your_key
export GOOGLE_API_KEY=your_key

# Data source APIs (for benchmark construction only)
export PATENTSVIEW_API_KEY=your_key
export NCBI_API_KEY=your_key

Running NovEx Evaluation

# Run 3-tier evaluation
PYTHONPATH=src python scripts/run_novex.py --tier 1 --method bm25
PYTHONPATH=src python scripts/run_novex.py --tier 2 --model claude-sonnet-4-6
PYTHONPATH=src python scripts/run_novex.py --tier 3 --context-k 10

# Analyze results
PYTHONPATH=src python scripts/analyze_novex.py --analysis all

# Generate figures
PYTHONPATH=src python scripts/generate_figures.py --format png

# Cost estimate without API calls
PYTHONPATH=src python scripts/run_novex.py --dry-run

Docker

docker compose build
docker compose run --rm benchmark

Configuration

Config file	Purpose
`configs/default.yaml`	Phase 1 pipeline (data paths, API settings, IPC filters)
`configs/experiments.yaml`	Full experiment suite (7 experiment types)
`configs/experiments_agent.yaml`	Agent dual-retrieval experiment only
`configs/novex.yaml`	NovEx benchmark (curation, annotation, 3-tier evaluation)

LLM Support

Three providers with a unified interface (LLMProvider.generate()):

Provider	Models	Cost tracking
OpenAI	GPT-4o, GPT-4o-mini, GPT-5.2	Per-token input/output
Anthropic	Claude Opus 4.6, Sonnet 4.5, Haiku 4.5	Per-token input/output
Google	Gemini 3 Pro, Gemini 2.5 Pro	Per-token input/output

Scripts

Script	Purpose
`run_novex.py`	Run NovEx 3-tier evaluation (T1/T2/T3 with checkpointing)
`analyze_novex.py`	Generate analysis JSON files from evaluation checkpoints
`generate_figures.py`	Generate all 10 publication figures (PDF + PNG)
`build_expert_sheet.py`	Build expert validation Excel workbook
`run_phase2.py`	Phase 2 API evaluation runs
`run_phase3.py`	Phase 3 pre-writing analyses (ablation significance, calibration)
`run_phase4.py`	Phase 4 gap-filling (re-runs, Fleiss' kappa, figure regen)
`curate_statements.py`	Statement curation pipeline (3-LLM extraction)
`scale_benchmark.py`	Scale benchmark from 100 to 300 statements
`fix_outliers.py`	Remove bad entries and replace from unused pool
`run_experiments.py`	Run Phase 1 retrieval experiments
`generate_paper_outputs.py`	Generate LaTeX tables and paper outputs

Data Structure

data/
├── benchmark/              # Phase 1 BEIR-format benchmark
│   ├── corpus.jsonl        #   158,850 scientific papers
│   ├── queries.jsonl       #   1,984 patent claims
│   ├── qrels/              #   842K relevance judgments (train/dev/test)
│   └── results/            #   Retrieval experiment results
│
└── novex/                  # NovEx benchmark
    ├── statements.jsonl    #   300 curated statements
    ├── queries.jsonl       #   300 NovEx queries (BEIR format)
    ├── dual_corpus.jsonl   #   164,649 papers + patents
    ├── qrels/              #   tier1.tsv (relevance), tier3.tsv (novelty)
    ├── results/            #   Evaluation checkpoints (all 27 runs)
    ├── analysis/           #   24 analysis JSON files
    │   └── figures/        #   10 figures (PDF + PNG, 20 files)
    ├── reverse/            #   8-step curation pipeline data
    │   └── checkpoints/    #   Pipeline checkpoints (step1-step8)
    ├── expert_validation.xlsx      # Expert review workbook
    ├── expert_validation.csv       # Flat CSV version
    ├── expert_validation_guidelines.txt
    └── expert_validation_selection.json

Dependencies

Core: httpx, polars, pyyaml, pydantic, diskcache, tqdm, rank-bm25

Evaluation: sentence-transformers, faiss-cpu, torch

LLM: openai, anthropic, google-genai, scipy

Advanced: rdkit (chemical structures)

Python 3.11-3.12 required.

Citation

@software{biopat2026,
  author    = {BioPAT Contributors},
  title     = {{BioPAT-NovEx}: A 3-Tier Benchmark for LLM-Based Patent Prior Art Discovery},
  year      = {2026},
  publisher = {GitHub},
  url       = {https://github.com/VibeCodingScientist/BioPAT-Benchmark}
}

License

MIT License — see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
configs		configs
data		data
examples/case_study		examples/case_study
scripts		scripts
src/biopat		src/biopat
tests		tests
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BioPAT-NovEx: A 3-Tier Benchmark for LLM-Based Patent Prior Art Discovery

Benchmark at a Glance

Architecture

3-Tier Evaluation Framework

Results

Tier 1: Document Retrieval

Tier 2: Relevance Grading (5,352 pairs)

Tier 3: Novelty Determination (300 statements)

Context Ablation (k = number of prior art docs)

Ground Truth Quality

Calibration

Cost

Analysis Outputs

Figures (10 figures, PDF + PNG)

Expert Validation

Quick Start

Installation

Environment Variables

Running NovEx Evaluation

Docker

Configuration

LLM Support

Scripts

Data Structure

Dependencies

Citation

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BioPAT-NovEx: A 3-Tier Benchmark for LLM-Based Patent Prior Art Discovery

Benchmark at a Glance

Architecture

3-Tier Evaluation Framework

Results

Tier 1: Document Retrieval

Tier 2: Relevance Grading (5,352 pairs)

Tier 3: Novelty Determination (300 statements)

Context Ablation (k = number of prior art docs)

Ground Truth Quality

Calibration

Cost

Analysis Outputs

Figures (10 figures, PDF + PNG)

Expert Validation

Quick Start

Installation

Environment Variables

Running NovEx Evaluation

Docker

Configuration

LLM Support

Scripts

Data Structure

Dependencies

Citation

License

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages