feat: expand verification engine with 8 new capabilities by VibeCodingScientist · Pull Request #37 · VibeCodingScientist/ClawdLab

VibeCodingScientist · 2026-02-16T11:46:14Z

Summary

Expands the verification engine from 5 domain adapters to 7 domain adapters + 4 cross-cutting meta-verifiers that enhance every domain. This is a large but self-contained change -- all new code lives under backend/verification/, backend/services/, and tests/test_verification/.

41 files changed, ~5,800 lines added.

What's New

Cross-Cutting Meta-Verifiers (apply to ANY domain result)

These are a new architectural concept -- unlike domain adapters (which run for one domain), cross-cutting verifiers run on any task result that contains the relevant data. They share 30% of the final score, with the domain adapter keeping 70%.

Verifier	Weight	Triggers when result contains	What it checks
Citation & Reference	0.15	`citations`, `references`, `papers`, `bibliography`	DOI resolution via CrossRef, metadata match via OpenAlex + Semantic Scholar, claim-abstract Jaccard similarity, freshness penalties for fast-moving fields
Statistical Forensics	0.10	`statistical_claims`, `means`, `p_values`, `metrics`	GRIM test (mean plausibility), SPRITE test (mean+SD achievability), Benford's law (first-digit distribution), p-curve analysis (p-hacking detection)
Reproducibility Executor	0.15	`code_repo` AND `code_commit`	Git clone, dependency file detection, Docker sandbox execution, output comparison against claimed results
Data Integrity	0.10	`data`, `dataset`, `raw_data`, `results_summary`, `output_checksums`	Schema consistency, exact duplicate detection, z-score outlier flagging (>3 sigma), SHA-256 hash verification

New Domain Adapters

Adapter	Claim Types	Key Dependencies
Chemistry	`reaction_mechanism`, `molecular_property`, `retrosynthesis`	rdkit (SMILES parsing, stoichiometry, feasibility), PubChem API, ChEMBL API
Physics	`numerical_simulation`, `analytical_derivation`, `dimensional_analysis`	pint (dimensional analysis), sympy (symbolic math), numpy

Enhanced Existing Adapters

Adapter	Enhancement
Math (Lean4)	Multi-prover: added Coq (`coqc` in Docker) and Isabelle (`isabelle build` in Docker) alongside Lean 4. Routed by `proof_system` field.
ML/AI	New `benchmark_live` claim type: generates inference script, runs it in `clawdlab/ml-inference` Docker container, loads HuggingFace model, runs samples, compares live accuracy to claimed metrics.

Infrastructure

Redis-backed async verification queue with distributed semaphores (2 Docker / 4 API concurrent slots)
Cross-cutting runner with asyncio.gather() concurrency and weighted score merging
4 new Docker images: Coq 8.18 + MathComp, Isabelle 2024, reproducibility sandbox (Python + sci-packages), ML inference (transformers + torch-cpu)
3 new pip dependencies: rdkit-pypi>=2024.3.1, pint>=0.23, sympy>=1.12
Claim-type-level Docker routing: requires_docker_for() on ML adapter enables Docker semaphore only for benchmark_live

Architecture

Task Result
    |
    +---> Domain Adapter (70% weight)
    |     - mathematics (Lean4 / Coq / Isabelle)
    |     - ml_ai (benchmark_result / benchmark_live / ml_experiment / architecture)
    |     - computational_biology
    |     - materials_science
    |     - bioinformatics
    |     - chemistry (NEW)
    |     - physics (NEW)
    |
    +---> Cross-Cutting Verifiers (30% weight, shared)
          - citation_reference (0.15)
          - statistical_forensics (0.10)
          - reproducibility (0.15)
          - data_integrity (0.10)

Score merging: final = 0.70 * domain_score + 0.30 * weighted_cc_score

Cross-cutting verifiers that crash are caught, logged, and scored 0.0 -- they never take down the domain verification.

New Files

File	Lines	Purpose
`backend/verification/cross_cutting_base.py`	37	Base class + CrossCuttingResult dataclass
`backend/verification/cross_cutting_runner.py`	180	Registry, concurrent execution, score merging
`backend/verification/citation_verifier.py`	323	DOI/OpenAlex/Semantic Scholar/freshness
`backend/verification/statistical_forensics.py`	437	GRIM/SPRITE/Benford/p-curve
`backend/verification/reproducibility_executor.py`	317	Git clone + Docker sandbox + output compare
`backend/verification/data_integrity.py`	308	Schema/duplicates/outliers/hashes
`backend/verification/chemistry_adapter.py`	602	rdkit + PubChem + ChEMBL
`backend/verification/physics_adapter.py`	600	Conservation + pint + sympy
`containers/coq.Dockerfile`	9	Coq 8.18 + MathComp
`containers/isabelle.Dockerfile`	11	Isabelle 2024
`containers/reproducibility.Dockerfile`	14	Python + sci-packages sandbox
`containers/ml-inference.Dockerfile`	11	transformers + torch-cpu
10 test files	~1,600	Full coverage for all new code

Test Plan

Verify all existing tests still pass: pytest tests/test_verification/ -v
Run new cross-cutting tests: pytest tests/test_verification/test_cross_cutting*.py tests/test_verification/test_citation*.py tests/test_verification/test_statistical*.py tests/test_verification/test_reproducibility*.py tests/test_verification/test_data_integrity.py -v
Run new adapter tests: pytest tests/test_verification/test_chemistry*.py tests/test_verification/test_physics*.py tests/test_verification/test_ml_live*.py -v
Verify dispatcher registers 7 adapters (was 5)
Verify merge math: domain score 0.8 + citation score 0.6 (weight 0.15) -> final ~ 0.74
Build Docker images on server: ./build.sh all
Integration: submit a task with citations -> cross-cutting citation verifier runs
Chemistry: submit task with SMILES string -> PubChem cross-ref works
Math: submit task with proof_system: "coq" -> Coq container runs

Generated with Claude Code

Add cross-cutting meta-verifiers, new domain adapters, and enhanced existing adapters to significantly broaden verification coverage. Cross-cutting verifiers (apply to any domain): - Citation and Reference: DOI resolution, OpenAlex/Semantic Scholar metadata match, claim support via Jaccard similarity, freshness - Statistical Forensics: GRIM test, SPRITE test, Benfords law, p-curve analysis for detecting fabricated statistics - Reproducibility Executor: git clone, dependency detection, Docker sandbox execution, output comparison against claimed results - Data Integrity: schema validation, duplicate detection, z-score outlier flagging, SHA-256 hash verification New domain adapters: - Chemistry: rdkit SMILES validation, stoichiometry balancing, PubChem + ChEMBL cross-reference, retrosynthesis route checks - Physics: conservation law checks, stability/divergence detection, convergence analysis, dimensional analysis (pint), symbolic math (sympy) Enhanced existing adapters: - Math multi-prover: Coq and Isabelle support alongside Lean 4 - ML live inference: benchmark_live claim type runs models in Docker sandbox against HuggingFace benchmarks Infrastructure: - Redis-backed async verification queue with distributed semaphores - Cross-cutting runner with weighted score merging (70/30 domain/CC) - 4 new Docker images (Coq, Isabelle, reproducibility, ML inference) - New dependencies: rdkit-pypi, pint, sympy - Comprehensive test suite (10 new test files, ~1600 lines of tests) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ty, skill docs Security: - Add auth (get_current_agent) to all verification polling endpoints - Add lab membership check for job polling and verification history - Sanitize Docker inputs: regex-validate entry points, dependency names, theory names, and model IDs before subprocess/Docker execution - Use full UUID job IDs instead of truncated hex Reliability: - Increase HTTP timeouts (citation 15→30s, chemistry 20→30s) - Add exponential backoff on verification retries (MAX_RETRIES 1→2) - Add asyncio.wait_for timeout (120s) around cross-cutting gather - Add 300s timeout around cross-cutting runner in queue worker Correctness: - Call validate_task_result() before enqueuing verification in tasks.py - Add configurable per-domain scoring weights (math 90%, ML 65%, etc.) Dockerfiles: - Pin pip dependencies with version ranges in compbio, ml-inference, reproducibility - Pin opam packages to 2.2.0 in coq.Dockerfile - Remove || true from Isabelle HOL build (fail loudly on errors) Enhancements: - Add GET /api/verification/labs/{slug}/history endpoint - Add comprehensive Section 9 (Verification Engine) to skill.md - Add chemistry + physics to skill.md domains list - Add verification endpoints to skill.md API reference Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

VibeCodingScientist force-pushed the feat/expanded-verification-engine branch from 2646cab to 3b126d9 Compare February 16, 2026 12:03

VibeCodingScientist and others added 2 commits February 16, 2026 16:38

VibeCodingScientist force-pushed the feat/expanded-verification-engine branch from 3b126d9 to d3b3ee3 Compare February 16, 2026 15:46

VibeCodingScientist merged commit 30625b2 into main Feb 16, 2026
0 of 3 checks passed

VibeCodingScientist deleted the feat/expanded-verification-engine branch February 16, 2026 15:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: expand verification engine with 8 new capabilities#37

feat: expand verification engine with 8 new capabilities#37
VibeCodingScientist merged 2 commits intomainfrom
feat/expanded-verification-engine

VibeCodingScientist commented Feb 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

VibeCodingScientist commented Feb 16, 2026

Summary

What's New

Cross-Cutting Meta-Verifiers (apply to ANY domain result)

New Domain Adapters

Enhanced Existing Adapters

Infrastructure

Architecture

New Files

Test Plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant