Skip to content

feat: expand verification engine with 8 new capabilities#37

Merged
VibeCodingScientist merged 2 commits intomainfrom
feat/expanded-verification-engine
Feb 16, 2026
Merged

feat: expand verification engine with 8 new capabilities#37
VibeCodingScientist merged 2 commits intomainfrom
feat/expanded-verification-engine

Conversation

@VibeCodingScientist
Copy link
Owner

Summary

Expands the verification engine from 5 domain adapters to 7 domain adapters + 4 cross-cutting meta-verifiers that enhance every domain. This is a large but self-contained change -- all new code lives under backend/verification/, backend/services/, and tests/test_verification/.

41 files changed, ~5,800 lines added.


What's New

Cross-Cutting Meta-Verifiers (apply to ANY domain result)

These are a new architectural concept -- unlike domain adapters (which run for one domain), cross-cutting verifiers run on any task result that contains the relevant data. They share 30% of the final score, with the domain adapter keeping 70%.

Verifier Weight Triggers when result contains What it checks
Citation & Reference 0.15 citations, references, papers, bibliography DOI resolution via CrossRef, metadata match via OpenAlex + Semantic Scholar, claim-abstract Jaccard similarity, freshness penalties for fast-moving fields
Statistical Forensics 0.10 statistical_claims, means, p_values, metrics GRIM test (mean plausibility), SPRITE test (mean+SD achievability), Benford's law (first-digit distribution), p-curve analysis (p-hacking detection)
Reproducibility Executor 0.15 code_repo AND code_commit Git clone, dependency file detection, Docker sandbox execution, output comparison against claimed results
Data Integrity 0.10 data, dataset, raw_data, results_summary, output_checksums Schema consistency, exact duplicate detection, z-score outlier flagging (>3 sigma), SHA-256 hash verification

New Domain Adapters

Adapter Claim Types Key Dependencies
Chemistry reaction_mechanism, molecular_property, retrosynthesis rdkit (SMILES parsing, stoichiometry, feasibility), PubChem API, ChEMBL API
Physics numerical_simulation, analytical_derivation, dimensional_analysis pint (dimensional analysis), sympy (symbolic math), numpy

Enhanced Existing Adapters

Adapter Enhancement
Math (Lean4) Multi-prover: added Coq (coqc in Docker) and Isabelle (isabelle build in Docker) alongside Lean 4. Routed by proof_system field.
ML/AI New benchmark_live claim type: generates inference script, runs it in clawdlab/ml-inference Docker container, loads HuggingFace model, runs samples, compares live accuracy to claimed metrics.

Infrastructure

  • Redis-backed async verification queue with distributed semaphores (2 Docker / 4 API concurrent slots)
  • Cross-cutting runner with asyncio.gather() concurrency and weighted score merging
  • 4 new Docker images: Coq 8.18 + MathComp, Isabelle 2024, reproducibility sandbox (Python + sci-packages), ML inference (transformers + torch-cpu)
  • 3 new pip dependencies: rdkit-pypi>=2024.3.1, pint>=0.23, sympy>=1.12
  • Claim-type-level Docker routing: requires_docker_for() on ML adapter enables Docker semaphore only for benchmark_live

Architecture

Task Result
    |
    +---> Domain Adapter (70% weight)
    |     - mathematics (Lean4 / Coq / Isabelle)
    |     - ml_ai (benchmark_result / benchmark_live / ml_experiment / architecture)
    |     - computational_biology
    |     - materials_science
    |     - bioinformatics
    |     - chemistry (NEW)
    |     - physics (NEW)
    |
    +---> Cross-Cutting Verifiers (30% weight, shared)
          - citation_reference (0.15)
          - statistical_forensics (0.10)
          - reproducibility (0.15)
          - data_integrity (0.10)

Score merging: final = 0.70 * domain_score + 0.30 * weighted_cc_score

Cross-cutting verifiers that crash are caught, logged, and scored 0.0 -- they never take down the domain verification.


New Files

File Lines Purpose
backend/verification/cross_cutting_base.py 37 Base class + CrossCuttingResult dataclass
backend/verification/cross_cutting_runner.py 180 Registry, concurrent execution, score merging
backend/verification/citation_verifier.py 323 DOI/OpenAlex/Semantic Scholar/freshness
backend/verification/statistical_forensics.py 437 GRIM/SPRITE/Benford/p-curve
backend/verification/reproducibility_executor.py 317 Git clone + Docker sandbox + output compare
backend/verification/data_integrity.py 308 Schema/duplicates/outliers/hashes
backend/verification/chemistry_adapter.py 602 rdkit + PubChem + ChEMBL
backend/verification/physics_adapter.py 600 Conservation + pint + sympy
containers/coq.Dockerfile 9 Coq 8.18 + MathComp
containers/isabelle.Dockerfile 11 Isabelle 2024
containers/reproducibility.Dockerfile 14 Python + sci-packages sandbox
containers/ml-inference.Dockerfile 11 transformers + torch-cpu
10 test files ~1,600 Full coverage for all new code

Test Plan

  • Verify all existing tests still pass: pytest tests/test_verification/ -v
  • Run new cross-cutting tests: pytest tests/test_verification/test_cross_cutting*.py tests/test_verification/test_citation*.py tests/test_verification/test_statistical*.py tests/test_verification/test_reproducibility*.py tests/test_verification/test_data_integrity.py -v
  • Run new adapter tests: pytest tests/test_verification/test_chemistry*.py tests/test_verification/test_physics*.py tests/test_verification/test_ml_live*.py -v
  • Verify dispatcher registers 7 adapters (was 5)
  • Verify merge math: domain score 0.8 + citation score 0.6 (weight 0.15) -> final ~ 0.74
  • Build Docker images on server: ./build.sh all
  • Integration: submit a task with citations -> cross-cutting citation verifier runs
  • Chemistry: submit task with SMILES string -> PubChem cross-ref works
  • Math: submit task with proof_system: "coq" -> Coq container runs

Generated with Claude Code

@VibeCodingScientist VibeCodingScientist force-pushed the feat/expanded-verification-engine branch from 2646cab to 3b126d9 Compare February 16, 2026 12:03
VibeCodingScientist and others added 2 commits February 16, 2026 16:38
Add cross-cutting meta-verifiers, new domain adapters, and enhanced
existing adapters to significantly broaden verification coverage.

Cross-cutting verifiers (apply to any domain):
- Citation and Reference: DOI resolution, OpenAlex/Semantic Scholar
  metadata match, claim support via Jaccard similarity, freshness
- Statistical Forensics: GRIM test, SPRITE test, Benfords law,
  p-curve analysis for detecting fabricated statistics
- Reproducibility Executor: git clone, dependency detection, Docker
  sandbox execution, output comparison against claimed results
- Data Integrity: schema validation, duplicate detection, z-score
  outlier flagging, SHA-256 hash verification

New domain adapters:
- Chemistry: rdkit SMILES validation, stoichiometry balancing,
  PubChem + ChEMBL cross-reference, retrosynthesis route checks
- Physics: conservation law checks, stability/divergence detection,
  convergence analysis, dimensional analysis (pint), symbolic math (sympy)

Enhanced existing adapters:
- Math multi-prover: Coq and Isabelle support alongside Lean 4
- ML live inference: benchmark_live claim type runs models in Docker
  sandbox against HuggingFace benchmarks

Infrastructure:
- Redis-backed async verification queue with distributed semaphores
- Cross-cutting runner with weighted score merging (70/30 domain/CC)
- 4 new Docker images (Coq, Isabelle, reproducibility, ML inference)
- New dependencies: rdkit-pypi, pint, sympy
- Comprehensive test suite (10 new test files, ~1600 lines of tests)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ty, skill docs

Security:
- Add auth (get_current_agent) to all verification polling endpoints
- Add lab membership check for job polling and verification history
- Sanitize Docker inputs: regex-validate entry points, dependency names,
  theory names, and model IDs before subprocess/Docker execution
- Use full UUID job IDs instead of truncated hex

Reliability:
- Increase HTTP timeouts (citation 15→30s, chemistry 20→30s)
- Add exponential backoff on verification retries (MAX_RETRIES 1→2)
- Add asyncio.wait_for timeout (120s) around cross-cutting gather
- Add 300s timeout around cross-cutting runner in queue worker

Correctness:
- Call validate_task_result() before enqueuing verification in tasks.py
- Add configurable per-domain scoring weights (math 90%, ML 65%, etc.)

Dockerfiles:
- Pin pip dependencies with version ranges in compbio, ml-inference, reproducibility
- Pin opam packages to 2.2.0 in coq.Dockerfile
- Remove || true from Isabelle HOL build (fail loudly on errors)

Enhancements:
- Add GET /api/verification/labs/{slug}/history endpoint
- Add comprehensive Section 9 (Verification Engine) to skill.md
- Add chemistry + physics to skill.md domains list
- Add verification endpoints to skill.md API reference

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@VibeCodingScientist VibeCodingScientist force-pushed the feat/expanded-verification-engine branch from 3b126d9 to d3b3ee3 Compare February 16, 2026 15:46
@VibeCodingScientist VibeCodingScientist merged commit 30625b2 into main Feb 16, 2026
0 of 3 checks passed
@VibeCodingScientist VibeCodingScientist deleted the feat/expanded-verification-engine branch February 16, 2026 15:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant