Skip to content

feat(adr-102): add concept embedding-dimension check to offline validator#492

Merged
aaronsb merged 2 commits into
mainfrom
feat/adr-102-backup-validator-dim-check
Jun 1, 2026
Merged

feat(adr-102): add concept embedding-dimension check to offline validator#492
aaronsb merged 2 commits into
mainfrom
feat/adr-102-backup-validator-dim-check

Conversation

@aaronsb
Copy link
Copy Markdown
Owner

@aaronsb aaronsb commented Jun 1, 2026

What

The offline validator (lint_backup) checked that embedding-profile references resolve — index in range (E_CONCEPT_PROFILE_RANGE), cascade resolves (E_NO_PROFILE_CASCADE), identity string shape (E_PROFILE_IDENTITY) — but never verified a concept's actual embedding vector length against its resolved profile's @dims.

That's a real gap: a backup could mis-tag a 768-dim concept as @1536 and pass clean, then mis-attach or fail at restore (especially integration mode, which keys on embedding-space identity). Surfaced while inspecting a real backup whose data is all local:nomic@768 while openai@1536 sits declared-but-unused.

The check

New E_CONCEPT_EMBEDDING_DIM: for each concept carrying an embedding vector whose profile resolves with a parseable @dims, assert len(embedding) == dims (spec §3.2). Only fires when the vector is present and the profile resolves cleanly — bad index / malformed identity are flagged by the existing codes, so no double-reporting.

  • New _profile_dims() helper parses the @dims suffix from the resolved profile identity.
  • _validate_bulk now binds the profiles list (previously only n_profiles) to look up the identity.
  • --selftest gains a pass case (3-dim vector under @3) and a fail case (2 != @3).

Fixtures made dimension-honest

test_id_remap / test_kg_backup_v2 / test_restore_modes declared @1536/@768 profiles but carried 1–2-element placeholder vectors — they were never dimension-honest (the new check correctly flagged them). Switched to synthetic identities (test:embed@1 / @2) matching their vector lengths. No assertion depended on the old identity string (only _BACKUP_IDENTITY, updated in lockstep). A real full export (3994 concepts, nomic@768) validates clean.

Tests

  • --selftest → PASS
  • kg-backup/restore slice (kg_backup_v2 / reader / id_remap / restore_modes / restore_worker_epoch / kg_backup_v2_restore / backup_integrity) → 67 passed
  • The real 16 MB archive validates OK: no issues found.

Note (flagging, not fixing here)

lint_backup.py is now 830 lines (>800). A package split is contraindicated: pytest loads it as a single standalone file via importlib.util.spec_from_file_location (the Track-D no-deps oracle), so splitting into a package would break the test-loading mechanism and its zero-dependency design. Flagging for a possible future task rather than forcing it here.

🤖 Generated with Claude Code

aaronsb added 2 commits June 1, 2026 18:48
…ator

The offline validator checked that profile *references* resolve (index in range,
cascade resolves, identity string shape) but never verified a concept's actual
embedding vector length against its resolved profile's @dims. A backup could
mis-tag a 768-dim concept as @1536 and pass clean — then mis-attach or fail at
restore (esp. integration mode, which keys on embedding-space identity).

Adds E_CONCEPT_EMBEDDING_DIM: for each concept carrying an embedding vector whose
profile resolves with a parseable @dims, assert len(embedding) == dims (spec §3.2).
Only fires when the vector is present and the profile resolves cleanly (bad index /
malformed identity are flagged separately), so it never double-reports.

- New _profile_dims() helper parses the @dims suffix from the resolved profile identity.
- _validate_bulk now binds  (was n_profiles only) to look up the identity.
- selftest gains a pass case (3-dim vector under @3) and a fail case (2 != @3).

Test fixtures fixed: test_id_remap / test_kg_backup_v2 / test_restore_modes declared
@1536/@768 profiles but carried 1-2 element placeholder vectors — never dimension-
honest. Switched them to synthetic identities (test:embed@1 / @2) matching their
vector lengths. Real backups (e.g. a full 3994-concept nomic@768 export) validate clean.

NOTE: lint_backup.py is now 830 lines (>800). A package split is contraindicated —
it's loaded by pytest as a single standalone file via importlib spec_from_file_location
(the Track-D no-deps oracle); splitting would break that. Flagging for a future task.

67 passed across the kg-backup/restore slice; selftest PASS.
Add a selftest case (review nit): an out-of-range record-level embedding_profile
with a present embedding must yield E_CONCEPT_PROFILE_RANGE only, NOT also
E_CONCEPT_EMBEDDING_DIM. Guards the decline-when-unresolved behavior against a
future refactor that might move the dim check before the range/identity guards.
@aaronsb aaronsb merged commit 2aa906a into main Jun 1, 2026
4 checks passed
@aaronsb aaronsb deleted the feat/adr-102-backup-validator-dim-check branch June 1, 2026 23:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant