feat(adr-102): add concept embedding-dimension check to offline validator#492
Merged
Conversation
…ator The offline validator checked that profile *references* resolve (index in range, cascade resolves, identity string shape) but never verified a concept's actual embedding vector length against its resolved profile's @dims. A backup could mis-tag a 768-dim concept as @1536 and pass clean — then mis-attach or fail at restore (esp. integration mode, which keys on embedding-space identity). Adds E_CONCEPT_EMBEDDING_DIM: for each concept carrying an embedding vector whose profile resolves with a parseable @dims, assert len(embedding) == dims (spec §3.2). Only fires when the vector is present and the profile resolves cleanly (bad index / malformed identity are flagged separately), so it never double-reports. - New _profile_dims() helper parses the @dims suffix from the resolved profile identity. - _validate_bulk now binds (was n_profiles only) to look up the identity. - selftest gains a pass case (3-dim vector under @3) and a fail case (2 != @3). Test fixtures fixed: test_id_remap / test_kg_backup_v2 / test_restore_modes declared @1536/@768 profiles but carried 1-2 element placeholder vectors — never dimension- honest. Switched them to synthetic identities (test:embed@1 / @2) matching their vector lengths. Real backups (e.g. a full 3994-concept nomic@768 export) validate clean. NOTE: lint_backup.py is now 830 lines (>800). A package split is contraindicated — it's loaded by pytest as a single standalone file via importlib spec_from_file_location (the Track-D no-deps oracle); splitting would break that. Flagging for a future task. 67 passed across the kg-backup/restore slice; selftest PASS.
Add a selftest case (review nit): an out-of-range record-level embedding_profile with a present embedding must yield E_CONCEPT_PROFILE_RANGE only, NOT also E_CONCEPT_EMBEDDING_DIM. Guards the decline-when-unresolved behavior against a future refactor that might move the dim check before the range/identity guards.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
The offline validator (
lint_backup) checked that embedding-profile references resolve — index in range (E_CONCEPT_PROFILE_RANGE), cascade resolves (E_NO_PROFILE_CASCADE), identity string shape (E_PROFILE_IDENTITY) — but never verified a concept's actual embedding vector length against its resolved profile's@dims.That's a real gap: a backup could mis-tag a 768-dim concept as
@1536and pass clean, then mis-attach or fail at restore (especiallyintegrationmode, which keys on embedding-space identity). Surfaced while inspecting a real backup whose data is alllocal:nomic@768whileopenai@1536sits declared-but-unused.The check
New
E_CONCEPT_EMBEDDING_DIM: for each concept carrying an embedding vector whose profile resolves with a parseable@dims, assertlen(embedding) == dims(spec §3.2). Only fires when the vector is present and the profile resolves cleanly — bad index / malformed identity are flagged by the existing codes, so no double-reporting._profile_dims()helper parses the@dimssuffix from the resolved profile identity._validate_bulknow binds theprofileslist (previously onlyn_profiles) to look up the identity.--selftestgains a pass case (3-dim vector under@3) and a fail case (2 != @3).Fixtures made dimension-honest
test_id_remap/test_kg_backup_v2/test_restore_modesdeclared@1536/@768profiles but carried 1–2-element placeholder vectors — they were never dimension-honest (the new check correctly flagged them). Switched to synthetic identities (test:embed@1/@2) matching their vector lengths. No assertion depended on the old identity string (only_BACKUP_IDENTITY, updated in lockstep). A real full export (3994 concepts,nomic@768) validates clean.Tests
--selftest→ PASSOK: no issues found.Note (flagging, not fixing here)
lint_backup.pyis now 830 lines (>800). A package split is contraindicated: pytest loads it as a single standalone file viaimportlib.util.spec_from_file_location(the Track-D no-deps oracle), so splitting into a package would break the test-loading mechanism and its zero-dependency design. Flagging for a possible future task rather than forcing it here.🤖 Generated with Claude Code