Summary
The per-(class, property) flake counts in the index stats
(IndexStats.graphs[g].classes[*].properties[*].datatypes) are not
current-state-accurate after an incremental refresh. A commit that only
retracts a property flake (with no new assertion of that property on the
subject) computes the correct decrement but never applies it, so the count stays
too high. The property-level counts (graphs[g].properties[].count) decrement
correctly; the class-scoped per-property counts do not.
Root cause
- The stats hook records per-subject datatype deltas with signed values, so
retractions decrement there: id_hook.rs (subject_prop_dts ... += delta).
- But the per-subject property presence set is assert-only:
id_hook.rs
(the branch gated on rec.op).
- The incremental class-stat merge in
build/incremental.rs only applies a
class's class_prop_dts deltas when that class appears in class_properties,
which is derived from the assert-only presence set. For a retraction-only
change the class is never revisited, so its decrement delta is dropped.
Where it shows up
A fast path answers COUNT(*) of ?s rdf:type ?o1 . ?s P ?o2 directly from these
stats as Σ_C Σ_dt classStat[C][P].count — each P-flake on a k-typed subject
is attributed once per class, which equals the join's product-sum, so no scan is
needed. With stale counts this over-counts on any ledger that has had an
indexed retraction touching P on a typed subject.
Current mitigation (stopgap, not ideal)
The fold is gated on store.lex_sorted_string_ids(), which is set only by bulk
import (import.rs) and cleared by any incremental refresh
(incremental_root.rs) — a reliable proxy for "class stats untouched by
incremental drift." So the fold fires only on pure bulk-import indexes (where the
full build produces exact counts) and defers to the always-correct merge
everywhere else. This enables the optimization for bulk-imported datasets but
disables it for any incrementally-updated ledger.
Proposed fix
Make the incremental class-stat merge apply class_prop_dts (and lang/ref)
deltas for every class that has such a delta, not just classes with an
assertion in the current commit:
- Drive the per-class merge by the union of the delta-source maps
(class_properties ∪ class_prop_dts.keys ∪ class_prop_lang_deltas.keys ∪ ref_edges.keys).
- Apply the datatype deltas to the prior
ClassPropertyUsage.datatypes
regardless of presence, and drop entries that reach 0.
- Add indexer tests: an incremental refresh that retracts a property flake on a
(base-)typed subject decrements the class-property count; full retraction
removes it.
Once accurate, drop the lex_sorted_string_ids gate so the fold works for all
indexes.
Related
The runtime novelty stats path (runtime_stats.rs::assemble_fast_stats) only
attributes property deltas when the subject's rdf:type is asserted in the same
batch. The fast path is gated to non-overlay so it isn't affected today, but it's
the same class of issue and worth addressing together.
Summary
The per-
(class, property)flake counts in the index stats(
IndexStats.graphs[g].classes[*].properties[*].datatypes) are notcurrent-state-accurate after an incremental refresh. A commit that only
retracts a property flake (with no new assertion of that property on the
subject) computes the correct decrement but never applies it, so the count stays
too high. The property-level counts (
graphs[g].properties[].count) decrementcorrectly; the class-scoped per-property counts do not.
Root cause
retractions decrement there:
id_hook.rs(subject_prop_dts ... += delta).id_hook.rs(the branch gated on
rec.op).build/incremental.rsonly applies aclass's
class_prop_dtsdeltas when that class appears inclass_properties,which is derived from the assert-only presence set. For a retraction-only
change the class is never revisited, so its decrement delta is dropped.
Where it shows up
A fast path answers
COUNT(*)of?s rdf:type ?o1 . ?s P ?o2directly from thesestats as
Σ_C Σ_dt classStat[C][P].count— eachP-flake on ak-typed subjectis attributed once per class, which equals the join's product-sum, so no scan is
needed. With stale counts this over-counts on any ledger that has had an
indexed retraction touching
Pon a typed subject.Current mitigation (stopgap, not ideal)
The fold is gated on
store.lex_sorted_string_ids(), which is set only by bulkimport (
import.rs) and cleared by any incremental refresh(
incremental_root.rs) — a reliable proxy for "class stats untouched byincremental drift." So the fold fires only on pure bulk-import indexes (where the
full build produces exact counts) and defers to the always-correct merge
everywhere else. This enables the optimization for bulk-imported datasets but
disables it for any incrementally-updated ledger.
Proposed fix
Make the incremental class-stat merge apply
class_prop_dts(and lang/ref)deltas for every class that has such a delta, not just classes with an
assertion in the current commit:
(
class_properties ∪ class_prop_dts.keys ∪ class_prop_lang_deltas.keys ∪ ref_edges.keys).ClassPropertyUsage.datatypesregardless of presence, and drop entries that reach 0.
(base-)typed subject decrements the class-property count; full retraction
removes it.
Once accurate, drop the
lex_sorted_string_idsgate so the fold works for allindexes.
Related
The runtime novelty stats path (
runtime_stats.rs::assemble_fast_stats) onlyattributes property deltas when the subject's
rdf:typeis asserted in the samebatch. The fast path is gated to non-overlay so it isn't affected today, but it's
the same class of issue and worth addressing together.