Skip to content

Incremental indexing leaves per-(class, property) stats stale after retractions #1266

@bplatz

Description

@bplatz

Summary

The per-(class, property) flake counts in the index stats
(IndexStats.graphs[g].classes[*].properties[*].datatypes) are not
current-state-accurate after an incremental refresh. A commit that only
retracts a property flake (with no new assertion of that property on the
subject) computes the correct decrement but never applies it, so the count stays
too high. The property-level counts (graphs[g].properties[].count) decrement
correctly; the class-scoped per-property counts do not.

Root cause

  • The stats hook records per-subject datatype deltas with signed values, so
    retractions decrement there: id_hook.rs (subject_prop_dts ... += delta).
  • But the per-subject property presence set is assert-only: id_hook.rs
    (the branch gated on rec.op).
  • The incremental class-stat merge in build/incremental.rs only applies a
    class's class_prop_dts deltas when that class appears in class_properties,
    which is derived from the assert-only presence set. For a retraction-only
    change the class is never revisited, so its decrement delta is dropped.

Where it shows up

A fast path answers COUNT(*) of ?s rdf:type ?o1 . ?s P ?o2 directly from these
stats as Σ_C Σ_dt classStat[C][P].count — each P-flake on a k-typed subject
is attributed once per class, which equals the join's product-sum, so no scan is
needed. With stale counts this over-counts on any ledger that has had an
indexed retraction touching P on a typed subject.

Current mitigation (stopgap, not ideal)

The fold is gated on store.lex_sorted_string_ids(), which is set only by bulk
import (import.rs) and cleared by any incremental refresh
(incremental_root.rs) — a reliable proxy for "class stats untouched by
incremental drift." So the fold fires only on pure bulk-import indexes (where the
full build produces exact counts) and defers to the always-correct merge
everywhere else. This enables the optimization for bulk-imported datasets but
disables it for any incrementally-updated ledger.

Proposed fix

Make the incremental class-stat merge apply class_prop_dts (and lang/ref)
deltas for every class that has such a delta, not just classes with an
assertion in the current commit:

  • Drive the per-class merge by the union of the delta-source maps
    (class_properties ∪ class_prop_dts.keys ∪ class_prop_lang_deltas.keys ∪ ref_edges.keys).
  • Apply the datatype deltas to the prior ClassPropertyUsage.datatypes
    regardless of presence, and drop entries that reach 0.
  • Add indexer tests: an incremental refresh that retracts a property flake on a
    (base-)typed subject decrements the class-property count; full retraction
    removes it.

Once accurate, drop the lex_sorted_string_ids gate so the fold works for all
indexes.

Related

The runtime novelty stats path (runtime_stats.rs::assemble_fast_stats) only
attributes property deltas when the subject's rdf:type is asserted in the same
batch. The fast path is gated to non-overlay so it isn't affected today, but it's
the same class of issue and worth addressing together.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions