feat(cbo): add ANALYZE TABLE persistent column statistics by LuciferYang · Pull Request #630 · lance-format/lance-spark

LuciferYang · 2026-06-16T12:52:45Z

Summary

Adds ANALYZE TABLE … COMPUTE STATISTICS [FOR ALL COLUMNS | FOR COLUMNS …] for Lance tables, persisting per-column statistics into the Lance manifest so Spark's CBO can consume them on subsequent scans — O(1) per column, with no per-scan re-aggregation.

The implementation reuses Spark's own statistics infrastructure rather than hand-rolling aggregation or a serialization format, and adds no new SQL grammar (Spark already parses ANALYZE; an injected resolution rule rewrites it for Lance tables).

Full design, wire format, correctness guarantees, configuration, and known limitations are in #629 — this PR implements that design.

Closes #629
Fixes part of the CBO roadmap (persistent column statistics).

What's in here

Reuse Spark's stats engine. ANALYZE delegates to CommandUtils.computeColumnStats (one aggregation job: min / max / nullCount / HLL-approx distinctCount / avgLen / maxLen) and persists each column with Spark's CatalogColumnStat.toMap. The read path decodes via CatalogColumnStat.fromMap / toPlanStat and feeds the CBO directly.
No new grammar. LanceAnalyzeTableResolution rewrites Spark's native AnalyzeColumn / AnalyzeTable → LanceAnalyzeTable when the target is a Lance dataset, pre-empting the V2-table rejection thrown in DataSourceV2Strategy. Non-Lance ANALYZE is untouched.
Wire format (lance.stats.version=1): an envelope plus lance.stats.column.<name>.<suffix> per-column keys in Spark's CatalogColumnStat map form, in manifest TBLPROPERTIES.
Correctness guarantees: single-commit write ending in complete=true; SHA-256 schema-drift guard; computedAtVersion exact-equality invalidation; a readVersion+1 read-side staleness guard. Fail-safe everywhere (any decode problem → live aggregation).
Config: spark.lance.cbo.column.stats.enabled (master kill-switch, default true) and spark.lance.cbo.column.stats.allow.stale (default false).
Skipped/rejected types: complex containers, interval types, and TimestampNTZ are skipped under FOR ALL COLUMNS / rejected under FOR COLUMNS.

Key classes

read/LanceStatsKeys.java — wire-format spec, key constants, schema hash.
v2/LanceNativeColumnStats.scala — reflective bridge to CommandUtils.computeColumnStats (3.x/4.x split).
v2/LanceColumnStatCodec.scala — read-side CatalogColumnStat → V2 ColumnStatistics bridge (fail-safe).
v2/LanceAnalyzeTableResolution.scala — analyzer rule rewriting native ANALYZE for Lance tables.
v2/LanceAnalyzeTableExec.scala — ANALYZE executor (writer, single-commit ordering, orphan-key GC).
read/LanceScanBuilder.java — loadPersistedColumnStats fast path.

Test plan

Unit: LanceStatsKeysTest, LanceAnalyzeTableSchemaHashTest, LoadPersistedColumnStatsTest, LanceSparkReadOptionsSerializationTest.
End-to-end: BaseAnalyzeTableTest (run per Spark version via AnalyzeTableTest) — FOR ALL COLUMNS / FOR COLUMNS, decimal / empty / all-null / re-analyze / partial, stats-reach-scan, TimestampNTZ-skip, case-insensitivity, NOSCAN-falls-through-to-Spark, and kill-switch paths.
Verified locally: the Spark 3.5 module suite is green (base + 3.5, incl. the new AnalyzeTableTest). Spark 4.0 / 4.1 reuse the 3.5 test sources via build-helper; 3.4 compiles. The full four-version + integration matrix runs in CI.

github-actions · 2026-06-16T12:53:15Z

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

LuciferYang · 2026-06-17T05:10:43Z

cc @hamersaw @fangbo FYI

summaryzb · 2026-06-18T02:43:57Z

+    // after the column became all-NULL), so the reader can't serve a stale value under the fresh
+    // manifest version. (Columns not analyzed at all are cleared by staleKeys.)
+    //
+    // The histogram is intentionally dropped (when spark.sql.statistics.histogram.enabled is on,


Maybe it's better to skip histogram computing ahead, rather than skipping persist the histogram result

Makes sense, building it just to throw it away is wasted work. LanceNativeColumnStats now turns spark.sql.statistics.histogram.enabled off around the aggregation (and restores it after), so Spark never computes a histogram in the first place. Dropped the post-serialization strip too. The test that sets histogram.enabled=true still passes, now because nothing gets produced rather than produced-then-dropped.

summaryzb · 2026-06-18T02:46:14Z

+    }
+
+    // Step 4: complete-marker LAST — only valid stats payloads have lance.stats.complete=true.
+    changes += TableChange.setProperty(LanceStatsKeys.COMPLETE, "true")


Lance table atomically alterTable properties, valid stats is over-design

Agreed. The whole change set goes in via a single updateConfig, so there's no half-written window for the marker to guard. I removed lance.stats.complete end to end: the two writes, the read-side gate, and the constant. Validity now keys off version + computedAtVersion + schemaHash, which already covered everything the flag did.

summaryzb · 2026-06-18T02:57:21Z

+   */
+  public static final String COLUMN_PREFIX = "lance.stats.column.";
+
+  public static final String COLUMN_SUFFIX_VERSION = ".version";


It seems all the suffix variant only used in test cases

Yeah. The read and write paths both go through Spark's CatalogColumnStat.toMap/fromMap, so prod never references the suffixes directly. Moved them into a test-only LanceStatsTestKeys and left LanceStatsKeys with just the keys production actually reads or writes.

…'s own infra Add `ANALYZE TABLE <t> COMPUTE STATISTICS [FOR [ALL] COLUMNS]` for Lance tables. Per-column stats (min/max/nullCount/HLL-approx distinctCount/avgLen/maxLen) are persisted into the Lance manifest TBLPROPERTIES under `lance.stats.*`, tagged with the manifest version, and surfaced to Spark's CBO on subsequent scans via DSv2 `Statistics.columnStats()` — no per-scan re-aggregation. Reuses Spark's own infrastructure end-to-end: - Computation: Spark's `CommandUtils.computeColumnStats` (single aggregation job), via a reflective bridge (LanceNativeColumnStats) because the session parameter is `o.a.s.sql.SparkSession` on 3.x but `o.a.s.sql.classic.SparkSession` on 4.x; the lookup fails loud if Spark's internal signature drifts. - Wire format: Spark's `CatalogColumnStat.toMap` / `fromMap` / `toPlanStat` (LanceColumnStatCodec on the read side). min/max come back in the internal catalyst representation that `DataSourceV2Relation.transformV2Stats` copies verbatim into a catalyst ColumnStat (no CatalystTypeConverters round-trip). NDV is HLL-approximate; string/binary get no min/max (matching native ANALYZE); avgLen/maxLen are surfaced for join-size estimation. Parsing: no custom ANALYZE grammar. Spark parses ANALYZE natively into AnalyzeColumn / AnalyzeTable; an injected resolution rule (LanceAnalyzeTableResolution) rewrites those into LanceAnalyzeTable when — and only when — the target resolves to a LanceDataset, pre-empting CheckAnalysis's NOT_SUPPORTED_COMMAND_FOR_V2_TABLE rejection (which fires after the resolution batches). Non-Lance ANALYZE is left untouched for Spark, so ANALYZE on a non-Lance table in a Lance-extension session is no longer hijacked. NOSCAN and partition-scoped ANALYZE intentionally fall through to Spark's V2 rejection. Safety / correctness: - Atomicity: the entire change list commits via one `catalog.alterTable` → single `Dataset.updateConfig`; the read path consumes stats only when `lance.stats.complete=true`. - Staleness: stats are tagged with the expected post-write manifest version; the read path's exact-version check rejects stale stats (fall back to no stats), with an opt-in allow-stale config. - Schema drift: SHA-256 fingerprint of (name, dataType, nullable) guards against rename/retype/reorder; mismatch falls back to no stats. - TimestampNTZ columns are skipped (their min/max would crash Spark's CBO FilterEstimation); poisoned/unparseable per-column payloads fail-safe to no stats for that column. Config (SparkConf / per-scan): `spark.lance.cbo.column.stats.enabled`, `.max.columns`, `.allow.stale`. Master switch off = safe rollback. Verified: lint + checkstyle clean; full Spark 3.5 suite (866 tests) and full Spark 4.0 suite (888 tests) green; Spark 3.4 / 4.1 compile.

fangbo · 2026-06-23T03:19:02Z

Hi, @LuciferYang There some discussions about column statistics in lance format . See lance-format/lance#4540.

LuciferYang · 2026-06-23T07:09:56Z

Thanks @fangbo, this is the context I was missing.

Agreed that column statistics really belong in the format itself. Per-fragment collection consolidated into a mergeable, eventually default-on store is a much better long-term home than recomputing in Spark and parking the result in TBLPROPERTIES.

Since @HaochengLIU's native work (lance #5639, #5967) is already underway and covers the same ground, I'd rather not ship an interim Spark-side mechanism that we'd only have to migrate off later. So I'll put this PR on hold and wait for the native column-statistics API to land, then wire lance-spark's CBO up to Dataset.column_statistics() directly.

Happy to help review/test on the native side in the meantime, and to align on the fields Spark's CBO needs (min/max/nullCount/distinctCount/avgLen/maxLen) so they're covered when it lands.

github-actions Bot added the enhancement New feature or request label Jun 16, 2026

LuciferYang changed the title ~~feat(cbo): ANALYZE TABLE persistent column statistics~~ feat(cbo): add ANALYZE TABLE persistent column statistics Jun 16, 2026

LuciferYang force-pushed the cbo-analyze-table branch from 7cc85c8 to 1f1e46f Compare June 16, 2026 13:01

summaryzb reviewed Jun 18, 2026

View reviewed changes

LuciferYang force-pushed the cbo-analyze-table branch from 1f1e46f to 47ab288 Compare June 22, 2026 12:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(cbo): add ANALYZE TABLE persistent column statistics#630

feat(cbo): add ANALYZE TABLE persistent column statistics#630
LuciferYang wants to merge 1 commit into
lance-format:mainfrom
LuciferYang:cbo-analyze-table

LuciferYang commented Jun 16, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 16, 2026

Uh oh!

LuciferYang commented Jun 17, 2026

Uh oh!

summaryzb Jun 18, 2026

Uh oh!

LuciferYang Jun 22, 2026

Uh oh!

summaryzb Jun 18, 2026

Uh oh!

LuciferYang Jun 22, 2026

Uh oh!

summaryzb Jun 18, 2026

Uh oh!

LuciferYang Jun 22, 2026

Uh oh!

fangbo commented Jun 23, 2026

Uh oh!

LuciferYang commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

LuciferYang commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in here

Key classes

Test plan

Uh oh!

github-actions Bot commented Jun 16, 2026

Uh oh!

LuciferYang commented Jun 17, 2026

Uh oh!

summaryzb Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

LuciferYang Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

summaryzb Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

LuciferYang Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

summaryzb Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

LuciferYang Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

fangbo commented Jun 23, 2026

Uh oh!

LuciferYang commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

LuciferYang commented Jun 16, 2026 •

edited

Loading