Skip to content

docs: specify data overlay files for the table format#7381

Draft
wjones127 wants to merge 3 commits into
lance-format:mainfrom
wjones127:feat-patch-files
Draft

docs: specify data overlay files for the table format#7381
wjones127 wants to merge 3 commits into
lance-format:mainfrom
wjones127:feat-patch-files

Conversation

@wjones127

Copy link
Copy Markdown
Contributor

Adds a specification for data overlay files: small files attached to a fragment that supply new values for a subset of (row offset, field) cells without rewriting the base data files. They make cell-level updates cheap when only a small fraction of rows and/or columns change.

This PR is spec + proto only — no read/write implementation yet.

Changes

  • protos/table.proto
    • Rework DataOverlayFile: a oneof coverage { bytes shared_offset_bitmap | FieldCoverage field_coverage } to support both dense (rectangular) and sparse overlays; add the FieldCoverage message.
    • Rename read_versioncommitted_version (uint64), with effective/commit-stamped semantics so overlay-vs-index ordering is correct.
    • Drop the in-file offset key column in favor of rank-based addressing off the coverage bitmap.
    • Document reader feature flag 64 (and previously-undocumented 16/32).
  • docs/src/format/table/data_overlay_file.md (new): full specification — coverage/resolution, deletion precedence, NULL-override, layout + rank addressing, dense vs. sparse, versioning, field-aware index exclusion with flat re-evaluation, the correctness invariant, both compaction modes, row lineage, a worked example (write → read → index query → sparse write → read → compaction), and a guidance stub with open questions.
  • docs/src/format/table/index.md: concise overview + link to the new spec (replacing the earlier inline sketch).

Out of scope / follow-ups

  • Write transaction shape (new Operation variant in transaction.proto + Rust).
  • Writer support for unequal-length columns (needed for single-file sparse overlays).
  • Coverage bitmap external spill for very large coverage.
  • Per-fragment vs. per-table overlays / LSM analogy (open question in the doc).

🤖 Generated with Claude Code

Add a specification for data overlay files: small files attached to a
fragment that supply new values for a subset of (row offset, field) cells
without rewriting the base data files, for cheap cell-level updates.

- protos/table.proto: rework DataOverlayFile with a dense/sparse coverage
  oneof (shared_offset_bitmap vs new FieldCoverage), rename read_version to
  committed_version (effective, commit-stamped), and document rank-based
  addressing with no offset column. Document reader feature flag 64.
- docs: add data_overlay_file.md (full spec, worked example, guidance stub)
  and link it from the table format overview.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the A-format On-disk format: protos and format spec docs label Jun 19, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Important

This PR touches the Lance format specification.

Substantive changes to the format specification — the .proto definitions
and the spec docs under docs/src/format/ — require a PMC vote before merge.
Minor edits such as typo fixes, wording, or formatting are excluded; use your
judgment.

If this is a meaningful format change:

  • Start a vote following the Lance community voting process.
    Format specification modifications need 3 binding +1 votes (excluding the
    proposer), held on GitHub Discussions, with a minimum voting period of 1 week.
  • Once the vote passes, link the completed vote in this PR. It should not be
    merged until the vote is linked.

@github-actions github-actions Bot added the documentation Improvements or additions to documentation label Jun 19, 2026
Add the `DataOverlay` operation (and `DataOverlayGroup`) to attach overlay
files to fragments without rewriting their base data. Mirrors the
`DataReplacement` batch shape, appends to each fragment's `overlays` list, and
documents permissive conflict semantics: concurrent overlays, appends, deletes,
and column rewrites are compatible; row-rewrites, compaction, and overlay->base
folds conflict.

committed_version is left 0 by the writer and stamped at commit time.

Proto only — Rust/Python bindings deferred.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@wjones127

Copy link
Copy Markdown
Contributor Author

See discussion: #7401

The table/transaction proto changes generate new fields and an Operation
variant. This wires the minimum needed to compile without implementing overlay
support:

- Emit empty `overlays` when converting fragments to proto.
- Reject the `DataOverlay` transaction operation with NotSupported on read.

Datasets that use overlays set reader feature flag 64, which already falls in
the unknown-flag range rejected by `can_read_dataset`, so the library refuses
them at the feature-flag layer.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-format On-disk format: protos and format spec docs documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant