feat: add lance_dataset_add_columns for SQL / all-null / stream column addition#45
Merged
Merged
Conversation
…n addition ## Summary Last of three PRs against lance-format#41. Exposes upstream's `Dataset::add_columns` through the three `NewColumnTransform` cases that translate cleanly across the C ABI: - **SQL expressions** — derive new columns from SQL over existing columns. - **All-null columns** — add nullable columns from an Arrow schema. On the modern format this is metadata-only; the legacy format can't represent it that way and returns `LANCE_ERR_NOT_SUPPORTED`. - **Stream** — splice in precomputed column data from an Arrow C stream, aligned positionally to the dataset's existing rows. Upstream's fourth variant, `BatchUDF`, is left out on purpose: it carries a Rust closure that can't cross the C ABI, and the stream variant already covers the same "bring your own computed data" use case. Each call mutates the dataset in place under an exclusive write lock; scanners already in flight keep their pre-add view via the same Arc clone-on-write as the `_drop_columns` (lance-format#42) / `_alter_columns` (lance-format#44) siblings. Three focused entry points rather than one mode-tagged function, because the inputs are genuinely different shapes (name/expression pairs vs. a schema pointer vs. a stream). Upstream's `read_columns` parameter isn't exposed — it only feeds `BatchUDF`; for these three variants upstream ignores it. `batch_size` is forwarded where it does something (SQL scan, stream alignment) and omitted from the metadata-only all-null path. ## Surface ```c typedef struct LanceSqlColumn { const char* name; const char* expression; } LanceSqlColumn; int32_t lance_dataset_add_columns_sql( LanceDataset* dataset, const LanceSqlColumn* columns, size_t num_columns, uint64_t batch_size); int32_t lance_dataset_add_columns_nulls( LanceDataset* dataset, const struct ArrowSchema* schema); int32_t lance_dataset_add_columns_stream( LanceDataset* dataset, struct ArrowArrayStream* stream, uint64_t batch_size); ``` `batch_size` uses `0` for the upstream default and is range-checked to `u32`. Two error-code details are worth calling out, both matching existing behavior: a SQL expression referencing a non-existent column surfaces as `LANCE_ERR_INTERNAL` (an upstream schema error, the same path as `lance_dataset_delete`), whereas a syntax error is `LANCE_ERR_INVALID_ARGUMENT`. Because these are `unsafe extern "C"` entry points under `panic = "abort"`, the stream variant pre-validates the mandatory CADI callbacks before handing the stream to arrow-rs (which would otherwise abort on a NULL `get_schema` / `get_next`), and the all-null variant rejects an uninitialised or non-UTF-8 top-level schema `format` before arrow-rs's `assert!`/`expect` can fire. The stream is consumed (released) on every non-NULL return path. ## Tests Rust integration tests cover all three variants end to end: computed values (single and multi-column SQL, constant expressions, honored `batch_size`), all-null backfill, and multi-fragment stream alignment — plus the full rejection surface (NULL/empty/non-UTF-8 inputs, name collisions, row-count mismatch, invalid `batch_size`, released/missing-callback streams, non-nullable all-null fields, and the legacy-format `NOT_SUPPORTED` path). Stream-consumption is proven with a drop counter rather than a vacuous release-slot check. C and C++ smoke tests exercise the SQL happy path and each variant's argument rejections across the ABI.
jja725
approved these changes
Jun 11, 2026
Contributor
Author
|
Thank you @jja725 |
This was referenced Jun 16, 2026
jja725
pushed a commit
that referenced
this pull request
Jun 16, 2026
The schema-evolution series is fully merged, so this flips the Phase 3 roadmap row from `[ ]` to `[x]` and names the functions that cover it, matching the style of the rows around it. Covered by: - #45 — `lance_dataset_add_columns_sql/_nulls/_stream` - #44 — `lance_dataset_alter_columns` - #42 — `lance_dataset_drop_columns` Closes #41. Docs-only; no code or test changes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Last of three PRs against #41. Exposes upstream's
Dataset::add_columnsthrough the threeNewColumnTransformcases that translate cleanly across the C ABI:LANCE_ERR_NOT_SUPPORTED.Upstream's fourth variant,
BatchUDF, is left out on purpose: it carries a Rust closure that can't cross the C ABI, and the stream variant already covers the same "bring your own computed data" use case.Each call mutates the dataset in place under an exclusive write lock; scanners already in flight keep their pre-add view via the same Arc clone-on-write as the
_drop_columns(#42) /_alter_columns(#44) siblings.Three focused entry points rather than one mode-tagged function, because the inputs are genuinely different shapes (name/expression pairs vs. a schema pointer vs. a stream). Upstream's
read_columnsparameter isn't exposed — it only feedsBatchUDF; for these three variants upstream ignores it.batch_sizeis forwarded where it does something (SQL scan, stream alignment) and omitted from the metadata-only all-null path.Surface
batch_sizeuses0for the upstream default and is range-checked tou32. Two error-code details are worth calling out, both matching existing behavior: a SQL expression referencing a non-existent column surfaces asLANCE_ERR_INTERNAL(an upstream schema error, the same path aslance_dataset_delete), whereas a syntax error isLANCE_ERR_INVALID_ARGUMENT.Because these are
unsafe extern "C"entry points underpanic = "abort", the stream variant pre-validates the mandatory CADI callbacks before handing the stream to arrow-rs (which would otherwise abort on a NULLget_schema/get_next), and the all-null variant rejects an uninitialised or non-UTF-8 top-level schemaformatbefore arrow-rs'sassert!/expectcan fire. The stream is consumed (released) on every non-NULL return path.Tests
Rust integration tests cover all three variants end to end: computed values (single and multi-column SQL, constant expressions, honored
batch_size), all-null backfill, and multi-fragment stream alignment — plus the full rejection surface (NULL/empty/non-UTF-8 inputs, name collisions, row-count mismatch, invalidbatch_size, released/missing-callback streams, non-nullable all-null fields, and the legacy-formatNOT_SUPPORTEDpath). Stream-consumption is proven with a drop counter rather than a vacuous release-slot check. C and C++ smoke tests exercise the SQL happy path and each variant's argument rejections across the ABI.