feat: add lance_dataset_add_columns for SQL / all-null / stream column addition by LuciferYang · Pull Request #45 · lance-format/lance-c

LuciferYang · 2026-06-11T07:12:02Z

Summary

Last of three PRs against #41. Exposes upstream's Dataset::add_columns through the three NewColumnTransform cases that translate cleanly across the C ABI:

SQL expressions — derive new columns from SQL over existing columns.
All-null columns — add nullable columns from an Arrow schema. On the modern format this is metadata-only; the legacy format can't represent it that way and returns LANCE_ERR_NOT_SUPPORTED.
Stream — splice in precomputed column data from an Arrow C stream, aligned positionally to the dataset's existing rows.

Upstream's fourth variant, BatchUDF, is left out on purpose: it carries a Rust closure that can't cross the C ABI, and the stream variant already covers the same "bring your own computed data" use case.

Each call mutates the dataset in place under an exclusive write lock; scanners already in flight keep their pre-add view via the same Arc clone-on-write as the _drop_columns (#42) / _alter_columns (#44) siblings.

Three focused entry points rather than one mode-tagged function, because the inputs are genuinely different shapes (name/expression pairs vs. a schema pointer vs. a stream). Upstream's read_columns parameter isn't exposed — it only feeds BatchUDF; for these three variants upstream ignores it. batch_size is forwarded where it does something (SQL scan, stream alignment) and omitted from the metadata-only all-null path.

Surface

typedef struct LanceSqlColumn {
    const char* name;
    const char* expression;
} LanceSqlColumn;

int32_t lance_dataset_add_columns_sql(
    LanceDataset* dataset, const LanceSqlColumn* columns,
    size_t num_columns, uint64_t batch_size);

int32_t lance_dataset_add_columns_nulls(
    LanceDataset* dataset, const struct ArrowSchema* schema);

int32_t lance_dataset_add_columns_stream(
    LanceDataset* dataset, struct ArrowArrayStream* stream,
    uint64_t batch_size);

batch_size uses 0 for the upstream default and is range-checked to u32. Two error-code details are worth calling out, both matching existing behavior: a SQL expression referencing a non-existent column surfaces as LANCE_ERR_INTERNAL (an upstream schema error, the same path as lance_dataset_delete), whereas a syntax error is LANCE_ERR_INVALID_ARGUMENT.

Because these are unsafe extern "C" entry points under panic = "abort", the stream variant pre-validates the mandatory CADI callbacks before handing the stream to arrow-rs (which would otherwise abort on a NULL get_schema / get_next), and the all-null variant rejects an uninitialised or non-UTF-8 top-level schema format before arrow-rs's assert!/expect can fire. The stream is consumed (released) on every non-NULL return path.

Tests

Rust integration tests cover all three variants end to end: computed values (single and multi-column SQL, constant expressions, honored batch_size), all-null backfill, and multi-fragment stream alignment — plus the full rejection surface (NULL/empty/non-UTF-8 inputs, name collisions, row-count mismatch, invalid batch_size, released/missing-callback streams, non-nullable all-null fields, and the legacy-format NOT_SUPPORTED path). Stream-consumption is proven with a drop counter rather than a vacuous release-slot check. C and C++ smoke tests exercise the SQL happy path and each variant's argument rejections across the ABI.

…n addition ## Summary Last of three PRs against lance-format#41. Exposes upstream's `Dataset::add_columns` through the three `NewColumnTransform` cases that translate cleanly across the C ABI: - **SQL expressions** — derive new columns from SQL over existing columns. - **All-null columns** — add nullable columns from an Arrow schema. On the modern format this is metadata-only; the legacy format can't represent it that way and returns `LANCE_ERR_NOT_SUPPORTED`. - **Stream** — splice in precomputed column data from an Arrow C stream, aligned positionally to the dataset's existing rows. Upstream's fourth variant, `BatchUDF`, is left out on purpose: it carries a Rust closure that can't cross the C ABI, and the stream variant already covers the same "bring your own computed data" use case. Each call mutates the dataset in place under an exclusive write lock; scanners already in flight keep their pre-add view via the same Arc clone-on-write as the `_drop_columns` (lance-format#42) / `_alter_columns` (lance-format#44) siblings. Three focused entry points rather than one mode-tagged function, because the inputs are genuinely different shapes (name/expression pairs vs. a schema pointer vs. a stream). Upstream's `read_columns` parameter isn't exposed — it only feeds `BatchUDF`; for these three variants upstream ignores it. `batch_size` is forwarded where it does something (SQL scan, stream alignment) and omitted from the metadata-only all-null path. ## Surface ```c typedef struct LanceSqlColumn { const char* name; const char* expression; } LanceSqlColumn; int32_t lance_dataset_add_columns_sql( LanceDataset* dataset, const LanceSqlColumn* columns, size_t num_columns, uint64_t batch_size); int32_t lance_dataset_add_columns_nulls( LanceDataset* dataset, const struct ArrowSchema* schema); int32_t lance_dataset_add_columns_stream( LanceDataset* dataset, struct ArrowArrayStream* stream, uint64_t batch_size); ``` `batch_size` uses `0` for the upstream default and is range-checked to `u32`. Two error-code details are worth calling out, both matching existing behavior: a SQL expression referencing a non-existent column surfaces as `LANCE_ERR_INTERNAL` (an upstream schema error, the same path as `lance_dataset_delete`), whereas a syntax error is `LANCE_ERR_INVALID_ARGUMENT`. Because these are `unsafe extern "C"` entry points under `panic = "abort"`, the stream variant pre-validates the mandatory CADI callbacks before handing the stream to arrow-rs (which would otherwise abort on a NULL `get_schema` / `get_next`), and the all-null variant rejects an uninitialised or non-UTF-8 top-level schema `format` before arrow-rs's `assert!`/`expect` can fire. The stream is consumed (released) on every non-NULL return path. ## Tests Rust integration tests cover all three variants end to end: computed values (single and multi-column SQL, constant expressions, honored `batch_size`), all-null backfill, and multi-fragment stream alignment — plus the full rejection surface (NULL/empty/non-UTF-8 inputs, name collisions, row-count mismatch, invalid `batch_size`, released/missing-callback streams, non-nullable all-null fields, and the legacy-format `NOT_SUPPORTED` path). Stream-consumption is proven with a drop counter rather than a vacuous release-slot check. C and C++ smoke tests exercise the SQL happy path and each variant's argument rejections across the ABI.

LuciferYang · 2026-06-12T02:29:14Z

Thank you @jja725

The schema-evolution series is fully merged, so this flips the Phase 3 roadmap row from `[ ]` to `[x]` and names the functions that cover it, matching the style of the rows around it. Covered by: - #45 — `lance_dataset_add_columns_sql/_nulls/_stream` - #44 — `lance_dataset_alter_columns` - #42 — `lance_dataset_drop_columns` Closes #41. Docs-only; no code or test changes.

jja725 approved these changes Jun 11, 2026

View reviewed changes

jja725 merged commit a4f4cef into lance-format:main Jun 11, 2026
9 checks passed

This was referenced Jun 16, 2026

docs: mark schema evolution complete in roadmap #46

Merged

Schema evolution: add/drop/alter columns #41

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add lance_dataset_add_columns for SQL / all-null / stream column addition#45

feat: add lance_dataset_add_columns for SQL / all-null / stream column addition#45
jja725 merged 1 commit into
lance-format:mainfrom
LuciferYang:feat/dataset-add-columns

LuciferYang commented Jun 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

LuciferYang commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

LuciferYang commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Surface

Tests

Uh oh!

Uh oh!

LuciferYang commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LuciferYang commented Jun 11, 2026 •

edited

Loading