Skip to content

feat: add lance_dataset_add_columns for SQL / all-null / stream column addition#45

Merged
jja725 merged 1 commit into
lance-format:mainfrom
LuciferYang:feat/dataset-add-columns
Jun 11, 2026
Merged

feat: add lance_dataset_add_columns for SQL / all-null / stream column addition#45
jja725 merged 1 commit into
lance-format:mainfrom
LuciferYang:feat/dataset-add-columns

Conversation

@LuciferYang

@LuciferYang LuciferYang commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Summary

Last of three PRs against #41. Exposes upstream's Dataset::add_columns through the three NewColumnTransform cases that translate cleanly across the C ABI:

  • SQL expressions — derive new columns from SQL over existing columns.
  • All-null columns — add nullable columns from an Arrow schema. On the modern format this is metadata-only; the legacy format can't represent it that way and returns LANCE_ERR_NOT_SUPPORTED.
  • Stream — splice in precomputed column data from an Arrow C stream, aligned positionally to the dataset's existing rows.

Upstream's fourth variant, BatchUDF, is left out on purpose: it carries a Rust closure that can't cross the C ABI, and the stream variant already covers the same "bring your own computed data" use case.

Each call mutates the dataset in place under an exclusive write lock; scanners already in flight keep their pre-add view via the same Arc clone-on-write as the _drop_columns (#42) / _alter_columns (#44) siblings.

Three focused entry points rather than one mode-tagged function, because the inputs are genuinely different shapes (name/expression pairs vs. a schema pointer vs. a stream). Upstream's read_columns parameter isn't exposed — it only feeds BatchUDF; for these three variants upstream ignores it. batch_size is forwarded where it does something (SQL scan, stream alignment) and omitted from the metadata-only all-null path.

Surface

typedef struct LanceSqlColumn {
    const char* name;
    const char* expression;
} LanceSqlColumn;

int32_t lance_dataset_add_columns_sql(
    LanceDataset* dataset, const LanceSqlColumn* columns,
    size_t num_columns, uint64_t batch_size);

int32_t lance_dataset_add_columns_nulls(
    LanceDataset* dataset, const struct ArrowSchema* schema);

int32_t lance_dataset_add_columns_stream(
    LanceDataset* dataset, struct ArrowArrayStream* stream,
    uint64_t batch_size);

batch_size uses 0 for the upstream default and is range-checked to u32. Two error-code details are worth calling out, both matching existing behavior: a SQL expression referencing a non-existent column surfaces as LANCE_ERR_INTERNAL (an upstream schema error, the same path as lance_dataset_delete), whereas a syntax error is LANCE_ERR_INVALID_ARGUMENT.

Because these are unsafe extern "C" entry points under panic = "abort", the stream variant pre-validates the mandatory CADI callbacks before handing the stream to arrow-rs (which would otherwise abort on a NULL get_schema / get_next), and the all-null variant rejects an uninitialised or non-UTF-8 top-level schema format before arrow-rs's assert!/expect can fire. The stream is consumed (released) on every non-NULL return path.

Tests

Rust integration tests cover all three variants end to end: computed values (single and multi-column SQL, constant expressions, honored batch_size), all-null backfill, and multi-fragment stream alignment — plus the full rejection surface (NULL/empty/non-UTF-8 inputs, name collisions, row-count mismatch, invalid batch_size, released/missing-callback streams, non-nullable all-null fields, and the legacy-format NOT_SUPPORTED path). Stream-consumption is proven with a drop counter rather than a vacuous release-slot check. C and C++ smoke tests exercise the SQL happy path and each variant's argument rejections across the ABI.

…n addition

## Summary

Last of three PRs against lance-format#41. Exposes upstream's `Dataset::add_columns`
through the three `NewColumnTransform` cases that translate cleanly across
the C ABI:

- **SQL expressions** — derive new columns from SQL over existing columns.
- **All-null columns** — add nullable columns from an Arrow schema. On the
  modern format this is metadata-only; the legacy format can't represent it
  that way and returns `LANCE_ERR_NOT_SUPPORTED`.
- **Stream** — splice in precomputed column data from an Arrow C stream,
  aligned positionally to the dataset's existing rows.

Upstream's fourth variant, `BatchUDF`, is left out on purpose: it carries a
Rust closure that can't cross the C ABI, and the stream variant already
covers the same "bring your own computed data" use case.

Each call mutates the dataset in place under an exclusive write lock;
scanners already in flight keep their pre-add view via the same Arc
clone-on-write as the `_drop_columns` (lance-format#42) / `_alter_columns` (lance-format#44)
siblings.

Three focused entry points rather than one mode-tagged function, because the
inputs are genuinely different shapes (name/expression pairs vs. a schema
pointer vs. a stream). Upstream's `read_columns` parameter isn't exposed —
it only feeds `BatchUDF`; for these three variants upstream ignores it.
`batch_size` is forwarded where it does something (SQL scan, stream
alignment) and omitted from the metadata-only all-null path.

## Surface

```c
typedef struct LanceSqlColumn {
    const char* name;
    const char* expression;
} LanceSqlColumn;

int32_t lance_dataset_add_columns_sql(
    LanceDataset* dataset, const LanceSqlColumn* columns,
    size_t num_columns, uint64_t batch_size);

int32_t lance_dataset_add_columns_nulls(
    LanceDataset* dataset, const struct ArrowSchema* schema);

int32_t lance_dataset_add_columns_stream(
    LanceDataset* dataset, struct ArrowArrayStream* stream,
    uint64_t batch_size);
```

`batch_size` uses `0` for the upstream default and is range-checked to
`u32`. Two error-code details are worth calling out, both matching existing
behavior: a SQL expression referencing a non-existent column surfaces as
`LANCE_ERR_INTERNAL` (an upstream schema error, the same path as
`lance_dataset_delete`), whereas a syntax error is `LANCE_ERR_INVALID_ARGUMENT`.

Because these are `unsafe extern "C"` entry points under `panic = "abort"`,
the stream variant pre-validates the mandatory CADI callbacks before handing
the stream to arrow-rs (which would otherwise abort on a NULL `get_schema` /
`get_next`), and the all-null variant rejects an uninitialised or non-UTF-8
top-level schema `format` before arrow-rs's `assert!`/`expect` can fire. The
stream is consumed (released) on every non-NULL return path.

## Tests

Rust integration tests cover all three variants end to end: computed values
(single and multi-column SQL, constant expressions, honored `batch_size`),
all-null backfill, and multi-fragment stream alignment — plus the full
rejection surface (NULL/empty/non-UTF-8 inputs, name collisions, row-count
mismatch, invalid `batch_size`, released/missing-callback streams, non-nullable
all-null fields, and the legacy-format `NOT_SUPPORTED` path). Stream-consumption
is proven with a drop counter rather than a vacuous release-slot check. C and
C++ smoke tests exercise the SQL happy path and each variant's argument
rejections across the ABI.
@jja725 jja725 merged commit a4f4cef into lance-format:main Jun 11, 2026
9 checks passed
@LuciferYang

Copy link
Copy Markdown
Contributor Author

Thank you @jja725

jja725 pushed a commit that referenced this pull request Jun 16, 2026
The schema-evolution series is fully merged, so this flips the Phase 3
roadmap row from `[ ]` to `[x]` and names the functions that cover it,
matching the style of the rows around it.

Covered by:
- #45 — `lance_dataset_add_columns_sql/_nulls/_stream`
- #44 — `lance_dataset_alter_columns`
- #42 — `lance_dataset_drop_columns`

Closes #41.

Docs-only; no code or test changes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants