feat: add lance_dataset_drop_columns for metadata-only column removal by LuciferYang · Pull Request #42 · lance-format/lance-c

LuciferYang · 2026-05-22T06:40:31Z

Summary

First of three PRs against #41 (schema evolution). Exposes upstream's drop_columns — a metadata-only manifest commit that removes the named columns from the schema without rewriting any data files. Materializing the projection is left to a later _compact_files (and a future cleanup operation, once exposed, removes the old version's files).

Mutates the dataset in place under an exclusive write lock; scanners already in flight keep their pre-drop snapshot view via the existing Arc clone-on-write, same as _delete / _update / _compact_files.

Surface

int32_t lance_dataset_drop_columns(
    LanceDataset* dataset,
    const char* const* columns,
    size_t num_columns
);

Inputs are validated up front with per-index error messages so the precise cause is observable from lance_last_error_message(). NULL handle, NULL pointer array, zero count, NULL or empty-string entries, and non-UTF-8 names all return LANCE_ERR_INVALID_ARGUMENT; upstream's own rejections (unknown column, attempt to drop every column) map to the same code.

The C++ wrapper takes const std::vector<std::string>& and follows the update / merge_insert sibling convention — passes col_ptrs.data() unconditionally. An empty vector flows through the Rust-side num_columns == 0 guard so the error message says "num_columns must be > 0" rather than the misleading "columns must not be NULL".

Tests

Eleven new Rust integration tests covering single-drop, multi-drop, version bump, data preservation (downcasts the surviving Arrow columns and checks the actual values, not just shape), and the full rejection surface (NULL dataset / NULL array / zero count / NULL entry / empty-string entry / unknown column / drop-all). C and C++ smoke tests snapshot ArrowSchema.n_children pre/post drop, exercise the drop-last-column rejection path, and verify the version is unchanged when a drop fails. cargo test and cargo test --test compile_and_run_test -- --ignored both green.

Follow-ups

lance_dataset_alter_columns — rename / nullability / type change
lance_dataset_add_columns — SQL expressions / AllNulls / ArrowArrayStream

The README roadmap entry stays unticked until all three ship.

First of three PRs covering the schema-evolution roadmap entry. Exposes upstream's `drop_columns` — a metadata-only manifest commit that removes the named columns from the schema without rewriting data files.

LuciferYang · 2026-05-22T07:55:16Z

The macOS arm64 leg of consumer-smoke-test failure here is pre-existing on main (failing since #24 — same unresolved _IO* symbols from sysinfo / objc2_io_kit), not introduced by this PR. Sent #43 as a focused fix to declare -framework IOKit in the CMake / pkg-config link line; once that merges, a rerun here should go green.

## Summary The macOS arm64 consumer-smoke-test job has been failing on `main` since #24 with a long list of unresolved `_IO*` symbols (`_IOObjectRelease`, `_IOServiceMatching`, `_IOHIDEventSystemClientCreate`, `_IORegistryEntryCreateCFProperty`, …) — sample run: https://github.com/lance-format/lance-c/actions/runs/26272649710. Root cause is plumbing, not the consumer example: `sysinfo` (pulled in transitively via the lance crates) calls IOKit on macOS for disk enumeration, CPU frequency, and thermal sensors, and `objc2_io_kit` declares the binding. Cargo's `rustc-link-lib=framework=IOKit` is honored when this repo builds, but a downstream consumer linking against the installed `liblance_c.a` via `find_package(LanceC)` (or pkg-config) only sees the frameworks we declare in our config files — and IOKit was missing. Add `-framework IOKit` next to the existing `CoreFoundation` / `Security` / `SystemConfiguration` entries in all three mirroring places: - `CMakeLists.txt` — build-tree `LanceC_platform_deps` interface library - `cmake/LanceCConfig.cmake.in` — installed `find_package(LanceC)` consumers - `CMakeLists.txt` — pkg-config `Libs.private` ## Verification Same `cmake --install` → `examples/cmake-consumer` build path the CI runs, on arm64 macOS (15.0 SDK, AppleClang 17): ``` $ cmake --install build --prefix _install $ cmake -S examples/cmake-consumer -B consumer-build -DCMAKE_PREFIX_PATH="$PWD/_install" $ cmake --build consumer-build … [100%] Built target consumer $ consumer-build/consumer usage: consumer <dataset_uri> $ echo $? 2 ``` Before the patch the same sequence dies at link with `Undefined symbols for architecture arm64`. After it, the link succeeds and the binary exits 2 (usage error) as the CI step expects. ## After this lands Unblocks the consumer-smoke macOS leg for every open PR — #42 (schema-evolution drop_columns) hits this exact failure on its CI run.

LuciferYang · 2026-05-25T03:11:02Z

all test passed

LuciferYang · 2026-05-26T05:16:10Z

Thank you @jja725

… changes (#44) ## Summary Second of three PRs against #41. Exposes upstream's `Dataset::alter_columns` — rename a column, change its nullability, or change its data type, committing a new manifest. Rename and nullability-only changes are zero-copy and preserve indices on the affected column; a type change rewrites the column's data files and drops any associated indices, mirroring upstream behavior. Mutates the dataset in place under an exclusive write lock; scanners already in flight against it keep their pre-alteration view via the existing Arc clone-on-write, same as `_delete` (#31) / `_drop_columns` (#42). ## Surface ```c typedef enum { LANCE_COLUMN_NULLABLE_UNCHANGED = 0, LANCE_COLUMN_NULLABLE_TRUE = 1, LANCE_COLUMN_NULLABLE_FALSE = 2, } LanceColumnNullableMode; typedef struct LanceColumnAlteration { const char* path; /* required */ const char* rename; /* NULL = keep */ int32_t nullable_mode; /* LanceColumnNullableMode discriminant */ const struct ArrowSchema* data_type; /* NULL = keep */ } LanceColumnAlteration; int32_t lance_dataset_alter_columns( LanceDataset* dataset, const LanceColumnAlteration* alterations, size_t num_alterations ); ``` Per-alteration validation runs up front with index-tagged error messages. The struct uses sentinels for the three optional fields (`rename = NULL`, `nullable_mode = UNCHANGED`, `data_type = NULL`); at least one must request a change, so a zero-init struct with only `path` set is rejected as a no-op rather than silently consuming a manifest version. Two design choices worth calling out: - **`nullable_mode` is `int32_t`, not the enum directly.** The struct is read across the FFI boundary, and Rust treats a `#[repr(C)]` enum read from C with an out-of-range discriminant as UB. So the field is `int32_t` and a `LanceColumnNullableMode::from_raw(i32)` helper converts and returns `INVALID_ARGUMENT` for unknown values — same pattern as `merge_insert`'s `WhenMatched::from_raw`. - **`data_type` borrows an Arrow `ArrowSchema`.** The wrapper never calls its `release` callback. Before handing the pointer to arrow-rs, the wrapper checks both `release == NULL` (the Arrow CADI "released" sentinel) and `format == NULL` (catches `FFI_ArrowSchema::empty()` and other half-built structs that would otherwise hit an `assert!` in `DataType::try_from` and abort the host process under `panic = "abort"`). The C++ wrapper takes `const std::vector<lance::ColumnAlteration>&` and uses the same direct-pass convention as `update`/`merge_insert` siblings — `raw.data()` unconditionally; an empty vector flows through the Rust-side `num_alterations == 0` guard so the error message is precise. ## Tests Nineteen new Rust integration tests cover the positive paths (rename, relax / tighten nullability, Int32→Int64 upcast with value round-trip, combined rename+relax, multi-alteration per call, version bump) and the full rejection surface (NULL dataset / NULL array / zero count / NULL path / empty path / empty rename / unknown column / incompatible cast / no-op alteration with schema-unchanged assertion / invalid `nullable_mode` discriminant / uninitialised `FFI_ArrowSchema` / tightening nullability when existing rows hold NULLs). C and C++ smoke tests slot in before `test_drop_columns`, relax `id` to nullable, and verify via `ArrowSchema.flags & ARROW_FLAG_NULLABLE`. Both also exercise the NULL / zero / no-op / bad-discriminant negative paths. `cargo test` and `cargo test --test compile_and_run_test -- --ignored` both green. ## Follow-up - `lance_dataset_add_columns` — SQL expressions / AllNulls / ArrowArrayStream The README roadmap entry stays unticked until that lands.

…n addition (#45) ## Summary Last of three PRs against #41. Exposes upstream's `Dataset::add_columns` through the three `NewColumnTransform` cases that translate cleanly across the C ABI: - **SQL expressions** — derive new columns from SQL over existing columns. - **All-null columns** — add nullable columns from an Arrow schema. On the modern format this is metadata-only; the legacy format can't represent it that way and returns `LANCE_ERR_NOT_SUPPORTED`. - **Stream** — splice in precomputed column data from an Arrow C stream, aligned positionally to the dataset's existing rows. Upstream's fourth variant, `BatchUDF`, is left out on purpose: it carries a Rust closure that can't cross the C ABI, and the stream variant already covers the same "bring your own computed data" use case. Each call mutates the dataset in place under an exclusive write lock; scanners already in flight keep their pre-add view via the same Arc clone-on-write as the `_drop_columns` (#42) / `_alter_columns` (#44) siblings. Three focused entry points rather than one mode-tagged function, because the inputs are genuinely different shapes (name/expression pairs vs. a schema pointer vs. a stream). Upstream's `read_columns` parameter isn't exposed — it only feeds `BatchUDF`; for these three variants upstream ignores it. `batch_size` is forwarded where it does something (SQL scan, stream alignment) and omitted from the metadata-only all-null path. ## Surface ```c typedef struct LanceSqlColumn { const char* name; const char* expression; } LanceSqlColumn; int32_t lance_dataset_add_columns_sql( LanceDataset* dataset, const LanceSqlColumn* columns, size_t num_columns, uint64_t batch_size); int32_t lance_dataset_add_columns_nulls( LanceDataset* dataset, const struct ArrowSchema* schema); int32_t lance_dataset_add_columns_stream( LanceDataset* dataset, struct ArrowArrayStream* stream, uint64_t batch_size); ``` `batch_size` uses `0` for the upstream default and is range-checked to `u32`. Two error-code details are worth calling out, both matching existing behavior: a SQL expression referencing a non-existent column surfaces as `LANCE_ERR_INTERNAL` (an upstream schema error, the same path as `lance_dataset_delete`), whereas a syntax error is `LANCE_ERR_INVALID_ARGUMENT`. Because these are `unsafe extern "C"` entry points under `panic = "abort"`, the stream variant pre-validates the mandatory CADI callbacks before handing the stream to arrow-rs (which would otherwise abort on a NULL `get_schema` / `get_next`), and the all-null variant rejects an uninitialised or non-UTF-8 top-level schema `format` before arrow-rs's `assert!`/`expect` can fire. The stream is consumed (released) on every non-NULL return path. ## Tests Rust integration tests cover all three variants end to end: computed values (single and multi-column SQL, constant expressions, honored `batch_size`), all-null backfill, and multi-fragment stream alignment — plus the full rejection surface (NULL/empty/non-UTF-8 inputs, name collisions, row-count mismatch, invalid `batch_size`, released/missing-callback streams, non-nullable all-null fields, and the legacy-format `NOT_SUPPORTED` path). Stream-consumption is proven with a drop counter rather than a vacuous release-slot check. C and C++ smoke tests exercise the SQL happy path and each variant's argument rejections across the ABI.

The schema-evolution series is fully merged, so this flips the Phase 3 roadmap row from `[ ]` to `[x]` and names the functions that cover it, matching the style of the rows around it. Covered by: - #45 — `lance_dataset_add_columns_sql/_nulls/_stream` - #44 — `lance_dataset_alter_columns` - #42 — `lance_dataset_drop_columns` Closes #41. Docs-only; no code or test changes.

feat: add lance_dataset_drop_columns for metadata-only column removal

6e42bf6

First of three PRs covering the schema-evolution roadmap entry. Exposes upstream's `drop_columns` — a metadata-only manifest commit that removes the named columns from the schema without rewriting data files.

LuciferYang mentioned this pull request May 22, 2026

fix(cmake): add IOKit framework to APPLE link line #43

Merged

Merge branch 'lance-format:main' into feat/dataset-drop-columns

82b8828

jja725 approved these changes May 26, 2026

View reviewed changes

jja725 merged commit d5133a8 into lance-format:main May 26, 2026
9 checks passed

LuciferYang mentioned this pull request May 26, 2026

feat: add lance_dataset_alter_columns for rename / nullability / type changes #44

Merged

LuciferYang mentioned this pull request Jun 11, 2026

feat: add lance_dataset_add_columns for SQL / all-null / stream column addition #45

Merged

This was referenced Jun 16, 2026

docs: mark schema evolution complete in roadmap #46

Merged

Schema evolution: add/drop/alter columns #41

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add lance_dataset_drop_columns for metadata-only column removal#42

feat: add lance_dataset_drop_columns for metadata-only column removal#42
jja725 merged 2 commits into
lance-format:mainfrom
LuciferYang:feat/dataset-drop-columns

LuciferYang commented May 22, 2026

Uh oh!

LuciferYang commented May 22, 2026

Uh oh!

LuciferYang commented May 25, 2026

Uh oh!

Uh oh!

LuciferYang commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

LuciferYang commented May 22, 2026

Summary

Surface

Tests

Follow-ups

Uh oh!

LuciferYang commented May 22, 2026

Uh oh!

LuciferYang commented May 25, 2026

Uh oh!

Uh oh!

LuciferYang commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants