fix(fast-path): three bugs in FastPathFragmentWriter and _can_use_fast_path by FANNG1 · Pull Request #43 · daft-engine/daft-lance

FANNG1 · 2026-06-25T23:39:09Z

Fixes three independent bugs in the fast-path column-merge code path (lance_merge_column.py).

Changes

Bug 1 — Type erasure: `fixed_size_list<float32>[N]` becomes `list<float64>` (closes #40)

FastPathFragmentWriter built the new-column Arrow table via pa.array(s.to_pylist()).
Python floats are always float64 and Python lists carry no fixed-size constraint, so
fixed_size_list<item: float>[N] (e.g. an embedding UDF's output) was silently widened
to list<item: double>.

Fix: use s.to_arrow().combine_chunks() to preserve the Arrow type that daft declared
in the UDF's return_dtype.

Bug 2 — `df.collect()` in `_can_use_fast_path` exhausts one-shot BlobFile objects (closes #42)

_can_use_fast_path called len(df.collect()) to count rows. Daft caches the result in
df._result_cache. When the pipeline contains take_blobs(), the cache holds BlobFile
instances (one-shot streams). The subsequent groupby().map_groups() in _merge_fast_path
received the same exhausted objects; blob.read() returned b'', causing all downstream
UDFs to produce null or incorrect output.

Fix: replace df.collect() with df.count_rows(), which does not set _result_cache.

Bug 3 — `next_fid` collides with nested struct child field IDs (closes #41)

next_fid = max(f.id() for f in lance_schema.fields()) + 1 only scanned top-level fields.
A struct column with M children occupies M+1 Lance field IDs (parent + each child). The new
file's "fields" entry therefore collided with an existing child field, and the new column
read back as null for every row.

Fix: recurse into f.children() to find the true maximum field ID across all descendants.

Tests

tests/io/lance/test_fast_path_merge.py — new TestRegressions class with one test per bug:

test_fixed_size_list_float32_type_preserved
test_next_fid_skips_nested_child_field_ids
test_blob_pipeline_merge_produces_nonnull_results

…t_path Bug 1 — type erasure (FastPathFragmentWriter.__call__): pa.array(s.to_pylist()) drops Arrow type info from daft Series. For fixed_size_list<float32>[N], Python floats become float64 and the fixed-size list becomes a variable-length list<double>, which causes lance's stream merger to fail with a row-count mismatch (1 != N). Fix: use s.to_arrow() / combine_chunks() to preserve the declared daft return type. Bug 2 — collect() side-effect (_can_use_fast_path): len(df.collect()) materialises results into df._result_cache. One-shot Python objects (e.g. lance BlobFile) cached there are exhausted; a subsequent groupby().map_groups() that re-uses the same df object receives stale objects and produces null values. Fix: use df.count_rows(), which does not set _result_cache. Bug 3 — next_fid under-counts child field IDs (FastPathFragmentWriter): lance_schema.fields() returns only top-level fields. Nested types (struct, blob-v2, etc.) have child fields with their own IDs. max(f.id() for top-level f) skips these, so next_fid can collide with an existing child field ID. The committed fragment metadata then maps the new column file to the wrong schema field, causing lance to read null for the new column. Fix: recurse into children when computing the max field ID.

…r bugs Covers: - fixed_size_list<float32>[N] type erasure via pa.array(s.to_pylist()) - next_fid collision with nested struct child field IDs - BlobFile exhaustion caused by collect() side-effect in _can_use_fast_path

FANNG1 · 2026-06-26T01:06:53Z

Superseded by three focused PRs for easier review: #44 (type erasure), #45 (collect side-effect), #46 (next_fid child IDs).

fanng added 4 commits June 19, 2026 12:32

Remove misleading LanceDB naming

0c75d81

fix: preserve deprecated LanceDB scan names

a77d7c1

FANNG1 closed this Jun 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(fast-path): three bugs in FastPathFragmentWriter and _can_use_fast_path#43

fix(fast-path): three bugs in FastPathFragmentWriter and _can_use_fast_path#43
FANNG1 wants to merge 4 commits into
daft-engine:mainfrom
FANNG1:fix/fast-path-type-erasure-and-collect-sideeffect

FANNG1 commented Jun 25, 2026

Uh oh!

FANNG1 commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

FANNG1 commented Jun 25, 2026

Changes

Bug 1 — Type erasure: fixed_size_list<float32>[N] becomes list<float64> (closes #40)

Bug 2 — df.collect() in _can_use_fast_path exhausts one-shot BlobFile objects (closes #42)

Bug 3 — next_fid collides with nested struct child field IDs (closes #41)

Tests

Uh oh!

FANNG1 commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Bug 1 — Type erasure: `fixed_size_list<float32>[N]` becomes `list<float64>` (closes #40)

Bug 2 — `df.collect()` in `_can_use_fast_path` exhausts one-shot BlobFile objects (closes #42)

Bug 3 — `next_fid` collides with nested struct child field IDs (closes #41)