fix(fast-path): three bugs in FastPathFragmentWriter and _can_use_fast_path#43
Closed
FANNG1 wants to merge 4 commits into
Closed
Conversation
added 4 commits
June 19, 2026 12:32
…t_path Bug 1 — type erasure (FastPathFragmentWriter.__call__): pa.array(s.to_pylist()) drops Arrow type info from daft Series. For fixed_size_list<float32>[N], Python floats become float64 and the fixed-size list becomes a variable-length list<double>, which causes lance's stream merger to fail with a row-count mismatch (1 != N). Fix: use s.to_arrow() / combine_chunks() to preserve the declared daft return type. Bug 2 — collect() side-effect (_can_use_fast_path): len(df.collect()) materialises results into df._result_cache. One-shot Python objects (e.g. lance BlobFile) cached there are exhausted; a subsequent groupby().map_groups() that re-uses the same df object receives stale objects and produces null values. Fix: use df.count_rows(), which does not set _result_cache. Bug 3 — next_fid under-counts child field IDs (FastPathFragmentWriter): lance_schema.fields() returns only top-level fields. Nested types (struct, blob-v2, etc.) have child fields with their own IDs. max(f.id() for top-level f) skips these, so next_fid can collide with an existing child field ID. The committed fragment metadata then maps the new column file to the wrong schema field, causing lance to read null for the new column. Fix: recurse into children when computing the max field ID.
…r bugs Covers: - fixed_size_list<float32>[N] type erasure via pa.array(s.to_pylist()) - next_fid collision with nested struct child field IDs - BlobFile exhaustion caused by collect() side-effect in _can_use_fast_path
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes three independent bugs in the fast-path column-merge code path (
lance_merge_column.py).Changes
Bug 1 — Type erasure:
fixed_size_list<float32>[N]becomeslist<float64>(closes #40)FastPathFragmentWriterbuilt the new-column Arrow table viapa.array(s.to_pylist()).Python floats are always
float64and Python lists carry no fixed-size constraint, sofixed_size_list<item: float>[N](e.g. an embedding UDF's output) was silently widenedto
list<item: double>.Fix: use
s.to_arrow().combine_chunks()to preserve the Arrow type that daft declaredin the UDF's
return_dtype.Bug 2 —
df.collect()in_can_use_fast_pathexhausts one-shot BlobFile objects (closes #42)_can_use_fast_pathcalledlen(df.collect())to count rows. Daft caches the result indf._result_cache. When the pipeline containstake_blobs(), the cache holdsBlobFileinstances (one-shot streams). The subsequent
groupby().map_groups()in_merge_fast_pathreceived the same exhausted objects;
blob.read()returnedb'', causing all downstreamUDFs to produce null or incorrect output.
Fix: replace
df.collect()withdf.count_rows(), which does not set_result_cache.Bug 3 —
next_fidcollides with nested struct child field IDs (closes #41)next_fid = max(f.id() for f in lance_schema.fields()) + 1only scanned top-level fields.A struct column with M children occupies M+1 Lance field IDs (parent + each child). The new
file's
"fields"entry therefore collided with an existing child field, and the new columnread back as null for every row.
Fix: recurse into
f.children()to find the true maximum field ID across all descendants.Tests
tests/io/lance/test_fast_path_merge.py— newTestRegressionsclass with one test per bug:test_fixed_size_list_float32_type_preservedtest_next_fid_skips_nested_child_field_idstest_blob_pipeline_merge_produces_nonnull_results