fix(java): expose updatedFragmentOffsets on Update operation for RewriteColumns#6748
fix(java): expose updatedFragmentOffsets on Update operation for RewriteColumns#6748jerryjch wants to merge 11 commits into
Conversation
c116d01 to
456e6dc
Compare
ed07fea to
cbf1475
Compare
…iteColumns commits
cbf1475 to
1884a7d
Compare
| let mut iter = jmap.iter(env)?; | ||
| let mut offsets: HashMap<u64, RoaringBitmap> = HashMap::new(); | ||
| env.with_local_frame(32, |env| { | ||
| while let Some((key, value)) = iter.next(env)? { |
There was a problem hiding this comment.
The Java-to-Rust offset map import keeps all iterator-created JNI local references alive for the whole map. Large multi-fragment updates can exhaust the local reference table and fail the commit before Rust validation runs.
| let frag_id = | ||
| env.call_method(&key, "longValue", "()J", &[])?.j()? as u64; | ||
| let buf: Vec<u8> = env.convert_byte_array(JByteArray::from(value))?; | ||
| let bitmap = RoaringBitmap::deserialize_from(buf.as_slice())?; |
There was a problem hiding this comment.
Malformed bitmap bytes are reported as I/O failures even though they come from Java API input. Callers can misclassify bad arguments as retryable storage errors.
…UpdateResult Serialize matched_offsets once at the executor JNI boundary instead of expanding to long[]. Add getUpdatedRowOffsetBytes(); deprecate getUpdatedRowOffsets() with expandRowOffsetsFromBytes for lance-format#6650 compat. Follow-up to lance-format#6650. Pairs with Update.updatedFragmentOffsets() for lance-spark#418. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@Xuanwo Thanks for the review. I addressed your comments in commit |
| * with callers compiled against the #6650 API. | ||
| */ | ||
| @Deprecated | ||
| public FragmentUpdateResult( |
There was a problem hiding this comment.
The retained long[] compatibility path now calls native helpers from a class that never loads the JNI library. Existing callers can hit UnsatisfiedLinkError unless another Lance class happened to initialize first.
| env.call_method(&key, "longValue", "()J", &[])?.j()? as u64; | ||
| let buf: Vec<u8> = | ||
| env.convert_byte_array(JByteArray::from(value))?; | ||
| let bitmap = RoaringBitmap::deserialize_from(buf.as_slice()) |
There was a problem hiding this comment.
The commit path accepts arbitrary offset bitmaps and later expands them into a full offset vector before checking fragment bounds. A compact valid Roaring bitmap can force huge allocations during a Java RewriteColumns commit.
| this(updatedFragment, updatedFieldIds, new byte[0]); | ||
| } | ||
|
|
||
| public FragmentUpdateResult( |
There was a problem hiding this comment.
Adding a public byte[] constructor with the same arity as the retained long[] constructor makes existing source calls that pass null for offsets fail overload resolution.
| "updatedRowOffsets must be non-negative, got {offset}" | ||
| ))); | ||
| } | ||
| bitmap.insert(offset as u32); |
There was a problem hiding this comment.
The deprecated long[] encoder casts every non-negative value to u32, so offsets above u32::MAX silently become different rows and can corrupt last_updated metadata.
| Update.builder() | ||
| .removedFragmentIds(Collections.singletonList(fragmentId)) | ||
| .newFragments(Collections.singletonList(newFragment)) | ||
| .updateMode(Optional.of(UpdateMode.RewriteRows)) |
There was a problem hiding this comment.
The round-trip test uses RewriteRows, but updatedFragmentOffsets is consumed only by the RewriteColumns stable-row-id branch. The Java/Spark path can regress while this test still passes.
The perf(java) commit picked up an incidental Cargo.lock change adding "rayon" when cargo ran during the pre-commit hook after the main branch merge. Revert to the pre-commit state. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add static { JniLoader.ensureLoaded(); } so deprecated native methods
do not throw UnsatisfiedLinkError when no other Lance class has been
touched first.
- Add public static create() factory as the primary bytes API; make the
byte[] constructor private to eliminate null-argument overload ambiguity
with the deprecated long[] constructor.
- Add upper bound check (offset > u32::MAX) in the deprecated
encodeRowOffsetsToBytes JNI helper to prevent silent u32 truncation.
- Update FragmentUpdateResultTest to use FragmentUpdateResult.create().
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
updatedFragmentOffsets is consumed only by the RewriteColumns path in build_manifest. The test was using RewriteRows, which never exercises the offset round-trip. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
A compact RLE RoaringBitmap (as few as 22 bytes) can represent all u32 values. The .collect() expansion at build_manifest would allocate ~32GB. Validate bitmap len() and max() against the fragment's physical_rows from existing_fragments (manifest truth) before expanding. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Add tests for cardinality exceeding physical_rows, max offset exceeding physical_rows, and exact-boundary success case. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
RewriteColumns). Required by lance-spark#418.
Update.java: addsMap<Long, byte[]> updatedFragmentOffsetsfield, 7-arg constructor,accessor
updatedFragmentOffsets(), andBuilder.updatedFragmentOffsets(...)setter.Defaults to
Collections.emptyMap(). Values are portable RoaringBitmap bytes.java/lance-jni/src/transaction.rs— two JNI directions updated: FromJava deserializeseach
byte[]value into aRoaringBitmapand setsupdated_fragment_offsetson the Rustoperation; IntoJava serializes each bitmap to
byte[]and populates aHashMap<Long, byte[]>passed to the 7-arg
Updateconstructor (previously the field was ignored and the 6-argform was used).
Update.javaequals/hashCode: deep-comparesbyte[]values by content;hashCodeadded per the Java contract.
Background
PR #6650 added
updated_fragment_offsetson the RustOperation::Update(proto field 9),build_manifestpartial refresh logic, andFragmentUpdateResult.getUpdatedRowOffsets().Two gaps remained:
The Java
Updateclass had no field for these offsets andconvert_to_rust_operationalways set
updated_fragment_offsets: None, so the lance-spark commit path(UpdateColumnsBackfillBatchWrite) had no way to pass offsets to Rust and the partial
refresh in
build_manifestcould never activate from a JVM caller.convert_to_java_operation_innerstill used the old 6-arg constructor signature fornew_object. With the 6-arg constructor removed fromUpdate.java(replaced by the7-arg form), any Rust→Java materialization of
Operation::Update(e.g. reading back atransaction) would fail at runtime with
NoSuchMethodError.Implementation notes
stays O(bitmap size) rather than O(n matched rows).
with_local_frame(4, ..)per bitmap entry in IntoJava bounds local-ref growth on largeoffset maps.
JMapwas avoided inside the frame because it holds aJObjectwith theouter frame's lifetime, causing borrow-checker conflicts;
call_methodon the outerjava_mapreference is used instead.Vec<u8>buffer for each bitmap is allocated in Rust before entering the frame, soits lifetime is independent of JNI frame scope.
with_local_frame(8, ..)per iteration in FromJava bounds local-ref growth for largemulti-fragment maps.
build_manifestvalidates bitmap cardinality and max offset against the fragment'sphysical_rowsfromexisting_fragmentsbefore.collect(), preventing a compact RLEbitmap from expanding into an unbounded allocation.
UpdatedFragmentOffsetsadded to thelance::dataset::transactionimport.Why the protobuf field alone is not enough
lance-spark commits by calling
CommitBuilder.execute(transaction), which passes the JavaTransactionobject tonativeCommitToDatasetvia JNI. The JNI handler callsconvert_to_rust_transaction→convert_to_rust_operation, which reflects on the JavaUpdateobject to build the RustOperation::Updatestruct. The protobuf field (field 9)is only used when a Transaction is serialized as a proto blob; it has no effect on the
reflection-based JNI path unless the Java
Updateclass exposes the field and the JNIdeserialization reads it.
Additional change
FragmentUpdateResult(from #6650) returned matched row offsets as an expandedlong[]atthe executor JNI boundary. This PR also passes those offsets as portable RoaringBitmap bytes
so lance-spark can wire them through to
Update.updatedFragmentOffsets()without an O(n rows)expansion on the executor→driver path.
FragmentUpdateResult.getUpdatedRowOffsetBytes()— primary accessor; values are the sameportable RoaringBitmap byte format as
Update.updatedFragmentOffsets().java/lance-jni/src/fragment.rs—update_columns_with_offsetsserializesmatched_offsetsonce withRoaringBitmap::serialize_into; JNI constructs results viathe private
(FragmentMetadata, long[], byte[])constructor (JNI can access private ctors).FragmentUpdateResult.create(FragmentMetadata, long[], byte[])— public static factory;primary construction path for callers using the bytes API.
@Deprecated getUpdatedRowOffsets()— retained for backward compatibility; expands bytes viaexpandRowOffsetsFromBytesonly when called (lazy O(n rows)).@Deprecated3-arg(FragmentMetadata, long[] fieldsModified, long[] updatedRowOffsets)constructor — encodes offsets via
encodeRowOffsetsToBytesfor source compat.FragmentUpdateResultTest— round-trip bytes, deprecated constructor encode, andupdateColumns()integration asserting matched offsets{0,1,2,3}on the test fixture.Test plan
UpdateTest#testUpdatedFragmentOffsetsRoundTrip— commits anUpdatewith a non-emptyupdatedFragmentOffsetsmap throughCommitBuilder.execute(exercises the FromJava JNIpath), reads the transaction back via
Dataset.readTransaction()(exercises the IntoJavaJNI path), and asserts the offsets match. Map value is hardcoded portable RoaringBitmap
bytes encoding {1, 3, 5}; verified with
assertArrayEqualsafter the round-trip.FragmentUpdateResultTest— see Additional change.Compatibility
Update— theupdatedFragmentOffsetsfield did not exist in anyprior release. The builder setter is optional and defaults to
Collections.emptyMap(), soexisting
Update.builder()...build()call sites compile and behave identically.equals/hashCode:equalsusesoffsetMapsEqualto deep-comparebyte[]values via
Arrays.equals;hashCodeis added per the Java contract.new_objectcall is updated from the 6-arg tothe 7-arg form in the same PR. Both files must ship together; within that atomic change
there is no compatibility gap.
updated_fragment_offsetsproto field and Rust structfield were already added in fix: propagate update_columns offsets and partial last_updated for RewriteColumns #6650.
FragmentUpdateResult—@Deprecatedlong[] getter and constructor retained; new bytes getter is the supported path for new callers(see Additional change).