feat(blob-v2): configure packed blob file max size with field metadata#7322
feat(blob-v2): configure packed blob file max size with field metadata#7322jo-migo wants to merge 4 commits into
Conversation
|
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
Fixes: lance-format#7292 Problem: There's currently no way to encode a threshold for maximum packed blob sidecar file size into a dataset in the way that you can encode inline blob and dedicated blob size thresholds via `lance-encoding:blob-dedicated-size-threshold` and `lance-encoding:blob-inline-size-threshold` respectively. Solution: Add a new `lance-encoding:blob-pack-file-size-threshold` field-level metadata key which informs the blob writer to only start new packed files when current packed file reaches that threshold. It can be overridden by supplying another value via the existing `blob_pack_file_size_threshold` parameter to the `write_dataset` function.
e6b4c40 to
08c0e06
Compare
| if self | ||
| .current_blob_id | ||
| .map(|_| self.current_size + len > self.max_pack_size) | ||
| .map(|_| self.current_size + len > max_pack_size) |
There was a problem hiding this comment.
This uses the incoming field's threshold to decide whether to append to the shared pack file. In datasets with multiple blob fields and different thresholds, a file that already contains data from a smaller-threshold field can be extended under a larger threshold, so the persisted field metadata no longer reliably bounds sidecar size.
There was a problem hiding this comment.
Thanks for the review! Good point, I was not dealing with datasets containing different thresholds before. Now this case should be covered.
| input_field.name, | ||
| dataset_inline_threshold, | ||
| ))); | ||
| for (key, read_threshold) in [ |
There was a problem hiding this comment.
This append validation only checks top-level fields, even though blob threshold metadata is read recursively from nested blob fields. A nested blob append can silently ignore mismatched pack-threshold metadata and write using the persisted schema instead, so nested fields do not get the same explicit-mismatch protection as top-level blob columns.
There was a problem hiding this comment.
I made this whole check recursive.
…olumns and do recursive validation
Fixes
#7292
Problem
There's currently no way to encode a threshold for maximum packed blob sidecar file size into a dataset in the way that you can encode inline blob and dedicated blob size thresholds via
lance-encoding:blob-dedicated-size-thresholdandlance-encoding:blob-inline-size-thresholdrespectively.Solution
Add a new
lance-encoding:blob-pack-file-size-thresholdfield-level metadata key which informs the blob writer to only start new packed files when current packed file reaches that threshold.It can be overridden by supplying another value via the existing
blob_pack_file_size_thresholdparameter to thewrite_datasetfunction.