-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Arc partition values in TableSchema #19137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me -- I recommend Fields but otherwise 👍
| /// These columns are NOT present in the data files but are appended to each | ||
| /// row during query execution based on the file's location. | ||
| table_partition_cols: Vec<FieldRef>, | ||
| table_partition_cols: Arc<Vec<FieldRef>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend we use https://docs.rs/arrow/latest/arrow/datatypes/struct.Fields.html (which is already basically Arc<Vec>) but can be constructed zero copy potentially from existing schemas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this but it:
- Is incompatible with your recommendation in https://github.com/apache/datafusion/pull/19137/files/fd79a3ba89e9c5d45796aa8fbb7782bf1413bde4#r2598642330 because we can't do
Arc::make_mut(&mut Fields)andFieldsdoesn't expose mutablility. - It would require changing the public API of
TableSchema::partition_fields()
I also am not that worried about construction cost (that happens once in FileSource, it's the mutliple copies for each FIleOpener that worry me.
In any case since the current change is fully backwards compatible API wise and is internally contained we can always come back and use Fields later, so I will leave that as a followup.
| /// You should prefer calling [`TableSchema::new`] instead of chaining [`TableSchema::from_file_schema`] | ||
| /// into [`TableSchema::with_table_partition_cols`] if you have partition columns at construction time | ||
| /// since it avoids re-computing the table schema. | ||
| pub fn with_table_partition_cols(mut self, partition_cols: Vec<FieldRef>) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you used Fields you could also pass in impl Into<Fields>
| self.table_partition_cols = Arc::new(partition_cols); | ||
| } else { | ||
| // Append to existing partition columns | ||
| self.table_partition_cols = Arc::new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend Arc::make_mut here to avoid the allocation if possible
76a5ca5 to
0200a22
Compare
0200a22 to
b981959
Compare
This should avoid some clones.
I also fixed a bug when
with_partition_valuesis called multiple times on the same instance; it now appends the partition values.