Skip to content

Conversation

@adriangb
Copy link
Contributor

@adriangb adriangb commented Dec 6, 2025

This should avoid some clones.

I also fixed a bug when with_partition_values is called multiple times on the same instance; it now appends the partition values.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- I recommend Fields but otherwise 👍

/// These columns are NOT present in the data files but are appended to each
/// row during query execution based on the file's location.
table_partition_cols: Vec<FieldRef>,
table_partition_cols: Arc<Vec<FieldRef>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend we use https://docs.rs/arrow/latest/arrow/datatypes/struct.Fields.html (which is already basically Arc<Vec>) but can be constructed zero copy potentially from existing schemas

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this but it:

  1. Is incompatible with your recommendation in https://github.com/apache/datafusion/pull/19137/files/fd79a3ba89e9c5d45796aa8fbb7782bf1413bde4#r2598642330 because we can't do Arc::make_mut(&mut Fields) and Fields doesn't expose mutablility.
  2. It would require changing the public API of TableSchema::partition_fields()

I also am not that worried about construction cost (that happens once in FileSource, it's the mutliple copies for each FIleOpener that worry me.

In any case since the current change is fully backwards compatible API wise and is internally contained we can always come back and use Fields later, so I will leave that as a followup.

/// You should prefer calling [`TableSchema::new`] instead of chaining [`TableSchema::from_file_schema`]
/// into [`TableSchema::with_table_partition_cols`] if you have partition columns at construction time
/// since it avoids re-computing the table schema.
pub fn with_table_partition_cols(mut self, partition_cols: Vec<FieldRef>) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you used Fields you could also pass in impl Into<Fields>

self.table_partition_cols = Arc::new(partition_cols);
} else {
// Append to existing partition columns
self.table_partition_cols = Arc::new(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend Arc::make_mut here to avoid the allocation if possible

@github-actions github-actions bot added the physical-expr Changes to the physical-expr crates label Dec 8, 2025
@github-actions github-actions bot removed the physical-expr Changes to the physical-expr crates label Dec 8, 2025
@adriangb adriangb added this pull request to the merge queue Dec 8, 2025
Merged via the queue into apache:main with commit adaed42 Dec 8, 2025
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasource Changes to the datasource crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants