-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Is your feature request related to a problem or challenge?
In #19094 we are going to fix incorrect total_byte_size calculations for Statistics by making them Inexact / Absent when we can't actually calculate the size of the data. While this is more correct, it would be nice if we could calculate scan sizes, etc. in more scenarios. In particular, we cannot calculate the scan sizes of variable length columns (e.g. Utf8) from just the type and number of rows.
To address this I propose we add ColumnStatistics { scan_byte_size: Precision<usize>, ... } which can be populated by the file format e.g. because we know that the in-memory Arrow size is the same as the Parquet uncompressed size of the Parquet column for Utf8View. I don't know in how many cases we'll be able to derive this information without reading the data but I think in some cases we should be able to.
Then once we have this we can track the total scan size through projections, limits, etc.
Describe the solution you'd like
No response
Describe alternatives you've considered
No response
Additional context
No response