Skip to content

Track individual column sizes in Statistics #19098

@adriangb

Description

@adriangb

Is your feature request related to a problem or challenge?

In #19094 we are going to fix incorrect total_byte_size calculations for Statistics by making them Inexact / Absent when we can't actually calculate the size of the data. While this is more correct, it would be nice if we could calculate scan sizes, etc. in more scenarios. In particular, we cannot calculate the scan sizes of variable length columns (e.g. Utf8) from just the type and number of rows.

To address this I propose we add ColumnStatistics { scan_byte_size: Precision<usize>, ... } which can be populated by the file format e.g. because we know that the in-memory Arrow size is the same as the Parquet uncompressed size of the Parquet column for Utf8View. I don't know in how many cases we'll be able to derive this information without reading the data but I think in some cases we should be able to.

Then once we have this we can track the total scan size through projections, limits, etc.

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions