Skip to content

DecimalArray Logical and Physical type mismatch #5820

@connortsui20

Description

@connortsui20

We have this documentation on DecimalArray:

/// These are just the maximal ranges for each scalar type, but it is perfectly legal to store
/// values with precision that does not match this exactly. For example, a valid DecimalArray with
/// precision=39 may store its values in an `i8` if all of the actual values fit into it.
///
/// Similarly, a `DecimalArray` can be built that stores a set of precision=2 values in a
/// `Buffer<i256>`.

This presents a problem (which is the cause of many of the recent fuzzer crashes): If we try to patch values or use fill_null with a decimal value that is valid according to the logical DType, but is not compatible with the storage DType, then we just panic!

This is highly concerning! The physical type has leaked and now governs what we are allowed to do instead of the logical type.

I see 3 ways forward:

  1. Keep this behavior. I think this is a bad idea, as it has completely different behavior from how
    PrimitiveArray works with BitPackedArray, for example.
  2. Reallocate the entire array whenever we run into this so that the new storage type can fit the new values.
    I also think this is a bad idea in terms of performance.
  3. We instead require that the storage type must always be able to fit any value in the domain of the logical
    DecimalDType.

For 3, by requiring this constraint we no longer need to worry about checking every single value during validation, we can just do a single check in the constructor that the DType fits the storage.


We are about to do a huge migration to the new Vortex operator world (where everything is lazily computed / executed), which means we don't necessarily need to do this migration in the old world: We can just make sure we implement the correct behavior in the new world.

In the new world, we need to make sure that the DecimalVector the array tree executes into holds a DVector that can store every possible value given by the DecmialDType's PrecisionScale.

In practice, this will mean finding the smallest integer storage type for the given PrecisionScale. Note we don't necessarily need to use the minimal storage type, though that will be the most helpful for performance.

The more important thing is that we do not allow executing a DecimalArray into a DecimalVector that cannot fit all possible values according to the PrecisionScale.

And also note that we need to handle old files that we were written in the old world with this bad logic.


Fuzzer Crashes

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions