-
Notifications
You must be signed in to change notification settings - Fork 1.5k
GH-3116: Implement the Variant binary encoding #3117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Fokko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @gene-db! I left some comments, but this is looking good
parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java
Outdated
Show resolved
Hide resolved
parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java
Outdated
Show resolved
Hide resolved
parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java
Outdated
Show resolved
Hide resolved
| if (index < 0 || index >= size) { | ||
| throw malformedVariant(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks inconsistent with the getFieldAtIndex where we return a null. Let's raise an exception at line 220 as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getFieldAtIndex is a little bit different, since if a field doesn't exist in a variant value, that doesn't mean the variant value is malformed. This dictionary case is different because we are expecting an id in the dictionary to exist, but it doesn't.
parquet-variant/src/main/java/org/apache/parquet/variant/VariantBuilder.java
Outdated
Show resolved
Hide resolved
| // If the value doesn't fit any integer type, parse it as decimal or floating instead. | ||
| parseAndAppendFloatingPoint(parser); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is lossy, and I'd rather raise an exception
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is a tricky situation. We decided to allow parsing this type of valid JSON and not return an error, since the JSON is technically valid. It is not ideal that a valid JSON string hits an error. This behavior is similar to how Snowflake's variant parses JSON.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be fine for an engine, but a format should not be lossy. I think that it is fine to parse integers that are too large as a decimal(scale=0) but not as a floating point number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to throw an exception if it doesn't fit into int or decimal.
parquet-variant/src/main/java/org/apache/parquet/variant/VariantBuilder.java
Show resolved
Hide resolved
| * Builder for creating Variant value and metadata. | ||
| */ | ||
| public class VariantBuilder { | ||
| public VariantBuilder(boolean allowDuplicateKeys) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would we allow this? This isn't allowed by the spec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not for writing duplicate keys in the Variant value itself, but for parsing JSON strings. JSON strings might have duplicate keys, and this flag controls the behavior when encountering duplicate keys.
I added a comment to clarify.
parquet-variant/src/main/java/org/apache/parquet/variant/VariantBuilder.java
Outdated
Show resolved
Hide resolved
parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java
Outdated
Show resolved
Hide resolved
| return Arrays.copyOfRange(value, pos, pos + size); | ||
| } | ||
|
|
||
| public byte[] getMetadata() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of byte[] seems awkward given the assumptions that are made. It looks like the intent is for value and metadata to either be two separate arrays starting at offset 0, or a single array with metadata coming first followed by value at pos (but in this case, the array is passed to the constructor twice).
A more common pattern would be to specify each array along with an offset and a length, so that there are no implicit assumptions about the array contents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do we assume that metadata and value are in the same array? I don't think we are making that assumption.
The pos part in getValue() is not assuming the metadata is in the same array, but is for getting a "sub-variant" value from a variant value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do we assume that
metadataandvalueare in the same array? I don't think we are making that assumption.
I was referring to the possible values and intent for the pos argument and trying to understand your intent from this code. But that isn't the point I was trying to make.
The point here is that it is more common in Java to pass byte arrays with offset and length, rather than requiring that arrays are copied before passing them in. I think the use of 0-offset byte arrays is limiting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure what the proposal is. Is this saying we should not return a byte[], but something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point is not that you're returning byte[] here. It is that the class works with byte[] and assumes content in both byte arrays starts at offset 0. That's limiting for anyone that wants to work with this because it requires copying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the class to also take in an offset for the byte arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to use ByteBuffer in the interface for Variant, rather than (byte[], pos) pairs? There's also a Binary class in parquet-java, although I'm not quite sure what its intended use cases are, or what the pros and cons would be compared to ByteBuffer.
parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java
Outdated
Show resolved
Hide resolved
parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java
Outdated
Show resolved
Hide resolved
parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java
Outdated
Show resolved
Hide resolved
parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java
Outdated
Show resolved
Hide resolved
parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java
Outdated
Show resolved
Hide resolved
parquet-variant/src/main/java/org/apache/parquet/variant/VariantBuilder.java
Outdated
Show resolved
Hide resolved
gene-db
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java
Outdated
Show resolved
Hide resolved
parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java
Outdated
Show resolved
Hide resolved
| return Arrays.copyOfRange(value, pos, pos + size); | ||
| } | ||
|
|
||
| public byte[] getMetadata() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do we assume that metadata and value are in the same array? I don't think we are making that assumption.
The pos part in getValue() is not assuming the metadata is in the same array, but is for getting a "sub-variant" value from a variant value.
parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java
Outdated
Show resolved
Hide resolved
| // If the value doesn't fit any integer type, parse it as decimal or floating instead. | ||
| parseAndAppendFloatingPoint(parser); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is a tricky situation. We decided to allow parsing this type of valid JSON and not return an error, since the JSON is technically valid. It is not ideal that a valid JSON string hits an error. This behavior is similar to how Snowflake's variant parses JSON.
parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java
Outdated
Show resolved
Hide resolved
| * @return the JSON representation of the variant | ||
| * @throws MalformedVariantException if the variant is malformed | ||
| */ | ||
| public String toJson(ZoneId zoneId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the toJson() which defaults to +00:00. The options are there for engines to choose the behavior, while sharing the same implementation.
parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java
Outdated
Show resolved
Hide resolved
parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java
Outdated
Show resolved
Hide resolved
parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java
Outdated
Show resolved
Hide resolved
parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java
Outdated
Show resolved
Hide resolved
parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java
Outdated
Show resolved
Hide resolved
| * An exception indicating that the Variant is malformed. | ||
| */ | ||
| public class MalformedVariantException extends RuntimeException { | ||
| public MalformedVariantException() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary? I genearally consider no-arg constructors for exception classes to be an anti-pattern because people use them without thinking about what helpful error message should be included.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
| ~ specific language governing permissions and limitations | ||
| ~ under the License. | ||
| --> | ||
| <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @gene-db to drive the reference implementation.
I have a general question on the requirement: we implement mostly Parse_Json() in this PR. Are we required to construct variant with richer type - date, timestamp, etc.? May be out of scope for this PR. I have the implementation in Iceberg (apache/iceberg#11857 to add the full support. As I talked to @rdblue, that may not be required for Iceberg but I can include such implementation in Parquet after this PR if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think parse_json should be trying to determine what type a particular JSON string is supposed to be. The JSON spec doesn't have the richer types, so parse_json will not try to guess what the strings might be. It might be error-prone and would be costly in terms of performance. Therefore, parse_json will only use a subset of the variant types.
This PR also supports the variant builder, which supports creating variant values with all of the variant types.
| int basicType = value[pos] & BASIC_TYPE_MASK; | ||
| int typeInfo = (value[pos] >> BASIC_TYPE_BITS) & PRIMITIVE_TYPE_MASK; | ||
| if (basicType != ARRAY) { | ||
| throw unexpectedType(Type.ARRAY); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will throw MalformedVariantException, but it is the fault of the caller that called handleArray and not the data. I think that this should be an IllegalArgumentException. The error message is fine (Expected type to be __).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to IllegalArgumentException.
| int idStart = pos + 1 + sizeBytes; | ||
| int offsetStart = idStart + numElements * idSize; | ||
| int dataStart = offsetStart + (numElements + 1) * offsetSize; | ||
| return new ObjectInfo(numElements, idSize, offsetSize, idStart, offsetStart, dataStart); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand why you'd build pos into the offsets passed back, but returning the absolute position in the buffer means that the object info (or similarly, array info) is not applicable to values that are returned by Variant#getValue even though the bytes are the same. The same bytes at a different offset produce different ObjectInfo instances. This isn't a huge problem, but it seems like it could cause some bugs if callers decide to reuse any values they take from the object info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I removed adding the pos, and made these "offsets from beginning of object/array" and not absolute positions.
| switch (typeInfo) { | ||
| case DECIMAL4: | ||
| result = BigDecimal.valueOf(readLong(value, pos + 2, 4), scale); | ||
| checkDecimal(result, MAX_DECIMAL4_PRECISION); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This decimal was just read from 4 bytes. What's the value of this check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed these checks.
| int offsetSize = ((metadata[0] >> 6) & 0x3) + 1; | ||
| int dictSize = readUnsigned(metadata, 1, offsetSize); | ||
| if (id >= dictSize) { | ||
| throw new MalformedVariantException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a problem with the ID, not the variant. It should be IllegalArgumentException.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
| } | ||
| // There are a header byte, a `dictSize` with `offsetSize` bytes, and `(dictSize + 1)` offsets | ||
| // before the string data. | ||
| int stringStart = 1 + (dictSize + 2) * offsetSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be easier to read this if you also used offsetListOffset to capture 1 + dictSize.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
| throw new MalformedVariantException( | ||
| String.format("Invalid offset: %d. next offset: %d", offset, nextOffset)); | ||
| } | ||
| checkIndex(stringStart + nextOffset - 1, metadata.length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency, I would rename stringStart to dataOffset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed.
| * An interface for the Variant object handler. | ||
| * @param <T> The return type of the handler | ||
| */ | ||
| public interface ObjectHandlerException<T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a better solution than this? It's not ideal that this library accounts for a specific use of ObjectHandler by creating a secondary handler that passes an IOException through. I would prefer using a getObjectInfo(byte[], int) for those cases instead of adding new handlers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. It becomes much simpler with something like getObjectInfo and getArrayInfo. Introducing ObjectInfo and ArrayInfo made this refactor easier. Now, there are no more handlers. Thanks!
| * Check the validity of an array index `pos`. | ||
| * @param pos The index to check | ||
| * @param length The length of the array | ||
| * @throws MalformedVariantException if the index is out of bound |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't thrown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
| result |= unsignedByteValue << (8 * i); | ||
| } | ||
| if (result < 0) { | ||
| throw new MalformedVariantException(String.format("Failed to read unsigned int. numBytes: %d", numBytes)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As with other places, it doesn't always make sense for this to throw MalformedVariantException because that assumes how this is called. In order to throw MalformedVariantException, this check should be in the calling code that is decoding an offset, rather than here. With the call here, this is violating some other expectation of the method -- that the value will fit in an unsigned int -- even though there is no restriction on numBytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to IllegalArgumentException.
| /** | ||
| * @return the primitive type id from a variant value | ||
| */ | ||
| public int getPrimitiveTypeId() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why expose this instead of the Type enum?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not really needed, since we have getType() below. Removed.
|
|
||
| /** | ||
| * The value type of Variant value. It is determined by the header byte but not a 1:1 mapping | ||
| * (for example, INT1/2/4/8 all maps to `Type.LONG`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is no longer true because it returns BYTE, SHORT, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
| value, | ||
| pos, | ||
| (info) -> info.dataStart | ||
| - pos |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pointed out earlier that adding pos to the offsets may be confusing. I think this is a good example, where in order to calculate the size of the object this has to account for pos being added in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated these fields to be offsets, and not absolute.
| this.pos = pos; | ||
| // There is currently only one allowed version. | ||
| if (metadata.length < 1 || (metadata[0] & VariantUtil.VERSION_MASK) != VariantUtil.VERSION) { | ||
| throw new MalformedVariantException(String.format( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be UnsupportedOperationException rather than MalformedVariantException? The variant may not be malformed if the version is newer. It is just not supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good point. Updated.
gene-db
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue Thanks! I updated the PR. I removed all of the "flexible" to JSON conversion, and exposed an interface an engine can use to convert scalars differently if desired.
| * Check the validity of an array index `pos`. | ||
| * @param pos The index to check | ||
| * @param length The length of the array | ||
| * @throws MalformedVariantException if the index is out of bound |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
| case INT2: | ||
| case INT4: | ||
| case INT8: | ||
| return Type.LONG; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
|
||
| /** | ||
| * The value type of Variant value. It is determined by the header byte but not a 1:1 mapping | ||
| * (for example, INT1/2/4/8 all maps to `Type.LONG`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
| result |= unsignedByteValue << (8 * i); | ||
| } | ||
| if (result < 0) { | ||
| throw new MalformedVariantException(String.format("Failed to read unsigned int. numBytes: %d", numBytes)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to IllegalArgumentException.
| switch (typeInfo) { | ||
| case DECIMAL4: | ||
| result = BigDecimal.valueOf(readLong(value, pos + 2, 4), scale); | ||
| checkDecimal(result, MAX_DECIMAL4_PRECISION); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed these checks.
| */ | ||
| public class VariantSizeLimitException extends RuntimeException { | ||
| public VariantSizeLimitException(long sizeLimitBytes, long estimatedSizeBytes) { | ||
| super(String.format( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we wanted to avoid materializing the full value if it is already going exceeding the size, but maybe this is not a big issue. Removed.
| this.pos = pos; | ||
| // There is currently only one allowed version. | ||
| if (metadata.length < 1 || (metadata[0] & VariantUtil.VERSION_MASK) != VariantUtil.VERSION) { | ||
| throw new MalformedVariantException(String.format( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good point. Updated.
| /** | ||
| * @return the primitive type id from a variant value | ||
| */ | ||
| public int getPrimitiveTypeId() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not really needed, since we have getType() below. Removed.
| // If the value doesn't fit any integer type, parse it as decimal or floating instead. | ||
| parseAndAppendFloatingPoint(parser); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to throw an exception if it doesn't fit into int or decimal.
| return Arrays.copyOfRange(value, pos, pos + size); | ||
| } | ||
|
|
||
| public byte[] getMetadata() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the class to also take in an offset for the byte arrays.
|
@cashmand I updated this to use |
|
This will be split up into multiple smaller PRs. The decode functionality is in #3197 |
|
Is this PR still relevant? |
|
Nope, this PR is no longer needed. |
Rationale for this change
This is a reference implementation for the Variant binary format.
What changes are included in this PR?
A new module for encoding/decoding the Variant binary format.
Are these changes tested?
Added unit tests
Are there any user-facing changes?
No
Closes #3116