Skip to content

Conversation

@gene-db
Copy link
Contributor

@gene-db gene-db commented Jan 7, 2025

Rationale for this change

This is a reference implementation for the Variant binary format.

What changes are included in this PR?

A new module for encoding/decoding the Variant binary format.

Are these changes tested?

Added unit tests

Are there any user-facing changes?

No

Closes #3116

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @gene-db! I left some comments, but this is looking good

Comment on lines 238 to 240
if (index < 0 || index >= size) {
throw malformedVariant();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks inconsistent with the getFieldAtIndex where we return a null. Let's raise an exception at line 220 as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getFieldAtIndex is a little bit different, since if a field doesn't exist in a variant value, that doesn't mean the variant value is malformed. This dictionary case is different because we are expecting an id in the dictionary to exist, but it doesn't.

Comment on lines 551 to 552
// If the value doesn't fit any integer type, parse it as decimal or floating instead.
parseAndAppendFloatingPoint(parser);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is lossy, and I'd rather raise an exception

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is a tricky situation. We decided to allow parsing this type of valid JSON and not return an error, since the JSON is technically valid. It is not ideal that a valid JSON string hits an error. This behavior is similar to how Snowflake's variant parses JSON.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be fine for an engine, but a format should not be lossy. I think that it is fine to parse integers that are too large as a decimal(scale=0) but not as a floating point number.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to throw an exception if it doesn't fit into int or decimal.

* Builder for creating Variant value and metadata.
*/
public class VariantBuilder {
public VariantBuilder(boolean allowDuplicateKeys) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we allow this? This isn't allowed by the spec

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not for writing duplicate keys in the Variant value itself, but for parsing JSON strings. JSON strings might have duplicate keys, and this flag controls the behavior when encountering duplicate keys.

I added a comment to clarify.

return Arrays.copyOfRange(value, pos, pos + size);
}

public byte[] getMetadata() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of byte[] seems awkward given the assumptions that are made. It looks like the intent is for value and metadata to either be two separate arrays starting at offset 0, or a single array with metadata coming first followed by value at pos (but in this case, the array is passed to the constructor twice).

A more common pattern would be to specify each array along with an offset and a length, so that there are no implicit assumptions about the array contents.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we assume that metadata and value are in the same array? I don't think we are making that assumption.

The pos part in getValue() is not assuming the metadata is in the same array, but is for getting a "sub-variant" value from a variant value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we assume that metadata and value are in the same array? I don't think we are making that assumption.

I was referring to the possible values and intent for the pos argument and trying to understand your intent from this code. But that isn't the point I was trying to make.

The point here is that it is more common in Java to pass byte arrays with offset and length, rather than requiring that arrays are copied before passing them in. I think the use of 0-offset byte arrays is limiting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure what the proposal is. Is this saying we should not return a byte[], but something else?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is not that you're returning byte[] here. It is that the class works with byte[] and assumes content in both byte arrays starts at offset 0. That's limiting for anyone that wants to work with this because it requires copying.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the class to also take in an offset for the byte arrays.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to use ByteBuffer in the interface for Variant, rather than (byte[], pos) pairs? There's also a Binary class in parquet-java, although I'm not quite sure what its intended use cases are, or what the pros and cons would be compared to ByteBuffer.

Copy link
Contributor Author

@gene-db gene-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fokko @rdblue Thanks for the reviews! I updated the PR.

return Arrays.copyOfRange(value, pos, pos + size);
}

public byte[] getMetadata() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we assume that metadata and value are in the same array? I don't think we are making that assumption.

The pos part in getValue() is not assuming the metadata is in the same array, but is for getting a "sub-variant" value from a variant value.

Comment on lines 551 to 552
// If the value doesn't fit any integer type, parse it as decimal or floating instead.
parseAndAppendFloatingPoint(parser);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is a tricky situation. We decided to allow parsing this type of valid JSON and not return an error, since the JSON is technically valid. It is not ideal that a valid JSON string hits an error. This behavior is similar to how Snowflake's variant parses JSON.

* @return the JSON representation of the variant
* @throws MalformedVariantException if the variant is malformed
*/
public String toJson(ZoneId zoneId) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the toJson() which defaults to +00:00. The options are there for engines to choose the behavior, while sharing the same implementation.

@gene-db gene-db requested review from Fokko and rdblue February 6, 2025 03:05
@gene-db gene-db requested a review from cashmand February 13, 2025 18:59
* An exception indicating that the Variant is malformed.
*/
public class MalformedVariantException extends RuntimeException {
public MalformedVariantException() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? I genearally consider no-arg constructors for exception classes to be an anti-pattern because people use them without thinking about what helpful error message should be included.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

~ specific language governing permissions and limitations
~ under the License.
-->
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
Copy link
Contributor

@aihuaxu aihuaxu Mar 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @gene-db to drive the reference implementation.

I have a general question on the requirement: we implement mostly Parse_Json() in this PR. Are we required to construct variant with richer type - date, timestamp, etc.? May be out of scope for this PR. I have the implementation in Iceberg (apache/iceberg#11857 to add the full support. As I talked to @rdblue, that may not be required for Iceberg but I can include such implementation in Parquet after this PR if needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think parse_json should be trying to determine what type a particular JSON string is supposed to be. The JSON spec doesn't have the richer types, so parse_json will not try to guess what the strings might be. It might be error-prone and would be costly in terms of performance. Therefore, parse_json will only use a subset of the variant types.

This PR also supports the variant builder, which supports creating variant values with all of the variant types.

@gene-db
Copy link
Contributor Author

gene-db commented Mar 17, 2025

@Fokko @rdblue I updated the PR. Could you please take another look? Thanks!

int basicType = value[pos] & BASIC_TYPE_MASK;
int typeInfo = (value[pos] >> BASIC_TYPE_BITS) & PRIMITIVE_TYPE_MASK;
if (basicType != ARRAY) {
throw unexpectedType(Type.ARRAY);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will throw MalformedVariantException, but it is the fault of the caller that called handleArray and not the data. I think that this should be an IllegalArgumentException. The error message is fine (Expected type to be __).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to IllegalArgumentException.

int idStart = pos + 1 + sizeBytes;
int offsetStart = idStart + numElements * idSize;
int dataStart = offsetStart + (numElements + 1) * offsetSize;
return new ObjectInfo(numElements, idSize, offsetSize, idStart, offsetStart, dataStart);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand why you'd build pos into the offsets passed back, but returning the absolute position in the buffer means that the object info (or similarly, array info) is not applicable to values that are returned by Variant#getValue even though the bytes are the same. The same bytes at a different offset produce different ObjectInfo instances. This isn't a huge problem, but it seems like it could cause some bugs if callers decide to reuse any values they take from the object info.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I removed adding the pos, and made these "offsets from beginning of object/array" and not absolute positions.

switch (typeInfo) {
case DECIMAL4:
result = BigDecimal.valueOf(readLong(value, pos + 2, 4), scale);
checkDecimal(result, MAX_DECIMAL4_PRECISION);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This decimal was just read from 4 bytes. What's the value of this check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed these checks.

int offsetSize = ((metadata[0] >> 6) & 0x3) + 1;
int dictSize = readUnsigned(metadata, 1, offsetSize);
if (id >= dictSize) {
throw new MalformedVariantException(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a problem with the ID, not the variant. It should be IllegalArgumentException.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

}
// There are a header byte, a `dictSize` with `offsetSize` bytes, and `(dictSize + 1)` offsets
// before the string data.
int stringStart = 1 + (dictSize + 2) * offsetSize;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be easier to read this if you also used offsetListOffset to capture 1 + dictSize.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

throw new MalformedVariantException(
String.format("Invalid offset: %d. next offset: %d", offset, nextOffset));
}
checkIndex(stringStart + nextOffset - 1, metadata.length);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency, I would rename stringStart to dataOffset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed.

* An interface for the Variant object handler.
* @param <T> The return type of the handler
*/
public interface ObjectHandlerException<T> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a better solution than this? It's not ideal that this library accounts for a specific use of ObjectHandler by creating a secondary handler that passes an IOException through. I would prefer using a getObjectInfo(byte[], int) for those cases instead of adding new handlers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. It becomes much simpler with something like getObjectInfo and getArrayInfo. Introducing ObjectInfo and ArrayInfo made this refactor easier. Now, there are no more handlers. Thanks!

* Check the validity of an array index `pos`.
* @param pos The index to check
* @param length The length of the array
* @throws MalformedVariantException if the index is out of bound
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't thrown.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

result |= unsignedByteValue << (8 * i);
}
if (result < 0) {
throw new MalformedVariantException(String.format("Failed to read unsigned int. numBytes: %d", numBytes));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As with other places, it doesn't always make sense for this to throw MalformedVariantException because that assumes how this is called. In order to throw MalformedVariantException, this check should be in the calling code that is decoding an offset, rather than here. With the call here, this is violating some other expectation of the method -- that the value will fit in an unsigned int -- even though there is no restriction on numBytes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to IllegalArgumentException.

/**
* @return the primitive type id from a variant value
*/
public int getPrimitiveTypeId() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why expose this instead of the Type enum?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not really needed, since we have getType() below. Removed.


/**
* The value type of Variant value. It is determined by the header byte but not a 1:1 mapping
* (for example, INT1/2/4/8 all maps to `Type.LONG`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer true because it returns BYTE, SHORT, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

value,
pos,
(info) -> info.dataStart
- pos
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pointed out earlier that adding pos to the offsets may be confusing. I think this is a good example, where in order to calculate the size of the object this has to account for pos being added in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated these fields to be offsets, and not absolute.

this.pos = pos;
// There is currently only one allowed version.
if (metadata.length < 1 || (metadata[0] & VariantUtil.VERSION_MASK) != VariantUtil.VERSION) {
throw new MalformedVariantException(String.format(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be UnsupportedOperationException rather than MalformedVariantException? The variant may not be malformed if the version is newer. It is just not supported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good point. Updated.

Copy link
Contributor Author

@gene-db gene-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue Thanks! I updated the PR. I removed all of the "flexible" to JSON conversion, and exposed an interface an engine can use to convert scalars differently if desired.

* Check the validity of an array index `pos`.
* @param pos The index to check
* @param length The length of the array
* @throws MalformedVariantException if the index is out of bound
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

case INT2:
case INT4:
case INT8:
return Type.LONG;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


/**
* The value type of Variant value. It is determined by the header byte but not a 1:1 mapping
* (for example, INT1/2/4/8 all maps to `Type.LONG`).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

result |= unsignedByteValue << (8 * i);
}
if (result < 0) {
throw new MalformedVariantException(String.format("Failed to read unsigned int. numBytes: %d", numBytes));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to IllegalArgumentException.

switch (typeInfo) {
case DECIMAL4:
result = BigDecimal.valueOf(readLong(value, pos + 2, 4), scale);
checkDecimal(result, MAX_DECIMAL4_PRECISION);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed these checks.

*/
public class VariantSizeLimitException extends RuntimeException {
public VariantSizeLimitException(long sizeLimitBytes, long estimatedSizeBytes) {
super(String.format(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we wanted to avoid materializing the full value if it is already going exceeding the size, but maybe this is not a big issue. Removed.

this.pos = pos;
// There is currently only one allowed version.
if (metadata.length < 1 || (metadata[0] & VariantUtil.VERSION_MASK) != VariantUtil.VERSION) {
throw new MalformedVariantException(String.format(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good point. Updated.

/**
* @return the primitive type id from a variant value
*/
public int getPrimitiveTypeId() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not really needed, since we have getType() below. Removed.

Comment on lines 551 to 552
// If the value doesn't fit any integer type, parse it as decimal or floating instead.
parseAndAppendFloatingPoint(parser);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to throw an exception if it doesn't fit into int or decimal.

return Arrays.copyOfRange(value, pos, pos + size);
}

public byte[] getMetadata() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the class to also take in an offset for the byte arrays.

@gene-db gene-db requested a review from rdblue March 27, 2025 05:08
cashmand added a commit to cashmand/parquet-java that referenced this pull request Mar 28, 2025
@gene-db
Copy link
Contributor Author

gene-db commented Apr 6, 2025

@cashmand I updated this to use ByteBuffer. Will this be easier to integrate with the shredding support?

@gene-db
Copy link
Contributor Author

gene-db commented Apr 18, 2025

This will be split up into multiple smaller PRs. The decode functionality is in #3197

cashmand added a commit to cashmand/parquet-java that referenced this pull request Apr 30, 2025
@emkornfield
Copy link
Contributor

Is this PR still relevant?

@gene-db
Copy link
Contributor Author

gene-db commented Jun 11, 2025

Nope, this PR is no longer needed.

@gene-db gene-db closed this Jun 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement the Variant binary encoding

6 participants