GH-3116: Implement the Variant binary encoding #3117

gene-db · 2025-01-07T21:14:58Z

Rationale for this change

This is a reference implementation for the Variant binary format.

What changes are included in this PR?

A new module for encoding/decoding the Variant binary format.

Are these changes tested?

Added unit tests

Are there any user-facing changes?

No

Closes #3116

Fokko

Thanks for working on this @gene-db! I left some comments, but this is looking good

parquet-variant/pom.xml

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

Fokko · 2025-01-20T15:56:17Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+      if (index < 0 || index >= size) {
+        throw malformedVariant();
+      }


This looks inconsistent with the getFieldAtIndex where we return a null. Let's raise an exception at line 220 as well.

getFieldAtIndex is a little bit different, since if a field doesn't exist in a variant value, that doesn't mean the variant value is malformed. This dictionary case is different because we are expecting an id in the dictionary to exist, but it doesn't.

parquet-variant/src/main/java/org/apache/parquet/variant/VariantBuilder.java

Fokko · 2025-01-23T10:46:16Z

parquet-variant/src/main/java/org/apache/parquet/variant/VariantBuilder.java

+          // If the value doesn't fit any integer type, parse it as decimal or floating instead.
+          parseAndAppendFloatingPoint(parser);


I think this is lossy, and I'd rather raise an exception

Yeah, this is a tricky situation. We decided to allow parsing this type of valid JSON and not return an error, since the JSON is technically valid. It is not ideal that a valid JSON string hits an error. This behavior is similar to how Snowflake's variant parses JSON.

This may be fine for an engine, but a format should not be lossy. I think that it is fine to parse integers that are too large as a decimal(scale=0) but not as a floating point number.

Updated to throw an exception if it doesn't fit into int or decimal.

parquet-variant/src/main/java/org/apache/parquet/variant/VariantBuilder.java

Fokko · 2025-01-23T12:57:11Z

parquet-variant/src/main/java/org/apache/parquet/variant/VariantBuilder.java

+ * Builder for creating Variant value and metadata.
+ */
+public class VariantBuilder {
+  public VariantBuilder(boolean allowDuplicateKeys) {


Why would we allow this? This isn't allowed by the spec

This is not for writing duplicate keys in the Variant value itself, but for parsing JSON strings. JSON strings might have duplicate keys, and this flag controls the behavior when encountering duplicate keys.

I added a comment to clarify.

parquet-variant/src/main/java/org/apache/parquet/variant/VariantBuilder.java

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

rdblue · 2025-01-23T23:46:42Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+    return Arrays.copyOfRange(value, pos, pos + size);
+  }
+
+  public byte[] getMetadata() {


The use of byte[] seems awkward given the assumptions that are made. It looks like the intent is for value and metadata to either be two separate arrays starting at offset 0, or a single array with metadata coming first followed by value at pos (but in this case, the array is passed to the constructor twice).

A more common pattern would be to specify each array along with an offset and a length, so that there are no implicit assumptions about the array contents.

Where do we assume that metadata and value are in the same array? I don't think we are making that assumption.

The pos part in getValue() is not assuming the metadata is in the same array, but is for getting a "sub-variant" value from a variant value.

Where do we assume that metadata and value are in the same array? I don't think we are making that assumption.

I was referring to the possible values and intent for the pos argument and trying to understand your intent from this code. But that isn't the point I was trying to make.

The point here is that it is more common in Java to pass byte arrays with offset and length, rather than requiring that arrays are copied before passing them in. I think the use of 0-offset byte arrays is limiting.

I'm not entirely sure what the proposal is. Is this saying we should not return a byte[], but something else?

My point is not that you're returning byte[] here. It is that the class works with byte[] and assumes content in both byte arrays starts at offset 0. That's limiting for anyone that wants to work with this because it requires copying.

I updated the class to also take in an offset for the byte arrays.

Would it be better to use ByteBuffer in the interface for Variant, rather than (byte[], pos) pairs? There's also a Binary class in parquet-java, although I'm not quite sure what its intended use cases are, or what the pros and cons would be compared to ByteBuffer.

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

parquet-variant/src/main/java/org/apache/parquet/variant/VariantBuilder.java

gene-db

@Fokko @rdblue Thanks for the reviews! I updated the PR.

parquet-variant/pom.xml

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

gene-db · 2025-02-03T18:12:44Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+    return Arrays.copyOfRange(value, pos, pos + size);
+  }
+
+  public byte[] getMetadata() {


Where do we assume that metadata and value are in the same array? I don't think we are making that assumption.

The pos part in getValue() is not assuming the metadata is in the same array, but is for getting a "sub-variant" value from a variant value.

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

gene-db · 2025-02-04T18:40:00Z

parquet-variant/src/main/java/org/apache/parquet/variant/VariantBuilder.java

+          // If the value doesn't fit any integer type, parse it as decimal or floating instead.
+          parseAndAppendFloatingPoint(parser);


Yeah, this is a tricky situation. We decided to allow parsing this type of valid JSON and not return an error, since the JSON is technically valid. It is not ideal that a valid JSON string hits an error. This behavior is similar to how Snowflake's variant parses JSON.

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

gene-db · 2025-02-05T22:22:42Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+   * @return the JSON representation of the variant
+   * @throws MalformedVariantException if the variant is malformed
+   */
+  public String toJson(ZoneId zoneId) {


I added the toJson() which defaults to +00:00. The options are there for engines to choose the behavior, while sharing the same implementation.

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

rdblue · 2025-02-24T23:01:06Z

parquet-variant/src/main/java/org/apache/parquet/variant/MalformedVariantException.java

+ * An exception indicating that the Variant is malformed.
+ */
+public class MalformedVariantException extends RuntimeException {
+  public MalformedVariantException() {


Is this necessary? I genearally consider no-arg constructors for exception classes to be an anti-pattern because people use them without thinking about what helpful error message should be included.

aihuaxu · 2025-03-03T18:50:57Z

parquet-variant/pom.xml

+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">


Thanks a lot @gene-db to drive the reference implementation.

I have a general question on the requirement: we implement mostly Parse_Json() in this PR. Are we required to construct variant with richer type - date, timestamp, etc.? May be out of scope for this PR. I have the implementation in Iceberg (apache/iceberg#11857 to add the full support. As I talked to @rdblue, that may not be required for Iceberg but I can include such implementation in Parquet after this PR if needed.

I don't think parse_json should be trying to determine what type a particular JSON string is supposed to be. The JSON spec doesn't have the richer types, so parse_json will not try to guess what the strings might be. It might be error-prone and would be costly in terms of performance. Therefore, parse_json will only use a subset of the variant types.

This PR also supports the variant builder, which supports creating variant values with all of the variant types.

gene-db · 2025-03-17T20:37:17Z

@Fokko @rdblue I updated the PR. Could you please take another look? Thanks!

rdblue · 2025-03-21T23:12:39Z

parquet-variant/src/main/java/org/apache/parquet/variant/VariantUtil.java

+    int basicType = value[pos] & BASIC_TYPE_MASK;
+    int typeInfo = (value[pos] >> BASIC_TYPE_BITS) & PRIMITIVE_TYPE_MASK;
+    if (basicType != ARRAY) {
+      throw unexpectedType(Type.ARRAY);


This will throw MalformedVariantException, but it is the fault of the caller that called handleArray and not the data. I think that this should be an IllegalArgumentException. The error message is fine (Expected type to be __).

Updated to IllegalArgumentException.

rdblue · 2025-03-21T23:21:13Z

parquet-variant/src/main/java/org/apache/parquet/variant/VariantUtil.java

+    int idStart = pos + 1 + sizeBytes;
+    int offsetStart = idStart + numElements * idSize;
+    int dataStart = offsetStart + (numElements + 1) * offsetSize;
+    return new ObjectInfo(numElements, idSize, offsetSize, idStart, offsetStart, dataStart);


I understand why you'd build pos into the offsets passed back, but returning the absolute position in the buffer means that the object info (or similarly, array info) is not applicable to values that are returned by Variant#getValue even though the bytes are the same. The same bytes at a different offset produce different ObjectInfo instances. This isn't a huge problem, but it seems like it could cause some bugs if callers decide to reuse any values they take from the object info.

I see. I removed adding the pos, and made these "offsets from beginning of object/array" and not absolute positions.

rdblue · 2025-03-21T23:22:21Z

parquet-variant/src/main/java/org/apache/parquet/variant/VariantUtil.java

+    switch (typeInfo) {
+      case DECIMAL4:
+        result = BigDecimal.valueOf(readLong(value, pos + 2, 4), scale);
+        checkDecimal(result, MAX_DECIMAL4_PRECISION);


This decimal was just read from 4 bytes. What's the value of this check?

Removed these checks.

rdblue · 2025-03-21T23:23:20Z

parquet-variant/src/main/java/org/apache/parquet/variant/VariantUtil.java

+    int offsetSize = ((metadata[0] >> 6) & 0x3) + 1;
+    int dictSize = readUnsigned(metadata, 1, offsetSize);
+    if (id >= dictSize) {
+      throw new MalformedVariantException(


This is a problem with the ID, not the variant. It should be IllegalArgumentException.

rdblue · 2025-03-21T23:24:50Z

parquet-variant/src/main/java/org/apache/parquet/variant/VariantUtil.java

+    }
+    // There are a header byte, a `dictSize` with `offsetSize` bytes, and `(dictSize + 1)` offsets
+    // before the string data.
+    int stringStart = 1 + (dictSize + 2) * offsetSize;


I think it would be easier to read this if you also used offsetListOffset to capture 1 + dictSize.

rdblue · 2025-03-21T23:26:21Z

parquet-variant/src/main/java/org/apache/parquet/variant/VariantUtil.java

+      throw new MalformedVariantException(
+          String.format("Invalid offset: %d. next offset: %d", offset, nextOffset));
+    }
+    checkIndex(stringStart + nextOffset - 1, metadata.length);


For consistency, I would rename stringStart to dataOffset.

rdblue · 2025-03-21T23:28:34Z

parquet-variant/src/main/java/org/apache/parquet/variant/VariantUtil.java

+   * An interface for the Variant object handler.
+   * @param <T> The return type of the handler
+   */
+  public interface ObjectHandlerException<T> {


Is there a better solution than this? It's not ideal that this library accounts for a specific use of ObjectHandler by creating a secondary handler that passes an IOException through. I would prefer using a getObjectInfo(byte[], int) for those cases instead of adding new handlers.

Good call. It becomes much simpler with something like getObjectInfo and getArrayInfo. Introducing ObjectInfo and ArrayInfo made this refactor easier. Now, there are no more handlers. Thanks!

rdblue · 2025-03-21T23:29:30Z

parquet-variant/src/main/java/org/apache/parquet/variant/VariantUtil.java

+   * Check the validity of an array index `pos`.
+   * @param pos The index to check
+   * @param length The length of the array
+   * @throws MalformedVariantException if the index is out of bound


This isn't thrown.

rdblue · 2025-03-24T16:54:02Z

parquet-variant/src/main/java/org/apache/parquet/variant/VariantUtil.java

+      result |= unsignedByteValue << (8 * i);
+    }
+    if (result < 0) {
+      throw new MalformedVariantException(String.format("Failed to read unsigned int. numBytes: %d", numBytes));


As with other places, it doesn't always make sense for this to throw MalformedVariantException because that assumes how this is called. In order to throw MalformedVariantException, this check should be in the calling code that is decoding an offset, rather than here. With the call here, this is violating some other expectation of the method -- that the value will fit in an unsigned int -- even though there is no restriction on numBytes.

Updated to IllegalArgumentException.

rdblue · 2025-03-24T16:55:17Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+  /**
+   * @return the primitive type id from a variant value
+   */
+  public int getPrimitiveTypeId() {


Why expose this instead of the Type enum?

It is not really needed, since we have getType() below. Removed.

rdblue · 2025-03-24T16:57:00Z

parquet-variant/src/main/java/org/apache/parquet/variant/VariantUtil.java

+
+  /**
+   * The value type of Variant value. It is determined by the header byte but not a 1:1 mapping
+   * (for example, INT1/2/4/8 all maps to `Type.LONG`).


This is no longer true because it returns BYTE, SHORT, etc.

rdblue · 2025-03-24T16:59:26Z

parquet-variant/src/main/java/org/apache/parquet/variant/VariantUtil.java

+            value,
+            pos,
+            (info) -> info.dataStart
+                - pos


I pointed out earlier that adding pos to the offsets may be confusing. I think this is a good example, where in order to calculate the size of the object this has to account for pos being added in.

Updated these fields to be offsets, and not absolute.

rdblue · 2025-03-24T17:01:23Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+    this.pos = pos;
+    // There is currently only one allowed version.
+    if (metadata.length < 1 || (metadata[0] & VariantUtil.VERSION_MASK) != VariantUtil.VERSION) {
+      throw new MalformedVariantException(String.format(


Shouldn't this be UnsupportedOperationException rather than MalformedVariantException? The variant may not be malformed if the version is newer. It is just not supported.

Yeah, good point. Updated.

gene-db

@rdblue Thanks! I updated the PR. I removed all of the "flexible" to JSON conversion, and exposed an interface an engine can use to convert scalars differently if desired.

gene-db · 2025-03-24T18:26:04Z

parquet-variant/src/main/java/org/apache/parquet/variant/VariantUtil.java

+   * Check the validity of an array index `pos`.
+   * @param pos The index to check
+   * @param length The length of the array
+   * @throws MalformedVariantException if the index is out of bound


gene-db · 2025-03-24T18:47:03Z

parquet-variant/src/main/java/org/apache/parquet/variant/VariantUtil.java

+          case INT2:
+          case INT4:
+          case INT8:
+            return Type.LONG;


gene-db · 2025-03-24T18:47:10Z

parquet-variant/src/main/java/org/apache/parquet/variant/VariantUtil.java

+
+  /**
+   * The value type of Variant value. It is determined by the header byte but not a 1:1 mapping
+   * (for example, INT1/2/4/8 all maps to `Type.LONG`).


gene-db · 2025-03-24T18:49:20Z

parquet-variant/src/main/java/org/apache/parquet/variant/VariantUtil.java

+      result |= unsignedByteValue << (8 * i);
+    }
+    if (result < 0) {
+      throw new MalformedVariantException(String.format("Failed to read unsigned int. numBytes: %d", numBytes));


Updated to IllegalArgumentException.

gene-db · 2025-03-24T18:53:45Z

parquet-variant/src/main/java/org/apache/parquet/variant/VariantUtil.java

+    switch (typeInfo) {
+      case DECIMAL4:
+        result = BigDecimal.valueOf(readLong(value, pos + 2, 4), scale);
+        checkDecimal(result, MAX_DECIMAL4_PRECISION);


Removed these checks.

gene-db · 2025-03-24T20:33:44Z

parquet-variant/src/main/java/org/apache/parquet/variant/VariantSizeLimitException.java

+ */
+public class VariantSizeLimitException extends RuntimeException {
+  public VariantSizeLimitException(long sizeLimitBytes, long estimatedSizeBytes) {
+    super(String.format(


Yeah, we wanted to avoid materializing the full value if it is already going exceeding the size, but maybe this is not a big issue. Removed.

gene-db · 2025-03-24T20:38:52Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+    this.pos = pos;
+    // There is currently only one allowed version.
+    if (metadata.length < 1 || (metadata[0] & VariantUtil.VERSION_MASK) != VariantUtil.VERSION) {
+      throw new MalformedVariantException(String.format(


Yeah, good point. Updated.

gene-db · 2025-03-24T20:46:45Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+  /**
+   * @return the primitive type id from a variant value
+   */
+  public int getPrimitiveTypeId() {


It is not really needed, since we have getType() below. Removed.

gene-db · 2025-03-24T20:59:24Z

parquet-variant/src/main/java/org/apache/parquet/variant/VariantBuilder.java

+          // If the value doesn't fit any integer type, parse it as decimal or floating instead.
+          parseAndAppendFloatingPoint(parser);


Updated to throw an exception if it doesn't fit into int or decimal.

gene-db · 2025-03-27T01:14:59Z

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java

+    return Arrays.copyOfRange(value, pos, pos + size);
+  }
+
+  public byte[] getMetadata() {


I updated the class to also take in an offset for the byte arrays.

gene-db · 2025-04-06T17:54:11Z

@cashmand I updated this to use ByteBuffer. Will this be easier to integrate with the shredding support?

gene-db · 2025-04-18T23:12:55Z

This will be split up into multiple smaller PRs. The decode functionality is in #3197

emkornfield · 2025-06-03T16:46:09Z

Is this PR still relevant?

gene-db · 2025-06-11T00:44:53Z

Nope, this PR is no longer needed.

gene-db added 8 commits January 6, 2025 13:21

Implement Variant encoding

c3c71b7

remove optional

c5d19e6

split test

0086b34

cleanup

5af337f

cleanup comment

5997732

Run mvn spotless:apply

de96bac

Fix dependencies

848ddcb

Fix tests for older jdk versions

1a448ea

Fokko reviewed Jan 23, 2025

View reviewed changes

rdblue reviewed Jan 23, 2025

View reviewed changes

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java Outdated Show resolved Hide resolved

rdblue reviewed Jan 23, 2025

View reviewed changes

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java Outdated Show resolved Hide resolved

rdblue reviewed Jan 23, 2025

View reviewed changes

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java Outdated Show resolved Hide resolved

rdblue reviewed Jan 23, 2025

View reviewed changes

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java Outdated Show resolved Hide resolved

rdblue reviewed Jan 23, 2025

View reviewed changes

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java Outdated Show resolved Hide resolved

rdblue reviewed Jan 24, 2025

View reviewed changes

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java Outdated Show resolved Hide resolved

rdblue reviewed Jan 24, 2025

View reviewed changes

parquet-variant/src/main/java/org/apache/parquet/variant/VariantBuilder.java Outdated Show resolved Hide resolved

gene-db added 2 commits February 5, 2025 15:05

Address PR comments

2056297

Add new variant types

1ea911c

gene-db commented Feb 5, 2025

View reviewed changes

gene-db requested review from Fokko and rdblue February 6, 2025 03:05

Fix tests for older JDK versions

cb954a6

cashmand reviewed Feb 11, 2025

View reviewed changes

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java Outdated Show resolved Hide resolved

Return UUID

db6b98e

gene-db requested a review from cashmand February 13, 2025 18:59

cashmand suggested changes Feb 13, 2025

View reviewed changes

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java Outdated Show resolved Hide resolved

Return java.util.UUID

c220c3c

Fokko reviewed Feb 20, 2025

View reviewed changes

parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java Outdated Show resolved Hide resolved

rdblue reviewed Feb 24, 2025

View reviewed changes

aihuaxu reviewed Mar 3, 2025

View reviewed changes

sfc-gh-mbojanczyk mentioned this pull request Mar 11, 2025

[Parquet] Support Variant Encoding for Parquet apache/arrow-go#310

Closed

rdblue reviewed Mar 21, 2025

View reviewed changes

rdblue reviewed Mar 24, 2025

View reviewed changes

gene-db added 4 commits March 24, 2025 14:01

Cleanup/improve apis

7f2cd6e

cleanup unused constructor/member

f310080

Update api to use byte-array + offset

1040ae8

Use ScalarToJson interface

ba905c8

gene-db commented Mar 27, 2025

View reviewed changes

gene-db requested a review from rdblue March 27, 2025 05:08

cashmand added a commit to cashmand/parquet-java that referenced this pull request Mar 28, 2025

Variant support from apache#3117

b871082

Use ByteByffer instead

968e9a1

cashmand added a commit to cashmand/parquet-java that referenced this pull request Apr 30, 2025

Variant support from apache#3117

a8c8997

gene-db closed this Jun 11, 2025

		// If the value doesn't fit any integer type, parse it as decimal or floating instead.
		parseAndAppendFloatingPoint(parser);

GH-3116: Implement the Variant binary encoding #3117

GH-3116: Implement the Variant binary encoding #3117

Uh oh!

Conversation

gene-db commented Jan 7, 2025

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Fokko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gene-db left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aihuaxu Mar 3, 2025 •

edited

Loading