[SPARK-55568][SQL] Separate schema construction from field stats collection by qlong · Pull Request #54343 · apache/spark

qlong · 2026-02-17T05:01:53Z

Why are the changes needed?

Variant shredding schema inference is expensive and can take over 100ms per file. Replace fold-based schema merging with deferred schema construction using single-pass field statistics collection.

Previous approach:

Used foldLeft to build and merge complete schemas for each row
Merged schemas repeatedly across 4096 rows
High allocation overhead from recursive schema construction

New approach:

Separate schema construction from field statistics collection to avoid excessive intermediate allocations and repeated merges.
Single-pass field traversal with field statistics tree to track field types and row counts
Using lastSeenRow for deduplication
Defers schema construction until after all rows processed

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Functional test:

Pass all existing unit tests

Performance vs master:

Tested with scenarios with different field counts, array sizes, and batch sizes(1-4096 rows, 10-200 fields, varying nesting depths and sparsity patterns).
1.7x to 2.4x speed up across test scenarios
Consistent performance across multiple runs
96% of tests show improvement

Was this patch authored or co-authored using generative AI tooling?

Co-authored with Claude Sonnet 4.5

qlong · 2026-02-25T19:48:16Z

@cloud-fan @cashmand Can you help review as you are the authors for the original implemenation? Thanks

cashmand

Thanks, it looks good overall, but I posted a couple of concerns. I wonder if we can make the schemaRegistry a nested structure to avoid these problems, but still get the performance benefit you're seeing.

...n/scala/org/apache/spark/sql/execution/datasources/parquet/InferVariantShreddingSchema.scala

qlong · 2026-02-26T16:50:24Z

Thanks, it looks good overall, but I posted a couple of concerns. I wonder if we can make the schemaRegistry a nested structure to avoid these problems, but still get the performance benefit you're seeing.

Thanks for the review. Switched to a tree to track field stats, get 1.3x to 1.5x improvement over flat map. Overall, the new implementation is 1.7x to 2.4x improvement.

qlong · 2026-03-06T18:30:17Z

Hi @cashmand , I addressed your comments in the latest version. Can you review? Thanks

cashmand

Thanks, just a few questions.

cashmand · 2026-03-11T19:01:11Z

...n/scala/org/apache/spark/sql/execution/datasources/parquet/InferVariantShreddingSchema.scala

+      inArrayContext: Boolean = false): DataType = {
+
+    // Check if this node represents an array (has "[]" child)
+    val arrayChild = currentNode.children.get("[]")


In a pathological case, can [] be used as a field name? If that happened, would we incorrectly build an array instead of a struct here? Is so, could we avoid this check by passing another bool to buildSchemaFromStats to indicate whether currentNode is an array?

Thanks for review. Good call out on [] as the type marker, I agree it is bad. I added a dedicate arrayElementNode in FieldNode to track array so we no longer need to rely on any marker. Also added test cases for [] as field names.

cashmand · 2026-03-11T19:11:20Z

...n/scala/org/apache/spark/sql/execution/datasources/parquet/InferVariantShreddingSchema.scala

-    case Type.UUID => VariantType
+  // Node for tree-based field tracking
+  private case class FieldNode(
+    var dataType: DataType,


Can you comment on what dataType will be for structs and arrays?

dataType is the type summary:

For struct, it is StructType(Seq.empty), actual schema come from children

For array, it is ArrayType(NullType), actually schema comes from the new arrayElementNode

For primitives: dataType is the merged scala type

cashmand · 2026-03-11T19:14:39Z

...n/scala/org/apache/spark/sql/execution/datasources/parquet/InferVariantShreddingSchema.scala

+        // Use "[]" as special child key for array elements
+        val arrayNode = currentNode.getOrCreateChild("[]")
+
+        // Track distinct row count for the array field itself


Why is this check for arrayNode.lastSeenRow != rowIdx needed? If we're not in an array context, what's the case where we'd see the same rowIdx twice?

The check for arrayNode is needed for a row with nested array like [[1], [2]]. The outer array node will be visited multiple times while we iterate elements at line 429. The check is to prevent inflating rowCount.

cashmand · 2026-03-11T19:23:06Z

...st/scala/org/apache/spark/sql/execution/datasources/parquet/VariantInferShreddingSuite.scala


  testWithTempDir("infer shredding key as data") { dir =>
-      // The first 10 fields in each object include the row ID in the field name, so they'll be
-      // unique. Because we impose a 1000-field limit when building up the schema, we'll end up


Is the 1000 field limit no longer enforced at all? The intent of the limit was to avoid building up a huge intermediate schema if all of the variant values have distinct fields. I think the new approach could still result in a fairly large statistics tree in this situation, right?

I don't think this is necessarily critical - the overall memory and time should still be bounded by the size of the variants, which is enforced elsewhere to not get too large. But if we're going to remove this limit, I want to make sure it's a conscious decision.

You are right that the tree can grow beyond 1000 fields during stats collection. I change it to ensure that the selected fields are the true top N by cardinality. The memory footprint of the stats tree is proportional to the number of unique fields, it should be much smaller than the varant data itself.

…ection Variant shredding schema inference is expensive and can take over 100ms per file. Replace fold-based schema merging with deferred schema construction using single-pass field statistics collection. Previous approach: - Used foldLeft to build and merge complete schemas for each row - Merged schemas repeatedly across 4096 rows - High allocation overhead from recursive schema construction New approach: - Separate schema construction from field statistics collection to avoid excessive intermediate allocations and repeated merges. - Single-pass field traversal with flat statistics registry to track field types and row counts - Using lastSeenRow for deduplication - Defers schema construction until after all rows processed Performance vs master: - Tested with scenarios with different field counts, array sizes, and batch sizes(1-4096 rows, 10-200 fields, varying nesting depths and sparsity patterns). - Average 1.5x speedup across test scenarios - 1.5x-1.6x faster on array-heavy workloads - 11.5x faster on sparse data (10% field presence) - Consistent performance across multiple runs - 96% of tests show improvement All existing unit tests pass. Issue: https://issues.apache.org/jira/browse/SPARK-55568

@cashmand

…ection Switch to tree structure for tracking field stats as suggested by @cashmand. Performance improvements - 1.3x to 1.5x faster compared to flat map - 1.7x to 2.4x faster compared to the original implementation Other changes: - Ensure top cardinality fields are included in the schema by sorting by cardinality first before taking the top N. - Add special character tests and adds tests for mixed special character, as suggested by @cashmand.

…ection Address PR feedback: - Make array-element tracking explicit with a dedicated arrayElementNode to avoid ambiguity with "[]" field names - Add tests for field names "[]"

qlong · 2026-03-13T01:10:53Z

...n/scala/org/apache/spark/sql/execution/datasources/parquet/InferVariantShreddingSchema.scala

-    case Type.UUID => VariantType
+  // Node for tree-based field tracking
+  private case class FieldNode(
+    var dataType: DataType,


dataType is the type summary:

For struct, it is StructType(Seq.empty), actual schema come from children

For array, it is ArrayType(NullType), actually schema comes from the new arrayElementNode

For primitives: dataType is the merged scala type

qlong · 2026-03-13T01:32:39Z

...n/scala/org/apache/spark/sql/execution/datasources/parquet/InferVariantShreddingSchema.scala

+        // Use "[]" as special child key for array elements
+        val arrayNode = currentNode.getOrCreateChild("[]")
+
+        // Track distinct row count for the array field itself


The check for arrayNode is needed for a row with nested array like [[1], [2]]. The outer array node will be visited multiple times while we iterate elements at line 429. The check is to prevent inflating rowCount.

qlong · 2026-03-13T01:34:29Z

...n/scala/org/apache/spark/sql/execution/datasources/parquet/InferVariantShreddingSchema.scala

+    var lastSeenRow: Int = -1,       // Last row index that incremented rowCount
+    var arrayElementCount: Long = 0, // Total occurrences across all array elements
+    children: mutable.Map[String, FieldNode] = mutable.Map.empty,
+    var arrayElementNode: Option[FieldNode] = None


Added a field to track Array, instead of relying on [] marker .

qlong · 2026-03-13T01:44:18Z

...st/scala/org/apache/spark/sql/execution/datasources/parquet/VariantInferShreddingSuite.scala


  testWithTempDir("infer shredding key as data") { dir =>
-      // The first 10 fields in each object include the row ID in the field name, so they'll be
-      // unique. Because we impose a 1000-field limit when building up the schema, we'll end up


You are right that the tree can grow beyond 1000 fields during stats collection. I change it to ensure that the selected fields are the true top N by cardinality. The memory footprint of the stats tree is proportional to the number of unique fields, it should be much smaller than the varant data itself.

qlong · 2026-03-13T01:48:50Z

...n/scala/org/apache/spark/sql/execution/datasources/parquet/InferVariantShreddingSchema.scala

+      inArrayContext: Boolean = false): DataType = {
+
+    // Check if this node represents an array (has "[]" child)
+    val arrayChild = currentNode.children.get("[]")


Thanks for review. Good call out on [] as the type marker, I agree it is bad. I added a dedicate arrayElementNode in FieldNode to track array so we no longer need to rely on any marker. Also added test cases for [] as field names.

qlong · 2026-03-13T01:54:46Z

...n/scala/org/apache/spark/sql/execution/datasources/parquet/InferVariantShreddingSchema.scala

+
+    // Get all direct children, filter by cardinality, sort by cardinality descending,
+    // take top N, then sort alphabetically for determinism.
+    val maxStructSize = Math.min(1000, maxShreddedFieldsPerFile)


this is to limit candidates per struct node; final file-level enforcement is in finalizeSimpleSchema.

cashmand reviewed Feb 25, 2026

View reviewed changes

qlong force-pushed the SPARK-55568-optimize-variant-schema-inference branch from 1ac24a8 to e89782e Compare February 26, 2026 16:42

qlong force-pushed the SPARK-55568-optimize-variant-schema-inference branch from e89782e to b0e9307 Compare February 26, 2026 20:21

qlong requested a review from cashmand February 26, 2026 22:16

cashmand reviewed Mar 11, 2026

View reviewed changes

qlong added 3 commits March 12, 2026 15:21

[SPARK-55568][SQL] Separate schema construction from field stats coll…

282bc8f

…ection Address PR feedback: - Make array-element tracking explicit with a dedicated arrayElementNode to avoid ambiguity with "[]" field names - Add tests for field names "[]"

qlong force-pushed the SPARK-55568-optimize-variant-schema-inference branch from e580e01 to 282bc8f Compare March 12, 2026 19:21

qlong commented Mar 13, 2026

View reviewed changes

qlong requested a review from cashmand March 13, 2026 01:55

Conversation

qlong commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

qlong commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cashmand left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qlong commented Feb 26, 2026

Uh oh!

qlong commented Mar 6, 2026

Uh oh!

cashmand left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

qlong commented Feb 17, 2026 •

edited

Loading

qlong commented Feb 25, 2026 •

edited

Loading