Skip to content

[CORE] Optimize Iceberg schema field matching#12233

Open
wankunde wants to merge 1 commit into
apache:mainfrom
wankunde:IcebergScan_schema_check
Open

[CORE] Optimize Iceberg schema field matching#12233
wankunde wants to merge 1 commit into
apache:mainfrom
wankunde:IcebergScan_schema_check

Conversation

@wankunde

@wankunde wankunde commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

What changes are proposed in this pull request?

Why this PR is needed?

In IcebergScanTransformer.typesMatch(), the struct type matching logic creates temporary Iceberg Schema objects for every Spark field:

new Schema(currentType.fields()).findField(...)
new Schema(iceberg.fields()).findField(...)

This repeatedly rebuilds Iceberg schema indexes while checking historical schemas, which can become expensive for wide schemas or tables with many schema versions. In production thread dumps, this shows up in Schema / IndexByName / HashMap initialization during Iceberg scan planning.
image

Changes in this PR:

This change uses Types.StructType.field(name) and Types.StructType.field(id) directly when matching nested struct fields.

Types.StructType already provides field lookup by name and id, so this avoids constructing temporary Schema objects inside the field loop while preserving the existing matching behavior:

  • find the current field by Spark field name
  • find the old schema field by Iceberg field id
  • keep allowing added columns
  • keep detecting renamed columns by comparing field names

How was this patch tested?

Test with exist UT

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Codex GPT-5

@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant