[SPARK-55959][SQL] Optimize Map Key Lookup for `GetMapValue` and `ElementAt` by LuciferYang · Pull Request #54748 · apache/spark

LuciferYang · 2026-03-11T03:06:00Z

What changes were proposed in this pull request?

This PR optimizes map value retrieval for GetMapValue (e.g., map['key']) and ElementAt expressions by introducing a hash-based lookup mechanism for large maps.

Previously, looking up a key in a map involved a linear scan of the keys array (O(N)), which becomes a significant bottleneck for large maps. This PR updates GetMapValueUtil to use a hash index when the map size exceeds a configurable threshold.

Key changes:

Added spark.sql.optimizer.mapLookupHashThreshold configuration (SQLConf.MAP_LOOKUP_HASH_THRESHOLD): An internal session-scoped config (default: 1000) that controls the minimum map size for hash-based lookup. Below this threshold, linear scan is used. The threshold must be non-negative.
Interpreted path — java.util.HashMap index: Added getOrBuildIndex in GetMapValueUtil that builds and caches a java.util.HashMap[Any, Int] mapping keys to their first occurrence index. The index is reused across lookups on the same MapData instance (identity check via ne). Uses putIfAbsent to preserve first-win semantics for duplicate keys. Falls back to linear scan for key types where TypeUtils.typeWithProperEquals returns false (e.g., BinaryType, ArrayType, StructType).
Codegen path — open-addressing hash table: Added doGetValueGenCodeWithHashOpt that generates an inline open-addressing hash table with power-of-2 sizing and linear probing. The hash table is rebuilt when the key array changes (identity check). Supported key types (checked by supportsHashLookup): BooleanType, ByteType, ShortType, IntegerType, LongType, FloatType, DoubleType, DateType, TimestampType, TimestampNTZType, YearMonthIntervalType, DayTimeIntervalType, and StringType with binary equality. For unsupported types, doGetValueGenCodeLinear (the original linear scan) is used.

Why are the changes needed?

To fix #54646.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Parameterized existing test cases in ComplexTypeSuite (GetMapValue) and CollectionExpressionsSuite (elementAt) to run under both linear lookup (threshold=Int.MaxValue) and hash lookup (threshold=0), covering: basic String/Int keys, nested maps, duplicate keys (via ArrayBasedMapData), null values, NaN keys (Double and Float), Binary keys, Array keys, Struct keys, and empty maps.
Added MapLookupBenchmark with results for JDK 17, 21, and 25 across multiple map sizes (1K, 10K, 100K, 1M), hit ratios (0%, 50%, 100%), and both GetMapValue/ElementAt in interpreted and codegen modes.

Was this patch authored or co-authored using generative AI tooling?

Completed with the assistance of Claude Sonnet 4.6

LuciferYang · 2026-03-11T03:07:49Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/MapLookupBenchmark.scala

+ *      Results will be written to "benchmarks/MapLookupBenchmark-results.txt".
+ * }}}
+ */
+object MapLookupBenchmark extends SqlBasedBenchmark {


Try to fix #54646. The micro-benchmark results before the fix will be posted in the PR description later.

LuciferYang · 2026-03-11T03:24:02Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ComplexTypeSuite.scala

    val nullMap = Literal.create(null, typeM)
    val nullString = Literal.create(null, StringType)

+    // 1. Basic lookup (String keys)


will add more tests for hash lookup

…kupBenchmark (JDK 21, Scala 2.13, split 1 of 1)

…kupBenchmark (JDK 17, Scala 2.13, split 1 of 1)

…kupBenchmark (JDK 25, Scala 2.13, split 1 of 1)

LuciferYang · 2026-03-11T04:43:20Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala

+   * The value 20 is chosen empirically; break-even is around 15-25
+   * elements for primitive key types.
+   */
+  private val hashLookupThreshold = 20


Is it worth turning this into a SQL config? Please let me know if it's needed.

One exception that would be worth making is when the map is a Literal. You already have reusing the hashmap if the map itself doesn't change, which would be the case for a literal map, and the cost of building the hash map once for all rows would probably be worth it even for a single key

LuciferYang · 2026-03-11T05:19:58Z

sql/core/benchmarks/MapLookupBenchmark-results.txt

+ElementAt codegen                                          20             21           1          0.5        1960.9       1.0X
+
+OpenJDK 64-Bit Server VM 17.0.18+8-LTS on Linux 6.14.0-1017-azure
+Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz


Ultimately, I will update this result with the test findings obtained using the AMD EPYC 7763 64-Core Processor, as the majority of tests in the codebase are conducted based on this CPU model.

…kupBenchmark (JDK 17, Scala 2.13, split 1 of 1)

Kimahriman

Thanks for working on this! Definitely will be a big improvement, especially for the literal cases

Kimahriman · 2026-03-11T14:53:38Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala

+      var i = 0
+      while (i < len) {
+        val k = keys.get(i, keyType)
+        if (!hm.containsKey(k)) hm.put(k, i)


Don't maps always have unique keys? Was this just an assumption claude made that it had to check this?

This is because ArrayBasedMapData allows duplicate keys to exist at the physical level, and its default lookup semantics follow the "first-match principle." Therefore, when constructing a hash index, I think we should explicitly ignore subsequent duplicate keys to ensure that the results of hash lookups are fully consistent with those of linear scans.

However, it might be better to use putIfAbsent here.

Kimahriman · 2026-03-11T14:58:16Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala

    val map = value.asInstanceOf[MapData]
    val length = map.numElements()
+
+    if (length < hashLookupThreshold || !TypeUtils.typeWithProperEquals(keyType)) {


You could use the approach I have in #53468 to support all types for hashing (and help me get that merged in 😬 ). Though it doesn't do codegen yet, would need to think about how to do that

A bit tired today. Let me take a look tomorrow

I think we can submit a separate pr later and then make use of the data structures in #53468.

Kimahriman · 2026-03-11T15:00:22Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala

+   * The value 20 is chosen empirically; break-even is around 15-25
+   * elements for primitive key types.
+   */
+  private val hashLookupThreshold = 20


One exception that would be worth making is when the map is a Literal. You already have reusing the hashmap if the map itself doesn't change, which would be the case for a literal map, and the cost of building the hash map once for all rows would probably be worth it even for a single key

LuciferYang · 2026-03-12T02:47:55Z

One exception that would be worth making is when the map is a Literal. You already have reusing the hashmap if the map itself doesn't change, which would be the case for a literal map, and the cost of building the hash map once for all rows would probably be worth it even for a single key

Let me try the test with a threshold value of 0.

LuciferYang · 2026-03-12T07:10:46Z

One exception that would be worth making is when the map is a Literal. You already have reusing the hashmap if the map itself doesn't change, which would be the case for a literal map, and the cost of building the hash map once for all rows would probably be worth it even for a single key

Let me try the test with a threshold value of 0.

From the test results, when the threshold is set to 0 (always constructing a hashmap), there is an additional absolute delay of 2 to 3 milliseconds per lookup. However, it might be more appropriate to make this a configurable option, so that users can control this behavior themselves according to different scenarios.

…kupBenchmark (JDK 21, Scala 2.13, split 1 of 1)

…kupBenchmark (JDK 25, Scala 2.13, split 1 of 1)

…kupBenchmark (JDK 17, Scala 2.13, split 1 of 1)

…kupBenchmark (JDK 25, Scala 2.13, split 1 of 1)

…kupBenchmark (JDK 21, Scala 2.13, split 1 of 1)

…kupBenchmark (JDK 17, Scala 2.13, split 1 of 1)

LuciferYang added 4 commits March 10, 2026 19:32

add benchmark

72e8922

fix

bc49e32

add times

c46c933

revert collectionOperations.scala

f965a0c

LuciferYang marked this pull request as draft March 11, 2026 03:06

LuciferYang commented Mar 11, 2026

View reviewed changes

LuciferYang and others added 4 commits March 11, 2026 03:27

Benchmark results for org.apache.spark.sql.execution.benchmark.MapLoo…

4a61bc3

…kupBenchmark (JDK 21, Scala 2.13, split 1 of 1)

Benchmark results for org.apache.spark.sql.execution.benchmark.MapLoo…

bcaba7c

…kupBenchmark (JDK 17, Scala 2.13, split 1 of 1)

Benchmark results for org.apache.spark.sql.execution.benchmark.MapLoo…

8e5b795

…kupBenchmark (JDK 25, Scala 2.13, split 1 of 1)

add more tests

b5d0d38

LuciferYang commented Mar 11, 2026

View reviewed changes

LuciferYang marked this pull request as ready for review March 11, 2026 05:20

LuciferYang mentioned this pull request Mar 11, 2026

Spark map lookup is O(n) #54646

Open

Benchmark results for org.apache.spark.sql.execution.benchmark.MapLoo…

8b33dc7

…kupBenchmark (JDK 17, Scala 2.13, split 1 of 1)

Kimahriman reviewed Mar 11, 2026

View reviewed changes

LuciferYang added 6 commits March 12, 2026 15:22

use putIfAbsent

f7d29cb

add config

e3ac77e

refactor test

dcad5e5

refactor benchmark

4ac347b

fix doc

6757907

add more memory

0917346

LuciferYang marked this pull request as draft March 12, 2026 13:16

LuciferYang and others added 4 commits March 13, 2026 01:09

init

782d7a5

Benchmark results for org.apache.spark.sql.execution.benchmark.MapLoo…

9d21fae

…kupBenchmark (JDK 21, Scala 2.13, split 1 of 1)

Benchmark results for org.apache.spark.sql.execution.benchmark.MapLoo…

c721967

…kupBenchmark (JDK 25, Scala 2.13, split 1 of 1)

Benchmark results for org.apache.spark.sql.execution.benchmark.MapLoo…

3bc2c2a

…kupBenchmark (JDK 17, Scala 2.13, split 1 of 1)

LuciferYang and others added 6 commits March 13, 2026 08:55

try add 1M back

fa38f0b

Benchmark results for org.apache.spark.sql.execution.benchmark.MapLoo…

7c01c08

…kupBenchmark (JDK 25, Scala 2.13, split 1 of 1)

Benchmark results for org.apache.spark.sql.execution.benchmark.MapLoo…

2e7bf82

…kupBenchmark (JDK 21, Scala 2.13, split 1 of 1)

Benchmark results for org.apache.spark.sql.execution.benchmark.MapLoo…

756ccbc

…kupBenchmark (JDK 17, Scala 2.13, split 1 of 1)

Merge branch 'upmaster' into issue-54646

85495d3

add ConfigBindingPolicy

a6eb0f3

LuciferYang marked this pull request as ready for review March 13, 2026 03:05

Conversation

LuciferYang commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LuciferYang Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Kimahriman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LuciferYang commented Mar 12, 2026

Uh oh!

LuciferYang commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LuciferYang commented Mar 11, 2026 •

edited

Loading

LuciferYang Mar 11, 2026 •

edited

Loading

LuciferYang commented Mar 12, 2026 •

edited

Loading