Skip to content

[SPARK-55959][SQL] Optimize Map Key Lookup for GetMapValue and ElementAt#54748

Open
LuciferYang wants to merge 25 commits intoapache:masterfrom
LuciferYang:issue-54646
Open

[SPARK-55959][SQL] Optimize Map Key Lookup for GetMapValue and ElementAt#54748
LuciferYang wants to merge 25 commits intoapache:masterfrom
LuciferYang:issue-54646

Conversation

@LuciferYang
Copy link
Contributor

@LuciferYang LuciferYang commented Mar 11, 2026

What changes were proposed in this pull request?

This PR optimizes map value retrieval for GetMapValue (e.g., map['key']) and ElementAt expressions by introducing a hash-based lookup mechanism for large maps.

Previously, looking up a key in a map involved a linear scan of the keys array (O(N)), which becomes a significant bottleneck for large maps. This PR updates GetMapValueUtil to use a hash index when the map size exceeds a configurable threshold.

Key changes:

  1. Added spark.sql.optimizer.mapLookupHashThreshold configuration (SQLConf.MAP_LOOKUP_HASH_THRESHOLD): An internal session-scoped config (default: 1000) that controls the minimum map size for hash-based lookup. Below this threshold, linear scan is used. The threshold must be non-negative.

  2. Interpreted path — java.util.HashMap index: Added getOrBuildIndex in GetMapValueUtil that builds and caches a java.util.HashMap[Any, Int] mapping keys to their first occurrence index. The index is reused across lookups on the same MapData instance (identity check via ne). Uses putIfAbsent to preserve first-win semantics for duplicate keys. Falls back to linear scan for key types where TypeUtils.typeWithProperEquals returns false (e.g., BinaryType, ArrayType, StructType).

  3. Codegen path — open-addressing hash table: Added doGetValueGenCodeWithHashOpt that generates an inline open-addressing hash table with power-of-2 sizing and linear probing. The hash table is rebuilt when the key array changes (identity check). Supported key types (checked by supportsHashLookup): BooleanType, ByteType, ShortType, IntegerType, LongType, FloatType, DoubleType, DateType, TimestampType, TimestampNTZType, YearMonthIntervalType, DayTimeIntervalType, and StringType with binary equality. For unsupported types, doGetValueGenCodeLinear (the original linear scan) is used.

Why are the changes needed?

To fix #54646.

Does this PR introduce any user-facing change?

No

How was this patch tested?

  1. Parameterized existing test cases in ComplexTypeSuite (GetMapValue) and CollectionExpressionsSuite (elementAt) to run under both linear lookup (threshold=Int.MaxValue) and hash lookup (threshold=0), covering: basic String/Int keys, nested maps, duplicate keys (via ArrayBasedMapData), null values, NaN keys (Double and Float), Binary keys, Array keys, Struct keys, and empty maps.
  2. Added MapLookupBenchmark with results for JDK 17, 21, and 25 across multiple map sizes (1K, 10K, 100K, 1M), hit ratios (0%, 50%, 100%), and both GetMapValue/ElementAt in interpreted and codegen modes.

Was this patch authored or co-authored using generative AI tooling?

Completed with the assistance of Claude Sonnet 4.6

@LuciferYang LuciferYang marked this pull request as draft March 11, 2026 03:06
* Results will be written to "benchmarks/MapLookupBenchmark-results.txt".
* }}}
*/
object MapLookupBenchmark extends SqlBasedBenchmark {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to fix #54646. The micro-benchmark results before the fix will be posted in the PR description later.

val nullMap = Literal.create(null, typeM)
val nullString = Literal.create(null, StringType)

// 1. Basic lookup (String keys)
Copy link
Contributor Author

@LuciferYang LuciferYang Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will add more tests for hash lookup

LuciferYang and others added 4 commits March 11, 2026 03:27
* The value 20 is chosen empirically; break-even is around 15-25
* elements for primitive key types.
*/
private val hashLookupThreshold = 20
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth turning this into a SQL config? Please let me know if it's needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One exception that would be worth making is when the map is a Literal. You already have reusing the hashmap if the map itself doesn't change, which would be the case for a literal map, and the cost of building the hash map once for all rows would probably be worth it even for a single key

ElementAt codegen 20 21 1 0.5 1960.9 1.0X

OpenJDK 64-Bit Server VM 17.0.18+8-LTS on Linux 6.14.0-1017-azure
Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ultimately, I will update this result with the test findings obtained using the AMD EPYC 7763 64-Core Processor, as the majority of tests in the codebase are conducted based on this CPU model.

@LuciferYang LuciferYang marked this pull request as ready for review March 11, 2026 05:20
…kupBenchmark (JDK 17, Scala 2.13, split 1 of 1)
Copy link
Contributor

@Kimahriman Kimahriman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! Definitely will be a big improvement, especially for the literal cases

var i = 0
while (i < len) {
val k = keys.get(i, keyType)
if (!hm.containsKey(k)) hm.put(k, i)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't maps always have unique keys? Was this just an assumption claude made that it had to check this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because ArrayBasedMapData allows duplicate keys to exist at the physical level, and its default lookup semantics follow the "first-match principle." Therefore, when constructing a hash index, I think we should explicitly ignore subsequent duplicate keys to ensure that the results of hash lookups are fully consistent with those of linear scans.

However, it might be better to use putIfAbsent here.

val map = value.asInstanceOf[MapData]
val length = map.numElements()

if (length < hashLookupThreshold || !TypeUtils.typeWithProperEquals(keyType)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use the approach I have in #53468 to support all types for hashing (and help me get that merged in 😬 ). Though it doesn't do codegen yet, would need to think about how to do that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit tired today. Let me take a look tomorrow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can submit a separate pr later and then make use of the data structures in #53468.

* The value 20 is chosen empirically; break-even is around 15-25
* elements for primitive key types.
*/
private val hashLookupThreshold = 20
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One exception that would be worth making is when the map is a Literal. You already have reusing the hashmap if the map itself doesn't change, which would be the case for a literal map, and the cost of building the hash map once for all rows would probably be worth it even for a single key

@LuciferYang
Copy link
Contributor Author

One exception that would be worth making is when the map is a Literal. You already have reusing the hashmap if the map itself doesn't change, which would be the case for a literal map, and the cost of building the hash map once for all rows would probably be worth it even for a single key

Let me try the test with a threshold value of 0.

@LuciferYang
Copy link
Contributor Author

LuciferYang commented Mar 12, 2026

One exception that would be worth making is when the map is a Literal. You already have reusing the hashmap if the map itself doesn't change, which would be the case for a literal map, and the cost of building the hash map once for all rows would probably be worth it even for a single key

Let me try the test with a threshold value of 0.

From the test results, when the threshold is set to 0 (always constructing a hashmap), there is an additional absolute delay of 2 to 3 milliseconds per lookup. However, it might be more appropriate to make this a configurable option, so that users can control this behavior themselves according to different scenarios.

@LuciferYang LuciferYang marked this pull request as draft March 12, 2026 13:16
LuciferYang and others added 4 commits March 13, 2026 01:09
…kupBenchmark (JDK 21, Scala 2.13, split 1 of 1)
…kupBenchmark (JDK 25, Scala 2.13, split 1 of 1)
…kupBenchmark (JDK 17, Scala 2.13, split 1 of 1)
@LuciferYang LuciferYang marked this pull request as ready for review March 13, 2026 03:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Spark map lookup is O(n)

2 participants