[SPARK-55959][SQL] Optimize Map Key Lookup for GetMapValue and ElementAt#54748
[SPARK-55959][SQL] Optimize Map Key Lookup for GetMapValue and ElementAt#54748LuciferYang wants to merge 25 commits intoapache:masterfrom
GetMapValue and ElementAt#54748Conversation
| * Results will be written to "benchmarks/MapLookupBenchmark-results.txt". | ||
| * }}} | ||
| */ | ||
| object MapLookupBenchmark extends SqlBasedBenchmark { |
There was a problem hiding this comment.
Try to fix #54646. The micro-benchmark results before the fix will be posted in the PR description later.
| val nullMap = Literal.create(null, typeM) | ||
| val nullString = Literal.create(null, StringType) | ||
|
|
||
| // 1. Basic lookup (String keys) |
There was a problem hiding this comment.
will add more tests for hash lookup
…kupBenchmark (JDK 21, Scala 2.13, split 1 of 1)
…kupBenchmark (JDK 17, Scala 2.13, split 1 of 1)
…kupBenchmark (JDK 25, Scala 2.13, split 1 of 1)
| * The value 20 is chosen empirically; break-even is around 15-25 | ||
| * elements for primitive key types. | ||
| */ | ||
| private val hashLookupThreshold = 20 |
There was a problem hiding this comment.
Is it worth turning this into a SQL config? Please let me know if it's needed.
There was a problem hiding this comment.
One exception that would be worth making is when the map is a Literal. You already have reusing the hashmap if the map itself doesn't change, which would be the case for a literal map, and the cost of building the hash map once for all rows would probably be worth it even for a single key
| ElementAt codegen 20 21 1 0.5 1960.9 1.0X | ||
|
|
||
| OpenJDK 64-Bit Server VM 17.0.18+8-LTS on Linux 6.14.0-1017-azure | ||
| Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz |
There was a problem hiding this comment.
Ultimately, I will update this result with the test findings obtained using the AMD EPYC 7763 64-Core Processor, as the majority of tests in the codebase are conducted based on this CPU model.
…kupBenchmark (JDK 17, Scala 2.13, split 1 of 1)
Kimahriman
left a comment
There was a problem hiding this comment.
Thanks for working on this! Definitely will be a big improvement, especially for the literal cases
| var i = 0 | ||
| while (i < len) { | ||
| val k = keys.get(i, keyType) | ||
| if (!hm.containsKey(k)) hm.put(k, i) |
There was a problem hiding this comment.
Don't maps always have unique keys? Was this just an assumption claude made that it had to check this?
There was a problem hiding this comment.
This is because ArrayBasedMapData allows duplicate keys to exist at the physical level, and its default lookup semantics follow the "first-match principle." Therefore, when constructing a hash index, I think we should explicitly ignore subsequent duplicate keys to ensure that the results of hash lookups are fully consistent with those of linear scans.
However, it might be better to use putIfAbsent here.
| val map = value.asInstanceOf[MapData] | ||
| val length = map.numElements() | ||
|
|
||
| if (length < hashLookupThreshold || !TypeUtils.typeWithProperEquals(keyType)) { |
There was a problem hiding this comment.
You could use the approach I have in #53468 to support all types for hashing (and help me get that merged in 😬 ). Though it doesn't do codegen yet, would need to think about how to do that
There was a problem hiding this comment.
A bit tired today. Let me take a look tomorrow
There was a problem hiding this comment.
I think we can submit a separate pr later and then make use of the data structures in #53468.
| * The value 20 is chosen empirically; break-even is around 15-25 | ||
| * elements for primitive key types. | ||
| */ | ||
| private val hashLookupThreshold = 20 |
There was a problem hiding this comment.
One exception that would be worth making is when the map is a Literal. You already have reusing the hashmap if the map itself doesn't change, which would be the case for a literal map, and the cost of building the hash map once for all rows would probably be worth it even for a single key
Let me try the test with a threshold value of 0. |
From the test results, when the threshold is set to 0 (always constructing a hashmap), there is an additional absolute delay of 2 to 3 milliseconds per lookup. However, it might be more appropriate to make this a configurable option, so that users can control this behavior themselves according to different scenarios. |
…kupBenchmark (JDK 21, Scala 2.13, split 1 of 1)
…kupBenchmark (JDK 25, Scala 2.13, split 1 of 1)
…kupBenchmark (JDK 17, Scala 2.13, split 1 of 1)
…kupBenchmark (JDK 25, Scala 2.13, split 1 of 1)
…kupBenchmark (JDK 21, Scala 2.13, split 1 of 1)
…kupBenchmark (JDK 17, Scala 2.13, split 1 of 1)
What changes were proposed in this pull request?
This PR optimizes map value retrieval for
GetMapValue(e.g.,map['key']) andElementAtexpressions by introducing a hash-based lookup mechanism for large maps.Previously, looking up a key in a map involved a linear scan of the keys array (O(N)), which becomes a significant bottleneck for large maps. This PR updates
GetMapValueUtilto use a hash index when the map size exceeds a configurable threshold.Key changes:
Added
spark.sql.optimizer.mapLookupHashThresholdconfiguration (SQLConf.MAP_LOOKUP_HASH_THRESHOLD): An internal session-scoped config (default: 1000) that controls the minimum map size for hash-based lookup. Below this threshold, linear scan is used. The threshold must be non-negative.Interpreted path —
java.util.HashMapindex: AddedgetOrBuildIndexinGetMapValueUtilthat builds and caches ajava.util.HashMap[Any, Int]mapping keys to their first occurrence index. The index is reused across lookups on the sameMapDatainstance (identity check viane). UsesputIfAbsentto preserve first-win semantics for duplicate keys. Falls back to linear scan for key types whereTypeUtils.typeWithProperEqualsreturns false (e.g.,BinaryType,ArrayType,StructType).Codegen path — open-addressing hash table: Added
doGetValueGenCodeWithHashOptthat generates an inline open-addressing hash table with power-of-2 sizing and linear probing. The hash table is rebuilt when the key array changes (identity check). Supported key types (checked bysupportsHashLookup):BooleanType,ByteType,ShortType,IntegerType,LongType,FloatType,DoubleType,DateType,TimestampType,TimestampNTZType,YearMonthIntervalType,DayTimeIntervalType, andStringTypewith binary equality. For unsupported types,doGetValueGenCodeLinear(the original linear scan) is used.Why are the changes needed?
To fix #54646.
Does this PR introduce any user-facing change?
No
How was this patch tested?
ComplexTypeSuite(GetMapValue) andCollectionExpressionsSuite(elementAt) to run under both linear lookup (threshold=Int.MaxValue) and hash lookup (threshold=0), covering: basic String/Int keys, nested maps, duplicate keys (viaArrayBasedMapData), null values, NaN keys (Double and Float), Binary keys, Array keys, Struct keys, and empty maps.MapLookupBenchmarkwith results for JDK 17, 21, and 25 across multiple map sizes (1K, 10K, 100K, 1M), hit ratios (0%, 50%, 100%), and bothGetMapValue/ElementAtin interpreted and codegen modes.Was this patch authored or co-authored using generative AI tooling?
Completed with the assistance of Claude Sonnet 4.6