Cache Key Design for One-to-Many Logical-Physical Operator Mapping #4346
Xiao-zhen-Liu
started this conversation in
Ideas
Replies: 1 comment
-
|
@Xiao-zhen-Liu Thanks for the great summary. For our records, it's based on a discussion of @Xiao-zhen-Liu @Yicong-Huang and myself (@chenlica). |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Background: Current Design
Currently, there is a one-to-one mapping from a workflow's logical plan to its physical plan. Each logical operator maps to one or more physical operators with a fixed implementation choice. For example,
Sortalways maps to the Python-based sort implementation, andHashJoinalways maps toBuild+Probe.This has two limitations:
HashJoininstead ofJoin, leaking the implementation into the user-facing layer. Ideally, logical operator names should stay logical.Joincould be realized asHashJoin,SortMergeJoin, or other implementations, and a cost-based optimizer could choose among them.The Cache-Key Tradeoff Under One-to-Many Mapping
If we move to a one-to-many logical-to-physical mapping, the cache-key design must decide at which level of the plan hierarchy to match cached results. There are two options, each with a tradeoff.
Option A: Cache key on the logical plan.
A cached result is associated with a logical operator's output port and its upstream logical sub-DAG. Matching uses logical-plan equality: if the logical sub-DAGs are the same, the cached result can be reused regardless of the physical implementation.
JoinasHashJoinand caches the join output, Execution 2 can reuse that cached result even if the optimizer would otherwise chooseSortMergeJoin. The cache effectively preempts the physical-plan decision for that operator.HashJoinproduces intermediate state (e.g., the build-side hash table) that could be cached and reused by anotherHashJoin, but a logical-level cache key cannot express this becauseBuildis not visible at the logical level.Option B: Cache key on the physical plan.
A cached result is associated with a physical operator's output port and its upstream physical sub-DAG (this is the current design).
Build, a sorted run fromSort, a local aggregate fromLocalAgg).HashJoin's output cannot match aSortMergeJoin's output, even though they compute the same logical result. If the optimizer changes the physical plan between executions, all downstream cached results are invalidated.Correctness requirement: Logical-level cache reuse assumes that all physical implementations of a logical operator produce equivalent output (at least bag-equal). If a downstream operator depends on output properties specific to one implementation (e.g., sorted order from
SortMergeJoin), the optimizer must account for this when deciding whether to use a logical-level cached result.Proposed Extension: Cache Keys at Both Levels
Under a one-to-many mapping, we can support cache keys at both the logical and physical levels to combine the benefits:
Compute cache keys at both levels. Each logical-plan output port gets a logical cache key (based on the upstream logical sub-DAG and operator configurations), and each physical-plan output port gets a physical cache key (based on the upstream physical sub-DAG), as in the current design.
Pass logical-level matches to the physical-plan optimizer. Before physical-plan optimization, the system checks which logical output ports have cached results. These matches are passed to the physical-plan optimizer as additional input, since a logical-level match may make certain physical-plan choices unnecessary or change their relative cost.
The optimizer considers both sources of reuse. Each logical-plan output port maps uniquely to a physical-plan output port, regardless of which physical plan is chosen (see the figure below). The optimizer can therefore account for cache matches from both levels. A logical-level match provides the final output for free; a physical-level match provides a reusable intermediate result within the chosen implementation.
In the figure above, the logical
Joinoperator has output porto1. In Physical Plan 1 (HashJoin),Join.o1maps toProbe.o1. In Physical Plan 2 (SortMergeJoin),Join.o1maps toMerge.o1. A logical cache key forJoin.o1matches in both physical plans via this port mapping. Physical cache keys, on the other hand, do not match across implementations:Probe.o1 ≠ Merge.o1. Meanwhile,Build.o1is a physical-only port with no logical counterpart, enabling finer-grained reuse within the HashJoin implementation.Relation to the Current One-to-One Design
Under the current one-to-one mapping, the physical plan has strictly more cacheable locations than the logical plan. Physical operators that are internal to a logical operator (such as
BuildwithinHashJoin) have output ports with their own cache keys, but these ports have no logical-plan counterpart. So even under one-to-one mapping, physical-plan cache keys provide finer-grained reuse than logical-plan cache keys alone.The proposed extension adds logical-level cache keys on top of the existing physical-level keys. This only becomes useful under one-to-many logical-to-physical mapping, where the cross-implementation reuse benefit of logical keys comes into play.
Note on Reuse Policy
When a cached result matches (at either level), the system always reuses it rather than recomputing. This is a simplifying assumption; there are workloads where recomputation would be cheaper (e.g., when a cached result is large and expensive to read from storage). A cost-based reuse decision is an interesting direction but is outside the scope of this design.
Beta Was this translation helpful? Give feedback.
All reactions