Cache Key Design for One-to-Many Logical-Physical Operator Mapping #4346

Xiao-zhen-Liu · 2026-04-02T23:50:28Z

Xiao-zhen-Liu
Apr 2, 2026
Collaborator

Background: Current Design

Currently, there is a one-to-one mapping from a workflow's logical plan to its physical plan. Each logical operator maps to one or more physical operators with a fixed implementation choice. For example, Sort always maps to the Python-based sort implementation, and HashJoin always maps to Build + Probe.

This has two limitations:

Logical operators expose physical choices. The logical plan uses names like HashJoin instead of Join, leaking the implementation into the user-facing layer. Ideally, logical operator names should stay logical.
No physical-plan alternatives. A given logical operator always produces the same physical operator(s). In principle, Join could be realized as HashJoin, SortMergeJoin, or other implementations, and a cost-based optimizer could choose among them.

The Cache-Key Tradeoff Under One-to-Many Mapping

If we move to a one-to-many logical-to-physical mapping, the cache-key design must decide at which level of the plan hierarchy to match cached results. There are two options, each with a tradeoff.

Option A: Cache key on the logical plan.
A cached result is associated with a logical operator's output port and its upstream logical sub-DAG. Matching uses logical-plan equality: if the logical sub-DAGs are the same, the cached result can be reused regardless of the physical implementation.

Advantage: Maximum reuse across physical-plan choices. For example, if Execution 1 realizes Join as HashJoin and caches the join output, Execution 2 can reuse that cached result even if the optimizer would otherwise choose SortMergeJoin. The cache effectively preempts the physical-plan decision for that operator.
Disadvantage: Cannot exploit sub-operator granularity. A HashJoin produces intermediate state (e.g., the build-side hash table) that could be cached and reused by another HashJoin, but a logical-level cache key cannot express this because Build is not visible at the logical level.

Option B: Cache key on the physical plan.
A cached result is associated with a physical operator's output port and its upstream physical sub-DAG (this is the current design).

Advantage: Fine-grained reuse of intermediate physical results (e.g., a hash table from Build, a sorted run from Sort, a local aggregate from LocalAgg).
Disadvantage: No cross-implementation reuse. A HashJoin's output cannot match a SortMergeJoin's output, even though they compute the same logical result. If the optimizer changes the physical plan between executions, all downstream cached results are invalidated.

Correctness requirement: Logical-level cache reuse assumes that all physical implementations of a logical operator produce equivalent output (at least bag-equal). If a downstream operator depends on output properties specific to one implementation (e.g., sorted order from SortMergeJoin), the optimizer must account for this when deciding whether to use a logical-level cached result.

Proposed Extension: Cache Keys at Both Levels

Under a one-to-many mapping, we can support cache keys at both the logical and physical levels to combine the benefits:

Compute cache keys at both levels. Each logical-plan output port gets a logical cache key (based on the upstream logical sub-DAG and operator configurations), and each physical-plan output port gets a physical cache key (based on the upstream physical sub-DAG), as in the current design.
Pass logical-level matches to the physical-plan optimizer. Before physical-plan optimization, the system checks which logical output ports have cached results. These matches are passed to the physical-plan optimizer as additional input, since a logical-level match may make certain physical-plan choices unnecessary or change their relative cost.
The optimizer considers both sources of reuse. Each logical-plan output port maps uniquely to a physical-plan output port, regardless of which physical plan is chosen (see the figure below). The optimizer can therefore account for cache matches from both levels. A logical-level match provides the final output for free; a physical-level match provides a reusable intermediate result within the chosen implementation.

In the figure above, the logical Join operator has output port o1. In Physical Plan 1 (HashJoin), Join.o1 maps to Probe.o1. In Physical Plan 2 (SortMergeJoin), Join.o1 maps to Merge.o1. A logical cache key for Join.o1 matches in both physical plans via this port mapping. Physical cache keys, on the other hand, do not match across implementations: Probe.o1 ≠ Merge.o1. Meanwhile, Build.o1 is a physical-only port with no logical counterpart, enabling finer-grained reuse within the HashJoin implementation.

Relation to the Current One-to-One Design

Under the current one-to-one mapping, the physical plan has strictly more cacheable locations than the logical plan. Physical operators that are internal to a logical operator (such as Build within HashJoin) have output ports with their own cache keys, but these ports have no logical-plan counterpart. So even under one-to-one mapping, physical-plan cache keys provide finer-grained reuse than logical-plan cache keys alone.

The proposed extension adds logical-level cache keys on top of the existing physical-level keys. This only becomes useful under one-to-many logical-to-physical mapping, where the cross-implementation reuse benefit of logical keys comes into play.

Note on Reuse Policy

When a cached result matches (at either level), the system always reuses it rather than recomputing. This is a simplifying assumption; there are workloads where recomputation would be cheaper (e.g., when a cached result is large and expensive to read from storage). A cost-based reuse decision is an interesting direction but is outside the scope of this design.

chenlica · 2026-04-03T05:21:51Z

chenlica
Apr 3, 2026
Collaborator

@Xiao-zhen-Liu Thanks for the great summary. For our records, it's based on a discussion of @Xiao-zhen-Liu @Yicong-Huang and myself (@chenlica).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache Key Design for One-to-Many Logical-Physical Operator Mapping #4346

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Cache Key Design for One-to-Many Logical-Physical Operator Mapping #4346

Uh oh!

Xiao-zhen-Liu Apr 2, 2026 Collaborator

Background: Current Design

The Cache-Key Tradeoff Under One-to-Many Mapping

Proposed Extension: Cache Keys at Both Levels

Relation to the Current One-to-One Design

Note on Reuse Policy

Replies: 1 comment

Uh oh!

chenlica Apr 3, 2026 Collaborator

Xiao-zhen-Liu
Apr 2, 2026
Collaborator

chenlica
Apr 3, 2026
Collaborator