perf: share encoder/reservation across PartitionedTopKExec partition … by SubhamSinghal · Pull Request #23096 · apache/datafusion

SubhamSinghal · 2026-06-22T16:36:33Z

Which issue does this PR close?

Follow-up to #21479 (PartitionedTopKExec for ROW_NUMBER ... PARTITION BY ... LIMIT N) toward closing #6899.

Rationale for this change

PartitionedTopKExec today maintains a HashMap<OwnedRow, TopK> — one full TopK per distinct partition key. Each TopK carries its own RowConverter, MemoryReservation registered with the runtime pool, TopKMetrics, and scratch Rows buffer. With high partition cardinality every partition seen for the first time pays:

RowConverter::new (parses SortField list, allocates per-encoder state)
MemoryConsumer::register with the pool (involves a global lock)
per-counter TopKMetrics setup
scratch Rows::empty_rows allocation

For the h2o window-TopN sweep on a 10M-row CSV (id3 % N partition cardinality), this shows up as a regression at ≥10K partitions — PartitionedTopKExec is slower than the unpartitioned SortExec baseline that it's meant to replace.

What changes are included in this PR?

Adds a PartitionedTopK sibling type to topk/mod.rs that holds the shared encoder/reservation/metrics state once at the operator level and a HashMap<OwnedRow, TopKHeap> of cheap per-partition heap state. PartitionedTopKExec switches from HashMap<OwnedRow, TopK> to one PartitionedTopK.

Bench results

Today's default (main, flag-off) vs this PR (flag-on)

Partitions	main flag-off	this PR flag-on	Delta
100	282 ms	105 ms	2.7x faster
1,000	247 ms	110 ms	2.2x faster
10,000	250 ms	137 ms	1.8x faster
100,000	222 ms	320 ms	1.4x slower

h2o id3 % N sweep, 10M-row CSV, 3 iterations per query, release build, enable_window_topn=true on both sides:

Partitions	`main`	this PR	Speedup
100	110 ms	105 ms	~1.0×
1,000	117 ms	110 ms	~1.0×
10,000	640 ms	137 ms	4.7×
100,000	4,327 ms	320 ms	13.5×

10K is the inflection point: on main it's a regression vs the sort baseline (640 ms vs 234 ms); after this PR it's a win (137 ms vs 234 ms — 1.7× faster than sort). 100K nearly catches up to the sort baseline (320 ms vs 238 ms).

enable_window_topn default stays false per the #21479 discussion — 100K+ remains slower than sort on average, so this PR doesn't motivate flipping the default. It's the prerequisite for further optimizations that would attack the residual 100K+ cliff.

Are these changes tested?

Yes

Are there any user-facing changes?

No public API changes.

…heaps

perf: share encoder/reservation across PartitionedTopKExec partition …

d654a0f

…heaps

github-actions Bot added the physical-plan Changes to the physical-plan crate label Jun 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: share encoder/reservation across PartitionedTopKExec partition …#23096

perf: share encoder/reservation across PartitionedTopKExec partition …#23096
SubhamSinghal wants to merge 1 commit into
apache:mainfrom
SubhamSinghal:partitioned-topk-shared-state

SubhamSinghal commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SubhamSinghal commented Jun 22, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Bench results

Today's default (main, flag-off) vs this PR (flag-on)

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant