Skip to content

[inductor] Save producer partials for downstream sums#1

Draft
eellison wants to merge 1 commit into
mainfrom
issue21-producer-sum-partials-codex
Draft

[inductor] Save producer partials for downstream sums#1
eellison wants to merge 1 commit into
mainfrom
issue21-producer-sum-partials-codex

Conversation

@eellison
Copy link
Copy Markdown
Owner

@eellison eellison commented May 29, 2026

Submitted by my agent.

This is a default-off prototype for better-benchmark issue pytorch#21, narrowed to the simple landable shape: existing mixed-order producers can attach compatible downstream producer-body sum consumers as extra saved partials. It does not relax normal mixed-order producer/consumer fusion and does not include the marginal simple-producer path.

Mechanically: while the mixed-order producer materializes a large fp32 full output that is still needed, it also writes compact saved sum partials for values already computed in that producer body. The wrapper then final-reduces the compact workspace instead of rereading the full materialized output/body values.

Current head: c774b530fc8.

Validation:

  • Static checks on clean origin/main worktree: py_compile, git diff --check, ruff check.
  • Focused GPU tests on built reviewable tree: test/inductor/test_producer_sum_reduction_accumulation.py -v, 4 tests passed.
  • 30-candidate B200 sweep artifact: better-benchmark/scratch_issue21_guarded_validation_simplified_64m_20260529.

Why the 64MiB gate:

  • A 30-candidate sweep with the earlier 32MiB gate selected one real small regression: sum_sum_sum_69414585b76b went 3 -> 1 kernels but 0.085ms -> 0.090ms.
  • Raising the default-off min-size gate to 64MiB rejects that 50MiB case and the old 32MiB boundary case.

Key 64MiB sweep results on B200:

  • sum_sum_sum_6107a2f54029: 0.8357ms -> 0.1793ms, kernels 4 -> 1.
  • sum_sum_sum_dc96c4651516: 0.8098ms -> 0.1766ms, kernels 4 -> 1.
  • sum_sum_sum_e2d4961f3571: 0.0788ms -> 0.0739ms, kernels 3 -> 1.
  • sum_sum_sum_59e022113eeb: 0.1115ms -> 0.1066ms, kernels 3 -> 1.
  • sum_sum_sum_97999676281e: 0.0933ms -> 0.0907ms, kernels 3 -> 1.
  • sum_sum_sum_64f701d26f0a: 0.1725ms -> 0.1679ms, kernels 6 -> 4.
  • 30/30 correct, 6 selected, no selected regressions.

This should stay default-off until the profitability model is stronger.

@eellison eellison force-pushed the issue21-producer-sum-partials-codex branch from 2e283fc to c774b53 Compare May 29, 2026 21:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant