[Feature] Support alloc global workspace#1940
Conversation
📝 WalkthroughWalkthroughThis pull request introduces global buffer allocation support in tilelang. A new Changes
Sequence DiagramsequenceDiagram
participant IR as TIR Module
participant Host as Host Main Block
participant Device as Device Block
participant Hoister as GlobalBufferAllocationsHoister
participant Result as Hoisted IR
IR->>Hoister: Apply mutator
Hoister->>Device: Visit device block
Device->>Hoister: Query alloc_buffers
Hoister->>Hoister: Filter: collect global buffers<br/>keep only non-global
Device->>Result: Update with filtered alloc_buffers
Hoister->>Host: Visit host main block<br/>(IsHostMainBlock check)
Host->>Hoister: Query alloc_buffers
Hoister->>Hoister: Append accumulated<br/>global buffers
Host->>Result: Update with<br/>combined alloc_buffers
Result-->>IR: Return mutated module
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly Related Issues
Suggested Reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
b80f614 to
d7c5f07
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
tilelang/language/allocate.py (1)
329-347: Consider enforcingscope="global"inalloc_global.Because hoisting/handling is keyed on global scope, allowing arbitrary
scopehere can silently defeat the intended workspace path.💡 Suggested guard
def alloc_global(shape: ShapeType, dtype: DType, scope="global") -> Buffer: @@ - return T.alloc_buffer(shape, dtype, scope=scope) + if scope != "global": + raise ValueError("alloc_global only supports scope='global'.") + return T.alloc_buffer(shape, dtype, scope=scope)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tilelang/language/allocate.py` around lines 329 - 347, The alloc_global function currently accepts a scope parameter which can be set to non-"global" and silently bypass the intended global workspace path; update alloc_global (the function defined at top of file) to enforce scope="global" by either ignoring the passed scope and always calling T.alloc_buffer(..., scope="global") or by validating the incoming scope and raising an error if it's not "global"; ensure the change references alloc_global and the call to T.alloc_buffer so callers can't silently provide a different scope.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/transform/hoist_global_buffer_allocations.cc`:
- Around line 53-60: The loop over op->alloc_buffers currently assigns
node.CopyOnWrite()->alloc_buffers = std::move(new_alloc_buffers) inside each
iteration which clobbers previously collected non-global buffers; move that
assignment out of the for loop so you accumulate all non-global buffers into
new_alloc_buffers and then perform a single move-assignment to
node.CopyOnWrite()->alloc_buffers after the loop, preserving global_buffers_
updates and using IsGlobalBuffer(buf) to decide where each buf goes.
In `@testing/python/language/test_tilelang_language_alloc.py`:
- Around line 234-236: The test function test_alloc_global is missing the CUDA
requirement decorator; add the `@tilelang.testing.requires_cuda` decorator above
the test_alloc_global definition so the test is skipped on non-CUDA
systems—apply the same decorator pattern used on other CUDA-dependent tests (see
functions/run calls run_alloc_global and run_alloc_global_eagerjit) to ensure
the test only runs when CUDA is available.
- Around line 239-241: Restore the module-level test runner by replacing the
manual call to test_alloc_global() with a call to tilelang.testing.main() so
running the file executes the full test suite; specifically, in the if __name__
== "__main__": block remove or uncomment the commented-out
tilelang.testing.main() and remove the direct test_alloc_global() invocation
(leave test functions like test_alloc_global defined but invoked by
tilelang.testing.main()).
In `@tilelang/engine/phase.py`:
- Around line 243-250: Remove the leftover debug prints that dump IR to stdout:
delete the unconditional print(mod) calls found near the transform pipeline
(after tilelang.transform.IfStmtBinding, after
PlanAndUpdateBufferAllocationLocation and the final print before/after
HoistGlobalBufferAllocations). Keep the sequence of transforms (IfStmtBinding,
PlanAndUpdateBufferAllocationLocation, PipelinePlanning, InjectSoftwarePipeline,
HoistGlobalBufferAllocations) unchanged; if you need retained debug visibility
use a conditional logger.debug call instead of print().
---
Nitpick comments:
In `@tilelang/language/allocate.py`:
- Around line 329-347: The alloc_global function currently accepts a scope
parameter which can be set to non-"global" and silently bypass the intended
global workspace path; update alloc_global (the function defined at top of file)
to enforce scope="global" by either ignoring the passed scope and always calling
T.alloc_buffer(..., scope="global") or by validating the incoming scope and
raising an error if it's not "global"; ensure the change references alloc_global
and the call to T.alloc_buffer so callers can't silently provide a different
scope.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d0be493e-915e-43a3-b0da-31476784ea8c
📒 Files selected for processing (12)
src/ir.ccsrc/target/codegen_c_host.ccsrc/transform/common/attr.hsrc/transform/hoist_global_buffer_allocations.ccsrc/transform/inject_assumes.ccsrc/transform/inject_tma_barrier.ccsrc/transform/plan_update_buffer_allocation_location.cctesting/python/language/test_tilelang_language_alloc.pytilelang/engine/phase.pytilelang/language/__init__.pytilelang/language/allocate.pytilelang/transform/__init__.py
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (4)
src/transform/hoist_global_buffer_allocations.cc (1)
53-60:⚠️ Potential issue | 🔴 CriticalMove alloc-buffer assignment out of the loop.
At Line [59], assigning
alloc_buffersinside each iteration repeatedly overwrites state and can drop earlier non-global buffers. Assign once after the loop finishes collecting.🛠️ Proposed fix
ffi::Array<Buffer> new_alloc_buffers; for (const auto &buf : op->alloc_buffers) { if (IsGlobalBuffer(buf)) { global_buffers_.push_back(buf); } else { new_alloc_buffers.push_back(buf); } - node.CopyOnWrite()->alloc_buffers = std::move(new_alloc_buffers); } + node.CopyOnWrite()->alloc_buffers = std::move(new_alloc_buffers);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/transform/hoist_global_buffer_allocations.cc` around lines 53 - 60, The loop currently assigns node.CopyOnWrite()->alloc_buffers = std::move(new_alloc_buffers) inside each iteration, which overwrites prior state and can drop previously collected non-global buffers; instead, iterate over op->alloc_buffers, push globals into global_buffers_ and non-globals into new_alloc_buffers (using IsGlobalBuffer, global_buffers_, new_alloc_buffers), then after the loop completes perform a single assignment node.CopyOnWrite()->alloc_buffers = std::move(new_alloc_buffers) so allocation buffer replacement happens once.testing/python/language/test_tilelang_language_alloc.py (2)
239-241:⚠️ Potential issue | 🟡 MinorRestore full module test runner in
__main__.Running only
test_alloc_global()hides failures in other tests when executing this file directly.🔧 Proposed fix
if __name__ == "__main__": - # tilelang.testing.main() - test_alloc_global() + tilelang.testing.main()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@testing/python/language/test_tilelang_language_alloc.py` around lines 239 - 241, Restore the module-level test runner: replace the direct call to test_alloc_global() in the __main__ block with a call to tilelang.testing.main() so running the file executes the full test suite; specifically, remove or stop calling test_alloc_global() directly and uncomment or re-enable tilelang.testing.main() in the if __name__ == "__main__": block.
234-236:⚠️ Potential issue | 🟡 MinorAdd CUDA guard on the new global-allocation test.
run_alloc_globalandrun_alloc_global_eagerjitallocate CUDA tensors, so this test should be skipped on non-CUDA environments.🔧 Proposed fix
+@tilelang.testing.requires_cuda def test_alloc_global(): run_alloc_global(1024, 128, T.float16) run_alloc_global_eagerjit(1024, 128, T.float16)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@testing/python/language/test_tilelang_language_alloc.py` around lines 234 - 236, The test test_alloc_global calls run_alloc_global and run_alloc_global_eagerjit which allocate CUDA tensors and must be skipped when CUDA isn't available; add a CUDA guard by decorating test_alloc_global with a pytest skipif checking CUDA availability (e.g., pytest.mark.skipif(not torch.cuda.is_available(), reason="requires CUDA")) or perform an early runtime check inside test_alloc_global and call pytest.skip when CUDA is absent, so the test only runs in CUDA environments.tilelang/engine/phase.py (1)
249-255:⚠️ Potential issue | 🟠 MajorRemove unconditional IR debug prints from the optimization pipeline.
Lines [249], [251], and [255] print full IR modules on normal execution and will spam stdout and slow compilation workflows.
🔧 Proposed fix
else: mod = tilelang.transform.LowerSharedBarrier()(mod) mod = tilelang.transform.IfStmtBinding()(mod) - print(mod) mod = tilelang.transform.PlanAndUpdateBufferAllocationLocation()(mod) - print(mod) mod = tilelang.transform.PipelinePlanning()(mod) mod = tilelang.transform.InjectSoftwarePipeline()(mod) mod = tilelang.transform.HoistGlobalBufferAllocations()(mod) - print(mod) mod = tilelang.transform.LowerOpaqueBlock()(mod)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tilelang/engine/phase.py` around lines 249 - 255, Remove the unconditional print calls that dump IR around the optimization pipeline: the prints immediately before/after PlanAndUpdateBufferAllocationLocation, after PipelinePlanning/InjectSoftwarePipeline, and after HoistGlobalBufferAllocations; instead either delete them or gate them behind a verbose/debug flag or logger check (e.g., an opt-level or tilelang.debug flag) so IR is only printed when explicit debugging is enabled; locate the prints in the block where tilelang.transform.PlanAndUpdateBufferAllocationLocation, tilelang.transform.PipelinePlanning, tilelang.transform.InjectSoftwarePipeline, and tilelang.transform.HoistGlobalBufferAllocations are invoked and remove/guard those print(mod) calls.
🧹 Nitpick comments (4)
src/transform/inject_tma_barrier.cc (4)
142-142: Missingfinalspecifier on virtual method override.
VisitStmt_(const IfThenElseNode *op)should havefinalkeyword consistent with other overrides in this file.Proposed fix
- Stmt VisitStmt_(const IfThenElseNode *op) { + Stmt VisitStmt_(const IfThenElseNode *op) final {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/transform/inject_tma_barrier.cc` at line 142, The override VisitStmt_(const IfThenElseNode *op) is missing the final specifier; update its declaration to mark it as final (e.g., change the method signature in the class/visitor implementation to include final) so it matches the other overridden VisitStmt_ methods in this file and prevents further overrides.
175-175: Missingfinalspecifier on virtual method override.
VisitExpr_(const CallNode *op)should havefinalkeyword.Proposed fix
- PrimExpr VisitExpr_(const CallNode *op) { + PrimExpr VisitExpr_(const CallNode *op) final {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/transform/inject_tma_barrier.cc` at line 175, The override of the virtual method VisitExpr_(const CallNode *op) must be marked final to prevent further overrides; locate the VisitExpr_(const CallNode *op) implementation in inject_tma_barrier.cc (the class that defines this visitor) and add the final specifier to its declaration/definition so it reads as an override marked final (i.e., VisitExpr_(const CallNode *op) final), ensuring the signature matches the base virtual method and keeping any existing override keyword if present.
200-208: Redundant assignment after equality check.Line 205 always executes the same
Setoperation regardless of whether the extent matches or differs. Since line 203 ICHECKs for equality, the assignment on line 205 is redundant when the entry exists.Proposed simplification
void UpdateBarrierRange(const PrimExpr &barrier_id, const IntImm &extent) { if (barrier_id_to_range_.count(barrier_id)) { auto old_extent = barrier_id_to_range_[barrier_id]; ICHECK_EQ(old_extent->value, extent->value) << "barrier_id: " << barrier_id << " has different extent"; - barrier_id_to_range_.Set(barrier_id, extent); + // Already exists with same extent, no update needed } else { barrier_id_to_range_.Set(barrier_id, extent); } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/transform/inject_tma_barrier.cc` around lines 200 - 208, The UpdateBarrierRange function currently calls barrier_id_to_range_.Set(barrier_id, extent) both inside the if branch after ICHECK_EQ and again in the else branch, making the Set in the if branch redundant; update UpdateBarrierRange to only perform the Set when the key is absent (keep the ICHECK_EQ to assert equality when the key exists) so that barrier_id_to_range_.Set is only called in the else path, referencing the UpdateBarrierRange method and the barrier_id_to_range_ map to locate the change.
654-658: Return value type could be more explicit.Returning
0(integer literal) works via implicit conversion toPrimExpr, but returning an explicitIntImmor usingmake_zero()would be clearer and more consistent with TVM IR conventions.Proposed fix
if (clear_arrive_) { clear_arrive_ = false; - return 0; + return IntImm(DataType::Int(32), 0); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/transform/inject_tma_barrier.cc` around lines 654 - 658, The code returns a plain integer literal (return 0;) when handling the ptx_arrive_barrier case; replace this with an explicit PrimExpr zero to follow TVM IR conventions—use make_zero(op->dtype) (or construct an IntImm with the correct dtype) in place of 0 so the return is an explicit PrimExpr, e.g., change the return in the block guarded by op->op.same_as(builtin::ptx_arrive_barrier()) and clear_arrive_ to return make_zero(op->dtype).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/transform/inject_tma_barrier.cc`:
- Around line 20-23: Update the file-level comment header that currently says
"tma_barrier_rewriter.cc" to the correct filename "inject_tma_barrier.cc" so the
documentation matches the actual file; locate the top-of-file comment block (the
/*! ... */ header) and modify the filename token and any related brief
description if necessary to reference inject_tma_barrier.cc and ensure the file
header and brief line are consistent with the current source (e.g., the opening
comment that includes "tma_barrier_rewriter.cc" and "Rewrite TMA barriers for
cuda GPU (sm90+)").
- Around line 93-96: The local variable name `old_loop_evtents` is a typo and
should be `old_loop_extents`; update the declaration and the restoration
assignment so the code stores `loop_extents` into `old_loop_extents` before
calling `StmtExprVisitor::VisitStmt_(op)` and then assigns `loop_extents =
old_loop_extents` afterward, making the change where `loop_extents` is
saved/restored in the block around the VisitStmt_ call.
- Around line 149-170: When visited_tma_load_ is true, you must use the mutated
branches from ret rather than the original op->then_case/op->else_case so child
mutations (from VisitExpr_ that set barrier IDs) are preserved; extract the
IfThenElse node from ret (result of IRMutatorWithAnalyzer::VisitStmt_(op)) and
use its then_case/else_case with TmaTraitsCollector and makeExpectTX/SeqStmt to
construct the replacement IfThenElse, and still reset visited_tma_load_ = false
before returning.
---
Duplicate comments:
In `@src/transform/hoist_global_buffer_allocations.cc`:
- Around line 53-60: The loop currently assigns
node.CopyOnWrite()->alloc_buffers = std::move(new_alloc_buffers) inside each
iteration, which overwrites prior state and can drop previously collected
non-global buffers; instead, iterate over op->alloc_buffers, push globals into
global_buffers_ and non-globals into new_alloc_buffers (using IsGlobalBuffer,
global_buffers_, new_alloc_buffers), then after the loop completes perform a
single assignment node.CopyOnWrite()->alloc_buffers =
std::move(new_alloc_buffers) so allocation buffer replacement happens once.
In `@testing/python/language/test_tilelang_language_alloc.py`:
- Around line 239-241: Restore the module-level test runner: replace the direct
call to test_alloc_global() in the __main__ block with a call to
tilelang.testing.main() so running the file executes the full test suite;
specifically, remove or stop calling test_alloc_global() directly and uncomment
or re-enable tilelang.testing.main() in the if __name__ == "__main__": block.
- Around line 234-236: The test test_alloc_global calls run_alloc_global and
run_alloc_global_eagerjit which allocate CUDA tensors and must be skipped when
CUDA isn't available; add a CUDA guard by decorating test_alloc_global with a
pytest skipif checking CUDA availability (e.g., pytest.mark.skipif(not
torch.cuda.is_available(), reason="requires CUDA")) or perform an early runtime
check inside test_alloc_global and call pytest.skip when CUDA is absent, so the
test only runs in CUDA environments.
In `@tilelang/engine/phase.py`:
- Around line 249-255: Remove the unconditional print calls that dump IR around
the optimization pipeline: the prints immediately before/after
PlanAndUpdateBufferAllocationLocation, after
PipelinePlanning/InjectSoftwarePipeline, and after HoistGlobalBufferAllocations;
instead either delete them or gate them behind a verbose/debug flag or logger
check (e.g., an opt-level or tilelang.debug flag) so IR is only printed when
explicit debugging is enabled; locate the prints in the block where
tilelang.transform.PlanAndUpdateBufferAllocationLocation,
tilelang.transform.PipelinePlanning, tilelang.transform.InjectSoftwarePipeline,
and tilelang.transform.HoistGlobalBufferAllocations are invoked and remove/guard
those print(mod) calls.
---
Nitpick comments:
In `@src/transform/inject_tma_barrier.cc`:
- Line 142: The override VisitStmt_(const IfThenElseNode *op) is missing the
final specifier; update its declaration to mark it as final (e.g., change the
method signature in the class/visitor implementation to include final) so it
matches the other overridden VisitStmt_ methods in this file and prevents
further overrides.
- Line 175: The override of the virtual method VisitExpr_(const CallNode *op)
must be marked final to prevent further overrides; locate the VisitExpr_(const
CallNode *op) implementation in inject_tma_barrier.cc (the class that defines
this visitor) and add the final specifier to its declaration/definition so it
reads as an override marked final (i.e., VisitExpr_(const CallNode *op) final),
ensuring the signature matches the base virtual method and keeping any existing
override keyword if present.
- Around line 200-208: The UpdateBarrierRange function currently calls
barrier_id_to_range_.Set(barrier_id, extent) both inside the if branch after
ICHECK_EQ and again in the else branch, making the Set in the if branch
redundant; update UpdateBarrierRange to only perform the Set when the key is
absent (keep the ICHECK_EQ to assert equality when the key exists) so that
barrier_id_to_range_.Set is only called in the else path, referencing the
UpdateBarrierRange method and the barrier_id_to_range_ map to locate the change.
- Around line 654-658: The code returns a plain integer literal (return 0;) when
handling the ptx_arrive_barrier case; replace this with an explicit PrimExpr
zero to follow TVM IR conventions—use make_zero(op->dtype) (or construct an
IntImm with the correct dtype) in place of 0 so the return is an explicit
PrimExpr, e.g., change the return in the block guarded by
op->op.same_as(builtin::ptx_arrive_barrier()) and clear_arrive_ to return
make_zero(op->dtype).
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e47f74fc-c588-4f44-b0b9-1e9b4d695e6a
📒 Files selected for processing (12)
src/ir.ccsrc/target/codegen_c_host.ccsrc/transform/common/attr.hsrc/transform/hoist_global_buffer_allocations.ccsrc/transform/inject_assumes.ccsrc/transform/inject_tma_barrier.ccsrc/transform/plan_update_buffer_allocation_location.cctesting/python/language/test_tilelang_language_alloc.pytilelang/engine/phase.pytilelang/language/__init__.pytilelang/language/allocate.pytilelang/transform/__init__.py
✅ Files skipped from review due to trivial changes (1)
- src/target/codegen_c_host.cc
🚧 Files skipped from review as they are similar to previous changes (7)
- src/transform/inject_assumes.cc
- src/ir.cc
- tilelang/language/init.py
- src/transform/plan_update_buffer_allocation_location.cc
- tilelang/language/allocate.py
- tilelang/transform/init.py
- src/transform/common/attr.h
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tilelang/engine/phase.py (1)
243-253:⚠️ Potential issue | 🟡 MinorDocument or remove the redundant
LowerOpaqueBlock()call in the TMA path.
LowerOpaqueBlock()is called at line 243 inside theallow_tma_lowerbranch and again at line 253 unconditionally. SinceLowerOpaqueBlockis idempotent (it convertsBlockRealizenodes and subsequent passes don't reintroduce them), the second call at line 253 is a no-op for the TMA path, creating an asymmetry where:
- TMA path applies the pass twice (redundant)
- Non-TMA path applies it once
Either add a clarifying comment explaining why the TMA path requires an additional pre-pass application, or move line 253 into the
elseblock to match the non-TMA structure.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tilelang/engine/phase.py` around lines 243 - 253, The LowerOpaqueBlock() pass is applied twice on the TMA path (once inside the allow_tma_lower/is_hopper branch and again unconditionally), which is redundant; remove the earlier call to tilelang.transform.LowerOpaqueBlock() inside the allow_tma_lower / is_hopper branch so the pass runs exactly once (the remaining unconditional tilelang.transform.LowerOpaqueBlock()(mod) stays), or if the TMA path truly requires a pre-application, instead keep both but add a clear comment next to the in-branch call explaining why the extra pre-pass is required; reference the LowerOpaqueBlock pass and the is_hopper/allow_tma_lower branch to locate the change.
♻️ Duplicate comments (1)
src/transform/hoist_global_buffer_allocations.cc (1)
51-61:⚠️ Potential issue | 🔴 CriticalCritical:
std::moveinside the loop causes buffer loss.The assignment at line 59 is inside the
forloop. After the firststd::move,new_alloc_buffersis in a moved-from state (empty). Subsequent iterations push buffers to this empty vector, then move again—losing all previously collected non-global buffers.Only the buffer from the last iteration will be retained in
alloc_buffers.🐛 Proposed fix
} else { ffi::Array<Buffer> new_alloc_buffers; for (const auto &buf : op->alloc_buffers) { if (IsGlobalBuffer(buf)) { global_buffers_.push_back(buf); } else { new_alloc_buffers.push_back(buf); } - node.CopyOnWrite()->alloc_buffers = std::move(new_alloc_buffers); } + node.CopyOnWrite()->alloc_buffers = std::move(new_alloc_buffers); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/transform/hoist_global_buffer_allocations.cc` around lines 51 - 61, The assignment to node.CopyOnWrite()->alloc_buffers uses std::move(new_alloc_buffers) inside the for loop which moves away the vector on the first iteration and causes loss of previously collected buffers; change the logic so you only build new_alloc_buffers by iterating op->alloc_buffers and pushing non-global buffers (using IsGlobalBuffer and pushing globals into global_buffers_), then after the loop perform a single node.CopyOnWrite()->alloc_buffers = std::move(new_alloc_buffers); to preserve all non-global buffers.
🧹 Nitpick comments (1)
src/transform/hoist_global_buffer_allocations.cc (1)
69-74: Consider returningfptror usingfuncdirectly.
HoistGlobalBufferAllocationscallsfunc.CopyOnWrite()(obtainingfptr), mutates viafptr->body, but returnsfunc. This works becauseCopyOnWrite()mutatesfuncin-place when it has a unique reference. However, for clarity, consider returningfptrwrapped back into aPrimFuncor simply returningfuncafter the mutation without storingfptrif it's not needed elsewhere.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/transform/hoist_global_buffer_allocations.cc` around lines 69 - 74, The function HoistGlobalBufferAllocations creates a copy-on-write pointer fptr via func.CopyOnWrite(), mutates fptr->body using GlobalBufferAllocationsHoister, but then returns the original func which is confusing; either return the modified fptr wrapped as a PrimFunc or avoid creating fptr at all and operate on func directly. Fix by: if you need the copy-on-write pointer, return fptr (converted to PrimFunc as in the surrounding code style) after assigning fptr->body, or if you never use fptr elsewhere, remove fptr and call func.CopyOnWrite() inline or mutate func directly so the return value clearly reflects the mutated object (update HoistGlobalBufferAllocations to return the same object you mutated).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@tilelang/engine/phase.py`:
- Around line 243-253: The LowerOpaqueBlock() pass is applied twice on the TMA
path (once inside the allow_tma_lower/is_hopper branch and again
unconditionally), which is redundant; remove the earlier call to
tilelang.transform.LowerOpaqueBlock() inside the allow_tma_lower / is_hopper
branch so the pass runs exactly once (the remaining unconditional
tilelang.transform.LowerOpaqueBlock()(mod) stays), or if the TMA path truly
requires a pre-application, instead keep both but add a clear comment next to
the in-branch call explaining why the extra pre-pass is required; reference the
LowerOpaqueBlock pass and the is_hopper/allow_tma_lower branch to locate the
change.
---
Duplicate comments:
In `@src/transform/hoist_global_buffer_allocations.cc`:
- Around line 51-61: The assignment to node.CopyOnWrite()->alloc_buffers uses
std::move(new_alloc_buffers) inside the for loop which moves away the vector on
the first iteration and causes loss of previously collected buffers; change the
logic so you only build new_alloc_buffers by iterating op->alloc_buffers and
pushing non-global buffers (using IsGlobalBuffer and pushing globals into
global_buffers_), then after the loop perform a single
node.CopyOnWrite()->alloc_buffers = std::move(new_alloc_buffers); to preserve
all non-global buffers.
---
Nitpick comments:
In `@src/transform/hoist_global_buffer_allocations.cc`:
- Around line 69-74: The function HoistGlobalBufferAllocations creates a
copy-on-write pointer fptr via func.CopyOnWrite(), mutates fptr->body using
GlobalBufferAllocationsHoister, but then returns the original func which is
confusing; either return the modified fptr wrapped as a PrimFunc or avoid
creating fptr at all and operate on func directly. Fix by: if you need the
copy-on-write pointer, return fptr (converted to PrimFunc as in the surrounding
code style) after assigning fptr->body, or if you never use fptr elsewhere,
remove fptr and call func.CopyOnWrite() inline or mutate func directly so the
return value clearly reflects the mutated object (update
HoistGlobalBufferAllocations to return the same object you mutated).
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: db7c9cc0-e416-45e2-b575-d3676505af77
📒 Files selected for processing (4)
src/transform/common/attr.hsrc/transform/hoist_global_buffer_allocations.ccsrc/transform/inject_assumes.cctilelang/engine/phase.py
🚧 Files skipped from review as they are similar to previous changes (1)
- src/transform/common/attr.h
There was a problem hiding this comment.
♻️ Duplicate comments (1)
testing/python/language/test_tilelang_language_alloc.py (1)
234-236:⚠️ Potential issue | 🟡 MinorAdd
@tilelang.testing.requires_cudadecorator.This test creates CUDA tensors (
device="cuda") but lacks the@tilelang.testing.requires_cudadecorator that other CUDA-dependent tests in this file use (e.g., lines 120-121, 161-162). The test will fail on systems without CUDA.🐛 Proposed fix
+@tilelang.testing.requires_cuda def test_alloc_global(): run_alloc_global(1024, 128, T.float16) run_alloc_global_eagerjit(1024, 128, T.float16)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@testing/python/language/test_tilelang_language_alloc.py` around lines 234 - 236, The test function test_alloc_global creates CUDA tensors but lacks the CUDA-only guard; add the same decorator used by other CUDA-dependent tests (tilelang.testing.requires_cuda) above test_alloc_global to skip it on non-CUDA systems so calls like run_alloc_global(...) and run_alloc_global_eagerjit(...) only run when CUDA is available.
🧹 Nitpick comments (1)
testing/python/language/test_tilelang_language_alloc.py (1)
191-197: Consider removing commented-out debug code.Lines 192-194 contain commented-out print statements that appear to be leftover from development. Consider removing them before merging to keep the codebase clean.
♻️ Proposed cleanup
kernel = tilelang.compile(program, out_idx=[1]) - # print(kernel.get_host_source()) - # code = kernel.get_kernel_source() - # print(code) A = torch.randn(N, device="cuda", dtype=getattr(torch, dtype))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@testing/python/language/test_tilelang_language_alloc.py` around lines 191 - 197, Remove the leftover commented-out debug prints in the test: delete or uncomment the three print lines referencing kernel.get_host_source() and kernel.get_kernel_source() in the test_tilelang_language_alloc.py block around the kernel creation and invocation (symbols: kernel.get_host_source, kernel.get_kernel_source, print). If you want to preserve them for future debugging, wrap them behind a conditional debug flag or a pytest -s option instead of leaving them commented out.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@testing/python/language/test_tilelang_language_alloc.py`:
- Around line 234-236: The test function test_alloc_global creates CUDA tensors
but lacks the CUDA-only guard; add the same decorator used by other
CUDA-dependent tests (tilelang.testing.requires_cuda) above test_alloc_global to
skip it on non-CUDA systems so calls like run_alloc_global(...) and
run_alloc_global_eagerjit(...) only run when CUDA is available.
---
Nitpick comments:
In `@testing/python/language/test_tilelang_language_alloc.py`:
- Around line 191-197: Remove the leftover commented-out debug prints in the
test: delete or uncomment the three print lines referencing
kernel.get_host_source() and kernel.get_kernel_source() in the
test_tilelang_language_alloc.py block around the kernel creation and invocation
(symbols: kernel.get_host_source, kernel.get_kernel_source, print). If you want
to preserve them for future debugging, wrap them behind a conditional debug flag
or a pytest -s option instead of leaving them commented out.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: c632d36b-dba9-4d73-965b-ac4e201b2a7e
📒 Files selected for processing (2)
src/transform/hoist_global_buffer_allocations.cctesting/python/language/test_tilelang_language_alloc.py
This PR supports
T.alloc_global, which can be used either in Legacy JIT or EagerJIT to allocate a global memory as a workspace for TileLang kernel. The API directly calls TVM's alloc buffer and will be lowered into TVMBackendAllocWorkspace in host code.To achieve this, some passes need to be modifed (like Planning Allocation) to avoid mistakenly handle global buffer allocations.
Summary by CodeRabbit
New Features
Tests