Skip to content

[Feature] Support alloc global workspace#1940

Merged
SiriusNEO merged 6 commits intotile-ai:mainfrom
SiriusNEO:chaofan/global_0310
Mar 23, 2026
Merged

[Feature] Support alloc global workspace#1940
SiriusNEO merged 6 commits intotile-ai:mainfrom
SiriusNEO:chaofan/global_0310

Conversation

@SiriusNEO
Copy link
Copy Markdown
Collaborator

@SiriusNEO SiriusNEO commented Mar 17, 2026

This PR supports T.alloc_global, which can be used either in Legacy JIT or EagerJIT to allocate a global memory as a workspace for TileLang kernel. The API directly calls TVM's alloc buffer and will be lowered into TVMBackendAllocWorkspace in host code.

To achieve this, some passes need to be modifed (like Planning Allocation) to avoid mistakenly handle global buffer allocations.

Summary by CodeRabbit

  • New Features

    • Added global memory allocation support for allocating buffers directly in global memory scope.
    • Implemented automatic hoisting optimization for global buffer allocations to improve memory management.
  • Tests

    • Added comprehensive test coverage for global memory allocation with both compiled and eager JIT execution modes.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 17, 2026

📝 Walkthrough

Walkthrough

This pull request introduces global buffer allocation support in tilelang. A new HoistGlobalBufferAllocations TIR transformation pass hoists global buffer allocations to the host main block. Supporting changes include separate main-block name constants for host/device, a new alloc_global language API, and updated infrastructure throughout the transformation pipeline.

Changes

Cohort / File(s) Summary
Block Name Infrastructure
src/transform/common/attr.h, src/ir.cc, src/transform/inject_assumes.cc
Replaces single MainBlockName constant with HostMainBlockName ("root") and DeviceMainBlockName ("tilelang_root"). Adds inline predicates IsHostMainBlock and IsDeviceMainBlock. Updates block name references in kernel launch and assume injection.
Global Buffer Hoisting
src/transform/hoist_global_buffer_allocations.cc, src/transform/plan_update_buffer_allocation_location.cc
Introduces GlobalBufferAllocationsHoister mutator that collects global buffers from non-host blocks and appends them to the host main block. Adds filtering in buffer allocation recording to skip argument-bound and unmanaged buffers.
Language API
tilelang/language/allocate.py, tilelang/language/__init__.py
Adds alloc_global(shape, dtype, scope) function for global buffer allocation. Adds documentation to empty() for eager-JIT context. Re-exports alloc_global through language module.
Transform Pipeline
tilelang/transform/__init__.py, tilelang/engine/phase.py
Adds Python wrapper for HoistGlobalBufferAllocations pass. Moves pass invocation to always run (not conditional) in OptimizeForTarget.
Test Coverage
testing/python/language/test_tilelang_language_alloc.py
Introduces comprehensive tests for alloc_global in both compiled and eager-JIT modes, validating global buffer allocation and data transfer flows.

Sequence Diagram

sequenceDiagram
    participant IR as TIR Module
    participant Host as Host Main Block
    participant Device as Device Block
    participant Hoister as GlobalBufferAllocationsHoister
    participant Result as Hoisted IR

    IR->>Hoister: Apply mutator
    Hoister->>Device: Visit device block
    Device->>Hoister: Query alloc_buffers
    Hoister->>Hoister: Filter: collect global buffers<br/>keep only non-global
    Device->>Result: Update with filtered alloc_buffers
    Hoister->>Host: Visit host main block<br/>(IsHostMainBlock check)
    Host->>Hoister: Query alloc_buffers
    Hoister->>Hoister: Append accumulated<br/>global buffers
    Host->>Result: Update with<br/>combined alloc_buffers
    Result-->>IR: Return mutated module
Loading

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly Related Issues

Suggested Reviewers

  • LeiWang1999

Poem

🐰 A hopper's delight in allocation's might!
Global buffers dance from left to right,
Host and device now clearly named,
With eager JIT, the goal is claimed!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.80% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title '[Feature] Support alloc global workspace' directly and clearly describes the main feature addition—global memory allocation support via T.alloc_global—which is the primary objective of this changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@SiriusNEO SiriusNEO marked this pull request as ready for review March 23, 2026 08:26
@SiriusNEO SiriusNEO force-pushed the chaofan/global_0310 branch from b80f614 to d7c5f07 Compare March 23, 2026 08:31
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
tilelang/language/allocate.py (1)

329-347: Consider enforcing scope="global" in alloc_global.

Because hoisting/handling is keyed on global scope, allowing arbitrary scope here can silently defeat the intended workspace path.

💡 Suggested guard
 def alloc_global(shape: ShapeType, dtype: DType, scope="global") -> Buffer:
@@
-    return T.alloc_buffer(shape, dtype, scope=scope)
+    if scope != "global":
+        raise ValueError("alloc_global only supports scope='global'.")
+    return T.alloc_buffer(shape, dtype, scope=scope)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tilelang/language/allocate.py` around lines 329 - 347, The alloc_global
function currently accepts a scope parameter which can be set to non-"global"
and silently bypass the intended global workspace path; update alloc_global (the
function defined at top of file) to enforce scope="global" by either ignoring
the passed scope and always calling T.alloc_buffer(..., scope="global") or by
validating the incoming scope and raising an error if it's not "global"; ensure
the change references alloc_global and the call to T.alloc_buffer so callers
can't silently provide a different scope.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/transform/hoist_global_buffer_allocations.cc`:
- Around line 53-60: The loop over op->alloc_buffers currently assigns
node.CopyOnWrite()->alloc_buffers = std::move(new_alloc_buffers) inside each
iteration which clobbers previously collected non-global buffers; move that
assignment out of the for loop so you accumulate all non-global buffers into
new_alloc_buffers and then perform a single move-assignment to
node.CopyOnWrite()->alloc_buffers after the loop, preserving global_buffers_
updates and using IsGlobalBuffer(buf) to decide where each buf goes.

In `@testing/python/language/test_tilelang_language_alloc.py`:
- Around line 234-236: The test function test_alloc_global is missing the CUDA
requirement decorator; add the `@tilelang.testing.requires_cuda` decorator above
the test_alloc_global definition so the test is skipped on non-CUDA
systems—apply the same decorator pattern used on other CUDA-dependent tests (see
functions/run calls run_alloc_global and run_alloc_global_eagerjit) to ensure
the test only runs when CUDA is available.
- Around line 239-241: Restore the module-level test runner by replacing the
manual call to test_alloc_global() with a call to tilelang.testing.main() so
running the file executes the full test suite; specifically, in the if __name__
== "__main__": block remove or uncomment the commented-out
tilelang.testing.main() and remove the direct test_alloc_global() invocation
(leave test functions like test_alloc_global defined but invoked by
tilelang.testing.main()).

In `@tilelang/engine/phase.py`:
- Around line 243-250: Remove the leftover debug prints that dump IR to stdout:
delete the unconditional print(mod) calls found near the transform pipeline
(after tilelang.transform.IfStmtBinding, after
PlanAndUpdateBufferAllocationLocation and the final print before/after
HoistGlobalBufferAllocations). Keep the sequence of transforms (IfStmtBinding,
PlanAndUpdateBufferAllocationLocation, PipelinePlanning, InjectSoftwarePipeline,
HoistGlobalBufferAllocations) unchanged; if you need retained debug visibility
use a conditional logger.debug call instead of print().

---

Nitpick comments:
In `@tilelang/language/allocate.py`:
- Around line 329-347: The alloc_global function currently accepts a scope
parameter which can be set to non-"global" and silently bypass the intended
global workspace path; update alloc_global (the function defined at top of file)
to enforce scope="global" by either ignoring the passed scope and always calling
T.alloc_buffer(..., scope="global") or by validating the incoming scope and
raising an error if it's not "global"; ensure the change references alloc_global
and the call to T.alloc_buffer so callers can't silently provide a different
scope.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d0be493e-915e-43a3-b0da-31476784ea8c

📥 Commits

Reviewing files that changed from the base of the PR and between fee041c and b80f614.

📒 Files selected for processing (12)
  • src/ir.cc
  • src/target/codegen_c_host.cc
  • src/transform/common/attr.h
  • src/transform/hoist_global_buffer_allocations.cc
  • src/transform/inject_assumes.cc
  • src/transform/inject_tma_barrier.cc
  • src/transform/plan_update_buffer_allocation_location.cc
  • testing/python/language/test_tilelang_language_alloc.py
  • tilelang/engine/phase.py
  • tilelang/language/__init__.py
  • tilelang/language/allocate.py
  • tilelang/transform/__init__.py

Comment thread src/transform/hoist_global_buffer_allocations.cc
Comment thread testing/python/language/test_tilelang_language_alloc.py
Comment thread testing/python/language/test_tilelang_language_alloc.py Outdated
Comment thread tilelang/engine/phase.py Outdated
@SiriusNEO SiriusNEO changed the title [WIP][Feature] Support alloc global workspace [Feature] Support alloc global workspace Mar 23, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (4)
src/transform/hoist_global_buffer_allocations.cc (1)

53-60: ⚠️ Potential issue | 🔴 Critical

Move alloc-buffer assignment out of the loop.

At Line [59], assigning alloc_buffers inside each iteration repeatedly overwrites state and can drop earlier non-global buffers. Assign once after the loop finishes collecting.

🛠️ Proposed fix
       ffi::Array<Buffer> new_alloc_buffers;
       for (const auto &buf : op->alloc_buffers) {
         if (IsGlobalBuffer(buf)) {
           global_buffers_.push_back(buf);
         } else {
           new_alloc_buffers.push_back(buf);
         }
-        node.CopyOnWrite()->alloc_buffers = std::move(new_alloc_buffers);
       }
+      node.CopyOnWrite()->alloc_buffers = std::move(new_alloc_buffers);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/transform/hoist_global_buffer_allocations.cc` around lines 53 - 60, The
loop currently assigns node.CopyOnWrite()->alloc_buffers =
std::move(new_alloc_buffers) inside each iteration, which overwrites prior state
and can drop previously collected non-global buffers; instead, iterate over
op->alloc_buffers, push globals into global_buffers_ and non-globals into
new_alloc_buffers (using IsGlobalBuffer, global_buffers_, new_alloc_buffers),
then after the loop completes perform a single assignment
node.CopyOnWrite()->alloc_buffers = std::move(new_alloc_buffers) so allocation
buffer replacement happens once.
testing/python/language/test_tilelang_language_alloc.py (2)

239-241: ⚠️ Potential issue | 🟡 Minor

Restore full module test runner in __main__.

Running only test_alloc_global() hides failures in other tests when executing this file directly.

🔧 Proposed fix
 if __name__ == "__main__":
-    # tilelang.testing.main()
-    test_alloc_global()
+    tilelang.testing.main()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@testing/python/language/test_tilelang_language_alloc.py` around lines 239 -
241, Restore the module-level test runner: replace the direct call to
test_alloc_global() in the __main__ block with a call to tilelang.testing.main()
so running the file executes the full test suite; specifically, remove or stop
calling test_alloc_global() directly and uncomment or re-enable
tilelang.testing.main() in the if __name__ == "__main__": block.

234-236: ⚠️ Potential issue | 🟡 Minor

Add CUDA guard on the new global-allocation test.

run_alloc_global and run_alloc_global_eagerjit allocate CUDA tensors, so this test should be skipped on non-CUDA environments.

🔧 Proposed fix
+@tilelang.testing.requires_cuda
 def test_alloc_global():
     run_alloc_global(1024, 128, T.float16)
     run_alloc_global_eagerjit(1024, 128, T.float16)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@testing/python/language/test_tilelang_language_alloc.py` around lines 234 -
236, The test test_alloc_global calls run_alloc_global and
run_alloc_global_eagerjit which allocate CUDA tensors and must be skipped when
CUDA isn't available; add a CUDA guard by decorating test_alloc_global with a
pytest skipif checking CUDA availability (e.g., pytest.mark.skipif(not
torch.cuda.is_available(), reason="requires CUDA")) or perform an early runtime
check inside test_alloc_global and call pytest.skip when CUDA is absent, so the
test only runs in CUDA environments.
tilelang/engine/phase.py (1)

249-255: ⚠️ Potential issue | 🟠 Major

Remove unconditional IR debug prints from the optimization pipeline.

Lines [249], [251], and [255] print full IR modules on normal execution and will spam stdout and slow compilation workflows.

🔧 Proposed fix
     else:
         mod = tilelang.transform.LowerSharedBarrier()(mod)
         mod = tilelang.transform.IfStmtBinding()(mod)
-        print(mod)
         mod = tilelang.transform.PlanAndUpdateBufferAllocationLocation()(mod)
-        print(mod)
         mod = tilelang.transform.PipelinePlanning()(mod)
         mod = tilelang.transform.InjectSoftwarePipeline()(mod)
     mod = tilelang.transform.HoistGlobalBufferAllocations()(mod)
-    print(mod)
     mod = tilelang.transform.LowerOpaqueBlock()(mod)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tilelang/engine/phase.py` around lines 249 - 255, Remove the unconditional
print calls that dump IR around the optimization pipeline: the prints
immediately before/after PlanAndUpdateBufferAllocationLocation, after
PipelinePlanning/InjectSoftwarePipeline, and after HoistGlobalBufferAllocations;
instead either delete them or gate them behind a verbose/debug flag or logger
check (e.g., an opt-level or tilelang.debug flag) so IR is only printed when
explicit debugging is enabled; locate the prints in the block where
tilelang.transform.PlanAndUpdateBufferAllocationLocation,
tilelang.transform.PipelinePlanning, tilelang.transform.InjectSoftwarePipeline,
and tilelang.transform.HoistGlobalBufferAllocations are invoked and remove/guard
those print(mod) calls.
🧹 Nitpick comments (4)
src/transform/inject_tma_barrier.cc (4)

142-142: Missing final specifier on virtual method override.

VisitStmt_(const IfThenElseNode *op) should have final keyword consistent with other overrides in this file.

Proposed fix
-  Stmt VisitStmt_(const IfThenElseNode *op) {
+  Stmt VisitStmt_(const IfThenElseNode *op) final {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/transform/inject_tma_barrier.cc` at line 142, The override
VisitStmt_(const IfThenElseNode *op) is missing the final specifier; update its
declaration to mark it as final (e.g., change the method signature in the
class/visitor implementation to include final) so it matches the other
overridden VisitStmt_ methods in this file and prevents further overrides.

175-175: Missing final specifier on virtual method override.

VisitExpr_(const CallNode *op) should have final keyword.

Proposed fix
-  PrimExpr VisitExpr_(const CallNode *op) {
+  PrimExpr VisitExpr_(const CallNode *op) final {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/transform/inject_tma_barrier.cc` at line 175, The override of the virtual
method VisitExpr_(const CallNode *op) must be marked final to prevent further
overrides; locate the VisitExpr_(const CallNode *op) implementation in
inject_tma_barrier.cc (the class that defines this visitor) and add the final
specifier to its declaration/definition so it reads as an override marked final
(i.e., VisitExpr_(const CallNode *op) final), ensuring the signature matches the
base virtual method and keeping any existing override keyword if present.

200-208: Redundant assignment after equality check.

Line 205 always executes the same Set operation regardless of whether the extent matches or differs. Since line 203 ICHECKs for equality, the assignment on line 205 is redundant when the entry exists.

Proposed simplification
   void UpdateBarrierRange(const PrimExpr &barrier_id, const IntImm &extent) {
     if (barrier_id_to_range_.count(barrier_id)) {
       auto old_extent = barrier_id_to_range_[barrier_id];
       ICHECK_EQ(old_extent->value, extent->value)
           << "barrier_id: " << barrier_id << " has different extent";
-      barrier_id_to_range_.Set(barrier_id, extent);
+      // Already exists with same extent, no update needed
     } else {
       barrier_id_to_range_.Set(barrier_id, extent);
     }
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/transform/inject_tma_barrier.cc` around lines 200 - 208, The
UpdateBarrierRange function currently calls barrier_id_to_range_.Set(barrier_id,
extent) both inside the if branch after ICHECK_EQ and again in the else branch,
making the Set in the if branch redundant; update UpdateBarrierRange to only
perform the Set when the key is absent (keep the ICHECK_EQ to assert equality
when the key exists) so that barrier_id_to_range_.Set is only called in the else
path, referencing the UpdateBarrierRange method and the barrier_id_to_range_ map
to locate the change.

654-658: Return value type could be more explicit.

Returning 0 (integer literal) works via implicit conversion to PrimExpr, but returning an explicit IntImm or using make_zero() would be clearer and more consistent with TVM IR conventions.

Proposed fix
       if (clear_arrive_) {
         clear_arrive_ = false;
-        return 0;
+        return IntImm(DataType::Int(32), 0);
       }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/transform/inject_tma_barrier.cc` around lines 654 - 658, The code returns
a plain integer literal (return 0;) when handling the ptx_arrive_barrier case;
replace this with an explicit PrimExpr zero to follow TVM IR conventions—use
make_zero(op->dtype) (or construct an IntImm with the correct dtype) in place of
0 so the return is an explicit PrimExpr, e.g., change the return in the block
guarded by op->op.same_as(builtin::ptx_arrive_barrier()) and clear_arrive_ to
return make_zero(op->dtype).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/transform/inject_tma_barrier.cc`:
- Around line 20-23: Update the file-level comment header that currently says
"tma_barrier_rewriter.cc" to the correct filename "inject_tma_barrier.cc" so the
documentation matches the actual file; locate the top-of-file comment block (the
/*! ... */ header) and modify the filename token and any related brief
description if necessary to reference inject_tma_barrier.cc and ensure the file
header and brief line are consistent with the current source (e.g., the opening
comment that includes "tma_barrier_rewriter.cc" and "Rewrite TMA barriers for
cuda GPU (sm90+)").
- Around line 93-96: The local variable name `old_loop_evtents` is a typo and
should be `old_loop_extents`; update the declaration and the restoration
assignment so the code stores `loop_extents` into `old_loop_extents` before
calling `StmtExprVisitor::VisitStmt_(op)` and then assigns `loop_extents =
old_loop_extents` afterward, making the change where `loop_extents` is
saved/restored in the block around the VisitStmt_ call.
- Around line 149-170: When visited_tma_load_ is true, you must use the mutated
branches from ret rather than the original op->then_case/op->else_case so child
mutations (from VisitExpr_ that set barrier IDs) are preserved; extract the
IfThenElse node from ret (result of IRMutatorWithAnalyzer::VisitStmt_(op)) and
use its then_case/else_case with TmaTraitsCollector and makeExpectTX/SeqStmt to
construct the replacement IfThenElse, and still reset visited_tma_load_ = false
before returning.

---

Duplicate comments:
In `@src/transform/hoist_global_buffer_allocations.cc`:
- Around line 53-60: The loop currently assigns
node.CopyOnWrite()->alloc_buffers = std::move(new_alloc_buffers) inside each
iteration, which overwrites prior state and can drop previously collected
non-global buffers; instead, iterate over op->alloc_buffers, push globals into
global_buffers_ and non-globals into new_alloc_buffers (using IsGlobalBuffer,
global_buffers_, new_alloc_buffers), then after the loop completes perform a
single assignment node.CopyOnWrite()->alloc_buffers =
std::move(new_alloc_buffers) so allocation buffer replacement happens once.

In `@testing/python/language/test_tilelang_language_alloc.py`:
- Around line 239-241: Restore the module-level test runner: replace the direct
call to test_alloc_global() in the __main__ block with a call to
tilelang.testing.main() so running the file executes the full test suite;
specifically, remove or stop calling test_alloc_global() directly and uncomment
or re-enable tilelang.testing.main() in the if __name__ == "__main__": block.
- Around line 234-236: The test test_alloc_global calls run_alloc_global and
run_alloc_global_eagerjit which allocate CUDA tensors and must be skipped when
CUDA isn't available; add a CUDA guard by decorating test_alloc_global with a
pytest skipif checking CUDA availability (e.g., pytest.mark.skipif(not
torch.cuda.is_available(), reason="requires CUDA")) or perform an early runtime
check inside test_alloc_global and call pytest.skip when CUDA is absent, so the
test only runs in CUDA environments.

In `@tilelang/engine/phase.py`:
- Around line 249-255: Remove the unconditional print calls that dump IR around
the optimization pipeline: the prints immediately before/after
PlanAndUpdateBufferAllocationLocation, after
PipelinePlanning/InjectSoftwarePipeline, and after HoistGlobalBufferAllocations;
instead either delete them or gate them behind a verbose/debug flag or logger
check (e.g., an opt-level or tilelang.debug flag) so IR is only printed when
explicit debugging is enabled; locate the prints in the block where
tilelang.transform.PlanAndUpdateBufferAllocationLocation,
tilelang.transform.PipelinePlanning, tilelang.transform.InjectSoftwarePipeline,
and tilelang.transform.HoistGlobalBufferAllocations are invoked and remove/guard
those print(mod) calls.

---

Nitpick comments:
In `@src/transform/inject_tma_barrier.cc`:
- Line 142: The override VisitStmt_(const IfThenElseNode *op) is missing the
final specifier; update its declaration to mark it as final (e.g., change the
method signature in the class/visitor implementation to include final) so it
matches the other overridden VisitStmt_ methods in this file and prevents
further overrides.
- Line 175: The override of the virtual method VisitExpr_(const CallNode *op)
must be marked final to prevent further overrides; locate the VisitExpr_(const
CallNode *op) implementation in inject_tma_barrier.cc (the class that defines
this visitor) and add the final specifier to its declaration/definition so it
reads as an override marked final (i.e., VisitExpr_(const CallNode *op) final),
ensuring the signature matches the base virtual method and keeping any existing
override keyword if present.
- Around line 200-208: The UpdateBarrierRange function currently calls
barrier_id_to_range_.Set(barrier_id, extent) both inside the if branch after
ICHECK_EQ and again in the else branch, making the Set in the if branch
redundant; update UpdateBarrierRange to only perform the Set when the key is
absent (keep the ICHECK_EQ to assert equality when the key exists) so that
barrier_id_to_range_.Set is only called in the else path, referencing the
UpdateBarrierRange method and the barrier_id_to_range_ map to locate the change.
- Around line 654-658: The code returns a plain integer literal (return 0;) when
handling the ptx_arrive_barrier case; replace this with an explicit PrimExpr
zero to follow TVM IR conventions—use make_zero(op->dtype) (or construct an
IntImm with the correct dtype) in place of 0 so the return is an explicit
PrimExpr, e.g., change the return in the block guarded by
op->op.same_as(builtin::ptx_arrive_barrier()) and clear_arrive_ to return
make_zero(op->dtype).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e47f74fc-c588-4f44-b0b9-1e9b4d695e6a

📥 Commits

Reviewing files that changed from the base of the PR and between b80f614 and d7c5f07.

📒 Files selected for processing (12)
  • src/ir.cc
  • src/target/codegen_c_host.cc
  • src/transform/common/attr.h
  • src/transform/hoist_global_buffer_allocations.cc
  • src/transform/inject_assumes.cc
  • src/transform/inject_tma_barrier.cc
  • src/transform/plan_update_buffer_allocation_location.cc
  • testing/python/language/test_tilelang_language_alloc.py
  • tilelang/engine/phase.py
  • tilelang/language/__init__.py
  • tilelang/language/allocate.py
  • tilelang/transform/__init__.py
✅ Files skipped from review due to trivial changes (1)
  • src/target/codegen_c_host.cc
🚧 Files skipped from review as they are similar to previous changes (7)
  • src/transform/inject_assumes.cc
  • src/ir.cc
  • tilelang/language/init.py
  • src/transform/plan_update_buffer_allocation_location.cc
  • tilelang/language/allocate.py
  • tilelang/transform/init.py
  • src/transform/common/attr.h

Comment thread src/transform/inject_tma_barrier.cc Outdated
Comment thread src/transform/inject_tma_barrier.cc Outdated
Comment thread src/transform/inject_tma_barrier.cc Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tilelang/engine/phase.py (1)

243-253: ⚠️ Potential issue | 🟡 Minor

Document or remove the redundant LowerOpaqueBlock() call in the TMA path.

LowerOpaqueBlock() is called at line 243 inside the allow_tma_lower branch and again at line 253 unconditionally. Since LowerOpaqueBlock is idempotent (it converts BlockRealize nodes and subsequent passes don't reintroduce them), the second call at line 253 is a no-op for the TMA path, creating an asymmetry where:

  • TMA path applies the pass twice (redundant)
  • Non-TMA path applies it once

Either add a clarifying comment explaining why the TMA path requires an additional pre-pass application, or move line 253 into the else block to match the non-TMA structure.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tilelang/engine/phase.py` around lines 243 - 253, The LowerOpaqueBlock() pass
is applied twice on the TMA path (once inside the allow_tma_lower/is_hopper
branch and again unconditionally), which is redundant; remove the earlier call
to tilelang.transform.LowerOpaqueBlock() inside the allow_tma_lower / is_hopper
branch so the pass runs exactly once (the remaining unconditional
tilelang.transform.LowerOpaqueBlock()(mod) stays), or if the TMA path truly
requires a pre-application, instead keep both but add a clear comment next to
the in-branch call explaining why the extra pre-pass is required; reference the
LowerOpaqueBlock pass and the is_hopper/allow_tma_lower branch to locate the
change.
♻️ Duplicate comments (1)
src/transform/hoist_global_buffer_allocations.cc (1)

51-61: ⚠️ Potential issue | 🔴 Critical

Critical: std::move inside the loop causes buffer loss.

The assignment at line 59 is inside the for loop. After the first std::move, new_alloc_buffers is in a moved-from state (empty). Subsequent iterations push buffers to this empty vector, then move again—losing all previously collected non-global buffers.

Only the buffer from the last iteration will be retained in alloc_buffers.

🐛 Proposed fix
     } else {
       ffi::Array<Buffer> new_alloc_buffers;
       for (const auto &buf : op->alloc_buffers) {
         if (IsGlobalBuffer(buf)) {
           global_buffers_.push_back(buf);
         } else {
           new_alloc_buffers.push_back(buf);
         }
-        node.CopyOnWrite()->alloc_buffers = std::move(new_alloc_buffers);
       }
+      node.CopyOnWrite()->alloc_buffers = std::move(new_alloc_buffers);
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/transform/hoist_global_buffer_allocations.cc` around lines 51 - 61, The
assignment to node.CopyOnWrite()->alloc_buffers uses
std::move(new_alloc_buffers) inside the for loop which moves away the vector on
the first iteration and causes loss of previously collected buffers; change the
logic so you only build new_alloc_buffers by iterating op->alloc_buffers and
pushing non-global buffers (using IsGlobalBuffer and pushing globals into
global_buffers_), then after the loop perform a single
node.CopyOnWrite()->alloc_buffers = std::move(new_alloc_buffers); to preserve
all non-global buffers.
🧹 Nitpick comments (1)
src/transform/hoist_global_buffer_allocations.cc (1)

69-74: Consider returning fptr or using func directly.

HoistGlobalBufferAllocations calls func.CopyOnWrite() (obtaining fptr), mutates via fptr->body, but returns func. This works because CopyOnWrite() mutates func in-place when it has a unique reference. However, for clarity, consider returning fptr wrapped back into a PrimFunc or simply returning func after the mutation without storing fptr if it's not needed elsewhere.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/transform/hoist_global_buffer_allocations.cc` around lines 69 - 74, The
function HoistGlobalBufferAllocations creates a copy-on-write pointer fptr via
func.CopyOnWrite(), mutates fptr->body using GlobalBufferAllocationsHoister, but
then returns the original func which is confusing; either return the modified
fptr wrapped as a PrimFunc or avoid creating fptr at all and operate on func
directly. Fix by: if you need the copy-on-write pointer, return fptr (converted
to PrimFunc as in the surrounding code style) after assigning fptr->body, or if
you never use fptr elsewhere, remove fptr and call func.CopyOnWrite() inline or
mutate func directly so the return value clearly reflects the mutated object
(update HoistGlobalBufferAllocations to return the same object you mutated).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@tilelang/engine/phase.py`:
- Around line 243-253: The LowerOpaqueBlock() pass is applied twice on the TMA
path (once inside the allow_tma_lower/is_hopper branch and again
unconditionally), which is redundant; remove the earlier call to
tilelang.transform.LowerOpaqueBlock() inside the allow_tma_lower / is_hopper
branch so the pass runs exactly once (the remaining unconditional
tilelang.transform.LowerOpaqueBlock()(mod) stays), or if the TMA path truly
requires a pre-application, instead keep both but add a clear comment next to
the in-branch call explaining why the extra pre-pass is required; reference the
LowerOpaqueBlock pass and the is_hopper/allow_tma_lower branch to locate the
change.

---

Duplicate comments:
In `@src/transform/hoist_global_buffer_allocations.cc`:
- Around line 51-61: The assignment to node.CopyOnWrite()->alloc_buffers uses
std::move(new_alloc_buffers) inside the for loop which moves away the vector on
the first iteration and causes loss of previously collected buffers; change the
logic so you only build new_alloc_buffers by iterating op->alloc_buffers and
pushing non-global buffers (using IsGlobalBuffer and pushing globals into
global_buffers_), then after the loop perform a single
node.CopyOnWrite()->alloc_buffers = std::move(new_alloc_buffers); to preserve
all non-global buffers.

---

Nitpick comments:
In `@src/transform/hoist_global_buffer_allocations.cc`:
- Around line 69-74: The function HoistGlobalBufferAllocations creates a
copy-on-write pointer fptr via func.CopyOnWrite(), mutates fptr->body using
GlobalBufferAllocationsHoister, but then returns the original func which is
confusing; either return the modified fptr wrapped as a PrimFunc or avoid
creating fptr at all and operate on func directly. Fix by: if you need the
copy-on-write pointer, return fptr (converted to PrimFunc as in the surrounding
code style) after assigning fptr->body, or if you never use fptr elsewhere,
remove fptr and call func.CopyOnWrite() inline or mutate func directly so the
return value clearly reflects the mutated object (update
HoistGlobalBufferAllocations to return the same object you mutated).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: db7c9cc0-e416-45e2-b575-d3676505af77

📥 Commits

Reviewing files that changed from the base of the PR and between d7c5f07 and bf782b6.

📒 Files selected for processing (4)
  • src/transform/common/attr.h
  • src/transform/hoist_global_buffer_allocations.cc
  • src/transform/inject_assumes.cc
  • tilelang/engine/phase.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/transform/common/attr.h

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
testing/python/language/test_tilelang_language_alloc.py (1)

234-236: ⚠️ Potential issue | 🟡 Minor

Add @tilelang.testing.requires_cuda decorator.

This test creates CUDA tensors (device="cuda") but lacks the @tilelang.testing.requires_cuda decorator that other CUDA-dependent tests in this file use (e.g., lines 120-121, 161-162). The test will fail on systems without CUDA.

🐛 Proposed fix
+@tilelang.testing.requires_cuda
 def test_alloc_global():
     run_alloc_global(1024, 128, T.float16)
     run_alloc_global_eagerjit(1024, 128, T.float16)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@testing/python/language/test_tilelang_language_alloc.py` around lines 234 -
236, The test function test_alloc_global creates CUDA tensors but lacks the
CUDA-only guard; add the same decorator used by other CUDA-dependent tests
(tilelang.testing.requires_cuda) above test_alloc_global to skip it on non-CUDA
systems so calls like run_alloc_global(...) and run_alloc_global_eagerjit(...)
only run when CUDA is available.
🧹 Nitpick comments (1)
testing/python/language/test_tilelang_language_alloc.py (1)

191-197: Consider removing commented-out debug code.

Lines 192-194 contain commented-out print statements that appear to be leftover from development. Consider removing them before merging to keep the codebase clean.

♻️ Proposed cleanup
     kernel = tilelang.compile(program, out_idx=[1])
-    # print(kernel.get_host_source())
-    # code = kernel.get_kernel_source()
-    # print(code)
     A = torch.randn(N, device="cuda", dtype=getattr(torch, dtype))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@testing/python/language/test_tilelang_language_alloc.py` around lines 191 -
197, Remove the leftover commented-out debug prints in the test: delete or
uncomment the three print lines referencing kernel.get_host_source() and
kernel.get_kernel_source() in the test_tilelang_language_alloc.py block around
the kernel creation and invocation (symbols: kernel.get_host_source,
kernel.get_kernel_source, print). If you want to preserve them for future
debugging, wrap them behind a conditional debug flag or a pytest -s option
instead of leaving them commented out.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@testing/python/language/test_tilelang_language_alloc.py`:
- Around line 234-236: The test function test_alloc_global creates CUDA tensors
but lacks the CUDA-only guard; add the same decorator used by other
CUDA-dependent tests (tilelang.testing.requires_cuda) above test_alloc_global to
skip it on non-CUDA systems so calls like run_alloc_global(...) and
run_alloc_global_eagerjit(...) only run when CUDA is available.

---

Nitpick comments:
In `@testing/python/language/test_tilelang_language_alloc.py`:
- Around line 191-197: Remove the leftover commented-out debug prints in the
test: delete or uncomment the three print lines referencing
kernel.get_host_source() and kernel.get_kernel_source() in the
test_tilelang_language_alloc.py block around the kernel creation and invocation
(symbols: kernel.get_host_source, kernel.get_kernel_source, print). If you want
to preserve them for future debugging, wrap them behind a conditional debug flag
or a pytest -s option instead of leaving them commented out.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c632d36b-dba9-4d73-965b-ac4e201b2a7e

📥 Commits

Reviewing files that changed from the base of the PR and between bf782b6 and 35c353a.

📒 Files selected for processing (2)
  • src/transform/hoist_global_buffer_allocations.cc
  • testing/python/language/test_tilelang_language_alloc.py

@SiriusNEO SiriusNEO merged commit d0090b4 into tile-ai:main Mar 23, 2026
8 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant