[tx] Implement cutlass kernel for ragged_dot with group_offset #896

pcmoritz · 2026-01-19T07:01:43Z

This brings down the step time of

uv run --with wandb --with tinker==0.3.0 sl_loop.py     base_url=http://localhost:8000     model_name=Qwen/Qwen3-30B-A3B lora_rank=1 max_length=512

with

uv run --extra gpu --extra tinker -m tx.tinker.api     --base-model Qwen/Qwen3-30B-A3B     --backend-config '{"max_lora_adapters": 2, "max_lora_rank": 1, "expert_parallel_size": 8, "train_micro_batch_size": 1, "shard_attention_heads": false}'

from 40s to 30s.

…o tx-ragged-dot-cutlass

gemini-code-assist

Code Review

This pull request introduces a custom CUTLASS kernel for ragged_dot with group_offset to improve performance, which is a great addition. The implementation includes the CUDA kernel, Python FFI bindings, and integration into the existing ragged_dot utility.

My review has found a few issues:

A critical bug in the CUDA kernel's pointer arithmetic that will lead to incorrect results.
A potential thread-safety issue in the device property caching function.
An error in the build instructions in the README file.

I've provided detailed comments and suggestions for each of these points. Once these are addressed, this will be a solid performance enhancement.

gemini-code-assist · 2026-01-19T07:16:51Z

skyrl-tx/tx/ffi/ragged_dot_ffi.cu

+  int32_t m = group_offsets_cumsum[global] - start;
+
+  A_ptrs[tid] = A + static_cast<int64_t>(start) * k;
+  B_ptrs[tid] = B + static_cast<int64_t>(tid) * n * k;


There appears to be a bug in the pointer arithmetic for B_ptrs. The rhs tensor has dimensions [g_local, k, n]. To get the pointer to the tid-th matrix, the offset should be tid * k * n. The current code uses tid * n * k, which will lead to incorrect memory access and wrong results.

B_ptrs[tid] = B + static_cast<int64_t>(tid) * k * n;

gemini-code-assist · 2026-01-19T07:16:51Z

skyrl-tx/tx/ffi/README.md

+1) Build the shared library (requires CUDA nvcc with C++17 support):
+```
+export CUTLASS_DIR=/path/to/cutlass
+export NVCC_ARCH=sm_90a  # for H100, adjust if needed


The instruction for setting NVCC_ARCH is inconsistent with how it's used in the build script. The script prepends sm_ to the NVCC_ARCH variable. Therefore, you should set NVCC_ARCH to the architecture number (e.g., 90a) instead of the full sm_ string (e.g., sm_90a).

Suggested change

export NVCC_ARCH=sm_90a # for H100, adjust if needed

export NVCC_ARCH=90a # for H100, adjust if needed

gemini-code-assist · 2026-01-19T07:16:51Z

skyrl-tx/tx/ffi/ragged_dot_ffi.cu

+static int get_sm_count() {
+  int device = 0;
+  if (cudaGetDevice(&device) != cudaSuccess || device < 0) {
+    return 0;
+  }
+  if (static_cast<size_t>(device) >= g_device_props.size()) {
+    g_device_props.resize(device + 1);
+  }
+  cudaDeviceProp& props = g_device_props[device];
+  if (!props.multiProcessorCount) {
+    cudaGetDeviceProperties(&props, device);
+  }
+  return props.multiProcessorCount;
+}


The get_sm_count function uses a global static std::vector (g_device_props) to cache device properties. This is not thread-safe. If multiple host threads call this function concurrently, it could lead to a race condition when resizing the vector. Using a static local mutex will make it thread-safe.

Please also add #include <mutex> at the top of the file.

static int get_sm_count() { static std::mutex mtx; std::lock_guard<std::mutex> lock(mtx); int device = 0; if (cudaGetDevice(&device) != cudaSuccess || device < 0) { return 0; } if (static_cast<size_t>(device) >= g_device_props.size()) { g_device_props.resize(device + 1); } cudaDeviceProp& props = g_device_props[device]; if (!props.multiProcessorCount) { cudaGetDeviceProperties(&props, device); } return props.multiProcessorCount; }

pcmoritz added 30 commits January 17, 2026 18:04

add cutlass ragged dot

12cc124

update

70dce5f

update

d70e010

update

ac85d10

update

106e4ae

add backward

94eb625

update

cf80c97

update

7f1fe1d

use grouped gemm

1d9df09

update

527c1a0

update

656756f

update

aee36c7

Merge branch 'tx-ragged-dot-cutlass' of github.com:pcmoritz/SkyRL int…

cfb3404

…o tx-ragged-dot-cutlass

update

6e9ead9

update

63914e7

update

f92d00d

fix

ad0bfee

update

70cba86

optimize

3f4dd25

fixes

3f6669d

optimize

7b22f86

try to use clusters

accff8e

update schedule

f1fb36c

try tile size

b1c48f4

update

046a033

optimize

5b14a8a

optimize

23a74e5

simplify

4c86409

simplify

2dcce20

add lto

e731efe

pcmoritz added 9 commits January 18, 2026 21:42

clean up

1717190

update

3199c0d

update

1071a5a

simplify

688c962

simplify

de0b140

simplify

9586e57

simplify

5a96b68

update

11fa3c4

update

6698906

pcmoritz added the tx label Jan 19, 2026

pcmoritz added 2 commits January 18, 2026 23:07

cleanup

5f9c413

simplify

e23037d

gemini-code-assist bot reviewed Jan 19, 2026

View reviewed changes

pcmoritz added 17 commits January 19, 2026 00:13

simplify

56968ba

unify code

0a184e4

update

4292fc8

update

0392cac

update

eb5e004

update build

1b19afa

update

7aa2fe7

update

be21295

update

5b67b4c

simplify

a29c3ef

update

ce86680

update

8220f6e

optimize

86e3ad9

try masking

d16a825

revert

e346551

update

b758b2b

update

2e2fff0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[tx] Implement cutlass kernel for ragged_dot with group_offset #896

[tx] Implement cutlass kernel for ragged_dot with group_offset #896

Uh oh!

pcmoritz commented Jan 19, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 19, 2026

Uh oh!

gemini-code-assist bot Jan 19, 2026

Uh oh!

gemini-code-assist bot Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	export NVCC_ARCH=sm_90a # for H100, adjust if needed
	export NVCC_ARCH=90a # for H100, adjust if needed

[tx] Implement cutlass kernel for ragged_dot with group_offset #896

Are you sure you want to change the base?

[tx] Implement cutlass kernel for ragged_dot with group_offset #896

Uh oh!

Conversation

pcmoritz commented Jan 19, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant