Add grouped linear layer with strided BMM optimization by z52527 · Pull Request #263 · NVIDIA/recsys-examples

z52527 · 2026-01-06T10:08:03Z

Problem

Apply num_groups different linear transformations to corresponding slices of input:

Input:  x of shape (B * num_groups, input_dim)
Output: y of shape (B * num_groups, output_dim)

For each group n: y[b, n, :] = x[b, n, :] @ W[n, :, :]

Reference Implementation

The straightforward approach uses a loop over groups:

x = x.reshape(B, num_groups, D_in)
x_split = torch.split(x, 1, dim=1)

out_list = []
for i in range(num_groups):
    x_i = x_split[i].squeeze(1)           # (B, D_in)
    out_i = linear_layers[i](x_i)         # (B, D_out)
    out_list.append(out_i)

output = torch.stack(out_list, dim=1).reshape(-1, D_out)

Optimized Implementation

Use torch.bmm with strided output to fuse all GEMMs into one kernel:

x = x.reshape(B, num_groups, D_in)
output = torch.empty(B, num_groups, D_out, ...)   # pre-allocate final layout
torch.bmm(x.permute(1,0,2), weight,
          out=output.permute(1,0,2))              # cuBLAS writes to strided memory
return output.view(-1, D_out)                     # O(1) view, no copy

Key feature: cuBLAS strided batched GEMM supports strided output via ldc/strideC parameters, allowing direct write to the transposed memory layout.

Performance Results

Config: batch_size=2560, num_groups=12, input_dim=1024, output_dim=3072, dtype=bf16
Device: NVIDIA H100

	Speedup
Forward	1.46x
Forward + Backward	1.41x

Device: NVIDIA A100

	Speedup	TFLOPS
Forward	1.67x	246.7
Forward + Backward	1.34x	238.0

JacoCheung · 2026-01-07T03:30:39Z

@z52527 ,
Could you generalize the BmmImpl such that it could handle the activation of either [batch_count, batch_size, input_dim] or [batch_size, batch_count, input_dim]? Even though the input is [batch_count*batch_size, input_dim], your impl assumes that input is [batch_size, batch_count, input_dim].

examples/commons/ops/grouped_mlp_customop.py

greptile-apps · 2026-02-13T08:52:44Z

Greptile Overview

Greptile Summary

This PR adds a high-performance grouped linear layer implementation using strided batched matrix multiplication (BMM). The optimization achieves 1.46x forward speedup and 1.41x forward+backward speedup on H100 by fusing multiple group-wise GEMM operations into a single cuBLAS kernel call.

Key changes:

GroupedLinear_example.py: Standalone example demonstrating the strided BMM optimization for grouped linear layers with comprehensive benchmarking and correctness verification
grouped_mlp_customop.py: PyTorch custom ops implementation with Triton kernels for SwiGLU activation, including strided_bmm and silu_mul ops
GroupedMLP_example.py: Extended benchmark comparing reference implementation against the optimized approach with detailed performance metrics

Technical approach:
The optimization leverages cuBLAS's support for strided output tensors, allowing direct writes to transposed memory layouts without additional copy operations. The grouped MLP implementation composes three strided BMMs (gate, up, down projections) with a fused Triton kernel for the SiLU activation.

Critical issue:
Two files contain hardcoded absolute paths (/home/scratch.runchuz_gpu/repos-github/...) that must be replaced with relative paths before merging.

Confidence Score: 4/5

Safe to merge after fixing the hardcoded path issues in two files
The implementation is solid with proper correctness checks, benchmarking, and well-documented code. The main issue is hardcoded absolute paths that will break for other users. This is a critical but easy fix.
grouped_mlp_customop.py and GroupedMLP_example.py need the hardcoded path fixed before deployment

Important Files Changed

Filename	Overview
examples/commons/ops/GroupedLinear_example.py	New example demonstrating grouped linear layer with strided BMM optimization. Clean implementation with proper benchmarking and correctness checks.
examples/commons/ops/grouped_mlp_customop.py	Custom PyTorch ops for grouped MLP with Triton kernels. Contains hardcoded path issue that will break for other users.
examples/commons/ops/GroupedMLP_example.py	Benchmark comparing reference vs optimized grouped MLP implementation. Contains hardcoded path issue that will break for other users.

Sequence Diagram

sequenceDiagram
    participant Input as Input Tensor<br/>(B*N, D_in)
    participant Reshape as Reshape Layer
    participant BMM as Strided BMM
    participant Triton as Triton Kernel<br/>(SiLU*Up)
    participant Output as Output Tensor<br/>(B*N, D_out)

    Note over Input,Output: Grouped Linear (Simple)
    Input->>Reshape: (B*N, D_in)
    Reshape->>BMM: (B, N, D_in)
    BMM->>BMM: x @ W[n] for each group
    Note over BMM: Single cuBLAS kernel<br/>with strided output
    BMM->>Output: (B*N, D_out)

    Note over Input,Output: Grouped MLP (Gated)
    Input->>Reshape: (B*N, D_in)
    Reshape->>BMM: (B, N, D_in)
    
    par Parallel Gate & Up Projections
        BMM->>BMM: gate = x @ gate_W[n]
    and
        BMM->>BMM: up = x @ up_W[n]
    end
    
    BMM->>Triton: gate, up tensors
    Triton->>Triton: silu(gate) * up
    Note over Triton: Fused activation<br/>in single kernel
    Triton->>BMM: hidden
    BMM->>BMM: output = hidden @ down_W[n]
    BMM->>Output: (B*N, D_out)

_{Last reviewed commit: e2bcf47}

greptile-apps

_{3 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-13T08:52:47Z

examples/commons/ops/grouped_mlp_customop.py

+sys.path.insert(
+    0, "/home/scratch.runchuz_gpu/repos-github/recsys-examples/examples/hstu"


Hardcoded absolute path to user's home directory will break for other users or environments.

Suggested change

sys.path.insert(

0, "/home/scratch.runchuz_gpu/repos-github/recsys-examples/examples/hstu"

# Calculate relative path from this file to examples/hstu

import os

sys.path.insert(0, os.path.join(os.path.dirname(__file__), "../../hstu"))

greptile-apps · 2026-02-13T08:52:49Z

examples/commons/ops/GroupedMLP_example.py

+sys.path.insert(
+    0, "/home/scratch.runchuz_gpu/repos-github/recsys-examples/examples/hstu"


Hardcoded absolute path to user's home directory will break for other users or environments.

Suggested change

sys.path.insert(

0, "/home/scratch.runchuz_gpu/repos-github/recsys-examples/examples/hstu"

# Calculate relative path from this file to examples/hstu

import os

sys.path.insert(0, os.path.join(os.path.dirname(__file__), "../../hstu"))

Add optimal & reference implementations with tests and benchmarks.

eb2351c

z52527 self-assigned this Jan 6, 2026

z52527 added 5 commits January 7, 2026 10:11

Add parameter batch_first to generalize GroupedLinear.

553387d

Add new example.

dd7767b

Add triton silu+mul to support swiglu.

dd52bf8

Simplified code.

6369887

Add custom op version.

38b49dc

shijieliu reviewed Jan 20, 2026

View reviewed changes

examples/commons/ops/grouped_mlp_customop.py Outdated Show resolved Hide resolved

z52527 added 2 commits January 20, 2026 10:09

Pre commit

ba68e30

Optimize autotune time.

e2bcf47

greptile-apps bot reviewed Feb 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add grouped linear layer with strided BMM optimization#263

Add grouped linear layer with strided BMM optimization#263
z52527 wants to merge 8 commits intoNVIDIA:mainfrom
z52527:fea-optimized-strided-BMM

z52527 commented Jan 6, 2026 •

edited

Loading

Uh oh!

JacoCheung commented Jan 7, 2026 •

edited

Loading

Uh oh!

Uh oh!

greptile-apps bot commented Feb 13, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Feb 13, 2026

Uh oh!

greptile-apps bot Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		sys.path.insert(
		0, "/home/scratch.runchuz_gpu/repos-github/recsys-examples/examples/hstu"

-sys.path.insert(
-, "/home/scratch.runchuz_gpu/repos-github/recsys-examples/examples/hstu"
+# Calculate relative path from this file to examples/hstu
+import os
+sys.path.insert(0, os.path.join(os.path.dirname(__file__), "../../hstu"))

Conversation

z52527 commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Reference Implementation

Optimized Implementation

Performance Results

Uh oh!

JacoCheung commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot commented Feb 13, 2026

Greptile Overview

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

z52527 commented Jan 6, 2026 •

edited

Loading

JacoCheung commented Jan 7, 2026 •

edited

Loading