HPC-Ops

HPC-Ops is a production-grade, high-performance, and easy-to-use operator library for LLM inference, developed by the Tencent Hunyuan AI Infra team.

Key Features

SOTA Performance & Production-Proven: Deeply optimized kernels tailored for NVIDIA H20 GPUs, delivering SOTA performance with up to 2.22x speedup. Powering large-scale production inference in Tencent.
Easy to Integrate: A clean API designed for seamless integration into popular inference frameworks like vLLM and SGLang.
Rich Precision Support: Native support for multiple data types including BF16 and FP8 with different quantization schemes.
A Modern CUDA Tutorial: Hands-on examples of building SOTA kernels with CuTe and CUTLASS in just hundreds of lines.

Performance

Maximum observed speedup per operator

Operator	Baseline	Prefill	Decode
Attention (bf16)	FlashInfer, FA2, FA3, TensorRT-LLM	1.33x	2.22x
Attention (fp8)	FlashInfer, FA3, TensorRT-LLM	1.12x	2.0x
FusedMoE (fp8)	TensorRT-LLM, vLLM	1.49x	1.14x
GroupGEMM (fp8)	DeepGEMM	1.1x	1.88x

We focus on maximum speedup to highlight the optimization potential, as performance varies substantially across cases.

Supporting Kernels

Attention

Decode, Prefill: Optimized kernels for all attention phases, including paged attention.

Grouped GEMM

Quantized Grouped GEMM: FP8 weights with block-wise or per-tensor scaling

Fused MoE

Quantized Fused MoE: FP8 expert weights with block-wise or per-tensor scaling

Quick Start

Requirements

NVIDIA SM90 architecture GPU
Python 3.8 or higher
Compilers with C++17 support
CUDA Toolkit: CUDA 12.8 or higher

You can set up the environment by installing the modules listed in requirements-dev.txt.

Install from Source

git clone https://github.com/Tencent/hpc-ops.git
cd hpc-ops

# build packages
make wheel
python3 -m pip install dist/*.whl

Basic Usage

Example: GroupGEMM fp8 kernel usage

import torch
import hpc

num_tokens = 1024
num_group, n, k = 8, 4096, 4096
x = torch.randn((num_tokens, k), dtype=torch.float, device="cuda").to(torch.float8_e4m3fn)
w = torch.randn((num_group, n, k), dtype=torch.float, device="cuda").to(torch.float8_e4m3fn)
scale = torch.full((num_group,), 1.0, dtype=torch.float, device="cuda")
num_tokens_per_group = torch.full((num_group,), 8, dtype=torch.int32, device="cuda")
cu_num_tokens_per_group = torch.cumsum(torch.cat([torch.tensor([0], dtype=torch.int32, device="cuda"), num_tokens_per_group]), dim=0).to(torch.int32)

output = hpc.group_gemm_pertensor_fp8(
    x, w, num_tokens_per_group, cu_num_tokens_per_group, scale,
)

For the usage of other operators, please refer to the corresponding test files in the tests/ directory.

Roadmap

Sparse Attention Kernels: Optimized for long-context LLMs, these kernels boost throughput for memory-bound workloads.
Extended Quantization Support: Flexible strategies (4bit/8bit mixed-precision included) kernel optimizations for quantized attention and GEMM which balance speed and accuracy.
Compute-Communication Boundary-Breaking Kernels: Overlapped computation and inter-GPU communication logic to minimizes overhead in multi-node/multi-GPU distributed inference.

We welcome targeted, high-impact contributions—whether it’s fixing edge-case kernel bugs, or submitting optimizations for niche LLM inference scenarios, your PRs will help refine this toolkit for production use.

⭐ Star this repo to follow our progress. We’re continuously improving performance to make your LLM inference faster and more efficient. More improvements are on the way.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
3rd		3rd
hpc		hpc
src		src
tests		tests
.clang-format		.clang-format
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
CPPLINT.cfg		CPPLINT.cfg
LICENSE.txt		LICENSE.txt
Makefile		Makefile
README.md		README.md
conftest.py		conftest.py
requirements-dev.txt		requirements-dev.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HPC-Ops

Key Features

Performance

Maximum observed speedup per operator

Supporting Kernels

Attention

Grouped GEMM

Fused MoE

Quick Start

Requirements

Install from Source

Basic Usage

Roadmap

About

Uh oh!

Contributors 5

Languages

License

Tencent/hpc-ops

Folders and files

Latest commit

History

Repository files navigation

HPC-Ops

Key Features

Performance

Maximum observed speedup per operator

Supporting Kernels

Attention

Grouped GEMM

Fused MoE

Quick Start

Requirements

Install from Source

Basic Usage

Roadmap

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors 5

Languages