Add CUDA GPU backend for NVIDIA acceleration by 0xSufi · Pull Request #12 · antirez/voxtral.c

0xSufi · 2026-02-11T23:13:54Z

Summary

Add CUDA/cuBLAS GPU backend for NVIDIA GPUs (SM 8.0+), bringing inference from ~40 sec to ~1 sec on RTX 3090
Implements BF16 weight caching, custom CUDA kernels (attention, RoPE, norms, activations), and monolithic GPU step functions that run all transformer layers with a single CPU↔GPU sync
New make cuda build target; CUDA and Metal are mutually exclusive at compile time

Performance (RTX 3090, 3.6s test audio)

	Encoder	Decoder (per step)	Decoder (total)
CPU (OpenBLAS)	10,871 ms	~500 ms	28,446 ms
CUDA	225 ms (48x)	12.3 ms (40x)	782 ms (36x)

Architecture

cuBLAS matmul: BF16 weights cached on GPU after first use, F32→BF16 activation conversion on-device, TF32 tensor ops on Ampere+
Custom CUDA kernels: rms_norm (shared-memory reduction), silu, gelu, add/mul, apply_rope (paired rotation), causal_attention (online softmax, GQA, sliding
window), ada_scale, bias_add
Monolithic step functions: Decoder (26 layers + logits) and encoder (32 layers + final norm) execute entirely on GPU with one cudaStreamSynchronize — no per-layer CPU
round-trips
Unified memory KV cache: cudaMallocManaged for zero-copy CPU↔GPU KV caches; GPU writes KV entries via device→managed copy to avoid page thrashing
Persistent GPU buffers: Single cudaMalloc per component (decoder/encoder), pointer arithmetic for sub-buffers, reused across tokens

New files

voxtral_cuda.h — C header with CUDA backend API
voxtral_cuda.cu — Full CUDA implementation (~1100 lines)

Modified files

Makefile — make cuda target (nvcc + gcc, links cudart/cublas)
main.c — CUDA init/shutdown lifecycle
voxtral.c — BF16 + F32 weight warmup, unified memory KV cache allocation
voxtral_kernels.c — #ifdef USE_CUDA dispatch for bf16 matmul functions
voxtral_decoder.c — CUDA monolithic decoder step dispatch, unified memory KV cache
voxtral_encoder.c — CUDA monolithic encoder step dispatch, unified memory KV cache

Test plan

make cuda builds without errors
Short audio transcription correct ("Hello, this is a test...")
Long audio transcription works (JFK 11s, MLK 3min)
Falls back to CPU path gracefully if CUDA step returns -1
Verify on different SM 8.x GPUs (tested on SM 8.6 / RTX 3090)

🤖 Generated with Claude Code

Add cuBLAS-accelerated matrix multiplication with BF16 weight caching, giving ~13x overall speedup on RTX 3090 vs CPU-only OpenBLAS (encoder 5.5x, decoder prefill 51x, decoder per-step 25x). New files: - voxtral_cuda.h: C header with CUDA backend API - voxtral_cuda.cu: cuBLAS GEMM with BF16 weight cache, F32→BF16 activation conversion kernel, activation buffer pool, and cudaMallocManaged for KV caches Modified files: - voxtral_kernels.c: #ifdef USE_CUDA dispatch for bf16 matmul functions - voxtral_encoder.c: CUDA unified memory for encoder KV cache - voxtral_decoder.c: CUDA unified memory for decoder KV cache + grow - voxtral.c: CUDA weight warmup at load time, shared memory free - main.c: vox_cuda_init() / vox_cuda_shutdown() - Makefile: `make cuda` target (requires CUDA toolkit + OpenBLAS, SM 8.0+) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Custom CUDA kernel with online softmax, GQA support, and sliding window masking. One block per (head, query_pos) with warp shuffle + shared memory dot product reduction. Only dispatched for the encoder incremental path where KV cache is in managed memory — decoder stays on CPU since single-token attention is faster without kernel launch + page migration overhead. Encoder: 1879ms → 634ms (2.97x), decoder unchanged at ~16ms/step. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Phase 3 of CUDA backend: keep activations on GPU between operations, eliminating per-matmul CPU↔GPU round-trips. Adds custom CUDA kernels (rms_norm, silu, gelu, add/mul, RoPE, ada_scale, bias_add), a device- pointer cuBLAS helper, persistent GPU buffers, and monolithic decoder (26 layers) and encoder (32 layers) step functions that execute with a single cudaStreamSynchronize. Performance on RTX 3090 (3.6s test audio): - Encoder: 225ms (was 634ms in Phase 2, 1879ms in Phase 1) - Decoder: 12.3 ms/step (was 16.8ms in Phase 2, 16.7ms in Phase 1) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

0xSufi · 2026-02-12T01:50:27Z

Ciao Salvatore, mi sembra che Opus 4.6 abbia fatto un buon lavoro qui, magari troppo Linux specific?
Le performances non sono male.

Sei un grande!

Danmoreng · 2026-02-13T10:27:40Z

How is this different from the existing CUDA PR here #7 ?

0xSufi · 2026-02-14T02:05:55Z

How is this different from the existing CUDA PR here #7 ?

Well, first of all look at the lines of code this PR added vs the other one, they are substantially different. Also this PR was created entirely with Claude Code (Opus 4.6) admittedly, I am not sure about #7. I have skimmed it briefly and it looks like it's Windows specific vs this PR, which should be Linux specific (I asked antirez about it).

Edit: Here's Opus 4.6 answer to your question:

● Both PRs add CUDA GPU backends for NVIDIA acceleration to voxtral.c, but they differ significantly in scope and approach:

  PR #12 — "Add CUDA GPU backend for NVIDIA acceleration"

  - Size: ~1,436 additions, 8 files changed
  - Scope: Focused, clean CUDA backend
  - Implementation: Single .cu file (~1100 lines), cuBLAS for matmul, custom kernels (RMSNorm, SiLU, GeLU, RoPE, causal attention, etc.)
  - Architecture: Monolithic GPU step functions (one cudaStreamSynchronize per step), unified memory (cudaMallocManaged) for KV caches, BF16 weight caching
  - Performance: RTX 3090, 3.6s audio — encoder 48x faster, decoder 40x faster (~1s total vs ~40s CPU)
  - Platform: Linux only
  - State: Open

  PR #7 — "CUDA backend (NVIDIA/WSL2) + faster-than-real-time STT"

  - Size: ~12,156 additions, 30 files changed
  - Scope: Much larger — CUDA backend + Windows support + many optimizations
  - Implementation: Separate voxtral_cuda.c (~6382 lines) + voxtral_cuda_kernels.cu (~2927 lines), uses CUDA Driver API + cuBLAS/cuBLASLt
  - Extra features:
    - Many tuning knobs (VOX_CUDA_FAST, VOX_CUDA_PIPELINE_FULL, VOX_CUDA_LOGITS_INT8, cuBLASLt autotune, etc.)
    - Fused kernels (RMSNorm + ada_scale + BF16 cast in one kernel)
    - CUDA graphs for decoder loop
    - INT8-quantized LM head (optional)
    - Pre-compiled cubin blob (avoids PTX JIT issues on WSL2)
    - Native Windows build (PowerShell scripts, WASAPI microphone capture)
  - Performance: RTX 3080 Ti, 180s audio — 5.5x real-time with CUDA fast + INT8
  - Platform: Linux, WSL2, and Windows
  - State: Open

  Key Differences

  ┌────────────────────┬─────────────────┬────────────────────────────────────┐
  │       Aspect       │     PR #12      │               PR #7                │
  ├────────────────────┼─────────────────┼────────────────────────────────────┤
  │ Lines added        │ ~1,400          │ ~12,000                            │
  ├────────────────────┼─────────────────┼────────────────────────────────────┤
  │ Files changed      │ 8               │ 30                                 │
  ├────────────────────┼─────────────────┼────────────────────────────────────┤
  │ CUDA API           │ Runtime API     │ Driver API + cuBLASLt              │
  ├────────────────────┼─────────────────┼────────────────────────────────────┤
  │ KV cache           │ Unified memory  │ Device memory + lazy host download │
  ├────────────────────┼─────────────────┼────────────────────────────────────┤
  │ Windows support    │ No              │ Yes (WASAPI mic, PowerShell build) │
  ├────────────────────┼─────────────────┼────────────────────────────────────┤
  │ Optimization knobs │ Minimal         │ Extensive (graphs, INT8, autotune) │
  ├────────────────────┼─────────────────┼────────────────────────────────────┤
  │ Complexity         │ Simple, focused │ Feature-rich, many options         │
  └────────────────────┴─────────────────┴────────────────────────────────────┘

  In short: PR #12 is a clean, minimal CUDA backend; PR #7 is a comprehensive, heavily optimized CUDA backend with Windows support and many tuning options.

HorizonXP · 2026-06-09T13:24:26Z

Hey just saw this. @0xSufi thanks for opening this. I think it's fantastic that you took a stab at it. I'll take a look at it closer to see what the delta actually is, I'll bet there's lots we can merge here to make it better.

0xSufi and others added 3 commits February 11, 2026 18:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CUDA GPU backend for NVIDIA acceleration#12

Add CUDA GPU backend for NVIDIA acceleration#12
0xSufi wants to merge 3 commits into
antirez:mainfrom
0xSufi:cuda-backend

0xSufi commented Feb 11, 2026 •

edited

Loading

Uh oh!

0xSufi commented Feb 12, 2026 •

edited

Loading

Uh oh!

Danmoreng commented Feb 13, 2026

Uh oh!

0xSufi commented Feb 14, 2026 •

edited

Loading

Uh oh!

HorizonXP commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

0xSufi commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Performance (RTX 3090, 3.6s test audio)

Architecture

New files

Modified files

Test plan

Uh oh!

0xSufi commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Danmoreng commented Feb 13, 2026

Uh oh!

0xSufi commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HorizonXP commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

0xSufi commented Feb 11, 2026 •

edited

Loading

0xSufi commented Feb 12, 2026 •

edited

Loading

0xSufi commented Feb 14, 2026 •

edited

Loading