Add CUDA GPU backend for NVIDIA acceleration#12
Conversation
Add cuBLAS-accelerated matrix multiplication with BF16 weight caching, giving ~13x overall speedup on RTX 3090 vs CPU-only OpenBLAS (encoder 5.5x, decoder prefill 51x, decoder per-step 25x). New files: - voxtral_cuda.h: C header with CUDA backend API - voxtral_cuda.cu: cuBLAS GEMM with BF16 weight cache, F32→BF16 activation conversion kernel, activation buffer pool, and cudaMallocManaged for KV caches Modified files: - voxtral_kernels.c: #ifdef USE_CUDA dispatch for bf16 matmul functions - voxtral_encoder.c: CUDA unified memory for encoder KV cache - voxtral_decoder.c: CUDA unified memory for decoder KV cache + grow - voxtral.c: CUDA weight warmup at load time, shared memory free - main.c: vox_cuda_init() / vox_cuda_shutdown() - Makefile: `make cuda` target (requires CUDA toolkit + OpenBLAS, SM 8.0+) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Custom CUDA kernel with online softmax, GQA support, and sliding window masking. One block per (head, query_pos) with warp shuffle + shared memory dot product reduction. Only dispatched for the encoder incremental path where KV cache is in managed memory — decoder stays on CPU since single-token attention is faster without kernel launch + page migration overhead. Encoder: 1879ms → 634ms (2.97x), decoder unchanged at ~16ms/step. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Phase 3 of CUDA backend: keep activations on GPU between operations, eliminating per-matmul CPU↔GPU round-trips. Adds custom CUDA kernels (rms_norm, silu, gelu, add/mul, RoPE, ada_scale, bias_add), a device- pointer cuBLAS helper, persistent GPU buffers, and monolithic decoder (26 layers) and encoder (32 layers) step functions that execute with a single cudaStreamSynchronize. Performance on RTX 3090 (3.6s test audio): - Encoder: 225ms (was 634ms in Phase 2, 1879ms in Phase 1) - Decoder: 12.3 ms/step (was 16.8ms in Phase 2, 16.7ms in Phase 1) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Ciao Salvatore, mi sembra che Opus 4.6 abbia fatto un buon lavoro qui, magari troppo Linux specific? Sei un grande! |
|
How is this different from the existing CUDA PR here #7 ? |
Well, first of all look at the lines of code this PR added vs the other one, they are substantially different. Also this PR was created entirely with Claude Code (Opus 4.6) admittedly, I am not sure about #7. I have skimmed it briefly and it looks like it's Windows specific vs this PR, which should be Linux specific (I asked antirez about it). Edit: Here's Opus 4.6 answer to your question: |
|
Hey just saw this. @0xSufi thanks for opening this. I think it's fantastic that you took a stab at it. I'll take a look at it closer to see what the delta actually is, I'll bet there's lots we can merge here to make it better. |
Summary
make cudabuild target; CUDA and Metal are mutually exclusive at compile timePerformance (RTX 3090, 3.6s test audio)
Architecture
rms_norm(shared-memory reduction),silu,gelu,add/mul,apply_rope(paired rotation),causal_attention(online softmax, GQA, slidingwindow),
ada_scale,bias_addcudaStreamSynchronize— no per-layer CPUround-trips
cudaMallocManagedfor zero-copy CPU↔GPU KV caches; GPU writes KV entries via device→managed copy to avoid page thrashingcudaMallocper component (decoder/encoder), pointer arithmetic for sub-buffers, reused across tokensNew files
voxtral_cuda.h— C header with CUDA backend APIvoxtral_cuda.cu— Full CUDA implementation (~1100 lines)Modified files
Makefile—make cudatarget (nvcc + gcc, links cudart/cublas)main.c— CUDA init/shutdown lifecyclevoxtral.c— BF16 + F32 weight warmup, unified memory KV cache allocationvoxtral_kernels.c—#ifdef USE_CUDAdispatch for bf16 matmul functionsvoxtral_decoder.c— CUDA monolithic decoder step dispatch, unified memory KV cachevoxtral_encoder.c— CUDA monolithic encoder step dispatch, unified memory KV cacheTest plan
make cudabuilds without errors🤖 Generated with Claude Code