Full GPU acceleration for the entire PQ pipeline: encoder training, PQ encoding, cluster training, and label assignment by afloresep · Pull Request #5 · afloresep/Chelombus

afloresep · 2026-04-07T15:37:51Z

Triton kernel for PQ cluster assignment (chelombus/clustering/_gpu_predict.py) — Custom kernel that tiles over centers with an online argmin, never materializing the N×K distance matrix. Uses tl.static_range(M) for compile-time unrolling, supporting any number of subvectors (not just M=6). Adaptive BLOCK_K sizing manages register pressure as M grows.
GPU encoder training (chelombus/encoder/encoder.py)
GPU cluster training (chelombus/clustering/PyQKmeans.py) Triton assignment + CPU centroid update loop. Early stopping with tolerance

Benchmarks

1B Enamine REAL molecules (RTX 4070 Ti SUPER 16GB):


  ┌────────────────────────────────────────┬──────────┐
  │                 Stage                  │   Time   │
  ├────────────────────────────────────────┼──────────┤
  │ Encoder training (50M sample)          │  1.8 min │
  ├────────────────────────────────────────┼──────────┤
  │ PQ encoding (1B)                       │  3.7 min │
  ├────────────────────────────────────────┼──────────┤
  │ Cluster training (1B, K=100K, 5 iters) │  2.3 hrs │
  ├────────────────────────────────────────┼──────────┤
  │ Label assignment (1B)                  │ 26.2 min │
  ├────────────────────────────────────────┼──────────┤
  │ Total                                  │  2.9 hrs │
  └────────────────────────────────────────┴──────────┘

k-selection on 100M: GPU fit times: 1.7 min (k=10K) to 17.6 min (k=200K), vs 1.3h to 26.4h on CPU (45-90x speedup).

- Fix GPU tensor cache to use content comparison instead of memory address, preventing stale tensor reuse after Python frees/reuses the same address - Add CPU transform fallback using codewords when sklearn models aren't available (e.g. loaded encoder without pq_trained) - Add GPU-accelerated KMeans fit with batched GEMM assignment - Add _update_centers for PQ centroid recomputation with empty cluster preservation - Add early stopping with tolerance and oscillation detection - Add device validation and GPU support gating (m=6, k<=256) - Remove stray debug print in ImportError handler - Clean up comment artifacts and unused variables - All 53 tests pass (CPU + GPU paths)

- Replace hand-unrolled M=6 kernel with tl.static_range(M) loop, supporting any number of subvectors via compile-time unrolling - Add adaptive BLOCK_K sizing based on M to manage register pressure - Remove m=6 guard from PQKMeans GPU support check - Update README with 1B GPU benchmark results on real Enamine data (2.9 hrs total for full pipeline on RTX 4070 Ti SUPER) - Add reproducible benchmark script (scripts/benchmark_1B_pipeline.py) that streams fingerprint chunks from disk

- Update k-selection table with both GPU and CPU fit times on 100M Enamine molecules (GPU: 1.7 min to 17.6 min vs CPU: 1.3h to 26.4h) - Add scripts/k_selection_gpu.py for reproducible GPU k-selection - Remove outdated CPU-only scaling note

- Change MQN fingerprint dtype from uint8 to int16 (values can exceed 255 for large molecules) - Guard torch.cuda monkeypatch with importorskip for CI without CUDA - Update test assertions to match correct dtypes

afloresep added 3 commits April 5, 2026 22:00

Add GPU k-selection benchmarks to README

faabe57

- Update k-selection table with both GPU and CPU fit times on 100M Enamine molecules (GPU: 1.7 min to 17.6 min vs CPU: 1.3h to 26.4h) - Add scripts/k_selection_gpu.py for reproducible GPU k-selection - Remove outdated CPU-only scaling note

afloresep changed the title ~~Gpu~~ Full GPU acceleration for the entire PQ pipeline: encoder training, PQ encoding, cluster training, and label assignment Apr 7, 2026

afloresep added 2 commits April 7, 2026 17:41

Merge branch 'master' into gpu

9613c24

Fix CI failures: MQN dtype to int16, torch monkeypatch guard

014d167

- Change MQN fingerprint dtype from uint8 to int16 (values can exceed 255 for large molecules) - Guard torch.cuda monkeypatch with importorskip for CI without CUDA - Update test assertions to match correct dtypes

afloresep merged commit 403134f into master Apr 7, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full GPU acceleration for the entire PQ pipeline: encoder training, PQ encoding, cluster training, and label assignment#5

Full GPU acceleration for the entire PQ pipeline: encoder training, PQ encoding, cluster training, and label assignment#5
afloresep merged 5 commits intomasterfrom
gpu

afloresep commented Apr 7, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

afloresep commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

afloresep commented Apr 7, 2026 •

edited

Loading