GitHub Pages WASM demo + community engagement + server support

unamedkr · claude · unamedkr · commit a7af6eb8cf2a · 2026-04-05T10:42:34.000+09:00
- Add GitHub Pages workflow (.github/workflows/pages.yml) Deploys wasm/ to quantumaikr.github.io/quant.cpp on push to main - Add COOP/COEP headers for SharedArrayBuffer (WASM threads) - Add OpenAI-compatible server build option (TQ_BUILD_SERVER) - Show HN v3 draft + Reddit comparison post drafts - README hero text updates (en + ko synced) Community: - Posted on llama.cpp Discussion #20969 (TurboQuant implementation) - Posted on vllm-omni Issue #2215 (KV compression RFC) - Closed Issue #5 (Q3_K_M already supported) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
diff --git a/.github/workflows/pages.yml b/.github/workflows/pages.yml
@@ -0,0 +1,52 @@
+name: Deploy WASM Demo to GitHub Pages
+
+on:
+  push:
+    branches: [main]
+    paths:
+      - 'wasm/**'
+      - '.github/workflows/pages.yml'
+  workflow_dispatch:
+
+permissions:
+  contents: read
+  pages: write
+  id-token: write
+
+concurrency:
+  group: pages
+  cancel-in-progress: true
+
+jobs:
+  deploy:
+    environment:
+      name: github-pages
+      url: ${{ steps.deployment.outputs.page_url }}
+    runs-on: ubuntu-latest
+
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v4
+        with:
+          fetch-depth: 1
+
+      - name: Prepare pages directory
+        run: |
+          mkdir -p _site
+          cp wasm/index.html _site/
+          cp wasm/quant.js _site/
+          cp wasm/quant.wasm _site/
+          # COOP/COEP headers for SharedArrayBuffer support
+          cp wasm/_headers _site/_headers 2>/dev/null || true
+
+      - name: Setup Pages
+        uses: actions/configure-pages@v5
+
+      - name: Upload artifact
+        uses: actions/upload-pages-artifact@v3
+        with:
+          path: _site
+
+      - name: Deploy to GitHub Pages
+        id: deployment
+        uses: actions/deploy-pages@v4
diff --git a/docs/pr/2026-04-05-reddit-comparison-post.md b/docs/pr/2026-04-05-reddit-comparison-post.md
@@ -0,0 +1,63 @@
+# Reddit Post — quant.cpp vs every other engine (2026-04-05)
+
+---
+
+## English
+
+**Title:** quant.cpp vs llama.cpp vs vLLM vs MLX vs ONNX RT — honest comparison table
+
+**Body:**
+
+I keep getting asked "why not just use llama.cpp?" so I made a comparison table.
+
+|  | quant.cpp | llama.cpp | vLLM | MLX | ONNX RT |
+|--|-----------|-----------|------|-----|---------|
+| KV compression | **7x, +0% PPL** | +10.6% PPL | -- | -- | -- |
+| Code size | **67K LOC** | 250K+ | 100K+ | 50K+ | 500K+ |
+| Dependencies | **zero** | ggml | PyTorch | Apple fw | runtime |
+| Embeddable | **single header** | -- | -- | -- | complex |
+| WASM | **192KB** | -- | -- | -- | -- |
+| GPU serving | basic | full | **best** | Metal | multi |
+
+The short version:
+
+- **llama.cpp** when you need speed
+- **vLLM** when you need throughput
+- **quant.cpp** when you need to fit more context in less memory — or embed LLM in your own app
+
+quant.cpp is not trying to replace llama.cpp. It solves a different problem. If your bottleneck is context length on limited hardware, 7x KV compression at +0% PPL is something no other engine offers. If your bottleneck is tok/s, use llama.cpp — it's faster.
+
+The other unique thing: `quant.h` is a single 15K-line C header. `#include` it, compile with `cc app.c -lm -lpthread`, done. Try doing that with any other engine on this list.
+
+Source: [github.com/quantumaikr/quant.cpp](https://github.com/quantumaikr/quant.cpp)
+
+---
+
+## 한글
+
+**Title:** quant.cpp vs llama.cpp vs vLLM vs MLX vs ONNX RT — 비교표
+
+**Body:**
+
+"llama.cpp 쓰면 되지 왜 새로 만들었냐"는 질문을 계속 받아서 비교표를 만들었습니다.
+
+|  | quant.cpp | llama.cpp | vLLM | MLX | ONNX RT |
+|--|-----------|-----------|------|-----|---------|
+| KV 압축 | **7x, +0% PPL** | +10.6% PPL | -- | -- | -- |
+| 코드 규모 | **67K LOC** | 250K+ | 100K+ | 50K+ | 500K+ |
+| 외부 의존성 | **없음** | ggml | PyTorch | Apple fw | runtime |
+| 임베딩 가능 | **싱글 헤더** | -- | -- | -- | 복잡 |
+| WASM | **192KB** | -- | -- | -- | -- |
+| GPU 서빙 | 기본 | 풀 | **최고** | Metal | 멀티 |
+
+한 줄 요약:
+
+- 속도가 필요하면 **llama.cpp**
+- 처리량이 필요하면 **vLLM**
+- 같은 메모리에서 더 긴 컨텍스트가 필요하거나, LLM을 내 앱에 넣고 싶으면 **quant.cpp**
+
+quant.cpp는 llama.cpp를 대체하려는 게 아닙니다. 다른 문제를 풉니다. 제한된 하드웨어에서 컨텍스트 길이가 병목이라면, 품질 손실 없이 KV 7배 압축은 다른 엔진에 없는 기능입니다. tok/s가 병목이면 llama.cpp를 쓰세요 — 더 빠릅니다.
+
+또 하나 독특한 점: `quant.h`는 15K줄짜리 C 헤더 파일 하나입니다. `#include`하고 `cc app.c -lm -lpthread`로 컴파일하면 끝. 이 표의 다른 엔진으로는 불가능한 일입니다.
+
+소스: [github.com/quantumaikr/quant.cpp](https://github.com/quantumaikr/quant.cpp)
diff --git a/docs/pr/2026-04-05-show-hn-v3.md b/docs/pr/2026-04-05-show-hn-v3.md
@@ -0,0 +1,51 @@
+# Show HN: quant.cpp — 7x longer LLM context in pure C (Gemma 4 26B on 16GB Mac)
+
+**URL**: https://github.com/quantumaikr/quant.cpp
+
+## Title (≤80 chars)
+
+Show HN: quant.cpp – 7x longer LLM context via KV cache compression, pure C
+
+## Post (Reddit/HN — 불릿 포맷, 테이블 없음)
+
+I built a minimal LLM inference engine in pure C (67K LOC, zero dependencies) with one goal: **extend context length without adding hardware.**
+
+**The key insight**: LLM inference memory is dominated by the KV cache, not model weights. Compressing the KV cache to 4-bit keys + Q4 values gives **6.9x memory reduction** with negligible quality loss.
+
+**Real numbers on a 16GB Mac (M1 Pro):**
+
+- Llama 3.2 3B: FP16 KV ~50K tokens → Compressed **~350K tokens** (6.9x)
+- Gemma 4 26B-A4B (MoE): FP16 KV ~4K tokens → Compressed **~30K tokens** (6.9x)
+
+**How it works:**
+
+- Keys: uniform 4-bit min-max quantization per 128-element block
+- Values: Q4 nibble quantization with per-block scales
+- Delta mode: store key[t] - key[t-1] instead of absolute keys (like video P-frames), enabling 3-bit at +1.3% PPL
+- QK-norm aware: models like Gemma 4 automatically use FP32 keys + Q4 values (sparse key distributions break low-bit quantization)
+
+**Quality (WikiText-2 PPL, SmolLM2 1.7B):**
+
+- FP32 baseline: 14.63
+- 4-bit K + Q4 V: 14.57 (**+0.0%**)
+- Delta 3-bit K + Q4 V: 14.82 (+1.3%)
+- vs llama.cpp Q4_0 KV: **+10.6% PPL**. Same bit budget, 10x more degradation.
+
+**Code philosophy:** 67K lines of C11. No frameworks, no CUDA required. The full forward pass fits in one file. Ships as a single-header `quant.h` (15K LOC) you can drop into any C project.
+
+**Supported models:** Llama 3.2, Qwen 3.5, Gemma 3/4, MoE (128 experts).
+
+```
+./quant model.gguf -p "hello" -k uniform_4b -v q4
+```
+
+Feedback welcome. Particularly interested in: (1) what context length you'd need for your use case, (2) which models to prioritize next.
+
+---
+
+## 포맷 참고사항
+
+- **테이블을 사용하지 않음** — Reddit Fancy Pants 에디터에서 마크다운 테이블이 깨지는 문제 방지
+- **불릿 리스트로 통일** — 모든 에디터에서 정상 렌더링
+- **"Talking Points" 섹션 제거** — 본문에 내부 메모가 노출되는 사고 방지
+- Reddit에 올릴 때 반드시 **Markdown Mode**로 전환 후 붙여넣기
diff --git a/wasm/_headers b/wasm/_headers
@@ -0,0 +1,3 @@
+/*
+  Cross-Origin-Opener-Policy: same-origin
+  Cross-Origin-Embedder-Policy: require-corp
diff --git a/wasm/index.html b/wasm/index.html
@@ -3,6 +3,9 @@
 <head>
 <meta charset="UTF-8">
 <meta name="viewport" content="width=device-width, initial-scale=1.0">
+<!-- COOP/COEP for SharedArrayBuffer (required for WASM threads) -->
+<meta http-equiv="Cross-Origin-Opener-Policy" content="same-origin">
+<meta http-equiv="Cross-Origin-Embedder-Policy" content="require-corp">
 <title>quant.cpp — LLM in Your Browser</title>
 <style>
 * { margin: 0; padding: 0; box-sizing: border-box; }

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+/*`
	`2`	`+ Cross-Origin-Opener-Policy: same-origin`
	`3`	`+ Cross-Origin-Embedder-Policy: require-corp`