Liquid4All · alay2shah · Jan 31, 2026 · Jan 31, 2026 · Jan 31, 2026 · Jan 31, 2026
@@ -158,4 +158,5 @@ Thumbs.db
 *.swp
 *.swo
 *~
-.vscode/
+.vscode/
+.onnx-tests/
@@ -92,6 +92,7 @@
               "docs/inference/vllm",
               "docs/inference/mlx",
               "docs/inference/ollama",
+              "docs/inference/onnx",
               {
                 "group": "Other Frameworks",
                 "icon": "server",

@@ -20,7 +20,7 @@ LFM models are compatible with:
 - [vLLM](/docs/inference/vllm) - For high-throughput production serving
 - [MLX](/docs/inference/mlx) - For Apple Silicon optimization
 - [Ollama](/docs/inference/ollama) - For easy local deployment
-- [LEAP](/leap/index) - For edge and mobile deployment
+- [LEAP](/leap/edge-sdk/overview) - For edge and mobile deployment
 </Accordion>
 
 ## Model Selection
@@ -49,7 +49,7 @@ LFM2.5 models are updated versions with improved training that deliver higher pe
 ## Deployment
 
 <Accordion title="Can I run LFM models on mobile devices?">
-Yes! Use the [LEAP SDK](/leap/index) to deploy models on iOS and Android devices. LEAP provides optimized inference for edge deployment with support for quantized models.
+Yes! Use the [LEAP SDK](/leap/edge-sdk/overview) to deploy models on iOS and Android devices. LEAP provides optimized inference for edge deployment with support for quantized models.
 </Accordion>
 
 <Accordion title="What quantization formats are available?">

@@ -0,0 +1,251 @@
+---
+title: "ONNX"
+description: "ONNX provides cross-platform inference for LFM models across CPUs, GPUs, NPUs, and browsers via WebGPU."
+description: "ONNX provides a platform-agnostic inference specification that allows running the model on device-specific runtimes that include CPU, GPU, NPU, and WebGPU."
+---
+
+<Tip>
+  Use ONNX for cross-platform deployment, edge devices, and browser-based inference with WebGPU and Transformers.js.
+</Tip>
+
+ONNX (Open Neural Network Exchange) is a portable format that enables LFM inference across diverse hardware and runtimes. ONNX models run on CPUs, GPUs, NPUs, and in browsers via WebGPU—making them ideal for edge deployment and web applications.
+
+Many LFM models are available as pre-exported ONNX packages on Hugging Face. For models not yet available, use the [LiquidONNX](#liquidonnx-export-tool) tool to export any LFM to ONNX.
+
+## Pre-exported Models
+
+Pre-exported ONNX models are available from LiquidAI and the [onnx-community](https://huggingface.co/onnx-community). Check the [Model Library](/docs/models/complete-library) for a complete list of available formats.
+
+### Quantization Options
+
+Each ONNX export includes multiple precision levels. **Q4** is recommended for most deployments and supports WebGPU, CPU, and GPU. **FP16** offers higher quality and works on WebGPU and GPU. **Q8** provides a quality/size balance but is server-only (CPU/GPU). **FP32** is the full precision baseline.
+
+## Python Inference
+
+### Installation
+
+```bash
+pip install onnxruntime transformers numpy huggingface_hub jinja2
+
+# For GPU support
+pip install onnxruntime-gpu transformers numpy huggingface_hub jinja2
+```
+
+### Basic Usage
+
+```python
+import numpy as np
+import onnxruntime as ort
+from huggingface_hub import hf_hub_download, list_repo_files
+from transformers import AutoTokenizer
+
+# Download Q4 model (recommended)
+model_id = "LiquidAI/LFM2.5-1.2B-Instruct-ONNX"
+model_path = hf_hub_download(model_id, "onnx/model_q4.onnx")
+
+# Download external data files
+for f in list_repo_files(model_id):
+    if f.startswith("onnx/model_q4.onnx_data"):
+        hf_hub_download(model_id, f)
+
+# Load model and tokenizer
+session = ort.InferenceSession(model_path)
+tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
+
+# Prepare input
+messages = [{"role": "user", "content": "What is the capital of France?"}]
+prompt = tokenizer.apply_chat_template(
+    messages, tokenize=False, add_generation_prompt=True
+)
+inputs = tokenizer.encode(prompt, add_special_tokens=False)
+input_ids = np.array([inputs], dtype=np.int64)
+
+# Initialize KV cache
+DTYPE_MAP = {
+    "tensor(float)": np.float32,
+    "tensor(float16)": np.float16,
+    "tensor(int64)": np.int64
+}
+cache = {}
+for inp in session.get_inputs():
+    if inp.name in {"input_ids", "attention_mask", "position_ids"}:
+        continue
+    shape = [d if isinstance(d, int) else 1 for d in inp.shape]
+    for i, d in enumerate(inp.shape):
+        if isinstance(d, str) and "sequence" in d.lower():
+            shape[i] = 0
+    dtype = DTYPE_MAP.get(inp.type, np.float32)
+    cache[inp.name] = np.zeros(shape, dtype=dtype)
+
+# Generate tokens
+seq_len = input_ids.shape[1]
+generated = []
+input_names = {inp.name for inp in session.get_inputs()}
+
+for step in range(100):
+    if step == 0:
+        ids = input_ids
+        pos = np.arange(seq_len, dtype=np.int64).reshape(1, -1)
+    else:
+        ids = np.array([[generated[-1]]], dtype=np.int64)
+        pos = np.array([[seq_len + len(generated) - 1]], dtype=np.int64)
+
+    attn_mask = np.ones((1, seq_len + len(generated)), dtype=np.int64)
+    feed = {"input_ids": ids, "attention_mask": attn_mask, **cache}
+    if "position_ids" in input_names:
+        feed["position_ids"] = pos
+
+    outputs = session.run(None, feed)
+    next_token = int(np.argmax(outputs[0][0, -1]))
+    generated.append(next_token)
+
+    # Update cache
+    for i, out in enumerate(session.get_outputs()[1:], 1):
+        name = out.name.replace("present_conv", "past_conv")
+        name = name.replace("present.", "past_key_values.")
+        if name in cache:
+            cache[name] = outputs[i]
+
+    if next_token == tokenizer.eos_token_id:
+        break
+
+print(tokenizer.decode(generated, skip_special_tokens=True))
+```
+
+## WebGPU Inference
+
+ONNX models run in browsers via [Transformers.js](https://huggingface.co/docs/transformers.js) with WebGPU acceleration. This enables fully client-side inference without server infrastructure.
+
+### Setup
+
+1. Install Transformers.js:
+```bash
+npm install @huggingface/transformers
+```
+
+2. Enable WebGPU in your browser:
+   - **Chrome/Edge**: Navigate to `chrome://flags/#enable-unsafe-webgpu`, enable, and restart
+   - **Verify**: Check `chrome://gpu` for WebGPU status
+
+### Usage
+
+```javascript
+import { AutoModelForCausalLM, AutoTokenizer, TextStreamer } from "@huggingface/transformers";
+
+const modelId = "LiquidAI/LFM2.5-1.2B-Instruct-ONNX";
+
+// Load model with WebGPU
+const tokenizer = await AutoTokenizer.from_pretrained(modelId);
+const model = await AutoModelForCausalLM.from_pretrained(modelId, {
+  device: "webgpu",
+  dtype: "q4",  // or "fp16"
+});
+
+// Generate with streaming
+const messages = [{ role: "user", content: "What is the capital of France?" }];
+const input = tokenizer.apply_chat_template(messages, {
+  add_generation_prompt: true,
+  return_dict: true,
+});
+
+const streamer = new TextStreamer(tokenizer, { skip_prompt: true });
+const output = await model.generate({
+  ...input,
+  max_new_tokens: 256,
+  do_sample: false,
+  streamer,
+});
+
+console.log(tokenizer.decode(output[0], { skip_special_tokens: true }));
+```
+
+<Note>
+WebGPU supports Q4 and FP16 precision. Q8 quantization is not available in browser environments.
+</Note>
+
+## LiquidONNX Export Tool
+
+[LiquidONNX](https://github.com/Liquid4All/onnx-export) is the official tool for exporting LFM models to ONNX. Use it to export models not yet available as pre-built packages, or to customize export settings.
+
+### Installation
+
+```bash
+git clone https://github.com/Liquid4All/onnx-export.git
+cd onnx-export
+uv sync
+
+# For GPU inference
+uv sync --extra gpu
+```
+
+### Supported Models
+
+| Family | Quantization Formats |
+|--------|---------------------|
+| LFM2.5, LFM2 (text) | fp32, fp16, q4, q8 |
+| LFM2.5-VL, LFM2-VL (vision) | fp32, fp16, q4, q8 |
+| LFM2-MoE | fp32, fp16, q4, q4f16 |
+| LFM2.5-Audio | fp32, fp16, q4, q8 |
+
+### Export Commands
+
+**Text models:**
+```bash
+# Export with all precisions (fp16, q4, q8)
+uv run lfm2-export LiquidAI/LFM2.5-1.2B-Instruct --precision
+
+# Export specific precisions
+uv run lfm2-export LiquidAI/LFM2.5-1.2B-Instruct --precision fp16 q4
+```
+
+**Vision-language models:**
+```bash
+uv run lfm2-vl-export LiquidAI/LFM2.5-VL-1.6B --precision
+
+# Alternative vision format for specific runtimes
+uv run lfm2-vl-export LiquidAI/LFM2.5-VL-1.6B --vision-format conv2d
+```
+
+**MoE models:**
+```bash
+uv run lfm2-moe-export LiquidAI/LFM2-8B-A1B --precision
+```
+
+**Audio models:**
+```bash
+uv run lfm2-audio-export LiquidAI/LFM2.5-Audio-1.5B --precision
+```
+
+### Export Options
+
+| Flag | Description |
+|------|-------------|
+| `--precision` | Output formats: `fp16`, `q4`, `q8`, or omit args for all |
+| `--output-dir` | Output base directory (default: current directory) |
+| `--skip-export` | Skip FP32 export, only run quantization on existing export |
+| `--block-size` | Block size for quantization (default: 32) |
+| `--q4-asymmetric` | Use asymmetric Q4 (default is symmetric for WebGPU) |
+| `--split-data` | Split external data into chunks in GB (default: 2.0) |
+
+### Inference with LiquidONNX
+
+LiquidONNX includes inference commands for testing exported models:
+
+```bash
+# Text model chat
+uv run lfm2-infer --model ./exports/LFM2.5-1.2B-Instruct-ONNX/onnx/model_q4.onnx
+
+# Vision-language with images
+uv run lfm2-vl-infer --model ./exports/LFM2.5-VL-1.6B-ONNX \
+    --images photo.jpg --prompt "Describe this image"
+
+# Audio transcription (ASR)
+uv run lfm2-audio-infer LFM2.5-Audio-1.5B-ONNX --mode asr \
+    --audio input.wav --precision q4
+
+# Text-to-speech (TTS)
+uv run lfm2-audio-infer LFM2.5-Audio-1.5B-ONNX --mode tts \
+    --prompt "Hello, how are you?" --output speech.wav --precision q4
+```
+
+For complete documentation and advanced options, see the [LiquidONNX GitHub repository](https://github.com/Liquid4All/onnx-export).