Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -158,4 +158,5 @@ Thumbs.db
*.swp
*.swo
*~
.vscode/
.vscode/
.onnx-tests/
1 change: 1 addition & 0 deletions docs.json
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,7 @@
"docs/inference/vllm",
"docs/inference/mlx",
"docs/inference/ollama",
"docs/inference/onnx",
{
"group": "Other Frameworks",
"icon": "server",
Expand Down
4 changes: 2 additions & 2 deletions docs/help/faqs.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ LFM models are compatible with:
- [vLLM](/docs/inference/vllm) - For high-throughput production serving
- [MLX](/docs/inference/mlx) - For Apple Silicon optimization
- [Ollama](/docs/inference/ollama) - For easy local deployment
- [LEAP](/leap/index) - For edge and mobile deployment
- [LEAP](/leap/edge-sdk/overview) - For edge and mobile deployment
</Accordion>

## Model Selection
Expand Down Expand Up @@ -49,7 +49,7 @@ LFM2.5 models are updated versions with improved training that deliver higher pe
## Deployment

<Accordion title="Can I run LFM models on mobile devices?">
Yes! Use the [LEAP SDK](/leap/index) to deploy models on iOS and Android devices. LEAP provides optimized inference for edge deployment with support for quantized models.
Yes! Use the [LEAP SDK](/leap/edge-sdk/overview) to deploy models on iOS and Android devices. LEAP provides optimized inference for edge deployment with support for quantized models.
</Accordion>

<Accordion title="What quantization formats are available?">
Expand Down
8 changes: 0 additions & 8 deletions docs/index.mdx

This file was deleted.

251 changes: 251 additions & 0 deletions docs/inference/onnx.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,251 @@
---
title: "ONNX"
description: "ONNX provides cross-platform inference for LFM models across CPUs, GPUs, NPUs, and browsers via WebGPU."
description: "ONNX provides a platform-agnostic inference specification that allows running the model on device-specific runtimes that include CPU, GPU, NPU, and WebGPU."

Check warning on line 4 in docs/inference/onnx.mdx

View check run for this annotation

Mintlify / Mintlify Validation (liquidai-main) - vale-spellcheck

docs/inference/onnx.mdx#L4

Did you really mean 'runtimes'?

Check warning on line 4 in docs/inference/onnx.mdx

View check run for this annotation

Mintlify / Mintlify Validation (liquidai) - vale-spellcheck

docs/inference/onnx.mdx#L4

Did you really mean 'runtimes'?
---

<Tip>
Use ONNX for cross-platform deployment, edge devices, and browser-based inference with WebGPU and Transformers.js.
</Tip>

ONNX (Open Neural Network Exchange) is a portable format that enables LFM inference across diverse hardware and runtimes. ONNX models run on CPUs, GPUs, NPUs, and in browsers via WebGPU—making them ideal for edge deployment and web applications.

Check warning on line 11 in docs/inference/onnx.mdx

View check run for this annotation

Mintlify / Mintlify Validation (liquidai-main) - vale-spellcheck

docs/inference/onnx.mdx#L11

Did you really mean 'runtimes'?

Check warning on line 11 in docs/inference/onnx.mdx

View check run for this annotation

Mintlify / Mintlify Validation (liquidai-main) - vale-spellcheck

docs/inference/onnx.mdx#L11

Did you really mean 'CPUs'?

Check warning on line 11 in docs/inference/onnx.mdx

View check run for this annotation

Mintlify / Mintlify Validation (liquidai-main) - vale-spellcheck

docs/inference/onnx.mdx#L11

Did you really mean 'GPUs'?

Check warning on line 11 in docs/inference/onnx.mdx

View check run for this annotation

Mintlify / Mintlify Validation (liquidai-main) - vale-spellcheck

docs/inference/onnx.mdx#L11

Did you really mean 'NPUs'?

Check warning on line 11 in docs/inference/onnx.mdx

View check run for this annotation

Mintlify / Mintlify Validation (liquidai) - vale-spellcheck

docs/inference/onnx.mdx#L11

Did you really mean 'runtimes'?

Check warning on line 11 in docs/inference/onnx.mdx

View check run for this annotation

Mintlify / Mintlify Validation (liquidai) - vale-spellcheck

docs/inference/onnx.mdx#L11

Did you really mean 'CPUs'?

Check warning on line 11 in docs/inference/onnx.mdx

View check run for this annotation

Mintlify / Mintlify Validation (liquidai) - vale-spellcheck

docs/inference/onnx.mdx#L11

Did you really mean 'GPUs'?

Check warning on line 11 in docs/inference/onnx.mdx

View check run for this annotation

Mintlify / Mintlify Validation (liquidai) - vale-spellcheck

docs/inference/onnx.mdx#L11

Did you really mean 'NPUs'?

Many LFM models are available as pre-exported ONNX packages on Hugging Face. For models not yet available, use the [LiquidONNX](#liquidonnx-export-tool) tool to export any LFM to ONNX.

## Pre-exported Models

Pre-exported ONNX models are available from LiquidAI and the [onnx-community](https://huggingface.co/onnx-community). Check the [Model Library](/docs/models/complete-library) for a complete list of available formats.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add a link on "from LiquidAI" that points on our HF


### Quantization Options

Each ONNX export includes multiple precision levels. **Q4** is recommended for most deployments and supports WebGPU, CPU, and GPU. **FP16** offers higher quality and works on WebGPU and GPU. **Q8** provides a quality/size balance but is server-only (CPU/GPU). **FP32** is the full precision baseline.

## Python Inference

### Installation

```bash
pip install onnxruntime transformers numpy huggingface_hub jinja2

# For GPU support
pip install onnxruntime-gpu transformers numpy huggingface_hub jinja2
```

Comment on lines +25 to +33

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe instead point on the onnx-export repo directly or use it

git clone ...
uv sync
uv ryn ... 

### Basic Usage

```python
import numpy as np
import onnxruntime as ort
from huggingface_hub import hf_hub_download, list_repo_files
from transformers import AutoTokenizer

# Download Q4 model (recommended)
model_id = "LiquidAI/LFM2.5-1.2B-Instruct-ONNX"
model_path = hf_hub_download(model_id, "onnx/model_q4.onnx")

# Download external data files
for f in list_repo_files(model_id):
if f.startswith("onnx/model_q4.onnx_data"):
hf_hub_download(model_id, f)

# Load model and tokenizer
session = ort.InferenceSession(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

# Prepare input
messages = [{"role": "user", "content": "What is the capital of France?"}]
prompt = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer.encode(prompt, add_special_tokens=False)
input_ids = np.array([inputs], dtype=np.int64)

# Initialize KV cache
DTYPE_MAP = {
"tensor(float)": np.float32,
"tensor(float16)": np.float16,
"tensor(int64)": np.int64
}
cache = {}
for inp in session.get_inputs():
if inp.name in {"input_ids", "attention_mask", "position_ids"}:
continue
shape = [d if isinstance(d, int) else 1 for d in inp.shape]
for i, d in enumerate(inp.shape):
if isinstance(d, str) and "sequence" in d.lower():
shape[i] = 0
dtype = DTYPE_MAP.get(inp.type, np.float32)
cache[inp.name] = np.zeros(shape, dtype=dtype)

# Generate tokens
seq_len = input_ids.shape[1]
generated = []
input_names = {inp.name for inp in session.get_inputs()}

for step in range(100):
if step == 0:
ids = input_ids
pos = np.arange(seq_len, dtype=np.int64).reshape(1, -1)
else:
ids = np.array([[generated[-1]]], dtype=np.int64)
pos = np.array([[seq_len + len(generated) - 1]], dtype=np.int64)

attn_mask = np.ones((1, seq_len + len(generated)), dtype=np.int64)
feed = {"input_ids": ids, "attention_mask": attn_mask, **cache}
if "position_ids" in input_names:
feed["position_ids"] = pos

outputs = session.run(None, feed)
next_token = int(np.argmax(outputs[0][0, -1]))
generated.append(next_token)

# Update cache
for i, out in enumerate(session.get_outputs()[1:], 1):
name = out.name.replace("present_conv", "past_conv")
name = name.replace("present.", "past_key_values.")
if name in cache:
cache[name] = outputs[i]

if next_token == tokenizer.eos_token_id:
break

print(tokenizer.decode(generated, skip_special_tokens=True))
```

## WebGPU Inference

ONNX models run in browsers via [Transformers.js](https://huggingface.co/docs/transformers.js) with WebGPU acceleration. This enables fully client-side inference without server infrastructure.

### Setup

1. Install Transformers.js:
```bash
npm install @huggingface/transformers
```

2. Enable WebGPU in your browser:
- **Chrome/Edge**: Navigate to `chrome://flags/#enable-unsafe-webgpu`, enable, and restart
- **Verify**: Check `chrome://gpu` for WebGPU status

### Usage

```javascript
import { AutoModelForCausalLM, AutoTokenizer, TextStreamer } from "@huggingface/transformers";

const modelId = "LiquidAI/LFM2.5-1.2B-Instruct-ONNX";

// Load model with WebGPU
const tokenizer = await AutoTokenizer.from_pretrained(modelId);
const model = await AutoModelForCausalLM.from_pretrained(modelId, {
device: "webgpu",
dtype: "q4", // or "fp16"
});

// Generate with streaming
const messages = [{ role: "user", content: "What is the capital of France?" }];
const input = tokenizer.apply_chat_template(messages, {
add_generation_prompt: true,
return_dict: true,
});

const streamer = new TextStreamer(tokenizer, { skip_prompt: true });
const output = await model.generate({
...input,
max_new_tokens: 256,
do_sample: false,
streamer,
});

console.log(tokenizer.decode(output[0], { skip_special_tokens: true }));
```

<Note>
WebGPU supports Q4 and FP16 precision. Q8 quantization is not available in browser environments.
</Note>

## LiquidONNX Export Tool

[LiquidONNX](https://github.com/Liquid4All/onnx-export) is the official tool for exporting LFM models to ONNX. Use it to export models not yet available as pre-built packages, or to customize export settings.

### Installation

```bash
git clone https://github.com/Liquid4All/onnx-export.git
cd onnx-export
uv sync

# For GPU inference
uv sync --extra gpu
```

### Supported Models

| Family | Quantization Formats |
|--------|---------------------|
| LFM2.5, LFM2 (text) | fp32, fp16, q4, q8 |
| LFM2.5-VL, LFM2-VL (vision) | fp32, fp16, q4, q8 |
| LFM2-MoE | fp32, fp16, q4, q4f16 |
| LFM2.5-Audio | fp32, fp16, q4, q8 |

### Export Commands

**Text models:**
```bash
# Export with all precisions (fp16, q4, q8)
uv run lfm2-export LiquidAI/LFM2.5-1.2B-Instruct --precision

# Export specific precisions
uv run lfm2-export LiquidAI/LFM2.5-1.2B-Instruct --precision fp16 q4
```

**Vision-language models:**
```bash
uv run lfm2-vl-export LiquidAI/LFM2.5-VL-1.6B --precision

# Alternative vision format for specific runtimes
uv run lfm2-vl-export LiquidAI/LFM2.5-VL-1.6B --vision-format conv2d
```

**MoE models:**
```bash
uv run lfm2-moe-export LiquidAI/LFM2-8B-A1B --precision
```

**Audio models:**
```bash
uv run lfm2-audio-export LiquidAI/LFM2.5-Audio-1.5B --precision
```

### Export Options

| Flag | Description |
|------|-------------|
| `--precision` | Output formats: `fp16`, `q4`, `q8`, or omit args for all |
| `--output-dir` | Output base directory (default: current directory) |
| `--skip-export` | Skip FP32 export, only run quantization on existing export |
| `--block-size` | Block size for quantization (default: 32) |
| `--q4-asymmetric` | Use asymmetric Q4 (default is symmetric for WebGPU) |
| `--split-data` | Split external data into chunks in GB (default: 2.0) |

### Inference with LiquidONNX

LiquidONNX includes inference commands for testing exported models:

```bash
# Text model chat
uv run lfm2-infer --model ./exports/LFM2.5-1.2B-Instruct-ONNX/onnx/model_q4.onnx

# Vision-language with images
uv run lfm2-vl-infer --model ./exports/LFM2.5-VL-1.6B-ONNX \
--images photo.jpg --prompt "Describe this image"

# Audio transcription (ASR)
uv run lfm2-audio-infer LFM2.5-Audio-1.5B-ONNX --mode asr \
--audio input.wav --precision q4

# Text-to-speech (TTS)
uv run lfm2-audio-infer LFM2.5-Audio-1.5B-ONNX --mode tts \
--prompt "Hello, how are you?" --output speech.wav --precision q4
```

For complete documentation and advanced options, see the [LiquidONNX GitHub repository](https://github.com/Liquid4All/onnx-export).
Loading