-
Notifications
You must be signed in to change notification settings - Fork 4
Add ONNX inference documentation #49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
1066051
3e96c77
a613e63
ae14e1c
bd8d2d6
ffd965f
f0a6d24
871bdaf
c8f3b1d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -158,4 +158,5 @@ Thumbs.db | |
| *.swp | ||
| *.swo | ||
| *~ | ||
| .vscode/ | ||
| .vscode/ | ||
| .onnx-tests/ | ||
This file was deleted.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,251 @@ | ||
| --- | ||
| title: "ONNX" | ||
| description: "ONNX provides cross-platform inference for LFM models across CPUs, GPUs, NPUs, and browsers via WebGPU." | ||
| description: "ONNX provides a platform-agnostic inference specification that allows running the model on device-specific runtimes that include CPU, GPU, NPU, and WebGPU." | ||
|
Check warning on line 4 in docs/inference/onnx.mdx
|
||
| --- | ||
|
|
||
| <Tip> | ||
| Use ONNX for cross-platform deployment, edge devices, and browser-based inference with WebGPU and Transformers.js. | ||
| </Tip> | ||
|
|
||
| ONNX (Open Neural Network Exchange) is a portable format that enables LFM inference across diverse hardware and runtimes. ONNX models run on CPUs, GPUs, NPUs, and in browsers via WebGPU—making them ideal for edge deployment and web applications. | ||
|
Check warning on line 11 in docs/inference/onnx.mdx
|
||
|
|
||
| Many LFM models are available as pre-exported ONNX packages on Hugging Face. For models not yet available, use the [LiquidONNX](#liquidonnx-export-tool) tool to export any LFM to ONNX. | ||
|
|
||
| ## Pre-exported Models | ||
|
|
||
| Pre-exported ONNX models are available from LiquidAI and the [onnx-community](https://huggingface.co/onnx-community). Check the [Model Library](/docs/models/complete-library) for a complete list of available formats. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd add a link on "from LiquidAI" that points on our HF |
||
|
|
||
| ### Quantization Options | ||
|
|
||
| Each ONNX export includes multiple precision levels. **Q4** is recommended for most deployments and supports WebGPU, CPU, and GPU. **FP16** offers higher quality and works on WebGPU and GPU. **Q8** provides a quality/size balance but is server-only (CPU/GPU). **FP32** is the full precision baseline. | ||
|
|
||
| ## Python Inference | ||
|
|
||
| ### Installation | ||
|
|
||
| ```bash | ||
| pip install onnxruntime transformers numpy huggingface_hub jinja2 | ||
|
|
||
| # For GPU support | ||
| pip install onnxruntime-gpu transformers numpy huggingface_hub jinja2 | ||
| ``` | ||
|
|
||
|
Comment on lines
+25
to
+33
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe instead point on the onnx-export repo directly or use it |
||
| ### Basic Usage | ||
|
|
||
| ```python | ||
| import numpy as np | ||
| import onnxruntime as ort | ||
| from huggingface_hub import hf_hub_download, list_repo_files | ||
| from transformers import AutoTokenizer | ||
|
|
||
| # Download Q4 model (recommended) | ||
| model_id = "LiquidAI/LFM2.5-1.2B-Instruct-ONNX" | ||
| model_path = hf_hub_download(model_id, "onnx/model_q4.onnx") | ||
|
|
||
| # Download external data files | ||
| for f in list_repo_files(model_id): | ||
| if f.startswith("onnx/model_q4.onnx_data"): | ||
| hf_hub_download(model_id, f) | ||
|
|
||
| # Load model and tokenizer | ||
| session = ort.InferenceSession(model_path) | ||
| tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) | ||
|
|
||
| # Prepare input | ||
| messages = [{"role": "user", "content": "What is the capital of France?"}] | ||
| prompt = tokenizer.apply_chat_template( | ||
| messages, tokenize=False, add_generation_prompt=True | ||
| ) | ||
| inputs = tokenizer.encode(prompt, add_special_tokens=False) | ||
| input_ids = np.array([inputs], dtype=np.int64) | ||
|
|
||
| # Initialize KV cache | ||
| DTYPE_MAP = { | ||
| "tensor(float)": np.float32, | ||
| "tensor(float16)": np.float16, | ||
| "tensor(int64)": np.int64 | ||
| } | ||
| cache = {} | ||
| for inp in session.get_inputs(): | ||
| if inp.name in {"input_ids", "attention_mask", "position_ids"}: | ||
| continue | ||
| shape = [d if isinstance(d, int) else 1 for d in inp.shape] | ||
| for i, d in enumerate(inp.shape): | ||
| if isinstance(d, str) and "sequence" in d.lower(): | ||
| shape[i] = 0 | ||
| dtype = DTYPE_MAP.get(inp.type, np.float32) | ||
| cache[inp.name] = np.zeros(shape, dtype=dtype) | ||
|
|
||
| # Generate tokens | ||
| seq_len = input_ids.shape[1] | ||
| generated = [] | ||
| input_names = {inp.name for inp in session.get_inputs()} | ||
|
|
||
| for step in range(100): | ||
| if step == 0: | ||
| ids = input_ids | ||
| pos = np.arange(seq_len, dtype=np.int64).reshape(1, -1) | ||
| else: | ||
| ids = np.array([[generated[-1]]], dtype=np.int64) | ||
| pos = np.array([[seq_len + len(generated) - 1]], dtype=np.int64) | ||
|
|
||
| attn_mask = np.ones((1, seq_len + len(generated)), dtype=np.int64) | ||
| feed = {"input_ids": ids, "attention_mask": attn_mask, **cache} | ||
| if "position_ids" in input_names: | ||
| feed["position_ids"] = pos | ||
|
|
||
| outputs = session.run(None, feed) | ||
| next_token = int(np.argmax(outputs[0][0, -1])) | ||
| generated.append(next_token) | ||
|
|
||
| # Update cache | ||
| for i, out in enumerate(session.get_outputs()[1:], 1): | ||
| name = out.name.replace("present_conv", "past_conv") | ||
| name = name.replace("present.", "past_key_values.") | ||
| if name in cache: | ||
| cache[name] = outputs[i] | ||
|
|
||
| if next_token == tokenizer.eos_token_id: | ||
| break | ||
|
|
||
| print(tokenizer.decode(generated, skip_special_tokens=True)) | ||
| ``` | ||
|
|
||
| ## WebGPU Inference | ||
|
|
||
| ONNX models run in browsers via [Transformers.js](https://huggingface.co/docs/transformers.js) with WebGPU acceleration. This enables fully client-side inference without server infrastructure. | ||
|
|
||
| ### Setup | ||
|
|
||
| 1. Install Transformers.js: | ||
| ```bash | ||
| npm install @huggingface/transformers | ||
| ``` | ||
|
|
||
| 2. Enable WebGPU in your browser: | ||
| - **Chrome/Edge**: Navigate to `chrome://flags/#enable-unsafe-webgpu`, enable, and restart | ||
| - **Verify**: Check `chrome://gpu` for WebGPU status | ||
|
|
||
| ### Usage | ||
|
|
||
| ```javascript | ||
| import { AutoModelForCausalLM, AutoTokenizer, TextStreamer } from "@huggingface/transformers"; | ||
|
|
||
| const modelId = "LiquidAI/LFM2.5-1.2B-Instruct-ONNX"; | ||
|
|
||
| // Load model with WebGPU | ||
| const tokenizer = await AutoTokenizer.from_pretrained(modelId); | ||
| const model = await AutoModelForCausalLM.from_pretrained(modelId, { | ||
| device: "webgpu", | ||
| dtype: "q4", // or "fp16" | ||
| }); | ||
|
|
||
| // Generate with streaming | ||
| const messages = [{ role: "user", content: "What is the capital of France?" }]; | ||
| const input = tokenizer.apply_chat_template(messages, { | ||
| add_generation_prompt: true, | ||
| return_dict: true, | ||
| }); | ||
|
|
||
| const streamer = new TextStreamer(tokenizer, { skip_prompt: true }); | ||
| const output = await model.generate({ | ||
| ...input, | ||
| max_new_tokens: 256, | ||
| do_sample: false, | ||
| streamer, | ||
| }); | ||
|
|
||
| console.log(tokenizer.decode(output[0], { skip_special_tokens: true })); | ||
| ``` | ||
|
|
||
| <Note> | ||
| WebGPU supports Q4 and FP16 precision. Q8 quantization is not available in browser environments. | ||
| </Note> | ||
|
|
||
| ## LiquidONNX Export Tool | ||
|
|
||
| [LiquidONNX](https://github.com/Liquid4All/onnx-export) is the official tool for exporting LFM models to ONNX. Use it to export models not yet available as pre-built packages, or to customize export settings. | ||
|
|
||
| ### Installation | ||
|
|
||
| ```bash | ||
| git clone https://github.com/Liquid4All/onnx-export.git | ||
| cd onnx-export | ||
| uv sync | ||
|
|
||
| # For GPU inference | ||
| uv sync --extra gpu | ||
| ``` | ||
|
|
||
| ### Supported Models | ||
|
|
||
| | Family | Quantization Formats | | ||
| |--------|---------------------| | ||
| | LFM2.5, LFM2 (text) | fp32, fp16, q4, q8 | | ||
| | LFM2.5-VL, LFM2-VL (vision) | fp32, fp16, q4, q8 | | ||
| | LFM2-MoE | fp32, fp16, q4, q4f16 | | ||
| | LFM2.5-Audio | fp32, fp16, q4, q8 | | ||
|
|
||
| ### Export Commands | ||
|
|
||
| **Text models:** | ||
| ```bash | ||
| # Export with all precisions (fp16, q4, q8) | ||
| uv run lfm2-export LiquidAI/LFM2.5-1.2B-Instruct --precision | ||
|
|
||
| # Export specific precisions | ||
| uv run lfm2-export LiquidAI/LFM2.5-1.2B-Instruct --precision fp16 q4 | ||
| ``` | ||
|
|
||
| **Vision-language models:** | ||
| ```bash | ||
| uv run lfm2-vl-export LiquidAI/LFM2.5-VL-1.6B --precision | ||
|
|
||
| # Alternative vision format for specific runtimes | ||
| uv run lfm2-vl-export LiquidAI/LFM2.5-VL-1.6B --vision-format conv2d | ||
| ``` | ||
|
|
||
| **MoE models:** | ||
| ```bash | ||
| uv run lfm2-moe-export LiquidAI/LFM2-8B-A1B --precision | ||
| ``` | ||
|
|
||
| **Audio models:** | ||
| ```bash | ||
| uv run lfm2-audio-export LiquidAI/LFM2.5-Audio-1.5B --precision | ||
| ``` | ||
|
|
||
| ### Export Options | ||
|
|
||
| | Flag | Description | | ||
| |------|-------------| | ||
| | `--precision` | Output formats: `fp16`, `q4`, `q8`, or omit args for all | | ||
| | `--output-dir` | Output base directory (default: current directory) | | ||
| | `--skip-export` | Skip FP32 export, only run quantization on existing export | | ||
| | `--block-size` | Block size for quantization (default: 32) | | ||
| | `--q4-asymmetric` | Use asymmetric Q4 (default is symmetric for WebGPU) | | ||
| | `--split-data` | Split external data into chunks in GB (default: 2.0) | | ||
|
|
||
| ### Inference with LiquidONNX | ||
|
|
||
| LiquidONNX includes inference commands for testing exported models: | ||
|
|
||
| ```bash | ||
| # Text model chat | ||
| uv run lfm2-infer --model ./exports/LFM2.5-1.2B-Instruct-ONNX/onnx/model_q4.onnx | ||
|
|
||
| # Vision-language with images | ||
| uv run lfm2-vl-infer --model ./exports/LFM2.5-VL-1.6B-ONNX \ | ||
| --images photo.jpg --prompt "Describe this image" | ||
|
|
||
| # Audio transcription (ASR) | ||
| uv run lfm2-audio-infer LFM2.5-Audio-1.5B-ONNX --mode asr \ | ||
| --audio input.wav --precision q4 | ||
|
|
||
| # Text-to-speech (TTS) | ||
| uv run lfm2-audio-infer LFM2.5-Audio-1.5B-ONNX --mode tts \ | ||
| --prompt "Hello, how are you?" --output speech.wav --precision q4 | ||
| ``` | ||
|
|
||
| For complete documentation and advanced options, see the [LiquidONNX GitHub repository](https://github.com/Liquid4All/onnx-export). | ||
Uh oh!
There was an error while loading. Please reload this page.