diff --git a/docs.json b/docs.json index 1e1bcbd..8dade2d 100644 --- a/docs.json +++ b/docs.json @@ -90,6 +90,7 @@ "docs/inference/transformers", "docs/inference/llama-cpp", "docs/inference/vllm", + "docs/inference/sglang", "docs/inference/mlx", "docs/inference/ollama", { diff --git a/docs/inference/sglang.mdx b/docs/inference/sglang.mdx new file mode 100644 index 0000000..65bed78 --- /dev/null +++ b/docs/inference/sglang.mdx @@ -0,0 +1,403 @@ +--- +title: "SGLang" +description: "SGLang is a fast serving framework for large language models. It features RadixAttention for efficient prefix caching, optimized CUDA kernels, and continuous batching for high-throughput, low-latency inference." +--- + + + Use SGLang for low-latency online deployments such as RAG pipelines, search engines, and real-time chat applications. + + +SGLang delivers ultra-low latency and high throughput, making it well-suited for production serving scenarios with many concurrent requests. It requires a CUDA-compatible GPU. For CPU-only environments, consider using [llama.cpp](/docs/inference/llama-cpp) instead. + +## Installation + +Install SGLang following the [official installation guide](https://docs.sglang.io/get_started/install_sglang.html). The recommended method is: + +```bash +pip install --upgrade pip +pip install uv +uv pip install "sglang" +``` + +For other installation methods (source, Kubernetes), refer to the [SGLang documentation](https://docs.sglang.io/get_started/install_sglang.html). + +### Docker + + +Please use the `dev` tag for LFM model support, as no stable release of SGLang supports it yet. For CUDA 13 environments (B300/GB300), use `lmsysorg/sglang:dev-cu13`. + + +You can also run SGLang using Docker: + +```bash +docker run --gpus all \ + --shm-size 32g \ + -p 30000:30000 \ + -v ~/.cache/huggingface:/root/.cache/huggingface \ + --env "HF_TOKEN=" \ + --ipc=host \ + lmsysorg/sglang:dev \ + python3 -m sglang.launch_server \ + --model-path LiquidAI/LFM2.5-1.2B-Instruct \ + --host 0.0.0.0 \ + --port 30000 \ + --chunked-prefill-size -1 +``` + +## Launching the Server + +Start the SGLang server with the following command: + +```bash +python3 -m sglang.launch_server \ + --model LiquidAI/LFM2.5-1.2B-Instruct \ + --host 0.0.0.0 \ + --port 30000 \ + --chunked-prefill-size -1 +``` + +Optional parameters: + +* `--chunked-prefill-size -1`: Disables chunked prefill for lower latency + +### Ultra Low Latency on Blackwell (B300) + +Running a 1.2B model on a B300 may sound counterintuitive, but combining `--enable-torch-compile` with Blackwell's architecture unlocks extremely low latency — ideal for latency-sensitive workloads like RAG, search, and real-time chat. + + + If your workload has concurrency under 256, we recommend using `--enable-torch-compile` for significantly lower latency. For pure throughput batch processing at very high concurrency, skip this flag. + + +```bash +python3 -m sglang.launch_server \ + --model LiquidAI/LFM2.5-1.2B-Instruct \ + --host 0.0.0.0 \ + --port 30000 \ + --enable-torch-compile \ + --chunked-prefill-size -1 +``` + +On B300/CUDA 13, use the dedicated Docker image: + +```bash +docker run --gpus all \ + --shm-size 32g \ + -p 30000:30000 \ + -v ~/.cache/huggingface:/root/.cache/huggingface \ + --env "HF_TOKEN=" \ + --ipc=host \ + lmsysorg/sglang:dev-cu13 \ + python3 -m sglang.launch_server \ + --model-path LiquidAI/LFM2.5-1.2B-Instruct \ + --host 0.0.0.0 \ + --port 30000 \ + --enable-torch-compile \ + --chunked-prefill-size -1 +``` + +With this configuration, end-to-end latency can be as low as **180ms** per request. Benchmark results on a B300 GPU with CUDA 13: + +``` +============ Serving Benchmark Result ============ +Backend: sglang-oai-chat +Successful requests: 256 +Benchmark duration (s): 49.16 +Total input tokens: 82131 +Total generated tokens: 54126 +Request throughput (req/s): 5.21 +Input token throughput (tok/s): 1670.54 +Output token throughput (tok/s): 1100.92 +Total token throughput (tok/s): 2771.47 +----------------End-to-End Latency---------------- +Mean E2E Latency (ms): 191.88 +Median E2E Latency (ms): 119.90 +P99 E2E Latency (ms): 760.65 +---------------Time to First Token---------------- +Mean TTFT (ms): 8.79 +Median TTFT (ms): 8.11 +P99 TTFT (ms): 17.45 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 0.86 +Median TPOT (ms): 0.86 +P99 TPOT (ms): 0.89 +================================================== +``` + + + ```bash + python3 -m sglang.bench_serving \ + --backend sglang-oai-chat \ + --num-prompts 256 \ + --max-concurrency 1 \ + --random-input-len 1024 \ + --random-output-len 128 \ + --warmup-requests 128 + ``` + + +## Chat Completions + +Once the server is running, use the OpenAI Python client or any OpenAI-compatible tool: + +```python +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="None" +) + +response = client.chat.completions.create( + model="LiquidAI/LFM2.5-1.2B-Instruct", + messages=[ + {"role": "user", "content": "What is machine learning?"} + ], + temperature=0.3, + min_p=0.15, + repetition_penalty=1.05, + +) + +print(response.choices[0].message.content) +``` + +### Streaming + +```python +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="None" +) + +stream = client.chat.completions.create( + model="LiquidAI/LFM2.5-1.2B-Instruct", + messages=[ + {"role": "user", "content": "Tell me a story."} + ], + temperature=0.3, + min_p=0.15, + repetition_penalty=1.05, + + stream=True, +) + +for chunk in stream: + if chunk.choices[0].delta.content is not None: + print(chunk.choices[0].delta.content, end="") +``` + +### Multi-turn Conversations + +```python +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="None" +) + +response = client.chat.completions.create( + model="LiquidAI/LFM2.5-1.2B-Instruct", + messages=[ + { + "role": "system", + "content": "You are a knowledgeable historian who provides concise responses.", + }, + {"role": "user", "content": "Tell me about ancient Rome"}, + { + "role": "assistant", + "content": "Ancient Rome was a civilization centered in Italy.", + }, + {"role": "user", "content": "What were their major achievements?"}, + ], + temperature=0.3, + min_p=0.15, + repetition_penalty=1.05, + +) + +print(response.choices[0].message.content) +``` + + + ```bash + curl http://localhost:30000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "LiquidAI/LFM2.5-1.2B-Instruct", + "messages": [ + {"role": "user", "content": "What is AI?"} + ], + "temperature": 0.3, + "min_p": 0.15, + "repetition_penalty": 1.05 + }' + ``` + + +## Tool Calling + +SGLang supports tool calling (function calling) with LFM models via the `--tool-call-parser` flag. Launch the server with tool calling enabled: + +```bash +python3 -m sglang.launch_server \ + --model LiquidAI/LFM2.5-1.2B-Instruct \ + --host 0.0.0.0 \ + --port 30000 \ + --chunked-prefill-size -1 \ + --tool-call-parser lfm2 +``` + +Then use the OpenAI tools API: + +```python +import json +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="None" +) + +tools = [ + { + "type": "function", + "function": { + "name": "get_candidate_status", + "description": "Retrieves the current status of a candidate in the recruitment process", + "parameters": { + "type": "object", + "properties": { + "candidate_id": { + "type": "string", + "description": "Unique identifier for the candidate", + } + }, + "required": ["candidate_id"], + }, + }, + } +] + +messages = [ + {"role": "user", "content": "What is the current status of candidate ID 12345?"} +] + +response = client.chat.completions.create( + model="LiquidAI/LFM2.5-1.2B-Instruct", + messages=messages, + tools=tools, + tool_choice="auto", + temperature=0.3, + min_p=0.15, + repetition_penalty=1.05, + +) + +# The model may return a tool call +tool_calls = response.choices[0].message.tool_calls +if tool_calls: + print(f"Function: {tool_calls[0].function.name}") + print(f"Arguments: {tool_calls[0].function.arguments}") +``` + +For more details on tool parsing configuration, see the [SGLang Tool Parser documentation](https://docs.sglang.io/advanced_features/tool_parser.html). + +## Vision Models + +### Installation for Vision Models + +To use LFM Vision Models with SGLang, install the required transformers version: + +```bash +pip install transformers==5.0.0 +``` + + +Transformers v5 is newly released. If you encounter issues, fall back to the pinned git source: +```bash +pip install git+https://github.com/huggingface/transformers.git@3c2517727ce28a30f5044e01663ee204deb1cdbe +``` + + +### Launching the Server + +Serve the vision model with `--trust-remote-code` (required because the vision processor code is loaded from the remote Hugging Face repository): + +```bash +python3 -m sglang.launch_server \ + --model LiquidAI/LFM2.5-VL-1.6B \ + --host 0.0.0.0 \ + --port 30000 \ + --trust-remote-code +``` + +### OpenAI-Compatible API + +Then use the OpenAI client with image content: + +```python +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="None" +) + +response = client.chat.completions.create( + model="LiquidAI/LFM2.5-VL-1.6B", + messages=[ + { + "role": "user", + "content": [ + {"type": "text", "text": "Describe what you see in this image."}, + {"type": "image_url", "image_url": {"url": "http://images.cocodataset.org/val2017/000000039769.jpg"}} + ] + } + ], + temperature=0.3, + min_p=0.15, + repetition_penalty=1.05, + +) + +print(response.choices[0].message.content) +``` + +You can also pass base64-encoded images: + +```python +import base64 +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:30000/v1", + api_key="None" +) + +# Load and encode image +with open("path/to/image.jpg", "rb") as f: + image_base64 = base64.b64encode(f.read()).decode() + +# Chat completion with image +response = client.chat.completions.create( + model="LiquidAI/LFM2.5-VL-1.6B", + messages=[ + { + "role": "user", + "content": [ + {"type": "text", "text": "Describe this image in detail."}, + {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_base64}"}} + ] + } + ], + temperature=0.3, + min_p=0.15, + repetition_penalty=1.05, + +) + +print(response.choices[0].message.content) +```