From b9d8bb5aeab35085aeaa2c8321e7f23e0e1a8a4e Mon Sep 17 00:00:00 2001 From: Dariusz Trawinski Date: Mon, 9 Mar 2026 10:57:12 +0100 Subject: [PATCH 1/9] draf of llm demo update --- demos/continuous_batching/README.md | 228 ++++------------------------ 1 file changed, 32 insertions(+), 196 deletions(-) diff --git a/demos/continuous_batching/README.md b/demos/continuous_batching/README.md index b6a2196205..e424fca18b 100644 --- a/demos/continuous_batching/README.md +++ b/demos/continuous_batching/README.md @@ -15,70 +15,14 @@ This demo shows how to deploy LLM models in the OpenVINO Model Server using cont Text generation use case is exposed via OpenAI API `chat/completions` and `completions` endpoints. That makes it easy to use and efficient especially on on Intel® Xeon® processors and ARC GPUs. -> **Note:** This demo was tested on 4th - 6th generation Intel® Xeon® Scalable Processors, Intel® Arc™ GPU Series and Intel® Data Center GPU Series on Ubuntu22/24, RedHat8/9 and Windows11. +> **Note:** This demo was tested on 4th - 6th generation Intel® Xeon® Scalable Processors, and Intel® Core Ultra Series on Ubuntu24 and Windows11. ## Prerequisites -**Model preparation**: Python 3.9 or higher with pip and HuggingFace account - **Model Server deployment**: Installed Docker Engine or OVMS binary package according to the [baremetal deployment guide](../../docs/deploying_server_baremetal.md) -**(Optional) Client**: git and Python for using OpenAI client package and vLLM benchmark app - - -## Model preparation -Here, the original Pytorch LLM model and the tokenizer will be converted to IR format and optionally quantized. -That ensures faster initialization time, better performance and lower memory consumption. -LLM engine parameters will be defined inside the `graph.pbtxt` file. - -Download export script, install it's dependencies and create directory for the models: -```console -curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/export_model.py -o export_model.py -pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/requirements.txt -mkdir models -``` - -Run `export_model.py` script to download and quantize the model: - -> **Note:** Before downloading the model, access must be requested. Follow the instructions on the [HuggingFace model page](https://huggingface.co/meta-llama/Meta-Llama-3-8B) to request access. When access is granted, create an authentication token in the HuggingFace account -> Settings -> Access Tokens page. Issue the following command and enter the authentication token. Authenticate via `huggingface-cli login`. -> **Note:** The users in China need to set environment variable HF_ENDPOINT="https://hf-mirror.com" before running the export script to connect to the HF Hub. - -**CPU** -```console -python export_model.py text_generation --source_model meta-llama/Meta-Llama-3-8B-Instruct --weight-format fp16 --kv_cache_precision u8 --config_file_path models/config.json --model_repository_path models --overwrite_models -``` - -**GPU** -```console -python export_model.py text_generation --source_model meta-llama/Meta-Llama-3-8B-Instruct --weight-format int4 --target_device GPU --config_file_path models/config.json --model_repository_path models --overwrite_models -``` - -> **Note:** Change the `--weight-format` to quantize the model to `int8` or `int4` precision to reduce memory consumption and improve performance. - -> **Note:** You can change the model used in the demo out of any topology [tested](https://github.com/openvinotoolkit/openvino.genai/blob/master/tests/python_tests/models/real_models) with OpenVINO. - -You should have a model folder like below: -``` -tree models -models -├── config.json -└── meta-llama - └── Meta-Llama-3-8B-Instruct - ├── config.json - ├── generation_config.json - ├── graph.pbtxt - ├── openvino_detokenizer.bin - ├── openvino_detokenizer.xml - ├── openvino_model.bin - ├── openvino_model.xml - ├── openvino_tokenizer.bin - ├── openvino_tokenizer.xml - ├── special_tokens_map.json - ├── tokenizer_config.json - └── tokenizer.json -``` +**(Optional) Client**: Git and Python for using OpenAI client package and vLLM benchmark app -The default configuration should work in most cases but the parameters can be tuned via `export_model.py` script arguments. Run the script with `--help` argument to check available parameters and see the [LLM calculator documentation](../../docs/llm/reference.md) to learn more about configuration options. ## Server Deployment @@ -86,58 +30,41 @@ The default configuration should work in most cases but the parameters can be tu Select deployment option depending on how you prepared models in the previous step. -**CPU** +**CPU Docker on Ubuntu24** Running this command starts the container with CPU only target device: ```bash -docker run -d --rm -p 8000:8000 -v $(pwd)/models:/workspace:ro openvino/model_server:latest --rest_port 8000 --config_path /workspace/config.json +docker run -it -p 8000:8000 --rm -e MOE_USE_MICRO_GEMM_PREFILL=0 --user $(id -u):$(id -g) -v $(pwd)/models:/models/:rw openvino/model_server:weekly --model_repository_path /models --source_model OpenVINO/Qwen3-30B-A3B-Instruct-2507-int4-ov --task text_generation --target_device GPU --tool_parser hermes3 --rest_port 8000 --model_name Qwen3-30B-A3B-Instruct-2507-int4-ov ``` -**GPU** +**GPU baremetal on Windows11** In case you want to use GPU device to run the generation, add extra docker parameters `--device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1)` to `docker run` command, use the image with GPU support. Export the models with precision matching the GPU capacity and adjust pipeline configuration. It can be applied using the commands below: -```bash -docker run -d --rm -p 8000:8000 --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -v $(pwd)/models:/workspace:ro openvino/model_server:latest-gpu --rest_port 8000 --config_path /workspace/config.json -``` -::: - -:::{dropdown} **Deploying on Bare Metal** - -Assuming you have unpacked model server package, make sure to: - -- **On Windows**: run `setupvars` script -- **On Linux**: set `LD_LIBRARY_PATH` and `PATH` environment variables - -as mentioned in [deployment guide](../../docs/deploying_server_baremetal.md), in every new shell that will start OpenVINO Model Server. - -Depending on how you prepared models in the first step of this demo, they are deployed to either CPU or GPU (it's defined in `graph.pbtxt`). If you run on GPU make sure to have appropriate drivers installed, so the device is accessible for the model server. - ```bat -ovms --rest_port 8000 --config_path ./models/config.json +set MOE_USE_MICRO_GEMM_PREFILL=0 +ovms.exe --model_repository_path /models --source_model OpenVINO/Qwen3-30B-A3B-Instruct-2507-int4-ov --task text_generation --target_device GPU --tool_parser hermes3 --rest_port 8000 --model_name Qwen3-30B-A3B-Instruct-2507-int4-ov ``` ::: + ## Readiness Check Wait for the model to load. You can check the status with a simple command: ```console -curl http://localhost:8000/v1/config +curl http://localhost:8000/v3/models ``` ```json { - "meta-llama/Meta-Llama-3-8B-Instruct": { - "model_version_status": [ - { - "version": "1", - "state": "AVAILABLE", - "status": { - "error_code": "OK", - "error_message": "OK" - } - } - ] + "object": "list", + "data": [ + { + "id": "Qwen3-30B-A3B-Instruct-2507-int4-ov", + "object": "model", + "created": 1772928358, + "owned_by": "OVMS" } + ] } ``` @@ -156,8 +83,8 @@ Completion endpoint should be used to pass the prompt directly by the client and curl http://localhost:8000/v3/chat/completions \ -H "Content-Type: application/json" \ -d '{ - "model": "meta-llama/Meta-Llama-3-8B-Instruct", - "max_tokens":30, + "model": "Qwen3-30B-A3B-Instruct-2507-int4-ov", + "max_completion_tokens":300, "stream":false, "messages": [ { @@ -166,7 +93,7 @@ curl http://localhost:8000/v3/chat/completions \ }, { "role": "user", - "content": "What is OpenVINO?" + "content": "If 1=3 2=3 3=5 4=4 5=4 Then, 6=?" } ] }'| jq . @@ -195,84 +122,29 @@ curl -s http://localhost:8000/v3/chat/completions -H "Content-Type: application/ { "choices": [ { - "finish_reason": "length", + "finish_reason": "stop", "index": 0, "logprobs": null, "message": { - "content": "OpenVINO is an open-source software framework developed by Intel for optimizing and deploying computer vision, machine learning, and deep learning models on various devices,", - "role": "assistant" + "content": "We are given a pattern:\n\n- 1 = 3 \n- 2 = 3 \n- 3 = 5 \n- 4 = 4 \n- 5 = 4 \n- 6 = ?\n\nWe need to find what **6** equals based on this pattern.\n\nLet’s analyze the pattern.\n\nAt first glance, it's not a mathematical operation like addition or multiplication. Let's look at the **number of letters** in the **English word** for each number.\n\nTry that:\n\n- 1 → \"one\" → 3 letters → matches 1 = 3 ✅ \n- 2 → \"two\" → 3 letters → matches 2 = 3 ✅ \n- 3 → \"three\" → 5 letters → matches 3 = 5 ✅ \n- 4 → \"four\" → 4 letters → matches 4 = 4 ✅ \n- 5 → \"five\" → 4 letters → matches 5 = 4 ✅ \n- 6 → \"six\" → 3 letters → So, 6 = 3?\n\nWait — let’s double-check:\n\n- \"six\" has 3 letters → so 6 = 3?\n\nBut let's confirm the pattern again.\n\nYes! The pattern is: \n**The number on the left equals the number of letters in the English word for that number.**\n\nSo:\n\n| Number | Word | Letters |\n|--------|----------|---------|\n| 1 | one | 3 |\n| 2 | two | 3 |\n| 3 | three | 5 |\n| 4 | four | 4 |\n| 5 | five | 4 |\n| 6 | six | 3 |\n\nSo, **6 = 3**\n\n### ✅ Final Answer: **3**", + "role": "assistant", + "tool_calls": [] } } ], - "created": 1724405301, - "model": "meta-llama/Meta-Llama-3-8B-Instruct", + "created": 1772929186, + "model": "ovms-model", "object": "chat.completion", "usage": { - "prompt_tokens": 27, - "completion_tokens": 30, - "total_tokens": 57 + "prompt_tokens": 45, + "completion_tokens": 394, + "total_tokens": 439 } } ``` -::: -### Unary calls to completions endpoint using cURL -A similar call can be made with a `completion` endpoint: -::::{tab-set} -:::{tab-item} Linux -```bash -curl http://localhost:8000/v3/completions \ - -H "Content-Type: application/json" \ - -d '{ - "model": "meta-llama/Meta-Llama-3-8B-Instruct", - "max_tokens":30, - "stream":false, - "prompt": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat is OpenVINO?<|eot_id|><|start_header_id|>assistant<|end_header_id|>" - }'| jq . -``` -::: - -:::{tab-item} Windows -Windows Powershell -```powershell -(Invoke-WebRequest -Uri "http://localhost:8000/v3/completions" ` - -Method POST ` - -Headers @{ "Content-Type" = "application/json" } ` - -Body '{"model": "meta-llama/Meta-Llama-3-8B-Instruct", "max_tokens": 30, "temperature": 0, "stream": false, "prompt":"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat is OpenVINO?<|eot_id|><|start_header_id|>assistant<|end_header_id|>"}').Content -``` - -Windows Command Prompt -```bat -curl -s http://localhost:8000/v3/completions -H "Content-Type: application/json" -d "{\"model\": \"meta-llama/Meta-Llama-3-8B-Instruct\", \"max_tokens\": 30, \"temperature\": 0, \"stream\": false, \"prompt\":\"^<^|begin_of_text^|^>^<^|start_header_id^|^>system^<^|end_header_id^|^>\n\nYou are assistant^<^|eot_id^|^>^<^|start_header_id^|^>user^<^|end_header_id^|^>\n\nWhat is OpenVINO?^<^|eot_id^|^>^<^|start_header_id^|^>assistant^<^|end_header_id^|^>\"}" -``` -::: - -:::: - -:::{dropdown} Expected Response -```json -{ - "choices": [ - { - "finish_reason": "length", - "index": 0, - "logprobs": null, - "text": "\n\nOpenVINO is an open-source computer vision platform developed by Intel for deploying and optimizing computer vision, machine learning, and autonomous driving applications. It" - } - ], - "created": 1724405354, - "model": "meta-llama/Meta-Llama-3-8B-Instruct", - "object": "text_completion", - "usage": { - "prompt_tokens": 23, - "completion_tokens": 30, - "total_tokens": 53 - } -} -``` -::: -### Streaming call with OpenAI Python package +### OpenAI Python package The endpoints `chat/completions` and `completions` are compatible with OpenAI client so it can be easily used to generate code also in streaming mode: @@ -283,7 +155,7 @@ pip3 install openai ::::{tab-set} -:::{tab-item} Chat completions +:::{tab-item} Chat completions with streaming ```python from openai import OpenAI @@ -309,7 +181,7 @@ It looks like you're testing me! ::: -:::{tab-item} Completions +:::{tab-item} Chat completions with unary response ```console pip3 install openai @@ -340,41 +212,6 @@ It looks like you're testing me! :::: -## Benchmarking text generation with high concurrency - -OpenVINO Model Server employs efficient parallelization for text generation. It can be used to generate text also in high concurrency in the environment shared by multiple clients. -It can be demonstrated using benchmarking app from vLLM repository: -```console -git clone --branch v0.7.3 --depth 1 https://github.com/vllm-project/vllm -cd vllm -pip3 install -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu -cd benchmarks -curl -L https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json -o ShareGPT_V3_unfiltered_cleaned_split.json # sample dataset -python benchmark_serving.py --host localhost --port 8000 --endpoint /v3/chat/completions --backend openai-chat --model meta-llama/Meta-Llama-3-8B-Instruct --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 --request-rate inf - -Namespace(backend='openai-chat', base_url=None, host='localhost', port=8000, endpoint='/v3/chat/completions', dataset=None, dataset_name='sharegpt', dataset_path='ShareGPT_V3_unfiltered_cleaned_split.json', max_concurrency=None, model='meta-llama/Meta-Llama-3-8B-Instruct', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=1000, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None) - -Traffic request rate: inf -100%|██████████████████████████████████████████████████| 1000/1000 [17:17<00:00, 1.04s/it] -============ Serving Benchmark Result ============ -Successful requests: 1000 -Benchmark duration (s): 447.62 -Total input tokens: 215201 -Total generated tokens: 198588 -Request throughput (req/s): 2.23 -Output token throughput (tok/s): 443.65 -Total Token throughput (tok/s): 924.41 ----------------Time to First Token---------------- -Mean TTFT (ms): 171999.94 -Median TTFT (ms): 170699.21 -P99 TTFT (ms): 360941.40 ------Time per Output Token (excl. 1st token)------ -Mean TPOT (ms): 211.31 -Median TPOT (ms): 223.79 -P99 TPOT (ms): 246.48 -================================================== -``` - ## RAG with Model Server The service deployed above can be used in RAG chain using `langchain` library with OpenAI endpoint as the LLM engine. @@ -385,7 +222,6 @@ Check the example in the [RAG notebook](https://github.com/openvinotoolkit/model Check this simple [text generation scaling demo](https://github.com/openvinotoolkit/model_server/blob/main/demos/continuous_batching/scaling/README.md). - ## Testing the model accuracy over serving API Check the [guide of using lm-evaluation-harness](https://github.com/openvinotoolkit/model_server/blob/main/demos/continuous_batching/accuracy/README.md) From 1ed8c3e6fba1567fdab8c9d4d5e1d7a3046ee8f4 Mon Sep 17 00:00:00 2001 From: Dariusz Trawinski Date: Mon, 9 Mar 2026 14:07:15 +0100 Subject: [PATCH 2/9] menu cleanup --- demos/README.md | 11 +-- demos/continuous_batching/README.md | 96 ++++++++++++++----- .../speculative_decoding/README.md | 2 +- demos/continuous_batching/vlm/README.md | 14 ++- 4 files changed, 90 insertions(+), 33 deletions(-) diff --git a/demos/README.md b/demos/README.md index 6c8de87f7b..c8a808fa1c 100644 --- a/demos/README.md +++ b/demos/README.md @@ -5,16 +5,13 @@ maxdepth: 1 hidden: --- -ovms_demos_continuous_batching_agent +ovms_demos_continuous_batching ovms_demos_integration_with_open_webui +ovms_demos_code_completion_vsc +ovms_demos_audio ovms_demos_rerank ovms_demos_embeddings -ovms_demos_continuous_batching -ovms_demo_long_context ovms_demos_continuous_batching_vlm -ovms_demos_llm_npu -ovms_demos_vlm_npu -ovms_demos_code_completion_vsc ovms_demos_image_generation ovms_demo_clip_image_classification ovms_demo_age_gender_guide @@ -40,10 +37,8 @@ ovms_demo_real_time_stream_analysis ovms_demo_using_paddlepaddle_model ovms_demo_bert ovms_demo_universal-sentence-encoder -ovms_demo_benchmark_client ovms_string_output_model_demo ovms_demos_gguf -ovms_demos_audio ``` diff --git a/demos/continuous_batching/README.md b/demos/continuous_batching/README.md index e424fca18b..57413dfbf1 100644 --- a/demos/continuous_batching/README.md +++ b/demos/continuous_batching/README.md @@ -1,14 +1,18 @@ -# How to serve LLM models with Continuous Batching via OpenAI API {#ovms_demos_continuous_batching} +# LLM models via OpenAI API {#ovms_demos_continuous_batching} ```{toctree} --- maxdepth: 1 hidden: --- -ovms_demos_continuous_batching_accuracy +ovms_demos_continuous_batching_agent ovms_demos_continuous_batching_rag ovms_demos_continuous_batching_scaling ovms_demos_continuous_batching_speculative_decoding +ovms_structured_output +ovms_demo_long_context +ovms_demos_llm_npu +ovms_demos_continuous_batching_accuracy ``` This demo shows how to deploy LLM models in the OpenVINO Model Server using continuous batching and paged attention algorithms. @@ -106,12 +110,12 @@ Windows Powershell (Invoke-WebRequest -Uri "http://localhost:8000/v3/chat/completions" ` -Method POST ` -Headers @{ "Content-Type" = "application/json" } ` - -Body '{"model": "meta-llama/Meta-Llama-3-8B-Instruct", "max_tokens": 30, "temperature": 0, "stream": false, "messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What are the 3 main tourist attractions in Paris?"}]}').Content + -Body '{"model": "Qwen3-30B-A3B-Instruct-2507-int4-ov", "max_tokens": 30, "temperature": 0, "stream": false, "messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "If 1=3 2=3 3=5 4=4 5=4 Then, 6=?"}]}').Content ``` Windows Command Prompt ```bat -curl -s http://localhost:8000/v3/chat/completions -H "Content-Type: application/json" -d "{\"model\": \"meta-llama/Meta-Llama-3-8B-Instruct\", \"max_tokens\": 30, \"temperature\": 0, \"stream\": false, \"messages\": [{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"}, {\"role\": \"user\", \"content\": \"What are the 3 main tourist attractions in Paris?\"}]}" +curl -s http://localhost:8000/v3/chat/completions -H "Content-Type: application/json" -d "{\"model\": \"meta-llama/Meta-Llama-3-8B-Instruct\", \"max_tokens\": 30, \"temperature\": 0, \"stream\": false, \"messages\": [{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"}, {\"role\": \"user\", \"content\": \"If 1=3 2=3 3=5 4=4 5=4 Then, 6=?\"}]}" ``` ::: @@ -165,8 +169,8 @@ client = OpenAI( ) stream = client.chat.completions.create( - model="meta-llama/Meta-Llama-3-8B-Instruct", - messages=[{"role": "user", "content": "Say this is a test"}], + model="Qwen3-30B-A3B-Instruct-2507-int4-ov", + messages=[{"role": "user", "content": "If 1=3 2=3 3=5 4=4 5=4 Then, 6=?"}], stream=True, ) for chunk in stream: @@ -176,7 +180,31 @@ for chunk in stream: Output: ``` -It looks like you're testing me! +We are given a pattern: + +- 1 = 3 +- 2 = 3 +- 3 = 5 +- 4 = 4 +- 5 = 4 +- 6 = ? + +We need to find the value for 6. + +Let’s look at the pattern. The numbers on the left are integers, and the values on the right seem to represent something about the number itself. + +Let’s consider: **the number of letters in the English word for the number.** + +Check: + +- **1** → "one" → 3 letters → matches 3 ✅ +- **2** → "two" → 3 letters → matches 3 ✅ +- **3** → "three" → 5 letters → matches 5 ✅ +- **4** → "four" → 4 letters → matches 4 ✅ +- **5** → "five" → 4 letters → matches 4 ✅ +- **6** → "six" → 3 letters → so **6 = 3** + +### ✅ Answer: **3** ``` ::: @@ -194,24 +222,50 @@ client = OpenAI( api_key="unused" ) -stream = client.completions.create( - model="meta-llama/Meta-Llama-3-8B-Instruct", - prompt="<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nSay this is a test.<|eot_id|><|start_header_id|>assistant<|end_header_id|>", - stream=True, +response = client.chat.completions.create( + model="Qwen3-30B-A3B-Instruct-2507-int4-ov", + messages=[{"role": "user", "content": "If 1=3 2=3 3=5 4=4 5=4 Then, 6=?"}], + stream=False, ) -for chunk in stream: - if chunk.choices[0].text is not None: - print(chunk.choices[0].text, end="", flush=True) +print(response.choices[0].message.content) ``` Output: ``` -It looks like you're testing me! +We are given a pattern: + +- 1 = 3 +- 2 = 3 +- 3 = 5 +- 4 = 4 +- 5 = 4 +- 6 = ? + +We need to find the value for 6. + +Let’s look at the pattern. The numbers on the left are integers, and the values on the right seem to represent something about the **number of letters** when the number is written out in English. + +Let’s check: + +- **1** = "one" → 3 letters → matches 3 ✅ +- **2** = "two" → 3 letters → matches 3 ✅ +- **3** = "three" → 5 letters → matches 5 ✅ +- **4** = "four" → 4 letters → matches 4 ✅ +- **5** = "five" → 4 letters → matches 4 ✅ +- **6** = "six" → 3 letters → so 6 = **3** + +### ✅ Answer: **3** + +So, **6 = 3**. ``` ::: :::: +## Check how to use AI agents with MCP servers and language models + +Check the demo [AI agent with MCP server and OpenVINO acceleration](./agentic_ai/README.md) + ## RAG with Model Server The service deployed above can be used in RAG chain using `langchain` library with OpenAI endpoint as the LLM engine. @@ -222,10 +276,6 @@ Check the example in the [RAG notebook](https://github.com/openvinotoolkit/model Check this simple [text generation scaling demo](https://github.com/openvinotoolkit/model_server/blob/main/demos/continuous_batching/scaling/README.md). -## Testing the model accuracy over serving API - -Check the [guide of using lm-evaluation-harness](https://github.com/openvinotoolkit/model_server/blob/main/demos/continuous_batching/accuracy/README.md) - ## Use Speculative Decoding Check the [guide for speculative decoding](./speculative_decoding/README.md) @@ -234,14 +284,14 @@ Check the [guide for speculative decoding](./speculative_decoding/README.md) Check the demo [text generation with visual model](./vlm/README.md) -## Check how to use AI agents with MCP servers and language models - -Check the demo [AI agent with MCP server and OpenVINO acceleration](./agentic_ai/README.md) - ## Use structured output with json schema guided generation Check the demo [structured output](./structured_output/README.md) +## Testing the model accuracy over serving API + +Check the [guide of using lm-evaluation-harness](./accuracy/README.md) + ## References - [Chat Completions API](../../docs/model_server_rest_api_chat.md) - [Completions API](../../docs/model_server_rest_api_completions.md) diff --git a/demos/continuous_batching/speculative_decoding/README.md b/demos/continuous_batching/speculative_decoding/README.md index 6b24efc169..9b3398677e 100644 --- a/demos/continuous_batching/speculative_decoding/README.md +++ b/demos/continuous_batching/speculative_decoding/README.md @@ -1,4 +1,4 @@ -# How to serve LLM Models in Speculative Decoding Pipeline{#ovms_demos_continuous_batching_speculative_decoding} +# LLM Models in Speculative Decoding Pipeline{#ovms_demos_continuous_batching_speculative_decoding} Following [OpenVINO GenAI docs](https://docs.openvino.ai/2026/openvino-workflow-generative/inference-with-genai.html#efficient-text-generation-via-speculative-decoding): > Speculative decoding (or assisted-generation) enables faster token generation when an additional smaller draft model is used alongside the main model. This reduces the number of infer requests to the main model, increasing performance. diff --git a/demos/continuous_batching/vlm/README.md b/demos/continuous_batching/vlm/README.md index b0664eb268..b4d9176dde 100644 --- a/demos/continuous_batching/vlm/README.md +++ b/demos/continuous_batching/vlm/README.md @@ -1,4 +1,12 @@ -# How to serve VLM models via OpenAI API {#ovms_demos_continuous_batching_vlm} +# VLM models via OpenAI API {#ovms_demos_continuous_batching_vlm} + +```{toctree} +--- +maxdepth: 1 +hidden: +--- +ovms_demos_vlm_npu +``` This demo shows how to deploy Vision Language Models in the OpenVINO Model Server. Text generation use case is exposed via OpenAI API `chat/completions` endpoint. @@ -367,6 +375,10 @@ P99 TPOT (ms): 8.02 Check the [guide of using lm-evaluation-harness](../accuracy/README.md) +## VLM models deployment with NPU acceleration + +Check [VLM usage with NPU acceleration](../../vlm_npu/README.md) + ## References - [Chat Completions API](../../../docs/model_server_rest_api_chat.md) From fe6ae45496f737626c251623f9ca07c582842017 Mon Sep 17 00:00:00 2001 From: Dariusz Trawinski Date: Mon, 9 Mar 2026 14:10:42 +0100 Subject: [PATCH 3/9] fix --- demos/continuous_batching/README.md | 4 ---- 1 file changed, 4 deletions(-) diff --git a/demos/continuous_batching/README.md b/demos/continuous_batching/README.md index 57413dfbf1..2712de1d38 100644 --- a/demos/continuous_batching/README.md +++ b/demos/continuous_batching/README.md @@ -30,10 +30,6 @@ That makes it easy to use and efficient especially on on Intel® Xeon® processo ## Server Deployment -:::{dropdown} **Deploying with Docker** - -Select deployment option depending on how you prepared models in the previous step. - **CPU Docker on Ubuntu24** Running this command starts the container with CPU only target device: From 95b5c02b84b1d7675397e43461d38a550d97badc Mon Sep 17 00:00:00 2001 From: Dariusz Trawinski Date: Mon, 9 Mar 2026 15:57:37 +0100 Subject: [PATCH 4/9] fix --- demos/continuous_batching/README.md | 2 +- demos/integration_with_OpenWebUI/README.md | 6 ++++-- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/demos/continuous_batching/README.md b/demos/continuous_batching/README.md index 2712de1d38..bd8a8ae6b0 100644 --- a/demos/continuous_batching/README.md +++ b/demos/continuous_batching/README.md @@ -45,7 +45,6 @@ It can be applied using the commands below: set MOE_USE_MICRO_GEMM_PREFILL=0 ovms.exe --model_repository_path /models --source_model OpenVINO/Qwen3-30B-A3B-Instruct-2507-int4-ov --task text_generation --target_device GPU --tool_parser hermes3 --rest_port 8000 --model_name Qwen3-30B-A3B-Instruct-2507-int4-ov ``` -::: ## Readiness Check @@ -142,6 +141,7 @@ curl -s http://localhost:8000/v3/chat/completions -H "Content-Type: application/ } } ``` +::: ### OpenAI Python package diff --git a/demos/integration_with_OpenWebUI/README.md b/demos/integration_with_OpenWebUI/README.md index 615807976f..674e5394b9 100644 --- a/demos/integration_with_OpenWebUI/README.md +++ b/demos/integration_with_OpenWebUI/README.md @@ -1,4 +1,4 @@ -# Demonstrating integration of Open WebUI with OpenVINO Model Server {#ovms_demos_integration_with_open_webui} +# Open WebUI with OpenVINO Model Server {#ovms_demos_integration_with_open_webui} ## Description @@ -70,7 +70,9 @@ Go to [http://localhost:8080](http://localhost:8080) and create admin account to ![get started with Open WebUI](./get_started_with_Open_WebUI.png) -### Reference +> **Important Note**: While using NPU device for acceleration and model gpt-oss-20b with GPU, it is recommended to disable `Follow-Up Auto-Generation` in `Settings > Interface` menu. It will improve response time and avoid queuing requests. For gpt-oss model it will avoid concurrent execution which in version 2026.0 has an accuracy issue. + +### References [https://docs.openvino.ai/2026/model-server/ovms_demos_continuous_batching.html](https://docs.openvino.ai/2026/model-server/ovms_demos_continuous_batching.html#model-preparation) [https://docs.openwebui.com](https://docs.openwebui.com/#installation-with-pip) From 2ba0f218bf0a566643873407d74de59a0effc00f Mon Sep 17 00:00:00 2001 From: Dariusz Trawinski Date: Mon, 9 Mar 2026 16:18:26 +0100 Subject: [PATCH 5/9] fix --- demos/continuous_batching/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/demos/continuous_batching/README.md b/demos/continuous_batching/README.md index bd8a8ae6b0..7dbacb1a47 100644 --- a/demos/continuous_batching/README.md +++ b/demos/continuous_batching/README.md @@ -43,7 +43,7 @@ to `docker run` command, use the image with GPU support. Export the models with It can be applied using the commands below: ```bat set MOE_USE_MICRO_GEMM_PREFILL=0 -ovms.exe --model_repository_path /models --source_model OpenVINO/Qwen3-30B-A3B-Instruct-2507-int4-ov --task text_generation --target_device GPU --tool_parser hermes3 --rest_port 8000 --model_name Qwen3-30B-A3B-Instruct-2507-int4-ov +ovms.exe --model_repository_path c:\models --source_model OpenVINO/Qwen3-30B-A3B-Instruct-2507-int4-ov --task text_generation --target_device GPU --tool_parser hermes3 --rest_port 8000 --model_name Qwen3-30B-A3B-Instruct-2507-int4-ov ``` From 5a0bd6131062ac282a0aba58f5b6365873d8d50f Mon Sep 17 00:00:00 2001 From: Dariusz Trawinski Date: Mon, 9 Mar 2026 23:18:59 +0100 Subject: [PATCH 6/9] simplify vlm and fixes --- demos/continuous_batching/README.md | 24 +-- demos/continuous_batching/vlm/README.md | 191 +++--------------------- demos/embeddings/README.md | 2 +- demos/rerank/README.md | 4 +- demos/vlm_npu/README.md | 6 +- 5 files changed, 43 insertions(+), 184 deletions(-) diff --git a/demos/continuous_batching/README.md b/demos/continuous_batching/README.md index 7dbacb1a47..b01869ae40 100644 --- a/demos/continuous_batching/README.md +++ b/demos/continuous_batching/README.md @@ -30,17 +30,20 @@ That makes it easy to use and efficient especially on on Intel® Xeon® processo ## Server Deployment -**CPU Docker on Ubuntu24** +**Container on Linux and CPU target device** Running this command starts the container with CPU only target device: ```bash -docker run -it -p 8000:8000 --rm -e MOE_USE_MICRO_GEMM_PREFILL=0 --user $(id -u):$(id -g) -v $(pwd)/models:/models/:rw openvino/model_server:weekly --model_repository_path /models --source_model OpenVINO/Qwen3-30B-A3B-Instruct-2507-int4-ov --task text_generation --target_device GPU --tool_parser hermes3 --rest_port 8000 --model_name Qwen3-30B-A3B-Instruct-2507-int4-ov +docker run -it -p 8000:8000 --rm -e MOE_USE_MICRO_GEMM_PREFILL=0 --user $(id -u):$(id -g) -v $(pwd)/models:/models/:rw openvino/model_server:weekly --model_repository_path /models --source_model OpenVINO/Qwen3-30B-A3B-Instruct-2507-int4-ov --task text_generation --target_device CPU --tool_parser hermes3 --rest_port 8000 --model_name Qwen3-30B-A3B-Instruct-2507-int4-ov ``` -**GPU baremetal on Windows11** +> **Note:** In case you want to use GPU target device, add extra docker parameters `--device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1)` +to `docker run` command. The parameter `--target_device` should be also updated to `GPU`. + + +**Binary package on Windows 11 with GPU target device** + +After ovms is installed according to steps from [baremetal deployment guide](../../docs/deploying_server_baremetal.md), run the following command: -In case you want to use GPU device to run the generation, add extra docker parameters `--device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1)` -to `docker run` command, use the image with GPU support. Export the models with precision matching the GPU capacity and adjust pipeline configuration. -It can be applied using the commands below: ```bat set MOE_USE_MICRO_GEMM_PREFILL=0 ovms.exe --model_repository_path c:\models --source_model OpenVINO/Qwen3-30B-A3B-Instruct-2507-int4-ov --task text_generation --target_device GPU --tool_parser hermes3 --rest_port 8000 --model_name Qwen3-30B-A3B-Instruct-2507-int4-ov @@ -69,9 +72,9 @@ curl http://localhost:8000/v3/models ## Request Generation -A single servable exposes both `chat/completions` and `completions` endpoints with and without stream capabilities. +Model exposes both `chat/completions` and `completions` endpoints with and without stream capabilities. Chat endpoint is expected to be used for scenarios where conversation context should be pasted by the client and the model prompt is created by the server based on the jinja model template. -Completion endpoint should be used to pass the prompt directly by the client and for models without the jinja template. +Completion endpoint should be used to pass the prompt directly by the client and for models without the jinja template. Here is demonstrated model `Qwen/Qwen3-30B-A3B-Instruct-2507` in int4 precision. It has chat capability so `chat/completions` endpoint will be employed: ### Unary calls to chat/completions endpoint using cURL @@ -110,7 +113,7 @@ Windows Powershell Windows Command Prompt ```bat -curl -s http://localhost:8000/v3/chat/completions -H "Content-Type: application/json" -d "{\"model\": \"meta-llama/Meta-Llama-3-8B-Instruct\", \"max_tokens\": 30, \"temperature\": 0, \"stream\": false, \"messages\": [{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"}, {\"role\": \"user\", \"content\": \"If 1=3 2=3 3=5 4=4 5=4 Then, 6=?\"}]}" +curl -s http://localhost:8000/v3/chat/completions -H "Content-Type: application/json" -d "{\"model\": \"Qwen3-30B-A3B-Instruct-2507-int4-ov\", \"max_tokens\": 30, \"temperature\": 0, \"stream\": false, \"messages\": [{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"}, {\"role\": \"user\", \"content\": \"If 1=3 2=3 3=5 4=4 5=4 Then, 6=?\"}]}" ``` ::: @@ -289,6 +292,9 @@ Check the demo [structured output](./structured_output/README.md) Check the [guide of using lm-evaluation-harness](./accuracy/README.md) ## References +- [Export models to OpenVINO format](../common/export_models/README.md) +- [Supported LLM models](https://openvinotoolkit.github.io/openvino.genai/docs/supported-models/#large-language-models-llms) +- [Official OpenVINO LLM models in HuggingFace](https://huggingface.co/collections/OpenVINO/llm) - [Chat Completions API](../../docs/model_server_rest_api_chat.md) - [Completions API](../../docs/model_server_rest_api_completions.md) - [Writing client code](../../docs/clients_genai.md) diff --git a/demos/continuous_batching/vlm/README.md b/demos/continuous_batching/vlm/README.md index b4d9176dde..0de93e27b8 100644 --- a/demos/continuous_batching/vlm/README.md +++ b/demos/continuous_batching/vlm/README.md @@ -11,13 +11,11 @@ ovms_demos_vlm_npu This demo shows how to deploy Vision Language Models in the OpenVINO Model Server. Text generation use case is exposed via OpenAI API `chat/completions` endpoint. -> **Note:** This demo was tested on 4th - 6th generation Intel® Xeon® Scalable Processors, Intel® Arc™ GPU Series and Intel® Data Center GPU Series on Ubuntu22/24, RedHat8/9 and Windows11. +> **Note:** This demo was tested on 4th - 6th generation Intel® Xeon® Scalable Processors, Intel® Arc™ GPU Series and Intel® Core Ultra Series on Ubuntu24, RedHat9 and Windows11. ## Prerequisites -**OVMS version 2025.1** This demo require version 2025.1 or newer. - -**Model preparation**: Python 3.9 or higher with pip and HuggingFace account +**Model preparation**: Python 3.10 or higher with pip and HuggingFace account **Model Server deployment**: Installed Docker Engine or OVMS binary package according to the [baremetal deployment guide](../../../docs/deploying_server_baremetal.md) @@ -38,7 +36,7 @@ Select deployment option depending on how you prepared models in the previous st Running this command starts the container with CPU only target device: ```bash mkdir -p models -docker run -d -u $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models:rw openvino/model_server:latest --rest_port 8000 --source_model OpenVINO/InternVL2-2B-int4-ov --model_repository_path /models --model_name OpenGVLab/InternVL2-2B --task text_generation --pipeline_type VLM --allowed_media_domains raw.githubusercontent.com +docker run -d -u $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models:rw openvino/model_server:latest --rest_port 8000 --source_model OpenVINO/InternVL2-2B-int4-ov --model_repository_path /models --task text_generation --pipeline_type VLM --allowed_media_domains raw.githubusercontent.com ``` **GPU** @@ -47,7 +45,7 @@ to `docker run` command, use the image with GPU support. It can be applied using the commands below: ```bash mkdir -p models -docker run -d -u $(id -u):$(id -g) --rm -p 8000:8000 --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -v $(pwd)/models:/models:rw openvino/model_server:latest-gpu --rest_port 8000 --source_model OpenVINO/InternVL2-2B-int4-ov --model_repository_path models --model_name OpenGVLab/InternVL2-2B --task text_generation --target_device GPU --pipeline_type VLM --allowed_media_domains raw.githubusercontent.com +docker run -d -u $(id -u):$(id -g) --rm -p 8000:8000 --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -v $(pwd)/models:/models:rw openvino/model_server:latest-gpu --rest_port 8000 --source_model OpenVINO/InternVL2-2B-int4-ov --model_repository_path models --task text_generation --target_device GPU --pipeline_type VLM --allowed_media_domains raw.githubusercontent.com ``` ::: @@ -57,146 +55,31 @@ If you run on GPU make sure to have appropriate drivers installed, so the device ```bat mkdir models -ovms --rest_port 8000 --source_model OpenVINO/InternVL2-2B-int4-ov --model_repository_path models --model_name OpenGVLab/InternVL2-2B --task text_generation --pipeline_type VLM --target_device CPU --allowed_media_domains raw.githubusercontent.com +ovms --rest_port 8000 --source_model OpenVINO/InternVL2-2B-int4-ov --model_repository_path models --task text_generation --pipeline_type VLM --target_device CPU --allowed_media_domains raw.githubusercontent.com ``` or ```bat -ovms --rest_port 8000 --source_model OpenVINO/InternVL2-2B-int4-ov --model_repository_path models --model_name OpenGVLab/InternVL2-2B --task text_generation --pipeline_type VLM --target_device GPU --allowed_media_domains raw.githubusercontent.com -``` -::: - - - -## Model preparation -Use this step for models outside of OpenVINO organization. - -Specific OVMS pull mode example for models outside of OpenVINO organization is described in section `## Pulling models outside of OpenVINO organization` in the [Ovms pull mode](https://github.com/openvinotoolkit/model_server/blob/main/docs/pull_hf_models.md) - -Or you can use the python export_model.py script described below. - -Here, the original VLM model and its auxiliary models (tokenizer, vision encoder, embeddings model etc.) will be converted to IR format and optionally quantized. -That ensures faster initialization time, better performance and lower memory consumption. -Execution parameters will be defined inside the `graph.pbtxt` file. - -Download export script, install it's dependencies and create directory for the models: -```console -curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/export_model.py -o export_model.py -pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/requirements.txt -mkdir models -``` - -Run `export_model.py` script to download and quantize the model: - -> **Note:** The users in China need to set environment variable HF_ENDPOINT="https://hf-mirror.com" before running the export script to connect to the HF Hub. - -**CPU** -```console -python export_model.py text_generation --source_model OpenGVLab/InternVL2-2B --weight-format int4 --pipeline_type VLM --model_name OpenGVLab/InternVL2-2B --config_file_path models/config.json --model_repository_path models --overwrite_models -``` - -**GPU** -```console -python export_model.py text_generation --source_model OpenGVLab/InternVL2-2B --weight-format int4 --pipeline_type VLM --model_name OpenGVLab/InternVL2-2B --config_file_path models/config.json --model_repository_path models --overwrite_models --target_device GPU -``` - -> **Note:** You can change the model used in the demo out of any topology [tested](https://openvinotoolkit.github.io/openvino.genai/docs/supported-models/#visual-language-models-vlms) with OpenVINO. -Be aware that QwenVL models executed on GPU might experience execution errors with very high resolution images. In case of such behavior, it is recommended to reduce the parameter `max_pixels` in `preprocessor_config.json`. - - -You should have a model folder like below: -``` -models/ -├── config.json -└── OpenGVLab - └── InternVL2 - ├── added_tokens.json - ├── config.json - ├── configuration_internlm2.py - ├── configuration_intern_vit.py - ├── configuration_internvl_chat.py - ├── generation_config.json - ├── graph.pbtxt - ├── openvino_config.json - ├── openvino_detokenizer.bin - ├── openvino_detokenizer.xml - ├── openvino_language_model.bin - ├── openvino_language_model.xml - ├── openvino_text_embeddings_model.bin - ├── openvino_text_embeddings_model.xml - ├── openvino_tokenizer.bin - ├── openvino_tokenizer.xml - ├── openvino_vision_embeddings_model.bin - ├── openvino_vision_embeddings_model.xml - ├── preprocessor_config.json - ├── special_tokens_map.json - ├── tokenization_internlm2.py - ├── tokenizer_config.json - └── tokenizer.model -``` - -The default configuration should work in most cases but the parameters can be tuned via `export_model.py` script arguments. Run the script with `--help` argument to check available parameters and see the [LLM calculator documentation](../../../docs/llm/reference.md) to learn more about configuration options. - -## Server Deployment with locally stored models - -:::{dropdown} **Deploying with Docker** - -Select deployment option depending on how you prepared models in the previous step. - -**CPU** - -Running this command starts the container with CPU only target device: -```bash -docker run -d --rm -p 8000:8000 -v $(pwd)/models:/models:ro openvino/model_server:latest --rest_port 8000 --model_name OpenGVLab/InternVL2-2B --model_path /models/OpenGVLab/InternVL2-2B --allowed_media_domains raw.githubusercontent.com -``` -**GPU** - -In case you want to use GPU device to run the generation, add extra docker parameters `--device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1)` -to `docker run` command, use the image with GPU support. Export the models with precision matching the GPU capacity and adjust pipeline configuration. -It can be applied using the commands below: -```bash -docker run -d --rm -p 8000:8000 --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -v $(pwd)/models:/models:ro openvino/model_server:latest-gpu --rest_port 8000 --model_name OpenGVLab/InternVL2-2B --model_path /models/OpenGVLab/InternVL2-2B --allowed_media_domains raw.githubusercontent.com -``` -::: - -:::{dropdown} **Deploying on Bare Metal** - -Assuming you have unpacked model server package, make sure to: - -- **On Windows**: run `setupvars` script -- **On Linux**: set `LD_LIBRARY_PATH` and `PATH` environment variables - -as mentioned in [deployment guide](../../../docs/deploying_server_baremetal.md), in every new shell that will start OpenVINO Model Server. - -Depending on how you prepared models in the first step of this demo, they are deployed to either CPU or GPU (it's defined in `graph.pbtxt`). If you run on GPU make sure to have appropriate drivers installed, so the device is accessible for the model server. - -```bat -ovms --rest_port 8000 --model_name OpenGVLab/InternVL2-2B --model_path /models/OpenGVLab/InternVL2-2B +ovms --rest_port 8000 --source_model OpenVINO/InternVL2-2B-int4-ov --model_repository_path models --task text_generation --pipeline_type VLM --target_device GPU --allowed_media_domains raw.githubusercontent.com ``` ::: -> **Note:** VLM models can be enabled also with continuous batching pipeline. It that case the export_model.py or the ovms deployment from HuggingFace source model should have parameter `--pipeline_type VLM_CB`. It has, however, a defect related to accuracy for models Qwen2, Qwen2.5 and Phi3.5. -The pipeline with continuous batching will give better throughput especially if there are many requests with text only in the requests. - ## Readiness Check Wait for the model to load. You can check the status with a simple command: ```console -curl http://localhost:8000/v1/config +curl http://localhost:8000/v3/models ``` ```json { - "OpenGVLab/InternVL2-2B": { - "model_version_status": [ - { - "version": "1", - "state": "AVAILABLE", - "status": { - "error_code": "OK", - "error_message": "OK" - } - } - ] + "object": "list", + "data": [ + { + "id": "OpenVINO/InternVL2-2B-int4-ov", + "object": "model", + "created": 1772928358, + "owned_by": "OVMS" } + ] } ``` @@ -209,7 +92,7 @@ Let's send a request with text an image in the messages context. **Note**: using urls in request requires `--allowed_media_domains` parameter described [here](../../../docs/parameters.md) ```bash -curl http://localhost:8000/v3/chat/completions -H "Content-Type: application/json" -d "{ \"model\": \"OpenGVLab/InternVL2-2B\", \"messages\":[{\"role\": \"user\", \"content\": [{\"type\": \"text\", \"text\": \"Describe what is one the picture.\"},{\"type\": \"image_url\", \"image_url\": {\"url\": \"http://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/releases/2025/3/demos/common/static/images/zebra.jpeg\"}}]}], \"max_completion_tokens\": 100}" +curl http://localhost:8000/v3/chat/completions -H "Content-Type: application/json" -d "{ \"model\": \"OpenVINO/InternVL2-2B-int4-ov\", \"messages\":[{\"role\": \"user\", \"content\": [{\"type\": \"text\", \"text\": \"Describe what is one the picture.\"},{\"type\": \"image_url\", \"image_url\": {\"url\": \"http://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/releases/2025/3/demos/common/static/images/zebra.jpeg\"}}]}], \"max_completion_tokens\": 100}" ``` ```json { @@ -225,7 +108,7 @@ curl http://localhost:8000/v3/chat/completions -H "Content-Type: application/js } ], "created": 1741731554, - "model": "OpenGVLab/InternVL2-2B", + "model": "OpenVINO/InternVL2-2B-int4-ov", "object": "chat.completion", "usage": { "prompt_tokens": 19, @@ -246,7 +129,7 @@ curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/m import requests import base64 base_url='http://127.0.0.1:8000/v3' -model_name = "OpenGVLab/InternVL2-2B" +model_name = "OpenVINO/InternVL2-2B-int4-ov" def convert_image(Image): with open(Image,'rb' ) as file: @@ -254,7 +137,7 @@ def convert_image(Image): return base64_image import requests -payload = {"model": "OpenGVLab/InternVL2-2B", +payload = {"model": "OpenVINO/InternVL2-2B-int4-ov", "messages": [ { "role": "user", @@ -284,7 +167,7 @@ print(response.text) } ], "created": 1741731554, - "model": "OpenGVLab/InternVL2-2B", + "model": "OpenVINO/InternVL2-2B-int4-ov", "object": "chat.completion", "usage": { "prompt_tokens": 19, @@ -306,7 +189,7 @@ pip3 install openai from openai import OpenAI import base64 base_url='http://localhost:8080/v3' -model_name = "OpenGVLab/InternVL2-2B" +model_name = "OpenVINO/InternVL2-2B-int4-ov" client = OpenAI(api_key='unused', base_url=base_url) @@ -340,36 +223,6 @@ The picture features a zebra standing in a grassy area. The zebra is characteriz ::: -## Benchmarking text generation with high concurrency - -OpenVINO Model Server employs efficient parallelization for text generation. It can be used to generate text also in high concurrency in the environment shared by multiple clients. -It can be demonstrated using benchmarking app from vLLM repository: -```console -git clone --branch v0.7.3 --depth 1 https://github.com/vllm-project/vllm -cd vllm -pip3 install -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu -cd benchmarks -python benchmark_serving.py --backend openai-chat --dataset-name hf --dataset-path lmarena-ai/vision-arena-bench-v0.1 --hf-split train --host localhost --port 8000 --model OpenGVLab/InternVL2-2B --endpoint /v3/chat/completions --max-concurrency 1 --num-prompts 100 --trust-remote-code - -Burstiness factor: 1.0 (Poisson process) -Maximum request concurrency: None -============ Serving Benchmark Result ============ -Successful requests: 100 -Benchmark duration (s): 287.81 -Total input tokens: 15381 -Total generated tokens: 20109 -Request throughput (req/s): 0.35 -Output token throughput (tok/s): 69.87 -Total Token throughput (tok/s): 123.31 ----------------Time to First Token---------------- -Mean TTFT (ms): 1513.96 -Median TTFT (ms): 1368.93 -P99 TTFT (ms): 2647.45 ------Time per Output Token (excl. 1st token)------ -Mean TPOT (ms): 6.68 -Median TPOT (ms): 6.68 -P99 TPOT (ms): 8.02 -``` ## Testing the model accuracy over serving API @@ -381,6 +234,8 @@ Check [VLM usage with NPU acceleration](../../vlm_npu/README.md) ## References +- [Export models to OpenVINO format](../common/export_models/README.md) +- [Supported VLM models](https://openvinotoolkit.github.io/openvino.genai/docs/supported-models/#visual-language-models-vlms) - [Chat Completions API](../../../docs/model_server_rest_api_chat.md) - [Writing client code](../../../docs/clients_genai.md) - [LLM calculator reference](../../../docs/llm/reference.md) diff --git a/demos/embeddings/README.md b/demos/embeddings/README.md index bcb81813ed..bcb30d8dab 100644 --- a/demos/embeddings/README.md +++ b/demos/embeddings/README.md @@ -1,4 +1,4 @@ -# How to serve Embeddings models via OpenAI API {#ovms_demos_embeddings} +# Text Embeddings models via OpenAI API {#ovms_demos_embeddings} This demo shows how to deploy embeddings models in the OpenVINO Model Server for text feature extractions. Text generation use case is exposed via OpenAI API `embeddings` endpoint. diff --git a/demos/rerank/README.md b/demos/rerank/README.md index 924f87a29e..32379ddcaf 100644 --- a/demos/rerank/README.md +++ b/demos/rerank/README.md @@ -1,8 +1,8 @@ -# How to serve Rerank models via Cohere API {#ovms_demos_rerank} +# Documents Reranking via Cohere API {#ovms_demos_rerank} ## Prerequisites -**Model preparation**: Python 3.9 or higher with pip +**Model preparation**: Python 3.10 or higher with pip **Model Server deployment**: Installed Docker Engine or OVMS binary package according to the [baremetal deployment guide](../../docs/deploying_server_baremetal.md) diff --git a/demos/vlm_npu/README.md b/demos/vlm_npu/README.md index 96736c80dc..200fe36ece 100644 --- a/demos/vlm_npu/README.md +++ b/demos/vlm_npu/README.md @@ -1,4 +1,4 @@ -# Serving for Text generation with Visual Language Models with NPU acceleration {#ovms_demos_vlm_npu} +# NPU for Visual Language Models {#ovms_demos_vlm_npu} This demo shows how to deploy VLM models in the OpenVINO Model Server with NPU acceleration. @@ -11,9 +11,7 @@ It is targeted on client machines equipped with NPU accelerator. ## Prerequisites -**OVMS 2025.1 or higher** - -**Model preparation**: Python 3.9 or higher with pip and HuggingFace account +**Model preparation**: Python 3.10 or higher with pip and HuggingFace account **Model Server deployment**: Installed Docker Engine or OVMS binary package according to the [baremetal deployment guide](../../docs/deploying_server_baremetal.md) From c9e82b7fb0a428ea81329a85515ca4dfb5620ecc Mon Sep 17 00:00:00 2001 From: Dariusz Trawinski Date: Tue, 10 Mar 2026 00:01:28 +0100 Subject: [PATCH 7/9] fix --- demos/continuous_batching/README.md | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/demos/continuous_batching/README.md b/demos/continuous_batching/README.md index b01869ae40..f6a934e716 100644 --- a/demos/continuous_batching/README.md +++ b/demos/continuous_batching/README.md @@ -34,6 +34,7 @@ That makes it easy to use and efficient especially on on Intel® Xeon® processo Running this command starts the container with CPU only target device: ```bash +mkdir -p models docker run -it -p 8000:8000 --rm -e MOE_USE_MICRO_GEMM_PREFILL=0 --user $(id -u):$(id -g) -v $(pwd)/models:/models/:rw openvino/model_server:weekly --model_repository_path /models --source_model OpenVINO/Qwen3-30B-A3B-Instruct-2507-int4-ov --task text_generation --target_device CPU --tool_parser hermes3 --rest_port 8000 --model_name Qwen3-30B-A3B-Instruct-2507-int4-ov ``` > **Note:** In case you want to use GPU target device, add extra docker parameters `--device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1)` @@ -86,7 +87,7 @@ curl http://localhost:8000/v3/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen3-30B-A3B-Instruct-2507-int4-ov", - "max_completion_tokens":300, + "max_completion_tokens":500, "stream":false, "messages": [ { @@ -108,12 +109,12 @@ Windows Powershell (Invoke-WebRequest -Uri "http://localhost:8000/v3/chat/completions" ` -Method POST ` -Headers @{ "Content-Type" = "application/json" } ` - -Body '{"model": "Qwen3-30B-A3B-Instruct-2507-int4-ov", "max_tokens": 30, "temperature": 0, "stream": false, "messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "If 1=3 2=3 3=5 4=4 5=4 Then, 6=?"}]}').Content + -Body '{"model": "Qwen3-30B-A3B-Instruct-2507-int4-ov", "max_completion_tokens": 500, "temperature": 0, "stream": false, "messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "If 1=3 2=3 3=5 4=4 5=4 Then, 6=?"}]}').Content ``` Windows Command Prompt ```bat -curl -s http://localhost:8000/v3/chat/completions -H "Content-Type: application/json" -d "{\"model\": \"Qwen3-30B-A3B-Instruct-2507-int4-ov\", \"max_tokens\": 30, \"temperature\": 0, \"stream\": false, \"messages\": [{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"}, {\"role\": \"user\", \"content\": \"If 1=3 2=3 3=5 4=4 5=4 Then, 6=?\"}]}" +curl -s http://localhost:8000/v3/chat/completions -H "Content-Type: application/json" -d "{\"model\": \"Qwen3-30B-A3B-Instruct-2507-int4-ov\", \"max_completion_tokens\": 500, \"temperature\": 0, \"stream\": false, \"messages\": [{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"}, {\"role\": \"user\", \"content\": \"If 1=3 2=3 3=5 4=4 5=4 Then, 6=?\"}]}" ``` ::: @@ -170,6 +171,7 @@ client = OpenAI( stream = client.chat.completions.create( model="Qwen3-30B-A3B-Instruct-2507-int4-ov", messages=[{"role": "user", "content": "If 1=3 2=3 3=5 4=4 5=4 Then, 6=?"}], + temperature=0, stream=True, ) for chunk in stream: @@ -224,6 +226,7 @@ client = OpenAI( response = client.chat.completions.create( model="Qwen3-30B-A3B-Instruct-2507-int4-ov", messages=[{"role": "user", "content": "If 1=3 2=3 3=5 4=4 5=4 Then, 6=?"}], + temperature=0, stream=False, ) print(response.choices[0].message.content) From 4982b5482a7144e5ffc939b91fff4ce374ca763b Mon Sep 17 00:00:00 2001 From: Dariusz Trawinski Date: Thu, 12 Mar 2026 14:50:37 +0100 Subject: [PATCH 8/9] change linux folder --- demos/continuous_batching/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/demos/continuous_batching/README.md b/demos/continuous_batching/README.md index f6a934e716..79576dc503 100644 --- a/demos/continuous_batching/README.md +++ b/demos/continuous_batching/README.md @@ -34,8 +34,8 @@ That makes it easy to use and efficient especially on on Intel® Xeon® processo Running this command starts the container with CPU only target device: ```bash -mkdir -p models -docker run -it -p 8000:8000 --rm -e MOE_USE_MICRO_GEMM_PREFILL=0 --user $(id -u):$(id -g) -v $(pwd)/models:/models/:rw openvino/model_server:weekly --model_repository_path /models --source_model OpenVINO/Qwen3-30B-A3B-Instruct-2507-int4-ov --task text_generation --target_device CPU --tool_parser hermes3 --rest_port 8000 --model_name Qwen3-30B-A3B-Instruct-2507-int4-ov +mkdir -p ${HOME}/models +docker run -it -p 8000:8000 --rm -e MOE_USE_MICRO_GEMM_PREFILL=0 --user $(id -u):$(id -g) -v ${HOME}/models:/models/:rw openvino/model_server:weekly --model_repository_path /models --source_model OpenVINO/Qwen3-30B-A3B-Instruct-2507-int4-ov --task text_generation --target_device CPU --tool_parser hermes3 --rest_port 8000 --model_name Qwen3-30B-A3B-Instruct-2507-int4-ov ``` > **Note:** In case you want to use GPU target device, add extra docker parameters `--device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1)` to `docker run` command. The parameter `--target_device` should be also updated to `GPU`. From 554af38b00ae683f65b1e854779b4b6e4ecc4df3 Mon Sep 17 00:00:00 2001 From: "Trawinski, Dariusz" Date: Thu, 12 Mar 2026 15:37:29 +0100 Subject: [PATCH 9/9] Apply suggestions from code review Co-authored-by: ngrozae <104074686+ngrozae@users.noreply.github.com> --- demos/integration_with_OpenWebUI/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/demos/integration_with_OpenWebUI/README.md b/demos/integration_with_OpenWebUI/README.md index 674e5394b9..38420b27de 100644 --- a/demos/integration_with_OpenWebUI/README.md +++ b/demos/integration_with_OpenWebUI/README.md @@ -70,7 +70,7 @@ Go to [http://localhost:8080](http://localhost:8080) and create admin account to ![get started with Open WebUI](./get_started_with_Open_WebUI.png) -> **Important Note**: While using NPU device for acceleration and model gpt-oss-20b with GPU, it is recommended to disable `Follow-Up Auto-Generation` in `Settings > Interface` menu. It will improve response time and avoid queuing requests. For gpt-oss model it will avoid concurrent execution which in version 2026.0 has an accuracy issue. +> **Important Note**: While using NPU device for acceleration or model gpt-oss-20b with GPU, it is recommended to disable `Follow-Up Auto-Generation` in `Settings > Interface` menu. It will improve response time and avoid queuing requests. For gpt-oss model it will avoid concurrent execution which in version 2026.0 has an accuracy issue. ### References [https://docs.openvino.ai/2026/model-server/ovms_demos_continuous_batching.html](https://docs.openvino.ai/2026/model-server/ovms_demos_continuous_batching.html#model-preparation)