diff --git a/demos/README.md b/demos/README.md index 6c8de87f7b..c8a808fa1c 100644 --- a/demos/README.md +++ b/demos/README.md @@ -5,16 +5,13 @@ maxdepth: 1 hidden: --- -ovms_demos_continuous_batching_agent +ovms_demos_continuous_batching ovms_demos_integration_with_open_webui +ovms_demos_code_completion_vsc +ovms_demos_audio ovms_demos_rerank ovms_demos_embeddings -ovms_demos_continuous_batching -ovms_demo_long_context ovms_demos_continuous_batching_vlm -ovms_demos_llm_npu -ovms_demos_vlm_npu -ovms_demos_code_completion_vsc ovms_demos_image_generation ovms_demo_clip_image_classification ovms_demo_age_gender_guide @@ -40,10 +37,8 @@ ovms_demo_real_time_stream_analysis ovms_demo_using_paddlepaddle_model ovms_demo_bert ovms_demo_universal-sentence-encoder -ovms_demo_benchmark_client ovms_string_output_model_demo ovms_demos_gguf -ovms_demos_audio ``` diff --git a/demos/continuous_batching/README.md b/demos/continuous_batching/README.md index b6a2196205..79576dc503 100644 --- a/demos/continuous_batching/README.md +++ b/demos/continuous_batching/README.md @@ -1,151 +1,81 @@ -# How to serve LLM models with Continuous Batching via OpenAI API {#ovms_demos_continuous_batching} +# LLM models via OpenAI API {#ovms_demos_continuous_batching} ```{toctree} --- maxdepth: 1 hidden: --- -ovms_demos_continuous_batching_accuracy +ovms_demos_continuous_batching_agent ovms_demos_continuous_batching_rag ovms_demos_continuous_batching_scaling ovms_demos_continuous_batching_speculative_decoding +ovms_structured_output +ovms_demo_long_context +ovms_demos_llm_npu +ovms_demos_continuous_batching_accuracy ``` This demo shows how to deploy LLM models in the OpenVINO Model Server using continuous batching and paged attention algorithms. Text generation use case is exposed via OpenAI API `chat/completions` and `completions` endpoints. That makes it easy to use and efficient especially on on Intel® Xeon® processors and ARC GPUs. -> **Note:** This demo was tested on 4th - 6th generation Intel® Xeon® Scalable Processors, Intel® Arc™ GPU Series and Intel® Data Center GPU Series on Ubuntu22/24, RedHat8/9 and Windows11. +> **Note:** This demo was tested on 4th - 6th generation Intel® Xeon® Scalable Processors, and Intel® Core Ultra Series on Ubuntu24 and Windows11. ## Prerequisites -**Model preparation**: Python 3.9 or higher with pip and HuggingFace account - **Model Server deployment**: Installed Docker Engine or OVMS binary package according to the [baremetal deployment guide](../../docs/deploying_server_baremetal.md) -**(Optional) Client**: git and Python for using OpenAI client package and vLLM benchmark app - +**(Optional) Client**: Git and Python for using OpenAI client package and vLLM benchmark app -## Model preparation -Here, the original Pytorch LLM model and the tokenizer will be converted to IR format and optionally quantized. -That ensures faster initialization time, better performance and lower memory consumption. -LLM engine parameters will be defined inside the `graph.pbtxt` file. - -Download export script, install it's dependencies and create directory for the models: -```console -curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/export_model.py -o export_model.py -pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/requirements.txt -mkdir models -``` - -Run `export_model.py` script to download and quantize the model: - -> **Note:** Before downloading the model, access must be requested. Follow the instructions on the [HuggingFace model page](https://huggingface.co/meta-llama/Meta-Llama-3-8B) to request access. When access is granted, create an authentication token in the HuggingFace account -> Settings -> Access Tokens page. Issue the following command and enter the authentication token. Authenticate via `huggingface-cli login`. -> **Note:** The users in China need to set environment variable HF_ENDPOINT="https://hf-mirror.com" before running the export script to connect to the HF Hub. - -**CPU** -```console -python export_model.py text_generation --source_model meta-llama/Meta-Llama-3-8B-Instruct --weight-format fp16 --kv_cache_precision u8 --config_file_path models/config.json --model_repository_path models --overwrite_models -``` - -**GPU** -```console -python export_model.py text_generation --source_model meta-llama/Meta-Llama-3-8B-Instruct --weight-format int4 --target_device GPU --config_file_path models/config.json --model_repository_path models --overwrite_models -``` - -> **Note:** Change the `--weight-format` to quantize the model to `int8` or `int4` precision to reduce memory consumption and improve performance. - -> **Note:** You can change the model used in the demo out of any topology [tested](https://github.com/openvinotoolkit/openvino.genai/blob/master/tests/python_tests/models/real_models) with OpenVINO. - -You should have a model folder like below: -``` -tree models -models -├── config.json -└── meta-llama - └── Meta-Llama-3-8B-Instruct - ├── config.json - ├── generation_config.json - ├── graph.pbtxt - ├── openvino_detokenizer.bin - ├── openvino_detokenizer.xml - ├── openvino_model.bin - ├── openvino_model.xml - ├── openvino_tokenizer.bin - ├── openvino_tokenizer.xml - ├── special_tokens_map.json - ├── tokenizer_config.json - └── tokenizer.json -``` - -The default configuration should work in most cases but the parameters can be tuned via `export_model.py` script arguments. Run the script with `--help` argument to check available parameters and see the [LLM calculator documentation](../../docs/llm/reference.md) to learn more about configuration options. ## Server Deployment -:::{dropdown} **Deploying with Docker** - -Select deployment option depending on how you prepared models in the previous step. - -**CPU** +**Container on Linux and CPU target device** Running this command starts the container with CPU only target device: ```bash -docker run -d --rm -p 8000:8000 -v $(pwd)/models:/workspace:ro openvino/model_server:latest --rest_port 8000 --config_path /workspace/config.json -``` -**GPU** - -In case you want to use GPU device to run the generation, add extra docker parameters `--device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1)` -to `docker run` command, use the image with GPU support. Export the models with precision matching the GPU capacity and adjust pipeline configuration. -It can be applied using the commands below: -```bash -docker run -d --rm -p 8000:8000 --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -v $(pwd)/models:/workspace:ro openvino/model_server:latest-gpu --rest_port 8000 --config_path /workspace/config.json +mkdir -p ${HOME}/models +docker run -it -p 8000:8000 --rm -e MOE_USE_MICRO_GEMM_PREFILL=0 --user $(id -u):$(id -g) -v ${HOME}/models:/models/:rw openvino/model_server:weekly --model_repository_path /models --source_model OpenVINO/Qwen3-30B-A3B-Instruct-2507-int4-ov --task text_generation --target_device CPU --tool_parser hermes3 --rest_port 8000 --model_name Qwen3-30B-A3B-Instruct-2507-int4-ov ``` -::: - -:::{dropdown} **Deploying on Bare Metal** - -Assuming you have unpacked model server package, make sure to: +> **Note:** In case you want to use GPU target device, add extra docker parameters `--device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1)` +to `docker run` command. The parameter `--target_device` should be also updated to `GPU`. -- **On Windows**: run `setupvars` script -- **On Linux**: set `LD_LIBRARY_PATH` and `PATH` environment variables -as mentioned in [deployment guide](../../docs/deploying_server_baremetal.md), in every new shell that will start OpenVINO Model Server. +**Binary package on Windows 11 with GPU target device** -Depending on how you prepared models in the first step of this demo, they are deployed to either CPU or GPU (it's defined in `graph.pbtxt`). If you run on GPU make sure to have appropriate drivers installed, so the device is accessible for the model server. +After ovms is installed according to steps from [baremetal deployment guide](../../docs/deploying_server_baremetal.md), run the following command: ```bat -ovms --rest_port 8000 --config_path ./models/config.json +set MOE_USE_MICRO_GEMM_PREFILL=0 +ovms.exe --model_repository_path c:\models --source_model OpenVINO/Qwen3-30B-A3B-Instruct-2507-int4-ov --task text_generation --target_device GPU --tool_parser hermes3 --rest_port 8000 --model_name Qwen3-30B-A3B-Instruct-2507-int4-ov ``` -::: + ## Readiness Check Wait for the model to load. You can check the status with a simple command: ```console -curl http://localhost:8000/v1/config +curl http://localhost:8000/v3/models ``` ```json { - "meta-llama/Meta-Llama-3-8B-Instruct": { - "model_version_status": [ - { - "version": "1", - "state": "AVAILABLE", - "status": { - "error_code": "OK", - "error_message": "OK" - } - } - ] + "object": "list", + "data": [ + { + "id": "Qwen3-30B-A3B-Instruct-2507-int4-ov", + "object": "model", + "created": 1772928358, + "owned_by": "OVMS" } + ] } ``` ## Request Generation -A single servable exposes both `chat/completions` and `completions` endpoints with and without stream capabilities. +Model exposes both `chat/completions` and `completions` endpoints with and without stream capabilities. Chat endpoint is expected to be used for scenarios where conversation context should be pasted by the client and the model prompt is created by the server based on the jinja model template. -Completion endpoint should be used to pass the prompt directly by the client and for models without the jinja template. +Completion endpoint should be used to pass the prompt directly by the client and for models without the jinja template. Here is demonstrated model `Qwen/Qwen3-30B-A3B-Instruct-2507` in int4 precision. It has chat capability so `chat/completions` endpoint will be employed: ### Unary calls to chat/completions endpoint using cURL @@ -156,8 +86,8 @@ Completion endpoint should be used to pass the prompt directly by the client and curl http://localhost:8000/v3/chat/completions \ -H "Content-Type: application/json" \ -d '{ - "model": "meta-llama/Meta-Llama-3-8B-Instruct", - "max_tokens":30, + "model": "Qwen3-30B-A3B-Instruct-2507-int4-ov", + "max_completion_tokens":500, "stream":false, "messages": [ { @@ -166,7 +96,7 @@ curl http://localhost:8000/v3/chat/completions \ }, { "role": "user", - "content": "What is OpenVINO?" + "content": "If 1=3 2=3 3=5 4=4 5=4 Then, 6=?" } ] }'| jq . @@ -179,12 +109,12 @@ Windows Powershell (Invoke-WebRequest -Uri "http://localhost:8000/v3/chat/completions" ` -Method POST ` -Headers @{ "Content-Type" = "application/json" } ` - -Body '{"model": "meta-llama/Meta-Llama-3-8B-Instruct", "max_tokens": 30, "temperature": 0, "stream": false, "messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What are the 3 main tourist attractions in Paris?"}]}').Content + -Body '{"model": "Qwen3-30B-A3B-Instruct-2507-int4-ov", "max_completion_tokens": 500, "temperature": 0, "stream": false, "messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "If 1=3 2=3 3=5 4=4 5=4 Then, 6=?"}]}').Content ``` Windows Command Prompt ```bat -curl -s http://localhost:8000/v3/chat/completions -H "Content-Type: application/json" -d "{\"model\": \"meta-llama/Meta-Llama-3-8B-Instruct\", \"max_tokens\": 30, \"temperature\": 0, \"stream\": false, \"messages\": [{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"}, {\"role\": \"user\", \"content\": \"What are the 3 main tourist attractions in Paris?\"}]}" +curl -s http://localhost:8000/v3/chat/completions -H "Content-Type: application/json" -d "{\"model\": \"Qwen3-30B-A3B-Instruct-2507-int4-ov\", \"max_completion_tokens\": 500, \"temperature\": 0, \"stream\": false, \"messages\": [{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"}, {\"role\": \"user\", \"content\": \"If 1=3 2=3 3=5 4=4 5=4 Then, 6=?\"}]}" ``` ::: @@ -195,84 +125,30 @@ curl -s http://localhost:8000/v3/chat/completions -H "Content-Type: application/ { "choices": [ { - "finish_reason": "length", + "finish_reason": "stop", "index": 0, "logprobs": null, "message": { - "content": "OpenVINO is an open-source software framework developed by Intel for optimizing and deploying computer vision, machine learning, and deep learning models on various devices,", - "role": "assistant" + "content": "We are given a pattern:\n\n- 1 = 3 \n- 2 = 3 \n- 3 = 5 \n- 4 = 4 \n- 5 = 4 \n- 6 = ?\n\nWe need to find what **6** equals based on this pattern.\n\nLet’s analyze the pattern.\n\nAt first glance, it's not a mathematical operation like addition or multiplication. Let's look at the **number of letters** in the **English word** for each number.\n\nTry that:\n\n- 1 → \"one\" → 3 letters → matches 1 = 3 ✅ \n- 2 → \"two\" → 3 letters → matches 2 = 3 ✅ \n- 3 → \"three\" → 5 letters → matches 3 = 5 ✅ \n- 4 → \"four\" → 4 letters → matches 4 = 4 ✅ \n- 5 → \"five\" → 4 letters → matches 5 = 4 ✅ \n- 6 → \"six\" → 3 letters → So, 6 = 3?\n\nWait — let’s double-check:\n\n- \"six\" has 3 letters → so 6 = 3?\n\nBut let's confirm the pattern again.\n\nYes! The pattern is: \n**The number on the left equals the number of letters in the English word for that number.**\n\nSo:\n\n| Number | Word | Letters |\n|--------|----------|---------|\n| 1 | one | 3 |\n| 2 | two | 3 |\n| 3 | three | 5 |\n| 4 | four | 4 |\n| 5 | five | 4 |\n| 6 | six | 3 |\n\nSo, **6 = 3**\n\n### ✅ Final Answer: **3**", + "role": "assistant", + "tool_calls": [] } } ], - "created": 1724405301, - "model": "meta-llama/Meta-Llama-3-8B-Instruct", + "created": 1772929186, + "model": "ovms-model", "object": "chat.completion", "usage": { - "prompt_tokens": 27, - "completion_tokens": 30, - "total_tokens": 57 + "prompt_tokens": 45, + "completion_tokens": 394, + "total_tokens": 439 } } ``` ::: -### Unary calls to completions endpoint using cURL -A similar call can be made with a `completion` endpoint: -::::{tab-set} - -:::{tab-item} Linux -```bash -curl http://localhost:8000/v3/completions \ - -H "Content-Type: application/json" \ - -d '{ - "model": "meta-llama/Meta-Llama-3-8B-Instruct", - "max_tokens":30, - "stream":false, - "prompt": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat is OpenVINO?<|eot_id|><|start_header_id|>assistant<|end_header_id|>" - }'| jq . -``` -::: - -:::{tab-item} Windows -Windows Powershell -```powershell -(Invoke-WebRequest -Uri "http://localhost:8000/v3/completions" ` - -Method POST ` - -Headers @{ "Content-Type" = "application/json" } ` - -Body '{"model": "meta-llama/Meta-Llama-3-8B-Instruct", "max_tokens": 30, "temperature": 0, "stream": false, "prompt":"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat is OpenVINO?<|eot_id|><|start_header_id|>assistant<|end_header_id|>"}').Content -``` - -Windows Command Prompt -```bat -curl -s http://localhost:8000/v3/completions -H "Content-Type: application/json" -d "{\"model\": \"meta-llama/Meta-Llama-3-8B-Instruct\", \"max_tokens\": 30, \"temperature\": 0, \"stream\": false, \"prompt\":\"^<^|begin_of_text^|^>^<^|start_header_id^|^>system^<^|end_header_id^|^>\n\nYou are assistant^<^|eot_id^|^>^<^|start_header_id^|^>user^<^|end_header_id^|^>\n\nWhat is OpenVINO?^<^|eot_id^|^>^<^|start_header_id^|^>assistant^<^|end_header_id^|^>\"}" -``` -::: -:::: -:::{dropdown} Expected Response -```json -{ - "choices": [ - { - "finish_reason": "length", - "index": 0, - "logprobs": null, - "text": "\n\nOpenVINO is an open-source computer vision platform developed by Intel for deploying and optimizing computer vision, machine learning, and autonomous driving applications. It" - } - ], - "created": 1724405354, - "model": "meta-llama/Meta-Llama-3-8B-Instruct", - "object": "text_completion", - "usage": { - "prompt_tokens": 23, - "completion_tokens": 30, - "total_tokens": 53 - } -} -``` -::: - -### Streaming call with OpenAI Python package +### OpenAI Python package The endpoints `chat/completions` and `completions` are compatible with OpenAI client so it can be easily used to generate code also in streaming mode: @@ -283,7 +159,7 @@ pip3 install openai ::::{tab-set} -:::{tab-item} Chat completions +:::{tab-item} Chat completions with streaming ```python from openai import OpenAI @@ -293,8 +169,9 @@ client = OpenAI( ) stream = client.chat.completions.create( - model="meta-llama/Meta-Llama-3-8B-Instruct", - messages=[{"role": "user", "content": "Say this is a test"}], + model="Qwen3-30B-A3B-Instruct-2507-int4-ov", + messages=[{"role": "user", "content": "If 1=3 2=3 3=5 4=4 5=4 Then, 6=?"}], + temperature=0, stream=True, ) for chunk in stream: @@ -304,12 +181,36 @@ for chunk in stream: Output: ``` -It looks like you're testing me! +We are given a pattern: + +- 1 = 3 +- 2 = 3 +- 3 = 5 +- 4 = 4 +- 5 = 4 +- 6 = ? + +We need to find the value for 6. + +Let’s look at the pattern. The numbers on the left are integers, and the values on the right seem to represent something about the number itself. + +Let’s consider: **the number of letters in the English word for the number.** + +Check: + +- **1** → "one" → 3 letters → matches 3 ✅ +- **2** → "two" → 3 letters → matches 3 ✅ +- **3** → "three" → 5 letters → matches 5 ✅ +- **4** → "four" → 4 letters → matches 4 ✅ +- **5** → "five" → 4 letters → matches 4 ✅ +- **6** → "six" → 3 letters → so **6 = 3** + +### ✅ Answer: **3** ``` ::: -:::{tab-item} Completions +:::{tab-item} Chat completions with unary response ```console pip3 install openai @@ -322,58 +223,50 @@ client = OpenAI( api_key="unused" ) -stream = client.completions.create( - model="meta-llama/Meta-Llama-3-8B-Instruct", - prompt="<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nSay this is a test.<|eot_id|><|start_header_id|>assistant<|end_header_id|>", - stream=True, +response = client.chat.completions.create( + model="Qwen3-30B-A3B-Instruct-2507-int4-ov", + messages=[{"role": "user", "content": "If 1=3 2=3 3=5 4=4 5=4 Then, 6=?"}], + temperature=0, + stream=False, ) -for chunk in stream: - if chunk.choices[0].text is not None: - print(chunk.choices[0].text, end="", flush=True) +print(response.choices[0].message.content) ``` Output: ``` -It looks like you're testing me! +We are given a pattern: + +- 1 = 3 +- 2 = 3 +- 3 = 5 +- 4 = 4 +- 5 = 4 +- 6 = ? + +We need to find the value for 6. + +Let’s look at the pattern. The numbers on the left are integers, and the values on the right seem to represent something about the **number of letters** when the number is written out in English. + +Let’s check: + +- **1** = "one" → 3 letters → matches 3 ✅ +- **2** = "two" → 3 letters → matches 3 ✅ +- **3** = "three" → 5 letters → matches 5 ✅ +- **4** = "four" → 4 letters → matches 4 ✅ +- **5** = "five" → 4 letters → matches 4 ✅ +- **6** = "six" → 3 letters → so 6 = **3** + +### ✅ Answer: **3** + +So, **6 = 3**. ``` ::: :::: -## Benchmarking text generation with high concurrency +## Check how to use AI agents with MCP servers and language models -OpenVINO Model Server employs efficient parallelization for text generation. It can be used to generate text also in high concurrency in the environment shared by multiple clients. -It can be demonstrated using benchmarking app from vLLM repository: -```console -git clone --branch v0.7.3 --depth 1 https://github.com/vllm-project/vllm -cd vllm -pip3 install -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu -cd benchmarks -curl -L https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json -o ShareGPT_V3_unfiltered_cleaned_split.json # sample dataset -python benchmark_serving.py --host localhost --port 8000 --endpoint /v3/chat/completions --backend openai-chat --model meta-llama/Meta-Llama-3-8B-Instruct --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 --request-rate inf - -Namespace(backend='openai-chat', base_url=None, host='localhost', port=8000, endpoint='/v3/chat/completions', dataset=None, dataset_name='sharegpt', dataset_path='ShareGPT_V3_unfiltered_cleaned_split.json', max_concurrency=None, model='meta-llama/Meta-Llama-3-8B-Instruct', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=1000, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None) - -Traffic request rate: inf -100%|██████████████████████████████████████████████████| 1000/1000 [17:17<00:00, 1.04s/it] -============ Serving Benchmark Result ============ -Successful requests: 1000 -Benchmark duration (s): 447.62 -Total input tokens: 215201 -Total generated tokens: 198588 -Request throughput (req/s): 2.23 -Output token throughput (tok/s): 443.65 -Total Token throughput (tok/s): 924.41 ----------------Time to First Token---------------- -Mean TTFT (ms): 171999.94 -Median TTFT (ms): 170699.21 -P99 TTFT (ms): 360941.40 ------Time per Output Token (excl. 1st token)------ -Mean TPOT (ms): 211.31 -Median TPOT (ms): 223.79 -P99 TPOT (ms): 246.48 -================================================== -``` +Check the demo [AI agent with MCP server and OpenVINO acceleration](./agentic_ai/README.md) ## RAG with Model Server @@ -385,11 +278,6 @@ Check the example in the [RAG notebook](https://github.com/openvinotoolkit/model Check this simple [text generation scaling demo](https://github.com/openvinotoolkit/model_server/blob/main/demos/continuous_batching/scaling/README.md). - -## Testing the model accuracy over serving API - -Check the [guide of using lm-evaluation-harness](https://github.com/openvinotoolkit/model_server/blob/main/demos/continuous_batching/accuracy/README.md) - ## Use Speculative Decoding Check the [guide for speculative decoding](./speculative_decoding/README.md) @@ -398,15 +286,18 @@ Check the [guide for speculative decoding](./speculative_decoding/README.md) Check the demo [text generation with visual model](./vlm/README.md) -## Check how to use AI agents with MCP servers and language models - -Check the demo [AI agent with MCP server and OpenVINO acceleration](./agentic_ai/README.md) - ## Use structured output with json schema guided generation Check the demo [structured output](./structured_output/README.md) +## Testing the model accuracy over serving API + +Check the [guide of using lm-evaluation-harness](./accuracy/README.md) + ## References +- [Export models to OpenVINO format](../common/export_models/README.md) +- [Supported LLM models](https://openvinotoolkit.github.io/openvino.genai/docs/supported-models/#large-language-models-llms) +- [Official OpenVINO LLM models in HuggingFace](https://huggingface.co/collections/OpenVINO/llm) - [Chat Completions API](../../docs/model_server_rest_api_chat.md) - [Completions API](../../docs/model_server_rest_api_completions.md) - [Writing client code](../../docs/clients_genai.md) diff --git a/demos/continuous_batching/speculative_decoding/README.md b/demos/continuous_batching/speculative_decoding/README.md index 6b24efc169..9b3398677e 100644 --- a/demos/continuous_batching/speculative_decoding/README.md +++ b/demos/continuous_batching/speculative_decoding/README.md @@ -1,4 +1,4 @@ -# How to serve LLM Models in Speculative Decoding Pipeline{#ovms_demos_continuous_batching_speculative_decoding} +# LLM Models in Speculative Decoding Pipeline{#ovms_demos_continuous_batching_speculative_decoding} Following [OpenVINO GenAI docs](https://docs.openvino.ai/2026/openvino-workflow-generative/inference-with-genai.html#efficient-text-generation-via-speculative-decoding): > Speculative decoding (or assisted-generation) enables faster token generation when an additional smaller draft model is used alongside the main model. This reduces the number of infer requests to the main model, increasing performance. diff --git a/demos/continuous_batching/vlm/README.md b/demos/continuous_batching/vlm/README.md index b0664eb268..0de93e27b8 100644 --- a/demos/continuous_batching/vlm/README.md +++ b/demos/continuous_batching/vlm/README.md @@ -1,15 +1,21 @@ -# How to serve VLM models via OpenAI API {#ovms_demos_continuous_batching_vlm} +# VLM models via OpenAI API {#ovms_demos_continuous_batching_vlm} + +```{toctree} +--- +maxdepth: 1 +hidden: +--- +ovms_demos_vlm_npu +``` This demo shows how to deploy Vision Language Models in the OpenVINO Model Server. Text generation use case is exposed via OpenAI API `chat/completions` endpoint. -> **Note:** This demo was tested on 4th - 6th generation Intel® Xeon® Scalable Processors, Intel® Arc™ GPU Series and Intel® Data Center GPU Series on Ubuntu22/24, RedHat8/9 and Windows11. +> **Note:** This demo was tested on 4th - 6th generation Intel® Xeon® Scalable Processors, Intel® Arc™ GPU Series and Intel® Core Ultra Series on Ubuntu24, RedHat9 and Windows11. ## Prerequisites -**OVMS version 2025.1** This demo require version 2025.1 or newer. - -**Model preparation**: Python 3.9 or higher with pip and HuggingFace account +**Model preparation**: Python 3.10 or higher with pip and HuggingFace account **Model Server deployment**: Installed Docker Engine or OVMS binary package according to the [baremetal deployment guide](../../../docs/deploying_server_baremetal.md) @@ -30,7 +36,7 @@ Select deployment option depending on how you prepared models in the previous st Running this command starts the container with CPU only target device: ```bash mkdir -p models -docker run -d -u $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models:rw openvino/model_server:latest --rest_port 8000 --source_model OpenVINO/InternVL2-2B-int4-ov --model_repository_path /models --model_name OpenGVLab/InternVL2-2B --task text_generation --pipeline_type VLM --allowed_media_domains raw.githubusercontent.com +docker run -d -u $(id -u):$(id -g) --rm -p 8000:8000 -v $(pwd)/models:/models:rw openvino/model_server:latest --rest_port 8000 --source_model OpenVINO/InternVL2-2B-int4-ov --model_repository_path /models --task text_generation --pipeline_type VLM --allowed_media_domains raw.githubusercontent.com ``` **GPU** @@ -39,7 +45,7 @@ to `docker run` command, use the image with GPU support. It can be applied using the commands below: ```bash mkdir -p models -docker run -d -u $(id -u):$(id -g) --rm -p 8000:8000 --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -v $(pwd)/models:/models:rw openvino/model_server:latest-gpu --rest_port 8000 --source_model OpenVINO/InternVL2-2B-int4-ov --model_repository_path models --model_name OpenGVLab/InternVL2-2B --task text_generation --target_device GPU --pipeline_type VLM --allowed_media_domains raw.githubusercontent.com +docker run -d -u $(id -u):$(id -g) --rm -p 8000:8000 --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -v $(pwd)/models:/models:rw openvino/model_server:latest-gpu --rest_port 8000 --source_model OpenVINO/InternVL2-2B-int4-ov --model_repository_path models --task text_generation --target_device GPU --pipeline_type VLM --allowed_media_domains raw.githubusercontent.com ``` ::: @@ -49,146 +55,31 @@ If you run on GPU make sure to have appropriate drivers installed, so the device ```bat mkdir models -ovms --rest_port 8000 --source_model OpenVINO/InternVL2-2B-int4-ov --model_repository_path models --model_name OpenGVLab/InternVL2-2B --task text_generation --pipeline_type VLM --target_device CPU --allowed_media_domains raw.githubusercontent.com +ovms --rest_port 8000 --source_model OpenVINO/InternVL2-2B-int4-ov --model_repository_path models --task text_generation --pipeline_type VLM --target_device CPU --allowed_media_domains raw.githubusercontent.com ``` or ```bat -ovms --rest_port 8000 --source_model OpenVINO/InternVL2-2B-int4-ov --model_repository_path models --model_name OpenGVLab/InternVL2-2B --task text_generation --pipeline_type VLM --target_device GPU --allowed_media_domains raw.githubusercontent.com -``` -::: - - - -## Model preparation -Use this step for models outside of OpenVINO organization. - -Specific OVMS pull mode example for models outside of OpenVINO organization is described in section `## Pulling models outside of OpenVINO organization` in the [Ovms pull mode](https://github.com/openvinotoolkit/model_server/blob/main/docs/pull_hf_models.md) - -Or you can use the python export_model.py script described below. - -Here, the original VLM model and its auxiliary models (tokenizer, vision encoder, embeddings model etc.) will be converted to IR format and optionally quantized. -That ensures faster initialization time, better performance and lower memory consumption. -Execution parameters will be defined inside the `graph.pbtxt` file. - -Download export script, install it's dependencies and create directory for the models: -```console -curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/export_model.py -o export_model.py -pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/requirements.txt -mkdir models -``` - -Run `export_model.py` script to download and quantize the model: - -> **Note:** The users in China need to set environment variable HF_ENDPOINT="https://hf-mirror.com" before running the export script to connect to the HF Hub. - -**CPU** -```console -python export_model.py text_generation --source_model OpenGVLab/InternVL2-2B --weight-format int4 --pipeline_type VLM --model_name OpenGVLab/InternVL2-2B --config_file_path models/config.json --model_repository_path models --overwrite_models -``` - -**GPU** -```console -python export_model.py text_generation --source_model OpenGVLab/InternVL2-2B --weight-format int4 --pipeline_type VLM --model_name OpenGVLab/InternVL2-2B --config_file_path models/config.json --model_repository_path models --overwrite_models --target_device GPU -``` - -> **Note:** You can change the model used in the demo out of any topology [tested](https://openvinotoolkit.github.io/openvino.genai/docs/supported-models/#visual-language-models-vlms) with OpenVINO. -Be aware that QwenVL models executed on GPU might experience execution errors with very high resolution images. In case of such behavior, it is recommended to reduce the parameter `max_pixels` in `preprocessor_config.json`. - - -You should have a model folder like below: -``` -models/ -├── config.json -└── OpenGVLab - └── InternVL2 - ├── added_tokens.json - ├── config.json - ├── configuration_internlm2.py - ├── configuration_intern_vit.py - ├── configuration_internvl_chat.py - ├── generation_config.json - ├── graph.pbtxt - ├── openvino_config.json - ├── openvino_detokenizer.bin - ├── openvino_detokenizer.xml - ├── openvino_language_model.bin - ├── openvino_language_model.xml - ├── openvino_text_embeddings_model.bin - ├── openvino_text_embeddings_model.xml - ├── openvino_tokenizer.bin - ├── openvino_tokenizer.xml - ├── openvino_vision_embeddings_model.bin - ├── openvino_vision_embeddings_model.xml - ├── preprocessor_config.json - ├── special_tokens_map.json - ├── tokenization_internlm2.py - ├── tokenizer_config.json - └── tokenizer.model -``` - -The default configuration should work in most cases but the parameters can be tuned via `export_model.py` script arguments. Run the script with `--help` argument to check available parameters and see the [LLM calculator documentation](../../../docs/llm/reference.md) to learn more about configuration options. - -## Server Deployment with locally stored models - -:::{dropdown} **Deploying with Docker** - -Select deployment option depending on how you prepared models in the previous step. - -**CPU** - -Running this command starts the container with CPU only target device: -```bash -docker run -d --rm -p 8000:8000 -v $(pwd)/models:/models:ro openvino/model_server:latest --rest_port 8000 --model_name OpenGVLab/InternVL2-2B --model_path /models/OpenGVLab/InternVL2-2B --allowed_media_domains raw.githubusercontent.com -``` -**GPU** - -In case you want to use GPU device to run the generation, add extra docker parameters `--device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1)` -to `docker run` command, use the image with GPU support. Export the models with precision matching the GPU capacity and adjust pipeline configuration. -It can be applied using the commands below: -```bash -docker run -d --rm -p 8000:8000 --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -v $(pwd)/models:/models:ro openvino/model_server:latest-gpu --rest_port 8000 --model_name OpenGVLab/InternVL2-2B --model_path /models/OpenGVLab/InternVL2-2B --allowed_media_domains raw.githubusercontent.com -``` -::: - -:::{dropdown} **Deploying on Bare Metal** - -Assuming you have unpacked model server package, make sure to: - -- **On Windows**: run `setupvars` script -- **On Linux**: set `LD_LIBRARY_PATH` and `PATH` environment variables - -as mentioned in [deployment guide](../../../docs/deploying_server_baremetal.md), in every new shell that will start OpenVINO Model Server. - -Depending on how you prepared models in the first step of this demo, they are deployed to either CPU or GPU (it's defined in `graph.pbtxt`). If you run on GPU make sure to have appropriate drivers installed, so the device is accessible for the model server. - -```bat -ovms --rest_port 8000 --model_name OpenGVLab/InternVL2-2B --model_path /models/OpenGVLab/InternVL2-2B +ovms --rest_port 8000 --source_model OpenVINO/InternVL2-2B-int4-ov --model_repository_path models --task text_generation --pipeline_type VLM --target_device GPU --allowed_media_domains raw.githubusercontent.com ``` ::: -> **Note:** VLM models can be enabled also with continuous batching pipeline. It that case the export_model.py or the ovms deployment from HuggingFace source model should have parameter `--pipeline_type VLM_CB`. It has, however, a defect related to accuracy for models Qwen2, Qwen2.5 and Phi3.5. -The pipeline with continuous batching will give better throughput especially if there are many requests with text only in the requests. - ## Readiness Check Wait for the model to load. You can check the status with a simple command: ```console -curl http://localhost:8000/v1/config +curl http://localhost:8000/v3/models ``` ```json { - "OpenGVLab/InternVL2-2B": { - "model_version_status": [ - { - "version": "1", - "state": "AVAILABLE", - "status": { - "error_code": "OK", - "error_message": "OK" - } - } - ] + "object": "list", + "data": [ + { + "id": "OpenVINO/InternVL2-2B-int4-ov", + "object": "model", + "created": 1772928358, + "owned_by": "OVMS" } + ] } ``` @@ -201,7 +92,7 @@ Let's send a request with text an image in the messages context. **Note**: using urls in request requires `--allowed_media_domains` parameter described [here](../../../docs/parameters.md) ```bash -curl http://localhost:8000/v3/chat/completions -H "Content-Type: application/json" -d "{ \"model\": \"OpenGVLab/InternVL2-2B\", \"messages\":[{\"role\": \"user\", \"content\": [{\"type\": \"text\", \"text\": \"Describe what is one the picture.\"},{\"type\": \"image_url\", \"image_url\": {\"url\": \"http://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/releases/2025/3/demos/common/static/images/zebra.jpeg\"}}]}], \"max_completion_tokens\": 100}" +curl http://localhost:8000/v3/chat/completions -H "Content-Type: application/json" -d "{ \"model\": \"OpenVINO/InternVL2-2B-int4-ov\", \"messages\":[{\"role\": \"user\", \"content\": [{\"type\": \"text\", \"text\": \"Describe what is one the picture.\"},{\"type\": \"image_url\", \"image_url\": {\"url\": \"http://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/releases/2025/3/demos/common/static/images/zebra.jpeg\"}}]}], \"max_completion_tokens\": 100}" ``` ```json { @@ -217,7 +108,7 @@ curl http://localhost:8000/v3/chat/completions -H "Content-Type: application/js } ], "created": 1741731554, - "model": "OpenGVLab/InternVL2-2B", + "model": "OpenVINO/InternVL2-2B-int4-ov", "object": "chat.completion", "usage": { "prompt_tokens": 19, @@ -238,7 +129,7 @@ curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/m import requests import base64 base_url='http://127.0.0.1:8000/v3' -model_name = "OpenGVLab/InternVL2-2B" +model_name = "OpenVINO/InternVL2-2B-int4-ov" def convert_image(Image): with open(Image,'rb' ) as file: @@ -246,7 +137,7 @@ def convert_image(Image): return base64_image import requests -payload = {"model": "OpenGVLab/InternVL2-2B", +payload = {"model": "OpenVINO/InternVL2-2B-int4-ov", "messages": [ { "role": "user", @@ -276,7 +167,7 @@ print(response.text) } ], "created": 1741731554, - "model": "OpenGVLab/InternVL2-2B", + "model": "OpenVINO/InternVL2-2B-int4-ov", "object": "chat.completion", "usage": { "prompt_tokens": 19, @@ -298,7 +189,7 @@ pip3 install openai from openai import OpenAI import base64 base_url='http://localhost:8080/v3' -model_name = "OpenGVLab/InternVL2-2B" +model_name = "OpenVINO/InternVL2-2B-int4-ov" client = OpenAI(api_key='unused', base_url=base_url) @@ -332,43 +223,19 @@ The picture features a zebra standing in a grassy area. The zebra is characteriz ::: -## Benchmarking text generation with high concurrency - -OpenVINO Model Server employs efficient parallelization for text generation. It can be used to generate text also in high concurrency in the environment shared by multiple clients. -It can be demonstrated using benchmarking app from vLLM repository: -```console -git clone --branch v0.7.3 --depth 1 https://github.com/vllm-project/vllm -cd vllm -pip3 install -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu -cd benchmarks -python benchmark_serving.py --backend openai-chat --dataset-name hf --dataset-path lmarena-ai/vision-arena-bench-v0.1 --hf-split train --host localhost --port 8000 --model OpenGVLab/InternVL2-2B --endpoint /v3/chat/completions --max-concurrency 1 --num-prompts 100 --trust-remote-code - -Burstiness factor: 1.0 (Poisson process) -Maximum request concurrency: None -============ Serving Benchmark Result ============ -Successful requests: 100 -Benchmark duration (s): 287.81 -Total input tokens: 15381 -Total generated tokens: 20109 -Request throughput (req/s): 0.35 -Output token throughput (tok/s): 69.87 -Total Token throughput (tok/s): 123.31 ----------------Time to First Token---------------- -Mean TTFT (ms): 1513.96 -Median TTFT (ms): 1368.93 -P99 TTFT (ms): 2647.45 ------Time per Output Token (excl. 1st token)------ -Mean TPOT (ms): 6.68 -Median TPOT (ms): 6.68 -P99 TPOT (ms): 8.02 -``` ## Testing the model accuracy over serving API Check the [guide of using lm-evaluation-harness](../accuracy/README.md) +## VLM models deployment with NPU acceleration + +Check [VLM usage with NPU acceleration](../../vlm_npu/README.md) + ## References +- [Export models to OpenVINO format](../common/export_models/README.md) +- [Supported VLM models](https://openvinotoolkit.github.io/openvino.genai/docs/supported-models/#visual-language-models-vlms) - [Chat Completions API](../../../docs/model_server_rest_api_chat.md) - [Writing client code](../../../docs/clients_genai.md) - [LLM calculator reference](../../../docs/llm/reference.md) diff --git a/demos/embeddings/README.md b/demos/embeddings/README.md index bcb81813ed..bcb30d8dab 100644 --- a/demos/embeddings/README.md +++ b/demos/embeddings/README.md @@ -1,4 +1,4 @@ -# How to serve Embeddings models via OpenAI API {#ovms_demos_embeddings} +# Text Embeddings models via OpenAI API {#ovms_demos_embeddings} This demo shows how to deploy embeddings models in the OpenVINO Model Server for text feature extractions. Text generation use case is exposed via OpenAI API `embeddings` endpoint. diff --git a/demos/integration_with_OpenWebUI/README.md b/demos/integration_with_OpenWebUI/README.md index 615807976f..38420b27de 100644 --- a/demos/integration_with_OpenWebUI/README.md +++ b/demos/integration_with_OpenWebUI/README.md @@ -1,4 +1,4 @@ -# Demonstrating integration of Open WebUI with OpenVINO Model Server {#ovms_demos_integration_with_open_webui} +# Open WebUI with OpenVINO Model Server {#ovms_demos_integration_with_open_webui} ## Description @@ -70,7 +70,9 @@ Go to [http://localhost:8080](http://localhost:8080) and create admin account to ![get started with Open WebUI](./get_started_with_Open_WebUI.png) -### Reference +> **Important Note**: While using NPU device for acceleration or model gpt-oss-20b with GPU, it is recommended to disable `Follow-Up Auto-Generation` in `Settings > Interface` menu. It will improve response time and avoid queuing requests. For gpt-oss model it will avoid concurrent execution which in version 2026.0 has an accuracy issue. + +### References [https://docs.openvino.ai/2026/model-server/ovms_demos_continuous_batching.html](https://docs.openvino.ai/2026/model-server/ovms_demos_continuous_batching.html#model-preparation) [https://docs.openwebui.com](https://docs.openwebui.com/#installation-with-pip) diff --git a/demos/rerank/README.md b/demos/rerank/README.md index 924f87a29e..32379ddcaf 100644 --- a/demos/rerank/README.md +++ b/demos/rerank/README.md @@ -1,8 +1,8 @@ -# How to serve Rerank models via Cohere API {#ovms_demos_rerank} +# Documents Reranking via Cohere API {#ovms_demos_rerank} ## Prerequisites -**Model preparation**: Python 3.9 or higher with pip +**Model preparation**: Python 3.10 or higher with pip **Model Server deployment**: Installed Docker Engine or OVMS binary package according to the [baremetal deployment guide](../../docs/deploying_server_baremetal.md) diff --git a/demos/vlm_npu/README.md b/demos/vlm_npu/README.md index 96736c80dc..200fe36ece 100644 --- a/demos/vlm_npu/README.md +++ b/demos/vlm_npu/README.md @@ -1,4 +1,4 @@ -# Serving for Text generation with Visual Language Models with NPU acceleration {#ovms_demos_vlm_npu} +# NPU for Visual Language Models {#ovms_demos_vlm_npu} This demo shows how to deploy VLM models in the OpenVINO Model Server with NPU acceleration. @@ -11,9 +11,7 @@ It is targeted on client machines equipped with NPU accelerator. ## Prerequisites -**OVMS 2025.1 or higher** - -**Model preparation**: Python 3.9 or higher with pip and HuggingFace account +**Model preparation**: Python 3.10 or higher with pip and HuggingFace account **Model Server deployment**: Installed Docker Engine or OVMS binary package according to the [baremetal deployment guide](../../docs/deploying_server_baremetal.md)