[ChatQnA] Remove enforce-eager to enable HPU graphs for better vLLM perf#1210
[ChatQnA] Remove enforce-eager to enable HPU graphs for better vLLM perf#1210lvliang-intel merged 4 commits intoopea-project:mainfrom
Conversation
Signed-off-by: Wang, Kai Lawrence <kai.lawrence.wang@intel.com>
|
Could you please also help to update GenAIComps setttings? https://github.com/opea-project/GenAIComps/tree/main/comps/llms/text-generation/vllm |
Test matrix did not include "PT_HPU_LAZY_MODE=0, enforce-eager=1" results? According to: https://docs.vllm.ai/en/latest/getting_started/gaudi-installation.html
=> Eager mode works best when there are lots of (parallel) requests (and therefore larger batches) i.e. when performance matters most. Was that tested too? |
For the latest SW stack version, eager mode performance still has perf gap with either lazy mode or TorchDynamo mode. Referring to https://docs.vllm.ai/en/latest/getting_started/gaudi-installation.html#execution-modes,
The test covers both smaller and larger concurrent requests for each sets of input/output seq len, and the performance ratio is the geomean of different num-of-requests and seq len. And I think this sentence you quote is only comparing lower batches with HPU Graphs disabled and larger batches with HPU Graphs disabled. Increasing the number of requests at a time would tend to increase the throughput while relatively better latency for smaller requests. Regarding the maximum batch size, we use |
|
@wangkl2 Thanks! => I'll update those args for my vLLM enabling PR in "GenAIInfra": opea-project/GenAIInfra#610 |
Signed-off-by: Wang, Kai Lawrence <kai.lawrence.wang@intel.com>
…erf (opea-project#1210) Signed-off-by: Wang, Kai Lawrence <kai.lawrence.wang@intel.com> Signed-off-by: Chingis Yundunov <YundunovCN@sibedge.com>
* Refactor docker compose of guardrails microservice Signed-off-by: lvliang-intel <liang1.lv@intel.com>
…erf (opea-project#1210) Signed-off-by: Wang, Kai Lawrence <kai.lawrence.wang@intel.com> Signed-off-by: cogniware-devops <ambarish.desai@cogniware.ai>
Description
Remove the
--enforce-eagerflag forvllm-gaudiservice, to enable HPU graphs optimization as default. It will improve both OOB latency and OOB throughput on Gaudi SW 1.18.Referenced benchmarking results ratio of
llmserveon a 7B LLM on Gaudi2 before and after this change:Note: keeping all other parameters consistent, and the geomean is calculated on the normalized perf results compared to the original setting measured on different input/output seq lengths including
128/128, 128/1024, 1024/128, 1024/1024.Issues
n/a
Type of change
Dependencies
n/a
Tests
Benchmark with GenAIEval.