[ChatQnA] Remove enforce-eager to enable HPU graphs for better vLLM perf by wangkl2 · Pull Request #1210 · opea-project/GenAIExamples

wangkl2 · 2024-11-28T07:12:15Z

Description

Remove the --enforce-eager flag for vllm-gaudi service, to enable HPU graphs optimization as default. It will improve both OOB latency and OOB throughput on Gaudi SW 1.18.

Referenced benchmarking results ratio of llmserve on a 7B LLM on Gaudi2 before and after this change:
Note: keeping all other parameters consistent, and the geomean is calculated on the normalized perf results compared to the original setting measured on different input/output seq lengths including 128/128, 128/1024, 1024/128, 1024/1024.

Setting	Execution Mode	Geomean of Normalized Avg TTFT	Geomean of Normalized Avg TPOT	Geomean of Normalized Avg Total Latency	Geomean of Normalized Output Tokens/s
PT_HPU_LAZY_MODE=1, enforce-eager=1, max-num-seqs=256 (the original)	Lazy Mode	1.00	1.00	1.00	1.00
PT_HPU_LAZY_MODE=1, enforce-eager=0, max-num-seqs=256	Lazy Mode, with HPU graphs	1.06	0.28	0.33	3.03
NA	Perf Improvement with HPU Graphs	1/1.06=0.94X	1/0.28=3.57X	1/0.33=3.03X	3.03/1=3.03X

Issues

n/a

Type of change

Others (enhancement, documentation, validation, etc.)

Dependencies

n/a

Tests

Benchmark with GenAIEval.

Signed-off-by: Wang, Kai Lawrence <kai.lawrence.wang@intel.com>

XinyaoWa · 2024-11-28T07:16:54Z

Could you please also help to update GenAIComps setttings? https://github.com/opea-project/GenAIComps/tree/main/comps/llms/text-generation/vllm

eero-t · 2024-12-02T17:07:26Z

Remove the --enforce-eager flag for vllm-gaudi service, to enable HPU graphs optimization as default. It will improve both OOB latency and OOB throughput on Gaudi SW 1.18.

Test matrix did not include "PT_HPU_LAZY_MODE=0, enforce-eager=1" results?

According to: https://docs.vllm.ai/en/latest/getting_started/gaudi-installation.html

When there’s large amount of requests pending, vLLM scheduler will attempt to fill the maximum batch size for decode as soon as possible.
...
With HPU Graphs disabled, you are trading latency and throughput at lower batches for potentially higher throughput on higher batches. You can do that by adding --enforce-eager flag to server

=> Eager mode works best when there are lots of (parallel) requests (and therefore larger batches) i.e. when performance matters most. Was that tested too?

wangkl2 · 2024-12-09T02:42:04Z

Remove the --enforce-eager flag for vllm-gaudi service, to enable HPU graphs optimization as default. It will improve both OOB latency and OOB throughput on Gaudi SW 1.18.

Test matrix did not include "PT_HPU_LAZY_MODE=0, enforce-eager=1" results?

For the latest SW stack version, eager mode performance still has perf gap with either lazy mode or TorchDynamo mode. Referring to https://docs.vllm.ai/en/latest/getting_started/gaudi-installation.html#execution-modes, PT_HPU_LAZY_MODE=0 is highly experimental and only used for functionality test.

According to: https://docs.vllm.ai/en/latest/getting_started/gaudi-installation.html

When there’s large amount of requests pending, vLLM scheduler will attempt to fill the maximum batch size for decode as soon as possible.
...
With HPU Graphs disabled, you are trading latency and throughput at lower batches for potentially higher throughput on higher batches. You can do that by adding --enforce-eager flag to server

=> Eager mode works best when there are lots of (parallel) requests (and therefore larger batches) i.e. when performance matters most. Was that tested too?

The test covers both smaller and larger concurrent requests for each sets of input/output seq len, and the performance ratio is the geomean of different num-of-requests and seq len.

And I think this sentence you quote is only comparing lower batches with HPU Graphs disabled and larger batches with HPU Graphs disabled. Increasing the number of requests at a time would tend to increase the throughput while relatively better latency for smaller requests.

Regarding the maximum batch size, we use --max-num-seqs flag to control the schedular to process the requests. I've tested that either lazy mode and lazy mode+hpu graphs enabled, the current default --max-num-seqs=256 in opea is optimal for 7B model for general generation configs. Enlarges to even larger values such as 512 (lazy mode or hpu graphs enabled) only provides very limited perf improvement for some specific input/output seq len.

eero-t · 2024-12-09T12:14:48Z

@wangkl2 Thanks!

=> I'll update those args for my vLLM enabling PR in "GenAIInfra": opea-project/GenAIInfra#610

Signed-off-by: Wang, Kai Lawrence <kai.lawrence.wang@intel.com>

…erf (opea-project#1210) Signed-off-by: Wang, Kai Lawrence <kai.lawrence.wang@intel.com> Signed-off-by: Chingis Yundunov <YundunovCN@sibedge.com>

* Refactor docker compose of guardrails microservice Signed-off-by: lvliang-intel <liang1.lv@intel.com>

…erf (opea-project#1210) Signed-off-by: Wang, Kai Lawrence <kai.lawrence.wang@intel.com> Signed-off-by: cogniware-devops <ambarish.desai@cogniware.ai>

remove enforce-eager to enable HPU graphs

ed333c3

Signed-off-by: Wang, Kai Lawrence <kai.lawrence.wang@intel.com>

wangkl2 requested a review from lvliang-intel as a code owner November 28, 2024 07:12

wangkl2 requested review from XinyaoWa and lvliang-intel and removed request for lvliang-intel November 28, 2024 07:13

XinyaoWa approved these changes Nov 28, 2024

View reviewed changes

wangkl2 mentioned this pull request Nov 28, 2024

Remove enforce-eager to enable HPU graphs for better vLLM perf opea-project/GenAIComps#954

Merged

1 task

Merge branch 'main' into update-vllm-gaudi

44ca7fd

wangkl2 and others added 2 commits December 10, 2024 11:32

Merge branch 'opea-project:main' into update-vllm-gaudi

68448e1

Increase the llm max timeout in ci for fully warmup

18ffa62

Signed-off-by: Wang, Kai Lawrence <kai.lawrence.wang@intel.com>

lvliang-intel approved these changes Dec 10, 2024

View reviewed changes

lvliang-intel merged commit 4c01e14 into opea-project:main Dec 10, 2024

eero-t mentioned this pull request Dec 10, 2024

Adapt to latest vllm changes opea-project/GenAIInfra#632

Closed

1 task

letonghan pushed a commit that referenced this pull request Sep 17, 2025

Refactor guardrails compose yaml (#1210)

a2a35ed

* Refactor docker compose of guardrails microservice Signed-off-by: lvliang-intel <liang1.lv@intel.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ChatQnA] Remove enforce-eager to enable HPU graphs for better vLLM perf#1210

[ChatQnA] Remove enforce-eager to enable HPU graphs for better vLLM perf#1210
lvliang-intel merged 4 commits intoopea-project:mainfrom
wangkl2:update-vllm-gaudi

wangkl2 commented Nov 28, 2024

Uh oh!

XinyaoWa commented Nov 28, 2024

Uh oh!

eero-t commented Dec 2, 2024

Uh oh!

wangkl2 commented Dec 9, 2024

Uh oh!

eero-t commented Dec 9, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

wangkl2 commented Nov 28, 2024

Description

Issues

Type of change

Dependencies

Tests

Uh oh!

XinyaoWa commented Nov 28, 2024

Uh oh!

eero-t commented Dec 2, 2024

Uh oh!

wangkl2 commented Dec 9, 2024

Uh oh!

eero-t commented Dec 9, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants