-
Notifications
You must be signed in to change notification settings - Fork 108
Open
Labels
Description
Describe the bug
The measured output token throughput in GuideLLM v0.4.0 and later is significantly lower despite all other metrics being near identical.
Expected behavior
Output token throughput should not differ from v0.3.x when all else is equal.
Environment
Include all relevant environment information:
- OS [e.g. Ubuntu 20.04]: Fedora 43 container on OCP 4.19
- Python version [e.g. 3.12.2]: 3.13.9
- GuideLLM version: 0.4.0 and 0.5.0
To Reproduce
Run the following GuideLLM test on v0.3.0 and v0.4.0, then compare metrics in output JSON:
# Ensure we use the same endpoint across versions
export GUIDELLM_REQUEST_TYPE=text_completions
export GUIDELLM_TARGET=http://localhost:8080
guidellm benchmark run \
--target="${GUIDELLM_TARGET}" \
--model=RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic \
--processor=RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic \
--rate-type=concurrent \
--data=prompt_tokens=1000,output_tokens=1000 \
--max-seconds=600 \
--rate=1,50,100,200,300,500,650Additional context
Total Output Tokens per second over intended concurrency
Percent Change from v0.3.0 baseline

ivanbaldo