Skip to content

Up to 9% decrease in output tokens per second between v0.3 and v0.4 #514

@sjmonson

Description

@sjmonson

Describe the bug

The measured output token throughput in GuideLLM v0.4.0 and later is significantly lower despite all other metrics being near identical.

Expected behavior

Output token throughput should not differ from v0.3.x when all else is equal.

Environment

Include all relevant environment information:

  1. OS [e.g. Ubuntu 20.04]: Fedora 43 container on OCP 4.19
  2. Python version [e.g. 3.12.2]: 3.13.9
  3. GuideLLM version: 0.4.0 and 0.5.0

To Reproduce

Run the following GuideLLM test on v0.3.0 and v0.4.0, then compare metrics in output JSON:

# Ensure we use the same endpoint across versions
export GUIDELLM_REQUEST_TYPE=text_completions
export GUIDELLM_TARGET=http://localhost:8080

guidellm benchmark run \
            --target="${GUIDELLM_TARGET}" \
            --model=RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic \
            --processor=RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic \
            --rate-type=concurrent \
            --data=prompt_tokens=1000,output_tokens=1000 \
            --max-seconds=600 \
            --rate=1,50,100,200,300,500,650

Additional context

Total Output Tokens per second over intended concurrency

Image

Percent Change from v0.3.0 baseline

Image

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions