Skip to content

[Reproducibility] dsr1 0528 dynamo + mtp on b200 #1691

@OrZipori

Description

@OrZipori

Hey,

I've been trying to reproduce the results for the following run:
ISL = 1k
OSK = 1k

B200 (Dynamo TRT, MTP)
Date: 2026-01-29
Image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1
Interactivity (tok/s/user): 21.338126345946833
Output Token Throughput per GPU (tok/s/gpu): 10,012.214
Total GPUs: 32
Prefill: 12 GPUs, TP: 4, EP: 4, DPA: True, Workers: 3
Decode: 20 GPUs, TP: 4, EP: 4, DPA: True, Workers: 5
Concurrency: 10860
Precision: FP4
GitHub Actions Run

I have 4 nodes of B200 sxm, i am using K8s to deploy the same configuration as you did here:
https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/1k1k/mtp/ctx3_gen5_dep4_batch512_eplb0_mtp1.yaml

No matter what I did, my results still fall under 10K TPS per GPU. Current best result is ~8.3K per decode gpu.

I have validated the kv transfer is via gpu direct.

The logs of the run already expired and therefore, ask if there is a way to get them or at least share more about how to be able to reproduce the results , e.g, how many frontends were deployed ? or were the system was configured to performance?

Thanks

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions