Hey,
I've been trying to reproduce the results for the following run:
ISL = 1k
OSK = 1k
B200 (Dynamo TRT, MTP)
Date: 2026-01-29
Image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1
Interactivity (tok/s/user): 21.338126345946833
Output Token Throughput per GPU (tok/s/gpu): 10,012.214
Total GPUs: 32
Prefill: 12 GPUs, TP: 4, EP: 4, DPA: True, Workers: 3
Decode: 20 GPUs, TP: 4, EP: 4, DPA: True, Workers: 5
Concurrency: 10860
Precision: FP4
GitHub Actions Run
I have 4 nodes of B200 sxm, i am using K8s to deploy the same configuration as you did here:
https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/1k1k/mtp/ctx3_gen5_dep4_batch512_eplb0_mtp1.yaml
No matter what I did, my results still fall under 10K TPS per GPU. Current best result is ~8.3K per decode gpu.
I have validated the kv transfer is via gpu direct.
The logs of the run already expired and therefore, ask if there is a way to get them or at least share more about how to be able to reproduce the results , e.g, how many frontends were deployed ? or were the system was configured to performance?
Thanks
Hey,
I've been trying to reproduce the results for the following run:
ISL = 1k
OSK = 1k
B200 (Dynamo TRT, MTP)
Date: 2026-01-29
Image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1
Interactivity (tok/s/user): 21.338126345946833
Output Token Throughput per GPU (tok/s/gpu): 10,012.214
Total GPUs: 32
Prefill: 12 GPUs, TP: 4, EP: 4, DPA: True, Workers: 3
Decode: 20 GPUs, TP: 4, EP: 4, DPA: True, Workers: 5
Concurrency: 10860
Precision: FP4
GitHub Actions Run
I have 4 nodes of B200 sxm, i am using K8s to deploy the same configuration as you did here:
https://github.com/NVIDIA/srt-slurm/blob/sa-submission-q2-2026/recipes/trtllm/b200-fp4/1k1k/mtp/ctx3_gen5_dep4_batch512_eplb0_mtp1.yaml
No matter what I did, my results still fall under 10K TPS per GPU. Current best result is ~8.3K per decode gpu.
I have validated the kv transfer is via gpu direct.
The logs of the run already expired and therefore, ask if there is a way to get them or at least share more about how to be able to reproduce the results , e.g, how many frontends were deployed ? or were the system was configured to performance?
Thanks