A100 migration #901

rka97 · 2026-01-27T20:54:53Z

Migrate workloads from V100 to A100s. This PR:

Updates the runtime budgets for the workloads
Sets the default matmul precision to TF32 for JAX and PyTorch
Tunes the number of workers for the imagenet PyTorch input pipeline

Timing comparisons:

Workload	step_hint	JAX step time	JAX	PyTorch tf32 step time	Pytorch tf32	Reference
criteo1tb_jax	10666	0.835	2.473	0.942	2.792	2.14
fastmri_jax	18094	0.277	1.395	0.376	1.891	1.23
imagenet_resnet_jax	195999	0.259	14.079	0.300	16.344	18.38
imagenet_vit_jax	167999	0.386	18.003	0.321	14.971	19.38
librispeech_conformer_jax	76000	0.599	12.637	0.418	8.824	16.12
librispeech_deepspeech_jax	38400	0.981	10.462	0.557	5.940	12.33
lm_jax	72000	0.448	8.958	0.409	8.182
ogbg_jax	52000	0.220	3.178	0.238	3.442	3.34
wmt_jax	120000	0.139	4.645	0.155	5.178	12.04

- Introduced DTYPE enum to standardize data types (FLOAT32, FLOAT16, BFLOAT16) for JAX and PyTorch. - Updated input pipelines and model definitions in CIFAR and ImageNet workloads to utilize mixed precision. - Implemented casting policies for parameters and inputs using jmp and torch.autocast.

…w pytorch" This reverts commit 6f7d638.

github-actions · 2026-01-27T20:55:04Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

priyakasimbeg · 2026-01-29T03:37:09Z

closing this and opening new PR from a100 branch which has the changes from this source branch merged in.

priyakasimbeg and others added 11 commits November 6, 2025 03:53

update budgets for a100 hardware weightclass

f0f7774

formatting

b93eb3c

revert changes to docker build shell script

88b0e47

fix merge conflict

fa946d8

update pytorch

4e564d5

Revert "ImageNet and CIFAR mixed-precision support, need to debug slo…

6806019

…w pytorch" This reverts commit 6f7d638.

Use tf32 in pytorch

c9899cf

ImageNet caching for faster dataset access PyTorch

2f865a1

some benchmarking steps

9c93fc2

Change num_workers for imagenet, add validation tests for step times

b4d742c

rka97 requested a review from a team as a code owner January 27, 2026 20:54

rka97 requested a review from priyakasimbeg January 27, 2026 20:55

rka97 assigned rka97 and priyakasimbeg Jan 27, 2026

priyakasimbeg closed this Jan 29, 2026

github-actions bot locked and limited conversation to collaborators Jan 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A100 migration #901

A100 migration #901

Uh oh!

rka97 commented Jan 27, 2026 •

edited by priyakasimbeg

Loading

Uh oh!

github-actions bot commented Jan 27, 2026

Uh oh!

priyakasimbeg commented Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

A100 migration #901

A100 migration #901

Uh oh!

Conversation

rka97 commented Jan 27, 2026 • edited by priyakasimbeg Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 27, 2026

Uh oh!

priyakasimbeg commented Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rka97 commented Jan 27, 2026 •

edited by priyakasimbeg

Loading