This issue was opened automatically by the Test Playbooks workflow after the test quick-train-full-finetuning failed on the main branch.
Failure scope
- Playbook:
pytorch-finetuning
- Test id:
quick-train-full-finetuning
- Device:
stx
- Operating system:
linux
- Runner labels:
self-hosted, Linux, stx
- Runner name:
xsj-aimlab-stxp-01
- Commit:
0b670a0916a72ede16f803aaad15cd1673ec0516
- Workflow run: https://github.com/amd/playbooks/actions/runs/27048931391
Hardware / OS to use to reproduce
Run the failing test on a machine that matches the runner labels above (OS = linux, device = stx). The repo's self-hosted runners already advertise these labels; if you reproduce locally, use the same OS family and the same AMD device class.
How to dispatch the same test from CI
Re-run only the failing playbook on the same matrix entry by triggering the workflow with the playbook id:
gh workflow run test-playbooks.yml --repo amd/playbooks -f playbook_id=pytorch-finetuning
The workflow's matrix narrows down to this (device, platform) combination automatically based on the playbook's tested_platforms.
How to run just this test locally
python .github/scripts/run_playbook_tests.py --playbook pytorch-finetuning --platform linux --device stx
The runner extracts test blocks from playbooks/*/pytorch-finetuning/README.md (the failing block starts around line 222).
Failing test (verbatim from the README)
- Setup:
source finetune-venv/bin/activate
- Timeout:
1200s
import os
import subprocess
import sys
os.environ["QUICK_TRAIN"] = "1"
os.environ["QUICK_TRAIN_MODEL"] = "unsloth/gemma-3-4b-it"
r = subprocess.run([sys.executable, "train_full_finetuning.py"], timeout=600)
sys.exit(r.returncode)
Result
stderr (last lines)
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 50%|█████ | 1/2 [00:01<00:01, 1.10s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00, 1.13it/s]
Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00, 1.09it/s]
Tokenizing train dataset: 0%| | 0/6 [00:00<?, ? examples/s]
Tokenizing train dataset: 100%|██████████| 6/6 [00:00<00:00, 508.46 examples/s]
Tokenizing eval dataset: 0%| | 0/2 [00:00<?, ? examples/s]
Tokenizing eval dataset: 100%|██████████| 2/2 [00:00<00:00, 672.60 examples/s]
The model is already on multiple devices. Skipping the move to device specified in `args`.
0%| | 0/1 [00:00<?, ?it/s]/home/user/actions-runner/_work/playbooks/playbooks/playbooks/supplemental/pytorch-finetuning/assets/finetune-venv/lib/python3.13/site-packages/transformers/integrations/sdpa_attention.py:96: UserWarning: Mem Efficient attention on Current AMD GPU is still experimental. Enable it with TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1. (Triggered internally at /__w/rockrel/rockrel/external-builds/pytorch/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:383.)
attn_output = torch.nn.functional.scaled_dot_product_attention(
Traceback (most recent call last):
File "/home/user/actions-runner/_work/playbooks/playbooks/playbooks/supplemental/pytorch-finetuning/assets/train_full_finetuning.py", line 209, in <module>
trainer.train()
~~~~~~~~~~~~~^^
File "/home/user/actions-runner/_work/playbooks/playbooks/playbooks/supplemental/pytorch-finetuning/assets/finetune-venv/lib/python3.13/site-packages/transformers/trainer.py", line 2325, in train
return inner_training_loop(
args=args,
...<2 lines>...
ignore_keys_for_eval=ignore_keys_for_eval,
)
File "/home/user/actions-runner/_work/playbooks/playbooks/playbooks/supplemental/pytorch-finetuning/assets/finetune-venv/lib/python3.13/site-packages/transformers/trainer.py", line 2740, in _inner_training_loop
self.optimizer.step()
~~~~~~~~~~~~~~~~~~~^^
File "/home/user/actions-runner/_work/playbooks/playbooks/playbooks/supplemental/pytorch-finetuning/assets/finetune-venv/lib/python3.13/site-packages/accelerate/optimizer.py", line 179, in step
self.optimizer.step(closure)
~~~~~~~~~~~~~~~~~~~^^^^^^^^^
File "/home/user/actions-runner/_work/playbooks/playbooks/playbooks/supplemental/pytorch-finetuning/assets/finetune-venv/lib/python3.13/site-packages/torch/optim/lr_scheduler.py", line 166, in wrapper
return func.__get__(opt, opt.__class__)(*args, **kwargs)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
File "/home/user/actions-runner/_work/playbooks/playbooks/playbooks/supplemental/pytorch-finetuning/assets/finetune-venv/lib/python3.13/site-packages/torch/optim/optimizer.py", line 533, in wrapper
out = func(*args, **kwargs)
File "/home/user/actions-runner/_work/playbooks/playbooks/playbooks/supplemental/pytorch-finetuning/assets/finetune-venv/lib/python3.13/site-packages/torch/optim/optimizer.py", line 81, in _use_grad
ret = func(*args, **kwargs)
File "/home/user/actions-runner/_work/playbooks/playbooks/playbooks/supplemental/pytorch-finetuning/assets/finetune-venv/lib/python3.13/site-packages/torch/optim/adam.py", line 238, in step
has_complex = self._init_group(
group,
...<5 lines>...
state_steps,
)
File "/home/user/actions-runner/_work/playbooks/playbooks/playbooks/supplemental/pytorch-finetuning/assets/finetune-venv/lib/python3.13/site-packages/torch/optim/adam.py", line 178, in _init_group
state["exp_avg"] = torch.zeros_like(
~~~~~~~~~~~~~~~~^
p, memory_format=torch.preserve_format
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 50.00 MiB. GPU 0 has a total capacity of 29.24 GiB of which 26.81 MiB is free. Of the allocated memory 28.53 GiB is allocated by PyTorch, and 233.29 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf)
0%| | 0/1 [00:10<?, ?it/s]
stdout (last lines)
Loading dataset...
QUICK_TRAIN=1: using non-gated model for smoke test: unsloth/gemma-3-4b-it
QUICK_TRAIN=1: using 1 step and a tiny dataset (smoke test).
Train samples: 6, Test samples: 2
Total selected samples: 8
Loading unsloth/gemma-3-4b-it...
Note: Model is stored as MXFP4 on Hugging Face but will be loaded as BF16 for training
(This is expected - the warning about MXFP4 is informational)
Model loaded. Weights footprint: 8.60 GB
Gradient checkpointing enabled (saves memory during backprop)
Using bf16 mixed precision.
Starting Full Fine-tuning
Model: unsloth/gemma-3-4b-it
Trainable parameters: 4,300,079,472
Effective batch size: 16
Learning rate: 2e-05
Quick smoke mode enabled: tiny dataset + max_steps=1
This issue is opened and deduplicated by .github/scripts/create_failure_issues.py. Close it once the failure is fixed; subsequent failures with the same scope will reopen a fresh issue.
This issue was opened automatically by the Test Playbooks workflow after the test
quick-train-full-finetuningfailed on themainbranch.Failure scope
pytorch-finetuningquick-train-full-finetuningstxlinuxself-hosted,Linux,stxxsj-aimlab-stxp-010b670a0916a72ede16f803aaad15cd1673ec0516Hardware / OS to use to reproduce
Run the failing test on a machine that matches the runner labels above (OS =
linux, device =stx). The repo's self-hosted runners already advertise these labels; if you reproduce locally, use the same OS family and the same AMD device class.How to dispatch the same test from CI
Re-run only the failing playbook on the same matrix entry by triggering the workflow with the playbook id:
The workflow's matrix narrows down to this
(device, platform)combination automatically based on the playbook'stested_platforms.How to run just this test locally
The runner extracts test blocks from
playbooks/*/pytorch-finetuning/README.md(the failing block starts around line 222).Failing test (verbatim from the README)
source finetune-venv/bin/activate1200sResult
1stderr (last lines)
stdout (last lines)
This issue is opened and deduplicated by
.github/scripts/create_failure_issues.py. Close it once the failure is fixed; subsequent failures with the same scope will reopen a fresh issue.