Skip to content

[CI] pytorch-finetuning / quick-train-full-finetuning failed on stx (linux) #349

@github-actions

Description

@github-actions

This issue was opened automatically by the Test Playbooks workflow after the test quick-train-full-finetuning failed on the main branch.

Failure scope

  • Playbook: pytorch-finetuning
  • Test id: quick-train-full-finetuning
  • Device: stx
  • Operating system: linux
  • Runner labels: self-hosted, Linux, stx
  • Runner name: xsj-aimlab-stxp-01
  • Commit: 0b670a0916a72ede16f803aaad15cd1673ec0516
  • Workflow run: https://github.com/amd/playbooks/actions/runs/27048931391

Hardware / OS to use to reproduce

Run the failing test on a machine that matches the runner labels above (OS = linux, device = stx). The repo's self-hosted runners already advertise these labels; if you reproduce locally, use the same OS family and the same AMD device class.

How to dispatch the same test from CI

Re-run only the failing playbook on the same matrix entry by triggering the workflow with the playbook id:

gh workflow run test-playbooks.yml --repo amd/playbooks -f playbook_id=pytorch-finetuning

The workflow's matrix narrows down to this (device, platform) combination automatically based on the playbook's tested_platforms.

How to run just this test locally

python .github/scripts/run_playbook_tests.py --playbook pytorch-finetuning --platform linux --device stx

The runner extracts test blocks from playbooks/*/pytorch-finetuning/README.md (the failing block starts around line 222).

Failing test (verbatim from the README)

  • Setup: source finetune-venv/bin/activate
  • Timeout: 1200s
import os
import subprocess
import sys

os.environ["QUICK_TRAIN"] = "1"
os.environ["QUICK_TRAIN_MODEL"] = "unsloth/gemma-3-4b-it"
r = subprocess.run([sys.executable, "train_full_finetuning.py"], timeout=600)
sys.exit(r.returncode)

Result

  • Exit code: 1

stderr (last lines)


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:  50%|█████     | 1/2 [00:01<00:01,  1.10s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00,  1.13it/s]
Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00,  1.09it/s]

Tokenizing train dataset:   0%|          | 0/6 [00:00<?, ? examples/s]
Tokenizing train dataset: 100%|██████████| 6/6 [00:00<00:00, 508.46 examples/s]

Tokenizing eval dataset:   0%|          | 0/2 [00:00<?, ? examples/s]
Tokenizing eval dataset: 100%|██████████| 2/2 [00:00<00:00, 672.60 examples/s]
The model is already on multiple devices. Skipping the move to device specified in `args`.

  0%|          | 0/1 [00:00<?, ?it/s]/home/user/actions-runner/_work/playbooks/playbooks/playbooks/supplemental/pytorch-finetuning/assets/finetune-venv/lib/python3.13/site-packages/transformers/integrations/sdpa_attention.py:96: UserWarning: Mem Efficient attention on Current AMD GPU is still experimental. Enable it with TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1. (Triggered internally at /__w/rockrel/rockrel/external-builds/pytorch/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:383.)
  attn_output = torch.nn.functional.scaled_dot_product_attention(
Traceback (most recent call last):
  File "/home/user/actions-runner/_work/playbooks/playbooks/playbooks/supplemental/pytorch-finetuning/assets/train_full_finetuning.py", line 209, in <module>
    trainer.train()
    ~~~~~~~~~~~~~^^
  File "/home/user/actions-runner/_work/playbooks/playbooks/playbooks/supplemental/pytorch-finetuning/assets/finetune-venv/lib/python3.13/site-packages/transformers/trainer.py", line 2325, in train
    return inner_training_loop(
        args=args,
    ...<2 lines>...
        ignore_keys_for_eval=ignore_keys_for_eval,
    )
  File "/home/user/actions-runner/_work/playbooks/playbooks/playbooks/supplemental/pytorch-finetuning/assets/finetune-venv/lib/python3.13/site-packages/transformers/trainer.py", line 2740, in _inner_training_loop
    self.optimizer.step()
    ~~~~~~~~~~~~~~~~~~~^^
  File "/home/user/actions-runner/_work/playbooks/playbooks/playbooks/supplemental/pytorch-finetuning/assets/finetune-venv/lib/python3.13/site-packages/accelerate/optimizer.py", line 179, in step
    self.optimizer.step(closure)
    ~~~~~~~~~~~~~~~~~~~^^^^^^^^^
  File "/home/user/actions-runner/_work/playbooks/playbooks/playbooks/supplemental/pytorch-finetuning/assets/finetune-venv/lib/python3.13/site-packages/torch/optim/lr_scheduler.py", line 166, in wrapper
    return func.__get__(opt, opt.__class__)(*args, **kwargs)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/home/user/actions-runner/_work/playbooks/playbooks/playbooks/supplemental/pytorch-finetuning/assets/finetune-venv/lib/python3.13/site-packages/torch/optim/optimizer.py", line 533, in wrapper
    out = func(*args, **kwargs)
  File "/home/user/actions-runner/_work/playbooks/playbooks/playbooks/supplemental/pytorch-finetuning/assets/finetune-venv/lib/python3.13/site-packages/torch/optim/optimizer.py", line 81, in _use_grad
    ret = func(*args, **kwargs)
  File "/home/user/actions-runner/_work/playbooks/playbooks/playbooks/supplemental/pytorch-finetuning/assets/finetune-venv/lib/python3.13/site-packages/torch/optim/adam.py", line 238, in step
    has_complex = self._init_group(
        group,
    ...<5 lines>...
        state_steps,
    )
  File "/home/user/actions-runner/_work/playbooks/playbooks/playbooks/supplemental/pytorch-finetuning/assets/finetune-venv/lib/python3.13/site-packages/torch/optim/adam.py", line 178, in _init_group
    state["exp_avg"] = torch.zeros_like(
                       ~~~~~~~~~~~~~~~~^
        p, memory_format=torch.preserve_format
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 50.00 MiB. GPU 0 has a total capacity of 29.24 GiB of which 26.81 MiB is free. Of the allocated memory 28.53 GiB is allocated by PyTorch, and 233.29 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf)

  0%|          | 0/1 [00:10<?, ?it/s]

stdout (last lines)

Loading dataset...
QUICK_TRAIN=1: using non-gated model for smoke test: unsloth/gemma-3-4b-it
QUICK_TRAIN=1: using 1 step and a tiny dataset (smoke test).
Train samples: 6, Test samples: 2
Total selected samples: 8

Loading unsloth/gemma-3-4b-it...
Note: Model is stored as MXFP4 on Hugging Face but will be loaded as BF16 for training
(This is expected - the warning about MXFP4 is informational)

Model loaded. Weights footprint: 8.60 GB
Gradient checkpointing enabled (saves memory during backprop)
Using bf16 mixed precision.
Starting Full Fine-tuning
Model: unsloth/gemma-3-4b-it
Trainable parameters: 4,300,079,472
Effective batch size: 16
Learning rate: 2e-05
Quick smoke mode enabled: tiny dataset + max_steps=1



This issue is opened and deduplicated by .github/scripts/create_failure_issues.py. Close it once the failure is fixed; subsequent failures with the same scope will reopen a fresh issue.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions