Skip to content

[CI] pytorch-finetuning / quick-train-full-finetuning failed on stx (windows) #350

@github-actions

Description

@github-actions

This issue was opened automatically by the Test Playbooks workflow after the test quick-train-full-finetuning failed on the main branch.

Failure scope

  • Playbook: pytorch-finetuning
  • Test id: quick-train-full-finetuning
  • Device: stx
  • Operating system: windows
  • Runner labels: self-hosted, Windows, stx
  • Runner name: xsj-aimlab-stxp-03
  • Commit: 0b670a0916a72ede16f803aaad15cd1673ec0516
  • Workflow run: https://github.com/amd/playbooks/actions/runs/27048931391

Hardware / OS to use to reproduce

Run the failing test on a machine that matches the runner labels above (OS = windows, device = stx). The repo's self-hosted runners already advertise these labels; if you reproduce locally, use the same OS family and the same AMD device class.

How to dispatch the same test from CI

Re-run only the failing playbook on the same matrix entry by triggering the workflow with the playbook id:

gh workflow run test-playbooks.yml --repo amd/playbooks -f playbook_id=pytorch-finetuning

The workflow's matrix narrows down to this (device, platform) combination automatically based on the playbook's tested_platforms.

How to run just this test locally

python .github/scripts/run_playbook_tests.py --playbook pytorch-finetuning --platform windows --device stx

The runner extracts test blocks from playbooks/*/pytorch-finetuning/README.md (the failing block starts around line 222).

Failing test (verbatim from the README)

  • Setup: finetune-venv\Scripts\activate
  • Timeout: 1200s
import os
import subprocess
import sys

os.environ["QUICK_TRAIN"] = "1"
os.environ["QUICK_TRAIN_MODEL"] = "unsloth/gemma-3-4b-it"
r = subprocess.run([sys.executable, "train_full_finetuning.py"], timeout=600)
sys.exit(r.returncode)

Result

  • Exit code: 3221226505

stderr (last lines)


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:  50%|█████     | 1/2 [00:03<00:03,  3.63s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:05<00:00,  2.47s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:05<00:00,  2.64s/it]

Tokenizing train dataset:   0%|          | 0/6 [00:00<?, ? examples/s]
Tokenizing train dataset: 100%|██████████| 6/6 [00:00<00:00, 220.04 examples/s]

Truncating train dataset:   0%|          | 0/6 [00:00<?, ? examples/s]
Truncating train dataset: 100%|██████████| 6/6 [00:00<00:00, 379.54 examples/s]

Tokenizing eval dataset:   0%|          | 0/2 [00:00<?, ? examples/s]
Tokenizing eval dataset: 100%|██████████| 2/2 [00:00<00:00, 111.06 examples/s]

Truncating eval dataset:   0%|          | 0/2 [00:00<?, ? examples/s]
Truncating eval dataset: 100%|██████████| 2/2 [00:00<00:00, 136.68 examples/s]
The model is already on multiple devices. Skipping the move to device specified in `args`.

  0%|          | 0/1 [00:00<?, ?it/s]C:\actions-runner\_work\playbooks\playbooks\playbooks\supplemental\pytorch-finetuning\assets\finetune-venv\Lib\site-packages\transformers\integrations\sdpa_attention.py:96: UserWarning: Mem Efficient attention on Current AMD GPU is still experimental. Enable it with TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1. (Triggered internally at B:\src\torch\aten\src\ATen\native\transformers\hip\sdp_utils.cpp:384.)
  attn_output = torch.nn.functional.scaled_dot_product_attention(

stdout (last lines)

Loading dataset...
QUICK_TRAIN=1: using non-gated model for smoke test: unsloth/gemma-3-4b-it
QUICK_TRAIN=1: using 1 step and a tiny dataset (smoke test).
Train samples: 6, Test samples: 2
Total selected samples: 8

Loading unsloth/gemma-3-4b-it...
Note: Model is stored as MXFP4 on Hugging Face but will be loaded as BF16 for training
(This is expected - the warning about MXFP4 is informational)

Model loaded. Weights footprint: 8.60 GB
Gradient checkpointing enabled (saves memory during backprop)
Using bf16 mixed precision.
Starting Full Fine-tuning
Model: unsloth/gemma-3-4b-it
Trainable parameters: 4,300,079,472
Effective batch size: 16
Learning rate: 2e-05
Quick smoke mode enabled: tiny dataset + max_steps=1

Unhandled exception caught in c10/util/AbortHandler.h
00007FF8D0AF177C00007FF8D0AE0800 torch_python.dll!initModule [<unknown file> @ <unknown line number>]
00007FF9C06F19F700007FF9C06F19E0 ucrtbase.dll!terminate [<unknown file> @ <unknown line number>]
00007FF9A90A113400007FF9A90A10F0 VCRUNTIME140.dll!_FrameUnwindFilter [<unknown file> @ <unknown line number>]
00007FF9A90B204C00007FF9A90B1A90 VCRUNTIME140.dll!memset [<unknown file> @ <unknown line number>]
00007FF9A90AE7BF00007FF9A90AE720 VCRUNTIME140.dll!_C_specific_handler [<unknown file> @ <unknown line number>]
00007FF9C32C63DF00007FF9C32C6340 ntdll.dll!_chkstk [<unknown file> @ <unknown line number>]
00007FF9C317949700007FF9C3178F00 ntdll.dll!RtlLocateExtendedFeature [<unknown file> @ <unknown line number>]
00007FF9C31D073100007FF9C31D0510 ntdll.dll!RtlRaiseException [<unknown file> @ <unknown line number>]
00007FF9C0B9FE0A00007FF9C0B9FD80 KERNELBASE.dll!RaiseException [<unknown file> @ <unknown line number>]
00007FF9A90A526700007FF9A90A51D0 VCRUNTIME140.dll!CxxThrowException [<unknown file> @ <unknown line number>]
00007FF991CE426200007FF991CE4100 c10_hip.dll!c10::cuda::c10_cuda_check_implementation [<unknown file> @ <unknown line number>]
00007FF991CE4B8C00007FF991CE4B30 c10_hip.dll!c10::cuda::SetDevice [<unknown file> @ <unknown line number>]
00007FF991CCD2F300007FF991CA8A70 c10_hip.dll!c10::FreeCudaMemoryCallbacksRegistry [<unknown file> @ <unknown line number>]
00007FF869FA713700007FF869FA7000 torch_hip.dll!at::cuda::_foreach_add_ [<unknown file> @ <unknown line number>]
00007FF9A90AF73000007FF9A90AF4D0 VCRUNTIME140.dll!_CxxFrameHandler3 [<unknown file> @ <unknown line number>]
00007FF9A90A3A8600007FF9A90A1230 VCRUNTIME140.dll!is_exception_typeof [<unknown file> @ <unknown line number>]
00007FF9A90AEC0400007FF9A90AE720 VCRUNTIME140.dll!_C_specific_handler [<unknown file> @ <unknown line number>]
00007FF9A90A2C9400007FF9A90A1230 VCRUNTIME140.dll!is_exception_typeof [<unknown file> @ <unknown line number>]
00007FF9A90AF54100007FF9A90AF4D0 VCRUNTIME140.dll!_CxxFrameHandler3 [<unknown file> @ <unknown line number>]
00007FF9C32C645F00007FF9C32C6340 ntdll.dll!_chkstk [<unknown file> @ <unknown line number>]
00007FF9C317570300007FF9C3175480 ntdll.dll!RtlUnwindEx [<unknown file> @ <unknown line number>]
00007FF9A90AF0CA00007FF9A90AE720 VCRUNTIME140.dll!_C_specific_handler [<unknown file> @ <unknown line number>]
00007FF9A90A18E500007FF9A90A1230 VCRUNTIME140.dll!is_exception_typeof [<unknown file> @ <unknown line number>]
00007FF9A90A1D0000007FF9A90A1230 VCRUNTIME140.dll!is_exception_typeof [<unknown file> @ <unknown line number>]
00007FF9A90A2D9000007FF9A90A1230 VCRUNTIME140.dll!is_exception_typeof [<unknown file> @ <unknown line number>]
00007FF9A90AF54100007FF9A90AF4D0 VCRUNTIME140.dll!_CxxFrameHandler3 [<unknown file> @ <unknown line number>]
00007FF9C32C63DF00007FF9C32C6340 ntdll.dll!_chkstk [<unknown file> @ <unknown line number>]
00007FF9C317949700007FF9C3178F00 ntdll.dll!RtlLocateExtendedFeature [<unknown file> @ <unknown line number>]
00007FF9C31D073100007FF9C31D0510 ntdll.dll!RtlRaiseException [<unknown file> @ <unknown line number>]
00007FF9C0B9FE0A00007FF9C0B9FD80 KERNELBASE.dll!RaiseException [<unknown file> @ <unknown line number>]
00007FF9A90A526700007FF9A90A51D0 VCRUNTIME140.dll!CxxThrowException [<unknown file> @ <unknown line number>]
00007FF991CE426200007FF991CE4100 c10_hip.dll!c10::cuda::c10_cuda_check_implementation [<unknown file> @ <unknown line number>]
00007FF86867BCD800007FF8686446F0 torch_hip.dll!at::native::foreach_tensor_clamp_min_scalar_kernel_cuda [<unknown file> @ <unknown line number>]
00007FF8686776F400007FF8686446F0 torch_hip.dll!at::native::foreach_tensor_clamp_min_scalar_kernel_cuda [<unknown file> @ <unknown line number>]
00007FF86864311600007FF8686430B0 torch_hip.dll!at::native::foreach_tensor_add_scalar_kernel_cuda_ [<unknown file> @ <unknown line number>]
00007FF869FA70BD00007FF869FA7000 torch_hip.dll!at::cuda::_foreach_add_ [<unknown file> @ <unknown line number>]
00007FF86A1315D900007FF86A052750 torch_hip.dll!at::cuda::_fused_adagrad_ [<unknown file> @ <unknown line number>]
00007FF86FCB025F00007FF86FCB01C0 torch_cpu.dll!at::_ops::_foreach_add__Scalar::redispatch [<unknown file> @ <unknown line number>]
00007FF872D7297D00007FF87258D930 torch_cpu.dll!torch::autograd::impl::JitDecompRegisterer::JitDecompRegisterer [<unknown file> @ <unknown line number>]
00007FF86FCAF93A00007FF86FCAF850 torch_cpu.dll!at::_ops::_foreach_add__Scalar::call [<unknown file> @ <unknown line number>]
00007FF8D089B98800007FF8D071E7F0 torch_python.dll!THPPointer<THPStorage>::release [<unknown file> @ <unknown line number>]
00007FF9294CAD3400007FF9294CACD0 python313.dll!PyThread_get_thread_ident [<unknown file> @ <unknown line number>]
00007FF9294E7A5F00007FF9294E6F70 python313.dll!PyObject_Vectorcall [<unknown file> @ <unknown line number>]
00007FF9294E6FA900007FF9294E6F70 python313.dll!PyObject_Vectorcall [<unknown file> @ <unknown line number>]
00007FF9295652EE00007FF9295620D0 python313.dll!PyEval_EvalFrameDefault [<unknown file> @ <unknown line number>]
00007FF9294EEB4000007FF9294EE730 python313.dll!PyObject_CallNoArgs [<unknown file> @ <unknown line number>]
00007FF92957D8EA00007FF92957D870 python313.dll!PyObject_Call [<unknown file> @ <unknown line number>]
00007FF92956653000007FF9295620D0 python313.dll!PyEval_EvalFrameDefault [<unknown file> @ <unknown line number>]
00007FF9294EEA0400007FF9294EE730 python313.dll!PyObject_CallNoArgs [<unknown file> @ <unknown line number>]
00007FF92957D98B00007FF92957D870 python313.dll!PyObject_Call [<unknown file> @ <unknown line number>]
00007FF92953E64500007FF92953D728 python313.dll!PyMapping_Items [<unknown file> @ <unknown line number>]
00007FF9294FF8F900007FF9294FF878 python313.dll!PyObject_MakeTpCall [<unknown file> @ <unknown line number>]
00007FF92962F4D900007FF92962F028 python313.dll!PyMutex_Unlock [<unknown file> @ <unknown line number>]
00007FF9294E704C00007FF9294E6F70 python313.dll!PyObject_Vectorcall [<unknown file> @ <unknown line number>]
00007FF9294E6FA900007FF9294E6F70 python313.dll!PyObject_Vectorcall [<unknown file> @ <unknown line number>]
00007FF92956802100007FF9295620D0 python313.dll!PyEval_EvalFrameDefault [<unknown file> @ <unknown line number>]
00007FF9295A80C400007FF9295A7EC4 python313.dll!PyEval_EvalCode [<unknown file> @ <unknown line number>]
00007FF9295A7F7A00007FF9295A7EC4 python313.dll!PyEval_EvalCode [<unknown file> @ <unknown line number>]
00007FF9295A834100007FF9295A7EC4 python313.dll!PyEval_EvalCode [<unknown file> @ <unknown line number>]
00007FF9295A844C00007FF9295A7EC4 python313.dll!PyEval_EvalCode [<unknown file> @ <unknown line number>]
00007FF9295E4CB700007FF9295E4AD4 python313.dll!PyDict_PopString [<unknown file> @ <unknown line number>]
00007FF9295E480800007FF9295E4634 python313.dll!PyUnicode_EqualToUTF8AndSize [<unknown file> @ <unknown line number>]
00007FF9295E3FA500007FF9295E3EB4 python313.dll!PyEval_MakePendingCalls [<unknown file> @ <unknown line number>]
00007FF9295E3E4200007FF9295E38BC python313.dll!PyInterpreterConfig_AsDict [<unknown file> @ <unknown line number>]



This issue is opened and deduplicated by .github/scripts/create_failure_issues.py. Close it once the failure is fixed; subsequent failures with the same scope will reopen a fresh issue.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions