This issue was opened automatically by the Test Playbooks workflow after the test quick-train-full-finetuning failed on the main branch.
Failure scope
- Playbook:
pytorch-finetuning
- Test id:
quick-train-full-finetuning
- Device:
stx
- Operating system:
windows
- Runner labels:
self-hosted, Windows, stx
- Runner name:
xsj-aimlab-stxp-03
- Commit:
0b670a0916a72ede16f803aaad15cd1673ec0516
- Workflow run: https://github.com/amd/playbooks/actions/runs/27048931391
Hardware / OS to use to reproduce
Run the failing test on a machine that matches the runner labels above (OS = windows, device = stx). The repo's self-hosted runners already advertise these labels; if you reproduce locally, use the same OS family and the same AMD device class.
How to dispatch the same test from CI
Re-run only the failing playbook on the same matrix entry by triggering the workflow with the playbook id:
gh workflow run test-playbooks.yml --repo amd/playbooks -f playbook_id=pytorch-finetuning
The workflow's matrix narrows down to this (device, platform) combination automatically based on the playbook's tested_platforms.
How to run just this test locally
python .github/scripts/run_playbook_tests.py --playbook pytorch-finetuning --platform windows --device stx
The runner extracts test blocks from playbooks/*/pytorch-finetuning/README.md (the failing block starts around line 222).
Failing test (verbatim from the README)
- Setup:
finetune-venv\Scripts\activate
- Timeout:
1200s
import os
import subprocess
import sys
os.environ["QUICK_TRAIN"] = "1"
os.environ["QUICK_TRAIN_MODEL"] = "unsloth/gemma-3-4b-it"
r = subprocess.run([sys.executable, "train_full_finetuning.py"], timeout=600)
sys.exit(r.returncode)
Result
stderr (last lines)
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 50%|█████ | 1/2 [00:03<00:03, 3.63s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:05<00:00, 2.47s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:05<00:00, 2.64s/it]
Tokenizing train dataset: 0%| | 0/6 [00:00<?, ? examples/s]
Tokenizing train dataset: 100%|██████████| 6/6 [00:00<00:00, 220.04 examples/s]
Truncating train dataset: 0%| | 0/6 [00:00<?, ? examples/s]
Truncating train dataset: 100%|██████████| 6/6 [00:00<00:00, 379.54 examples/s]
Tokenizing eval dataset: 0%| | 0/2 [00:00<?, ? examples/s]
Tokenizing eval dataset: 100%|██████████| 2/2 [00:00<00:00, 111.06 examples/s]
Truncating eval dataset: 0%| | 0/2 [00:00<?, ? examples/s]
Truncating eval dataset: 100%|██████████| 2/2 [00:00<00:00, 136.68 examples/s]
The model is already on multiple devices. Skipping the move to device specified in `args`.
0%| | 0/1 [00:00<?, ?it/s]C:\actions-runner\_work\playbooks\playbooks\playbooks\supplemental\pytorch-finetuning\assets\finetune-venv\Lib\site-packages\transformers\integrations\sdpa_attention.py:96: UserWarning: Mem Efficient attention on Current AMD GPU is still experimental. Enable it with TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1. (Triggered internally at B:\src\torch\aten\src\ATen\native\transformers\hip\sdp_utils.cpp:384.)
attn_output = torch.nn.functional.scaled_dot_product_attention(
stdout (last lines)
Loading dataset...
QUICK_TRAIN=1: using non-gated model for smoke test: unsloth/gemma-3-4b-it
QUICK_TRAIN=1: using 1 step and a tiny dataset (smoke test).
Train samples: 6, Test samples: 2
Total selected samples: 8
Loading unsloth/gemma-3-4b-it...
Note: Model is stored as MXFP4 on Hugging Face but will be loaded as BF16 for training
(This is expected - the warning about MXFP4 is informational)
Model loaded. Weights footprint: 8.60 GB
Gradient checkpointing enabled (saves memory during backprop)
Using bf16 mixed precision.
Starting Full Fine-tuning
Model: unsloth/gemma-3-4b-it
Trainable parameters: 4,300,079,472
Effective batch size: 16
Learning rate: 2e-05
Quick smoke mode enabled: tiny dataset + max_steps=1
Unhandled exception caught in c10/util/AbortHandler.h
00007FF8D0AF177C00007FF8D0AE0800 torch_python.dll!initModule [<unknown file> @ <unknown line number>]
00007FF9C06F19F700007FF9C06F19E0 ucrtbase.dll!terminate [<unknown file> @ <unknown line number>]
00007FF9A90A113400007FF9A90A10F0 VCRUNTIME140.dll!_FrameUnwindFilter [<unknown file> @ <unknown line number>]
00007FF9A90B204C00007FF9A90B1A90 VCRUNTIME140.dll!memset [<unknown file> @ <unknown line number>]
00007FF9A90AE7BF00007FF9A90AE720 VCRUNTIME140.dll!_C_specific_handler [<unknown file> @ <unknown line number>]
00007FF9C32C63DF00007FF9C32C6340 ntdll.dll!_chkstk [<unknown file> @ <unknown line number>]
00007FF9C317949700007FF9C3178F00 ntdll.dll!RtlLocateExtendedFeature [<unknown file> @ <unknown line number>]
00007FF9C31D073100007FF9C31D0510 ntdll.dll!RtlRaiseException [<unknown file> @ <unknown line number>]
00007FF9C0B9FE0A00007FF9C0B9FD80 KERNELBASE.dll!RaiseException [<unknown file> @ <unknown line number>]
00007FF9A90A526700007FF9A90A51D0 VCRUNTIME140.dll!CxxThrowException [<unknown file> @ <unknown line number>]
00007FF991CE426200007FF991CE4100 c10_hip.dll!c10::cuda::c10_cuda_check_implementation [<unknown file> @ <unknown line number>]
00007FF991CE4B8C00007FF991CE4B30 c10_hip.dll!c10::cuda::SetDevice [<unknown file> @ <unknown line number>]
00007FF991CCD2F300007FF991CA8A70 c10_hip.dll!c10::FreeCudaMemoryCallbacksRegistry [<unknown file> @ <unknown line number>]
00007FF869FA713700007FF869FA7000 torch_hip.dll!at::cuda::_foreach_add_ [<unknown file> @ <unknown line number>]
00007FF9A90AF73000007FF9A90AF4D0 VCRUNTIME140.dll!_CxxFrameHandler3 [<unknown file> @ <unknown line number>]
00007FF9A90A3A8600007FF9A90A1230 VCRUNTIME140.dll!is_exception_typeof [<unknown file> @ <unknown line number>]
00007FF9A90AEC0400007FF9A90AE720 VCRUNTIME140.dll!_C_specific_handler [<unknown file> @ <unknown line number>]
00007FF9A90A2C9400007FF9A90A1230 VCRUNTIME140.dll!is_exception_typeof [<unknown file> @ <unknown line number>]
00007FF9A90AF54100007FF9A90AF4D0 VCRUNTIME140.dll!_CxxFrameHandler3 [<unknown file> @ <unknown line number>]
00007FF9C32C645F00007FF9C32C6340 ntdll.dll!_chkstk [<unknown file> @ <unknown line number>]
00007FF9C317570300007FF9C3175480 ntdll.dll!RtlUnwindEx [<unknown file> @ <unknown line number>]
00007FF9A90AF0CA00007FF9A90AE720 VCRUNTIME140.dll!_C_specific_handler [<unknown file> @ <unknown line number>]
00007FF9A90A18E500007FF9A90A1230 VCRUNTIME140.dll!is_exception_typeof [<unknown file> @ <unknown line number>]
00007FF9A90A1D0000007FF9A90A1230 VCRUNTIME140.dll!is_exception_typeof [<unknown file> @ <unknown line number>]
00007FF9A90A2D9000007FF9A90A1230 VCRUNTIME140.dll!is_exception_typeof [<unknown file> @ <unknown line number>]
00007FF9A90AF54100007FF9A90AF4D0 VCRUNTIME140.dll!_CxxFrameHandler3 [<unknown file> @ <unknown line number>]
00007FF9C32C63DF00007FF9C32C6340 ntdll.dll!_chkstk [<unknown file> @ <unknown line number>]
00007FF9C317949700007FF9C3178F00 ntdll.dll!RtlLocateExtendedFeature [<unknown file> @ <unknown line number>]
00007FF9C31D073100007FF9C31D0510 ntdll.dll!RtlRaiseException [<unknown file> @ <unknown line number>]
00007FF9C0B9FE0A00007FF9C0B9FD80 KERNELBASE.dll!RaiseException [<unknown file> @ <unknown line number>]
00007FF9A90A526700007FF9A90A51D0 VCRUNTIME140.dll!CxxThrowException [<unknown file> @ <unknown line number>]
00007FF991CE426200007FF991CE4100 c10_hip.dll!c10::cuda::c10_cuda_check_implementation [<unknown file> @ <unknown line number>]
00007FF86867BCD800007FF8686446F0 torch_hip.dll!at::native::foreach_tensor_clamp_min_scalar_kernel_cuda [<unknown file> @ <unknown line number>]
00007FF8686776F400007FF8686446F0 torch_hip.dll!at::native::foreach_tensor_clamp_min_scalar_kernel_cuda [<unknown file> @ <unknown line number>]
00007FF86864311600007FF8686430B0 torch_hip.dll!at::native::foreach_tensor_add_scalar_kernel_cuda_ [<unknown file> @ <unknown line number>]
00007FF869FA70BD00007FF869FA7000 torch_hip.dll!at::cuda::_foreach_add_ [<unknown file> @ <unknown line number>]
00007FF86A1315D900007FF86A052750 torch_hip.dll!at::cuda::_fused_adagrad_ [<unknown file> @ <unknown line number>]
00007FF86FCB025F00007FF86FCB01C0 torch_cpu.dll!at::_ops::_foreach_add__Scalar::redispatch [<unknown file> @ <unknown line number>]
00007FF872D7297D00007FF87258D930 torch_cpu.dll!torch::autograd::impl::JitDecompRegisterer::JitDecompRegisterer [<unknown file> @ <unknown line number>]
00007FF86FCAF93A00007FF86FCAF850 torch_cpu.dll!at::_ops::_foreach_add__Scalar::call [<unknown file> @ <unknown line number>]
00007FF8D089B98800007FF8D071E7F0 torch_python.dll!THPPointer<THPStorage>::release [<unknown file> @ <unknown line number>]
00007FF9294CAD3400007FF9294CACD0 python313.dll!PyThread_get_thread_ident [<unknown file> @ <unknown line number>]
00007FF9294E7A5F00007FF9294E6F70 python313.dll!PyObject_Vectorcall [<unknown file> @ <unknown line number>]
00007FF9294E6FA900007FF9294E6F70 python313.dll!PyObject_Vectorcall [<unknown file> @ <unknown line number>]
00007FF9295652EE00007FF9295620D0 python313.dll!PyEval_EvalFrameDefault [<unknown file> @ <unknown line number>]
00007FF9294EEB4000007FF9294EE730 python313.dll!PyObject_CallNoArgs [<unknown file> @ <unknown line number>]
00007FF92957D8EA00007FF92957D870 python313.dll!PyObject_Call [<unknown file> @ <unknown line number>]
00007FF92956653000007FF9295620D0 python313.dll!PyEval_EvalFrameDefault [<unknown file> @ <unknown line number>]
00007FF9294EEA0400007FF9294EE730 python313.dll!PyObject_CallNoArgs [<unknown file> @ <unknown line number>]
00007FF92957D98B00007FF92957D870 python313.dll!PyObject_Call [<unknown file> @ <unknown line number>]
00007FF92953E64500007FF92953D728 python313.dll!PyMapping_Items [<unknown file> @ <unknown line number>]
00007FF9294FF8F900007FF9294FF878 python313.dll!PyObject_MakeTpCall [<unknown file> @ <unknown line number>]
00007FF92962F4D900007FF92962F028 python313.dll!PyMutex_Unlock [<unknown file> @ <unknown line number>]
00007FF9294E704C00007FF9294E6F70 python313.dll!PyObject_Vectorcall [<unknown file> @ <unknown line number>]
00007FF9294E6FA900007FF9294E6F70 python313.dll!PyObject_Vectorcall [<unknown file> @ <unknown line number>]
00007FF92956802100007FF9295620D0 python313.dll!PyEval_EvalFrameDefault [<unknown file> @ <unknown line number>]
00007FF9295A80C400007FF9295A7EC4 python313.dll!PyEval_EvalCode [<unknown file> @ <unknown line number>]
00007FF9295A7F7A00007FF9295A7EC4 python313.dll!PyEval_EvalCode [<unknown file> @ <unknown line number>]
00007FF9295A834100007FF9295A7EC4 python313.dll!PyEval_EvalCode [<unknown file> @ <unknown line number>]
00007FF9295A844C00007FF9295A7EC4 python313.dll!PyEval_EvalCode [<unknown file> @ <unknown line number>]
00007FF9295E4CB700007FF9295E4AD4 python313.dll!PyDict_PopString [<unknown file> @ <unknown line number>]
00007FF9295E480800007FF9295E4634 python313.dll!PyUnicode_EqualToUTF8AndSize [<unknown file> @ <unknown line number>]
00007FF9295E3FA500007FF9295E3EB4 python313.dll!PyEval_MakePendingCalls [<unknown file> @ <unknown line number>]
00007FF9295E3E4200007FF9295E38BC python313.dll!PyInterpreterConfig_AsDict [<unknown file> @ <unknown line number>]
This issue is opened and deduplicated by .github/scripts/create_failure_issues.py. Close it once the failure is fixed; subsequent failures with the same scope will reopen a fresh issue.
This issue was opened automatically by the Test Playbooks workflow after the test
quick-train-full-finetuningfailed on themainbranch.Failure scope
pytorch-finetuningquick-train-full-finetuningstxwindowsself-hosted,Windows,stxxsj-aimlab-stxp-030b670a0916a72ede16f803aaad15cd1673ec0516Hardware / OS to use to reproduce
Run the failing test on a machine that matches the runner labels above (OS =
windows, device =stx). The repo's self-hosted runners already advertise these labels; if you reproduce locally, use the same OS family and the same AMD device class.How to dispatch the same test from CI
Re-run only the failing playbook on the same matrix entry by triggering the workflow with the playbook id:
The workflow's matrix narrows down to this
(device, platform)combination automatically based on the playbook'stested_platforms.How to run just this test locally
The runner extracts test blocks from
playbooks/*/pytorch-finetuning/README.md(the failing block starts around line 222).Failing test (verbatim from the README)
finetune-venv\Scripts\activate1200sResult
3221226505stderr (last lines)
stdout (last lines)
This issue is opened and deduplicated by
.github/scripts/create_failure_issues.py. Close it once the failure is fixed; subsequent failures with the same scope will reopen a fresh issue.