[CI] pytorch-finetuning / quick-train-full-finetuning failed on stx (windows)

This issue was opened automatically by the **Test Playbooks** workflow after the test `quick-train-full-finetuning` failed on the `main` branch.

## Failure scope

- **Playbook:** `pytorch-finetuning`
- **Test id:** `quick-train-full-finetuning`
- **Device:** `stx`
- **Operating system:** `windows`
- **Runner labels:** `self-hosted`, `Windows`, `stx`
- **Runner name:** `xsj-aimlab-stxp-03`
- **Commit:** `0b670a0916a72ede16f803aaad15cd1673ec0516`
- **Workflow run:** https://github.com/amd/playbooks/actions/runs/27048931391

## Hardware / OS to use to reproduce

Run the failing test on a machine that matches the runner labels above (OS = `windows`, device = `stx`). The repo's self-hosted runners already advertise these labels; if you reproduce locally, use the same OS family and the same AMD device class.

## How to dispatch the same test from CI

Re-run only the failing playbook on the same matrix entry by triggering the workflow with the playbook id:

```bash
gh workflow run test-playbooks.yml --repo amd/playbooks -f playbook_id=pytorch-finetuning
```

The workflow's matrix narrows down to this `(device, platform)` combination automatically based on the playbook's `tested_platforms`.

## How to run just this test locally

```bash
python .github/scripts/run_playbook_tests.py --playbook pytorch-finetuning --platform windows --device stx
```

The runner extracts test blocks from `playbooks/*/pytorch-finetuning/README.md` (the failing block starts around line 222).

## Failing test (verbatim from the README)

- **Setup:** `finetune-venv\Scripts\activate`
- **Timeout:** `1200s`

```python
import os
import subprocess
import sys

os.environ["QUICK_TRAIN"] = "1"
os.environ["QUICK_TRAIN_MODEL"] = "unsloth/gemma-3-4b-it"
r = subprocess.run([sys.executable, "train_full_finetuning.py"], timeout=600)
sys.exit(r.returncode)
```

## Result

- **Exit code:** `3221226505`

### stderr (last lines)

```

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:  50%|█████     | 1/2 [00:03<00:03,  3.63s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:05<00:00,  2.47s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:05<00:00,  2.64s/it]

Tokenizing train dataset:   0%|          | 0/6 [00:00<?, ? examples/s]
Tokenizing train dataset: 100%|██████████| 6/6 [00:00<00:00, 220.04 examples/s]

Truncating train dataset:   0%|          | 0/6 [00:00<?, ? examples/s]
Truncating train dataset: 100%|██████████| 6/6 [00:00<00:00, 379.54 examples/s]

Tokenizing eval dataset:   0%|          | 0/2 [00:00<?, ? examples/s]
Tokenizing eval dataset: 100%|██████████| 2/2 [00:00<00:00, 111.06 examples/s]

Truncating eval dataset:   0%|          | 0/2 [00:00<?, ? examples/s]
Truncating eval dataset: 100%|██████████| 2/2 [00:00<00:00, 136.68 examples/s]
The model is already on multiple devices. Skipping the move to device specified in `args`.

  0%|          | 0/1 [00:00<?, ?it/s]C:\actions-runner\_work\playbooks\playbooks\playbooks\supplemental\pytorch-finetuning\assets\finetune-venv\Lib\site-packages\transformers\integrations\sdpa_attention.py:96: UserWarning: Mem Efficient attention on Current AMD GPU is still experimental. Enable it with TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1. (Triggered internally at B:\src\torch\aten\src\ATen\native\transformers\hip\sdp_utils.cpp:384.)
  attn_output = torch.nn.functional.scaled_dot_product_attention(

```

### stdout (last lines)

```
Loading dataset...
QUICK_TRAIN=1: using non-gated model for smoke test: unsloth/gemma-3-4b-it
QUICK_TRAIN=1: using 1 step and a tiny dataset (smoke test).
Train samples: 6, Test samples: 2
Total selected samples: 8

Loading unsloth/gemma-3-4b-it...
Note: Model is stored as MXFP4 on Hugging Face but will be loaded as BF16 for training
(This is expected - the warning about MXFP4 is informational)

Model loaded. Weights footprint: 8.60 GB
Gradient checkpointing enabled (saves memory during backprop)
Using bf16 mixed precision.
Starting Full Fine-tuning
Model: unsloth/gemma-3-4b-it
Trainable parameters: 4,300,079,472
Effective batch size: 16
Learning rate: 2e-05
Quick smoke mode enabled: tiny dataset + max_steps=1

Unhandled exception caught in c10/util/AbortHandler.h
00007FF8D0AF177C00007FF8D0AE0800 torch_python.dll!initModule [<unknown file> @ <unknown line number>]
00007FF9C06F19F700007FF9C06F19E0 ucrtbase.dll!terminate [<unknown file> @ <unknown line number>]
00007FF9A90A113400007FF9A90A10F0 VCRUNTIME140.dll!_FrameUnwindFilter [<unknown file> @ <unknown line number>]
00007FF9A90B204C00007FF9A90B1A90 VCRUNTIME140.dll!memset [<unknown file> @ <unknown line number>]
00007FF9A90AE7BF00007FF9A90AE720 VCRUNTIME140.dll!_C_specific_handler [<unknown file> @ <unknown line number>]
00007FF9C32C63DF00007FF9C32C6340 ntdll.dll!_chkstk [<unknown file> @ <unknown line number>]
00007FF9C317949700007FF9C3178F00 ntdll.dll!RtlLocateExtendedFeature [<unknown file> @ <unknown line number>]
00007FF9C31D073100007FF9C31D0510 ntdll.dll!RtlRaiseException [<unknown file> @ <unknown line number>]
00007FF9C0B9FE0A00007FF9C0B9FD80 KERNELBASE.dll!RaiseException [<unknown file> @ <unknown line number>]
00007FF9A90A526700007FF9A90A51D0 VCRUNTIME140.dll!CxxThrowException [<unknown file> @ <unknown line number>]
00007FF991CE426200007FF991CE4100 c10_hip.dll!c10::cuda::c10_cuda_check_implementation [<unknown file> @ <unknown line number>]
00007FF991CE4B8C00007FF991CE4B30 c10_hip.dll!c10::cuda::SetDevice [<unknown file> @ <unknown line number>]
00007FF991CCD2F300007FF991CA8A70 c10_hip.dll!c10::FreeCudaMemoryCallbacksRegistry [<unknown file> @ <unknown line number>]
00007FF869FA713700007FF869FA7000 torch_hip.dll!at::cuda::_foreach_add_ [<unknown file> @ <unknown line number>]
00007FF9A90AF73000007FF9A90AF4D0 VCRUNTIME140.dll!_CxxFrameHandler3 [<unknown file> @ <unknown line number>]
00007FF9A90A3A8600007FF9A90A1230 VCRUNTIME140.dll!is_exception_typeof [<unknown file> @ <unknown line number>]
00007FF9A90AEC0400007FF9A90AE720 VCRUNTIME140.dll!_C_specific_handler [<unknown file> @ <unknown line number>]
00007FF9A90A2C9400007FF9A90A1230 VCRUNTIME140.dll!is_exception_typeof [<unknown file> @ <unknown line number>]
00007FF9A90AF54100007FF9A90AF4D0 VCRUNTIME140.dll!_CxxFrameHandler3 [<unknown file> @ <unknown line number>]
00007FF9C32C645F00007FF9C32C6340 ntdll.dll!_chkstk [<unknown file> @ <unknown line number>]
00007FF9C317570300007FF9C3175480 ntdll.dll!RtlUnwindEx [<unknown file> @ <unknown line number>]
00007FF9A90AF0CA00007FF9A90AE720 VCRUNTIME140.dll!_C_specific_handler [<unknown file> @ <unknown line number>]
00007FF9A90A18E500007FF9A90A1230 VCRUNTIME140.dll!is_exception_typeof [<unknown file> @ <unknown line number>]
00007FF9A90A1D0000007FF9A90A1230 VCRUNTIME140.dll!is_exception_typeof [<unknown file> @ <unknown line number>]
00007FF9A90A2D9000007FF9A90A1230 VCRUNTIME140.dll!is_exception_typeof [<unknown file> @ <unknown line number>]
00007FF9A90AF54100007FF9A90AF4D0 VCRUNTIME140.dll!_CxxFrameHandler3 [<unknown file> @ <unknown line number>]
00007FF9C32C63DF00007FF9C32C6340 ntdll.dll!_chkstk [<unknown file> @ <unknown line number>]
00007FF9C317949700007FF9C3178F00 ntdll.dll!RtlLocateExtendedFeature [<unknown file> @ <unknown line number>]
00007FF9C31D073100007FF9C31D0510 ntdll.dll!RtlRaiseException [<unknown file> @ <unknown line number>]
00007FF9C0B9FE0A00007FF9C0B9FD80 KERNELBASE.dll!RaiseException [<unknown file> @ <unknown line number>]
00007FF9A90A526700007FF9A90A51D0 VCRUNTIME140.dll!CxxThrowException [<unknown file> @ <unknown line number>]
00007FF991CE426200007FF991CE4100 c10_hip.dll!c10::cuda::c10_cuda_check_implementation [<unknown file> @ <unknown line number>]
00007FF86867BCD800007FF8686446F0 torch_hip.dll!at::native::foreach_tensor_clamp_min_scalar_kernel_cuda [<unknown file> @ <unknown line number>]
00007FF8686776F400007FF8686446F0 torch_hip.dll!at::native::foreach_tensor_clamp_min_scalar_kernel_cuda [<unknown file> @ <unknown line number>]
00007FF86864311600007FF8686430B0 torch_hip.dll!at::native::foreach_tensor_add_scalar_kernel_cuda_ [<unknown file> @ <unknown line number>]
00007FF869FA70BD00007FF869FA7000 torch_hip.dll!at::cuda::_foreach_add_ [<unknown file> @ <unknown line number>]
00007FF86A1315D900007FF86A052750 torch_hip.dll!at::cuda::_fused_adagrad_ [<unknown file> @ <unknown line number>]
00007FF86FCB025F00007FF86FCB01C0 torch_cpu.dll!at::_ops::_foreach_add__Scalar::redispatch [<unknown file> @ <unknown line number>]
00007FF872D7297D00007FF87258D930 torch_cpu.dll!torch::autograd::impl::JitDecompRegisterer::JitDecompRegisterer [<unknown file> @ <unknown line number>]
00007FF86FCAF93A00007FF86FCAF850 torch_cpu.dll!at::_ops::_foreach_add__Scalar::call [<unknown file> @ <unknown line number>]
00007FF8D089B98800007FF8D071E7F0 torch_python.dll!THPPointer<THPStorage>::release [<unknown file> @ <unknown line number>]
00007FF9294CAD3400007FF9294CACD0 python313.dll!PyThread_get_thread_ident [<unknown file> @ <unknown line number>]
00007FF9294E7A5F00007FF9294E6F70 python313.dll!PyObject_Vectorcall [<unknown file> @ <unknown line number>]
00007FF9294E6FA900007FF9294E6F70 python313.dll!PyObject_Vectorcall [<unknown file> @ <unknown line number>]
00007FF9295652EE00007FF9295620D0 python313.dll!PyEval_EvalFrameDefault [<unknown file> @ <unknown line number>]
00007FF9294EEB4000007FF9294EE730 python313.dll!PyObject_CallNoArgs [<unknown file> @ <unknown line number>]
00007FF92957D8EA00007FF92957D870 python313.dll!PyObject_Call [<unknown file> @ <unknown line number>]
00007FF92956653000007FF9295620D0 python313.dll!PyEval_EvalFrameDefault [<unknown file> @ <unknown line number>]
00007FF9294EEA0400007FF9294EE730 python313.dll!PyObject_CallNoArgs [<unknown file> @ <unknown line number>]
00007FF92957D98B00007FF92957D870 python313.dll!PyObject_Call [<unknown file> @ <unknown line number>]
00007FF92953E64500007FF92953D728 python313.dll!PyMapping_Items [<unknown file> @ <unknown line number>]
00007FF9294FF8F900007FF9294FF878 python313.dll!PyObject_MakeTpCall [<unknown file> @ <unknown line number>]
00007FF92962F4D900007FF92962F028 python313.dll!PyMutex_Unlock [<unknown file> @ <unknown line number>]
00007FF9294E704C00007FF9294E6F70 python313.dll!PyObject_Vectorcall [<unknown file> @ <unknown line number>]
00007FF9294E6FA900007FF9294E6F70 python313.dll!PyObject_Vectorcall [<unknown file> @ <unknown line number>]
00007FF92956802100007FF9295620D0 python313.dll!PyEval_EvalFrameDefault [<unknown file> @ <unknown line number>]
00007FF9295A80C400007FF9295A7EC4 python313.dll!PyEval_EvalCode [<unknown file> @ <unknown line number>]
00007FF9295A7F7A00007FF9295A7EC4 python313.dll!PyEval_EvalCode [<unknown file> @ <unknown line number>]
00007FF9295A834100007FF9295A7EC4 python313.dll!PyEval_EvalCode [<unknown file> @ <unknown line number>]
00007FF9295A844C00007FF9295A7EC4 python313.dll!PyEval_EvalCode [<unknown file> @ <unknown line number>]
00007FF9295E4CB700007FF9295E4AD4 python313.dll!PyDict_PopString [<unknown file> @ <unknown line number>]
00007FF9295E480800007FF9295E4634 python313.dll!PyUnicode_EqualToUTF8AndSize [<unknown file> @ <unknown line number>]
00007FF9295E3FA500007FF9295E3EB4 python313.dll!PyEval_MakePendingCalls [<unknown file> @ <unknown line number>]
00007FF9295E3E4200007FF9295E38BC python313.dll!PyInterpreterConfig_AsDict [<unknown file> @ <unknown line number>]


```

---
_This issue is opened and deduplicated by `.github/scripts/create_failure_issues.py`. Close it once the failure is fixed; subsequent failures with the same scope will reopen a fresh issue._

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] pytorch-finetuning / quick-train-full-finetuning failed on stx (windows) #350

Failure scope

Hardware / OS to use to reproduce

How to dispatch the same test from CI

How to run just this test locally

Failing test (verbatim from the README)

Result

stderr (last lines)

stdout (last lines)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[CI] pytorch-finetuning / quick-train-full-finetuning failed on stx (windows) #350

Description

Failure scope

Hardware / OS to use to reproduce

How to dispatch the same test from CI

How to run just this test locally

Failing test (verbatim from the README)

Result

stderr (last lines)

stdout (last lines)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions