Skip to content

RTX 5080 via Thunderbolt 5 eGPU: Hard lock on CUDA operations (nvidia-smi works at idle) #979

@roger-pmta

Description

@roger-pmta

NVIDIA Open GPU Kernel Modules Version

590.44.01 and 580.105.08, both from the NVIDIA's official CUDA rhel10 repo

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Rocky Linux 10.1 (Red Quartz)

Kernel Release

6.12.0-124.13.1.el10_1.x86_64 (PREEMPT_DYNAMIC)

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Hardware: GPU

NVIDIA GeForce RTX 5080

Describe the bug

RTX 5080 connected via Thunderbolt 5 eGPU enclosure (Sonnet Breakaway Box 850T5) initializes correctly and is visible to nvidia-smi at idle. Any CUDA operation causes immediate system hard-lock requiring power cycle. No kernel panic, no Xid error logged, no SysRq response.

This appears related to closed issue #900 (RTX 5090 via OCuLink), which showed identical symptoms: GPU functional at idle, crash under load, GSP firmware bootstrap errors. That issue was resolved by switching OCuLink docks, but dock alternatives for Thunderbolt 5 are extremely limited.

PCIe link negotiates correctly at 16GT/s x4 (PCIe 4.0 x4) - optimal for Thunderbolt 4 host controller bandwidth. BAR allocation succeeds with hotplug resource reservation parameters. Driver loads and initializes without error.

Minimal reproducer:

nvidia-smi                    # Works - GPU visible, ~2W idle
python3 -c "import torch; torch.zeros(1, device='cuda')"   # Hard lock

Hardware:

  • GPU: NVIDIA GeForce RTX 5080 (GB203)
  • eGPU Enclosure: Sonnet Breakaway Box 850T5 (Thunderbolt 5)
  • Host: Lenovo ThinkPad X1 Carbon Gen 11
  • Thunderbolt Controller: Intel Raptor Lake-P Thunderbolt 4 (host), USB4/TB5 (enclosure)
  • OS: Rocky Linux 10.1

Required kernel parameters: pcie_ports=native pcie_aspm=off pcie_port_pm=off pci=assign-busses,realloc

Without pcie_ports=native, GPU enters D3cold and driver fails with "Unable to change power state from D3cold to D0".

To Reproduce

  1. Connect RTX 5080 to Sonnet Breakaway Box 850T5 (Thunderbolt 5 eGPU enclosure)
  2. Connect enclosure to host via Thunderbolt cable
  3. Boot system with kernel parameters: pcie_ports=native pcie_aspm=off pcie_port_pm=off pci=assign-busses,realloc
  4. Confirm GPU detected: nvidia-smi (shows GPU at idle, ~2W, ~30°C)
  5. Run any CUDA operation: python3 -c "import torch; torch.zeros(1, device='cuda')"
  6. System hard-locks immediately. No kernel panic, no SysRq response. Power cycle required.

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

Also reported at https://forums.developer.nvidia.com/t/590-release-feedback-discussion/353310/53, cross-posted to https://forums.developer.nvidia.com/t/580-release-feedback-discussion/341205/898, and and linux-bugs

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions