We're hitting a reproducible NIC crash (host becomes permanently unreachable) on ice 2.2.8 during AF_XDP/XSK socket setup/teardown under TX load. The crash sequence is:
- False TX watchdog timeout (transmit queue N timed out) during
ice_qp_dis
- PF reset triggers NULL pointer dereference in
ice_qp_dis+0x6c (ring pointer freed during reset)
- Workqueue deadlock between
ice_service_task and xp_release_deferred — NIC never recovers
The upstream Linux kernel already has 4 commits (merged Jul-Aug 2024) that fix this. They all modify ice_xsk.c (ice_qp_dis / ice_qp_ena):
All fix the original 2d4238f55697 ( torvalds/linux@2d4238f55697) ("ice: Add support for AF_XDP").
Questions:
- Are these fixes planned for any upcoming out-of-tree release (2.5.x or later)?
- If so, what's the expected timeline?
We've validated that backporting these 4 commits onto ice 2.5.4 resolves the issue in our environment.
Environment: Intel E810, ice 2.2.8 (DKMS), kernel 5.10.253 (Amazon Linux 2), AF_XDP with 62 queues.
We're hitting a reproducible NIC crash (host becomes permanently unreachable) on ice 2.2.8 during AF_XDP/XSK socket setup/teardown under TX load. The crash sequence is:
ice_qp_disice_qp_dis+0x6c(ring pointer freed during reset)ice_service_taskandxp_release_deferred— NIC never recoversThe upstream Linux kernel already has 4 commits (merged Jul-Aug 2024) that fix this. They all modify
ice_xsk.c(ice_qp_dis/ice_qp_ena):synchronize_net()before queue stopnetif_carrier_off/onduring XSK reconfigICE_CFG_BUSY(never protected the queue pair)All fix the original 2d4238f55697 ( torvalds/linux@2d4238f55697) ("ice: Add support for AF_XDP").
Questions:
We've validated that backporting these 4 commits onto ice 2.5.4 resolves the issue in our environment.
Environment: Intel E810, ice 2.2.8 (DKMS), kernel 5.10.253 (Amazon Linux 2), AF_XDP with 62 queues.