Skip to content

Fix physical data corruption in tpu_raiden by implementing CPU Cache Displacement.#190

Open
copybara-service[bot] wants to merge 1 commit into
mainfrom
test_938149369
Open

Fix physical data corruption in tpu_raiden by implementing CPU Cache Displacement.#190
copybara-service[bot] wants to merge 1 commit into
mainfrom
test_938149369

Conversation

@copybara-service

Copy link
Copy Markdown

Fix physical data corruption in tpu_raiden by implementing CPU Cache Displacement.

We identified a critical CPU cache coherence failure on the Sender (D2H) side that caused exactly ~32 MB (matching L3 cache slice size) of data corruption under high parallelism (P=8, or P=4 with tight semaphore limits).

The high-performance host allocator uses a first-touch policy that writes zeroes to the buffer, filling the CPU cache with Dirty lines of 0s. When the TPU performs D2H DMA, it writes directly to DRAM (No-Snoop PCIe), bypassing the CPU cache and leaving the dirty 0 lines intact. When the CPU eventually evicts these lines, it overwrites the TPU's fresh data in DRAM with 0s. Under tight semaphore limits, the allocator immediately recycles the same buffer back-to-back, guaranteeing stale cache hits.

To resolve this without the 57-second performance penalty of a full clflush on 32 GB, we implement a hardware-portable CPU Cache Displacement mechanism. By sequentially reading a thread-local 128 MB dummy buffer, we force the CPU to evict all stale/dirty lines from the L3 cache to DRAM in 2-3 milliseconds (a 20,000x speedup).

We integrate this displacement automatically into PjRtCopyFuture::Await() for all futures marked as is_d2h, transparently protecting JAX and PyTorch D2H transfers.

Additionally, we implement clean C++ CPU cache flushing (clwb + sfence) on the H2D path before TPU DMA launches.

…Displacement.

We identified a critical CPU cache coherence failure on the Sender (D2H) side that caused exactly ~32 MB (matching L3 cache slice size) of data corruption under high parallelism (P=8, or P=4 with tight semaphore limits).

The high-performance host allocator uses a first-touch policy that writes zeroes to the buffer, filling the CPU cache with Dirty lines of 0s. When the TPU performs D2H DMA, it writes directly to DRAM (No-Snoop PCIe), bypassing the CPU cache and leaving the dirty 0 lines intact. When the CPU eventually evicts these lines, it overwrites the TPU's fresh data in DRAM with 0s. Under tight semaphore limits, the allocator immediately recycles the same buffer back-to-back, guaranteeing stale cache hits.

To resolve this without the 57-second performance penalty of a full clflush on 32 GB, we implement a hardware-portable CPU Cache Displacement mechanism. By sequentially reading a thread-local 128 MB dummy buffer, we force the CPU to evict all stale/dirty lines from the L3 cache to DRAM in 2-3 milliseconds (a 20,000x speedup).

We integrate this displacement automatically into PjRtCopyFuture::Await() for all futures marked as is_d2h, transparently protecting JAX and PyTorch D2H transfers.

Additionally, we implement clean C++ CPU cache flushing (clwb + sfence) on the H2D path before TPU DMA launches.

PiperOrigin-RevId: 938149369
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant