Fix physical data corruption in tpu_raiden by implementing CPU Cache Displacement. by copybara-service[bot] · Pull Request #190 · google/tpu-raiden

copybara-service · 2026-06-25T20:05:13Z

Fix physical data corruption in tpu_raiden by implementing CPU Cache Displacement.

We identified a critical CPU cache coherence failure on the Sender (D2H) side that caused exactly ~32 MB (matching L3 cache slice size) of data corruption under high parallelism (P=8, or P=4 with tight semaphore limits).

The high-performance host allocator uses a first-touch policy that writes zeroes to the buffer, filling the CPU cache with Dirty lines of 0s. When the TPU performs D2H DMA, it writes directly to DRAM (No-Snoop PCIe), bypassing the CPU cache and leaving the dirty 0 lines intact. When the CPU eventually evicts these lines, it overwrites the TPU's fresh data in DRAM with 0s. Under tight semaphore limits, the allocator immediately recycles the same buffer back-to-back, guaranteeing stale cache hits.

To resolve this without the 57-second performance penalty of a full clflush on 32 GB, we implement a hardware-portable CPU Cache Displacement mechanism. By sequentially reading a thread-local 128 MB dummy buffer, we force the CPU to evict all stale/dirty lines from the L3 cache to DRAM in 2-3 milliseconds (a 20,000x speedup).

We integrate this displacement automatically into PjRtCopyFuture::Await() for all futures marked as is_d2h, transparently protecting JAX and PyTorch D2H transfers.

Additionally, we implement clean C++ CPU cache flushing (clwb + sfence) on the H2D path before TPU DMA launches.

…Displacement. We identified a critical CPU cache coherence failure on the Sender (D2H) side that caused exactly ~32 MB (matching L3 cache slice size) of data corruption under high parallelism (P=8, or P=4 with tight semaphore limits). The high-performance host allocator uses a first-touch policy that writes zeroes to the buffer, filling the CPU cache with Dirty lines of 0s. When the TPU performs D2H DMA, it writes directly to DRAM (No-Snoop PCIe), bypassing the CPU cache and leaving the dirty 0 lines intact. When the CPU eventually evicts these lines, it overwrites the TPU's fresh data in DRAM with 0s. Under tight semaphore limits, the allocator immediately recycles the same buffer back-to-back, guaranteeing stale cache hits. To resolve this without the 57-second performance penalty of a full clflush on 32 GB, we implement a hardware-portable CPU Cache Displacement mechanism. By sequentially reading a thread-local 128 MB dummy buffer, we force the CPU to evict all stale/dirty lines from the L3 cache to DRAM in 2-3 milliseconds (a 20,000x speedup). We integrate this displacement automatically into PjRtCopyFuture::Await() for all futures marked as is_d2h, transparently protecting JAX and PyTorch D2H transfers. Additionally, we implement clean C++ CPU cache flushing (clwb + sfence) on the H2D path before TPU DMA launches. PiperOrigin-RevId: 938149369

copybara-service Bot force-pushed the test_938149369 branch from 1f51a6c to 56767db Compare June 25, 2026 20:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix physical data corruption in tpu_raiden by implementing CPU Cache Displacement.#190

Fix physical data corruption in tpu_raiden by implementing CPU Cache Displacement.#190
copybara-service[bot] wants to merge 1 commit into
mainfrom
test_938149369

copybara-service Bot commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

copybara-service Bot commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant