Skip to content

[Proof of Concept] Precompute static parts of the deformation gradient#1225

Draft
efaulhaber wants to merge 1 commit into
trixi-framework:mainfrom
efaulhaber:deformation-grad-precompute
Draft

[Proof of Concept] Precompute static parts of the deformation gradient#1225
efaulhaber wants to merge 1 commit into
trixi-framework:mainfrom
efaulhaber:deformation-grad-precompute

Conversation

@efaulhaber
Copy link
Copy Markdown
Member

This is a proof-of-concept demonstrating what we could potentially gain from precomputing the static part of the deformation gradient. On the GPU, the load of this precomputed part can be coalesced (which is the memory layout in this PR). For the CPU, it is faster to switch the memory layout to put particle last.

As we can see below, we get a significant but not massive speedup on the GPU, but only in 3D. On the CPU, we get a massive 2x speedup, but not more from SIMD-vectorizing it, whereas the version in #1220 is even faster when vectorized. In 2D, this is slower on the CPU.

3D

Machine #1220 This PR
H100 FP64 1.358 ms 989.258 μs
H100 FP32 855.816 μs 615.526 μs
Intel Xeon w9-3475X (x36) 39.861 ms 20.309 ms
Intel Xeon w9-3475X (x36) vectorized 14.729 ms 18.901 ms

2D

Machine #1220 This PR
H100 FP64 256.930 μs 207.394 μs
H100 FP32 155.842 μs 136.834 μs
Intel Xeon w9-3475X (x36) 4.855 ms 6.308 ms

@efaulhaber efaulhaber self-assigned this May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant