SIMD LUT Problem #21

5000user5000 · 2025-04-29T14:14:19Z

5000user5000
Apr 29, 2025
Maintainer

In PR #20, my initial SIMD LUT implementation was very slow, even about five times slower than the naive GEMM approach. After some investigation, I found a few key issues and made improvements. First, the data layout was not continuous. Originally, two int4 values were packed into a uint8, and accessing them required using at() inside the inner loop, causing significant overhead. To fix this, I unpacked the int4 values into uint8s before running GEMM, making it easier for SIMD instructions to access the data. (In the future, I might switch to dequantization instead of simply extracting the lower 4 bits.)

Second, I initially used small temporary arrays to preload data in an attempt to increase parallelism, but after unpacking the data beforehand, it became more efficient to load vectorized data directly without intermediate buffers.

Finally, I realized that the SIMD operations I had split between lut_utils.hpp and matrix_ops.hpp were unnecessarily complicated. In fact, lut_utils.hpp didn't need any changes, and in matrix_ops.hpp, it was simpler and more efficient to perform batched lookups and computations together. These adjustments led to much cleaner and faster code.

5000user5000 · 2025-05-01T14:50:50Z

5000user5000
May 1, 2025
Maintainer Author

Tried 64-byte alignment via AlignedAllocator in PR #22 to potentially improve SIMD efficiency.
Turns out the speed gain is minimal or negligible.

0 replies

5000user5000 · 2025-05-07T09:32:31Z

5000user5000
May 7, 2025
Maintainer Author

By the way, in issue #30 , I want to investigate why simd lut-based gemm get similar or even worse performance than naive gemm.

5 replies

5000user5000 May 7, 2025
Maintainer Author

./build/run_benchmark
[Shape] M=500, K=600, N=500

[ naive_int ] Time: 49.53 ms
[naive_float] Time: 34.1124 ms
[ LUT ] Time: 31.7995 ms

5000user5000 May 7, 2025
Maintainer Author

sudo perf stat -e cache-misses,cache-references,instructions,cycles,branch-misses,L1-dcache-load-misses,L1-dcache-loads ./test_matrix_ops naive

<ch-misses,L1-dcache-load-misses,L1-dcache-loads ./test_matrix_ops naive
Naive float time: 239 ms

Performance counter stats for './test_matrix_ops naive':

       394,407      cache-misses                     #    0.58% of all cache refs         
    68,150,312      cache-references                                                      
 7,564,371,504      instructions                     #    6.57  insn per cycle            
 1,152,169,000      cycles                                                                
     1,075,467      branch-misses                                                         
     6,282,867      cpu_atom/L1-dcache-load-misses/  #    0.29% of all L1-dcache accesses 
     6,282,867      cpu_core/L1-dcache-load-misses/  #    0.29% of all L1-dcache accesses 
 2,154,574,723      cpu_atom/L1-dcache-loads/                                             
 2,154,574,723      cpu_core/L1-dcache-loads/                                             

   0.253112297 seconds time elapsed

   0.240999000 seconds user
   0.011999000 seconds sys

5000user5000 May 7, 2025
Maintainer Author

sudo perf stat -e cache-misses,cache-references,instructions,cycles,branch-misses,L1-dcache-load-misses,L1-dcache-loads ./test_matrix_ops lut

<-misses,L1-dcache-load-misses,L1-dcache-loads ./test_matrix_ops lut
LUT time: 290 ms

Performance counter stats for './test_matrix_ops lut':

       185,905      cache-misses                     #    1.71% of all cache refs         
    10,877,305      cache-references                                                      
 2,254,294,446      instructions                     #    1.68  insn per cycle            
 1,338,995,645      cycles                                                                
        47,413      branch-misses                                                         
   125,331,981      cpu_atom/L1-dcache-load-misses/  #   22.55% of all L1-dcache accesses 
   125,331,981      cpu_core/L1-dcache-load-misses/  #   22.55% of all L1-dcache accesses 
   555,846,210      cpu_atom/L1-dcache-loads/                                             
   555,846,210      cpu_core/L1-dcache-loads/                                             

   0.301421935 seconds time elapsed

   0.296577000 seconds user
   0.005009000 seconds sys

5000user5000 May 7, 2025
Maintainer Author

LUT Version

Execution Time: 290 ms
Instruction Count: 2,254,294,446
IPC: 1.68
L1 Cache Miss Rate: 22.55%
Cache Miss Rate: 1.71%
L1 Cache Loads: 555,846,210
L1 Cache Load Misses: 125,331,981
Branch Misses: 47,413

Naive Float Version

Execution Time: 239 ms
Instruction Count: 7,564,371,504
IPC: 6.57
L1 Cache Miss Rate: 0.29%
Cache Miss Rate: 0.58%
L1 Cache Loads: 2,154,574,723
L1 Cache Load Misses: 6,282,867
Branch Misses: 1,075,467

Key Differences:

Execution Time: The LUT version is about 21% slower than the Naive Float version.
Instruction Count: The LUT version executes only around 30% as many instructions as the Naive Float version.
IPC: The LUT version has a significantly lower IPC (1.68) compared to Naive Float (6.57).
Cache Efficiency:
- The LUT version has a much higher L1 cache miss rate (22.55% vs 0.29%).
- Its overall cache miss rate is also higher (1.71% vs 0.58%).
Memory Access:
- The LUT version performs fewer L1 cache loads (555M vs 2.15B),
- but suffers from significantly more cache load misses (125M vs 6.2M).
Branch Prediction: The LUT version has fewer branch mispredictions (47K vs 1.07M).

These results clearly show:

Although the LUT version performs less computation (fewer instructions), its poor memory access efficiency (high cache miss rates)
leads to a lower IPC, and ultimately worse performance.
This strongly supports the earlier analysis that the vpgatherdd instruction degrades memory access efficiency.

5000user5000 May 7, 2025
Maintainer Author

Summary:
Although the naive GEMM implementation lacks manual optimizations, the compiler successfully vectorizes the core floating-point operations (e.g., a[i][k] * b[k][j]) using SIMD instructions such as vmulps and vfmadd231ps. These fused multiply-add patterns are highly SIMD-friendly, resulting in a high IPC of 6.57, indicating that the CPU's execution units are fully utilized and computation is not the bottleneck.

In contrast, the LUT-based implementation avoids explicit multiplications via table lookups, but in practice, most time is spent on index computations (like a * stride + b) and SIMD gather operations like vpgatherdd. While vpgatherdd allows non-contiguous memory access in vector form, it comes with significant performance penalties due to poor cache friendliness. This is reflected in a much lower IPC of 1.68, suggesting that the CPU spends most of its time stalled on memory accesses.

Memory behavior further separates the two. Naive GEMM performs sequential, row-major accesses, benefiting from spatial locality and hardware prefetching. This results in an extremely low L1 cache miss rate of just 0.29%. On the other hand, the LUT version’s access pattern is highly irregular, causing L1 cache misses to spike to 22.5%, clearly showing that memory latency, not arithmetic, is the dominant performance bottleneck.
These observations are not just theoretical—they’re directly confirmed by hardware-level profiling using perf stat.

5000user5000 · 2025-05-07T15:06:43Z

5000user5000
May 7, 2025
Maintainer Author

At PR #31 , I have mitigated the cache miss overhead for lut-gemm.
By applying several optimizations,such as prefetching the LUT into L1 cache and tiling the matrices to avoid loading overly large chunks that cause cache misses. Also introduing multithreading to further improve performance.

3 replies

5000user5000 May 7, 2025
Maintainer Author

new data:
<isses,L1-dcache-load-misses,L1-dcache-loads ./test_matrix_ops naive
Parallel naive float time: 77 ms

Performance counter stats for './test_matrix_ops naive':

       713,225      cache-misses                     #    1.04% of all cache refs         
    68,677,955      cache-references                                                      
 7,598,204,679      instructions                     #    5.87  insn per cycle            
 1,294,892,703      cycles                                                                
     1,093,041      branch-misses                                                         
    47,122,982      cpu_atom/L1-dcache-load-misses/  #    2.18% of all L1-dcache accesses 
    47,122,982      cpu_core/L1-dcache-load-misses/  #    2.18% of all L1-dcache accesses 
 2,164,663,661      cpu_atom/L1-dcache-loads/                                             
 2,164,663,661      cpu_core/L1-dcache-loads/                                             

   0.096785267 seconds time elapsed

   0.285389000 seconds user
   0.023865000 seconds sys

5000user5000 May 7, 2025
Maintainer Author

<-misses,L1-dcache-load-misses,L1-dcache-loads ./test_matrix_ops lut
LUT time: 83 ms

Performance counter stats for 'make matrix_ops_lut':

       768,353      cache-misses                     #   25.07% of all cache refs         
     3,064,957      cache-references                                                      
 4,125,462,010      instructions                     #    2.77  insn per cycle            
 1,490,971,683      cycles                                                                
     1,036,641      branch-misses                                                         
     2,678,705      cpu_atom/L1-dcache-load-misses/  #    0.20% of all L1-dcache accesses 
     2,678,705      cpu_core/L1-dcache-load-misses/  #    0.20% of all L1-dcache accesses 
 1,310,546,913      cpu_atom/L1-dcache-loads/                                             
 1,310,546,913      cpu_core/L1-dcache-loads/                                             

   0.144292067 seconds time elapsed

   0.354808000 seconds user
   0.024556000 seconds sys

5000user5000 May 7, 2025
Maintainer Author

After applying parallelization and blocking optimizations to both naive and LUT-based GEMM implementations, I observed a significant speedup in execution time—3.1× for naive GEMM and 3.5× for LUT GEMM. Notably, the LUT version, which was previously memory-latency bound with poor cache behavior (22.5% L1 miss), now shows only 0.20% L1 D-cache miss rate. This drastic improvement suggests that SIMD blocking and data locality refinements greatly mitigated cache inefficiency. IPC for LUT also increased from 1.68 to 2.77, indicating better pipeline utilization. These results confirm that LUT GEMM can benefit substantially from proper cache-aware parallelization.

SIMD LUT Problem #21

Uh oh!

5000user5000 Apr 29, 2025 Maintainer

Replies: 3 comments · 8 replies

Uh oh!

5000user5000 May 1, 2025 Maintainer Author

Uh oh!

5000user5000 May 7, 2025 Maintainer Author

Uh oh!

5000user5000 May 7, 2025 Maintainer Author

Uh oh!

5000user5000 May 7, 2025 Maintainer Author

Uh oh!

5000user5000 May 7, 2025 Maintainer Author

Uh oh!

5000user5000 May 7, 2025 Maintainer Author

Uh oh!

5000user5000 May 7, 2025 Maintainer Author

Uh oh!

5000user5000 May 7, 2025 Maintainer Author

Uh oh!

5000user5000 May 7, 2025 Maintainer Author

Uh oh!

Uh oh!

5000user5000 May 7, 2025 Maintainer Author

Uh oh!

Uh oh!

5000user5000 May 7, 2025 Maintainer Author

5000user5000
Apr 29, 2025
Maintainer

Replies: 3 comments 8 replies

5000user5000
May 1, 2025
Maintainer Author

5000user5000
May 7, 2025
Maintainer Author

5000user5000 May 7, 2025
Maintainer Author

5000user5000 May 7, 2025
Maintainer Author

5000user5000 May 7, 2025
Maintainer Author

5000user5000 May 7, 2025
Maintainer Author

5000user5000 May 7, 2025
Maintainer Author

5000user5000
May 7, 2025
Maintainer Author

5000user5000 May 7, 2025
Maintainer Author

5000user5000 May 7, 2025
Maintainer Author

5000user5000 May 7, 2025
Maintainer Author