SIMD LUT Problem #21
Replies: 3 comments 8 replies
-
|
Tried 64-byte alignment via AlignedAllocator in PR #22 to potentially improve SIMD efficiency. |
Beta Was this translation helpful? Give feedback.
-
|
By the way, in issue #30 , I want to investigate why simd lut-based gemm get similar or even worse performance than naive gemm. |
Beta Was this translation helpful? Give feedback.
-
|
At PR #31 , I have mitigated the cache miss overhead for lut-gemm. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
In PR #20, my initial SIMD LUT implementation was very slow, even about five times slower than the naive GEMM approach. After some investigation, I found a few key issues and made improvements. First, the data layout was not continuous. Originally, two int4 values were packed into a uint8, and accessing them required using
at()inside the inner loop, causing significant overhead. To fix this, I unpacked the int4 values into uint8s before running GEMM, making it easier for SIMD instructions to access the data. (In the future, I might switch to dequantization instead of simply extracting the lower 4 bits.)Second, I initially used small temporary arrays to preload data in an attempt to increase parallelism, but after unpacking the data beforehand, it became more efficient to load vectorized data directly without intermediate buffers.
Finally, I realized that the SIMD operations I had split between
lut_utils.hppandmatrix_ops.hppwere unnecessarily complicated. In fact,lut_utils.hppdidn't need any changes, and inmatrix_ops.hpp, it was simpler and more efficient to perform batched lookups and computations together. These adjustments led to much cleaner and faster code.Beta Was this translation helpful? Give feedback.
All reactions