MKL is slower? #24
Replies: 2 comments
-
|
At first, I thought I had implemented MKL incorrectly because my naive GEMM was actually faster than MKL's matrix shape: (400×600) × (600×400) ==== Int4 packed (SIMD LUT) ====
==== MKL (float) ==== |
Beta Was this translation helpful? Give feedback.
-
|
As a follow-up, I also implemented a As expected, MKL becomes significantly faster for larger matrices. But an interesting observation is that At first glance, this might seem counterintuitive, but it makes sense on modern CPUs. Floating-point operations are heavily optimized in hardware, and compilers often auto-vectorize float loops more aggressively than integer ones. So even though integers might seem simpler, in practice, I also came across a related discussion on Reddit: matrix shape: (400×600) × (600×400) ==== Baseline (float, naive) ==== ==== Int4 packed (SIMD LUT) ====
==== MKL (float) ==== |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
In PR #23 , I implemented MKL to see how fast it is.
However, I found that MKL is the slowest.
matrix shape: (200,300) x (300,200)
==== Baseline (int, naive) ====
Naive int GEMM: 3.66808 ms
==== Int4 packed (SIMD LUT) ====
==== MKL (float) ====
MKL sgemm: 16.6933 ms
Beta Was this translation helpful? Give feedback.
All reactions