MKL is slower? #24

5000user5000 · 2025-05-03T14:17:16Z

5000user5000
May 3, 2025
Maintainer

In PR #23 , I implemented MKL to see how fast it is.
However, I found that MKL is the slowest.
matrix shape: (200,300) x (300,200)

==== Baseline (int, naive) ====
Naive int GEMM: 3.66808 ms

==== Int4 packed (SIMD LUT) ====

LUT size: 1024 bytes
LUT Shape: (16, 16)
LUT GEMM (AVX2): 2.84536 ms

==== MKL (float) ====
MKL sgemm: 16.6933 ms

5000user5000 · 2025-05-03T14:22:34Z

5000user5000
May 3, 2025
Maintainer Author

At first, I thought I had implemented MKL incorrectly because my naive GEMM was actually faster than MKL's sgemm on small matrices.
However, after increasing the matrix size to something like (400×600) × (600×400), MKL clearly outperformed both the naive and SIMD LUT-based implementations.
So it turns out that the original performance issue was not due to a bug, but rather expected behavior: MKL has a certain overhead (e.g., thread management, internal blocking) that makes it suboptimal for small matrices. Once the problem size grows, MKL becomes the fastest as expected.

matrix shape: (400×600) × (600×400)
==== Baseline (int, naive) ====
Naive int GEMM: 51.3692 ms

==== Int4 packed (SIMD LUT) ====

LUT size: 1024 bytes
LUT Shape: (16, 16)
LUT GEMM (AVX2): 27.0469 ms

==== MKL (float) ====
MKL sgemm: 20.4966 ms

0 replies

5000user5000 · 2025-05-03T14:35:11Z

5000user5000
May 3, 2025
Maintainer Author

As a follow-up, I also implemented a naive float version of matrix multiplication to compare fairly against MKL (float).

As expected, MKL becomes significantly faster for larger matrices. But an interesting observation is that naive float is consistently faster than naive int.

At first glance, this might seem counterintuitive, but it makes sense on modern CPUs. Floating-point operations are heavily optimized in hardware, and compilers often auto-vectorize float loops more aggressively than integer ones. So even though integers might seem simpler, in practice, float can outperform int in naive implementations due to better instruction-level parallelism and memory throughput.

I also came across a related discussion on Reddit:
Reddit: Performance of int vs float

matrix shape: (400×600) × (600×400)
==== Baseline (int, naive) ====
Naive int GEMM: 60.1847 ms

==== Baseline (float, naive) ====
Naive float GEMM: 44.3662 ms

==== Int4 packed (SIMD LUT) ====

LUT size: 1024 bytes
LUT Shape: (16, 16)
LUT GEMM (AVX2): 32.7537 ms

==== MKL (float) ====
MKL sgemm: 11.3078 ms

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MKL is slower? #24

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

MKL is slower? #24

Uh oh!

5000user5000 May 3, 2025 Maintainer

Replies: 2 comments

Uh oh!

5000user5000 May 3, 2025 Maintainer Author

Uh oh!

5000user5000 May 3, 2025 Maintainer Author

5000user5000
May 3, 2025
Maintainer

5000user5000
May 3, 2025
Maintainer Author

5000user5000
May 3, 2025
Maintainer Author