Commit f910071
Metal Q4 fast kernel: llama.cpp-inspired uint16 mask + SIMD-group
Reimplemented GPU Q4 matmul based on llama.cpp's actual technique
(refs/llama.cpp/ggml/src/ggml-metal/ggml-metal.metal):
Key insight: llama.cpp does NOT convert Q4 to FP16. Weights stay Q4.
Speed comes from shader optimization:
- uint16 reads: 2 nibbles at once via mask (0x000F, 0x0F00, 0x00F0, 0xF000)
- Scale absorption: d/256 replaces bit shift (GPU multiply is free)
- sumy trick: -8 bias factored as sumy*(-8)*d
- SIMD-group: 32 threads cooperate per output row
- float4 vectorized input loads
Results (M1 Pro, 1-commit GPU graph):
- SmolLM2 135M: 27 tok/s (was 22 with naive kernel, +23%)
- Still 3.5x slower than CPU NEON (96 tok/s)
- Bottleneck: per-layer commit overhead (~0.3ms × 28 layers)
The Q4 kernel itself is now efficient. The remaining gap is
architectural: CPU NEON avoids ALL dispatch overhead. GPU needs
graph compilation (encode entire model, commit once per forward)
which requires a tensor graph IR — equivalent to building ggml.
GPU path disabled. CPU NEON remains optimal for batch-1 inference.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent baa58db commit f910071
File tree
3 files changed
+92
-70
lines changed- src
- backend/metal
- engine
3 files changed
+92
-70
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
524 | 524 | | |
525 | 525 | | |
526 | 526 | | |
527 | | - | |
| 527 | + | |
528 | 528 | | |
529 | | - | |
530 | | - | |
531 | | - | |
532 | | - | |
| 529 | + | |
| 530 | + | |
| 531 | + | |
| 532 | + | |
| 533 | + | |
| 534 | + | |
533 | 535 | | |
534 | | - | |
535 | | - | |
| 536 | + | |
536 | 537 | | |
537 | | - | |
| 538 | + | |
538 | 539 | | |
539 | 540 | | |
540 | | - | |
541 | | - | |
| 541 | + | |
| 542 | + | |
542 | 543 | | |
543 | 544 | | |
544 | | - | |
545 | | - | |
| 545 | + | |
| 546 | + | |
| 547 | + | |
546 | 548 | | |
547 | | - | |
548 | | - | |
| 549 | + | |
| 550 | + | |
| 551 | + | |
| 552 | + | |
| 553 | + | |
549 | 554 | | |
550 | 555 | | |
551 | 556 | | |
552 | 557 | | |
553 | | - | |
554 | 558 | | |
555 | | - | |
556 | | - | |
557 | | - | |
558 | | - | |
559 | | - | |
560 | | - | |
| 559 | + | |
| 560 | + | |
561 | 561 | | |
| 562 | + | |
562 | 563 | | |
563 | 564 | | |
564 | | - | |
565 | | - | |
566 | | - | |
567 | | - | |
568 | | - | |
| 565 | + | |
| 566 | + | |
| 567 | + | |
569 | 568 | | |
570 | 569 | | |
571 | | - | |
572 | | - | |
573 | | - | |
574 | | - | |
575 | | - | |
576 | | - | |
577 | | - | |
578 | | - | |
579 | | - | |
580 | | - | |
581 | | - | |
| 570 | + | |
| 571 | + | |
| 572 | + | |
| 573 | + | |
| 574 | + | |
| 575 | + | |
| 576 | + | |
| 577 | + | |
| 578 | + | |
| 579 | + | |
| 580 | + | |
| 581 | + | |
| 582 | + | |
| 583 | + | |
| 584 | + | |
| 585 | + | |
| 586 | + | |
| 587 | + | |
| 588 | + | |
| 589 | + | |
| 590 | + | |
| 591 | + | |
| 592 | + | |
582 | 593 | | |
583 | | - | |
| 594 | + | |
| 595 | + | |
| 596 | + | |
| 597 | + | |
| 598 | + | |
| 599 | + | |
| 600 | + | |
| 601 | + | |
| 602 | + | |
| 603 | + | |
| 604 | + | |
| 605 | + | |
| 606 | + | |
| 607 | + | |
| 608 | + | |
| 609 | + | |
584 | 610 | | |
585 | 611 | | |
586 | | - | |
| 612 | + | |
| 613 | + | |
| 614 | + | |
| 615 | + | |
| 616 | + | |
| 617 | + | |
587 | 618 | | |
588 | 619 | | |
589 | 620 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
430 | 430 | | |
431 | 431 | | |
432 | 432 | | |
433 | | - | |
| 433 | + | |
434 | 434 | | |
435 | 435 | | |
436 | 436 | | |
| |||
1739 | 1739 | | |
1740 | 1740 | | |
1741 | 1741 | | |
1742 | | - | |
1743 | | - | |
1744 | | - | |
1745 | | - | |
1746 | | - | |
1747 | | - | |
1748 | | - | |
1749 | | - | |
1750 | | - | |
1751 | | - | |
1752 | | - | |
1753 | | - | |
1754 | | - | |
1755 | | - | |
1756 | | - | |
1757 | | - | |
1758 | | - | |
1759 | | - | |
1760 | | - | |
1761 | | - | |
1762 | | - | |
1763 | | - | |
1764 | | - | |
| 1742 | + | |
| 1743 | + | |
| 1744 | + | |
| 1745 | + | |
| 1746 | + | |
| 1747 | + | |
| 1748 | + | |
| 1749 | + | |
| 1750 | + | |
1765 | 1751 | | |
1766 | 1752 | | |
1767 | 1753 | | |
1768 | 1754 | | |
1769 | | - | |
| 1755 | + | |
1770 | 1756 | | |
1771 | | - | |
1772 | | - | |
1773 | | - | |
| 1757 | + | |
| 1758 | + | |
| 1759 | + | |
1774 | 1760 | | |
1775 | 1761 | | |
1776 | | - | |
1777 | | - | |
1778 | | - | |
| 1762 | + | |
| 1763 | + | |
| 1764 | + | |
| 1765 | + | |
| 1766 | + | |
1779 | 1767 | | |
1780 | 1768 | | |
1781 | 1769 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2191 | 2191 | | |
2192 | 2192 | | |
2193 | 2193 | | |
| 2194 | + | |
| 2195 | + | |
| 2196 | + | |
2194 | 2197 | | |
2195 | 2198 | | |
2196 | 2199 | | |
| |||
0 commit comments