Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ jobs:
run: |
make USE_MKL=1

- name: Run main test
- name: Run benchmark
run: |
make run
# python3 scripts/benchmark.py 100
Expand Down
6 changes: 3 additions & 3 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,8 @@ TEST_DIR := tests
BUILD_DIR := build

# main test
TEST_SRC := $(TEST_DIR)/main.cpp
TARGET_MAIN := $(BUILD_DIR)/main
TEST_SRC := $(TEST_DIR)/run_benchmark.cpp
TARGET_MAIN := $(BUILD_DIR)/run_benchmark

# correctness suite
CORR_SRC := $(TEST_DIR)/test_correctness.cpp # fix filename
Expand Down Expand Up @@ -53,7 +53,7 @@ $(TARGET_CORR): $(CORR_SRC) $(HEADERS)

# build pybind11 module
mpgemm$(PYEXT): src/bindings.cpp $(HEADERS)
$(CXX) $(CXXFLAGS) $(PYBIND11_INC) -fPIC -shared src/bindings.cpp -o $@
$(CXX) $(CXXFLAGS) $(PYBIND11_INC) -fPIC -shared src/bindings.cpp -o $@ $(LDFLAGS) $(LDLIBS)

# run pytest
pytest: all
Expand Down
277 changes: 116 additions & 161 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,189 +1,144 @@
# LUT-based Mixed-Precision GEMM
# LUT-based Mixed-Precision GEMM (mpGEMM)

## Basic Information
## Overview

On resource-constrained devices, such as embedded systems, running deep
learning models, especially LLMs, is highly computationally expensive. Low-bit
quantization is a popular solution to reduce memory consumption. However, when
weights are quantized below 8 bits, performing matrix multiplication between
the weight matrix and an FP16 activation matrix requires dequantization to a
common precision, as hardware lacks native support for mixed-precision matrix
multiplication (mpGEMM). Some recent research suggests using lookup tables
(LUTs) to replace dequantization, further reducing computational overhead.
mpGEMM is a high-performance mixed-precision General Matrix Multiplication
(GEMM) library optimized for embedded and resource-constrained AI deployments.
It leverages precomputed Lookup Tables (LUTs) to accelerate low-bit integer
(INT4) and FP16 mixed-precision matrix multiplication, significantly improving
inference speed and reducing computational overhead.

![LUT](./img/lut.png)
## Key Features

## Problem to Solve
* **Mixed-Precision Computation**: Supports INT4 quantized weights combined
with FP16 activation matrices.
* **Lookup Table (LUT) Optimization**: Replaces runtime dequantization with
LUT lookups, greatly reducing computational complexity.
* **Multiple Backend Support**:

- Support mixed-precision matrix computation (weight INT1~4) where activation
tensors are limited to FP16.
- Implement **precomputed lookup table (LUT)-based computation** to accelerate
low-bit matrix multiplications.
- Enable the LUT to reside in the fastest on-chip memory and parallel lookup.
- Preliminary estimates suggest that vendor libraries such as MKL or
Accelerate can be **10× to 1000×** faster than naive GEMM implementations.
This project aims to evaluate how closely LUT-based methods can approach these
performance levels in mixed-precision low-bit GEMM.
* Naive GEMM (INT and FP32)
* SIMD-optimized LUT GEMM (AVX2)
* Intel MKL optimized GEMM
* **Post-Processing**: Provides bias addition and activation functions (ReLU,
Sigmoid, Tanh, Linear).
* **Benchmarking Tools**: Includes tools for latency measurement across
different matrix sizes and computational backends.
* **Quantization Utilities**: Functions for INT4 quantization/dequantization.
* **Python API Integration**: Seamlessly integrates a C++ backend with Python
for ease of use.

## Prospective Users
## Installation

This project will benefit:
### Prerequisites

1. **Embedded AI Developers**: Those optimizing AI models for deployment on
edge devices.
2. **Data Scientists**: Users who need optimized inference on diverse hardware
setups.
3. **Academia & Research Labs**: Those investigating numerical optimization
and quantization techniques.
* Python 3.10 or later
* Pybind11
* Intel MKL (optional, for accelerated FP32 computations)

## System Architecture
### Build and Setup

### Lookup Table Component
```bash
# Install dependencies
sudo apt-get install python3-pybind11 intel-mkl-full

- Precomputes and stores the LUT in the on-chip memory.
- The LUT precomputes and stores the accumulation results of all possible
combinations of quantized low-bit weights (INT1~INT4) and FP16 activation
vectors.
- Accelerates LUT lookups using SIMD instructions (Intel AVX2 or AVX-512).
- Allows users flexibility to:
- Automatically generate LUT based on provided activation vectors and
selected weight bit-width (INT1~4).
- Load and reuse previously generated LUTs for reproducibility and
efficiency.
# Clone the repository
git clone <repo_url>
cd mpGEMM

### GEMM Component
# Build the project with MKL
make USE_MKL=1

- Implements the matrix multiplication algorithm in C++.
- Supports three computation modes:
- LUT-based GEMM (using precomputed LUT)
- Naive GEMM (pure FP16 computation)
- Vendor-optimized GEMM (Intel MKL)
- Supports mixed-precision matrix computation (weight INT1~4, activation FP16).
- Generation of test matrices (weights, FP16 activations, and biases).
## Or build the project without MKL
make
```

### Post-Processing Component
## Usage

- Performs bias addition to GEMM outputs.
- Provides common neural-network activation functions:
- ReLU
- Sigmoid
- Tanh
- Linear (identity function, i.e., no activation)
### Python API Example

### Accuracy & Error Analysis Component
```python
import mpgemm
import numpy as np

- Provides methods to measure numerical accuracy of computations.
- Includes methods for error analysis (e.g., Mean Squared Error, Maximum
error).
- Compares results with reference FP16 matrix multiplication for accuracy
verification.
# === Step 1: Initialize engine ===
gemm = mpgemm.Engine(backend="lut") # options: "lut", "naive", "mkl"

### Quantization & Dequantization Utility Component
# === Step 2: Prepare inputs ===
M, K, N = 4, 4, 4 # Small size for demonstration
weights = np.random.randint(0, 16, (M, K), dtype=np.uint8)
activations = np.random.randn(K, N).astype(np.float16)
bias = np.random.randn(N).astype(np.float16)

- Provides functions to quantize and dequantize FP16 values to/from INT1~4.
- **For quantization:**
```quantized = round(fp16 / scale) + zero_point```
- **For dequantization:**
```fp16 = (quantized - zero_point) * scale```
# === Step 3: Generate LUT for int4 × fp16 ===
gemm.generate_lut(bit_width=4)

### Benchmarking Component
# === Step 4: Matrix multiplication
output = gemm.matmul(weights, activations, M=M, K=K, N=N)

- The benchmarking Component will compare latency between LUT-based GEMM,
vendor libraries, and naive dequantization-based methods.
# === Step 5: Optional post-processing ===
output = gemm.add_bias(output, bias)
output = gemm.apply_activation(output, "relu")

## API Description
# === Step 6: Output ===
print("Output shape:", output.shape)
print("Output values:\n", output)
```

```python
import mpGEMM

# === Initialization ===
# Initialize GEMM engine (options: "lut", "mkl", "naive")
gemm = mpGEMM(backend="lut", use_simd=True)

# === Data Generation ===
# Generate test data: quantized weights, FP16 activations, FP16 biases
fp16_weights = gemm.generate_matrix((128, 128), dtype="fp16")
weights = gemm.quantize(fp16_weights, bit_width=4, scale=0.1, zero_point=0)
activations = gemm.generate_matrix((128, 128), dtype="fp16")
bias_vector = gemm.generate_bias(128, dtype="fp16")

# Dequantize weights from INT4 to FP16
dq_weights = gemm.dequantize(weights, bit_width=4, scale=0.1, zero_point=0)

# === LUT Management ===
# Generate and manage Lookup Table (LUT)
lut = gemm.generate_lut(bit_width=4, activations=activations)
gemm.save_lut("lut_int4_fp16.bin")
gemm.load_lut("lut_int4_fp16.bin")

# Inspect LUT details
lut_info = gemm.inspect_lut()
print(lut_info)
# Example output:
# {
# "lut_size": "64KB",
# "bit_width": 4,
# "activation_shape": [128, 128],
# "num_entries": 16
# }

# === Computation ===
# Perform LUT-based mixed-precision GEMM
output = gemm.matmul(weights, activations, weight_bit_width=4)

# Optional: add bias and activation function
biased_output = gemm.add_bias(output, bias_vector)
activated_output = gemm.activation_function(biased_output, activation="relu")

# === Benchmarking ===
# Benchmark different computation methods
gemm.benchmark(methods=["lut", "mkl", "naive"], num_runs=10)

# === Analysis ===
# Compute reference FP16 result for accuracy verification
fp16_output = gemm.matmul(fp16_weights, activations, weight_bit_width=16)

# Measure numerical error compared to FP16 reference
error = gemm.measure_error(fp16_output, output, method="mse")
print(f"Mean Squared Error: {error.mse}")
Full example: scripts/example.py

### Benchmarking

```bash
# Run built-in benchmarks
make run

# Automated benchmarking script (averaging multiple runs)
python3 scripts/benchmark.py --runs 10
```

## Engineering Infrastructure

- **Automated Build System:** Uses CMake to set up the C++ build system and
setuptools to build Python packages.
- **CI**: GitHub Actions for automated testing and benchmarking. The CI
pipeline includes:
- **Correctness tests**: Ensures numerical accuracy of matrix multiplication
and quantization/dequantization methods.
- **Performance benchmarks**: Compares LUT-based GEMM with traditional
dequantization-based methods and vendor libraries (MKL).
- **Testing Framework**: Uses Google Test for C++ unit tests and pytest for
Python.
- **Version Control:** Uses Git for version management, with all development
processes submitted to the GitHub repository.
- **Documentation**: GitHub README.md.

## Schedule


| Week | Tasks & Test Plan |
|-------|-------------------|
| **Week 1 (3/17)** | Research background, setup project repo, configure testing framework. <br>**(Testing Plan)** Choose and configure testing framework (pytest for Python, GoogleTest for C++). |
| **Week 2 (3/24)** | Implement naive GEMM, setup CI for correctness testing. <br>**(Testing Plan)** Validate GEMM correctness across multiple matrix sizes and INT bit-widths (1~4). |
| **Week 3 (3/31)** | Implement Lookup Table (LUT) component.<br>**(Testing Plan)** Validate LUT correctness (ensure lookup values match mathematical expectations). |
| **Week 4 (4/7)** | Optimize LUT lookup using SIMD (AVX2/AVX-512). <br>**(Testing Plan)** Benchmark and correctness tests: Validate performance gains and ensure no errors introduced by SIMD optimization. |
| **Week 5 (4/14)** | Implement Quantization/Dequantization Component, optimize LUT memory management. <br>**(Testing Plan)** Measure LUT memory efficiency, verify memory access performance and correctness. Verify correctness of quantization/dequantization operations and numerical accuracy. |
| **Week 6 (4/21)** | Integrate vendor libraries (MKL), implement GEMM Component. <br>**(Testing Plan)** Verify correctness across all computation modes and ensure correct interaction with vendor libraries. |
| **Week 7 (4/28)** | Implement Post-processing Component, integrate API. <br>**(Testing Plan)** Unit tests: Confirm correctness of each activation function and bias addition. Verify integration between Python API and C++ backend. |
| **Week 8 (5/5)** | Develop benchmarking component. <br>**(Testing Plan)** Benchmark tests: Measure latency for varying matrix sizes, bit-widths, and backends. |
| **Week 9 (5/12)** | Implement Accuracy & Error Analysis Component. <br>**(Testing Plan)** Measure MSE and maximum error relative to FP16 reference implementation. |
| **Week 10 (5/19)** | Final optimizations and benchmarking. <br>**(Testing Plan)** Final validation tests: Ensure all previously implemented tests pass and performance results are consistent. |
| **Week 11 (5/26)** | Finalize documentation, prepare report, and presentation. <br>**(Testing Plan)** Documentation review: Verify clarity, completeness, and accuracy of final documentation and presentation materials. |
## Project Structure

## References
```
mpGEMM/
├── src/
│ ├── matrix.hpp
│ ├── matrix_ops.hpp
│ ├── layout_policies.hpp
│ ├── storage_policies.hpp
│ ├── lut_utils.hpp
│ ├── post_processing.hpp
│ ├── quant_utils.hpp
│ ├── gemm_engine.hpp
│ └── bindings.cpp
├── tests/
│ ├── test_correctness.cpp
│ ├── test_post_process.py
│ └── run_benchmark.cpp
├── scripts/
│ └── benchmark.py
├── doc/
│ └── proposal.md
├── .github/workflows/
│ └── ci.yml
├── Makefile
└── README.md
```

## Testing and Verification

* **Correctness tests**: Ensure the numerical accuracy of matrix operations.
* **Benchmark tests**: Compare latency across naive, LUT, and MKL backends.

- **DeepGEMM:**
[Paper](https://openaccess.thecvf.com/content/CVPR2023W/ECV/papers/Ganji_DeepGEMM.pdf)
- **T-MAC:** [Paper](https://arxiv.org/html/2407.00088v1)
Run tests with:

```bash
make test
make pytest
```

## References

*
[DeepGEMM](https://openaccess.thecvf.com/content/CVPR2023W/ECV/papers/Ganji_Dee
pGEMM.pdf)
* [T-MAC](https://arxiv.org/html/2407.00088v1)
Loading