5000user5000 · 5000user5000 · May 4, 2025 · May 4, 2025 · May 4, 2025 · May 4, 2025
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -21,7 +21,7 @@ jobs:
         run: |
           make USE_MKL=1
 
-      - name: Run main test
+      - name: Run benchmark
         run: |
           make run
           # python3 scripts/benchmark.py 100

diff --git a/Makefile b/Makefile
@@ -20,8 +20,8 @@ TEST_DIR   := tests
 BUILD_DIR  := build
 
 # main test
-TEST_SRC       := $(TEST_DIR)/main.cpp
-TARGET_MAIN    := $(BUILD_DIR)/main
+TEST_SRC       := $(TEST_DIR)/run_benchmark.cpp
+TARGET_MAIN    := $(BUILD_DIR)/run_benchmark
 
 # correctness suite
 CORR_SRC       := $(TEST_DIR)/test_correctness.cpp   # fix filename
@@ -53,7 +53,7 @@ $(TARGET_CORR): $(CORR_SRC) $(HEADERS)
 
 # build pybind11 module
 mpgemm$(PYEXT): src/bindings.cpp $(HEADERS)
-	$(CXX) $(CXXFLAGS) $(PYBIND11_INC) -fPIC -shared src/bindings.cpp -o $@
+	$(CXX) $(CXXFLAGS) $(PYBIND11_INC) -fPIC -shared src/bindings.cpp -o $@ $(LDFLAGS) $(LDLIBS)
 
 # run pytest
 pytest: all

diff --git a/README.md b/README.md
@@ -1,189 +1,144 @@
-# LUT-based Mixed-Precision GEMM
+# LUT-based Mixed-Precision GEMM (mpGEMM)
 
-## Basic Information
+## Overview
 
-On resource-constrained devices, such as embedded systems, running deep 
-learning models, especially LLMs, is highly computationally expensive. Low-bit 
-quantization is a popular solution to reduce memory consumption. However, when 
-weights are quantized below 8 bits, performing matrix multiplication between 
-the weight matrix and an FP16 activation matrix requires dequantization to a 
-common precision, as hardware lacks native support for mixed-precision matrix 
-multiplication (mpGEMM). Some recent research suggests using lookup tables 
-(LUTs) to replace dequantization, further reducing computational overhead.
+mpGEMM is a high-performance mixed-precision General Matrix Multiplication 
+(GEMM) library optimized for embedded and resource-constrained AI deployments. 
+It leverages precomputed Lookup Tables (LUTs) to accelerate low-bit integer 
+(INT4) and FP16 mixed-precision matrix multiplication, significantly improving 
+inference speed and reducing computational overhead.
 
-![LUT](./img/lut.png)
+## Key Features
 
-## Problem to Solve
+* **Mixed-Precision Computation**: Supports INT4 quantized weights combined 
+with FP16 activation matrices.
+* **Lookup Table (LUT) Optimization**: Replaces runtime dequantization with 
+LUT lookups, greatly reducing computational complexity.
+* **Multiple Backend Support**:
 
-- Support mixed-precision matrix computation (weight INT1~4) where activation 
-tensors are limited to FP16.
-- Implement **precomputed lookup table (LUT)-based computation** to accelerate 
-low-bit matrix multiplications.
-- Enable the LUT to reside in the fastest on-chip memory and parallel lookup.
-- Preliminary estimates suggest that vendor libraries such as MKL or 
-Accelerate can be **10× to 1000×** faster than naive GEMM implementations. 
-This project aims to evaluate how closely LUT-based methods can approach these 
-performance levels in mixed-precision low-bit GEMM.
+  * Naive GEMM (INT and FP32)
+  * SIMD-optimized LUT GEMM (AVX2)
+  * Intel MKL optimized GEMM
+* **Post-Processing**: Provides bias addition and activation functions (ReLU, 
+Sigmoid, Tanh, Linear).
+* **Benchmarking Tools**: Includes tools for latency measurement across 
+different matrix sizes and computational backends.
+* **Quantization Utilities**: Functions for INT4 quantization/dequantization.
+* **Python API Integration**: Seamlessly integrates a C++ backend with Python 
+for ease of use.
 
-## Prospective Users
+## Installation
 
-This project will benefit:
+### Prerequisites
 
-1. **Embedded AI Developers**: Those optimizing AI models for deployment on 
-edge devices.
-2. **Data Scientists**: Users who need optimized inference on diverse hardware 
-setups.
-3. **Academia & Research Labs**: Those investigating numerical optimization 
-and quantization techniques.
+* Python 3.10 or later
+* Pybind11
+* Intel MKL (optional, for accelerated FP32 computations)
 
-## System Architecture
+### Build and Setup
 
-### Lookup Table Component
+```bash
+# Install dependencies
+sudo apt-get install python3-pybind11 intel-mkl-full
 
-- Precomputes and stores the LUT in the on-chip memory.
-- The LUT precomputes and stores the accumulation results of all possible 
-combinations of quantized low-bit weights (INT1~INT4) and FP16 activation 
-vectors.
-- Accelerates LUT lookups using SIMD instructions (Intel AVX2 or AVX-512).
-- Allows users flexibility to:
-  - Automatically generate LUT based on provided activation vectors and 
-selected weight bit-width (INT1~4).
-  - Load and reuse previously generated LUTs for reproducibility and 
-efficiency.
+# Clone the repository
+git clone <repo_url>
+cd mpGEMM
 
-### GEMM Component
+# Build the project with MKL
+make USE_MKL=1
 
-- Implements the matrix multiplication algorithm in C++.
-- Supports three computation modes:
-  - LUT-based GEMM (using precomputed LUT)
-  - Naive GEMM (pure FP16 computation)
-  - Vendor-optimized GEMM (Intel MKL)
-- Supports mixed-precision matrix computation (weight INT1~4, activation FP16).
-- Generation of test matrices (weights, FP16 activations, and biases).
+## Or build the project without MKL
+make
+```
 
-### Post-Processing Component
+## Usage
 
-- Performs bias addition to GEMM outputs.
-- Provides common neural-network activation functions:
-  - ReLU  
-  - Sigmoid  
-  - Tanh  
-  - Linear (identity function, i.e., no activation)  
+### Python API Example
 
-### Accuracy & Error Analysis Component
+```python
+import mpgemm
+import numpy as np
 
-- Provides methods to measure numerical accuracy of computations.
-- Includes methods for error analysis (e.g., Mean Squared Error, Maximum 
-error).
-- Compares results with reference FP16 matrix multiplication for accuracy 
-verification.
+# === Step 1: Initialize engine ===
+gemm = mpgemm.Engine(backend="lut")  # options: "lut", "naive", "mkl"
 
-### Quantization & Dequantization Utility Component
+# === Step 2: Prepare inputs ===
+M, K, N = 4, 4, 4  # Small size for demonstration
+weights = np.random.randint(0, 16, (M, K), dtype=np.uint8)
+activations = np.random.randn(K, N).astype(np.float16)
+bias = np.random.randn(N).astype(np.float16)
 
-- Provides functions to quantize and dequantize FP16 values to/from INT1~4.
-- **For quantization:**  
-  ```quantized = round(fp16 / scale) + zero_point```
-- **For dequantization:**  
-  ```fp16 = (quantized - zero_point) * scale```
+# === Step 3: Generate LUT for int4 × fp16 ===
+gemm.generate_lut(bit_width=4)
 
-### Benchmarking Component
+# === Step 4: Matrix multiplication
+output = gemm.matmul(weights, activations, M=M, K=K, N=N)
 
-- The benchmarking Component will compare latency between LUT-based GEMM, 
-vendor libraries, and naive dequantization-based methods.
+# === Step 5: Optional post-processing ===
+output = gemm.add_bias(output, bias)
+output = gemm.apply_activation(output, "relu")
 
-## API Description
+# === Step 6: Output ===
+print("Output shape:", output.shape)
+print("Output values:\n", output)
+```
 
-```python
-import mpGEMM
-
-# === Initialization ===
-# Initialize GEMM engine (options: "lut", "mkl", "naive")
-gemm = mpGEMM(backend="lut", use_simd=True)
-
-# === Data Generation ===
-# Generate test data: quantized weights, FP16 activations, FP16 biases
-fp16_weights = gemm.generate_matrix((128, 128), dtype="fp16")
-weights = gemm.quantize(fp16_weights, bit_width=4, scale=0.1, zero_point=0)
-activations = gemm.generate_matrix((128, 128), dtype="fp16")
-bias_vector = gemm.generate_bias(128, dtype="fp16")
-
-# Dequantize weights from INT4 to FP16
-dq_weights = gemm.dequantize(weights, bit_width=4, scale=0.1, zero_point=0)
-
-# === LUT Management ===
-# Generate and manage Lookup Table (LUT)
-lut = gemm.generate_lut(bit_width=4, activations=activations)
-gemm.save_lut("lut_int4_fp16.bin")
-gemm.load_lut("lut_int4_fp16.bin")
-
-# Inspect LUT details
-lut_info = gemm.inspect_lut()
-print(lut_info)
-# Example output:
-# {
-#   "lut_size": "64KB",
-#   "bit_width": 4,
-#   "activation_shape": [128, 128],
-#   "num_entries": 16
-# }
-
-# === Computation ===
-# Perform LUT-based mixed-precision GEMM
-output = gemm.matmul(weights, activations, weight_bit_width=4)
-
-# Optional: add bias and activation function
-biased_output = gemm.add_bias(output, bias_vector)
-activated_output = gemm.activation_function(biased_output, activation="relu")
-
-# === Benchmarking ===
-# Benchmark different computation methods
-gemm.benchmark(methods=["lut", "mkl", "naive"], num_runs=10)
-
-# === Analysis ===
-# Compute reference FP16 result for accuracy verification
-fp16_output = gemm.matmul(fp16_weights, activations, weight_bit_width=16)
-
-# Measure numerical error compared to FP16 reference
-error = gemm.measure_error(fp16_output, output, method="mse")
-print(f"Mean Squared Error: {error.mse}")
+Full example: scripts/example.py
+
+### Benchmarking
+
+```bash
+# Run built-in benchmarks
+make run
+
+# Automated benchmarking script (averaging multiple runs)
+python3 scripts/benchmark.py --runs 10
 ```
 
-## Engineering Infrastructure
-
-- **Automated Build System:** Uses CMake to set up the C++ build system and 
-setuptools to build Python packages.
-- **CI**: GitHub Actions for automated testing and benchmarking. The CI 
-pipeline includes:
-  - **Correctness tests**: Ensures numerical accuracy of matrix multiplication 
-and quantization/dequantization methods.
-  - **Performance benchmarks**: Compares LUT-based GEMM with traditional 
-dequantization-based methods and vendor libraries (MKL).
-- **Testing Framework**: Uses Google Test for C++ unit tests and pytest for 
-Python.
-- **Version Control:** Uses Git for version management, with all development 
-processes submitted to the GitHub repository.
-- **Documentation**: GitHub README.md.
-
-## Schedule
-
-
-| Week  | Tasks & Test Plan |
-|-------|-------------------|
-| **Week 1 (3/17)** | Research background, setup project repo, configure testing framework. <br>**(Testing Plan)** Choose and configure testing framework (pytest for Python, GoogleTest for C++). |
-| **Week 2 (3/24)** | Implement naive GEMM, setup CI for correctness testing. <br>**(Testing Plan)** Validate GEMM correctness across multiple matrix sizes and INT bit-widths (1~4). |
-| **Week 3 (3/31)** | Implement Lookup Table (LUT) component.<br>**(Testing Plan)** Validate LUT correctness (ensure lookup values match mathematical expectations). |
-| **Week 4 (4/7)** | Optimize LUT lookup using SIMD (AVX2/AVX-512). <br>**(Testing Plan)** Benchmark and correctness tests: Validate performance gains and ensure no errors introduced by SIMD optimization. |
-| **Week 5 (4/14)** | Implement Quantization/Dequantization Component, optimize LUT memory management. <br>**(Testing Plan)** Measure LUT memory efficiency, verify memory access performance and correctness. Verify correctness of quantization/dequantization operations and numerical accuracy. |
-| **Week 6 (4/21)** | Integrate vendor libraries (MKL), implement GEMM Component. <br>**(Testing Plan)** Verify correctness across all computation modes and ensure correct interaction with vendor libraries. |
-| **Week 7 (4/28)** | Implement Post-processing Component, integrate API. <br>**(Testing Plan)** Unit tests: Confirm correctness of each activation function and bias addition. Verify integration between Python API and C++ backend. |
-| **Week 8 (5/5)** | Develop benchmarking component. <br>**(Testing Plan)** Benchmark tests: Measure latency for varying matrix sizes, bit-widths, and backends. |
-| **Week 9 (5/12)** | Implement Accuracy & Error Analysis Component. <br>**(Testing Plan)** Measure MSE and maximum error relative to FP16 reference implementation. |
-| **Week 10 (5/19)** | Final optimizations and benchmarking. <br>**(Testing Plan)** Final validation tests: Ensure all previously implemented tests pass and performance results are consistent. |
-| **Week 11 (5/26)** | Finalize documentation, prepare report, and presentation. <br>**(Testing Plan)** Documentation review: Verify clarity, completeness, and accuracy of final documentation and presentation materials. |
+## Project Structure
 
-## References
+```
+mpGEMM/
+├── src/
+│   ├── matrix.hpp
+│   ├── matrix_ops.hpp
+│   ├── layout_policies.hpp
+│   ├── storage_policies.hpp
+│   ├── lut_utils.hpp
+│   ├── post_processing.hpp
+│   ├── quant_utils.hpp
+│   ├── gemm_engine.hpp
+│   └── bindings.cpp
+├── tests/
+│   ├── test_correctness.cpp
+│   ├── test_post_process.py
+│   └── run_benchmark.cpp
+├── scripts/
+│   └── benchmark.py
+├── doc/
+│   └── proposal.md
+├── .github/workflows/
+│   └── ci.yml
+├── Makefile
+└── README.md
+```
+
+## Testing and Verification
+
+* **Correctness tests**: Ensure the numerical accuracy of matrix operations.
+* **Benchmark tests**: Compare latency across naive, LUT, and MKL backends.
 
-- **DeepGEMM:** 
-[Paper](https://openaccess.thecvf.com/content/CVPR2023W/ECV/papers/Ganji_DeepGEMM.pdf)
-- **T-MAC:** [Paper](https://arxiv.org/html/2407.00088v1)
+Run tests with:
+
+```bash
+make test
+make pytest
+```
+
+## References
 
+* 
+[DeepGEMM](https://openaccess.thecvf.com/content/CVPR2023W/ECV/papers/Ganji_Dee
+pGEMM.pdf)
+* [T-MAC](https://arxiv.org/html/2407.00088v1)