Transforming AI into Seamless Embedded Powerhouse
Compiler for high-level ML libraries to run your models on edge
Complete pipeline for converting PyTorch neural networks to optimized C, C++, and LLVM code for deployment on embedded systems.
# 1. create and train a model
python export_model.py --model-type hybrid --epochs 5
# 2. test all converters
./test_conversion.sh
# 3. validate entire workflow
python test_complete_workflow.pyCreates PyTorch models compatible with the conversion pipeline.
Supported architectures:
linear: Fully connected layers only (784→128→64→10)conv: Convolutional layers + linear classifierhybrid: Mixed conv + linear layers (recommended)
Usage:
# train a hybrid model for 5 epochs
python export_model.py --model-type hybrid --epochs 5 --batch-size 64
# create model without training (for testing)
python export_model.py --model-type linear --no-train
# train on gpu if available
python export_model.py --model-type conv --device cuda --epochs 10Output:
models/mnist_hybrid_model.pth- Complete modelmodels/mnist_hybrid_model_state_dict.pth- State dict onlytest_conversion.sh- Script to test all converters
Generates pure C code optimized for ARM Cortex-M4 microcontrollers.
Features:
- Static memory allocation
- Ping-pong buffer optimization
- ARM Cortex-M4 compilation
- Minimal dependencies
Usage:
python converter.py models/mnist_hybrid_model.pthOutput:
output/
├── model.h # header with declarations
├── model.c # implementation
└── model.o # compiled ARM object file
Generated API:
int predict(const float *input, int input_h, int input_w, int input_ch);Generates LLVM intermediate representation with advanced optimizations.
Features:
- Cross-platform target support
- Advanced optimization passes
- Loop unrolling and vectorization
- Multiple architecture support
Usage:
python llvm.py models/mnist_hybrid_model.pthOutput:
output/
├── model.ll # llvm ir code
└── model_llvm.o # optimized object file
Generates modern C++ code with STL containers and type safety.
Features:
- Template-based architecture
- STL containers for safety
- Easy debugging and modification
- JSON configuration export
Usage:
python pytoc.py models/mnist_hybrid_model.pthOutput:
output/
├── dynamic_model.cpp # complete c++ implementation
├── dynamic_model # compiled executable
└── model_config.json # architecture metadata
nn.Linear(in_features, out_features, bias=True)- Fully connected transformation
- Optional bias terms
- Efficient matrix multiplication
nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0)- 2D convolution with configurable parameters
- Stride and padding support
- Boundary condition handling
nn.ReLU()- Element-wise max(0, x) operation
- Hardware-optimized implementation
// weights stored as [output_neurons][input_features]
const float w0[128][784] = {...};
// ping-pong buffers for layer outputs
float buf1[MAX_BUFFER_SIZE], buf2[MAX_BUFFER_SIZE];// input indexed as [channel][height][width]
int input_idx = ic * input_h * input_w + ih * input_w + iw;
// output organized as [channel][height][width]
int output_idx = oc * out_h * out_w + oh * out_w + ow;- Ping-pong buffers: Alternate between
buf1andbuf2for each layer - Static allocation: No dynamic memory allocation for embedded safety
- Size optimization: Reuse buffers across layers
clang --target=armv7em-none-eabi \
-mcpu=cortex-m4 \
-mthumb \
-mfloat-abi=hard \
-mfpu=fpv4-sp-d16 \
-O3Modify compiler flags in each converter for different targets:
- x86:
--target=x86_64-linux-gnu - ARM64:
--target=aarch64-linux-gnu - RISC-V:
--target=riscv32-unknown-elf
Linear (784→128→64→10): ~107K parameters, ~95% accuracy
Conv (1×3×3→16×3×3→32): ~23K parameters, ~98% accuracy
Hybrid (optimized): ~15K parameters, ~97% accuracy
Flash (weights): 15KB - 400KB depending on architecture
RAM (buffers): 2KB - 64KB for intermediate computations
Stack: <1KB for local variables
Linear model: ~2ms per inference
Conv model: ~8ms per inference
Hybrid model: ~5ms per inference
# create your own sequential model
custom_model = nn.Sequential(
nn.Conv2d(1, 16, 3, padding=1),
nn.ReLU(),
nn.Conv2d(16, 32, 3, stride=2),
nn.ReLU(),
nn.Flatten(),
nn.Linear(32 * 14 * 14, 64),
nn.ReLU(),
nn.Linear(64, 10)
)
# export for conversion
torch.save(custom_model, 'custom_model.pth')// example main.c for microcontroller
#include "model.h"
#include <stdio.h>
float sensor_data[784]; // input from sensors
int main() {
// collect sensor data
read_sensors(sensor_data);
// run inference
int prediction = predict(sensor_data, 28, 28, 1);
// act on prediction
handle_prediction(prediction);
return 0;
}# validate complete workflow
python test_complete_workflow.py
# compare outputs between pytorch and c
python validate_conversion.py model.pth
# profile performance
python benchmark_inference.py model.pth1. Unsupported layer types
Error: Layer type 'BatchNorm2d' not supported
Solution: Remove or replace unsupported layers. Currently supported: Linear, Conv2d, ReLU.
2. Memory buffer overflow
Error: Layer output size exceeds buffer capacity
Solution: Increase MAX_BUFFER_SIZE in converter or reduce model size.
3. LLVM compilation failures
Error: llvmlite not installed
Solution: pip install llvmlite or use C/C++ converters instead.
4. ARM toolchain missing
Error: clang: command not found
Solution: Install ARM GCC toolchain or use x86 targets for testing.
For embedded deployment:
- Keep total parameters under 100K
- Avoid large convolutional layers
- Use stride > 1 to reduce spatial dimensions quickly
- Prefer ReLU over other activations
For best converter compatibility:
- Use
nn.Sequentialmodels - Avoid custom layers or complex control flow
- Keep all operations differentiable
- Save complete models, not just state dicts
Python packages:
pip install torch torchvision numpy llvmliteSystem tools:
# ubuntu/debian
sudo apt install clang gcc-arm-none-eabi
# macos
brew install llvm arm-none-eabi-gcc
# or use conda
conda install pytorch torchvision llvmlite- Create your model:
python export_model.py --model-type hybrid - Test conversion:
./test_conversion.sh - Integrate with firmware: Use generated
.ofiles in your embedded project - Optimize further: Profile and adjust model architecture for your constraints
- Deploy: Flash to target hardware and validate real-world performance
For more advanced use cases, see the individual converter documentation and consider extending the layer support for your specific requirements.