Skip to content

halim-cv/CuVision-Engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CuVision-Engine

  ██████╗██╗   ██╗██╗   ██╗██╗███████╗██╗ ██████╗ ███╗   ██╗
 ██╔════╝██║   ██║██║   ██║██║██╔════╝██║██╔═══██╗████╗  ██║
 ██║     ██║   ██║██║   ██║██║███████╗██║██║   ██║██╔██╗ ██║
 ██║     ██║   ██║╚██╗ ██╔╝██║╚════██║██║██║   ██║██║╚██╗██║
 ╚██████╗╚██████╔╝ ╚████╔╝ ██║███████║██║╚██████╔╝██║ ╚████║
  ╚═════╝ ╚═════╝   ╚═══╝  ╚═╝╚══════╝╚═╝ ╚═════╝ ╚═╝  ╚═══╝
                    E  N  G  I  N  E

High-Performance Native Computer Vision for the Edge.

CuVision-Engine is a low-latency Computer Vision framework written in C++ and CUDA. By targeting cuDNN and cuBLAS primitives directly, it achieves peak hardware utilisation on NVIDIA GPUs — no PyTorch, no TensorFlow, no framework overhead.


Project State (March 2026)

  Module                  Status          Technique
  ─────────────────────────────────────────────────────────────────────────
  Classification          ██████████ 100%  2-Stage CNN + BN + Dropout
  Object Detection        ████████░░  85%  RetinaNet-FPN (ResNet backbone)
  Segmentation            ████████░░  85%  Attention U-Net + ASPP Bottleneck
  ─────────────────────────────────────────────────────────────────────────
  TensorRT Integration    ░░░░░░░░░░   0%  (planned)
  Instance Segmentation   ░░░░░░░░░░   0%  (planned)

🗺️ Engine Architecture — Top-Level Overview

 ┌────────────────────────────── CuVision-Engine ─────────────────────────────┐
 │                                                                             │
 │   ┌─────────────────────┐  ┌──────────────────────┐  ┌──────────────────┐  │
 │   │   CLASSIFICATION    │  │  OBJECT DETECTION    │  │  SEGMENTATION    │  │
 │   │                     │  │                      │  │                  │  │
 │   │  Input [B,C,32,32]  │  │  Input [B,C,300,300] │  │  Input [B,C,256  │  │
 │   │         │           │  │         │            │  │        ×256]     │  │
 │   │      2×ConvBlock    │  │  ResNet Backbone     │  │  ResNet Encoder  │  │
 │   │    (BN+ReLU+Pool)   │  │  (4 stages, stride2) │  │  (4 stages)      │  │
 │   │         │           │  │         │            │  │       │          │  │
 │   │      Dropout        │  │    FPN Neck          │  │  Dilated ASPP    │  │
 │   │         │           │  │  (P2, P3, P4 @ 256)  │  │  Bottleneck      │  │
 │   │      FC Layer       │  │         │            │  │       │          │  │
 │   │         │           │  │  Shared Det. Head    │  │  Attn U-Net      │  │
 │   │      Softmax        │  │  (cls + reg towers)  │  │  Decoder ×3      │  │
 │   │         │           │  │         │            │  │       │          │  │
 │   │  Cross-Entropy     │  │  Focal + SmoothL1    │  │  CE + Dice Loss  │  │
 │   │  Loss              │  │  Loss                │  │                  │  │
 │   └─────────────────────┘  └──────────────────────┘  └──────────────────┘  │
 │                                                                             │
 │   ┌─────────────────────────── SHARED FOUNDATION ─────────────────────────┐ │
 │   │  Momentum-SGD kernel  │  He init  │  BN (train/infer)  │  Dropout    │ │
 │   │  cuRAND augmentation  │  cuBLAS   │  cudnnConvolution  │  GpuTimer   │ │
 │   └────────────────────────────────────────────────────────────────────────┘ │
 └─────────────────────────────────────────────────────────────────────────────┘

Repository Structure

CU_NN/
│
├── README.md                         ← You are here
├── LICENSE                           MIT
│
├── classification/                   ✅  COMPLETE
│   ├── main.cu                       Training loop  (CE loss, step-LR)
│   ├── compile.ps1                   NVCC build → dnn_classifier.exe
│   ├── README.md                     Architecture + math + paper refs
│   ├── dataset/
│   │   └── prepare_dataset.py        Oxford 17-Flowers → flowers10.bin
│   └── network/
│       ├── cudnn_helper.h            Error macros (CUDA / cuDNN / cuBLAS)
│       ├── utilities.cu              GpuTimer, printDeviceInformation
│       ├── augmentation.cu           H-flip + brightness jitter kernel
│       └── network.cu                ImageClassifier class (fwd + bwd + save)
│
├── object_detection/                 🔶  IN PROGRESS (network complete)
│   ├── main.cu                       Training loop  (IoU match, cosine-LR)
│   ├── compile.ps1                   NVCC build → od_detector.exe
│   ├── README.md                     Architecture + math + paper refs
│   ├── dataset/
│   │   └── prepare_dataset.py        Pascal VOC 2007 → od_voc2007.bin
│   └── network/
│       ├── cudnn_helper.h            Error macros
│       ├── utilities.cu              Smooth-L1, Focal loss, NMS, Anchors
│       ├── augmentation.cu           H-flip (bbox-aware), jitter, noise, cutout
│       └── network.cu                ObjectDetector (backbone→FPN→head)
│
└── segmentation/                     🔶  IN PROGRESS (network complete)
    ├── main.cu                       Training loop  (mIoU tracking, poly-LR)
    ├── compile.ps1                   NVCC build → seg_unet.exe
    ├── README.md                     Architecture + math + paper refs
    ├── dataset/
    │   └── prepare_dataset.py        Oxford-IIIT Pet → seg_pets.bin
    └── network/
        ├── cudnn_helper.h            Error macros
        ├── utilities.cu              Pixel CE, Dice loss, mIoU, weight I/O
        ├── augmentation.cu           Paired H/V flip, elastic deform, noise
        └── network.cu                SegmentationNet (encoder→ASPP→attn decoder)

Model Complexity Comparison

  ── Parameters (log scale) ─────────────────────────────────────────────────

  Classification    ██░░░░░░░░░░░░░░░░░░░░░░░░░░░  ~60 K
  Object Detection  ████████████████████░░░░░░░░░  ~21 M  (ResNet + FPN + Head)
  Segmentation      ████████████████████████░░░░░  ~31 M  (ResNet + ASPP + UNet)

  ── Input Resolution ────────────────────────────────────────────────────────

  Classification    ██░░░░░░░░░░░░░░░░░░░░░░░░░░░  32×32
  Object Detection  ████████████████████░░░░░░░░░  300×300
  Segmentation      █████████████████████████░░░░  256×256

  ── Anchors / Output Pixels ─────────────────────────────────────────────────

  Classification    ──   (single label per image)
  Object Detection  ~60,000 anchors across P2+P3+P4
  Segmentation      65,536 pixels labelled per image

CUDA Kernel Summary

  Kernel                       Module(s)         Purpose
  ─────────────────────────────────────────────────────────────────────────────
  applyMomentumSGD             All               v ← μv + lr·g;  w ← w − v
  setConst / fillConst         All               GPU buffer zero / fill
  augmentBatchKernel           Classification    H-flip + brightness  (fused)
  horizontalFlipKernel         Object Det.       Paired pixel swap (NCHW)
  colorJitterKernel            Object Det.       Brightness/contrast/saturation
  gaussianNoiseKernel          OD + Seg          Additive noise via cuRAND
  cutoutKernel                 Object Det.       Square region zeroing
  segHFlipKernel               Segmentation      H-flip: image + int mask
  segVFlipKernel               Segmentation      V-flip: image + int mask
  segColorJitterKernel         Segmentation      Color jitter (image only)
  segGaussianNoiseKernel       Segmentation      Noise (image only)
  elasticSampleKernel          Segmentation      Bilinear warp + NN mask sample
  upsample2xKernel             Segmentation      Bilinear 2× upscale
  concatKernel                 Segmentation      Channel-wise concat
  attentionScaleKernel         Segmentation      ψ ⊙ skip features
  focalLossKernel              Object Det.       −α(1−pₜ)^γ log(pₜ)
  smoothL1LossKernel           Object Det.       Huber regression loss
  pixelCrossEntropyLossKernel  Segmentation      Pixel-wise CE + softmax grad
  diceLossFwdKernel            Segmentation      |P∩G| / (|P|+|G|) via atomics
  ─────────────────────────────────────────────────────────────────────────────
  Total custom kernels: 20

Technique Comparison Across Modules

                         Classification   Detection    Segmentation
  ──────────────────────────────────────────────────────────────────
  Backbone               2-stage CNN      ResNet-4     ResNet-4
  Skip connections       ✗                ✓ (residual) ✓ (residual)
  Multi-scale output     ✗                ✓ (FPN)      ✗
  Attention mechanism    ✗                ✗            ✓ (additive)
  Dilated convolution    ✗                ✗            ✓ (ASPP ×2)
  Batch Normalisation    ✓                ✓            ✓
  Dropout                ✓ (0.5 FC)       ✗            ✗
  He Initialisation      ✓                ✓            ✓
  Momentum SGD           ✓                ✓            ✓
  Weight Decay (L2)      ✓                ✓            ✓
  LR Schedule            Step ×0.8/ep     Cosine       Polynomial^0.9
  Augmentation           flip+bright      flip+jitter  elastic+flip
                                          +noise+cut   +noise+jitter
  Loss                   Softmax CE       Focal+SmoothL1  CE+Dice
  Metric                 Accuracy         mAP (IoU)    mIoU
  Dataset                Oxford Flowers   Pascal VOC07 Oxford-IIIT Pet
  Classes                10               20           3
  ──────────────────────────────────────────────────────────────────

Learning Rate Schedules — Visual

  ── Classification: Step Decay (×0.8 / epoch) ──────────────────────

  lr  0.010 ┤━━━━━━━━━━━━┐
      0.008 ┤            └━━━━━━━━━━┐
      0.006 ┤                       └━━━━━━━━┐
      0.004 ┤                                └━━━━━━━
            └──────────────────────────────────────► epoch
             1            2          3         4   5


  ── Object Detection: Cosine Annealing ─────────────────────────────

  lr  0.001 ┤━━━━━━┐
      0.000 ┤      ╲      ╲
            ┤       ╲      ╲__
            ┤        ╲         ╲___
            ┤         ╲             ╲______━━━
            └──────────────────────────────────► epoch
             1    2    3    4    5    6   ...  10


  ── Segmentation: Polynomial Decay (power=0.9) ─────────────────────

  lr  5e-4 ┤━━━━┐
      4e-4 ┤    ╲━━━┐
      3e-4 ┤        ╲━━━━┐
      2e-4 ┤             ╲━━━━━┐
      1e-4 ┤                   ╲━━━━━━━━━┐
           ┤                             ╲━━━━━
           └──────────────────────────────────► epoch
            1    3    5    7    9   11   13   15

Augmentation Pipeline — Visual

  CLASSIFICATION
  ──────────────────────────────────────────────────────────────
  Raw batch [B, 3, 32, 32]
       │
       └──[augmentBatchKernel]── H-flip (50%) + Brightness ±0.15
                                 (1 kernel, in-place, fused)


  OBJECT DETECTION
  ──────────────────────────────────────────────────────────────
  Raw batch [B, 3, 300, 300]                    Bounding Boxes
       │                                              │
       ├──[horizontalFlipKernel]──────────────── x' = 1 − x
       ├──[colorJitterKernel]  bright/contrast/saturation
       ├──[gaussianNoiseKernel] σ ∈ [0, 0.04]  (cuRAND)
       └──[cutoutKernel]  random 15–25% square zeroed


  SEGMENTATION
  ──────────────────────────────────────────────────────────────
  Image [B, 3, 256, 256]    +    Mask [B, 256, 256] (int)
       │                              │
       ├──[segHFlipKernel]────────── pixel swap  +  label swap
       ├──[segVFlipKernel]────────── pixel swap  +  label swap
       ├──[segColorJitterKernel]──── image only
       ├──[segGaussianNoiseKernel]── image only (cuRAND)
       └──[elasticSampleKernel]───── bilinear image  +  NN mask
                                     (displacement α=8, σ=4)

Loss Functions — Visual

  Focal Loss  (Object Detection — classification head)
  ───────────────────────────────────────────────────────
  weight
  1.0 ┤                                      *
      ┤                             *
  0.8 ┤                    *
      ┤           *
  0.4 ┤    *
      ┤  *
  0.1 ┤ *  ← easy (pₜ=0.9) down-weighted to ~0.01
      └──────────────────────────────────────────► pₜ
       0.0  0.1  0.3  0.5  0.7  0.9  1.0

  (1-pₜ)^γ with γ=2.0: easy negatives → near-zero gradient


  Smooth-L1  (Object Detection — regression head)
  ───────────────────────────────────────────────────────
  loss
  2.0 ┤              /  ← linear (|δ|−0.5)
      ┤            /
  1.0 ┤          /
      ┤       ╭──╮  ← quadratic (0.5δ²)
  0.0 ┤──────╯    ╰──────
      └──────────────────────────────────────────► δ
       -2    -1    0    1    2


  Dice Loss  (Segmentation)
  ───────────────────────────────────────────────────────
  Dice = 1 − (2|P∩G|) / (|P|+|G|)

  Overlap  |P∩G|/|P∪G|   vs   Dice score
  0.0 → Dice = 1.0   (worst)
  0.5 → Dice = 0.67
  0.8 → Dice = 0.33
  1.0 → Dice = 0.0   (perfect overlap)

Development Roadmap

  ✅  DONE ──────────────────────────────────────────────────────────────
  [x]  2-Stage CNN Classifier (Conv→BN→ReLU→Pool ×2 + FC)
  [x]  Batch Normalization (training + inference running stats)
  [x]  Dropout Regularization (50%, inverted)
  [x]  He (Kaiming) Normal weight initialisation
  [x]  Custom Momentum-SGD CUDA kernel (per-param velocity)
  [x]  L2 Weight Decay (fused into SGD kernel)
  [x]  GPU Timer (cudaEvent benchmark utility)
  [x]  Oxford 17-Flowers dataset loader (binary format)
  [x]  CUDA augmentation: H-flip + brightness jitter
  [x]  ResNet-style backbone (4 stages, residual blocks)
  [x]  Feature Pyramid Network (FPN) neck
  [x]  RetinaNet detection head (shared cls + reg towers)
  [x]  Sigmoid Focal Loss (α=0.25, γ=2.0)
  [x]  Smooth-L1 (Huber) regression loss
  [x]  Anchor generation (multi-scale, multi-ratio)
  [x]  IoU-based anchor matching + delta encoding
  [x]  Non-Maximum Suppression (host-side)
  [x]  Object detection augmentation (flip/jitter/noise/cutout)
  [x]  Pascal VOC 2007 dataset loader
  [x]  Attention U-Net decoder with additive attention gates
  [x]  Dilated convolution bottleneck (ASPP-style, d=2,4)
  [x]  Pixel-wise cross-entropy loss (in-kernel numerically stable)
  [x]  Dice loss (atomicAdd accumulation)
  [x]  mIoU metric (per-class intersection-over-union)
  [x]  Bilinear 2× upsample CUDA kernel
  [x]  Channel-wise concatenation CUDA kernel
  [x]  Elastic deformation augmentation (paired image+mask)
  [x]  Oxford-IIIT Pet dataset loader (trimap → class mask)
  [x]  Polynomial + cosine LR schedules
  [x]  Checkpoint saving (every N epochs)
  [x]  Full documentation (README per module, ASCII diagrams, paper refs)

  🔶  REMAINING ─────────────────────────────────────────────────────────
  [ ]  FPN lateral add + spatial upsample (GPU elementwise kernel)
  [ ]  Detection head full forward per FPN level
  [ ]  Segmentation decoder full bilinear dispatch per stage
  [ ]  mAP evaluation (mean Average Precision, PASCAL VOC protocol)
  [ ]  Inference-only mode (load weights, no grad buffers)
  [ ]  ONNX weight export for TensorRT ingestion
  [ ]  TensorRT integration (INT8 / FP16 engine for edge deployment)
  [ ]  Instance segmentation (Mask R-CNN style)
  [ ]  Multi-GPU support (NCCL all-reduce)

Getting Started

Hardware Requirements

  Minimum:   NVIDIA Pascal GPU (sm_60)  |  8 GB VRAM  |  CUDA 11+
  Recommend: NVIDIA Ampere  (sm_86)     |  16 GB VRAM |  CUDA 12+
  Edge:      NVIDIA Jetson Xavier/Orin  |  8 GB unified

Build (Windows PowerShell)

# Classification
cd classification && .\compile.ps1 && .\dnn_classifier.exe

# Object Detection
cd object_detection && .\compile.ps1 && .\od_detector.exe

# Segmentation
cd segmentation && .\compile.ps1 && .\seg_unet.exe

Dataset Preparation

# Each module includes a self-contained download script
python classification/dataset/prepare_dataset.py   # Oxford 17-Flowers  ~60 MB
python object_detection/dataset/prepare_dataset.py # Pascal VOC 2007    ~439 MB
python segmentation/dataset/prepare_dataset.py     # Oxford-IIIT Pet   ~800 MB

Reference Papers — Full Engine

Module Paper Authors Venue Link
All Batch Normalization Ioffe & Szegedy ICML 2015 arXiv:1502.03167
All He (Kaiming) Initialization He et al. ICCV 2015 arXiv:1502.01852
All Momentum SGD Sutskever et al. ICML 2013 ICML
All cuDNN: Efficient Primitives Chetlur et al. 2014 arXiv:1410.0759
Classification Dropout Srivastava et al. JMLR 2014 JMLR
Classification AlexNet Krizhevsky et al. NeurIPS 2012 NIPS
Classification VGGNet Simonyan & Zisserman ICLR 2015 arXiv:1409.1556
Detection + Seg Deep Residual Learning He et al. CVPR 2016 arXiv:1512.03385
Detection Feature Pyramid Networks Lin et al. CVPR 2017 arXiv:1612.03144
Detection Focal Loss / RetinaNet Lin et al. ICCV 2017 arXiv:1708.02002
Detection SSD Liu et al. ECCV 2016 arXiv:1512.02325
Detection Fast R-CNN (Smooth-L1) Girshick ICCV 2015 arXiv:1504.08083
Detection Cutout DeVries & Taylor 2017 arXiv:1708.04552
Segmentation U-Net Ronneberger et al. MICCAI 2015 arXiv:1505.04597
Segmentation Attention U-Net Oktay et al. MIDL 2018 arXiv:1804.03999
Segmentation DeepLab v3+ Chen et al. ECCV 2018 arXiv:1802.02611
Segmentation Dilated Convolutions Yu & Koltun ICLR 2016 arXiv:1511.07122
Segmentation V-Net / Dice Loss Milletari et al. 3DV 2016 arXiv:1606.04797
Segmentation Elastic Deformation Simard et al. ICDAR 2003 IEEE

License

Distributed under the MIT License. See LICENSE for more information.

About

low-latency Computer Vision framework written entirely in native C++/CUDA. Engineered for maximum throughput on NVIDIA hardware, CuVision-Engine provides optimized implementations for Classification, Segmentation, and Object Detection by bypassing heavy frameworks and leveraging cuDNN and cuBLAS directly.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors