Skip to content

Latest commit

 

History

History
423 lines (351 loc) · 23.8 KB

File metadata and controls

423 lines (351 loc) · 23.8 KB

CuVision-Engine

  ██████╗██╗   ██╗██╗   ██╗██╗███████╗██╗ ██████╗ ███╗   ██╗
 ██╔════╝██║   ██║██║   ██║██║██╔════╝██║██╔═══██╗████╗  ██║
 ██║     ██║   ██║██║   ██║██║███████╗██║██║   ██║██╔██╗ ██║
 ██║     ██║   ██║╚██╗ ██╔╝██║╚════██║██║██║   ██║██║╚██╗██║
 ╚██████╗╚██████╔╝ ╚████╔╝ ██║███████║██║╚██████╔╝██║ ╚████║
  ╚═════╝ ╚═════╝   ╚═══╝  ╚═╝╚══════╝╚═╝ ╚═════╝ ╚═╝  ╚═══╝
                    E  N  G  I  N  E

High-Performance Native Computer Vision for the Edge.

CuVision-Engine is a low-latency Computer Vision framework written in C++ and CUDA. By targeting cuDNN and cuBLAS primitives directly, it achieves peak hardware utilisation on NVIDIA GPUs — no PyTorch, no TensorFlow, no framework overhead.


Project State (March 2026)

  Module                  Status          Technique
  ─────────────────────────────────────────────────────────────────────────
  Classification          ██████████ 100%  2-Stage CNN + BN + Dropout
  Object Detection        ████████░░  85%  RetinaNet-FPN (ResNet backbone)
  Segmentation            ████████░░  85%  Attention U-Net + ASPP Bottleneck
  ─────────────────────────────────────────────────────────────────────────
  TensorRT Integration    ░░░░░░░░░░   0%  (planned)
  Instance Segmentation   ░░░░░░░░░░   0%  (planned)

🗺️ Engine Architecture — Top-Level Overview

 ┌────────────────────────────── CuVision-Engine ─────────────────────────────┐
 │                                                                             │
 │   ┌─────────────────────┐  ┌──────────────────────┐  ┌──────────────────┐  │
 │   │   CLASSIFICATION    │  │  OBJECT DETECTION    │  │  SEGMENTATION    │  │
 │   │                     │  │                      │  │                  │  │
 │   │  Input [B,C,32,32]  │  │  Input [B,C,300,300] │  │  Input [B,C,256  │  │
 │   │         │           │  │         │            │  │        ×256]     │  │
 │   │      2×ConvBlock    │  │  ResNet Backbone     │  │  ResNet Encoder  │  │
 │   │    (BN+ReLU+Pool)   │  │  (4 stages, stride2) │  │  (4 stages)      │  │
 │   │         │           │  │         │            │  │       │          │  │
 │   │      Dropout        │  │    FPN Neck          │  │  Dilated ASPP    │  │
 │   │         │           │  │  (P2, P3, P4 @ 256)  │  │  Bottleneck      │  │
 │   │      FC Layer       │  │         │            │  │       │          │  │
 │   │         │           │  │  Shared Det. Head    │  │  Attn U-Net      │  │
 │   │      Softmax        │  │  (cls + reg towers)  │  │  Decoder ×3      │  │
 │   │         │           │  │         │            │  │       │          │  │
 │   │  Cross-Entropy     │  │  Focal + SmoothL1    │  │  CE + Dice Loss  │  │
 │   │  Loss              │  │  Loss                │  │                  │  │
 │   └─────────────────────┘  └──────────────────────┘  └──────────────────┘  │
 │                                                                             │
 │   ┌─────────────────────────── SHARED FOUNDATION ─────────────────────────┐ │
 │   │  Momentum-SGD kernel  │  He init  │  BN (train/infer)  │  Dropout    │ │
 │   │  cuRAND augmentation  │  cuBLAS   │  cudnnConvolution  │  GpuTimer   │ │
 │   └────────────────────────────────────────────────────────────────────────┘ │
 └─────────────────────────────────────────────────────────────────────────────┘

Repository Structure

CU_NN/
│
├── README.md                         ← You are here
├── LICENSE                           MIT
│
├── classification/                   ✅  COMPLETE
│   ├── main.cu                       Training loop  (CE loss, step-LR)
│   ├── compile.ps1                   NVCC build → dnn_classifier.exe
│   ├── README.md                     Architecture + math + paper refs
│   ├── dataset/
│   │   └── prepare_dataset.py        Oxford 17-Flowers → flowers10.bin
│   └── network/
│       ├── cudnn_helper.h            Error macros (CUDA / cuDNN / cuBLAS)
│       ├── utilities.cu              GpuTimer, printDeviceInformation
│       ├── augmentation.cu           H-flip + brightness jitter kernel
│       └── network.cu                ImageClassifier class (fwd + bwd + save)
│
├── object_detection/                 🔶  IN PROGRESS (network complete)
│   ├── main.cu                       Training loop  (IoU match, cosine-LR)
│   ├── compile.ps1                   NVCC build → od_detector.exe
│   ├── README.md                     Architecture + math + paper refs
│   ├── dataset/
│   │   └── prepare_dataset.py        Pascal VOC 2007 → od_voc2007.bin
│   └── network/
│       ├── cudnn_helper.h            Error macros
│       ├── utilities.cu              Smooth-L1, Focal loss, NMS, Anchors
│       ├── augmentation.cu           H-flip (bbox-aware), jitter, noise, cutout
│       └── network.cu                ObjectDetector (backbone→FPN→head)
│
└── segmentation/                     🔶  IN PROGRESS (network complete)
    ├── main.cu                       Training loop  (mIoU tracking, poly-LR)
    ├── compile.ps1                   NVCC build → seg_unet.exe
    ├── README.md                     Architecture + math + paper refs
    ├── dataset/
    │   └── prepare_dataset.py        Oxford-IIIT Pet → seg_pets.bin
    └── network/
        ├── cudnn_helper.h            Error macros
        ├── utilities.cu              Pixel CE, Dice loss, mIoU, weight I/O
        ├── augmentation.cu           Paired H/V flip, elastic deform, noise
        └── network.cu                SegmentationNet (encoder→ASPP→attn decoder)

Model Complexity Comparison

  ── Parameters (log scale) ─────────────────────────────────────────────────

  Classification    ██░░░░░░░░░░░░░░░░░░░░░░░░░░░  ~60 K
  Object Detection  ████████████████████░░░░░░░░░  ~21 M  (ResNet + FPN + Head)
  Segmentation      ████████████████████████░░░░░  ~31 M  (ResNet + ASPP + UNet)

  ── Input Resolution ────────────────────────────────────────────────────────

  Classification    ██░░░░░░░░░░░░░░░░░░░░░░░░░░░  32×32
  Object Detection  ████████████████████░░░░░░░░░  300×300
  Segmentation      █████████████████████████░░░░  256×256

  ── Anchors / Output Pixels ─────────────────────────────────────────────────

  Classification    ──   (single label per image)
  Object Detection  ~60,000 anchors across P2+P3+P4
  Segmentation      65,536 pixels labelled per image

CUDA Kernel Summary

  Kernel                       Module(s)         Purpose
  ─────────────────────────────────────────────────────────────────────────────
  applyMomentumSGD             All               v ← μv + lr·g;  w ← w − v
  setConst / fillConst         All               GPU buffer zero / fill
  augmentBatchKernel           Classification    H-flip + brightness  (fused)
  horizontalFlipKernel         Object Det.       Paired pixel swap (NCHW)
  colorJitterKernel            Object Det.       Brightness/contrast/saturation
  gaussianNoiseKernel          OD + Seg          Additive noise via cuRAND
  cutoutKernel                 Object Det.       Square region zeroing
  segHFlipKernel               Segmentation      H-flip: image + int mask
  segVFlipKernel               Segmentation      V-flip: image + int mask
  segColorJitterKernel         Segmentation      Color jitter (image only)
  segGaussianNoiseKernel       Segmentation      Noise (image only)
  elasticSampleKernel          Segmentation      Bilinear warp + NN mask sample
  upsample2xKernel             Segmentation      Bilinear 2× upscale
  concatKernel                 Segmentation      Channel-wise concat
  attentionScaleKernel         Segmentation      ψ ⊙ skip features
  focalLossKernel              Object Det.       −α(1−pₜ)^γ log(pₜ)
  smoothL1LossKernel           Object Det.       Huber regression loss
  pixelCrossEntropyLossKernel  Segmentation      Pixel-wise CE + softmax grad
  diceLossFwdKernel            Segmentation      |P∩G| / (|P|+|G|) via atomics
  ─────────────────────────────────────────────────────────────────────────────
  Total custom kernels: 20

Technique Comparison Across Modules

                         Classification   Detection    Segmentation
  ──────────────────────────────────────────────────────────────────
  Backbone               2-stage CNN      ResNet-4     ResNet-4
  Skip connections       ✗                ✓ (residual) ✓ (residual)
  Multi-scale output     ✗                ✓ (FPN)      ✗
  Attention mechanism    ✗                ✗            ✓ (additive)
  Dilated convolution    ✗                ✗            ✓ (ASPP ×2)
  Batch Normalisation    ✓                ✓            ✓
  Dropout                ✓ (0.5 FC)       ✗            ✗
  He Initialisation      ✓                ✓            ✓
  Momentum SGD           ✓                ✓            ✓
  Weight Decay (L2)      ✓                ✓            ✓
  LR Schedule            Step ×0.8/ep     Cosine       Polynomial^0.9
  Augmentation           flip+bright      flip+jitter  elastic+flip
                                          +noise+cut   +noise+jitter
  Loss                   Softmax CE       Focal+SmoothL1  CE+Dice
  Metric                 Accuracy         mAP (IoU)    mIoU
  Dataset                Oxford Flowers   Pascal VOC07 Oxford-IIIT Pet
  Classes                10               20           3
  ──────────────────────────────────────────────────────────────────

Learning Rate Schedules — Visual

  ── Classification: Step Decay (×0.8 / epoch) ──────────────────────

  lr  0.010 ┤━━━━━━━━━━━━┐
      0.008 ┤            └━━━━━━━━━━┐
      0.006 ┤                       └━━━━━━━━┐
      0.004 ┤                                └━━━━━━━
            └──────────────────────────────────────► epoch
             1            2          3         4   5


  ── Object Detection: Cosine Annealing ─────────────────────────────

  lr  0.001 ┤━━━━━━┐
      0.000 ┤      ╲      ╲
            ┤       ╲      ╲__
            ┤        ╲         ╲___
            ┤         ╲             ╲______━━━
            └──────────────────────────────────► epoch
             1    2    3    4    5    6   ...  10


  ── Segmentation: Polynomial Decay (power=0.9) ─────────────────────

  lr  5e-4 ┤━━━━┐
      4e-4 ┤    ╲━━━┐
      3e-4 ┤        ╲━━━━┐
      2e-4 ┤             ╲━━━━━┐
      1e-4 ┤                   ╲━━━━━━━━━┐
           ┤                             ╲━━━━━
           └──────────────────────────────────► epoch
            1    3    5    7    9   11   13   15

Augmentation Pipeline — Visual

  CLASSIFICATION
  ──────────────────────────────────────────────────────────────
  Raw batch [B, 3, 32, 32]
       │
       └──[augmentBatchKernel]── H-flip (50%) + Brightness ±0.15
                                 (1 kernel, in-place, fused)


  OBJECT DETECTION
  ──────────────────────────────────────────────────────────────
  Raw batch [B, 3, 300, 300]                    Bounding Boxes
       │                                              │
       ├──[horizontalFlipKernel]──────────────── x' = 1 − x
       ├──[colorJitterKernel]  bright/contrast/saturation
       ├──[gaussianNoiseKernel] σ ∈ [0, 0.04]  (cuRAND)
       └──[cutoutKernel]  random 15–25% square zeroed


  SEGMENTATION
  ──────────────────────────────────────────────────────────────
  Image [B, 3, 256, 256]    +    Mask [B, 256, 256] (int)
       │                              │
       ├──[segHFlipKernel]────────── pixel swap  +  label swap
       ├──[segVFlipKernel]────────── pixel swap  +  label swap
       ├──[segColorJitterKernel]──── image only
       ├──[segGaussianNoiseKernel]── image only (cuRAND)
       └──[elasticSampleKernel]───── bilinear image  +  NN mask
                                     (displacement α=8, σ=4)

Loss Functions — Visual

  Focal Loss  (Object Detection — classification head)
  ───────────────────────────────────────────────────────
  weight
  1.0 ┤                                      *
      ┤                             *
  0.8 ┤                    *
      ┤           *
  0.4 ┤    *
      ┤  *
  0.1 ┤ *  ← easy (pₜ=0.9) down-weighted to ~0.01
      └──────────────────────────────────────────► pₜ
       0.0  0.1  0.3  0.5  0.7  0.9  1.0

  (1-pₜ)^γ with γ=2.0: easy negatives → near-zero gradient


  Smooth-L1  (Object Detection — regression head)
  ───────────────────────────────────────────────────────
  loss
  2.0 ┤              /  ← linear (|δ|−0.5)
      ┤            /
  1.0 ┤          /
      ┤       ╭──╮  ← quadratic (0.5δ²)
  0.0 ┤──────╯    ╰──────
      └──────────────────────────────────────────► δ
       -2    -1    0    1    2


  Dice Loss  (Segmentation)
  ───────────────────────────────────────────────────────
  Dice = 1 − (2|P∩G|) / (|P|+|G|)

  Overlap  |P∩G|/|P∪G|   vs   Dice score
  0.0 → Dice = 1.0   (worst)
  0.5 → Dice = 0.67
  0.8 → Dice = 0.33
  1.0 → Dice = 0.0   (perfect overlap)

Development Roadmap

  ✅  DONE ──────────────────────────────────────────────────────────────
  [x]  2-Stage CNN Classifier (Conv→BN→ReLU→Pool ×2 + FC)
  [x]  Batch Normalization (training + inference running stats)
  [x]  Dropout Regularization (50%, inverted)
  [x]  He (Kaiming) Normal weight initialisation
  [x]  Custom Momentum-SGD CUDA kernel (per-param velocity)
  [x]  L2 Weight Decay (fused into SGD kernel)
  [x]  GPU Timer (cudaEvent benchmark utility)
  [x]  Oxford 17-Flowers dataset loader (binary format)
  [x]  CUDA augmentation: H-flip + brightness jitter
  [x]  ResNet-style backbone (4 stages, residual blocks)
  [x]  Feature Pyramid Network (FPN) neck
  [x]  RetinaNet detection head (shared cls + reg towers)
  [x]  Sigmoid Focal Loss (α=0.25, γ=2.0)
  [x]  Smooth-L1 (Huber) regression loss
  [x]  Anchor generation (multi-scale, multi-ratio)
  [x]  IoU-based anchor matching + delta encoding
  [x]  Non-Maximum Suppression (host-side)
  [x]  Object detection augmentation (flip/jitter/noise/cutout)
  [x]  Pascal VOC 2007 dataset loader
  [x]  Attention U-Net decoder with additive attention gates
  [x]  Dilated convolution bottleneck (ASPP-style, d=2,4)
  [x]  Pixel-wise cross-entropy loss (in-kernel numerically stable)
  [x]  Dice loss (atomicAdd accumulation)
  [x]  mIoU metric (per-class intersection-over-union)
  [x]  Bilinear 2× upsample CUDA kernel
  [x]  Channel-wise concatenation CUDA kernel
  [x]  Elastic deformation augmentation (paired image+mask)
  [x]  Oxford-IIIT Pet dataset loader (trimap → class mask)
  [x]  Polynomial + cosine LR schedules
  [x]  Checkpoint saving (every N epochs)
  [x]  Full documentation (README per module, ASCII diagrams, paper refs)

  🔶  REMAINING ─────────────────────────────────────────────────────────
  [ ]  FPN lateral add + spatial upsample (GPU elementwise kernel)
  [ ]  Detection head full forward per FPN level
  [ ]  Segmentation decoder full bilinear dispatch per stage
  [ ]  mAP evaluation (mean Average Precision, PASCAL VOC protocol)
  [ ]  Inference-only mode (load weights, no grad buffers)
  [ ]  ONNX weight export for TensorRT ingestion
  [ ]  TensorRT integration (INT8 / FP16 engine for edge deployment)
  [ ]  Instance segmentation (Mask R-CNN style)
  [ ]  Multi-GPU support (NCCL all-reduce)

Getting Started

Hardware Requirements

  Minimum:   NVIDIA Pascal GPU (sm_60)  |  8 GB VRAM  |  CUDA 11+
  Recommend: NVIDIA Ampere  (sm_86)     |  16 GB VRAM |  CUDA 12+
  Edge:      NVIDIA Jetson Xavier/Orin  |  8 GB unified

Build (Windows PowerShell)

# Classification
cd classification && .\compile.ps1 && .\dnn_classifier.exe

# Object Detection
cd object_detection && .\compile.ps1 && .\od_detector.exe

# Segmentation
cd segmentation && .\compile.ps1 && .\seg_unet.exe

Dataset Preparation

# Each module includes a self-contained download script
python classification/dataset/prepare_dataset.py   # Oxford 17-Flowers  ~60 MB
python object_detection/dataset/prepare_dataset.py # Pascal VOC 2007    ~439 MB
python segmentation/dataset/prepare_dataset.py     # Oxford-IIIT Pet   ~800 MB

Reference Papers — Full Engine

Module Paper Authors Venue Link
All Batch Normalization Ioffe & Szegedy ICML 2015 arXiv:1502.03167
All He (Kaiming) Initialization He et al. ICCV 2015 arXiv:1502.01852
All Momentum SGD Sutskever et al. ICML 2013 ICML
All cuDNN: Efficient Primitives Chetlur et al. 2014 arXiv:1410.0759
Classification Dropout Srivastava et al. JMLR 2014 JMLR
Classification AlexNet Krizhevsky et al. NeurIPS 2012 NIPS
Classification VGGNet Simonyan & Zisserman ICLR 2015 arXiv:1409.1556
Detection + Seg Deep Residual Learning He et al. CVPR 2016 arXiv:1512.03385
Detection Feature Pyramid Networks Lin et al. CVPR 2017 arXiv:1612.03144
Detection Focal Loss / RetinaNet Lin et al. ICCV 2017 arXiv:1708.02002
Detection SSD Liu et al. ECCV 2016 arXiv:1512.02325
Detection Fast R-CNN (Smooth-L1) Girshick ICCV 2015 arXiv:1504.08083
Detection Cutout DeVries & Taylor 2017 arXiv:1708.04552
Segmentation U-Net Ronneberger et al. MICCAI 2015 arXiv:1505.04597
Segmentation Attention U-Net Oktay et al. MIDL 2018 arXiv:1804.03999
Segmentation DeepLab v3+ Chen et al. ECCV 2018 arXiv:1802.02611
Segmentation Dilated Convolutions Yu & Koltun ICLR 2016 arXiv:1511.07122
Segmentation V-Net / Dice Loss Milletari et al. 3DV 2016 arXiv:1606.04797
Segmentation Elastic Deformation Simard et al. ICDAR 2003 IEEE

License

Distributed under the MIT License. See LICENSE for more information.