Course: Sztuczna Inteligencja w Grafice Komputerowej
Framework: PyTorch | Language: Python
- Project 1 - Super-Resolution & Denoising
- Project 2 - HDR Exposure Synthesis
- Project 3 - Neural Rendering (Phong)
- Project 4 - 3D Point Cloud Transformation
- Project 5 - Stick Animation (Diffusion)
Full report:
project1/SUMMARY.md
U-Net with residual blocks and PixelShuffle upsampling. Reconstructs HR images (256Γ256) from LR inputs at Γ4 (64Γ64) and Γ8 (32Γ32) scale.
| Method | PSNR β | SSIM β | LPIPS β |
|---|---|---|---|
| Bicubic Γ4 | 29.47 | 0.7554 | 0.3369 |
| SRUNet Γ4 | 30.52 | 0.7906 | 0.3153 |
| Bicubic Γ8 | 26.52 | 0.6301 | 0.4886 |
| SRUNet Γ8 | 27.13 | 0.6565 | 0.4686 |
Residual attention network with dilated convolutions and channel attention (EAM). Removes Gaussian noise at Ο β {0.01, 0.03}.
| Method | PSNR β | SSIM β | LPIPS β |
|---|---|---|---|
| Noisy input | 33.65 | 0.8471 | 0.1509 |
| Bilateral filter | 34.07 | 0.9058 | 0.1800 |
| RIDNet | 40.80 | 0.9731 | 0.0938 |
Full report:
project2/SUMMARY.md
Neural network-based HDR imaging pipeline: a ResUNet generates two additional exposures (EV β2.7 and EV +2.7) from a single LDR input, which are then merged into an HDR image using the Debevec algorithm (OpenCV). Dataset: HDR-Eye (EPFL) β 7 test scenes (C40βC46), ~28 training scenes, 1 400 training / 350 test patches (256Γ256 px).
Encoderβdecoder with residual blocks at every scale. Features: [32, 64, 128, 256], ~11.9M parameters. Loss: L = 0.8 Β· L1 + 0.2 Β· (1 β SSIM). Trained for 10 epochs (Adam, lr=1e-4) on Kaggle T4.
| Direction | PSNR β | LPIPS β |
|---|---|---|
| Underexposed | 19.66 dB | 0.3729 |
| Overexposed | 19.00 dB | 0.5608 |
Reconstructed HDR images reach ~5.8β7.6 EV dynamic range vs. 7.2β24.3 EV in the originals. The gap is inherent to the approach: only Β±2.7 EV of bracketing (5.4 EV total) is available for Debevec merging.
| Scene | Original DR (EV) | Reconstructed DR (EV) |
|---|---|---|
| C40 | 20.27 | 6.22 |
| C41 | 18.00 | 6.58 |
| C42 | 8.18 | 6.94 |
| C43 | 24.30 | 7.58 |
| C44 | 7.17 | 5.78 |
| C45 | 8.39 | 7.45 |
| C46 | 14.07 | 6.99 |
Full report:
project3/SUMMARY.md
Goal: approximate the Phong lighting model with a neural network. The model takes a scene parameter vector (object position, diffuse color, shininess, light position) and generates a 128Γ128 px rendering. Dataset: 3 000 procedurally rendered images; test set: indices 2400β2999 (600 samples).
Two architectures were evaluated: a conditional DDPM diffusion model and a conditional GAN (LSGAN).
Conditional U-Net with sinusoidal time embedding and scene parameter conditioning. Trained for 67 epochs (early stopping, patience=10) on Kaggle T4.
| Method | FLIP β | LPIPS β | SSIM β | Hausdorff β |
|---|---|---|---|---|
| Diffusion (DDPM) | 0.0211 | 0.7940 | 0.0020 | 74.94 px |
The model failed to reproduce object geometry or Phong shading β generated images resemble noisy pixel clusters rather than coherent renders.
Conditional GAN with spectral-normalized discriminator. Generator uses transposed convolutions to upsample from an 18-dim latent vector (noise z=8 + condition c=10) to 128Γ128 px. A foreground mask (brightness > 0.05) applies 50Γ weight to sphere pixels in the L1 loss, preventing the generator from collapsing to black backgrounds.
L_G = MSE(D(x_fake, c), 1.0) + 200.0 Β· L_masked_L1
Trained for 300 epochs (~58.7 min on T4), best checkpoint at epoch 240.
| Method | FLIP β | LPIPS β | SSIM β | Hausdorff β |
|---|---|---|---|---|
| GAN | 0.0125 | 0.1303 | 0.9650 | 19.63 px |
The GAN successfully approximates the Phong model (SSIM=0.965, FLIP=0.0125), significantly outperforming the diffusion model across all metrics.
Full report:
project4/SUMMARY.ipynb
Goal: train neural networks to deform a 3D point cloud from a source shape into a target shape (teapot). Three separate models were trained β Armadillo, Bunny, and Dragon as source objects. Generalisation is evaluated on an unseen shape β Asian Dragon.
All models predict a displacement field: for each input point x_i, the network outputs Ξx_i, and the final position is x_pred = x_input + Ξx. This formulation makes the network learn only the shape difference, stabilising training. Each model follows a three-block pipeline:
| Block | Operation | Output shape |
|---|---|---|
| Local encoder | Per-point shared MLP | (B, N, 128) |
| Global descriptor | Max-pool over points β MLP | (B, 512) broadcast to each point |
| Decoder | MLP on concat (local + global) β 3 | (B, N, 3) displacements |
Armadillo model (VectorFieldNet): 373 251 parameters. Input/output: (B, 2048, 3).
CD(P, Q) = (1/|P|) * Ξ£_{pβP} min_{qβQ} ||p-q||Β² + (1/|Q|) * Ξ£_{qβQ} min_{pβP} ||q-p||Β²
The symmetric formulation penalises both predicted points far from the target and target regions not covered by the prediction.
All models: Adam, CosineAnnealingLR, batch size 16, 2048 points per cloud.
| Model | Epochs | LR | Notes |
|---|---|---|---|
| Bunny | 200 | 3e-4 | Single stage |
| Dragon | 200 | 3e-4 | Single stage |
| Armadillo | 100 + 200 | 1e-3 β 3e-4 | Two-stage fine-tuning; val loss: 0.003517 β 0.001164 (~9% improvement) |
| Flow | IoU β | Dice β | Chamfer β |
|---|---|---|---|
| bunny β teapot | 0.7489 | 0.8565 | 3.1016 |
| dragon β teapot | 0.7581 | 0.8624 | 3.2829 |
| armadillo β teapot | 0.7343 | 0.8468 | 3.2182 |
| asian dragon (bunny flow) | 0.7203 | 0.8374 | 3.1777 |
| asian dragon (dragon flow) | 0.7527 | 0.8589 | 3.1765 |
| asian dragon (armadillo flow) | 0.7974 | 0.8873 | 3.2282 |
All models achieve high IoU (>0.73) and Dice (>0.84). Notably, the armadillo model generalises best to the unseen Asian Dragon β the two-stage fine-tuning yielded a smoother displacement field that transfers well to new shapes.
Full report:
project5/SUMMARY.md
Goal: generate stickman animations from a text prompt specifying motion type using a conditional diffusion model. Two motion classes are supported: walk and jump. The model outputs a motion tensor of shape [48, 15, 3] β 48 animation frames, 15 skeleton keypoints, and 3 spatial coordinates per point.
Motion sequences were stored as .npy files, split by class, and temporally resampled to a fixed 48-frame length. Each sequence was spatially centred relative to mean body-centre joint positions to decouple absolute location from motion dynamics. Training data was augmented with random vertical-axis rotation and optional skeleton mirroring β 7 augmentation variants for walk, 13 for jump to balance the classes. Final splits: train.npz, test.npz, and norm_stats.npy.
The generative model (MotionDenoiser) is a Transformer-based diffusion denoiser conditioned on motion class. It predicts the noise added to a motion sequence at a given diffusion step.
| Component | Description |
|---|---|
| Input projection | Linear layer mapping 15 joints β model hidden dim |
| Time embedding | Sinusoidal + MLP |
| Class embedding | Learned embedding for walk / jump |
| Backbone | Transformer encoder (temporal self-attention) |
| Output head | Linear projection β [48, 15, 3] displacements |
Key hyperparameters: model dim 384, attention heads 6, transformer layers 6, dropout 0.1.
Standard Gaussian Diffusion with a linear beta schedule over 1 000 steps. The forward process gradually corrupts motion sequences with Gaussian noise; the reverse process denoises iteratively to recover the animation. Classifier-free guidance is applied β class labels are randomly dropped during training, and generation uses a guidance_scale parameter to control conditioning strength.
| Parameter | Value |
|---|---|
| Optimizer | AdamW |
| Learning rate | 1e-4 |
| Scheduler | CosineAnnealingLR |
| Batch size | 32 |
| Diffusion steps | 1 000 |
| Gradient clipping | 1.0 |
| Velocity loss weight | 0.1 |
Training ran for 5 000 epochs (phase 1, interrupted due to memory limits) followed by a fine-tuning phase. The final checkpoint from epoch 4 200 (phase 2) achieved the lowest loss and best qualitative results. Loss started at ~1.4979 and decreased systematically throughout training.
total_loss = noise_loss + 0.1 Β· velocity_loss
noise_loss is the standard MSE between predicted and actual noise. velocity_loss penalises differences in joint velocity between consecutive frames, promoting temporal smoothness and natural motion dynamics.
Quantitative evaluation used three metrics: FMD (FrΓ©chet Motion Distance β distribution-level quality), MPJPE (Mean Per Joint Position Error β joint accuracy), and Var (variance across generated samples β diversity).
| Motion | FMD β | MPJPE β | Var β | Samples |
|---|---|---|---|---|
| walk | 31.03 | 2.71 | 9.50 | 623 |
| jump | 139.76 | 2.09 | 3.40 | 637 |
| Motion | FMD β | MPJPE β | Var β | Samples |
|---|---|---|---|---|
| walk | 2596.04 | 10.52 | 8.68 | 23 |
| jump | 882.17 | 7.19 | 3.59 | 12 |
The model achieves low FMD and MPJPE on the training set, especially for walk. The large increase on the test set indicates overfitting, likely due to the small test set size and limited training data for more complex motions. Variance remained stable across both splits, confirming diverse generation without mode collapse.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |



























