Evaluation Guide

This guide covers how to evaluate layer decomposition quality using LayerD's evaluation metrics.

Overview

LayerD provides a comprehensive evaluation framework based on:

LayersEditDist: Sequence-based edit distance metric using Dynamic Time Warping (DTW)
Per-layer metrics: RGBL1 (color accuracy) and AlphaIoU (mask accuracy)

These metrics compare predicted layer decompositions against ground-truth layers.

Quick Start

Python API

Evaluate layer decomposition quality:

from layerd.evaluation import LayersEditDist
from PIL import Image

# Load layers (both should be lists of PIL Images in RGBA format)
layers_pred = [Image.open(f"pred/layer_{i}.png") for i in range(3)]
layers_gt = [Image.open(f"gt/layer_{i}.png") for i in range(3)]

# Create metric
metric = LayersEditDist()

# Compute edit distance
result = metric(layers_pred, layers_gt)

print(f"Edit Distance: {result['edit_distance']}")
print(f"Normalized Edit Distance: {result['normalized_edit_distance']}")

Command-Line Evaluation

For dataset-level evaluation:

uv run python ./tools/evaluate.py \
  --pred-dir /path/to/predictions \
  --gt-dir /path/to/groundtruth \
  --output-dir /path/to/results \
  --max-edits 5

LayersEditDist Metric

What It Measures

The LayersEditDist metric computes the minimum edit distance between two layer sequences:

Insert: Adding a missing layer
Delete: Removing an extra layer
Modify: Changing an existing layer

The metric uses Dynamic Time Warping (DTW) to align layers and compute the optimal sequence of edit operations.

Why It's Useful

Traditional per-layer metrics (like MSE or IoU) require exact correspondence between predicted and ground-truth layers. LayersEditDist handles:

Different numbers of layers
Misaligned layer orders
Missing or extra layers

This makes it suitable for evaluating layer decomposition where the number and order of layers may vary.

Output Format

result = metric(layers_pred, layers_gt)
# result is a dictionary:
{
    'edit_distance': 2.5,           # Raw edit distance
    'normalized_edit_distance': 0.5, # Normalized by max(len(pred), len(gt))
    'alignment': [...],              # DTW alignment path
    'operation_costs': [...]         # Cost of each edit operation
}

Interpretation

Edit distance = 0: Perfect match (identical layers)
Lower is better: Smaller edit distance means better quality
Normalized range: [0, max_edits] where max_edits is typically 3-5

Per-Layer Metrics

RGBL1 - Color Accuracy

Measures the L1 distance between RGB values of predicted and ground-truth layers:

from layerd.evaluation.metrics import compute_rgbl1

rgbl1 = compute_rgbl1(layer_pred, layer_gt)
# Returns float in [0, 1], lower is better

AlphaIoU - Mask Accuracy

Measures the Intersection over Union (IoU) of alpha channels:

from layerd.evaluation.metrics import compute_alpha_iou

alpha_iou = compute_alpha_iou(layer_pred, layer_gt)
# Returns float in [0, 1], higher is better

Combined Per-Layer Evaluation

from layerd.evaluation.metrics import evaluate_layer_pair

metrics = evaluate_layer_pair(layer_pred, layer_gt)
print(f"RGBL1: {metrics['rgbl1']:.4f}")
print(f"AlphaIoU: {metrics['alpha_iou']:.4f}")

Dataset-Level Evaluation

Directory Structure

The evaluation script expects the following structure:

predictions/
├── sample_001/
│   ├── 0000.png  # Background
│   ├── 0001.png  # Layer 1
│   ├── 0002.png  # Layer 2
│   └── ...
├── sample_002/
│   ├── 0000.png
│   └── ...
└── ...

groundtruth/
├── sample_001/
│   ├── 0000.png
│   ├── 0001.png
│   ├── 0002.png
│   └── ...
└── ...

Each sample has its own directory with numbered layer files (0000.png, 0001.png, etc.).

Running Evaluation

uv run python ./tools/evaluate.py \
  --pred-dir /path/to/predictions \
  --gt-dir /path/to/groundtruth \
  --output-dir /path/to/results \
  --max-edits 5

Evaluation Script Options

--pred-dir (required): Directory with predicted layers
--gt-dir (required): Directory with ground-truth layers
--output-dir (required): Directory to save evaluation results
--max-edits: Maximum edit distance for normalization (default: 5)

Output Files

The script saves results to output-dir/:

results/
├── summary.json          # Overall statistics
├── per_sample.csv        # Per-sample metrics
└── alignment_viz/        # Visualization of layer alignments (optional)

summary.json contains:

{
  "mean_edit_distance": 1.23,
  "mean_normalized_edit_distance": 0.41,
  "mean_rgbl1": 0.15,
  "mean_alpha_iou": 0.87,
  "num_samples": 100
}

Crello Dataset Evaluation

The Crello dataset prepared by generate_crello_matting.py includes a layers/ directory ready for evaluation.

Preparing Crello for Evaluation

When generating the dataset, use --save-layers:

uv run python ./tools/generate_crello_matting.py \
  --output-dir /path/to/dataset \
  --inpainting \
  --save-layers

This creates:

dataset/
├── train/
│   └── layers/  # Ground-truth layers for training set
├── validation/
│   └── layers/
└── test/
    └── layers/  # Ground-truth layers for test set

Evaluating on Crello

# 1. Run inference on test set
uv run python ./tools/infer.py \
  --input /path/to/dataset/test/composite/ \
  --output-dir /path/to/predictions/ \
  --device cuda

# 2. Evaluate predictions
uv run python ./tools/evaluate.py \
  --pred-dir /path/to/predictions/ \
  --gt-dir /path/to/dataset/test/layers/ \
  --output-dir /path/to/eval_results/ \
  --max-edits 5

Advanced Usage

Custom Evaluation Metrics

Add custom metrics by extending the evaluation framework:

from layerd.evaluation.metrics import evaluate_layer_pair
from PIL import Image
import numpy as np

def custom_metric(layer_pred, layer_gt):
    """Custom evaluation metric"""
    # Convert PIL to numpy
    pred_arr = np.array(layer_pred)
    gt_arr = np.array(layer_gt)

    # Your metric computation
    score = ...

    return score

# Use in evaluation
layer_pred = Image.open("pred.png")
layer_gt = Image.open("gt.png")
score = custom_metric(layer_pred, layer_gt)

Visualizing Alignments

Visualize DTW alignment between predicted and ground-truth layers:

from layerd.evaluation import LayersEditDist
import matplotlib.pyplot as plt

metric = LayersEditDist()
result = metric(layers_pred, layers_gt)

# Get alignment path
alignment = result['alignment']

# Visualize (example)
for i, (pred_idx, gt_idx) in enumerate(alignment):
    print(f"Step {i}: Pred layer {pred_idx} <-> GT layer {gt_idx}")

Batch Evaluation Script

For large-scale evaluation:

from pathlib import Path
from layerd.evaluation import LayersEditDist
from PIL import Image

pred_root = Path("/path/to/predictions")
gt_root = Path("/path/to/groundtruth")

metric = LayersEditDist()
results = []

for sample_dir in pred_root.iterdir():
    if not sample_dir.is_dir():
        continue

    sample_id = sample_dir.name

    # Load predicted layers
    layers_pred = []
    for layer_file in sorted(sample_dir.glob("*.png")):
        layers_pred.append(Image.open(layer_file))

    # Load ground-truth layers
    gt_dir = gt_root / sample_id
    layers_gt = []
    for layer_file in sorted(gt_dir.glob("*.png")):
        layers_gt.append(Image.open(layer_file))

    # Evaluate
    result = metric(layers_pred, layers_gt)
    results.append({
        'sample_id': sample_id,
        'edit_distance': result['edit_distance'],
        'normalized_edit_distance': result['normalized_edit_distance']
    })

# Compute statistics
mean_edit_dist = sum(r['edit_distance'] for r in results) / len(results)
print(f"Mean Edit Distance: {mean_edit_dist:.4f}")

Troubleshooting

Common Issues

Problem: "Mismatched layer counts"

Solution: This is expected behavior. LayersEditDist handles different layer counts automatically. The edit distance reflects the cost of aligning mismatched sequences.

Problem: Evaluation is slow for large datasets

Solution:

Use multiprocessing to parallelize sample evaluation
Evaluate on a subset first to verify correctness
Use smaller image resolutions if per-pixel accuracy is not critical

Problem: High edit distance despite visually similar results

Solution:

Check layer ordering (background should be first)
Verify alpha channel quality (check AlphaIoU)
Adjust edit operation costs if needed (modify metric parameters)

For more troubleshooting help, see troubleshooting.md.

Metric Details

Edit Operation Costs

The LayersEditDist metric uses the following default costs:

Insert: 1.0 (cost of adding a missing layer)
Delete: 1.0 (cost of removing an extra layer)
Modify: Based on RGBL1 + (1 - AlphaIoU)

These can be customized by modifying the metric implementation.

Dynamic Time Warping (DTW)

DTW finds the optimal alignment between two sequences by:

Computing pairwise distance matrix between all layer pairs
Finding the minimum-cost path through the matrix
Returning the alignment and total cost

See src/layerd/evaluation/dtw.py for implementation details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation Guide

Overview

Quick Start

Python API

Command-Line Evaluation

LayersEditDist Metric

What It Measures

Why It's Useful

Output Format

Interpretation

Per-Layer Metrics

RGBL1 - Color Accuracy

AlphaIoU - Mask Accuracy

Combined Per-Layer Evaluation

Dataset-Level Evaluation

Directory Structure

Running Evaluation

Evaluation Script Options

Output Files

Crello Dataset Evaluation

Preparing Crello for Evaluation

Evaluating on Crello

Advanced Usage

Custom Evaluation Metrics

Visualizing Alignments

Batch Evaluation Script

Troubleshooting

Common Issues

Metric Details

Edit Operation Costs

Dynamic Time Warping (DTW)

Related Documentation

FilesExpand file tree

evaluation.md

Latest commit

History

evaluation.md

File metadata and controls

Evaluation Guide

Overview

Quick Start

Python API

Command-Line Evaluation

LayersEditDist Metric

What It Measures

Why It's Useful

Output Format

Interpretation

Per-Layer Metrics

RGBL1 - Color Accuracy

AlphaIoU - Mask Accuracy

Combined Per-Layer Evaluation

Dataset-Level Evaluation

Directory Structure

Running Evaluation

Evaluation Script Options

Output Files

Crello Dataset Evaluation

Preparing Crello for Evaluation

Evaluating on Crello

Advanced Usage

Custom Evaluation Metrics

Visualizing Alignments

Batch Evaluation Script

Troubleshooting

Common Issues

Metric Details

Edit Operation Costs

Dynamic Time Warping (DTW)

Related Documentation