This guide covers how to evaluate layer decomposition quality using LayerD's evaluation metrics.
LayerD provides a comprehensive evaluation framework based on:
- LayersEditDist: Sequence-based edit distance metric using Dynamic Time Warping (DTW)
- Per-layer metrics: RGBL1 (color accuracy) and AlphaIoU (mask accuracy)
These metrics compare predicted layer decompositions against ground-truth layers.
Evaluate layer decomposition quality:
from layerd.evaluation import LayersEditDist
from PIL import Image
# Load layers (both should be lists of PIL Images in RGBA format)
layers_pred = [Image.open(f"pred/layer_{i}.png") for i in range(3)]
layers_gt = [Image.open(f"gt/layer_{i}.png") for i in range(3)]
# Create metric
metric = LayersEditDist()
# Compute edit distance
result = metric(layers_pred, layers_gt)
print(f"Edit Distance: {result['edit_distance']}")
print(f"Normalized Edit Distance: {result['normalized_edit_distance']}")For dataset-level evaluation:
uv run python ./tools/evaluate.py \
--pred-dir /path/to/predictions \
--gt-dir /path/to/groundtruth \
--output-dir /path/to/results \
--max-edits 5The LayersEditDist metric computes the minimum edit distance between two layer sequences:
- Insert: Adding a missing layer
- Delete: Removing an extra layer
- Modify: Changing an existing layer
The metric uses Dynamic Time Warping (DTW) to align layers and compute the optimal sequence of edit operations.
Traditional per-layer metrics (like MSE or IoU) require exact correspondence between predicted and ground-truth layers. LayersEditDist handles:
- Different numbers of layers
- Misaligned layer orders
- Missing or extra layers
This makes it suitable for evaluating layer decomposition where the number and order of layers may vary.
result = metric(layers_pred, layers_gt)
# result is a dictionary:
{
'edit_distance': 2.5, # Raw edit distance
'normalized_edit_distance': 0.5, # Normalized by max(len(pred), len(gt))
'alignment': [...], # DTW alignment path
'operation_costs': [...] # Cost of each edit operation
}- Edit distance = 0: Perfect match (identical layers)
- Lower is better: Smaller edit distance means better quality
- Normalized range: [0, max_edits] where max_edits is typically 3-5
Measures the L1 distance between RGB values of predicted and ground-truth layers:
from layerd.evaluation.metrics import compute_rgbl1
rgbl1 = compute_rgbl1(layer_pred, layer_gt)
# Returns float in [0, 1], lower is betterMeasures the Intersection over Union (IoU) of alpha channels:
from layerd.evaluation.metrics import compute_alpha_iou
alpha_iou = compute_alpha_iou(layer_pred, layer_gt)
# Returns float in [0, 1], higher is betterfrom layerd.evaluation.metrics import evaluate_layer_pair
metrics = evaluate_layer_pair(layer_pred, layer_gt)
print(f"RGBL1: {metrics['rgbl1']:.4f}")
print(f"AlphaIoU: {metrics['alpha_iou']:.4f}")The evaluation script expects the following structure:
predictions/
├── sample_001/
│ ├── 0000.png # Background
│ ├── 0001.png # Layer 1
│ ├── 0002.png # Layer 2
│ └── ...
├── sample_002/
│ ├── 0000.png
│ └── ...
└── ...
groundtruth/
├── sample_001/
│ ├── 0000.png
│ ├── 0001.png
│ ├── 0002.png
│ └── ...
└── ...
Each sample has its own directory with numbered layer files (0000.png, 0001.png, etc.).
uv run python ./tools/evaluate.py \
--pred-dir /path/to/predictions \
--gt-dir /path/to/groundtruth \
--output-dir /path/to/results \
--max-edits 5--pred-dir(required): Directory with predicted layers--gt-dir(required): Directory with ground-truth layers--output-dir(required): Directory to save evaluation results--max-edits: Maximum edit distance for normalization (default: 5)
The script saves results to output-dir/:
results/
├── summary.json # Overall statistics
├── per_sample.csv # Per-sample metrics
└── alignment_viz/ # Visualization of layer alignments (optional)
summary.json contains:
{
"mean_edit_distance": 1.23,
"mean_normalized_edit_distance": 0.41,
"mean_rgbl1": 0.15,
"mean_alpha_iou": 0.87,
"num_samples": 100
}The Crello dataset prepared by generate_crello_matting.py includes a layers/ directory ready for evaluation.
When generating the dataset, use --save-layers:
uv run python ./tools/generate_crello_matting.py \
--output-dir /path/to/dataset \
--inpainting \
--save-layersThis creates:
dataset/
├── train/
│ └── layers/ # Ground-truth layers for training set
├── validation/
│ └── layers/
└── test/
└── layers/ # Ground-truth layers for test set
# 1. Run inference on test set
uv run python ./tools/infer.py \
--input /path/to/dataset/test/composite/ \
--output-dir /path/to/predictions/ \
--device cuda
# 2. Evaluate predictions
uv run python ./tools/evaluate.py \
--pred-dir /path/to/predictions/ \
--gt-dir /path/to/dataset/test/layers/ \
--output-dir /path/to/eval_results/ \
--max-edits 5Add custom metrics by extending the evaluation framework:
from layerd.evaluation.metrics import evaluate_layer_pair
from PIL import Image
import numpy as np
def custom_metric(layer_pred, layer_gt):
"""Custom evaluation metric"""
# Convert PIL to numpy
pred_arr = np.array(layer_pred)
gt_arr = np.array(layer_gt)
# Your metric computation
score = ...
return score
# Use in evaluation
layer_pred = Image.open("pred.png")
layer_gt = Image.open("gt.png")
score = custom_metric(layer_pred, layer_gt)Visualize DTW alignment between predicted and ground-truth layers:
from layerd.evaluation import LayersEditDist
import matplotlib.pyplot as plt
metric = LayersEditDist()
result = metric(layers_pred, layers_gt)
# Get alignment path
alignment = result['alignment']
# Visualize (example)
for i, (pred_idx, gt_idx) in enumerate(alignment):
print(f"Step {i}: Pred layer {pred_idx} <-> GT layer {gt_idx}")For large-scale evaluation:
from pathlib import Path
from layerd.evaluation import LayersEditDist
from PIL import Image
pred_root = Path("/path/to/predictions")
gt_root = Path("/path/to/groundtruth")
metric = LayersEditDist()
results = []
for sample_dir in pred_root.iterdir():
if not sample_dir.is_dir():
continue
sample_id = sample_dir.name
# Load predicted layers
layers_pred = []
for layer_file in sorted(sample_dir.glob("*.png")):
layers_pred.append(Image.open(layer_file))
# Load ground-truth layers
gt_dir = gt_root / sample_id
layers_gt = []
for layer_file in sorted(gt_dir.glob("*.png")):
layers_gt.append(Image.open(layer_file))
# Evaluate
result = metric(layers_pred, layers_gt)
results.append({
'sample_id': sample_id,
'edit_distance': result['edit_distance'],
'normalized_edit_distance': result['normalized_edit_distance']
})
# Compute statistics
mean_edit_dist = sum(r['edit_distance'] for r in results) / len(results)
print(f"Mean Edit Distance: {mean_edit_dist:.4f}")Problem: "Mismatched layer counts"
Solution: This is expected behavior. LayersEditDist handles different layer counts automatically. The edit distance reflects the cost of aligning mismatched sequences.
Problem: Evaluation is slow for large datasets
Solution:
- Use multiprocessing to parallelize sample evaluation
- Evaluate on a subset first to verify correctness
- Use smaller image resolutions if per-pixel accuracy is not critical
Problem: High edit distance despite visually similar results
Solution:
- Check layer ordering (background should be first)
- Verify alpha channel quality (check AlphaIoU)
- Adjust edit operation costs if needed (modify metric parameters)
For more troubleshooting help, see troubleshooting.md.
The LayersEditDist metric uses the following default costs:
- Insert: 1.0 (cost of adding a missing layer)
- Delete: 1.0 (cost of removing an extra layer)
- Modify: Based on RGBL1 + (1 - AlphaIoU)
These can be customized by modifying the metric implementation.
DTW finds the optimal alignment between two sequences by:
- Computing pairwise distance matrix between all layer pairs
- Finding the minimum-cost path through the matrix
- Returning the alignment and total cost
See src/layerd/evaluation/dtw.py for implementation details.
- Inference Guide - Generate predictions to evaluate
- Training Guide - Train models for evaluation
- Architecture - Understanding evaluation components
- Paper - Full metric description