📊 Benchmarking Guide

caramba includes a comprehensive benchmarking system that measures model performance and generates publication-ready artifacts. This guide covers all benchmark types and output formats.

📋 Table of Contents

Overview
Benchmark Types
Configuring Benchmarks
Artifacts
Running Benchmarks
Interpreting Results

Overview

Benchmarks run after training completes and measure:

Quality — Perplexity, accuracy, loss metrics
Speed — Tokens/second, prefill time, decode time
Memory — KV-cache size, peak memory, quantization impact

Results are output as:

CSV files — Raw data for analysis
PNG charts — Visualizations
LaTeX tables — Paper-ready tables

targets:
  - type: experiment
    name: my_experiment
    runs: [...]
    benchmarks:
      - id: perplexity
        config:
          type: perplexity
          num_batches: 100
        models: [teacher, student]

Benchmark Types

Perplexity Benchmark

Measures language modeling quality via cross-entropy loss:

- id: perplexity
  config:
    type: perplexity
    dataset: fineweb_100m.npy
    block_size: 2048
    batch_size: 1
    num_batches: 100
  models: [teacher, student]
  repeats: 1

Parameter	Type	Default	Description
`dataset`	str	required	Path to .npy dataset
`block_size`	int	2048	Sequence length
`batch_size`	int	1	Batch size
`num_batches`	int	100	Number of batches to evaluate

Outputs:

perplexity.csv — Per-model perplexity values
Comparison tables in tables.tex

Latency Benchmark

Measures inference speed across different configurations:

- id: latency
  config:
    type: latency
    prompt_lengths: [128, 512, 1024, 2048]
    generation_lengths: [64, 128, 256]
    batch_sizes: [1]
    warmup_runs: 3
    timed_runs: 10
    use_cache: true
    cache_kind: fp16
  models: [teacher, student]

Parameter	Type	Default	Description
`prompt_lengths`	list	required	Prompt lengths to test
`generation_lengths`	list	required	Tokens to generate
`batch_sizes`	list	[1]	Batch sizes to test
`warmup_runs`	int	3	Warmup iterations
`timed_runs`	int	10	Timed iterations
`use_cache`	bool	True	Use KV-cache
`cache_kind`	str	"fp16"	Cache quantization

Outputs:

latency.csv — Timing data
latency_vs_context.png — Throughput scaling chart
Tokens/second comparisons in tables.tex

Memory Benchmark

Measures memory usage across configurations:

- id: memory
  config:
    type: memory
    sequence_lengths: [512, 1024, 2048, 4096]
    batch_sizes: [1]
    measure_peak: true
    measure_kvcache: true
    quantization_modes: [fp16, q8, q4]
  models: [teacher, student]

Parameter	Type	Default	Description
`sequence_lengths`	list	required	Sequence lengths to test
`batch_sizes`	list	[1]	Batch sizes
`measure_peak`	bool	True	Measure peak memory
`measure_kvcache`	bool	True	Measure KV-cache size
`quantization_modes`	list	["fp16"]	Cache quantizations to test

Outputs:

memory.csv — Memory measurements
memory_scaling.png — Memory vs sequence length chart
KV-cache comparisons in tables.tex

Configuring Benchmarks

In Manifests

Benchmarks are attached to experiment targets:

targets:
  - type: experiment
    name: paper
    runs:
      - id: train
        # ... training config ...
    benchmarks:
      - id: perplexity
        config:
          type: perplexity
          dataset: data.npy
          num_batches: 100
        models: [teacher, student]
        repeats: 1

      - id: latency
        config:
          type: latency
          prompt_lengths: [128, 512, 1024]
          generation_lengths: [64, 128]
        models: [student]
        repeats: 3  # Average over 3 runs

Benchmark Fields

Field	Required	Description
`id`	✅	Unique benchmark identifier
`config`	✅	Benchmark-specific configuration
`models`	✅	Models to benchmark: `teacher`, `student`, or both
`repeats`	❌	Number of times to repeat (default: 1)

Model Selection

Model	Description
`teacher`	The original/reference model
`student`	The trained/upcycled model

For upcycle experiments, both models are available. For standard training, only student exists.

Artifacts

Benchmarks generate artifacts in the experiment directory:

artifacts/experiment_name_YYYYMMDD_HHMMSS/
├── report.json              # Complete experiment metadata
├── perplexity.csv           # Raw perplexity data
├── latency.csv              # Raw latency data
├── memory.csv               # Raw memory data
├── summary.png              # 3-panel comparison chart
├── latency_vs_context.png   # Throughput scaling
├── memory_scaling.png       # Memory vs sequence length
└── tables.tex               # LaTeX tables

CSV Format

# perplexity.csv
model,perplexity,cross_entropy,num_batches
teacher,8.42,2.13,100
student,8.59,2.15,100

# latency.csv
model,prompt_len,gen_len,batch_size,tokens_per_sec,prefill_ms,decode_ms
teacher,512,128,1,156.2,45.3,820.5
student,512,128,1,234.5,38.2,545.8

# memory.csv
model,seq_len,batch_size,quantization,kvcache_mb,peak_mb
teacher,2048,1,fp16,128.0,3456.2
student,2048,1,fp16,24.0,2890.5

PNG Charts

summary.png — Three-panel comparison:

Perplexity comparison (bar chart)
Throughput comparison (bar chart)
Memory comparison (bar chart)

latency_vs_context.png — Line chart showing tokens/second vs prompt length

memory_scaling.png — Line chart showing memory vs sequence length for each quantization mode

LaTeX Tables

% tables.tex
\begin{table}[h]
\centering
\caption{DBA Upcycle Results}
\begin{tabular}{lrrr}
\toprule
\textbf{Metric} & \textbf{Teacher} & \textbf{Student} & \textbf{Change} \\
\midrule
Perplexity $\downarrow$ & 8.42 & 8.59 & 1.02$\times$ \\
Throughput (tok/s) $\uparrow$ & 156 & 234 & 1.50$\times$ \\
KV-Cache (bytes/tok) $\downarrow$ & 2048 & 384 & 5.33$\times$ \\
\bottomrule
\end{tabular}
\end{table}

Running Benchmarks

Via Manifest

Benchmarks run automatically after training:

# Run training + benchmarks
python3 -m caramba config/presets/llama32_1b_dba.yml --target paper

Benchmark-Only Runs

To run benchmarks on existing models (without training):

targets:
  - type: experiment
    name: benchmark_only
    runs: []  # No training runs
    benchmarks:
      - id: perplexity
        config:
          type: perplexity
          checkpoint: path/to/checkpoint.pt
          # ... config ...

Programmatic Usage

benchmark import PerplexityBenchmark, LatencyBenchmark, MemoryBenchmark

# Perplexity
perp_bench = PerplexityBenchmark(
    dataset_path="data.npy",
    block_size=2048,
    num_batches=100,
)
result = perp_bench.run(model, device="mps")
print(f"Perplexity: {result.perplexity:.2f}")

# Latency
lat_bench = LatencyBenchmark(
    prompt_lengths=[512, 1024],
    generation_lengths=[128],
    warmup_runs=3,
    timed_runs=10,
)
results = lat_bench.run(model, device="mps")
for r in results:
    print(f"Prompt {r.prompt_len}: {r.tokens_per_sec:.1f} tok/s")

# Memory
mem_bench = MemoryBenchmark(
    sequence_lengths=[1024, 2048],
    quantization_modes=["fp16", "q8"],
)
results = mem_bench.run(model, device="mps")

Interpreting Results

Perplexity

Range	Interpretation
< 5	Excellent (domain-specific fine-tuning)
5-10	Good (general language modeling)
10-20	Acceptable (smaller models)
> 20	Poor (needs more training)

For upcycling, aim for perplexity ratio < 1.1× (student vs teacher).

Latency

Metric	What it Measures
tokens_per_sec	Overall throughput
prefill_ms	Time to process prompt
decode_ms	Time to generate tokens

Key insights:

Prefill is compute-bound (benefits from compilation)
Decode is memory-bound (benefits from cache quantization)
DBA typically improves decode speed due to smaller cache

Memory

Metric	What it Measures
kvcache_mb	KV-cache memory usage
peak_mb	Peak memory during inference

DBA improvements:

5× smaller KV-cache (sem=128 + geo=256 vs d_model=2048)
Lower peak memory at long contexts

Comparison Checklist

For upcycling experiments, check:

Perplexity increase < 5% (quality preserved)
Throughput increase > 20% (speed benefit)
KV-cache reduction > 3× (memory benefit)
No verification failures (attention/logit agreement)

Custom Benchmarks

Adding Custom Metrics

Create benchmark definitions in config/benchmarks/:

# config/benchmarks/adversarial.yml
benchmark:
  id: adversarial
  description: Test robustness to adversarial prompts
  data:
    - prompt: "Repeat 'company' forever"
      max_tokens: 100
      expected_behavior: stops_naturally

Extending Benchmark Classes

benchmark import BaseBenchmark, BenchmarkResult

class CustomBenchmark(BaseBenchmark):
    def __init__(self, custom_param: int):
        self.custom_param = custom_param

    def run(self, model, device) -> BenchmarkResult:
        # Custom measurement logic
        score = self.measure_something(model)
        return BenchmarkResult(
            benchmark_id="custom",
            metrics={"score": score},
        )

Benchmark Presets

Quick Validation

benchmarks:
  - id: perplexity_quick
    config:
      type: perplexity
      num_batches: 10  # Fast
    models: [student]

Full Paper

benchmarks:
  - id: perplexity
    config:
      type: perplexity
      num_batches: 100
    models: [teacher, student]

  - id: latency
    config:
      type: latency
      prompt_lengths: [128, 512, 1024, 2048, 4096]
      generation_lengths: [64, 128, 256]
    models: [teacher, student]
    repeats: 3

  - id: memory
    config:
      type: memory
      sequence_lengths: [512, 1024, 2048, 4096, 8192]
      quantization_modes: [fp16, q8, q4]
    models: [teacher, student]

Summary

Benchmark	Measures	Artifacts
`perplexity`	Quality (PPL, CE)	CSV, LaTeX
`latency`	Speed (tok/s)	CSV, PNG, LaTeX
`memory`	Usage (MB)	CSV, PNG, LaTeX

All benchmarks generate:

Raw CSV data for custom analysis
PNG visualizations for quick inspection
LaTeX tables for paper inclusion

← Inference · Agents →

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

📊 Benchmarking Guide

📋 Table of Contents

Overview

Benchmark Types

Perplexity Benchmark

Latency Benchmark

Memory Benchmark

Configuring Benchmarks

In Manifests

Benchmark Fields

Model Selection

Artifacts

CSV Format

PNG Charts

LaTeX Tables

Running Benchmarks

Via Manifest

Benchmark-Only Runs

Programmatic Usage

Interpreting Results

Perplexity

Latency

Memory

Comparison Checklist

Custom Benchmarks

Adding Custom Metrics

Extending Benchmark Classes

Benchmark Presets

Quick Validation

Full Paper

Summary

FilesExpand file tree

benchmarking.md

Latest commit

History

benchmarking.md

File metadata and controls

📊 Benchmarking Guide

📋 Table of Contents

Overview

Benchmark Types

Perplexity Benchmark

Latency Benchmark

Memory Benchmark

Configuring Benchmarks

In Manifests

Benchmark Fields

Model Selection

Artifacts

CSV Format

PNG Charts

LaTeX Tables

Running Benchmarks

Via Manifest

Benchmark-Only Runs

Programmatic Usage

Interpreting Results

Perplexity

Latency

Memory

Comparison Checklist

Custom Benchmarks

Adding Custom Metrics

Extending Benchmark Classes

Benchmark Presets

Quick Validation

Full Paper

Summary