awslabs · chenliu0831 · Jan 20, 2026 · Jan 20, 2026 · Jan 20, 2026 · Jan 20, 2026
diff --git a/.gitignore b/.gitignore
@@ -148,5 +148,7 @@ dmypy.json
 # Cython debug symbols
 cython_debug/
 
-# DS_STORE 
+# DS_STORE
 .DS_Store
+
+benchmark_results
diff --git a/BENCHMARK.md b/BENCHMARK.md
@@ -0,0 +1,349 @@
+# PyDeequ Benchmark
+
+Benchmark harness for comparing DuckDB and Spark engine performance.
+
+## Design Overview
+
+### Architecture
+
+```
+benchmark_cli.py          # CLI entry point
+benchmark/
+├── config.py             # Configuration dataclasses
+├── experiments.py        # Experiment logic (data gen, checks, profiling)
+├── worker.py             # Subprocess worker for process isolation
+├── spark_server.py       # Auto Spark Connect server management
+├── results.py            # Results storage and merging
+├── report.py             # Markdown report generation
+└── visualize.py          # PNG chart generation
+```
+
+### Process Isolation
+
+Each engine runs in a separate subprocess to ensure:
+- Clean JVM state for Spark
+- Independent memory allocation
+- No cross-contamination between engines
+
+### Data Pipeline
+
+1. **Generate** synthetic mixed-type data (strings, floats, ints)
+2. **Cache** as Parquet files with optimized row groups
+3. **Load** from same Parquet files for both engines (fair comparison)
+
+## Experiments
+
+### 1. Varying Rows
+- Fixed: 10 columns, 16 data quality checks
+- Variable: 100K to 130M rows
+- Measures: Validation time scaling with data size
+
+### 2. Varying Columns
+- Fixed: 1M rows
+- Variable: 10 to 80 columns (16 to 226 checks)
+- Measures: Validation time scaling with schema complexity
+
+### 3. Column Profiling
+- Fixed: 10 columns
+- Variable: 100K to 10M rows
+- Measures: Full column profiling performance
+
+## Results
+
+Benchmark run on Apple M3 Max (14 cores), macOS Darwin 25.2.0.
+
+![Benchmark Results](imgs/benchmark_chart.png)
+
+### Experiment 1: Varying Rows
+
+| Rows | DuckDB (s) | Spark (s) | Speedup |
+|------|------------|-----------|---------|
+| 100K | 0.034 | 0.662 | **19.5x** |
+| 1M | 0.071 | 1.648 | **23.2x** |
+| 5M | 0.167 | 2.470 | **14.8x** |
+| 10M | 0.268 | 3.239 | **12.1x** |
+| 50M | 1.114 | 12.448 | **11.2x** |
+| 130M | 2.752 | 28.404 | **10.3x** |
+
+### Experiment 2: Varying Columns
+
+| Cols | Checks | DuckDB (s) | Spark (s) | Speedup |
+|------|--------|------------|-----------|---------|
+| 10 | 16 | 0.076 | 1.619 | **21.3x** |
+| 20 | 46 | 0.081 | 2.078 | **25.7x** |
+| 40 | 106 | 0.121 | 2.781 | **23.0x** |
+| 80 | 226 | 0.177 | 4.258 | **24.1x** |
+
+### Experiment 3: Column Profiling
+
+| Rows | DuckDB (s) | Spark (s) | Speedup |
+|------|------------|-----------|---------|
+| 100K | 0.045 | 0.585 | **13.0x** |
+| 1M | 0.288 | 0.720 | **2.5x** |
+| 5M | 1.524 | 2.351 | **1.5x** |
+| 10M | 2.993 | 3.975 | **1.3x** |
+
+### Key Takeaways
+
+1. **DuckDB is 10-23x faster** for row-scaling validation workloads
+2. **Consistent speedup across complexity** - 21-26x speedup regardless of column count
+3. **Profiling converges** - at 10M rows, DuckDB is still 1.3x faster
+4. **No JVM overhead** - DuckDB runs natively in Python, no startup cost
+
+## Performance Optimizations
+
+The DuckDB engine includes several optimizations to maintain performance as check complexity increases:
+
+### Optimization 1: Grouping Operator Batching
+
+Grouping operators (Distinctness, Uniqueness, UniqueValueRatio) that share the same columns and WHERE clause are fused into single queries.
+
+**Before**: N queries for N grouping operators on same columns
+```sql
+-- Query 1: Distinctness
+WITH freq AS (SELECT cols, COUNT(*) AS cnt FROM t GROUP BY cols)
+SELECT COUNT(*) AS distinct_count, SUM(cnt) AS total_count FROM freq
+
+-- Query 2: Uniqueness
+WITH freq AS (SELECT cols, COUNT(*) AS cnt FROM t GROUP BY cols)
+SELECT SUM(CASE WHEN cnt = 1 THEN 1 ELSE 0 END) AS unique_count, SUM(cnt) AS total_count FROM freq
+```
+
+**After**: 1 query computing all metrics
+```sql
+WITH freq AS (SELECT cols, COUNT(*) AS cnt FROM t GROUP BY cols)
+SELECT
+    COUNT(*) AS distinct_count,
+    SUM(cnt) AS total_count,
+    SUM(CASE WHEN cnt = 1 THEN 1 ELSE 0 END) AS unique_count
+FROM freq
+```
+
+**Impact**: 20-40% improvement for checks with multiple grouping operators
+
+### Optimization 2: Multi-Column Profiling
+
+Profile statistics for all columns are batched into 2-3 queries instead of 2-3 queries per column.
+
+**Before**: 20-30 queries for 10 columns
+```sql
+-- Per-column queries for completeness, numeric stats, percentiles
+SELECT COUNT(*), SUM(CASE WHEN col1 IS NULL...) FROM t
+SELECT MIN(col1), MAX(col1), AVG(col1)... FROM t
+SELECT QUANTILE_CONT(col1, 0.25)... FROM t
+-- Repeated for each column
+```
+
+**After**: 3 queries total
+```sql
+-- Query 1: All completeness stats
+SELECT COUNT(*), SUM(CASE WHEN col1 IS NULL...), SUM(CASE WHEN col2 IS NULL...)... FROM t
+
+-- Query 2: All numeric stats
+SELECT MIN(col1), MAX(col1), MIN(col2), MAX(col2)... FROM t
+
+-- Query 3: All percentiles
+SELECT QUANTILE_CONT(col1, 0.25), QUANTILE_CONT(col2, 0.25)... FROM t
+```
+
+**Impact**: 40-60% improvement for column profiling
+
+### Optimization 3: DuckDB Configuration
+
+Configurable engine settings optimize DuckDB for analytical workloads:
+
+```python
+from pydeequ.engines.duckdb_config import DuckDBEngineConfig
+
+config = DuckDBEngineConfig(
+    threads=8,                      # Control parallelism
+    memory_limit="8GB",             # Memory management
+    preserve_insertion_order=False, # Better parallel execution
+    parquet_metadata_cache=True,    # Faster Parquet reads
+)
+
+engine = DuckDBEngine(con, table="test", config=config)
+```
+
+**Impact**: 5-15% improvement for large parallel scans
+
+### Optimization 4: Constraint Batching
+
+Scan-based constraints (Size, Completeness, Mean, etc.) and ratio-check constraints (isPositive, isContainedIn, etc.) are batched into minimal queries.
+
+**Before**: 1 query per constraint
+```sql
+SELECT COUNT(*) FROM t                                    -- Size
+SELECT COUNT(*), SUM(CASE WHEN col IS NULL...) FROM t     -- Completeness
+SELECT AVG(col) FROM t                                    -- Mean
+```
+
+**After**: 1 query for all scan-based constraints
+```sql
+SELECT
+    COUNT(*) AS size,
+    SUM(CASE WHEN col IS NULL THEN 1 ELSE 0 END) AS null_count,
+    AVG(col) AS mean
+FROM t
+```
+
+**Impact**: 20-40% improvement for checks with many constraints
+
+### Optimization 5: Query Profiling Infrastructure
+
+Built-in profiling helps identify bottlenecks and verify optimizations:
+
+```python
+engine = DuckDBEngine(con, table="test", enable_profiling=True)
+engine.run_checks([check])
+
+# Get query statistics
+stats = engine.get_query_stats()
+print(f"Query count: {engine.get_query_count()}")
+print(stats)
+
+# Get query plan for analysis
+plan = engine.explain_query("SELECT COUNT(*) FROM test")
+```
+
+### Measured Performance Improvements
+
+Benchmark comparison: Baseline (2026-01-20) vs After Optimization (2026-01-21, 5-run average)
+
+#### Experiment 2: Varying Columns (KEY METRIC - Speedup Degradation Fix)
+
+| Cols | Checks | Before DuckDB | After DuckDB | Spark | Before Speedup | After Speedup |
+|------|--------|---------------|--------------|-------|----------------|---------------|
+| 10 | 16 | 0.118s | 0.076s | 1.619s | 14.1x | **21.3x** |
+| 20 | 46 | 0.286s | 0.081s | 2.078s | 7.5x | **25.7x** |
+| 40 | 106 | 0.713s | 0.121s | 2.781s | 4.0x | **23.0x** |
+| 80 | 226 | 2.214s | 0.177s | 4.258s | 2.0x | **24.1x** |
+
+**Key Achievement**: The speedup degradation problem is **SOLVED**.
+- **Before**: Speedup degraded from 14x (10 cols) down to 2x (80 cols)
+- **After**: Speedup is consistent **~21-26x** across ALL column counts
+
+#### DuckDB-Only Performance Gains
+
+| Cols | Before | After | Improvement |
+|------|--------|-------|-------------|
+| 10 | 0.118s | 0.076s | 36% faster |
+| 20 | 0.286s | 0.081s | 72% faster |
+| 40 | 0.713s | 0.121s | 83% faster |
+| 80 | 2.214s | 0.177s | **92% faster (~12x)** |
+
+#### Experiment 1: Varying Rows (16 checks)
+
+| Rows | Before | After | Improvement |
+|------|--------|-------|-------------|
+| 100K | 0.052s | 0.034s | 35% faster |
+| 1M | 0.090s | 0.071s | 21% faster |
+| 5M | 0.221s | 0.167s | 24% faster |
+| 10M | 0.335s | 0.268s | 20% faster |
+| 50M | 1.177s | 1.114s | 5% faster |
+| 130M | 2.897s | 2.752s | 5% faster |
+
+#### Experiment 3: Column Profiling (10 columns)
+
+| Rows | Before | After | Change |
+|------|--------|-------|--------|
+| 100K | 0.086s | 0.045s | 48% faster |
+| 1M | 0.388s | 0.288s | 26% faster |
+| 5M | 1.470s | 1.524s | ~same |
+| 10M | 2.659s | 2.993s | 13% slower |
+
+Note: Profiling shows slight regression at very high row counts due to batched query overhead, which is a trade-off for the significant gains in column scaling.
+
+## Quick Start
+
+### Run DuckDB Only (No Spark Required)
+
+```bash
+python benchmark_cli.py run --engine duckdb
+```
+
+### Run Both Engines
+
+```bash
+python benchmark_cli.py run --engine all
+```
+
+Auto-spark is enabled by default. The harness will:
+1. Start a Spark Connect server
+2. Run DuckDB benchmarks
+3. Run Spark benchmarks
+4. Stop the server
+5. Merge results
+
+### Run with External Spark Server
+
+```bash
+# Start server manually first, then:
+python benchmark_cli.py run --engine spark --no-auto-spark
+```
+
+## Output Structure
+
+Each run creates a timestamped folder:
+
+```
+benchmark_results/
+└── benchmark_2024-01-19T14-30-45/
+    ├── results.json           # Raw timing data
+    └── BENCHMARK_RESULTS.md   # Markdown report
+```
+
+## Visualize Results
+
+Generate a PNG chart comparing engine performance:
+
+```bash
+# From run folder
+python benchmark_cli.py visualize benchmark_results/benchmark_2024-01-19T14-30-45/
+
+# Custom output path
+python benchmark_cli.py visualize benchmark_results/benchmark_2024-01-19T14-30-45/ -o comparison.png
+```
+
+The chart shows:
+- **Top row**: Time comparisons (DuckDB vs Spark) for each experiment
+- **Bottom row**: Speedup ratios (how many times faster DuckDB is)
+
+## Regenerate Report
+
+```bash
+python benchmark_cli.py report benchmark_results/benchmark_2024-01-19T14-30-45/
+```
+
+## Configuration
+
+Default experiment parameters (see `benchmark/config.py`):
+
+| Parameter | Default |
+|-----------|---------|
+| Row counts | 100K, 1M, 5M, 10M, 50M, 130M |
+| Column counts | 10, 20, 40, 80 |
+| Profiling rows | 100K, 1M, 5M, 10M |
+| Validation runs | 3 (averaged) |
+| Cache directory | `~/.deequ_benchmark_data` |
+
+## Requirements
+
+- **DuckDB**: No additional setup
+- **Spark**: Requires `SPARK_HOME` and `JAVA_HOME` environment variables (or use `--spark-home`/`--java-home` flags)
+
+## Example Workflow
+
+```bash
+# 1. Run full benchmark
+python benchmark_cli.py run --engine all
+
+# 2. View results
+cat benchmark_results/benchmark_*/BENCHMARK_RESULTS.md
+
+# 3. Generate chart
+python benchmark_cli.py visualize benchmark_results/benchmark_*/
+
+# 4. Open chart
+open benchmark_results/benchmark_*/benchmark_chart.png
+```