Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -148,5 +148,7 @@ dmypy.json
# Cython debug symbols
cython_debug/

# DS_STORE
# DS_STORE
.DS_Store

benchmark_results
349 changes: 349 additions & 0 deletions BENCHMARK.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,349 @@
# PyDeequ Benchmark

Benchmark harness for comparing DuckDB and Spark engine performance.

## Design Overview

### Architecture

```
benchmark_cli.py # CLI entry point
benchmark/
├── config.py # Configuration dataclasses
├── experiments.py # Experiment logic (data gen, checks, profiling)
├── worker.py # Subprocess worker for process isolation
├── spark_server.py # Auto Spark Connect server management
├── results.py # Results storage and merging
├── report.py # Markdown report generation
└── visualize.py # PNG chart generation
```

### Process Isolation

Each engine runs in a separate subprocess to ensure:
- Clean JVM state for Spark
- Independent memory allocation
- No cross-contamination between engines

### Data Pipeline

1. **Generate** synthetic mixed-type data (strings, floats, ints)
2. **Cache** as Parquet files with optimized row groups
3. **Load** from same Parquet files for both engines (fair comparison)

## Experiments

### 1. Varying Rows
- Fixed: 10 columns, 16 data quality checks
- Variable: 100K to 130M rows
- Measures: Validation time scaling with data size

### 2. Varying Columns
- Fixed: 1M rows
- Variable: 10 to 80 columns (16 to 226 checks)
- Measures: Validation time scaling with schema complexity

### 3. Column Profiling
- Fixed: 10 columns
- Variable: 100K to 10M rows
- Measures: Full column profiling performance

## Results

Benchmark run on Apple M3 Max (14 cores), macOS Darwin 25.2.0.

![Benchmark Results](imgs/benchmark_chart.png)

### Experiment 1: Varying Rows

| Rows | DuckDB (s) | Spark (s) | Speedup |
|------|------------|-----------|---------|
| 100K | 0.034 | 0.662 | **19.5x** |
| 1M | 0.071 | 1.648 | **23.2x** |
| 5M | 0.167 | 2.470 | **14.8x** |
| 10M | 0.268 | 3.239 | **12.1x** |
| 50M | 1.114 | 12.448 | **11.2x** |
| 130M | 2.752 | 28.404 | **10.3x** |

### Experiment 2: Varying Columns

| Cols | Checks | DuckDB (s) | Spark (s) | Speedup |
|------|--------|------------|-----------|---------|
| 10 | 16 | 0.076 | 1.619 | **21.3x** |
| 20 | 46 | 0.081 | 2.078 | **25.7x** |
| 40 | 106 | 0.121 | 2.781 | **23.0x** |
| 80 | 226 | 0.177 | 4.258 | **24.1x** |

### Experiment 3: Column Profiling

| Rows | DuckDB (s) | Spark (s) | Speedup |
|------|------------|-----------|---------|
| 100K | 0.045 | 0.585 | **13.0x** |
| 1M | 0.288 | 0.720 | **2.5x** |
| 5M | 1.524 | 2.351 | **1.5x** |
| 10M | 2.993 | 3.975 | **1.3x** |

### Key Takeaways

1. **DuckDB is 10-23x faster** for row-scaling validation workloads
2. **Consistent speedup across complexity** - 21-26x speedup regardless of column count
3. **Profiling converges** - at 10M rows, DuckDB is still 1.3x faster
4. **No JVM overhead** - DuckDB runs natively in Python, no startup cost

## Performance Optimizations

The DuckDB engine includes several optimizations to maintain performance as check complexity increases:

### Optimization 1: Grouping Operator Batching

Grouping operators (Distinctness, Uniqueness, UniqueValueRatio) that share the same columns and WHERE clause are fused into single queries.

**Before**: N queries for N grouping operators on same columns
```sql
-- Query 1: Distinctness
WITH freq AS (SELECT cols, COUNT(*) AS cnt FROM t GROUP BY cols)
SELECT COUNT(*) AS distinct_count, SUM(cnt) AS total_count FROM freq

-- Query 2: Uniqueness
WITH freq AS (SELECT cols, COUNT(*) AS cnt FROM t GROUP BY cols)
SELECT SUM(CASE WHEN cnt = 1 THEN 1 ELSE 0 END) AS unique_count, SUM(cnt) AS total_count FROM freq
```

**After**: 1 query computing all metrics
```sql
WITH freq AS (SELECT cols, COUNT(*) AS cnt FROM t GROUP BY cols)
SELECT
COUNT(*) AS distinct_count,
SUM(cnt) AS total_count,
SUM(CASE WHEN cnt = 1 THEN 1 ELSE 0 END) AS unique_count
FROM freq
```

**Impact**: 20-40% improvement for checks with multiple grouping operators

### Optimization 2: Multi-Column Profiling

Profile statistics for all columns are batched into 2-3 queries instead of 2-3 queries per column.

**Before**: 20-30 queries for 10 columns
```sql
-- Per-column queries for completeness, numeric stats, percentiles
SELECT COUNT(*), SUM(CASE WHEN col1 IS NULL...) FROM t
SELECT MIN(col1), MAX(col1), AVG(col1)... FROM t
SELECT QUANTILE_CONT(col1, 0.25)... FROM t
-- Repeated for each column
```

**After**: 3 queries total
```sql
-- Query 1: All completeness stats
SELECT COUNT(*), SUM(CASE WHEN col1 IS NULL...), SUM(CASE WHEN col2 IS NULL...)... FROM t

-- Query 2: All numeric stats
SELECT MIN(col1), MAX(col1), MIN(col2), MAX(col2)... FROM t

-- Query 3: All percentiles
SELECT QUANTILE_CONT(col1, 0.25), QUANTILE_CONT(col2, 0.25)... FROM t
```

**Impact**: 40-60% improvement for column profiling

### Optimization 3: DuckDB Configuration

Configurable engine settings optimize DuckDB for analytical workloads:

```python
from pydeequ.engines.duckdb_config import DuckDBEngineConfig

config = DuckDBEngineConfig(
threads=8, # Control parallelism
memory_limit="8GB", # Memory management
preserve_insertion_order=False, # Better parallel execution
parquet_metadata_cache=True, # Faster Parquet reads
)

engine = DuckDBEngine(con, table="test", config=config)
```

**Impact**: 5-15% improvement for large parallel scans

### Optimization 4: Constraint Batching

Scan-based constraints (Size, Completeness, Mean, etc.) and ratio-check constraints (isPositive, isContainedIn, etc.) are batched into minimal queries.

**Before**: 1 query per constraint
```sql
SELECT COUNT(*) FROM t -- Size
SELECT COUNT(*), SUM(CASE WHEN col IS NULL...) FROM t -- Completeness
SELECT AVG(col) FROM t -- Mean
```

**After**: 1 query for all scan-based constraints
```sql
SELECT
COUNT(*) AS size,
SUM(CASE WHEN col IS NULL THEN 1 ELSE 0 END) AS null_count,
AVG(col) AS mean
FROM t
```

**Impact**: 20-40% improvement for checks with many constraints

### Optimization 5: Query Profiling Infrastructure

Built-in profiling helps identify bottlenecks and verify optimizations:

```python
engine = DuckDBEngine(con, table="test", enable_profiling=True)
engine.run_checks([check])

# Get query statistics
stats = engine.get_query_stats()
print(f"Query count: {engine.get_query_count()}")
print(stats)

# Get query plan for analysis
plan = engine.explain_query("SELECT COUNT(*) FROM test")
```

### Measured Performance Improvements

Benchmark comparison: Baseline (2026-01-20) vs After Optimization (2026-01-21, 5-run average)

#### Experiment 2: Varying Columns (KEY METRIC - Speedup Degradation Fix)

| Cols | Checks | Before DuckDB | After DuckDB | Spark | Before Speedup | After Speedup |
|------|--------|---------------|--------------|-------|----------------|---------------|
| 10 | 16 | 0.118s | 0.076s | 1.619s | 14.1x | **21.3x** |
| 20 | 46 | 0.286s | 0.081s | 2.078s | 7.5x | **25.7x** |
| 40 | 106 | 0.713s | 0.121s | 2.781s | 4.0x | **23.0x** |
| 80 | 226 | 2.214s | 0.177s | 4.258s | 2.0x | **24.1x** |

**Key Achievement**: The speedup degradation problem is **SOLVED**.
- **Before**: Speedup degraded from 14x (10 cols) down to 2x (80 cols)
- **After**: Speedup is consistent **~21-26x** across ALL column counts

#### DuckDB-Only Performance Gains

| Cols | Before | After | Improvement |
|------|--------|-------|-------------|
| 10 | 0.118s | 0.076s | 36% faster |
| 20 | 0.286s | 0.081s | 72% faster |
| 40 | 0.713s | 0.121s | 83% faster |
| 80 | 2.214s | 0.177s | **92% faster (~12x)** |

#### Experiment 1: Varying Rows (16 checks)

| Rows | Before | After | Improvement |
|------|--------|-------|-------------|
| 100K | 0.052s | 0.034s | 35% faster |
| 1M | 0.090s | 0.071s | 21% faster |
| 5M | 0.221s | 0.167s | 24% faster |
| 10M | 0.335s | 0.268s | 20% faster |
| 50M | 1.177s | 1.114s | 5% faster |
| 130M | 2.897s | 2.752s | 5% faster |

#### Experiment 3: Column Profiling (10 columns)

| Rows | Before | After | Change |
|------|--------|-------|--------|
| 100K | 0.086s | 0.045s | 48% faster |
| 1M | 0.388s | 0.288s | 26% faster |
| 5M | 1.470s | 1.524s | ~same |
| 10M | 2.659s | 2.993s | 13% slower |

Note: Profiling shows slight regression at very high row counts due to batched query overhead, which is a trade-off for the significant gains in column scaling.

## Quick Start

### Run DuckDB Only (No Spark Required)

```bash
python benchmark_cli.py run --engine duckdb
```

### Run Both Engines

```bash
python benchmark_cli.py run --engine all
```

Auto-spark is enabled by default. The harness will:
1. Start a Spark Connect server
2. Run DuckDB benchmarks
3. Run Spark benchmarks
4. Stop the server
5. Merge results

### Run with External Spark Server

```bash
# Start server manually first, then:
python benchmark_cli.py run --engine spark --no-auto-spark
```

## Output Structure

Each run creates a timestamped folder:

```
benchmark_results/
└── benchmark_2024-01-19T14-30-45/
├── results.json # Raw timing data
└── BENCHMARK_RESULTS.md # Markdown report
```

## Visualize Results

Generate a PNG chart comparing engine performance:

```bash
# From run folder
python benchmark_cli.py visualize benchmark_results/benchmark_2024-01-19T14-30-45/

# Custom output path
python benchmark_cli.py visualize benchmark_results/benchmark_2024-01-19T14-30-45/ -o comparison.png
```

The chart shows:
- **Top row**: Time comparisons (DuckDB vs Spark) for each experiment
- **Bottom row**: Speedup ratios (how many times faster DuckDB is)

## Regenerate Report

```bash
python benchmark_cli.py report benchmark_results/benchmark_2024-01-19T14-30-45/
```

## Configuration

Default experiment parameters (see `benchmark/config.py`):

| Parameter | Default |
|-----------|---------|
| Row counts | 100K, 1M, 5M, 10M, 50M, 130M |
| Column counts | 10, 20, 40, 80 |
| Profiling rows | 100K, 1M, 5M, 10M |
| Validation runs | 3 (averaged) |
| Cache directory | `~/.deequ_benchmark_data` |

## Requirements

- **DuckDB**: No additional setup
- **Spark**: Requires `SPARK_HOME` and `JAVA_HOME` environment variables (or use `--spark-home`/`--java-home` flags)

## Example Workflow

```bash
# 1. Run full benchmark
python benchmark_cli.py run --engine all

# 2. View results
cat benchmark_results/benchmark_*/BENCHMARK_RESULTS.md

# 3. Generate chart
python benchmark_cli.py visualize benchmark_results/benchmark_*/

# 4. Open chart
open benchmark_results/benchmark_*/benchmark_chart.png
```
Loading
Loading