Advanced data validation library unifying Pydantic and Pandera with multi-backend support for pandas and Polars.
- π Unified Validation β Single schema for both record-level (Pydantic) and DataFrame-level (Pandera) validation
- β‘ Multi-Backend Support β Seamlessly switch between pandas and Polars without rewriting validation rules
- π Streaming Validation β Efficiently validate large CSV, Parquet, and JSONL files that don't fit in memory
- π§ Auto-Fix Suggestions β Intelligent suggestions for common data quality issues with one-click fixes
- π Data Profiling β Generate statistical profiles and infer validation constraints automatically
- π Rich Reporting β Beautiful console output, interactive HTML reports, and metrics export (Prometheus, OpenTelemetry)
- π§ͺ Type-Safe β Full type hints with mypy strict mode support
- π Production Ready β Comprehensive test suite with >90% coverage, property-based testing, and benchmarks
pip install pandera-unified-validatorWith optional dependencies:
# For Parquet support
pip install pandera-unified-validator[parquet]
# For database validation
pip install pandera-unified-validator[database]
# For data profiling
pip install pandera-unified-validator[profiling]
# All features
pip install pandera-unified-validator[all]import pandas as pd
from pandera_unified_validator import SchemaBuilder, UnifiedValidator
# Define schema with fluent API
schema = (
SchemaBuilder("user_schema")
.add_column("user_id", int, unique=True, ge=0)
.add_column("email", str, pattern=r"^[\w\.-]+@[\w\.-]+\.\w+$")
.add_column("age", int, ge=0, le=120)
.add_column("score", float, ge=0.0, le=100.0)
.build()
)
# Create validator with auto-fix enabled
validator = UnifiedValidator(schema.to_validation_schema(), auto_fix=True)
# Validate your data
data = pd.DataFrame({
"user_id": [1, 2, 3],
"email": ["user@example.com", "invalid-email", "admin@test.org"],
"age": [25, 150, 30], # 150 is out of range
"score": [85.5, 92.0, 78.5]
})
result = validator.validate(data)
# Check results
print(f"Valid: {result.is_valid}")
print(f"Errors: {len(result.errors)}")
print(f"Suggestions: {len(result.suggestions)}")
# Generate beautiful reports
from pandera_unified_validator import ValidationReporter
reporter = ValidationReporter(result)
reporter.to_console(verbose=True) # Rich console output
reporter.to_html("report.html") # Interactive HTML report
reporter.to_json("report.json") # JSON export| Feature | pandera-unified-validator | Pydantic | Pandera |
|---|---|---|---|
| Record validation | β | β | β |
| DataFrame validation | β | β | β |
| Unified schema | β | β | β |
| Multi-backend (pandas/Polars) | β | β | β |
| Streaming validation | β | β | β |
| Auto-fix suggestions | β | β | β |
| Data profiling | β | β | β |
| HTML/JSON reports | β | β | β |
| Metrics export | β | β | β |
from pandera_unified_validator import UnifiedValidator, SchemaBuilder, ValidationReporter
# Define comprehensive product schema
schema = (
SchemaBuilder("product_catalog")
.add_column("product_id", str, unique=True, pattern=r"^PRD-\d{6}$")
.add_column("name", str, nullable=False)
.add_column("price", float, ge=0.01, le=1_000_000)
.add_column("category", str, isin=["Electronics", "Clothing", "Books", "Home"])
.add_column("stock_quantity", int, ge=0)
.add_column("supplier_id", str, pattern=r"^SUP-\d{4}$")
.add_cross_column_constraint(
"price_check",
["price", "category"],
lambda df: df["price"] < 10000 if df["category"] == "Books" else True,
error_message="Books must be priced under $10,000"
)
.build()
)
# Validate with auto-fix
validator = UnifiedValidator(schema.to_validation_schema(), auto_fix=True)
result = validator.validate(products_df)
# Generate comprehensive report
reporter = ValidationReporter(result)
reporter.to_console(verbose=True)
reporter.to_html("validation_report.html")
# Apply auto-fixes
if result.suggestions:
fixed_df = validator.apply_fixes(products_df, result)
print(f"Fixed {len(result.suggestions)} issues automatically")from pandera_unified_validator import StreamingValidator
# Validate large CSV without loading into memory
schema = SchemaBuilder("transactions").add_column("amount", float, ge=0).build()
validator = StreamingValidator(schema, chunk_size=10000, error_threshold=0.05)
# Async validation with progress callback
async def progress_callback(metrics):
print(f"Processed {metrics.total_rows} rows, {metrics.error_rate:.2%} error rate")
result = await validator.validate_csv(
"large_transactions.csv",
report_callback=progress_callback
)
print(f"Total rows: {result.metrics.total_rows}")
print(f"Invalid rows: {result.metrics.invalid_rows}")
print(f"Processing time: {result.metrics.processing_time:.2f}s")- User Guide - Complete tutorial and API reference
- Examples - 9 practical examples covering common use cases
- API Documentation - Auto-generated API docs
- Contributing Guide - How to contribute to the project
# Clone repository
git clone https://github.com/ianpinto/pandera-unified-validator.git
cd pandera-unified-validator
# Install development dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/ -v --cov=src/pandera_unified_validator
# Run linting
ruff check src/ tests/
# Run type checking
mypy src/
# Run formatting
black src/ tests/
# Run all checks
ruff check src/ && black --check src/ && mypy src/ && pytestContributions are welcome! Please read our Contributing Guide for details on:
- Code of conduct
- Development setup
- Testing requirements
- Code style guidelines
- Pull request process
This project is licensed under the MIT License - see the LICENSE file for details.
- Built on top of Pydantic and Pandera
- Inspired by the need for unified data validation in production data pipelines
- Thanks to all contributors
If you use pandera-unified-validator in your research or production systems, please cite:
@software{pandera_unified_validator,
title = {pandera-unified-validator: Advanced data validation library},
author = {Ian Pinto},
year = {2025},
url = {https://github.com/ianpinto/pandera-unified-validator}
}