Skip to content

pandera-unified-validator is an advanced data validation library that unifies Pydantic record validation with Pandera-powered DataFrame checks. It ships with pluggable pandas and Polars backends so analytics teams can reuse a single set of declarative rules across their favorite engines.

License

Notifications You must be signed in to change notification settings

iAn-P1nt0/pandera-unified-validator

Repository files navigation

pandera-unified-validator

PyPI version Python 3.10+ License: MIT CI Status Coverage

Advanced data validation library unifying Pydantic and Pandera with multi-backend support for pandas and Polars.

Features

  • πŸ” Unified Validation – Single schema for both record-level (Pydantic) and DataFrame-level (Pandera) validation
  • ⚑ Multi-Backend Support – Seamlessly switch between pandas and Polars without rewriting validation rules
  • πŸ“Š Streaming Validation – Efficiently validate large CSV, Parquet, and JSONL files that don't fit in memory
  • πŸ”§ Auto-Fix Suggestions – Intelligent suggestions for common data quality issues with one-click fixes
  • πŸ“ˆ Data Profiling – Generate statistical profiles and infer validation constraints automatically
  • πŸ“ Rich Reporting – Beautiful console output, interactive HTML reports, and metrics export (Prometheus, OpenTelemetry)
  • πŸ§ͺ Type-Safe – Full type hints with mypy strict mode support
  • πŸš€ Production Ready – Comprehensive test suite with >90% coverage, property-based testing, and benchmarks

Installation

pip install pandera-unified-validator

With optional dependencies:

# For Parquet support
pip install pandera-unified-validator[parquet]

# For database validation
pip install pandera-unified-validator[database]

# For data profiling
pip install pandera-unified-validator[profiling]

# All features
pip install pandera-unified-validator[all]

Quick Start (30 seconds)

import pandas as pd
from pandera_unified_validator import SchemaBuilder, UnifiedValidator

# Define schema with fluent API
schema = (
    SchemaBuilder("user_schema")
    .add_column("user_id", int, unique=True, ge=0)
    .add_column("email", str, pattern=r"^[\w\.-]+@[\w\.-]+\.\w+$")
    .add_column("age", int, ge=0, le=120)
    .add_column("score", float, ge=0.0, le=100.0)
    .build()
)

# Create validator with auto-fix enabled
validator = UnifiedValidator(schema.to_validation_schema(), auto_fix=True)

# Validate your data
data = pd.DataFrame({
    "user_id": [1, 2, 3],
    "email": ["user@example.com", "invalid-email", "admin@test.org"],
    "age": [25, 150, 30],  # 150 is out of range
    "score": [85.5, 92.0, 78.5]
})

result = validator.validate(data)

# Check results
print(f"Valid: {result.is_valid}")
print(f"Errors: {len(result.errors)}")
print(f"Suggestions: {len(result.suggestions)}")

# Generate beautiful reports
from pandera_unified_validator import ValidationReporter

reporter = ValidationReporter(result)
reporter.to_console(verbose=True)  # Rich console output
reporter.to_html("report.html")     # Interactive HTML report
reporter.to_json("report.json")     # JSON export

Comparison with Alternatives

Feature pandera-unified-validator Pydantic Pandera
Record validation βœ… βœ… ❌
DataFrame validation βœ… ❌ βœ…
Unified schema βœ… ❌ ❌
Multi-backend (pandas/Polars) βœ… ❌ ❌
Streaming validation βœ… ❌ ❌
Auto-fix suggestions βœ… ❌ ❌
Data profiling βœ… ❌ βœ…
HTML/JSON reports βœ… ❌ ❌
Metrics export βœ… ❌ ❌

Real-World Example: E-commerce Product Validation

from pandera_unified_validator import UnifiedValidator, SchemaBuilder, ValidationReporter

# Define comprehensive product schema
schema = (
    SchemaBuilder("product_catalog")
    .add_column("product_id", str, unique=True, pattern=r"^PRD-\d{6}$")
    .add_column("name", str, nullable=False)
    .add_column("price", float, ge=0.01, le=1_000_000)
    .add_column("category", str, isin=["Electronics", "Clothing", "Books", "Home"])
    .add_column("stock_quantity", int, ge=0)
    .add_column("supplier_id", str, pattern=r"^SUP-\d{4}$")
    .add_cross_column_constraint(
        "price_check",
        ["price", "category"],
        lambda df: df["price"] < 10000 if df["category"] == "Books" else True,
        error_message="Books must be priced under $10,000"
    )
    .build()
)

# Validate with auto-fix
validator = UnifiedValidator(schema.to_validation_schema(), auto_fix=True)
result = validator.validate(products_df)

# Generate comprehensive report
reporter = ValidationReporter(result)
reporter.to_console(verbose=True)
reporter.to_html("validation_report.html")

# Apply auto-fixes
if result.suggestions:
    fixed_df = validator.apply_fixes(products_df, result)
    print(f"Fixed {len(result.suggestions)} issues automatically")

Streaming Validation for Large Files

from pandera_unified_validator import StreamingValidator

# Validate large CSV without loading into memory
schema = SchemaBuilder("transactions").add_column("amount", float, ge=0).build()
validator = StreamingValidator(schema, chunk_size=10000, error_threshold=0.05)

# Async validation with progress callback
async def progress_callback(metrics):
    print(f"Processed {metrics.total_rows} rows, {metrics.error_rate:.2%} error rate")

result = await validator.validate_csv(
    "large_transactions.csv",
    report_callback=progress_callback
)

print(f"Total rows: {result.metrics.total_rows}")
print(f"Invalid rows: {result.metrics.invalid_rows}")
print(f"Processing time: {result.metrics.processing_time:.2f}s")

Documentation

Development

# Clone repository
git clone https://github.com/ianpinto/pandera-unified-validator.git
cd pandera-unified-validator

# Install development dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/ -v --cov=src/pandera_unified_validator

# Run linting
ruff check src/ tests/

# Run type checking
mypy src/

# Run formatting
black src/ tests/

# Run all checks
ruff check src/ && black --check src/ && mypy src/ && pytest

Contributing

Contributions are welcome! Please read our Contributing Guide for details on:

  • Code of conduct
  • Development setup
  • Testing requirements
  • Code style guidelines
  • Pull request process

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Built on top of Pydantic and Pandera
  • Inspired by the need for unified data validation in production data pipelines
  • Thanks to all contributors

Citation

If you use pandera-unified-validator in your research or production systems, please cite:

@software{pandera_unified_validator,
  title = {pandera-unified-validator: Advanced data validation library},
  author = {Ian Pinto},
  year = {2025},
  url = {https://github.com/ianpinto/pandera-unified-validator}
}

About

pandera-unified-validator is an advanced data validation library that unifies Pydantic record validation with Pandera-powered DataFrame checks. It ships with pluggable pandas and Polars backends so analytics teams can reuse a single set of declarative rules across their favorite engines.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published