pandera-unified-validator

Advanced data validation library unifying Pydantic and Pandera with multi-backend support for pandas and Polars.

Features

🔐 Unified Validation – Single schema for both record-level (Pydantic) and DataFrame-level (Pandera) validation
⚡ Multi-Backend Support – Seamlessly switch between pandas and Polars without rewriting validation rules
📊 Streaming Validation – Efficiently validate large CSV, Parquet, and JSONL files that don't fit in memory
🔧 Auto-Fix Suggestions – Intelligent suggestions for common data quality issues with one-click fixes
📈 Data Profiling – Generate statistical profiles and infer validation constraints automatically
📝 Rich Reporting – Beautiful console output, interactive HTML reports, and metrics export (Prometheus, OpenTelemetry)
🧪 Type-Safe – Full type hints with mypy strict mode support
🚀 Production Ready – Comprehensive test suite with >90% coverage, property-based testing, and benchmarks

Installation

pip install pandera-unified-validator

With optional dependencies:

# For Parquet support
pip install pandera-unified-validator[parquet]

# For database validation
pip install pandera-unified-validator[database]

# For data profiling
pip install pandera-unified-validator[profiling]

# All features
pip install pandera-unified-validator[all]

Quick Start (30 seconds)

import pandas as pd
from pandera_unified_validator import SchemaBuilder, UnifiedValidator

# Define schema with fluent API
schema = (
    SchemaBuilder("user_schema")
    .add_column("user_id", int, unique=True, ge=0)
    .add_column("email", str, pattern=r"^[\w\.-]+@[\w\.-]+\.\w+$")
    .add_column("age", int, ge=0, le=120)
    .add_column("score", float, ge=0.0, le=100.0)
    .build()
)

# Create validator with auto-fix enabled
validator = UnifiedValidator(schema.to_validation_schema(), auto_fix=True)

# Validate your data
data = pd.DataFrame({
    "user_id": [1, 2, 3],
    "email": ["user@example.com", "invalid-email", "admin@test.org"],
    "age": [25, 150, 30],  # 150 is out of range
    "score": [85.5, 92.0, 78.5]
})

result = validator.validate(data)

# Check results
print(f"Valid: {result.is_valid}")
print(f"Errors: {len(result.errors)}")
print(f"Suggestions: {len(result.suggestions)}")

# Generate beautiful reports
from pandera_unified_validator import ValidationReporter

reporter = ValidationReporter(result)
reporter.to_console(verbose=True)  # Rich console output
reporter.to_html("report.html")     # Interactive HTML report
reporter.to_json("report.json")     # JSON export

Comparison with Alternatives

Feature	pandera-unified-validator	Pydantic	Pandera
Record validation	✅	✅	❌
DataFrame validation	✅	❌	✅
Unified schema	✅	❌	❌
Multi-backend (pandas/Polars)	✅	❌	❌
Streaming validation	✅	❌	❌
Auto-fix suggestions	✅	❌	❌
Data profiling	✅	❌	✅
HTML/JSON reports	✅	❌	❌
Metrics export	✅	❌	❌

Real-World Example: E-commerce Product Validation

from pandera_unified_validator import UnifiedValidator, SchemaBuilder, ValidationReporter

# Define comprehensive product schema
schema = (
    SchemaBuilder("product_catalog")
    .add_column("product_id", str, unique=True, pattern=r"^PRD-\d{6}$")
    .add_column("name", str, nullable=False)
    .add_column("price", float, ge=0.01, le=1_000_000)
    .add_column("category", str, isin=["Electronics", "Clothing", "Books", "Home"])
    .add_column("stock_quantity", int, ge=0)
    .add_column("supplier_id", str, pattern=r"^SUP-\d{4}$")
    .add_cross_column_constraint(
        "price_check",
        ["price", "category"],
        lambda df: df["price"] < 10000 if df["category"] == "Books" else True,
        error_message="Books must be priced under $10,000"
    )
    .build()
)

# Validate with auto-fix
validator = UnifiedValidator(schema.to_validation_schema(), auto_fix=True)
result = validator.validate(products_df)

# Generate comprehensive report
reporter = ValidationReporter(result)
reporter.to_console(verbose=True)
reporter.to_html("validation_report.html")

# Apply auto-fixes
if result.suggestions:
    fixed_df = validator.apply_fixes(products_df, result)
    print(f"Fixed {len(result.suggestions)} issues automatically")

Streaming Validation for Large Files

from pandera_unified_validator import StreamingValidator

# Validate large CSV without loading into memory
schema = SchemaBuilder("transactions").add_column("amount", float, ge=0).build()
validator = StreamingValidator(schema, chunk_size=10000, error_threshold=0.05)

# Async validation with progress callback
async def progress_callback(metrics):
    print(f"Processed {metrics.total_rows} rows, {metrics.error_rate:.2%} error rate")

result = await validator.validate_csv(
    "large_transactions.csv",
    report_callback=progress_callback
)

print(f"Total rows: {result.metrics.total_rows}")
print(f"Invalid rows: {result.metrics.invalid_rows}")
print(f"Processing time: {result.metrics.processing_time:.2f}s")

Documentation

User Guide - Complete tutorial and API reference
Examples - 9 practical examples covering common use cases
API Documentation - Auto-generated API docs
Contributing Guide - How to contribute to the project

Development

# Clone repository
git clone https://github.com/ianpinto/pandera-unified-validator.git
cd pandera-unified-validator

# Install development dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/ -v --cov=src/pandera_unified_validator

# Run linting
ruff check src/ tests/

# Run type checking
mypy src/

# Run formatting
black src/ tests/

# Run all checks
ruff check src/ && black --check src/ && mypy src/ && pytest

Contributing

Contributions are welcome! Please read our Contributing Guide for details on:

Code of conduct
Development setup
Testing requirements
Code style guidelines
Pull request process

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Built on top of Pydantic and Pandera
Inspired by the need for unified data validation in production data pipelines
Thanks to all contributors

Citation

If you use pandera-unified-validator in your research or production systems, please cite:

@software{pandera_unified_validator,
  title = {pandera-unified-validator: Advanced data validation library},
  author = {Ian Pinto},
  year = {2025},
  url = {https://github.com/ianpinto/pandera-unified-validator}
}

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.github		.github
.hypothesis/unicode_data/16.0.0		.hypothesis/unicode_data/16.0.0
docs		docs
examples		examples
output		output
src/pandera_unified_validator		src/pandera_unified_validator
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

pandera-unified-validator

Features

Installation

Quick Start (30 seconds)

Comparison with Alternatives

Real-World Example: E-commerce Product Validation

Streaming Validation for Large Files

Documentation

Development

Contributing

License

Acknowledgments

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

iAn-P1nt0/pandera-unified-validator

Folders and files

Latest commit

History

Repository files navigation

pandera-unified-validator

Features

Installation

Quick Start (30 seconds)

Comparison with Alternatives

Real-World Example: E-commerce Product Validation

Streaming Validation for Large Files

Documentation

Development

Contributing

License

Acknowledgments

Citation

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages