DataGuard is a standard-library-only Python CLI for cleaning, validating, triaging, and transforming messy real-world data.
It is designed for command-line workflows where you want one tool that can:
- clean invisible or unsafe text artifacts
- extract contacts from unstructured text
- audit password quality with offline heuristics
- parse web/server access logs and flag suspicious patterns
- repair malformed CSV files and convert them to JSON
- sanitize HTML in plain-text or safe-tag mode
- auto-detect likely input type and route it to the right module
- batch-process directories of mixed files
Important: DataGuard aims to be practical and portable, not “security magic.” Some modules intentionally use heuristic logic and best-effort parsing. For example, the HTML sanitizer is not a replacement for hardened browser-side or framework-grade sanitization.
Real-world data is messy.
You might receive:
- a text file with hidden Unicode characters
- a copied contact list with inconsistent formatting
- a password dump that needs offline quality scoring
- server logs that need quick triage
- a CSV with broken rows, duplicate headers, or mixed delimiters
- user-submitted HTML that needs cleanup before inspection or storage
DataGuard provides a single CLI with a consistent output model, shared runtime flags, optional reports, and JSON-friendly automation.
Clean text artifacts such as:
- BOM markers
- ANSI escape sequences
- control characters
- unusual Unicode whitespace
- smart quotes
- zero-width characters
- optional bidi / isolate formatting marks
Useful for copied text, generated logs, OCR output, or suspicious text payloads.
Extract and validate contact information from messy text.
Capabilities include:
- email extraction with validation rules
- US/NANP phone normalization
- international phone normalization for
+-prefixed numbers - confidence scoring
- duplicate suppression
- optional rejected-candidate reporting
Output is CSV for easy spreadsheet import.
Audit password quality using offline heuristics.
Checks include:
- minimum length
- character-class diversity
- common-password matching
- leetspeak normalization
- weak suffix detection
- repeated characters
- ascending/descending sequences
- keyboard walk patterns
- a naive entropy estimate
Security note: exported audit data can contain full plaintext passwords. Treat those files as sensitive.
Parse server logs and generate quick threat-oriented summaries.
Supported behavior includes:
- Apache / Nginx / generic format detection
- request parsing
- top IP and URL summaries
- parse-failure reporting
- heuristic detection for path traversal, SQL injection probes, scanner fingerprints, brute-force bursts, and rapid-fire request spikes
Security note: this is a triage aid, not a SIEM, IDS/IPS, WAF, or formal incident analysis platform.
Repair broken CSV input and convert it to structured JSON.
Capabilities include:
- delimiter detection
- header detection / generation
- header normalization
- duplicate-header repair
- short-row padding
- long-row overflow handling
- strict-mode rejection
- optional quarantine file for rejected rows
- basic type inference and conversion
- column completeness profiling
Clean and sanitize HTML in two modes:
- plain: strip markup and return plain text
- safe: preserve only allowlisted tags and allowed attributes
The module removes or neutralizes things like:
<script>blocks- dangerous URLs
- event handlers
- blocked tags such as
iframe,object,embed,form, andbase - style-based attacks and meta refresh tricks
Security note: this is best-effort sanitization for cleanup and inspection, not a hardened security boundary.
Inspect input content and route it to the most likely module.
Detection is based on content heuristics and, when available, filename extension hints.
Scan a directory, auto-detect each file’s type, and write cleaned output files plus an optional batch summary JSON report.
Read or update defaults stored in .dataguardrc.
Print example commands.
Print environment and module-health information.
DataGuard keeps the implementation intentionally lightweight:
- Language: Python 3.10+
- Packaging:
pyproject.tomlwithsetuptools - Runtime dependencies: none beyond the Python standard library
- Test dependency:
pytest - CLI framework: standard-library
argparse - Parsing / transformation approach: standard library modules such as
csv,json,html.parser,re,pathlib, and related utilities - Schemas: JSON Schema documents shipped under
schema/
This makes the project easy to install in constrained environments and simple to audit.
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -e .Then verify the CLI:
dg-clean --helpThis repository already warns about a potential package-name collision with another package named dataguard in some environments.
Use dg-clean when possible to ensure you are executing this project’s entry point.
You can still try:
python -m dataguard --helpBut dg-clean is the safer default when working from a local clone or editable install.
dg-clean sanitize --input $'Hello\u200b world\ufeff'dg-clean contacts --file contacts.txt --output contacts.csv --reportdg-clean audit --password 'TrickyPass123!' --reportdg-clean logs --file access.log --top 5 --threats-onlydg-clean csv --file broken.csv --output output.json --quarantine rejected.csvdg-clean html --file input.html --mode safe --allow p,a,strong,img --show-diff --reportdg-clean auto --file mystery.txt --reportdg-clean batch --dir incoming --pattern '*.txt' --output-dir cleaned --batch-report batch.jsonMany subcommands share runtime flags such as:
--reportto print a standardized report--report-format text|json|csv--report-file <path>--pipe-format text|json|raw--no-color--quiet--verbose/-v -vv -vvv--strict
dg-clean sanitize [--input TEXT | --file PATH | --stdin] [--output PATH] [--preserve-bidi-marks]Use for cleaning invisible, unsafe, or normalization-worthy text artifacts.
dg-clean contacts --file PATH|--stdin [--output PATH] [--min-confidence FLOAT] [--show-rejected]Output is CSV with columns like:
name_if_foundemailphonesource_lineconfidence_score
dg-clean audit [--password TEXT | --file PATH | --stdin] [--show] [--min-length N] [--no-dictionary] [--no-entropy] [--export PATH]Use --show carefully, because it prints the real password instead of masking it.
dg-clean logs --file PATH|--stdin [--format auto|apache|nginx|generic] [--top N] [--threats-only] [--export PATH]--export writes parsed log entry JSON, which may contain sensitive operational data.
dg-clean csv --file PATH|--stdin [--output PATH] [--delimiter auto|,|;|||tab] [--quarantine PATH] [--no-types]In strict mode, malformed rows are rejected instead of repaired.
dg-clean html [--input TEXT | --file PATH | --stdin] [--mode plain|safe] [--allow TAG1,TAG2,...] [--output PATH] [--show-diff]dg-clean auto --file PATH|--stdin [--output PATH] [--dry-run]--dry-runprints the detected module, confidence, and detection notes without running the module.
dg-clean batch --dir PATH [--recursive] [--pattern GLOB] --output-dir PATH [--batch-report PATH]This command:
- scans matching files
- auto-detects each file’s likely module
- runs the corresponding processor
- writes per-file cleaned output
- optionally writes a batch summary JSON file
dg-clean config
dg-clean config --set verbosity=1 pipe_format=jsondg-clean examplesdg-clean infoDataGuard supports an optional .dataguardrc file in the current working directory.
It is stored as JSON and merged with built-in defaults.
Example:
{
"color_enabled": true,
"strict_mode": false,
"pipe_format": "text",
"report_format": "text",
"min_confidence_threshold": 0.3,
"password_min_length": 12,
"log_top_n": 10,
"verbosity": 1
}Supported defaults include:
default_output_formatcolor_enabledverbositystrict_modemin_confidence_thresholdpassword_min_lengthlog_top_npipe_formatreport_format
Unknown keys are ignored with warnings, and invalid values produce clear errors.
Most module runners return a structured result with common concepts such as:
module_nametitleoutputfindingswarningserrorsstatsmetadatasummary
This makes the CLI suitable both for human use and for shell / automation pipelines.
Primary output can be shaped with:
textjsonraw
Standardized reports can be emitted as:
textjsoncsv
DataGuard uses meaningful process exit codes:
0= success with no warnings1= success with warnings2= processing failure, or warnings treated as failure in strict mode3= CLI / internal / handled execution error
This makes it easy to integrate with scripts and CI-style checks.
The auto module scores content against likely module types such as:
- logs
- CSV
- HTML
- contacts
- password lists
- plain text / sanitization fallback
It uses:
- content sampling
- regex and structural heuristics
- extension-based confidence boosts for files like
.log,.csv,.tsv,.html, and.htm - tie-breaking priority rules when scores are close
This is useful for mixed-input directories and quick CLI workflows.
The schema/ directory contains machine-readable schema documents for:
- common definitions
- config payloads
- config CLI responses
- auto-detection responses
- batch summaries
- CLI info output
- unified module result shapes
- read metadata
Start with:
schema/manifest.json
This is especially useful if you want to validate DataGuard output programmatically or build tooling around it.
A simplified layout looks like this:
Dataguard/
├── cli.py
├── auto_detect.py
├── config.py
├── dg_clean_entry.py
├── formatter.py
├── io_utils.py
├── errors.py
├── common_passwords.py
├── modules/
│ ├── string_sanitizer.py
│ ├── contact_extractor.py
│ ├── password_checker.py
│ ├── log_parser.py
│ ├── csv_converter.py
│ └── html_sanitizer.py
├── schema/
├── tests/
├── pyproject.toml
└── requirements.txt
DataGuard is intentionally transparent about what it does and does not do.
The HTML module is useful for cleanup and inspection, but it is not a replacement for:
- a hardened HTML sanitizer
- trusted browser parsing
- CSP
- full application-layer XSS defense
Password scoring is heuristic. The entropy metric is a naive uniform-random estimate, not a real attacker-cost model and not NIST verification.
Threat findings are pattern-based hints for triage. They should not be treated as definitive incident conclusions.
Confidence scores help filter noisy text, but extracted results should still be reviewed before operational use.
pip install -r requirements.txtpytestExamples of good local checks before opening a PR:
dg-clean info
dg-clean examples
pytestPotential next improvements for the project:
- add CI workflows for tests and linting
- publish release notes / tagged versions
- add sample input and output fixtures in the README
- add contributor guidelines
- add benchmark notes for large-file processing
- document expected JSON outputs per module with examples
- add a comparison section explaining when to use DataGuard versus dedicated security or ETL tools
Contributions are welcome.
Good contribution areas include:
- new test coverage
- more sample fixtures
- improved docs and examples
- better log-format support
- tighter HTML allowlist behavior
- more robust data-type inference for CSV conversion
- clearer machine-readable schema documentation
A useful contribution flow is:
- fork the repo
- create a feature branch
- add or update tests
- run
pytest - open a pull request with before/after examples
Add the project license here once the repository includes one explicitly.
If you already have a license file in the repo, update this section to reference it directly.