Uses NLP, regex, and pattern matching (via Microsoft Presidio + spaCy) to detect and redact PII from CSV and Excel files.
Disclaimer: PII detection is not 100% accurate. Always manually verify the output before treating it as fully redacted. Do not blindly trust this tool with sensitive data.
- Python 3.12 (other versions are not supported)
uvpackage manager
# 1. Install uv (if not already installed)
pip install uv
# 2. Clone the repo
git clone <repo-url>
cd pii-redaction-lib
# 3. Install dependencies (including the spaCy model)
uv syncuv sync fetches everything declared in pyproject.toml, including the en_core_web_lg spaCy model sourced directly from a GitHub release.
python -m src.replace_pii <filepath> --output <outfile> [--sheet SHEET] [--text-fields FIELD ...] [--exclude-fields FIELD ...]
# Redact a CSV file
python -m src.replace_pii data/employees.csv --output data/employees_redacted.csv
# Redact an Excel file, targeting specific free-text columns
python -m src.replace_pii data/employees.xlsx --text-fields notes bio \
--output data/employees_redacted.xlsx
# Specify a sheet and protect primary-key columns from modification
python -m src.replace_pii data/employees.xlsx --sheet "Sheet1" \
--text-fields notes bio \
--exclude-fields employee_id \
--output data/employees_redacted.xlsx
# Reference a sheet by index (0-based)
python -m src.replace_pii data/employees.xlsx --sheet 2 --text-fields comments \
--output data/employees_redacted.xlsx| Parameter | Required | Default | Description |
|---|---|---|---|
filepath |
Yes | — | Path to the input file (.csv, .xls, or .xlsx) |
--output |
Yes | — | Path for the redacted output file (.csv, .xls, or .xlsx). Format is inferred from the extension. |
--sheet |
No | 0 |
Sheet name or 0-based index for Excel files. Ignored for CSV. |
--text-fields |
No | (none) | One or more column names containing long free-form text (e.g. notes, comments). These are processed row-by-row with full NLP analysis. |
--exclude-fields |
No | (none) | One or more column names to leave completely untouched (e.g. primary keys, date columns). |
The pipeline runs two passes over the input file. Pass 1 uses Presidio's NLP pipeline to analyze free-text columns cell-by-cell, providing high accuracy for natural-language content like notes or comments. Pass 2 uses Presidio's batch StructuredEngine on all remaining columns for speed, processing entire columns at once rather than row-by-row. Both passes apply the same replacement rules.
| Entity Type | Replacement |
|---|---|
PERSON |
Literal string REDACTED |
EMAIL_ADDRESS |
Synthetic safe email address (generated by Faker) |
CREDIT_CARD |
Synthetic credit card number (generated by Faker) |
CRYPTO |
Literal string REDACTED |
DATE_TIME |
Literal string REDACTED |
IP_ADDRESS |
Literal string REDACTED |
MAC_ADDRESS |
Literal string REDACTED |
PHONE_NUMBER |
Literal string REDACTED |
MEDICAL_LICENSE |
Literal string REDACTED |
Presidio supports many additional entity types. The
operatorsdict insrc/replace_pii.pycan be extended to cover them.