pii-redaction-lib

Uses NLP, regex, and pattern matching (via Microsoft Presidio + spaCy) to detect and redact PII from CSV and Excel files.

Disclaimer: PII detection is not 100% accurate. Always manually verify the output before treating it as fully redacted. Do not blindly trust this tool with sensitive data.

Requirements

Python 3.12 (other versions are not supported)
uv package manager

Installation

# 1. Install uv (if not already installed)
pip install uv

# 2. Clone the repo
git clone <repo-url>
cd pii-redaction-lib

# 3. Install dependencies (including the spaCy model)
uv sync

uv sync fetches everything declared in pyproject.toml, including the en_core_web_lg spaCy model sourced directly from a GitHub release.

Usage

Basic syntax

python -m src.replace_pii <filepath> --output <outfile> [--sheet SHEET] [--text-fields FIELD ...] [--exclude-fields FIELD ...]

Examples

# Redact a CSV file
python -m src.replace_pii data/employees.csv --output data/employees_redacted.csv

# Redact an Excel file, targeting specific free-text columns
python -m src.replace_pii data/employees.xlsx --text-fields notes bio \
    --output data/employees_redacted.xlsx

# Specify a sheet and protect primary-key columns from modification
python -m src.replace_pii data/employees.xlsx --sheet "Sheet1" \
    --text-fields notes bio \
    --exclude-fields employee_id \
    --output data/employees_redacted.xlsx

# Reference a sheet by index (0-based)
python -m src.replace_pii data/employees.xlsx --sheet 2 --text-fields comments \
    --output data/employees_redacted.xlsx

CLI Parameter Reference

Parameter	Required	Default	Description
`filepath`	Yes	—	Path to the input file (`.csv`, `.xls`, or `.xlsx`)
`--output`	Yes	—	Path for the redacted output file (`.csv`, `.xls`, or `.xlsx`). Format is inferred from the extension.
`--sheet`	No	`0`	Sheet name or 0-based index for Excel files. Ignored for CSV.
`--text-fields`	No	(none)	One or more column names containing long free-form text (e.g. notes, comments). These are processed row-by-row with full NLP analysis.
`--exclude-fields`	No	(none)	One or more column names to leave completely untouched (e.g. primary keys, date columns).

How It Works

The pipeline runs two passes over the input file. Pass 1 uses Presidio's NLP pipeline to analyze free-text columns cell-by-cell, providing high accuracy for natural-language content like notes or comments. Pass 2 uses Presidio's batch StructuredEngine on all remaining columns for speed, processing entire columns at once rather than row-by-row. Both passes apply the same replacement rules.

Supported Entity Types

Entity Type	Replacement
`PERSON`	Literal string `REDACTED`
`EMAIL_ADDRESS`	Synthetic safe email address (generated by Faker)
`CREDIT_CARD`	Synthetic credit card number (generated by Faker)
`CRYPTO`	Literal string `REDACTED`
`DATE_TIME`	Literal string `REDACTED`
`IP_ADDRESS`	Literal string `REDACTED`
`MAC_ADDRESS`	Literal string `REDACTED`
`PHONE_NUMBER`	Literal string `REDACTED`
`MEDICAL_LICENSE`	Literal string `REDACTED`

Presidio supports many additional entity types. The operators dict in src/replace_pii.py can be extended to cover them.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.idea		.idea
src		src
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pii-redaction-lib

Requirements

Installation

Usage

Basic syntax

Examples

CLI Parameter Reference

How It Works

Supported Entity Types

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

pii-redaction-lib

Requirements

Installation

Usage

Basic syntax

Examples

CLI Parameter Reference

How It Works

Supported Entity Types

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages