Skip to content

actionably/pii-redaction-lib

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pii-redaction-lib

Uses NLP, regex, and pattern matching (via Microsoft Presidio + spaCy) to detect and redact PII from CSV and Excel files.

Disclaimer: PII detection is not 100% accurate. Always manually verify the output before treating it as fully redacted. Do not blindly trust this tool with sensitive data.


Requirements

  • Python 3.12 (other versions are not supported)
  • uv package manager

Installation

# 1. Install uv (if not already installed)
pip install uv

# 2. Clone the repo
git clone <repo-url>
cd pii-redaction-lib

# 3. Install dependencies (including the spaCy model)
uv sync

uv sync fetches everything declared in pyproject.toml, including the en_core_web_lg spaCy model sourced directly from a GitHub release.


Usage

Basic syntax

python -m src.replace_pii <filepath> --output <outfile> [--sheet SHEET] [--text-fields FIELD ...] [--exclude-fields FIELD ...]

Examples

# Redact a CSV file
python -m src.replace_pii data/employees.csv --output data/employees_redacted.csv

# Redact an Excel file, targeting specific free-text columns
python -m src.replace_pii data/employees.xlsx --text-fields notes bio \
    --output data/employees_redacted.xlsx

# Specify a sheet and protect primary-key columns from modification
python -m src.replace_pii data/employees.xlsx --sheet "Sheet1" \
    --text-fields notes bio \
    --exclude-fields employee_id \
    --output data/employees_redacted.xlsx

# Reference a sheet by index (0-based)
python -m src.replace_pii data/employees.xlsx --sheet 2 --text-fields comments \
    --output data/employees_redacted.xlsx

CLI Parameter Reference

Parameter Required Default Description
filepath Yes Path to the input file (.csv, .xls, or .xlsx)
--output Yes Path for the redacted output file (.csv, .xls, or .xlsx). Format is inferred from the extension.
--sheet No 0 Sheet name or 0-based index for Excel files. Ignored for CSV.
--text-fields No (none) One or more column names containing long free-form text (e.g. notes, comments). These are processed row-by-row with full NLP analysis.
--exclude-fields No (none) One or more column names to leave completely untouched (e.g. primary keys, date columns).

How It Works

The pipeline runs two passes over the input file. Pass 1 uses Presidio's NLP pipeline to analyze free-text columns cell-by-cell, providing high accuracy for natural-language content like notes or comments. Pass 2 uses Presidio's batch StructuredEngine on all remaining columns for speed, processing entire columns at once rather than row-by-row. Both passes apply the same replacement rules.


Supported Entity Types

Entity Type Replacement
PERSON Literal string REDACTED
EMAIL_ADDRESS Synthetic safe email address (generated by Faker)
CREDIT_CARD Synthetic credit card number (generated by Faker)
CRYPTO Literal string REDACTED
DATE_TIME Literal string REDACTED
IP_ADDRESS Literal string REDACTED
MAC_ADDRESS Literal string REDACTED
PHONE_NUMBER Literal string REDACTED
MEDICAL_LICENSE Literal string REDACTED

Presidio supports many additional entity types. The operators dict in src/replace_pii.py can be extended to cover them.

About

This project utilizes NLP, Regex, and other detection patterns to find PII and then automatically scrub it. As noted additionally in the code and README, this will never be 100% perfect - Please verify the output instead of blindly trusting this project.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages