Skip to content

Research & training for DataFog's next-generation PII detection model

Notifications You must be signed in to change notification settings

DataFog/datafog-labs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

52 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DataFog Labs

Open research and development for lightweight PII detection models. This repo contains the full training code, experiment history, and research behind DataFog's PII-NER model family.

Latest checkpoint: DataFog/pii-small-en on HuggingFace (v1.4, Apache 2.0)

PII-NER v1

A 22.7M parameter model for detecting 41 types of personally identifiable information in English text. Combines a pretrained DeBERTa-v3-xsmall backbone with a character CNN encoder, adaptive gating fusion, and CRF output layer.

Input Text
    |
[Tokenization + Word-to-Char mapping]
    |
DeBERTa-v3-xsmall (22M)  +  CharCNN (0.3M)
    |                            |
    +-------> Gating Fusion <----+
                  |
             CRF Head (0.2M)
                  |
         BIO Tag Predictions (Viterbi decode)
                  |
         Span-level PII Entities

The gating fusion dynamically weights character-level features (for structured PII like SSNs and credit cards) against contextual features (for soft PII like names and addresses) on a per-token basis.

Results

Metric V1.0 V1.1 V1.2 V1.3 V1.4
Overall F1 0.904 0.901 0.901 0.907 0.889
Precision 0.907 0.906 0.905 0.898 0.870
Recall 0.902 0.895 0.896 0.916 0.910
Tier 1 Recall (SSN, Credit Card, ...) 0.722 0.771 0.841 0.823 0.814
Tier 2 Recall (Person, Email, Phone, ...) 0.934 0.933 0.936 0.945 0.937
Tier 3 Recall (Username, Date, Location, ...) 0.919 0.908 0.911 0.930 0.945
Tier 4 Recall (Employee ID, IBAN, ...) 0.866 0.844 0.845 0.868 0.937

V1.3 has the best overall F1 (0.907). V1.4 expanded training data from 169K to 241K examples with 4 new data sources and 22K synthetic examples covering 11 entity types that previously had zero training data. V1.4 achieves the best Tier 3 and Tier 4 recall — all previously-zero entity types now produce results. The F1 drop vs v1.3 reflects the broader entity coverage at the cost of some precision on existing types.

Top entity F1 scores (v1.4)

Entity F1 Entity F1
URL 0.995 Nationality 0.993
Religion 1.000 Crypto Wallet 0.987
Health Condition 0.988 Insurance Number 0.986
Student ID 1.000 Political Affiliation 0.995
Marital Status 1.000 Salary 0.995
Sexual Orientation 1.000 Criminal Record 1.000
Biometric 0.975 Gender 0.957
Email 0.976 Phone 0.947

Quick start

from datafog_pii_ner.inference import PiiPipeline

pipeline = PiiPipeline.from_pretrained("DataFog/pii-small-en")
entities = pipeline("My SSN is 123-45-6789 and email is john@example.com")
# [PiiEntity(text='123-45-6789', label='SSN', start=10, end=21, tier=1),
#  PiiEntity(text='john@example.com', label='EMAIL', start=32, end=48, tier=2)]

Setup

cd pii-ner-v1
pip install -e ".[dev]"

# Run tests
pytest

# Lint
ruff check src/ tests/ scripts/

# Smoke test (requires GPU)
python -m scripts.smoke_test

# Full training
python scripts/train_v1.3.py --config configs/h100-v1.3.yaml

Evaluation

python scripts/eval_benchmark.py \
  --model datafog \
  --model-path DataFog/pii-small-en \
  --dataset combined \
  --split test

See eval_benchmark.md for flags and options.

Training data

Dataset Size License
AI4Privacy ~43K examples (English) Apache 2.0
NVIDIA Nemotron-PII ~100K examples CC-BY-4.0
Gretel Synthetic PII Finance ~26K examples Apache 2.0
Gretel PII Masking EN v1 ~50K examples Apache 2.0
Synthetic (generated) ~22K examples Apache 2.0

Combined: ~241K English examples after filtering and dedup. 41 canonical entity types across 4 sensitivity tiers, unified into 83 BIO labels. The synthetic data covers 11 entity types that had zero examples in the open-source datasets: NATIONALITY, RELIGION, MARITAL_STATUS, STUDENT_ID, CRYPTO_WALLET, INSURANCE_NUMBER, SALARY, CRIMINAL_RECORD, POLITICAL_AFFILIATION, SEXUAL_ORIENTATION, and HEALTH_CONDITION.

Documentation

Document Description
Training Chronicle Full narrative of the ML journey: 4 NaN sources, backbone instability, tier-weighted loss, freezing experiments
Smoke Test Walkthrough Why differential learning rates are essential for pretrained+CRF architectures
Evaluation Harness Head-to-head model comparison on the same test split
Design Document Original architecture decisions and project structure

Research

The RESEARCH/ directory contains the pre-implementation research: 8 reports surveying 26 architectures, 9 PII datasets, and the competitive landscape. Includes a 29-slide interactive architecture guide.

Key finding: no published work combines differentiable character-level pattern recognition with contextual transformers specifically for PII detection.

Development log

Date Version What changed
2026-02-08 v1.4 Full entity coverage. Added 4 new data sources (241K total), synthetic data for 11 zero-occurrence types. All 41 entity types now produce results. Best T3 (0.945) and T4 (0.937) recall. Backbone freeze confirmed harmful — epoch 3 best, stopped early.
2026-02-07 v1.3 Best F1 (0.907). Early backbone freeze (epoch 3) + progressive tier weight reduction. Discovered training spikes originate in head components, not backbone.
2026-02-05 v1.2 Best Tier 1 recall (0.841). Backbone freezing after epoch 4. Epoch 3 identified as consistent sweet spot.
2026-02-04 v1.1 Tier-weighted CRF loss (3x for Tier 1), rare entity oversampling, inference pipeline. Tier 1 recall +4.9pts.
2026-02-04 Training chronicle, entity frequency audit (323x imbalance discovered).
2026-02-03 v1.0 First full training on A100. F1=0.904 on 360K examples. Model uploaded to HuggingFace.
2026-02-03 NaN gauntlet: 4 distinct NaN sources identified and fixed (CRF overflow, AdamW bias-correction, BF16 mantissa, FP16 gradient scaler).
2026-02-02 Smoke test passed (F1=0.947 on 100 examples). Differential learning rates proven essential.
2026-02-01 Architecture design. Research phase complete (2,800+ lines across 8 reports).

Key technical findings

  1. Differential learning rates are non-negotiable. A flat LR across pretrained backbone + random CRF head produces F1=0.000. A 50x ratio (backbone 2e-5, head 1e-3) is needed.

  2. AdamW eps=1.0 for pretrained backbones. Standard eps=1e-8 makes effective updates ~±lr regardless of gradient magnitude, causing NaN on DeBERTa with PyTorch 2.9+. Setting eps=1.0 restores gradient-proportional updates.

  3. The training spike is a head problem, not backbone. V1.3 proved this definitively: the spike occurred at epoch 5 with the backbone already frozen since epoch 3. The CharCNN/GatingFusion/CRF destabilize under continued training.

  4. Epoch 3 is consistently the best checkpoint. Across v1.2, v1.3, and v1.4, the model peaks at epoch 3 then destabilizes. Earlier representations generalize better.

  5. Tier-weighted loss works but amplifies instability. 3x weight + 3x oversampling = ~9x gradient signal for Tier 1, which accelerates learning but accumulates damage.

  6. Backbone freeze hurts more than it helps. V1.4 confirmed: freezing the backbone after epoch 3 causes immediate F1 regression (0.889→0.806) and eval loss spike (2.4→8.3). The head cannot adapt without backbone co-training.

Open problems

  • Tier 1 recall gap: 0.814 vs 0.98 target. Passport number (0.447 F1) and PIN (0.556 F1) remain weak due to limited training examples.
  • Head instability: Backbone freeze causes immediate regression (F1 0.889→0.806, loss 2.4→8.3). Root cause is in the CharCNN/GatingFusion/CRF head. Gradient clipping, per-component LR decay, or early stopping are candidate fixes.
  • Precision vs coverage trade-off: v1.4 expanded entity coverage at cost of ~2pts F1 vs v1.3. Better synthetic data quality or curriculum learning could close this gap.
  • ONNX export: CRF Viterbi decode doesn't export cleanly; needs pure-PyTorch reimplementation.

Project structure

datafog-labs/
├── pii-ner-v1/
│   ├── src/datafog_pii_ner/      # Model, data pipeline, training, inference
│   ├── scripts/                   # Training runners, evaluation, data download
│   ├── configs/                   # YAML configs per GPU/version
│   ├── tests/                     # Unit + integration tests (6 modules)
│   ├── notebooks/                 # Experiment notebooks (Colab/local)
│   └── docs/                      # Training chronicle, eval docs
├── RESEARCH/                      # Pre-implementation research (8 reports)
├── docs/plans/                    # Design documents
└── .github/workflows/ci.yml      # Lint + test CI

License

Apache 2.0

About

Research & training for DataFog's next-generation PII detection model

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •