Open research and development for lightweight PII detection models. This repo contains the full training code, experiment history, and research behind DataFog's PII-NER model family.
Latest checkpoint: DataFog/pii-small-en on HuggingFace (v1.4, Apache 2.0)
A 22.7M parameter model for detecting 41 types of personally identifiable information in English text. Combines a pretrained DeBERTa-v3-xsmall backbone with a character CNN encoder, adaptive gating fusion, and CRF output layer.
Input Text
|
[Tokenization + Word-to-Char mapping]
|
DeBERTa-v3-xsmall (22M) + CharCNN (0.3M)
| |
+-------> Gating Fusion <----+
|
CRF Head (0.2M)
|
BIO Tag Predictions (Viterbi decode)
|
Span-level PII Entities
The gating fusion dynamically weights character-level features (for structured PII like SSNs and credit cards) against contextual features (for soft PII like names and addresses) on a per-token basis.
| Metric | V1.0 | V1.1 | V1.2 | V1.3 | V1.4 |
|---|---|---|---|---|---|
| Overall F1 | 0.904 | 0.901 | 0.901 | 0.907 | 0.889 |
| Precision | 0.907 | 0.906 | 0.905 | 0.898 | 0.870 |
| Recall | 0.902 | 0.895 | 0.896 | 0.916 | 0.910 |
| Tier 1 Recall (SSN, Credit Card, ...) | 0.722 | 0.771 | 0.841 | 0.823 | 0.814 |
| Tier 2 Recall (Person, Email, Phone, ...) | 0.934 | 0.933 | 0.936 | 0.945 | 0.937 |
| Tier 3 Recall (Username, Date, Location, ...) | 0.919 | 0.908 | 0.911 | 0.930 | 0.945 |
| Tier 4 Recall (Employee ID, IBAN, ...) | 0.866 | 0.844 | 0.845 | 0.868 | 0.937 |
V1.3 has the best overall F1 (0.907). V1.4 expanded training data from 169K to 241K examples with 4 new data sources and 22K synthetic examples covering 11 entity types that previously had zero training data. V1.4 achieves the best Tier 3 and Tier 4 recall — all previously-zero entity types now produce results. The F1 drop vs v1.3 reflects the broader entity coverage at the cost of some precision on existing types.
| Entity | F1 | Entity | F1 |
|---|---|---|---|
| URL | 0.995 | Nationality | 0.993 |
| Religion | 1.000 | Crypto Wallet | 0.987 |
| Health Condition | 0.988 | Insurance Number | 0.986 |
| Student ID | 1.000 | Political Affiliation | 0.995 |
| Marital Status | 1.000 | Salary | 0.995 |
| Sexual Orientation | 1.000 | Criminal Record | 1.000 |
| Biometric | 0.975 | Gender | 0.957 |
| 0.976 | Phone | 0.947 |
from datafog_pii_ner.inference import PiiPipeline
pipeline = PiiPipeline.from_pretrained("DataFog/pii-small-en")
entities = pipeline("My SSN is 123-45-6789 and email is john@example.com")
# [PiiEntity(text='123-45-6789', label='SSN', start=10, end=21, tier=1),
# PiiEntity(text='john@example.com', label='EMAIL', start=32, end=48, tier=2)]cd pii-ner-v1
pip install -e ".[dev]"
# Run tests
pytest
# Lint
ruff check src/ tests/ scripts/
# Smoke test (requires GPU)
python -m scripts.smoke_test
# Full training
python scripts/train_v1.3.py --config configs/h100-v1.3.yamlpython scripts/eval_benchmark.py \
--model datafog \
--model-path DataFog/pii-small-en \
--dataset combined \
--split testSee eval_benchmark.md for flags and options.
| Dataset | Size | License |
|---|---|---|
| AI4Privacy | ~43K examples (English) | Apache 2.0 |
| NVIDIA Nemotron-PII | ~100K examples | CC-BY-4.0 |
| Gretel Synthetic PII Finance | ~26K examples | Apache 2.0 |
| Gretel PII Masking EN v1 | ~50K examples | Apache 2.0 |
| Synthetic (generated) | ~22K examples | Apache 2.0 |
Combined: ~241K English examples after filtering and dedup. 41 canonical entity types across 4 sensitivity tiers, unified into 83 BIO labels. The synthetic data covers 11 entity types that had zero examples in the open-source datasets: NATIONALITY, RELIGION, MARITAL_STATUS, STUDENT_ID, CRYPTO_WALLET, INSURANCE_NUMBER, SALARY, CRIMINAL_RECORD, POLITICAL_AFFILIATION, SEXUAL_ORIENTATION, and HEALTH_CONDITION.
| Document | Description |
|---|---|
| Training Chronicle | Full narrative of the ML journey: 4 NaN sources, backbone instability, tier-weighted loss, freezing experiments |
| Smoke Test Walkthrough | Why differential learning rates are essential for pretrained+CRF architectures |
| Evaluation Harness | Head-to-head model comparison on the same test split |
| Design Document | Original architecture decisions and project structure |
The RESEARCH/ directory contains the pre-implementation research: 8 reports surveying 26 architectures, 9 PII datasets, and the competitive landscape. Includes a 29-slide interactive architecture guide.
Key finding: no published work combines differentiable character-level pattern recognition with contextual transformers specifically for PII detection.
| Date | Version | What changed |
|---|---|---|
| 2026-02-08 | v1.4 | Full entity coverage. Added 4 new data sources (241K total), synthetic data for 11 zero-occurrence types. All 41 entity types now produce results. Best T3 (0.945) and T4 (0.937) recall. Backbone freeze confirmed harmful — epoch 3 best, stopped early. |
| 2026-02-07 | v1.3 | Best F1 (0.907). Early backbone freeze (epoch 3) + progressive tier weight reduction. Discovered training spikes originate in head components, not backbone. |
| 2026-02-05 | v1.2 | Best Tier 1 recall (0.841). Backbone freezing after epoch 4. Epoch 3 identified as consistent sweet spot. |
| 2026-02-04 | v1.1 | Tier-weighted CRF loss (3x for Tier 1), rare entity oversampling, inference pipeline. Tier 1 recall +4.9pts. |
| 2026-02-04 | — | Training chronicle, entity frequency audit (323x imbalance discovered). |
| 2026-02-03 | v1.0 | First full training on A100. F1=0.904 on 360K examples. Model uploaded to HuggingFace. |
| 2026-02-03 | — | NaN gauntlet: 4 distinct NaN sources identified and fixed (CRF overflow, AdamW bias-correction, BF16 mantissa, FP16 gradient scaler). |
| 2026-02-02 | — | Smoke test passed (F1=0.947 on 100 examples). Differential learning rates proven essential. |
| 2026-02-01 | — | Architecture design. Research phase complete (2,800+ lines across 8 reports). |
-
Differential learning rates are non-negotiable. A flat LR across pretrained backbone + random CRF head produces F1=0.000. A 50x ratio (backbone 2e-5, head 1e-3) is needed.
-
AdamW eps=1.0 for pretrained backbones. Standard eps=1e-8 makes effective updates ~±lr regardless of gradient magnitude, causing NaN on DeBERTa with PyTorch 2.9+. Setting eps=1.0 restores gradient-proportional updates.
-
The training spike is a head problem, not backbone. V1.3 proved this definitively: the spike occurred at epoch 5 with the backbone already frozen since epoch 3. The CharCNN/GatingFusion/CRF destabilize under continued training.
-
Epoch 3 is consistently the best checkpoint. Across v1.2, v1.3, and v1.4, the model peaks at epoch 3 then destabilizes. Earlier representations generalize better.
-
Tier-weighted loss works but amplifies instability. 3x weight + 3x oversampling = ~9x gradient signal for Tier 1, which accelerates learning but accumulates damage.
-
Backbone freeze hurts more than it helps. V1.4 confirmed: freezing the backbone after epoch 3 causes immediate F1 regression (0.889→0.806) and eval loss spike (2.4→8.3). The head cannot adapt without backbone co-training.
- Tier 1 recall gap: 0.814 vs 0.98 target. Passport number (0.447 F1) and PIN (0.556 F1) remain weak due to limited training examples.
- Head instability: Backbone freeze causes immediate regression (F1 0.889→0.806, loss 2.4→8.3). Root cause is in the CharCNN/GatingFusion/CRF head. Gradient clipping, per-component LR decay, or early stopping are candidate fixes.
- Precision vs coverage trade-off: v1.4 expanded entity coverage at cost of ~2pts F1 vs v1.3. Better synthetic data quality or curriculum learning could close this gap.
- ONNX export: CRF Viterbi decode doesn't export cleanly; needs pure-PyTorch reimplementation.
datafog-labs/
├── pii-ner-v1/
│ ├── src/datafog_pii_ner/ # Model, data pipeline, training, inference
│ ├── scripts/ # Training runners, evaluation, data download
│ ├── configs/ # YAML configs per GPU/version
│ ├── tests/ # Unit + integration tests (6 modules)
│ ├── notebooks/ # Experiment notebooks (Colab/local)
│ └── docs/ # Training chronicle, eval docs
├── RESEARCH/ # Pre-implementation research (8 reports)
├── docs/plans/ # Design documents
└── .github/workflows/ci.yml # Lint + test CI
Apache 2.0