DataFog Labs

Open research and development for lightweight PII detection models. This repo contains the full training code, experiment history, and research behind DataFog's PII-NER model family.

Latest checkpoint: DataFog/pii-small-en on HuggingFace (v1.4, Apache 2.0)

PII-NER v1

A 22.7M parameter model for detecting 41 types of personally identifiable information in English text. Combines a pretrained DeBERTa-v3-xsmall backbone with a character CNN encoder, adaptive gating fusion, and CRF output layer.

Input Text
    |
[Tokenization + Word-to-Char mapping]
    |
DeBERTa-v3-xsmall (22M)  +  CharCNN (0.3M)
    |                            |
    +-------> Gating Fusion <----+
                  |
             CRF Head (0.2M)
                  |
         BIO Tag Predictions (Viterbi decode)
                  |
         Span-level PII Entities

The gating fusion dynamically weights character-level features (for structured PII like SSNs and credit cards) against contextual features (for soft PII like names and addresses) on a per-token basis.

Results

Metric	V1.0	V1.1	V1.2	V1.3	V1.4
Overall F1	0.904	0.901	0.901	0.907	0.889
Precision	0.907	0.906	0.905	0.898	0.870
Recall	0.902	0.895	0.896	0.916	0.910
Tier 1 Recall (SSN, Credit Card, ...)	0.722	0.771	0.841	0.823	0.814
Tier 2 Recall (Person, Email, Phone, ...)	0.934	0.933	0.936	0.945	0.937
Tier 3 Recall (Username, Date, Location, ...)	0.919	0.908	0.911	0.930	0.945
Tier 4 Recall (Employee ID, IBAN, ...)	0.866	0.844	0.845	0.868	0.937

V1.3 has the best overall F1 (0.907). V1.4 expanded training data from 169K to 241K examples with 4 new data sources and 22K synthetic examples covering 11 entity types that previously had zero training data. V1.4 achieves the best Tier 3 and Tier 4 recall — all previously-zero entity types now produce results. The F1 drop vs v1.3 reflects the broader entity coverage at the cost of some precision on existing types.

Top entity F1 scores (v1.4)

Entity	F1	Entity	F1
URL	0.995	Nationality	0.993
Religion	1.000	Crypto Wallet	0.987
Health Condition	0.988	Insurance Number	0.986
Student ID	1.000	Political Affiliation	0.995
Marital Status	1.000	Salary	0.995
Sexual Orientation	1.000	Criminal Record	1.000
Biometric	0.975	Gender	0.957
Email	0.976	Phone	0.947

Quick start

from datafog_pii_ner.inference import PiiPipeline

pipeline = PiiPipeline.from_pretrained("DataFog/pii-small-en")
entities = pipeline("My SSN is 123-45-6789 and email is john@example.com")
# [PiiEntity(text='123-45-6789', label='SSN', start=10, end=21, tier=1),
#  PiiEntity(text='john@example.com', label='EMAIL', start=32, end=48, tier=2)]

Setup

cd pii-ner-v1
pip install -e ".[dev]"

# Run tests
pytest

# Lint
ruff check src/ tests/ scripts/

# Smoke test (requires GPU)
python -m scripts.smoke_test

# Full training
python scripts/train_v1.3.py --config configs/h100-v1.3.yaml

Evaluation

python scripts/eval_benchmark.py \
  --model datafog \
  --model-path DataFog/pii-small-en \
  --dataset combined \
  --split test

See eval_benchmark.md for flags and options.

Training data

Dataset	Size	License
AI4Privacy	~43K examples (English)	Apache 2.0
NVIDIA Nemotron-PII	~100K examples	CC-BY-4.0
Gretel Synthetic PII Finance	~26K examples	Apache 2.0
Gretel PII Masking EN v1	~50K examples	Apache 2.0
Synthetic (generated)	~22K examples	Apache 2.0

Combined: ~241K English examples after filtering and dedup. 41 canonical entity types across 4 sensitivity tiers, unified into 83 BIO labels. The synthetic data covers 11 entity types that had zero examples in the open-source datasets: NATIONALITY, RELIGION, MARITAL_STATUS, STUDENT_ID, CRYPTO_WALLET, INSURANCE_NUMBER, SALARY, CRIMINAL_RECORD, POLITICAL_AFFILIATION, SEXUAL_ORIENTATION, and HEALTH_CONDITION.

Documentation

Document	Description
Training Chronicle	Full narrative of the ML journey: 4 NaN sources, backbone instability, tier-weighted loss, freezing experiments
Smoke Test Walkthrough	Why differential learning rates are essential for pretrained+CRF architectures
Evaluation Harness	Head-to-head model comparison on the same test split
Design Document	Original architecture decisions and project structure

Research

The RESEARCH/ directory contains the pre-implementation research: 8 reports surveying 26 architectures, 9 PII datasets, and the competitive landscape. Includes a 29-slide interactive architecture guide.

Key finding: no published work combines differentiable character-level pattern recognition with contextual transformers specifically for PII detection.

Development log

Date	Version	What changed
2026-02-08	v1.4	Full entity coverage. Added 4 new data sources (241K total), synthetic data for 11 zero-occurrence types. All 41 entity types now produce results. Best T3 (0.945) and T4 (0.937) recall. Backbone freeze confirmed harmful — epoch 3 best, stopped early.
2026-02-07	v1.3	Best F1 (0.907). Early backbone freeze (epoch 3) + progressive tier weight reduction. Discovered training spikes originate in head components, not backbone.
2026-02-05	v1.2	Best Tier 1 recall (0.841). Backbone freezing after epoch 4. Epoch 3 identified as consistent sweet spot.
2026-02-04	v1.1	Tier-weighted CRF loss (3x for Tier 1), rare entity oversampling, inference pipeline. Tier 1 recall +4.9pts.
2026-02-04	—	Training chronicle, entity frequency audit (323x imbalance discovered).
2026-02-03	v1.0	First full training on A100. F1=0.904 on 360K examples. Model uploaded to HuggingFace.
2026-02-03	—	NaN gauntlet: 4 distinct NaN sources identified and fixed (CRF overflow, AdamW bias-correction, BF16 mantissa, FP16 gradient scaler).
2026-02-02	—	Smoke test passed (F1=0.947 on 100 examples). Differential learning rates proven essential.
2026-02-01	—	Architecture design. Research phase complete (2,800+ lines across 8 reports).

Key technical findings

Differential learning rates are non-negotiable. A flat LR across pretrained backbone + random CRF head produces F1=0.000. A 50x ratio (backbone 2e-5, head 1e-3) is needed.
AdamW eps=1.0 for pretrained backbones. Standard eps=1e-8 makes effective updates ~±lr regardless of gradient magnitude, causing NaN on DeBERTa with PyTorch 2.9+. Setting eps=1.0 restores gradient-proportional updates.
The training spike is a head problem, not backbone. V1.3 proved this definitively: the spike occurred at epoch 5 with the backbone already frozen since epoch 3. The CharCNN/GatingFusion/CRF destabilize under continued training.
Epoch 3 is consistently the best checkpoint. Across v1.2, v1.3, and v1.4, the model peaks at epoch 3 then destabilizes. Earlier representations generalize better.
Tier-weighted loss works but amplifies instability. 3x weight + 3x oversampling = ~9x gradient signal for Tier 1, which accelerates learning but accumulates damage.
Backbone freeze hurts more than it helps. V1.4 confirmed: freezing the backbone after epoch 3 causes immediate F1 regression (0.889→0.806) and eval loss spike (2.4→8.3). The head cannot adapt without backbone co-training.

Open problems

Tier 1 recall gap: 0.814 vs 0.98 target. Passport number (0.447 F1) and PIN (0.556 F1) remain weak due to limited training examples.
Head instability: Backbone freeze causes immediate regression (F1 0.889→0.806, loss 2.4→8.3). Root cause is in the CharCNN/GatingFusion/CRF head. Gradient clipping, per-component LR decay, or early stopping are candidate fixes.
Precision vs coverage trade-off: v1.4 expanded entity coverage at cost of ~2pts F1 vs v1.3. Better synthetic data quality or curriculum learning could close this gap.
ONNX export: CRF Viterbi decode doesn't export cleanly; needs pure-PyTorch reimplementation.

Project structure

datafog-labs/
├── pii-ner-v1/
│   ├── src/datafog_pii_ner/      # Model, data pipeline, training, inference
│   ├── scripts/                   # Training runners, evaluation, data download
│   ├── configs/                   # YAML configs per GPU/version
│   ├── tests/                     # Unit + integration tests (6 modules)
│   ├── notebooks/                 # Experiment notebooks (Colab/local)
│   └── docs/                      # Training chronicle, eval docs
├── RESEARCH/                      # Pre-implementation research (8 reports)
├── docs/plans/                    # Design documents
└── .github/workflows/ci.yml      # Lint + test CI

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
.github/workflows		.github/workflows
RESEARCH/pii-model-architecture		RESEARCH/pii-model-architecture
docs/plans		docs/plans
pii-ner-v1		pii-ner-v1
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataFog Labs

PII-NER v1

Results

Top entity F1 scores (v1.4)

Quick start

Setup

Evaluation

Training data

Documentation

Research

Development log

Key technical findings

Open problems

Project structure

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

DataFog/datafog-labs

Folders and files

Latest commit

History

Repository files navigation

DataFog Labs

PII-NER v1

Results

Top entity F1 scores (v1.4)

Quick start

Setup

Evaluation

Training data

Documentation

Research

Development log

Key technical findings

Open problems

Project structure

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages