feat: --placeholder-scheme indexed for distinguishable per-entity placeholders by mvanhorn · Pull Request #9 · openai/privacy-filter

mvanhorn · 2026-04-22T23:01:03Z

Summary

Add --placeholder-scheme {label,indexed} to opf redact. Default stays label (current behavior, zero breakage). In indexed mode, same surface text within a single redaction call gets the same numbered placeholder, so downstream NLP can still distinguish entities. schema_version stays at 1 (OUTPUT_SCHEMAS.md: "Additive fields may appear over time").

Why

The README positions OPF for "high-throughput data sanitization workflows", which typically feed redacted text to downstream NLP or analysts. Today, on a realistic input:

$ opf --device cpu 'Alice Jones was born on 1990-01-02. Her email is alice@example.com. Bob called Alice and Alice replied.'
<PRIVATE_PERSON> was born on <PRIVATE_DATE>. Her email is <PRIVATE_EMAIL>. <PRIVATE_PERSON> called <PRIVATE_PERSON> and <PRIVATE_PERSON> replied.

All four person mentions (Alice Jones, Bob, Alice, Alice) collapse to the identical string. Coreference, pronoun resolution, and multi-entity relations are lost at the first step. Presidio solves this with its ReAnonymizer operator and an entity_mapping dict, so repeated values get consistent anonymized forms:

Changes

opf/_common/constants.py: new PLACEHOLDER_SCHEMES: Final[tuple[str, str]] = ("label", "indexed").
opf/_core/runtime.py: new _indexed_placeholder(label, index) (returns f"<{label.upper()}_{index}>") and _assign_indexed_placeholders(spans) (walks spans in order, assigns per-label 1-based indices keyed by trimmed surface text, returns new DetectedSpan objects with rewritten placeholders).
opf/_api.py: OPF.__init__ accepts placeholder_scheme: Literal["label","indexed"] = "label" with PLACEHOLDER_SCHEMES validation. OPF.redact applies _assign_indexed_placeholders after predict_text and before _redact_text / build_detection_summary when the scheme is indexed.
opf/_cli/common.py, opf/_cli/args.py, opf/_cli/render.py: new --placeholder-scheme {label,indexed} flag, plumbed through build_redactor_from_args.
OUTPUT_SCHEMAS.md: document the indexed value format <LABEL_N> and the redacted-mode interaction (<REDACTED_N>).
tests/test_placeholder_scheme.py: 7 tests covering default behavior (label scheme unchanged), indexed behavior (same surface reuses index, distinct surfaces get different indices, per-label namespacing, redacted output mode + indexed), Python API parity, build_redactor_from_args threading, and CLI validation rejecting unknown schemes.

Surface-form canonicalization is span.text.strip() - exact case-sensitive match after the existing whitespace trim. Fuzzy matching, coreference resolution, and case folding are out of scope (those are ML problems, not output-formatting problems).

--output-mode redacted + indexed scheme: spans come back labeled "redacted", so the indexed namespace collapses to a single <REDACTED_N> sequence per unique surface. Documented and tested.

Verification

Ran against the real checkpoint (~/.opf/privacy_filter) on CPU:

$ python -m unittest tests.test_placeholder_scheme -v
test_api_rejects_invalid_placeholder_scheme ... ok
test_build_redactor_from_args_threads_placeholder_scheme ... ok
test_indexed_scheme_reuses_same_surface_form ... ok
test_indexed_scheme_uses_per_label_namespaces ... ok
test_label_scheme_keeps_current_placeholders ... ok
test_redacted_output_mode_with_indexed_scheme ... ok
test_cli_rejects_invalid_scheme ... ok

Ran 7 tests in 0.335s
OK

End-to-end smoke test matches the expected behavior:

$ opf --placeholder-scheme indexed 'John Doe called John Doe yesterday. John Doe also called Jane Smith.'
<PRIVATE_PERSON_1> called <PRIVATE_PERSON_1> <PRIVATE_DATE_1>. <PRIVATE_PERSON_1> also called <PRIVATE_PERSON_2>.

Demo

Three frames:

Default label scheme on "Alice Jones was born on 1990-01-02. Her email is alice@example.com. Bob called Alice and Alice replied." All 4 person mentions collapse to identical <PRIVATE_PERSON>.
Same input, --placeholder-scheme indexed. "Alice Jones", "Bob", and "Alice" get <PRIVATE_PERSON_1>, <PRIVATE_PERSON_2>, <PRIVATE_PERSON_3> respectively. The two "Alice" mentions share <PRIVATE_PERSON_3> because the surface form matches.
Presidio-style "John Doe called John Doe" case. All three "John Doe" mentions get <PRIVATE_PERSON_1>; "Jane Smith" gets <PRIVATE_PERSON_2>.

Scope notes

Not in this PR:

Hash-based placeholders (salt handling + security implications deferred).
Configurable placeholder templates (--placeholder-template "<{label}_{index}>").
Cross-call stable identity (requires external state).
Coreference resolution, fuzzy matching, or case folding.

Orthogonal to issue #6 (batched inference API) and PR #5 (--stdin-mode). Composes cleanly with both and with the confidence-scores PR.

Developed with AI assistance.

…ceholders Add `--placeholder-scheme {label,indexed}` to `opf redact`. Default stays `label` (current behavior, zero breakage). In `indexed` mode, same surface text within a single `redact` call gets the same numbered placeholder, so "John Doe called John Doe" redacts to "<PRIVATE_PERSON_1> called <PRIVATE_PERSON_1>" and downstream code can still tell distinct entities apart. Indices are per-label and 1-based (first-appearance order). `--output-mode redacted` + indexed scheme namespaces all spans under `redacted`, producing `<REDACTED_1>`, `<REDACTED_2>`, etc. per unique surface form. Python API: `OPF(placeholder_scheme="indexed")`. Mirrors Presidio's `ReAnonymizer` + `entity_mapping` consistency pattern. schema_version stays at 1 per OUTPUT_SCHEMAS.md's additive-field clause.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: --placeholder-scheme indexed for distinguishable per-entity placeholders#9

feat: --placeholder-scheme indexed for distinguishable per-entity placeholders#9
mvanhorn wants to merge 1 commit intoopenai:mainfrom
mvanhorn:feat/placeholder-scheme-indexed

mvanhorn commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mvanhorn commented Apr 22, 2026

Summary

Why

Changes

Verification

Demo

Scope notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant