Skip to content

feat: --placeholder-scheme indexed for distinguishable per-entity placeholders#9

Open
mvanhorn wants to merge 1 commit intoopenai:mainfrom
mvanhorn:feat/placeholder-scheme-indexed
Open

feat: --placeholder-scheme indexed for distinguishable per-entity placeholders#9
mvanhorn wants to merge 1 commit intoopenai:mainfrom
mvanhorn:feat/placeholder-scheme-indexed

Conversation

@mvanhorn
Copy link
Copy Markdown

Summary

Add --placeholder-scheme {label,indexed} to opf redact. Default stays label (current behavior, zero breakage). In indexed mode, same surface text within a single redaction call gets the same numbered placeholder, so downstream NLP can still distinguish entities. schema_version stays at 1 (OUTPUT_SCHEMAS.md: "Additive fields may appear over time").

Why

The README positions OPF for "high-throughput data sanitization workflows", which typically feed redacted text to downstream NLP or analysts. Today, on a realistic input:

$ opf --device cpu 'Alice Jones was born on 1990-01-02. Her email is alice@example.com. Bob called Alice and Alice replied.'
<PRIVATE_PERSON> was born on <PRIVATE_DATE>. Her email is <PRIVATE_EMAIL>. <PRIVATE_PERSON> called <PRIVATE_PERSON> and <PRIVATE_PERSON> replied.

All four person mentions (Alice Jones, Bob, Alice, Alice) collapse to the identical string. Coreference, pronoun resolution, and multi-entity relations are lost at the first step. Presidio solves this with its ReAnonymizer operator and an entity_mapping dict, so repeated values get consistent anonymized forms:

Changes

  • opf/_common/constants.py: new PLACEHOLDER_SCHEMES: Final[tuple[str, str]] = ("label", "indexed").
  • opf/_core/runtime.py: new _indexed_placeholder(label, index) (returns f"<{label.upper()}_{index}>") and _assign_indexed_placeholders(spans) (walks spans in order, assigns per-label 1-based indices keyed by trimmed surface text, returns new DetectedSpan objects with rewritten placeholders).
  • opf/_api.py: OPF.__init__ accepts placeholder_scheme: Literal["label","indexed"] = "label" with PLACEHOLDER_SCHEMES validation. OPF.redact applies _assign_indexed_placeholders after predict_text and before _redact_text / build_detection_summary when the scheme is indexed.
  • opf/_cli/common.py, opf/_cli/args.py, opf/_cli/render.py: new --placeholder-scheme {label,indexed} flag, plumbed through build_redactor_from_args.
  • OUTPUT_SCHEMAS.md: document the indexed value format <LABEL_N> and the redacted-mode interaction (<REDACTED_N>).
  • tests/test_placeholder_scheme.py: 7 tests covering default behavior (label scheme unchanged), indexed behavior (same surface reuses index, distinct surfaces get different indices, per-label namespacing, redacted output mode + indexed), Python API parity, build_redactor_from_args threading, and CLI validation rejecting unknown schemes.

Surface-form canonicalization is span.text.strip() - exact case-sensitive match after the existing whitespace trim. Fuzzy matching, coreference resolution, and case folding are out of scope (those are ML problems, not output-formatting problems).

--output-mode redacted + indexed scheme: spans come back labeled "redacted", so the indexed namespace collapses to a single <REDACTED_N> sequence per unique surface. Documented and tested.

Verification

Ran against the real checkpoint (~/.opf/privacy_filter) on CPU:

$ python -m unittest tests.test_placeholder_scheme -v
test_api_rejects_invalid_placeholder_scheme ... ok
test_build_redactor_from_args_threads_placeholder_scheme ... ok
test_indexed_scheme_reuses_same_surface_form ... ok
test_indexed_scheme_uses_per_label_namespaces ... ok
test_label_scheme_keeps_current_placeholders ... ok
test_redacted_output_mode_with_indexed_scheme ... ok
test_cli_rejects_invalid_scheme ... ok

Ran 7 tests in 0.335s
OK

End-to-end smoke test matches the expected behavior:

$ opf --placeholder-scheme indexed 'John Doe called John Doe yesterday. John Doe also called Jane Smith.'
<PRIVATE_PERSON_1> called <PRIVATE_PERSON_1> <PRIVATE_DATE_1>. <PRIVATE_PERSON_1> also called <PRIVATE_PERSON_2>.

Demo

demo

Three frames:

  1. Default label scheme on "Alice Jones was born on 1990-01-02. Her email is alice@example.com. Bob called Alice and Alice replied." All 4 person mentions collapse to identical <PRIVATE_PERSON>.
  2. Same input, --placeholder-scheme indexed. "Alice Jones", "Bob", and "Alice" get <PRIVATE_PERSON_1>, <PRIVATE_PERSON_2>, <PRIVATE_PERSON_3> respectively. The two "Alice" mentions share <PRIVATE_PERSON_3> because the surface form matches.
  3. Presidio-style "John Doe called John Doe" case. All three "John Doe" mentions get <PRIVATE_PERSON_1>; "Jane Smith" gets <PRIVATE_PERSON_2>.

Scope notes

Not in this PR:

  • Hash-based placeholders (salt handling + security implications deferred).
  • Configurable placeholder templates (--placeholder-template "<{label}_{index}>").
  • Cross-call stable identity (requires external state).
  • Coreference resolution, fuzzy matching, or case folding.

Orthogonal to issue #6 (batched inference API) and PR #5 (--stdin-mode). Composes cleanly with both and with the confidence-scores PR.

Developed with AI assistance.

…ceholders

Add `--placeholder-scheme {label,indexed}` to `opf redact`. Default stays
`label` (current behavior, zero breakage). In `indexed` mode, same surface
text within a single `redact` call gets the same numbered placeholder, so
"John Doe called John Doe" redacts to "<PRIVATE_PERSON_1> called <PRIVATE_PERSON_1>"
and downstream code can still tell distinct entities apart.

Indices are per-label and 1-based (first-appearance order). `--output-mode
redacted` + indexed scheme namespaces all spans under `redacted`, producing
`<REDACTED_1>`, `<REDACTED_2>`, etc. per unique surface form.

Python API: `OPF(placeholder_scheme="indexed")`.

Mirrors Presidio's `ReAnonymizer` + `entity_mapping` consistency pattern.
schema_version stays at 1 per OUTPUT_SCHEMAS.md's additive-field clause.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant