feat: --placeholder-scheme indexed for distinguishable per-entity placeholders#9
Open
mvanhorn wants to merge 1 commit intoopenai:mainfrom
Open
feat: --placeholder-scheme indexed for distinguishable per-entity placeholders#9mvanhorn wants to merge 1 commit intoopenai:mainfrom
mvanhorn wants to merge 1 commit intoopenai:mainfrom
Conversation
…ceholders
Add `--placeholder-scheme {label,indexed}` to `opf redact`. Default stays
`label` (current behavior, zero breakage). In `indexed` mode, same surface
text within a single `redact` call gets the same numbered placeholder, so
"John Doe called John Doe" redacts to "<PRIVATE_PERSON_1> called <PRIVATE_PERSON_1>"
and downstream code can still tell distinct entities apart.
Indices are per-label and 1-based (first-appearance order). `--output-mode
redacted` + indexed scheme namespaces all spans under `redacted`, producing
`<REDACTED_1>`, `<REDACTED_2>`, etc. per unique surface form.
Python API: `OPF(placeholder_scheme="indexed")`.
Mirrors Presidio's `ReAnonymizer` + `entity_mapping` consistency pattern.
schema_version stays at 1 per OUTPUT_SCHEMAS.md's additive-field clause.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add
--placeholder-scheme {label,indexed}toopf redact. Default stayslabel(current behavior, zero breakage). Inindexedmode, same surface text within a single redaction call gets the same numbered placeholder, so downstream NLP can still distinguish entities.schema_versionstays at 1 (OUTPUT_SCHEMAS.md: "Additive fields may appear over time").Why
The README positions OPF for "high-throughput data sanitization workflows", which typically feed redacted text to downstream NLP or analysts. Today, on a realistic input:
All four person mentions (Alice Jones, Bob, Alice, Alice) collapse to the identical string. Coreference, pronoun resolution, and multi-entity relations are lost at the first step. Presidio solves this with its
ReAnonymizeroperator and anentity_mappingdict, so repeated values get consistent anonymized forms:Changes
opf/_common/constants.py: newPLACEHOLDER_SCHEMES: Final[tuple[str, str]] = ("label", "indexed").opf/_core/runtime.py: new_indexed_placeholder(label, index)(returnsf"<{label.upper()}_{index}>") and_assign_indexed_placeholders(spans)(walks spans in order, assigns per-label 1-based indices keyed by trimmed surface text, returns newDetectedSpanobjects with rewritten placeholders).opf/_api.py:OPF.__init__acceptsplaceholder_scheme: Literal["label","indexed"] = "label"withPLACEHOLDER_SCHEMESvalidation.OPF.redactapplies_assign_indexed_placeholdersafterpredict_textand before_redact_text/build_detection_summarywhen the scheme is indexed.opf/_cli/common.py,opf/_cli/args.py,opf/_cli/render.py: new--placeholder-scheme {label,indexed}flag, plumbed throughbuild_redactor_from_args.OUTPUT_SCHEMAS.md: document the indexed value format<LABEL_N>and the redacted-mode interaction (<REDACTED_N>).tests/test_placeholder_scheme.py: 7 tests covering default behavior (label scheme unchanged), indexed behavior (same surface reuses index, distinct surfaces get different indices, per-label namespacing,redactedoutput mode + indexed), Python API parity,build_redactor_from_argsthreading, and CLI validation rejecting unknown schemes.Surface-form canonicalization is
span.text.strip()- exact case-sensitive match after the existing whitespace trim. Fuzzy matching, coreference resolution, and case folding are out of scope (those are ML problems, not output-formatting problems).--output-mode redacted+ indexed scheme: spans come back labeled"redacted", so the indexed namespace collapses to a single<REDACTED_N>sequence per unique surface. Documented and tested.Verification
Ran against the real checkpoint (
~/.opf/privacy_filter) on CPU:End-to-end smoke test matches the expected behavior:
Demo
Three frames:
labelscheme on "Alice Jones was born on 1990-01-02. Her email is alice@example.com. Bob called Alice and Alice replied." All 4 person mentions collapse to identical<PRIVATE_PERSON>.--placeholder-scheme indexed. "Alice Jones", "Bob", and "Alice" get<PRIVATE_PERSON_1>,<PRIVATE_PERSON_2>,<PRIVATE_PERSON_3>respectively. The two "Alice" mentions share<PRIVATE_PERSON_3>because the surface form matches.<PRIVATE_PERSON_1>; "Jane Smith" gets<PRIVATE_PERSON_2>.Scope notes
Not in this PR:
--placeholder-template "<{label}_{index}>").Orthogonal to issue #6 (batched inference API) and PR #5 (
--stdin-mode). Composes cleanly with both and with the confidence-scores PR.Developed with AI assistance.