fix(enrich): propagate task to enricher and enforce label whitelist#31
Conversation
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
🟢 Require two reviewer for test updatesWonderful, this rule succeeded.When test data is updated, we require two reviewers
|
|
✅ DCO Check Passed Thanks @ana-daniele, all your commits are properly signed off. 🎉 |
ceberam
left a comment
There was a problem hiding this comment.
@ana-daniele I don't have much to add on this PR, since it looks like it addresses the bug about the entity constraints in the YAML file.
I was wondering if we could add an integration tests verifying the constraint enforcement.
Also, please rebase to main, resolve eventual conflicts, and ensure the linting errors are addressed.
Two related bugs prevented YAML-declared entity-type constraints from reaching NuExtract: 1. `orchestrator._ensure_enriched` hard-coded `task=""` when the YAML used the `operations:` shortcut, dropping `task.query` on the floor. With no task, `_infer_entity_targets` returned None, so the prompt sent to the model never included the allowed labels. Now `_ensure_enriched` accepts `task` and both `_run_enrich` and `_run_rag` pass `task.query` through. 2. `enricher._generate_entities` only applied `entity_targets` as a soft hint, and had no post-filter on returned labels. When the model invented types (Project, Software, Algorithm, IP-Address, …) they flowed through into the rendered output. Now the prompt carries a HARD CONSTRAINT block listing the allowed labels, and the parser drops any mention whose case-folded label is not in the allowed set (with an INFO log line summarising drops per call). Verified on df3 with the `2026-05-22a_enrich_nuextract3_postpatch` run that the enricher-only patch was a no-op without (1); end-to-end re-run with both fixes is queued. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Ana Daniele <ana.daniele@ibm.com>
`_validate_entity_target_spec` (and the post-validation parse on line 688) used `find_json_dicts`, which only matches JSON wrapped in a ```json ...``` markdown block. NuExtract3 ignores that part of the prompt and returns a bare JSON object instead — well-formed, just unfenced. Validation then failed and `_infer_entity_targets` returned None, so `entity_targets` reached `_generate_entities` as None and the HARD CONSTRAINT clause from the previous commit was silently skipped. Introduce `_parse_spec_dict` inside `_infer_entity_targets`: try `find_json_dicts` first (preserves existing behaviour for models that do use the fence), and fall back to `json.loads(content.strip())` when no fence is found. Both the validation hook and the final parse use the same helper. Confirmed end-to-end on df3 (`2026-06-01b_..._postpatch_v3`): the brief now parses, `entity_targets["labels"] = ["MODEL", "DATASET", "KPI"]`, and every per-chunk LLM REQUEST carries the HARD CONSTRAINT block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Ana Daniele <ana.daniele@ibm.com>
…ling-core fields
Three issues surfaced once the prompt actually carried the
HARD CONSTRAINT block and NuExtract started responding with
the expected schema.
1. Session bleed. `_detect_key_entities` reused one LM Studio
session for both `_infer_entity_targets` (which primes the
model with the spec schema `{"generic":..., "labels":...}`)
and the per-chunk extraction calls. NuExtract carried the
prior turn's schema forward and answered every chunk with
the spec dict instead of an entity array. Now the leaf
stage opens its own session via `_create_extraction_session`.
2. Entity parser too strict. The response parser required a
```json ...``` fenced block and assumed the top-level payload
was a list. NuExtract returns bare JSON, usually as
`{"entities": [...]}`. Fall back to `result.strip()` when no
fence is present and unwrap a single `entities` key into the
list before iterating; coerce other shapes to an empty list
so downstream code keeps its invariants.
3. EntityMention field names. The mention constructor used the
old docling-core argument names (`original=`, `span=`). The
current docling-core EntityMention expects `orig=` and
`charspan=`; the mismatch raised a pydantic validation error
on every successful response, which the surrounding `try`
swallowed as "Failed to parse entities JSON".
End-to-end on df3: `2026-06-01e_..._postpatch_v6` is the first
run where parsed entities are non-empty, e.g. `[EntityMention(
text='Nougat', label='MODEL', charspan=(0, 6)), ...]`, with
zero pydantic warnings.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Ana Daniele <ana.daniele@ibm.com>
b79db8e to
7fe67be
Compare
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Add defensive handling in '_validate_entities()' to unwrap responses ensuring validation consistency with the main parsing logic. Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
ceberam
left a comment
There was a problem hiding this comment.
@ana-daniele I have addressed my observation and added regression tests for the changes. We're good to go 🚀
Summary
Fixes a six-layered bug-chain that caused YAML-declared entity-type constraints to be silently ignored in
mode: enrichruns, producing HTML output littered with arbitrary model-invented label types (Project,Software,Algorithm,IP-Address,Person, …) and weak recall on the labels the YAML actually asks for.Bug chain
Each layer hid the next; nothing was visible end-to-end until all six were fixed.
orchestrator._ensure_enrichedhard-codedtask=""when the YAML used theoperations:shortcut (the common path formode: enrich). Thequery:block in the YAML was never seen by the enricher.enricher._generate_entitiesprompt added the YAML labels only as a soft hint whenentity_targetswas present.enricher._generate_entitiespost-filter was gated onentity_targets["labels"]being populated, so it filtered nothing whenever the rewrite step failed._infer_entity_targetsrewrite parser usedfind_json_dicts, which only matches JSON insidejson ...fences. NuExtract3 returns bare JSON, so the spec was discarded andentity_targetscame backNone._detect_key_entitiessession reuse — the same LM Studio session served the rewrite call (which primes the model with a{"generic": …, "labels": …}schema) and every per-chunk entity call. NuExtract carried the prior turn forward and echoed the spec dict back as the "entities" response.EntityMentionconstructor — the parser required ```json fences, assumed the top-level payload was a list (NuExtract returns{"entities": [...]}), and the constructor used pre-rename field names (`original=`, `span=`) for a docling-core API that now expects `orig=` and `charspan=`. Both swallowed every successful response as `Failed to parse entities JSON`.Changes
orchestrator.py—_ensure_enrichednow acceptstask;_run_enrichand_run_ragpasstask.querythrough.enricher.py_generate_entitiesprompt: hardHARD CONSTRAINT — Use ONLY these label values: […]block whenallowed_labelsis non-empty, replacing the soft hint._generate_entitiespost-filter: drops any mention whose case-folded label is not in the allowed set, with one INFO line per call summarising the drops._infer_entity_targets: shared_parse_spec_dicthelper used by both_validate_entity_target_specand the post-validation parse — triesfind_json_dictsfirst, falls back tojson.loads(content.strip())for unfenced JSON._detect_key_entities: opens a fresh_create_extraction_session()for the leaf-entity stage so the rewrite call can't bleed schema context into per-chunk extraction.{"entities": [...]}to a list before iterating._make_entity_mention: passesorig=/charspan=toEntityMentioninstead of the renamedoriginal=/span=.test_enricher.py)test_orchestrator.py)tests/conftest.pyfor shared fixtures with automatic discoveryValidation
End-to-end run on
2308.13418v1(~/model-eval/runs/2026-06-01e_enrich_nuextract3_postpatch_v6/):label=Noneresponses)Tested labels seen in the HTML: exactly
MODEL/DATASET/KPI, no other strings. Coworker-review test (1)HTML shows only MODEL/DATASET/KPI✓; (2) Section 4 yields dataset entities ✓.🤖 Generated with Claude Code