Skip to content

fix(enrich): propagate task to enricher and enforce label whitelist#31

Merged
ceberam merged 6 commits into
docling-project:mainfrom
ana-daniele:fix/enrich-task-label-constraints
Jun 5, 2026
Merged

fix(enrich): propagate task to enricher and enforce label whitelist#31
ceberam merged 6 commits into
docling-project:mainfrom
ana-daniele:fix/enrich-task-label-constraints

Conversation

@ana-daniele

@ana-daniele ana-daniele commented May 27, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes a six-layered bug-chain that caused YAML-declared entity-type constraints to be silently ignored in mode: enrich runs, producing HTML output littered with arbitrary model-invented label types (Project, Software, Algorithm, IP-Address, Person, …) and weak recall on the labels the YAML actually asks for.

Bug chain

Each layer hid the next; nothing was visible end-to-end until all six were fixed.

  1. orchestrator._ensure_enriched hard-coded task="" when the YAML used the operations: shortcut (the common path for mode: enrich). The query: block in the YAML was never seen by the enricher.
  2. enricher._generate_entities prompt added the YAML labels only as a soft hint when entity_targets was present.
  3. enricher._generate_entities post-filter was gated on entity_targets["labels"] being populated, so it filtered nothing whenever the rewrite step failed.
  4. _infer_entity_targets rewrite parser used find_json_dicts, which only matches JSON inside json ... fences. NuExtract3 returns bare JSON, so the spec was discarded and entity_targets came back None.
  5. _detect_key_entities session reuse — the same LM Studio session served the rewrite call (which primes the model with a {"generic": …, "labels": …} schema) and every per-chunk entity call. NuExtract carried the prior turn forward and echoed the spec dict back as the "entities" response.
  6. Entity response parser + EntityMention constructor — the parser required ```json fences, assumed the top-level payload was a list (NuExtract returns {"entities": [...]}), and the constructor used pre-rename field names (`original=`, `span=`) for a docling-core API that now expects `orig=` and `charspan=`. Both swallowed every successful response as `Failed to parse entities JSON`.

Changes

  • orchestrator.py_ensure_enriched now accepts task; _run_enrich and _run_rag pass task.query through.
  • enricher.py
    • _generate_entities prompt: hard HARD CONSTRAINT — Use ONLY these label values: […] block when allowed_labels is non-empty, replacing the soft hint.
    • _generate_entities post-filter: drops any mention whose case-folded label is not in the allowed set, with one INFO line per call summarising the drops.
    • _infer_entity_targets: shared _parse_spec_dict helper used by both _validate_entity_target_spec and the post-validation parse — tries find_json_dicts first, falls back to json.loads(content.strip()) for unfenced JSON.
    • _detect_key_entities: opens a fresh _create_extraction_session() for the leaf-entity stage so the rewrite call can't bleed schema context into per-chunk extraction.
    • Entity response parser: tolerates missing ```json fences, unwraps {"entities": [...]} to a list before iterating.
    • _make_entity_mention: passes orig= / charspan= to EntityMention instead of the renamed original= / span=.
  • Regression Tests:
    • Entity Extraction Tests (test_enricher.py)
    • Task Propagation Tests (test_orchestrator.py)
    • Created tests/conftest.py for shared fixtures with automatic discovery

Validation

End-to-end run on 2308.13418v1 (~/model-eval/runs/2026-06-01e_enrich_nuextract3_postpatch_v6/):

Pre-fix (May 22 run) Post-fix (this PR)
Total entity mentions in JSON 767 757
Distinct labels ~150 (invented) 3 (MODEL, DATASET, KPI)
DATASET mentions dropped by section 108 (24 distinct, incl. ImageNet, Common Crawl, arXiv, PubLayNet, DocBank)
Post-filter drops 0 (no-op) 80 (NuExtract label=None responses)
Pydantic validation warnings many 0
Run wall-clock on single paper 33 min 9.4 min

Tested labels seen in the HTML: exactly MODEL / DATASET / KPI, no other strings. Coworker-review test (1) HTML shows only MODEL/DATASET/KPI ✓; (2) Section 4 yields dataset entities ✓.

🤖 Generated with Claude Code

@mergify

mergify Bot commented May 27, 2026

Copy link
Copy Markdown
Contributor

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

🟢 Require two reviewer for test updates

Wonderful, this rule succeeded.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 1

@github-actions

github-actions Bot commented May 27, 2026

Copy link
Copy Markdown
Contributor

DCO Check Passed

Thanks @ana-daniele, all your commits are properly signed off. 🎉

@ana-daniele ana-daniele marked this pull request as ready for review June 1, 2026 06:29

@ceberam ceberam left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ana-daniele I don't have much to add on this PR, since it looks like it addresses the bug about the entity constraints in the YAML file.

I was wondering if we could add an integration tests verifying the constraint enforcement.

Also, please rebase to main, resolve eventual conflicts, and ensure the linting errors are addressed.

Comment thread docling_agent/agent/enricher.py
@ceberam ceberam added the bug Something isn't working label Jun 4, 2026
ana-daniele and others added 3 commits June 5, 2026 11:21
Two related bugs prevented YAML-declared entity-type constraints
from reaching NuExtract:

1. `orchestrator._ensure_enriched` hard-coded `task=""` when the
   YAML used the `operations:` shortcut, dropping `task.query` on
   the floor. With no task, `_infer_entity_targets` returned None,
   so the prompt sent to the model never included the allowed
   labels. Now `_ensure_enriched` accepts `task` and both
   `_run_enrich` and `_run_rag` pass `task.query` through.

2. `enricher._generate_entities` only applied `entity_targets`
   as a soft hint, and had no post-filter on returned labels.
   When the model invented types (Project, Software, Algorithm,
   IP-Address, …) they flowed through into the rendered output.
   Now the prompt carries a HARD CONSTRAINT block listing the
   allowed labels, and the parser drops any mention whose
   case-folded label is not in the allowed set (with an INFO
   log line summarising drops per call).

Verified on df3 with the `2026-05-22a_enrich_nuextract3_postpatch`
run that the enricher-only patch was a no-op without (1); end-to-end
re-run with both fixes is queued.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Ana Daniele <ana.daniele@ibm.com>
`_validate_entity_target_spec` (and the post-validation parse on
line 688) used `find_json_dicts`, which only matches JSON wrapped
in a ```json ...``` markdown block. NuExtract3 ignores that part
of the prompt and returns a bare JSON object instead — well-formed,
just unfenced. Validation then failed and `_infer_entity_targets`
returned None, so `entity_targets` reached `_generate_entities`
as None and the HARD CONSTRAINT clause from the previous commit
was silently skipped.

Introduce `_parse_spec_dict` inside `_infer_entity_targets`: try
`find_json_dicts` first (preserves existing behaviour for models
that do use the fence), and fall back to `json.loads(content.strip())`
when no fence is found. Both the validation hook and the final
parse use the same helper.

Confirmed end-to-end on df3 (`2026-06-01b_..._postpatch_v3`): the
brief now parses, `entity_targets["labels"] = ["MODEL", "DATASET",
"KPI"]`, and every per-chunk LLM REQUEST carries the HARD CONSTRAINT
block.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Ana Daniele <ana.daniele@ibm.com>
…ling-core fields

Three issues surfaced once the prompt actually carried the
HARD CONSTRAINT block and NuExtract started responding with
the expected schema.

1. Session bleed. `_detect_key_entities` reused one LM Studio
   session for both `_infer_entity_targets` (which primes the
   model with the spec schema `{"generic":..., "labels":...}`)
   and the per-chunk extraction calls. NuExtract carried the
   prior turn's schema forward and answered every chunk with
   the spec dict instead of an entity array. Now the leaf
   stage opens its own session via `_create_extraction_session`.

2. Entity parser too strict. The response parser required a
   ```json ...``` fenced block and assumed the top-level payload
   was a list. NuExtract returns bare JSON, usually as
   `{"entities": [...]}`. Fall back to `result.strip()` when no
   fence is present and unwrap a single `entities` key into the
   list before iterating; coerce other shapes to an empty list
   so downstream code keeps its invariants.

3. EntityMention field names. The mention constructor used the
   old docling-core argument names (`original=`, `span=`). The
   current docling-core EntityMention expects `orig=` and
   `charspan=`; the mismatch raised a pydantic validation error
   on every successful response, which the surrounding `try`
   swallowed as "Failed to parse entities JSON".

End-to-end on df3: `2026-06-01e_..._postpatch_v6` is the first
run where parsed entities are non-empty, e.g. `[EntityMention(
text='Nougat', label='MODEL', charspan=(0, 6)), ...]`, with
zero pydantic warnings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Ana Daniele <ana.daniele@ibm.com>
@ceberam ceberam force-pushed the fix/enrich-task-label-constraints branch from b79db8e to 7fe67be Compare June 5, 2026 09:25
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
@codecov

codecov Bot commented Jun 5, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 66.66667% with 13 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling_agent/agent/enricher.py 69.44% 11 Missing ⚠️
docling_agent/agent/orchestrator.py 33.33% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

ceberam added 2 commits June 5, 2026 11:58
Add defensive handling in '_validate_entities()' to unwrap responses ensuring
validation consistency with the main parsing logic.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

@ceberam ceberam left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ana-daniele I have addressed my observation and added regression tests for the changes. We're good to go 🚀

@ceberam ceberam merged commit 33cd343 into docling-project:main Jun 5, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants