fix(enrich): propagate task to enricher and enforce label whitelist by ana-daniele · Pull Request #31 · docling-project/docling-agent

ana-daniele · 2026-05-27T10:50:29Z

Summary

Fixes a six-layered bug-chain that caused YAML-declared entity-type constraints to be silently ignored in mode: enrich runs, producing HTML output littered with arbitrary model-invented label types (Project, Software, Algorithm, IP-Address, Person, …) and weak recall on the labels the YAML actually asks for.

Bug chain

Each layer hid the next; nothing was visible end-to-end until all six were fixed.

orchestrator._ensure_enriched hard-coded task="" when the YAML used the operations: shortcut (the common path for mode: enrich). The query: block in the YAML was never seen by the enricher.
enricher._generate_entities prompt added the YAML labels only as a soft hint when entity_targets was present.
enricher._generate_entities post-filter was gated on entity_targets["labels"] being populated, so it filtered nothing whenever the rewrite step failed.
_infer_entity_targets rewrite parser used find_json_dicts, which only matches JSON inside json ... fences. NuExtract3 returns bare JSON, so the spec was discarded and entity_targets came back None.
_detect_key_entities session reuse — the same LM Studio session served the rewrite call (which primes the model with a {"generic": …, "labels": …} schema) and every per-chunk entity call. NuExtract carried the prior turn forward and echoed the spec dict back as the "entities" response.
Entity response parser + EntityMention constructor — the parser required ```json fences, assumed the top-level payload was a list (NuExtract returns {"entities": [...]}), and the constructor used pre-rename field names (`original=`, `span=`) for a docling-core API that now expects `orig=` and `charspan=`. Both swallowed every successful response as `Failed to parse entities JSON`.

Changes

orchestrator.py — _ensure_enriched now accepts task; _run_enrich and _run_rag pass task.query through.
enricher.py
- _generate_entities prompt: hard HARD CONSTRAINT — Use ONLY these label values: […] block when allowed_labels is non-empty, replacing the soft hint.
- _generate_entities post-filter: drops any mention whose case-folded label is not in the allowed set, with one INFO line per call summarising the drops.
- _infer_entity_targets: shared _parse_spec_dict helper used by both _validate_entity_target_spec and the post-validation parse — tries find_json_dicts first, falls back to json.loads(content.strip()) for unfenced JSON.
- _detect_key_entities: opens a fresh _create_extraction_session() for the leaf-entity stage so the rewrite call can't bleed schema context into per-chunk extraction.
- Entity response parser: tolerates missing ```json fences, unwraps {"entities": [...]} to a list before iterating.
- _make_entity_mention: passes orig= / charspan= to EntityMention instead of the renamed original= / span=.
Regression Tests:
- Entity Extraction Tests (test_enricher.py)
- Task Propagation Tests (test_orchestrator.py)
- Created tests/conftest.py for shared fixtures with automatic discovery

Validation

End-to-end run on 2308.13418v1 (~/model-eval/runs/2026-06-01e_enrich_nuextract3_postpatch_v6/):

	Pre-fix (May 22 run)	Post-fix (this PR)
Total entity mentions in JSON	767	757
Distinct labels	~150 (invented)	3 (MODEL, DATASET, KPI)
DATASET mentions	dropped by section	108 (24 distinct, incl. ImageNet, Common Crawl, arXiv, PubLayNet, DocBank)
Post-filter drops	0 (no-op)	80 (NuExtract `label=None` responses)
Pydantic validation warnings	many	0
Run wall-clock on single paper	33 min	9.4 min

Tested labels seen in the HTML: exactly MODEL / DATASET / KPI, no other strings. Coworker-review test (1) HTML shows only MODEL/DATASET/KPI ✓; (2) Section 4 yields dataset entities ✓.

🤖 Generated with Claude Code

mergify · 2026-05-27T10:51:05Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

🟢 Require two reviewer for test updates

Wonderful, this rule succeeded.

When test data is updated, we require two reviewers

#approved-reviews-by >= 1

github-actions · 2026-05-27T10:58:38Z

✅ DCO Check Passed

Thanks @ana-daniele, all your commits are properly signed off. 🎉

ceberam

@ana-daniele I don't have much to add on this PR, since it looks like it addresses the bug about the entity constraints in the YAML file.

I was wondering if we could add an integration tests verifying the constraint enforcement.

Also, please rebase to main, resolve eventual conflicts, and ensure the linting errors are addressed.

Two related bugs prevented YAML-declared entity-type constraints from reaching NuExtract: 1. `orchestrator._ensure_enriched` hard-coded `task=""` when the YAML used the `operations:` shortcut, dropping `task.query` on the floor. With no task, `_infer_entity_targets` returned None, so the prompt sent to the model never included the allowed labels. Now `_ensure_enriched` accepts `task` and both `_run_enrich` and `_run_rag` pass `task.query` through. 2. `enricher._generate_entities` only applied `entity_targets` as a soft hint, and had no post-filter on returned labels. When the model invented types (Project, Software, Algorithm, IP-Address, …) they flowed through into the rendered output. Now the prompt carries a HARD CONSTRAINT block listing the allowed labels, and the parser drops any mention whose case-folded label is not in the allowed set (with an INFO log line summarising drops per call). Verified on df3 with the `2026-05-22a_enrich_nuextract3_postpatch` run that the enricher-only patch was a no-op without (1); end-to-end re-run with both fixes is queued. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Ana Daniele <ana.daniele@ibm.com>

`_validate_entity_target_spec` (and the post-validation parse on line 688) used `find_json_dicts`, which only matches JSON wrapped in a ```json ...``` markdown block. NuExtract3 ignores that part of the prompt and returns a bare JSON object instead — well-formed, just unfenced. Validation then failed and `_infer_entity_targets` returned None, so `entity_targets` reached `_generate_entities` as None and the HARD CONSTRAINT clause from the previous commit was silently skipped. Introduce `_parse_spec_dict` inside `_infer_entity_targets`: try `find_json_dicts` first (preserves existing behaviour for models that do use the fence), and fall back to `json.loads(content.strip())` when no fence is found. Both the validation hook and the final parse use the same helper. Confirmed end-to-end on df3 (`2026-06-01b_..._postpatch_v3`): the brief now parses, `entity_targets["labels"] = ["MODEL", "DATASET", "KPI"]`, and every per-chunk LLM REQUEST carries the HARD CONSTRAINT block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Ana Daniele <ana.daniele@ibm.com>

…ling-core fields Three issues surfaced once the prompt actually carried the HARD CONSTRAINT block and NuExtract started responding with the expected schema. 1. Session bleed. `_detect_key_entities` reused one LM Studio session for both `_infer_entity_targets` (which primes the model with the spec schema `{"generic":..., "labels":...}`) and the per-chunk extraction calls. NuExtract carried the prior turn's schema forward and answered every chunk with the spec dict instead of an entity array. Now the leaf stage opens its own session via `_create_extraction_session`. 2. Entity parser too strict. The response parser required a ```json ...``` fenced block and assumed the top-level payload was a list. NuExtract returns bare JSON, usually as `{"entities": [...]}`. Fall back to `result.strip()` when no fence is present and unwrap a single `entities` key into the list before iterating; coerce other shapes to an empty list so downstream code keeps its invariants. 3. EntityMention field names. The mention constructor used the old docling-core argument names (`original=`, `span=`). The current docling-core EntityMention expects `orig=` and `charspan=`; the mismatch raised a pydantic validation error on every successful response, which the surrounding `try` swallowed as "Failed to parse entities JSON". End-to-end on df3: `2026-06-01e_..._postpatch_v6` is the first run where parsed entities are non-empty, e.g. `[EntityMention( text='Nougat', label='MODEL', charspan=(0, 6)), ...]`, with zero pydantic warnings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Ana Daniele <ana.daniele@ibm.com>

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

codecov · 2026-06-05T09:31:57Z

Codecov Report

❌ Patch coverage is 66.66667% with 13 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
docling_agent/agent/enricher.py	69.44%	11 Missing ⚠️
docling_agent/agent/orchestrator.py	33.33%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

Add defensive handling in '_validate_entities()' to unwrap responses ensuring validation consistency with the main parsing logic. Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

ceberam

@ana-daniele I have addressed my observation and added regression tests for the changes. We're good to go 🚀

ana-daniele marked this pull request as ready for review June 1, 2026 06:29

PeterStaar-IBM requested review from PeterStaar-IBM and ceberam June 4, 2026 12:20

ceberam requested changes Jun 4, 2026

View reviewed changes

Comment thread docling_agent/agent/enricher.py

ceberam added the bug Something isn't working label Jun 4, 2026

ana-daniele and others added 3 commits June 5, 2026 11:21

ceberam force-pushed the fix/enrich-task-label-constraints branch from b79db8e to 7fe67be Compare June 5, 2026 09:25

style: fix formatting

cbb12a4

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

ceberam added 2 commits June 5, 2026 11:58

chore(enricher): add defensive handling in '_validate_entities'

1574ee1

Add defensive handling in '_validate_entities()' to unwrap responses ensuring validation consistency with the main parsing logic. Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

test: add regression tests for entity constraints

d4350a2

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

ceberam approved these changes Jun 5, 2026

View reviewed changes

ceberam merged commit 33cd343 into docling-project:main Jun 5, 2026
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(enrich): propagate task to enricher and enforce label whitelist#31

fix(enrich): propagate task to enricher and enforce label whitelist#31
ceberam merged 6 commits into
docling-project:mainfrom
ana-daniele:fix/enrich-task-label-constraints

ana-daniele commented May 27, 2026 •

edited by ceberam

Loading

Uh oh!

mergify Bot commented May 27, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 27, 2026 •

edited

Loading

Uh oh!

ceberam left a comment

Uh oh!

Uh oh!

codecov Bot commented Jun 5, 2026 •

edited

Loading

Uh oh!

ceberam left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ana-daniele commented May 27, 2026 • edited by ceberam Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Bug chain

Changes

Validation

Uh oh!

mergify Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Protections

🟢 Enforce conventional commit

🟢 Require two reviewer for test updates

Uh oh!

github-actions Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ceberam left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ceberam left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ana-daniele commented May 27, 2026 •

edited by ceberam

Loading

mergify Bot commented May 27, 2026 •

edited

Loading

github-actions Bot commented May 27, 2026 •

edited

Loading

codecov Bot commented Jun 5, 2026 •

edited

Loading