Skip to content

refactor(enricher): use docling-core keywords meta field and remove keyphrases style#38

Merged
ceberam merged 2 commits into
mainfrom
dev/keywords-in-meta
Jun 10, 2026
Merged

refactor(enricher): use docling-core keywords meta field and remove keyphrases style#38
ceberam merged 2 commits into
mainfrom
dev/keywords-in-meta

Conversation

@ceberam

@ceberam ceberam commented Jun 8, 2026

Copy link
Copy Markdown
Member

Breaking Changes

  • Removed style parameter from _generate_summary(), _summarize_pages(), and _generate_document_level_summary() methods
  • Summaries now always generate sentences (no more "keyphrases" style option)

Changes

Keywords Integration with docling-core

  • Migrated keywords to standard meta field: Keywords generated by DoclingEnrichingAgent are now stored in BaseMeta.keywords (using KeywordsMetaField) instead of the custom docling_agent__keywords field

Improved Keyword Generation

  • Changed format: Keywords now use semicolon-separated format instead of JSON (simpler for LLM)
  • Enhanced prompt: Improved based on the former keyphrases prompt, focusing on "concepts and facts" useful for "search and retrieval"
  • Validation: Updated to validate semicolon-separated keywords (3-7 items)

Summary Generation Simplification

  • Removed "keyphrases" style: The temporary keyphrases style has been removed now that keywords are properly supported via the keywords meta field
  • Simplified API: _generate_summary() now only generates sentence-based summaries
  • Updated callers: Removed style parameter from all calling methods

Tests

  • Added keyword tests:
    • test_generate_keywords(): Tests keyword generation with semicolon format
    • test_find_search_keywords(): Tests full keyword extraction workflow
  • Updated existing tests: Removed references to style parameter in summary tests

Motivation

This refactoring aligns keyword handling with how summaries and entities are managed in docling-core, providing a consistent metadata structure. The removal of the "keyphrases" style from summaries eliminates redundancy now that proper keyword support exists.

Testing

All 17 tests pass successfully ✅

ceberam added 2 commits June 8, 2026 11:11
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Leverage the new 'keywords' meta field in DoclingDocument to
populate keywords from 'generate_keywords' in the enricher agent.
Drop the 'keyphrases' style in 'generate_summary' since now
keywords are supported in DoclingDocument meta field.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
@ceberam ceberam requested a review from PeterStaar-IBM June 8, 2026 16:49
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

DCO Check Passed

Thanks @ceberam, all your commits are properly signed off. 🎉

@mergify

mergify Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

🟢 Require two reviewer for test updates

Wonderful, this rule succeeded.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 1

@PeterStaar-IBM PeterStaar-IBM left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@ceberam ceberam merged commit f7f9065 into main Jun 10, 2026
11 checks passed
@ceberam ceberam deleted the dev/keywords-in-meta branch June 10, 2026 07:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants