Skip to content

(retriever) Add .store() for persisting extracted images and text#1785

Merged
jperez999 merged 3 commits intoNVIDIA:mainfrom
jioffe502:jioffe/store-extracted-images-v2
Apr 7, 2026
Merged

(retriever) Add .store() for persisting extracted images and text#1785
jperez999 merged 3 commits intoNVIDIA:mainfrom
jioffe502:jioffe/store-extracted-images-v2

Conversation

@jioffe502
Copy link
Copy Markdown
Collaborator

Summary

Adds a .store() pipeline stage that persists extracted images and text to local or cloud storage (any fsspec-compatible URI). Wired as a first-class ordered graph operator following the refactor-step architecture.

  • StoreOperator — graph operator (AbstractOperator + CPUOperator) that delegates to store_extracted_images() in io/image_store.py
  • StoreParams — Pydantic config model controlling what gets stored (page images, tables, charts, infographics, text) and where
  • --store-images-uri / --store-text CLI flags on graph_pipeline.py
  • Placement: after OCR/caption, before content-reshape/embed (user controls via fluent API call order)
  • strip_base64=True by default — frees image payloads from the DataFrame after writing to storage, reducing downstream memory pressure through embed

Key design decisions

  • Store is an ordered stage via _append_ordered_transform_stages(), not a special-cased insertion. Respects the stage_order recorded by the fluent API.
  • Page images written as JPEG (configurable), cropped content (tables/charts/infographics) as PNG for lossless quality.
  • Text storage is opt-in (store_text=False default) — writes .txt files for page text and structured content OCR.
  • Uses UPath (universal-pathlib) for fsspec compatibility — works with file://, s3://, gs://, etc.

Test results

Unit tests: 34 passed (image store logic, params, edge cases)
Graph integration tests: 2 passed (StoreOperator in InprocessExecutor, multi-store-node pipeline)
Full test suite: 676 passed, 13 skipped, 0 failures (excluding 1 pre-existing Ray/LanceDB permission issue)

bo20 end-to-end (batch mode, 1 GPU):

  • 20 documents, 798 rows processed in 43.85s
  • Images only: 835 files (496 page JPEGs + 339 cropped PNGs)
  • Images + text: 1,597 files (835 images + 762 text files)
  • Execution plan: ReadBinary → DocToPdf → PDFSplit → PDFExtraction → PageElementDetection → OCR → StoreOperator → ExplodeContent → BatchEmbed

Test plan

  • pytest nemo_retriever/tests/test_io_image_store.py — 34 unit tests
  • pytest nemo_retriever/tests/test_store_pipeline_stages.py — graph integration tests
  • pytest nemo_retriever/tests/ — full suite regression check
  • graph_pipeline bo20 --store-images-uri — batch mode e2e (images only)
  • graph_pipeline bo20 --store-images-uri --store-text — batch mode e2e (images + text)

🤖 Generated with Claude Code

- Add StoreOperator graph node and image_store.py core logic
- Wire store as ordered stage in build_graph / GraphIngestor
- Add --store-images-uri and --store-text CLI flags to graph_pipeline

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
return None


def store_extracted_images(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to rename this. It is not indicative of everything it does. The name makes it seem like this is only for images but we can pull tons of things. I think if we shorten it to store_extracted it is indicative but general enough to be all encompassing, as is the function.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also a very large function, might be good to break it up to make it easier to read. I could see helper functions for image, text, and full_page that might be possible? What do you think?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed store_extracted_imagesstore_extracted and decomposed into three focused helpers behind a _RowCtx dataclass:

  • _store_page_image — full page image direct-write
  • _store_content_items — unified handler for structured content (table/chart/infographic) and natural images (eliminates the duplicated direct-write-or-crop pattern)
  • _store_text — page text + structured content text to .txt

Main function is now a ~30-line orchestrator. 32/32 tests pass, bo20 and jp20 end-to-end output unchanged (1,597 files for bo20, recall@5=0.878 for jp20).

…lpers

- Rename function to reflect it handles images, text, tables, charts
- Extract _RowCtx dataclass, _store_page_image, _store_content_items, _store_text helpers
- Unify structured content and natural image storage into shared helper

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
@jperez999 jperez999 merged commit f6dcf4b into NVIDIA:main Apr 7, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants