(retriever) Add .store() for persisting extracted images and text by jioffe502 · Pull Request #1785 · NVIDIA/NeMo-Retriever

jioffe502 · 2026-04-02T19:25:35Z

Summary

Adds a .store() pipeline stage that persists extracted images and text to local or cloud storage (any fsspec-compatible URI). Wired as a first-class ordered graph operator following the refactor-step architecture.

StoreOperator — graph operator (AbstractOperator + CPUOperator) that delegates to store_extracted_images() in io/image_store.py
StoreParams — Pydantic config model controlling what gets stored (page images, tables, charts, infographics, text) and where
--store-images-uri / --store-text CLI flags on graph_pipeline.py
Placement: after OCR/caption, before content-reshape/embed (user controls via fluent API call order)
strip_base64=True by default — frees image payloads from the DataFrame after writing to storage, reducing downstream memory pressure through embed

Key design decisions

Store is an ordered stage via _append_ordered_transform_stages(), not a special-cased insertion. Respects the stage_order recorded by the fluent API.
Page images written as JPEG (configurable), cropped content (tables/charts/infographics) as PNG for lossless quality.
Text storage is opt-in (store_text=False default) — writes .txt files for page text and structured content OCR.
Uses UPath (universal-pathlib) for fsspec compatibility — works with file://, s3://, gs://, etc.

Test results

Unit tests: 34 passed (image store logic, params, edge cases)
Graph integration tests: 2 passed (StoreOperator in InprocessExecutor, multi-store-node pipeline)
Full test suite: 676 passed, 13 skipped, 0 failures (excluding 1 pre-existing Ray/LanceDB permission issue)

bo20 end-to-end (batch mode, 1 GPU):

20 documents, 798 rows processed in 43.85s
Images only: 835 files (496 page JPEGs + 339 cropped PNGs)
Images + text: 1,597 files (835 images + 762 text files)
Execution plan: ReadBinary → DocToPdf → PDFSplit → PDFExtraction → PageElementDetection → OCR → StoreOperator → ExplodeContent → BatchEmbed

Test plan

pytest nemo_retriever/tests/test_io_image_store.py — 34 unit tests
pytest nemo_retriever/tests/test_store_pipeline_stages.py — graph integration tests
pytest nemo_retriever/tests/ — full suite regression check
graph_pipeline bo20 --store-images-uri — batch mode e2e (images only)
graph_pipeline bo20 --store-images-uri --store-text — batch mode e2e (images + text)

🤖 Generated with Claude Code

- Add StoreOperator graph node and image_store.py core logic - Wire store as ordered stage in build_graph / GraphIngestor - Add --store-images-uri and --store-text CLI flags to graph_pipeline Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>

jperez999 · 2026-04-03T17:13:13Z

nemo_retriever/src/nemo_retriever/io/image_store.py

+        return None
+
+
+def store_extracted_images(


I think we need to rename this. It is not indicative of everything it does. The name makes it seem like this is only for images but we can pull tons of things. I think if we shorten it to store_extracted it is indicative but general enough to be all encompassing, as is the function.

This is also a very large function, might be good to break it up to make it easier to read. I could see helper functions for image, text, and full_page that might be possible? What do you think?

Renamed store_extracted_images → store_extracted and decomposed into three focused helpers behind a _RowCtx dataclass:

_store_page_image — full page image direct-write

_store_content_items — unified handler for structured content (table/chart/infographic) and natural images (eliminates the duplicated direct-write-or-crop pattern)

_store_text — page text + structured content text to .txt

Main function is now a ~30-line orchestrator. 32/32 tests pass, bo20 and jp20 end-to-end output unchanged (1,597 files for bo20, recall@5=0.878 for jp20).

…lpers - Rename function to reflect it handles images, text, tables, charts - Extract _RowCtx dataclass, _store_page_image, _store_content_items, _store_text helpers - Unify structured content and natural image storage into shared helper Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>

jioffe502 requested review from a team as code owners April 2, 2026 19:25

jioffe502 requested a review from ChrisJar April 2, 2026 19:25

jioffe502 mentioned this pull request Apr 2, 2026

(retriever) Add .store() task for persisting extracted images (#1675) #1714

Closed

(retriever) Fix pre-commit: black formatting and unused imports

5f09547

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>

jperez999 requested changes Apr 3, 2026

View reviewed changes

jperez999 approved these changes Apr 7, 2026

View reviewed changes

jperez999 merged commit f6dcf4b into NVIDIA:main Apr 7, 2026
6 checks passed

jioffe502 mentioned this pull request Apr 7, 2026

Harness: expose store config for image/text storage #1811

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(retriever) Add .store() for persisting extracted images and text#1785

(retriever) Add .store() for persisting extracted images and text#1785
jperez999 merged 3 commits intoNVIDIA:mainfrom
jioffe502:jioffe/store-extracted-images-v2

jioffe502 commented Apr 2, 2026

Uh oh!

jperez999 Apr 3, 2026

Uh oh!

jperez999 Apr 3, 2026

Uh oh!

jioffe502 Apr 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jioffe502 commented Apr 2, 2026

Summary

Key design decisions

Test results

Test plan

Uh oh!

jperez999 Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

jperez999 Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

jioffe502 Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants