(retriever) Add .store() for persisting extracted images and text#1785
(retriever) Add .store() for persisting extracted images and text#1785jperez999 merged 3 commits intoNVIDIA:mainfrom
Conversation
- Add StoreOperator graph node and image_store.py core logic - Wire store as ordered stage in build_graph / GraphIngestor - Add --store-images-uri and --store-text CLI flags to graph_pipeline Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
| return None | ||
|
|
||
|
|
||
| def store_extracted_images( |
There was a problem hiding this comment.
I think we need to rename this. It is not indicative of everything it does. The name makes it seem like this is only for images but we can pull tons of things. I think if we shorten it to store_extracted it is indicative but general enough to be all encompassing, as is the function.
There was a problem hiding this comment.
This is also a very large function, might be good to break it up to make it easier to read. I could see helper functions for image, text, and full_page that might be possible? What do you think?
There was a problem hiding this comment.
Renamed store_extracted_images → store_extracted and decomposed into three focused helpers behind a _RowCtx dataclass:
_store_page_image— full page image direct-write_store_content_items— unified handler for structured content (table/chart/infographic) and natural images (eliminates the duplicated direct-write-or-crop pattern)_store_text— page text + structured content text to.txt
Main function is now a ~30-line orchestrator. 32/32 tests pass, bo20 and jp20 end-to-end output unchanged (1,597 files for bo20, recall@5=0.878 for jp20).
…lpers - Rename function to reflect it handles images, text, tables, charts - Extract _RowCtx dataclass, _store_page_image, _store_content_items, _store_text helpers - Unify structured content and natural image storage into shared helper Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
Summary
Adds a
.store()pipeline stage that persists extracted images and text to local or cloud storage (any fsspec-compatible URI). Wired as a first-class ordered graph operator following the refactor-step architecture.StoreOperator— graph operator (AbstractOperator+CPUOperator) that delegates tostore_extracted_images()inio/image_store.pyStoreParams— Pydantic config model controlling what gets stored (page images, tables, charts, infographics, text) and where--store-images-uri/--store-textCLI flags ongraph_pipeline.pystrip_base64=Trueby default — frees image payloads from the DataFrame after writing to storage, reducing downstream memory pressure through embedKey design decisions
_append_ordered_transform_stages(), not a special-cased insertion. Respects the stage_order recorded by the fluent API.store_text=Falsedefault) — writes.txtfiles for page text and structured content OCR.UPath(universal-pathlib) for fsspec compatibility — works withfile://,s3://,gs://, etc.Test results
Unit tests: 34 passed (image store logic, params, edge cases)
Graph integration tests: 2 passed (StoreOperator in InprocessExecutor, multi-store-node pipeline)
Full test suite: 676 passed, 13 skipped, 0 failures (excluding 1 pre-existing Ray/LanceDB permission issue)
bo20 end-to-end (batch mode, 1 GPU):
ReadBinary → DocToPdf → PDFSplit → PDFExtraction → PageElementDetection → OCR → StoreOperator → ExplodeContent → BatchEmbedTest plan
pytest nemo_retriever/tests/test_io_image_store.py— 34 unit testspytest nemo_retriever/tests/test_store_pipeline_stages.py— graph integration testspytest nemo_retriever/tests/— full suite regression checkgraph_pipeline bo20 --store-images-uri— batch mode e2e (images only)graph_pipeline bo20 --store-images-uri --store-text— batch mode e2e (images + text)🤖 Generated with Claude Code