Skip to content

Harness: expose store config for image/text storage#1811

Merged
jioffe502 merged 4 commits intoNVIDIA:mainfrom
jioffe502:jioffe/harness-store-config
Apr 7, 2026
Merged

Harness: expose store config for image/text storage#1811
jioffe502 merged 4 commits intoNVIDIA:mainfrom
jioffe502:jioffe/harness-store-config

Conversation

@jioffe502
Copy link
Copy Markdown
Collaborator

Summary

Wires store_extracted into the harness so storage can be configured via test_configs.yaml or --override. Addresses the store portion of the RFE for MinIO-backed citation assets (images, tables, charts).

  • Adds store_images_uri, store_text, strip_base64 to HarnessConfig with env var overrides
  • Adds _resolve_store_uri() — resolves relative URIs to artifact_dir/<name>/, matching the lancedb_uri pattern so each harness run gets isolated storage
  • Wires --store-images-uri, --store-text, --strip-base64 into _build_command() (only when URI is set)
  • Records store config in results.json payload
  • Fixes SettingWithCopyWarning in store_extracted via df.copy() (Ray passes DataFrame slices)

Artifact layout with store enabled

artifacts/bo20_20260407_165842_UTC/
├── command.txt
├── lancedb/
├── results.json
├── runtime_metrics/
└── stored_images/          ← new, only when store_images_uri is set
    ├── document_1/
    │   ├── page_1.jpeg     ← full page image (source encoding preserved)
    │   ├── page_1.txt      ← page text (when store_text=true)
    │   ├── page_1_table_0.png   ← cropped table
    │   ├── page_1_table_0.txt   ← table text
    │   ├── page_1_chart_0.png   ← cropped chart
    │   └── page_1_image_0.png   ← natural sub-page image
    └── document_2/
        └── ...

Usage

# test_configs.yaml — per-dataset
datasets:
  bo20:
    store_images_uri: stored_images   # resolves to artifact_dir/stored_images/
    store_text: true
    strip_base64: true                # default; frees memory after writing
# CLI override
retriever harness run --dataset bo20 --preset single_gpu \
  --override store_images_uri=stored_images --override store_text=true

Validated

Test plan

  • pytest tests/test_harness_config.py tests/test_harness_run.py — 46/46 pass
  • pre-commit run --all-files — clean
  • retriever harness run --dataset bo20 --override store_images_uri=stored_images --override store_text=true — 2,093 files in artifact dir
  • Reviewer: confirm results.json includes store_images_uri, store_text, strip_base64 in test_config

🤖 Generated with Claude Code

- Add store_images_uri, store_text, strip_base64 to HarnessConfig
- Resolve store URI relative to artifact dir (like lancedb_uri)
- Wire flags into _build_command and results.json payload

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
@jioffe502 jioffe502 requested review from a team as code owners April 7, 2026 17:04
@jioffe502 jioffe502 requested a review from nkmcalli April 7, 2026 17:04
jioffe502 and others added 2 commits April 7, 2026 17:05
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 7, 2026

Greptile Summary

Wires store_images_uri, store_text, and strip_base64 end-to-end through the harness: new HarnessConfig fields with env-var overrides, _resolve_store_uri() that correctly passes fsspec URIs through unchanged via the :// guard, CLI flag injection in _build_command(), and persistence of the resolved URI to results.json. The df.copy() fix in image_store.py resolves the SettingWithCopyWarning from Ray DataFrame slices; all production callers already use the return value so there is no functional regression.

Confidence Score: 5/5

Safe to merge; only a P2 docstring nit remains.

The previously-reported S3 URI bug is fixed with the :// guard. All remaining findings are P2 documentation issues that do not affect runtime behaviour. Test coverage is solid across config loading, command building, and edge cases.

nemo_retriever/src/nemo_retriever/io/image_store.py - docstring for store_extracted should be updated to reflect copy-return semantics.

Important Files Changed

Filename Overview
nemo_retriever/src/nemo_retriever/harness/config.py Adds store_images_uri, store_text, strip_base64 to HarnessConfig with matching env-var mappings; straightforward and correct.
nemo_retriever/src/nemo_retriever/harness/run.py Adds _resolve_store_uri() with correct s3:// passthrough guard; wires store flags into _build_command() and records resolved URI in results.json.
nemo_retriever/src/nemo_retriever/io/image_store.py df.copy() correctly fixes SettingWithCopyWarning, but docstring still says in-place / mutated input DataFrame - needs update (P2).
nemo_retriever/tests/test_harness_config.py Adds test covering the three new store config fields loaded from YAML; adequate.
nemo_retriever/tests/test_harness_run.py Adds three focused tests: store flags present, flags absent when URI is None, and no-strip-base64 variant; good edge-case coverage.
nemo_retriever/harness/test_configs.yaml Adds store_images_uri/store_text/strip_base64 defaults to the active section; consistent with new HarnessConfig fields.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[HarnessConfig] --> B[_resolve_store_uri]
    B -->|None| C[Skip store flags]
    B -->|contains ://| D[Pass through unchanged]
    B -->|relative path| E[artifact_dir / path]
    D --> F[_build_command]
    E --> F
    F --> G[CLI store flags]
    G --> H[graph_pipeline subprocess]
    F --> I[results.json]
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/io/image_store.py
Line: 312-353

Comment:
**Stale in-place language after df.copy() fix**

Now that `df = df.copy()` is added at the top of the function body, `store_extracted` no longer mutates the caller's DataFrame. Two docstring phrases still describe the old behaviour: line 312 says "Updates the DataFrame in-place" and line 353 says "The (mutated) input DataFrame". All production callers (`StoreOperator.process`) already use the return value so there is no functional regression - this is purely a documentation fix.

Suggested change for the Returns section:
```
Returns
-------
pd.DataFrame
    A copy of the input DataFrame with storage URIs added.
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (2): Last reviewed commit: "Update nemo_retriever/src/nemo_retriever..." | Re-trigger Greptile

@jioffe502 jioffe502 requested review from jperez999 and removed request for nkmcalli April 7, 2026 17:38
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
@jioffe502 jioffe502 merged commit 6b7988b into NVIDIA:main Apr 7, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants