Harness: expose store config for image/text storage#1811
Harness: expose store config for image/text storage#1811jioffe502 merged 4 commits intoNVIDIA:mainfrom
Conversation
- Add store_images_uri, store_text, strip_base64 to HarnessConfig - Resolve store URI relative to artifact dir (like lancedb_uri) - Wire flags into _build_command and results.json payload Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
Greptile SummaryWires store_images_uri, store_text, and strip_base64 end-to-end through the harness: new HarnessConfig fields with env-var overrides, _resolve_store_uri() that correctly passes fsspec URIs through unchanged via the :// guard, CLI flag injection in _build_command(), and persistence of the resolved URI to results.json. The df.copy() fix in image_store.py resolves the SettingWithCopyWarning from Ray DataFrame slices; all production callers already use the return value so there is no functional regression.
|
| Filename | Overview |
|---|---|
| nemo_retriever/src/nemo_retriever/harness/config.py | Adds store_images_uri, store_text, strip_base64 to HarnessConfig with matching env-var mappings; straightforward and correct. |
| nemo_retriever/src/nemo_retriever/harness/run.py | Adds _resolve_store_uri() with correct s3:// passthrough guard; wires store flags into _build_command() and records resolved URI in results.json. |
| nemo_retriever/src/nemo_retriever/io/image_store.py | df.copy() correctly fixes SettingWithCopyWarning, but docstring still says in-place / mutated input DataFrame - needs update (P2). |
| nemo_retriever/tests/test_harness_config.py | Adds test covering the three new store config fields loaded from YAML; adequate. |
| nemo_retriever/tests/test_harness_run.py | Adds three focused tests: store flags present, flags absent when URI is None, and no-strip-base64 variant; good edge-case coverage. |
| nemo_retriever/harness/test_configs.yaml | Adds store_images_uri/store_text/strip_base64 defaults to the active section; consistent with new HarnessConfig fields. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[HarnessConfig] --> B[_resolve_store_uri]
B -->|None| C[Skip store flags]
B -->|contains ://| D[Pass through unchanged]
B -->|relative path| E[artifact_dir / path]
D --> F[_build_command]
E --> F
F --> G[CLI store flags]
G --> H[graph_pipeline subprocess]
F --> I[results.json]
Prompt To Fix All With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/io/image_store.py
Line: 312-353
Comment:
**Stale in-place language after df.copy() fix**
Now that `df = df.copy()` is added at the top of the function body, `store_extracted` no longer mutates the caller's DataFrame. Two docstring phrases still describe the old behaviour: line 312 says "Updates the DataFrame in-place" and line 353 says "The (mutated) input DataFrame". All production callers (`StoreOperator.process`) already use the return value so there is no functional regression - this is purely a documentation fix.
Suggested change for the Returns section:
```
Returns
-------
pd.DataFrame
A copy of the input DataFrame with storage URIs added.
```
How can I resolve this? If you propose a fix, please make it concise.Reviews (2): Last reviewed commit: "Update nemo_retriever/src/nemo_retriever..." | Re-trigger Greptile
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Summary
Wires
store_extractedinto the harness so storage can be configured viatest_configs.yamlor--override. Addresses the store portion of the RFE for MinIO-backed citation assets (images, tables, charts).store_images_uri,store_text,strip_base64toHarnessConfigwith env var overrides_resolve_store_uri()— resolves relative URIs toartifact_dir/<name>/, matching thelancedb_uripattern so each harness run gets isolated storage--store-images-uri,--store-text,--strip-base64into_build_command()(only when URI is set)results.jsonpayloadSettingWithCopyWarninginstore_extractedviadf.copy()(Ray passes DataFrame slices)Artifact layout with store enabled
Usage
# CLI override retriever harness run --dataset bo20 --preset single_gpu \ --override store_images_uri=stored_images --override store_text=trueValidated
Test plan
pytest tests/test_harness_config.py tests/test_harness_run.py— 46/46 passpre-commit run --all-files— cleanretriever harness run --dataset bo20 --override store_images_uri=stored_images --override store_text=true— 2,093 files in artifact dirresults.jsonincludesstore_images_uri,store_text,strip_base64intest_config🤖 Generated with Claude Code