QA eval pipeline for retrieval by KyleZheng1284 · Pull Request #1754 · NVIDIA/NeMo-Retriever

KyleZheng1284 · 2026-03-30T21:26:45Z

Description

Adds a pluggable QA evaluation harness for measuring Retrieval quality end-to-end using multi-tier scoring.

Capabilities:

Multi-tier scoring -- Tier 1 retrieval recall (answer-in-context), Tier 2 programmatic (exact match + token F1), and Tier 3 LLM-as-judge (1-5 rubric) run together in a single pass at zero extra retrieval cost.
Full-page markdown retrieval -- Reconstructs complete document pages from NeMo Retriever extraction records via to_markdown_by_page()
Pluggable retrieval -- Any retrieval system (vector search, agentic, hybrid, BM25) plugs in by producing a standard JSON (queries → chunks); no harness code changes required.
Pluggable datasets -- Any CSV with query/answer columns loads via csv:path/to/file.csv; default ground truth is data/bo767_annotations.csv (1007 Q&A pairs, all modalities).
Pluggable LLMs -- Generator and judge models swap via env var or YAML config using litellm prefix routing (nvidia_nim/, openai/, huggingface/).
Multi-model sweep -- Set GEN_MODELS to evaluate multiple generators in a single run with side-by-side score comparisons.
Failure classification -- Per-query categorization into correct, partial, retrieval_miss, generation_miss, no_context, thinking_truncated to pinpoint exactly where the pipeline fails.

Note - the csv containing the q-a pairs is a subset of the existing https://github.com/NVIDIA/NeMo-Retriever/blob/main/data/digital_corpora_10k_annotations.csv. Currently have an separate PR up with a subset annotations for only bo767 specific files here - #1730

)## Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

copy-pr-bot · 2026-03-30T21:26:50Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

tools/harness/ingest_bo767.py

tools/harness/extract_bo767_parquet.py

jperez999 · 2026-03-30T23:13:59Z

tools/harness/export_retrieval_nemo.py

+    print(f"  Page index key check: {matched}/{len(sampled)} sampled source_ids found")
+
+
+def main() -> int:


Why not make this a tool we can call via import, instead of a main function.

core evaluation logic has been moved into nemo_retriever.evaluation (importable package, pip-installable via nemo_retriever[eval])

tools/harness/export_retrieval_nemo.py

tools/harness/build_page_markdown_index.py

nemo_retriever/src/nemo_retriever/evaluation/types.py

tools/harness/src/nv_ingest_harness/utils/qa/orchestrator.py

nemo_retriever/src/nemo_retriever/evaluation/judges.py

nemo_retriever/src/nemo_retriever/evaluation/generators.py

…ith better scoring naming, more generic env naming for api keys for multi model support

greptile-apps · 2026-04-03T22:19:59Z

Greptile Summary

This PR adds a substantial pluggable QA evaluation harness for retrieval quality measurement, introducing multi-tier scoring (Tier-1 retrieval recall, Tier-2 token F1, Tier-3 LLM-as-judge), a full graph-pipeline-based execution model, and support for swappable datasets, LLMs, and retrieval backends. The architecture is well-structured with clean protocol abstractions in types.py and a FileRetriever integration point.

Two bugs in the changed files warrant attention before merging:

RetrievalLoaderOperator silently swallows a helpful data_dir-missing error for the bo767_infographic dataset, replacing it with a misleading FileNotFoundError.
recall.py passes hybrid=sparse instead of hybrid=hybrid to the Milvus retrieval call, the same class of flag-swap bug flagged in utils/qa/__init__.py.

Confidence Score: 4/5

Two P1 logic bugs need fixing before merging; the rest of the new evaluation framework is well-structured.

The retrieval_loader.py ValueError-swallowing bug silently breaks the bo767_infographic dataset path with a misleading error, and recall.py ignores the hybrid flag for all Milvus-backed recall runs. Both are straightforward one-line fixes, but they represent real behavioral defects on named code paths introduced by this PR.

nemo_retriever/src/nemo_retriever/evaluation/retrieval_loader.py (broad ValueError catch) and tools/harness/src/nv_ingest_harness/utils/recall.py (hybrid=sparse swap at lines 140 and 258).

Important Files Changed

Filename	Overview
nemo_retriever/src/nemo_retriever/evaluation/retrieval_loader.py	New source operator that loads retrieval JSON + ground truth CSV into a DataFrame; has a P1 bug where a broad `except ValueError` swallows the actionable `data_dir` error for `bo767_infographic`.
tools/harness/src/nv_ingest_harness/utils/recall.py	Added `get_retrieval_func` helper and updated recall functions; Milvus path in both `get_recall_scores` and `get_recall_scores_pdf_only` still passes `hybrid=sparse` instead of `hybrid=hybrid`.
nemo_retriever/src/nemo_retriever/evaluation/orchestrator.py	Core `QAEvalPipeline`: well-structured threaded generation+judging+scoring loop; default-arg closure fix for late-binding is present and correct; aggregate output format is comprehensive.
nemo_retriever/src/nemo_retriever/evaluation/scoring.py	Multi-tier scoring with word-set Tier-1 check (previously flagged substring bug fixed), SQuAD-style token F1, and `classify_failure` with `judge_error` sentinel; logic is clean.
nemo_retriever/src/nemo_retriever/evaluation/config.py	Config loading, env-var expansion, and legacy→new format normalization; empty `evaluations` guard is present; `check_unresolved_env` called in runner for API keys.
nemo_retriever/src/nemo_retriever/evaluation/ground_truth.py	Dataset loaders for `bo767_infographic`, ViDoRe v3, and generic CSV; `bo767_infographic` now correctly validates `data_dir` before calling `os.path.join`.
nemo_retriever/src/nemo_retriever/evaluation/judges.py	LLM-as-judge with JSON parsing and regex fallback; returns `JudgeResult(error=...)` on failure instead of raising; `empty_candidate` short-circuit is correct.
nemo_retriever/src/nemo_retriever/evaluation/generators.py	Unified `LiteLLMClient` wrapping litellm; `thinking_truncated` sentinel when `strip_think_tags` returns empty; `extra_params` applied last, which can intentionally override call kwargs.
nemo_retriever/src/nemo_retriever/evaluation/runner.py	New `run_eval_sweep` function replaces deleted script logic; `check_unresolved_env` guards on API keys per-eval; timestamped JSON output per run is clean.
tools/harness/src/nv_ingest_harness/utils/qa/init.py	`TopKRetriever` correctly passes `hybrid=self.hybrid` to the Milvus retrieval function (previously flagged swap is resolved); collection existence pre-check is a good defensive pattern.
tools/harness/src/nv_ingest_harness/cases/qa_eval.py	Imports `_expand_env_vars` directly from `nemo_retriever.evaluation.config` (deduplication resolved); `_build_retriever` correctly defers heavy harness imports to the `topk` branch.

Sequence Diagram

sequenceDiagram
    participant User
    participant CLI as evaluation/cli.py
    participant Loader as RetrievalLoaderOperator
    participant GT as ground_truth.py
    participant FR as FileRetriever/TopKRetriever
    participant Gen as QAGenerationOperator(LiteLLMClient)
    participant Judge as JudgingOperator(LLMJudge)
    participant Scorer as ScoringOperator(scoring.py)

    User->>CLI: retriever eval run --config eval_sweep.yaml
    CLI->>Loader: process(None)
    Loader->>GT: get_qa_dataset_loader(source)(data_dir)
    GT-->>Loader: qa_pairs list[dict]
    Loader->>FR: retrieve(query, top_k) per pair
    FR-->>Loader: RetrievalResult(chunks, metadata)
    Loader-->>Gen: DataFrame(query, reference_answer, context)
    Gen->>Gen: LiteLLMClient.generate() [ThreadPoolExecutor]
    Gen-->>Judge: DataFrame + answer, gen_error cols
    Judge->>Judge: LLMJudge.judge() [ThreadPoolExecutor]
    Judge-->>Scorer: DataFrame + judge_score, judge_reasoning cols
    Scorer->>Scorer: answer_in_context(), token_f1(), classify_failure()
    Scorer-->>CLI: DataFrame with Tier1/2/3 metrics + failure_mode
    CLI->>User: JSON results written to results_dir

Comments Outside Diff (1)

tools/harness/src/nv_ingest_harness/utils/recall.py, line 137-146 (link)

hybrid=sparse should be hybrid=hybrid

nvingest_retrieval is called with hybrid=sparse, so the hybrid parameter passed to get_recall_scores has zero effect on the Milvus path. The LanceDB path (via get_retrieval_func) correctly threads through hybrid=hybrid, making the inconsistency clear. The same fix applies to the identical call in get_recall_scores_pdf_only at line 258.

Prompt To Fix With AI

This is a comment left during a code review.
Path: tools/harness/src/nv_ingest_harness/utils/recall.py
Line: 137-146

Comment:
**`hybrid=sparse` should be `hybrid=hybrid`**

`nvingest_retrieval` is called with `hybrid=sparse`, so the `hybrid` parameter passed to `get_recall_scores` has zero effect on the Milvus path. The LanceDB path (via `get_retrieval_func`) correctly threads through `hybrid=hybrid`, making the inconsistency clear. The same fix applies to the identical call in `get_recall_scores_pdf_only` at line 258.



How can I resolve this? If you propose a fix, please make it concise.

Prompt To Fix All With AI

This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/evaluation/retrieval_loader.py
Line: 71-76

Comment:
**Overly-broad `except ValueError` swallows the `data_dir` error**

Both `get_qa_dataset_loader()` and `loader_fn(self._data_dir)` are inside the same `try` block. When `source="bo767_infographic"` and `self._data_dir=None`, `loader_fn(None)` raises `ValueError("bo767_infographic dataset requires data_dir to be set.")`. That ValueError is caught here and the code falls back to `load_generic_csv("bo767_infographic")`, which then raises a confusing `FileNotFoundError: CSV file not found: bo767_infographic` — swallowing the actionable message entirely.

Split the two operations so only the "unknown dataset" branch falls back to generic CSV:

```suggestion
        try:
            loader_fn = get_qa_dataset_loader(source)
        except ValueError:
            qa_pairs = load_generic_csv(source)
        else:
            qa_pairs = loader_fn(self._data_dir)
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: tools/harness/src/nv_ingest_harness/utils/recall.py
Line: 137-146

Comment:
**`hybrid=sparse` should be `hybrid=hybrid`**

`nvingest_retrieval` is called with `hybrid=sparse`, so the `hybrid` parameter passed to `get_recall_scores` has zero effect on the Milvus path. The LanceDB path (via `get_retrieval_func`) correctly threads through `hybrid=hybrid`, making the inconsistency clear. The same fix applies to the identical call in `get_recall_scores_pdf_only` at line 258.

```suggestion
            batch_answers = nvingest_retrieval(
                batch_queries,
                collection_name,
                hybrid=hybrid,
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: tools/harness/src/nv_ingest_harness/utils/recall.py
Line: 256-265

Comment:
**Same `hybrid=sparse` swap in `get_recall_scores_pdf_only`**

Same bug as the Milvus call in `get_recall_scores` (line 140): `hybrid=sparse` ignores the `hybrid` argument for all Milvus-backed PDF-only recall runs.

```suggestion
            batch_answers = nvingest_retrieval(
                batch_queries,
                collection_name,
                hybrid=hybrid,
```

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (10): Last reviewed commit: "restored singular column names for test" | Re-trigger Greptile}

tools/harness/run_qa_eval.py

nemo_retriever/src/nemo_retriever/evaluation/scoring.py

tools/harness/src/nv_ingest_harness/cases/qa_eval.py

…el configs

tools/harness/src/nv_ingest_harness/utils/qa/__init__.py

nemo_retriever/src/nemo_retriever/evaluation/config.py

nemo_retriever/src/nemo_retriever/evaluation/ground_truth.py

jperez999

Moving in the right direction. Lets remove all the changes to the harness not in nemo_retriever. That will slim down the PR quite a bit. Also, unless you feel it is really helpful, lets remove all the extra tools you added and replace them with helper functions for those actions. We should refactor to make it possible to tack these operators on the graph in graph_pipeline.py or into the Retreiver object already in use. We should be trying to reuse as much of the objects that we have as much as possible. Keep in mind, everything here is a discussion, if you feel it is better the way you have it, please explain it to me.

nemo_retriever/src/nemo_retriever/evaluation/config.py

nemo_retriever/src/nemo_retriever/evaluation/orchestrator.py

jperez999 · 2026-04-08T18:12:56Z

tools/harness/retrieval_bench/run_agentic_retrieval_bo767.py

+# ---------------------------------------------------------------------------
+
+
+def run_agentic_retrieval(


So this is something that we need to do separate from graph_pipeline.py entry point? Cant we just add in the operators we want and use that same entrypoint. It would then allow us to make changes to the query file and datasets and should still get same behavior.

jperez999 · 2026-04-08T18:13:26Z

tools/harness/retrieval_bench/run_dense_retrieval_bo767.py

+        --output data/test_retrieval/bo767_retrieval_dense.json
+"""
+
+from __future__ import annotations


Why create a whole new file to do what graph_pipeline already mostly does?

This script exists because retrieval-bench only works with HuggingFace datasets out of the box. We would need this file to load our extraction Parquets, expand chunk hits to full-page markdown, and output the FileRetriever JSON that our QA eval pipeline expects.

jperez999 · 2026-04-08T18:14:44Z

tools/harness/src/nv_ingest_harness/cases/e2e_qa_eval.py

+import json
+import os
+
+from nv_ingest_harness.cases.e2e import main as e2e_main


Again it seems like you are creating a whole new graph specifically for this. When what I think we want is to be able to tack on these operations to any graph.

tools/harness/src/nv_ingest_harness/cli/run.py

jperez999 · 2026-04-08T18:20:36Z

tools/harness/src/nv_ingest_harness/utils/qa/__init__.py

+from nemo_retriever.evaluation.types import RetrievalResult
+
+
+class TopKRetriever:


Why are you adding this in the harness. This should exist in nemo_retriever. All code changes in legacy nv-ingest can be removed unless necessary to make nemo_retriever work.

moving it would pull harness dependencies into nemo_retriever right, which isn't what we want. It makes more sense in my mind if the harness consumes the nemo_retriever protocl instead of vice versa.

…ench

KyleZheng1284 requested review from a team as code owners March 30, 2026 21:26

KyleZheng1284 requested a review from nkmcalli March 30, 2026 21:26

jperez999 requested changes Mar 30, 2026

View reviewed changes

Kyle Zheng added 6 commits April 2, 2026 17:15

QA eval pipeline with full-page markdown and multi-tier scoring

de0a769

removed stale ref

16a9b9e

fixing ci/cd issues

3c6e13d

style: black formatting for QA harness files

bc70c40

update readme

d176f39

migrate eval framework to graph pipeline and also added new changes w…

9262c63

…ith better scoring naming, more generic env naming for api keys for multi model support

KyleZheng1284 force-pushed the feature/qa-harness-fullpage-pipeline branch from d7c48fa to 9262c63 Compare April 3, 2026 21:56

greptile-apps bot reviewed Apr 3, 2026

View reviewed changes

Kyle Zheng added 4 commits April 5, 2026 05:07

add support for multi run sweep w/ support for multiple different mod…

ea42498

…el configs

updated eval sweep to correct model name

d0b491c

add scripts for running retrieval_bench

d5e0f7a

priortize bo767 as dataset in retrieval bench in readme

05658e3

greptile-apps bot reviewed Apr 7, 2026

View reviewed changes

tools/harness/src/nv_ingest_harness/utils/qa/__init__.py Show resolved Hide resolved

refactor scripts

6ba0390

greptile-apps bot reviewed Apr 7, 2026

View reviewed changes

nemo_retriever/src/nemo_retriever/evaluation/config.py Show resolved Hide resolved

add agentic retrieval example

8049693

greptile-apps bot reviewed Apr 7, 2026

View reviewed changes

nemo_retriever/src/nemo_retriever/evaluation/ground_truth.py Outdated Show resolved Hide resolved

jperez999 requested changes Apr 8, 2026

View reviewed changes

Kyle Zheng added 5 commits April 9, 2026 21:43

refactor and consolidate into retriever cli

f768a91

update to support only plural + remove cli entry point for retrievalb…

3cb1178

…ench

bug fixes

65162f8

remove case

c80dfa3

restored singular column names for test

4ae6848

		print(f" Page index key check: {matched}/{len(sampled)} sampled source_ids found")


		def main() -> int:

		# ---------------------------------------------------------------------------


		def run_agentic_retrieval(

		from nemo_retriever.evaluation.types import RetrievalResult


		class TopKRetriever:

Conversation

KyleZheng1284 commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Capabilities:

Note - the csv containing the q-a pairs is a subset of the existing https://github.com/NVIDIA/NeMo-Retriever/blob/main/data/digital_corpora_10k_annotations.csv. Currently have an separate PR up with a subset annotations for only bo767 specific files here - #1730

Uh oh!

copy-pr-bot bot commented Mar 30, 2026

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (1)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jperez999 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

KyleZheng1284 commented Mar 30, 2026 •

edited

Loading

greptile-apps bot commented Apr 3, 2026 •

edited

Loading