Skip to content

QA eval pipeline for retrieval#1754

Open
KyleZheng1284 wants to merge 17 commits intoNVIDIA:mainfrom
KyleZheng1284:feature/qa-harness-fullpage-pipeline
Open

QA eval pipeline for retrieval#1754
KyleZheng1284 wants to merge 17 commits intoNVIDIA:mainfrom
KyleZheng1284:feature/qa-harness-fullpage-pipeline

Conversation

@KyleZheng1284
Copy link
Copy Markdown
Member

@KyleZheng1284 KyleZheng1284 commented Mar 30, 2026

Description

  • Adds a pluggable QA evaluation harness for measuring Retrieval quality end-to-end using multi-tier scoring.

Capabilities:

  • Multi-tier scoring -- Tier 1 retrieval recall (answer-in-context), Tier 2 programmatic (exact match + token F1), and Tier 3 LLM-as-judge (1-5 rubric) run together in a single pass at zero extra retrieval cost.
  • Full-page markdown retrieval -- Reconstructs complete document pages from NeMo Retriever extraction records via to_markdown_by_page()
  • Pluggable retrieval -- Any retrieval system (vector search, agentic, hybrid, BM25) plugs in by producing a standard JSON (queries → chunks); no harness code changes required.
  • Pluggable datasets -- Any CSV with query/answer columns loads via csv:path/to/file.csv; default ground truth is data/bo767_annotations.csv (1007 Q&A pairs, all modalities).
  • Pluggable LLMs -- Generator and judge models swap via env var or YAML config using litellm prefix routing (nvidia_nim/, openai/, huggingface/).
  • Multi-model sweep -- Set GEN_MODELS to evaluate multiple generators in a single run with side-by-side score comparisons.
  • Failure classification -- Per-query categorization into correct, partial, retrieval_miss, generation_miss, no_context, thinking_truncated to pinpoint exactly where the pipeline fails.

Note - the csv containing the q-a pairs is a subset of the existing https://github.com/NVIDIA/NeMo-Retriever/blob/main/data/digital_corpora_10k_annotations.csv. Currently have an separate PR up with a subset annotations for only bo767 specific files here - #1730

)## Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

@KyleZheng1284 KyleZheng1284 requested review from a team as code owners March 30, 2026 21:26
@KyleZheng1284 KyleZheng1284 requested a review from nkmcalli March 30, 2026 21:26
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 30, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

print(f" Page index key check: {matched}/{len(sampled)} sampled source_ids found")


def main() -> int:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not make this a tool we can call via import, instead of a main function.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

core evaluation logic has been moved into nemo_retriever.evaluation (importable package, pip-installable via nemo_retriever[eval])

@KyleZheng1284 KyleZheng1284 force-pushed the feature/qa-harness-fullpage-pipeline branch from d7c48fa to 9262c63 Compare April 3, 2026 21:56
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 3, 2026

Greptile Summary

This PR adds a substantial pluggable QA evaluation harness for retrieval quality measurement, introducing multi-tier scoring (Tier-1 retrieval recall, Tier-2 token F1, Tier-3 LLM-as-judge), a full graph-pipeline-based execution model, and support for swappable datasets, LLMs, and retrieval backends. The architecture is well-structured with clean protocol abstractions in types.py and a FileRetriever integration point.

Two bugs in the changed files warrant attention before merging:

  • RetrievalLoaderOperator silently swallows a helpful data_dir-missing error for the bo767_infographic dataset, replacing it with a misleading FileNotFoundError.
  • recall.py passes hybrid=sparse instead of hybrid=hybrid to the Milvus retrieval call, the same class of flag-swap bug flagged in utils/qa/__init__.py.

Confidence Score: 4/5

Two P1 logic bugs need fixing before merging; the rest of the new evaluation framework is well-structured.

The retrieval_loader.py ValueError-swallowing bug silently breaks the bo767_infographic dataset path with a misleading error, and recall.py ignores the hybrid flag for all Milvus-backed recall runs. Both are straightforward one-line fixes, but they represent real behavioral defects on named code paths introduced by this PR.

nemo_retriever/src/nemo_retriever/evaluation/retrieval_loader.py (broad ValueError catch) and tools/harness/src/nv_ingest_harness/utils/recall.py (hybrid=sparse swap at lines 140 and 258).

Important Files Changed

Filename Overview
nemo_retriever/src/nemo_retriever/evaluation/retrieval_loader.py New source operator that loads retrieval JSON + ground truth CSV into a DataFrame; has a P1 bug where a broad except ValueError swallows the actionable data_dir error for bo767_infographic.
tools/harness/src/nv_ingest_harness/utils/recall.py Added get_retrieval_func helper and updated recall functions; Milvus path in both get_recall_scores and get_recall_scores_pdf_only still passes hybrid=sparse instead of hybrid=hybrid.
nemo_retriever/src/nemo_retriever/evaluation/orchestrator.py Core QAEvalPipeline: well-structured threaded generation+judging+scoring loop; default-arg closure fix for late-binding is present and correct; aggregate output format is comprehensive.
nemo_retriever/src/nemo_retriever/evaluation/scoring.py Multi-tier scoring with word-set Tier-1 check (previously flagged substring bug fixed), SQuAD-style token F1, and classify_failure with judge_error sentinel; logic is clean.
nemo_retriever/src/nemo_retriever/evaluation/config.py Config loading, env-var expansion, and legacy→new format normalization; empty evaluations guard is present; check_unresolved_env called in runner for API keys.
nemo_retriever/src/nemo_retriever/evaluation/ground_truth.py Dataset loaders for bo767_infographic, ViDoRe v3, and generic CSV; bo767_infographic now correctly validates data_dir before calling os.path.join.
nemo_retriever/src/nemo_retriever/evaluation/judges.py LLM-as-judge with JSON parsing and regex fallback; returns JudgeResult(error=...) on failure instead of raising; empty_candidate short-circuit is correct.
nemo_retriever/src/nemo_retriever/evaluation/generators.py Unified LiteLLMClient wrapping litellm; thinking_truncated sentinel when strip_think_tags returns empty; extra_params applied last, which can intentionally override call kwargs.
nemo_retriever/src/nemo_retriever/evaluation/runner.py New run_eval_sweep function replaces deleted script logic; check_unresolved_env guards on API keys per-eval; timestamped JSON output per run is clean.
tools/harness/src/nv_ingest_harness/utils/qa/init.py TopKRetriever correctly passes hybrid=self.hybrid to the Milvus retrieval function (previously flagged swap is resolved); collection existence pre-check is a good defensive pattern.
tools/harness/src/nv_ingest_harness/cases/qa_eval.py Imports _expand_env_vars directly from nemo_retriever.evaluation.config (deduplication resolved); _build_retriever correctly defers heavy harness imports to the topk branch.

Sequence Diagram

sequenceDiagram
    participant User
    participant CLI as evaluation/cli.py
    participant Loader as RetrievalLoaderOperator
    participant GT as ground_truth.py
    participant FR as FileRetriever/TopKRetriever
    participant Gen as QAGenerationOperator(LiteLLMClient)
    participant Judge as JudgingOperator(LLMJudge)
    participant Scorer as ScoringOperator(scoring.py)

    User->>CLI: retriever eval run --config eval_sweep.yaml
    CLI->>Loader: process(None)
    Loader->>GT: get_qa_dataset_loader(source)(data_dir)
    GT-->>Loader: qa_pairs list[dict]
    Loader->>FR: retrieve(query, top_k) per pair
    FR-->>Loader: RetrievalResult(chunks, metadata)
    Loader-->>Gen: DataFrame(query, reference_answer, context)
    Gen->>Gen: LiteLLMClient.generate() [ThreadPoolExecutor]
    Gen-->>Judge: DataFrame + answer, gen_error cols
    Judge->>Judge: LLMJudge.judge() [ThreadPoolExecutor]
    Judge-->>Scorer: DataFrame + judge_score, judge_reasoning cols
    Scorer->>Scorer: answer_in_context(), token_f1(), classify_failure()
    Scorer-->>CLI: DataFrame with Tier1/2/3 metrics + failure_mode
    CLI->>User: JSON results written to results_dir
Loading

Comments Outside Diff (1)

  1. tools/harness/src/nv_ingest_harness/utils/recall.py, line 137-146 (link)

    P1 hybrid=sparse should be hybrid=hybrid

    nvingest_retrieval is called with hybrid=sparse, so the hybrid parameter passed to get_recall_scores has zero effect on the Milvus path. The LanceDB path (via get_retrieval_func) correctly threads through hybrid=hybrid, making the inconsistency clear. The same fix applies to the identical call in get_recall_scores_pdf_only at line 258.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: tools/harness/src/nv_ingest_harness/utils/recall.py
    Line: 137-146
    
    Comment:
    **`hybrid=sparse` should be `hybrid=hybrid`**
    
    `nvingest_retrieval` is called with `hybrid=sparse`, so the `hybrid` parameter passed to `get_recall_scores` has zero effect on the Milvus path. The LanceDB path (via `get_retrieval_func`) correctly threads through `hybrid=hybrid`, making the inconsistency clear. The same fix applies to the identical call in `get_recall_scores_pdf_only` at line 258.
    
    
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/evaluation/retrieval_loader.py
Line: 71-76

Comment:
**Overly-broad `except ValueError` swallows the `data_dir` error**

Both `get_qa_dataset_loader()` and `loader_fn(self._data_dir)` are inside the same `try` block. When `source="bo767_infographic"` and `self._data_dir=None`, `loader_fn(None)` raises `ValueError("bo767_infographic dataset requires data_dir to be set.")`. That ValueError is caught here and the code falls back to `load_generic_csv("bo767_infographic")`, which then raises a confusing `FileNotFoundError: CSV file not found: bo767_infographic` — swallowing the actionable message entirely.

Split the two operations so only the "unknown dataset" branch falls back to generic CSV:

```suggestion
        try:
            loader_fn = get_qa_dataset_loader(source)
        except ValueError:
            qa_pairs = load_generic_csv(source)
        else:
            qa_pairs = loader_fn(self._data_dir)
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: tools/harness/src/nv_ingest_harness/utils/recall.py
Line: 137-146

Comment:
**`hybrid=sparse` should be `hybrid=hybrid`**

`nvingest_retrieval` is called with `hybrid=sparse`, so the `hybrid` parameter passed to `get_recall_scores` has zero effect on the Milvus path. The LanceDB path (via `get_retrieval_func`) correctly threads through `hybrid=hybrid`, making the inconsistency clear. The same fix applies to the identical call in `get_recall_scores_pdf_only` at line 258.

```suggestion
            batch_answers = nvingest_retrieval(
                batch_queries,
                collection_name,
                hybrid=hybrid,
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: tools/harness/src/nv_ingest_harness/utils/recall.py
Line: 256-265

Comment:
**Same `hybrid=sparse` swap in `get_recall_scores_pdf_only`**

Same bug as the Milvus call in `get_recall_scores` (line 140): `hybrid=sparse` ignores the `hybrid` argument for all Milvus-backed PDF-only recall runs.

```suggestion
            batch_answers = nvingest_retrieval(
                batch_queries,
                collection_name,
                hybrid=hybrid,
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (10): Last reviewed commit: "restored singular column names for test" | Re-trigger Greptile

Copy link
Copy Markdown
Collaborator

@jperez999 jperez999 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving in the right direction. Lets remove all the changes to the harness not in nemo_retriever. That will slim down the PR quite a bit. Also, unless you feel it is really helpful, lets remove all the extra tools you added and replace them with helper functions for those actions. We should refactor to make it possible to tack these operators on the graph in graph_pipeline.py or into the Retreiver object already in use. We should be trying to reuse as much of the objects that we have as much as possible. Keep in mind, everything here is a discussion, if you feel it is better the way you have it, please explain it to me.

# ---------------------------------------------------------------------------


def run_agentic_retrieval(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is something that we need to do separate from graph_pipeline.py entry point? Cant we just add in the operators we want and use that same entrypoint. It would then allow us to make changes to the query file and datasets and should still get same behavior.

--output data/test_retrieval/bo767_retrieval_dense.json
"""

from __future__ import annotations
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why create a whole new file to do what graph_pipeline already mostly does?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script exists because retrieval-bench only works with HuggingFace datasets out of the box. We would need this file to load our extraction Parquets, expand chunk hits to full-page markdown, and output the FileRetriever JSON that our QA eval pipeline expects.

import json
import os

from nv_ingest_harness.cases.e2e import main as e2e_main
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again it seems like you are creating a whole new graph specifically for this. When what I think we want is to be able to tack on these operations to any graph.

from nemo_retriever.evaluation.types import RetrievalResult


class TopKRetriever:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you adding this in the harness. This should exist in nemo_retriever. All code changes in legacy nv-ingest can be removed unless necessary to make nemo_retriever work.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moving it would pull harness dependencies into nemo_retriever right, which isn't what we want. It makes more sense in my mind if the harness consumes the nemo_retriever protocl instead of vice versa.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants