Skip to content

OpenSymbolicAI/benchmark-py-MultiHopRAG

Repository files navigation

OpenSymbolicAI

MultiHop-RAG Benchmark

A multi-hop question-answering benchmark built with the GoalSeeking blueprint from opensymbolicai-core. It answers complex questions that require reasoning across multiple documents by iteratively retrieving evidence — each retrieval "hop" is one iteration in the GoalSeeking loop.

What is MultiHopRAG?

MultiHopRAG is a benchmark for evaluating retrieval-augmented generation on questions that cannot be answered from a single document. It contains 609 news articles and 2,556 queries where the answer requires connecting facts across 2–4 articles.

Example problem:

Q: Who was the individual associated with cryptocurrency who was found guilty?

A single article mentions a "crypto executive on trial" but not the name. Another article names "Sam Bankman-Fried" in a fraud case but doesn't mention the verdict. A third article reports the guilty verdict. The system must retrieve all three, link the entities, and synthesize: Sam Bankman-Fried.

Query types span inference (connecting facts across articles), comparison (across news sources), temporal (across time periods), and null (insufficient information in the corpus).

Why GoalSeeking?

Traditional RAG retrieves documents once and generates an answer. This works for simple look-up questions but fails when the answer depends on evidence scattered across multiple articles — the retriever doesn't know what to look for until it has seen the first result.

GoalSeeking makes retrieval iterative: each hop sees the accumulated evidence and decides what to search for next, like a researcher following leads. This structured approach lets a 120B-parameter open-source model (gpt-oss-120b, Fireworks AI) outperform GPT-4 on this benchmark:

Method Overall
GoalSeeking + gpt-oss-120b (ours) 82.9%
IRCoT + RAG (Llama-8B) 75.0%
Multi-Meta-RAG + GPT-4 60.6%
GPT-4 RAG baseline 56.0%

The gain comes from the retrieval strategy, not model scale alone.

Prerequisites

  • Python 3.12 or 3.13
  • uv package manager
  • API key for Fireworks AI (used for both embeddings and LLM inference)

Quick Start

# 1. Clone and enter the repo
git clone https://github.com/OpenSymbolicAI/benchmark-py-MultiHopRAG.git
cd benchmark-py-MultiHopRAG

# 2. Copy .env.example and fill in your API keys
cp .env.example .env
# Edit .env with your FIREWORKS_API_KEY (required), plus optional GROQ_API_KEY, ANTHROPIC_API_KEY

# 3. Install dependencies
uv sync

# 4. Load the corpus into ChromaDB (quick: 50 articles, ~1 min)
uv run python setup_data.py --quick

# 5. Run a single query
uv run python -m multihop_rag.main --query "Who was found guilty in the crypto trial?"

For the full 609-article corpus (better accuracy, takes longer to embed):

uv run python setup_data.py

How It Works

Iterative Retrieval vs. Single-Shot RAG

Traditional RAG pipelines generate the entire retrieval plan in one shot. This is brittle for multi-hop questions because the planner must anticipate all hops before seeing any evidence.

GoalSeeking makes each hop adaptive:

seek("Who is the individual linked to crypto that was found guilty?")

  Iteration 1:  retrieve("crypto individual guilty") -> extract evidence
                -> update_context: found "Sam Bankman-Fried", gap: "verdict details"
                -> evaluate: only 1 source, low confidence -> CONTINUE

  Iteration 2:  retrieve("Bankman-Fried trial verdict") -> extract evidence
                -> update_context: corroborating evidence from 2nd source
                -> evaluate: 2+ sources, sufficient -> ACHIEVED

  -> "Sam Bankman-Fried"

The planner sees accumulated knowledge (not raw results) and decides what to search next. The evaluator checks if enough cross-referenced evidence has been gathered.

Architecture

User Query
    |
    v
MultiHopRAGAgent.seek(query)
    |
    +-- create_context() -> MultiHopContext(evidence=[], entities=[], ...)
    |
    +-- LOOP (max 5 iterations):
        |
        +-- 1. plan_iteration()     <- LLM sees accumulated evidence, plans next hop
        +-- 2. execute()            <- runs primitives (retrieve, extract, synthesize)
        +-- 3. update_context()     <- INTROSPECTION BOUNDARY: raw -> structured insights
        +-- 4. evaluate()           <- @evaluator checks: sufficient + answer ready?
        +-- 5. should_continue()    <- stop if achieved or max iterations
            |
            v
        GoalSeekingResult(answer, iterations, status)

The Introspection Boundary

update_context() is the key architectural feature. It converts raw ExecutionResult into structured fields on MultiHopContext:

Primitive Called Context Updated
retrieve / retrieve_by_category queries_tried -- tracks search angles used
extract_evidence evidence -- accumulates EvidencePiece objects
identify_entities entities_found -- bridge entities for cross-referencing
assess_sufficiency sufficient -- flag when evidence is enough
synthesize_answer current_answer + answer_confidence

The planner and evaluator only see these structured fields -- never the raw execution results.

Primitives

Primitive Purpose
retrieve(query, k) Semantic search over the news corpus
retrieve_by_category(query, category, k) Filtered by news category (tech, sports, etc.)
retrieve_by_source(query, source, k) Filtered by news outlet name
retrieve_filtered(query, source, category, date_from, date_to, k) Combined metadata filters
extract_evidence(context, question) Pull relevant facts from retrieved text
identify_entities(text) Find named entities / bridge entities
generate_next_query(question, evidence) Plan the next retrieval hop
synthesize_answer(question, evidence) Combine multi-source evidence into answer
assess_sufficiency(question, evidence) Check if evidence is enough to answer
combine_contexts(documents) Merge documents into a context string

Decomposition Patterns (few-shot examples)

Seven patterns teach the LLM planner how to compose primitives:

  1. Two-hop inference -- retrieve -> extract -> generate_next_query -> retrieve -> synthesize
  2. Source comparison -- retrieve_by_source(A) -> extract -> retrieve_by_source(B) -> extract -> synthesize
  3. Single retrieval with sufficiency check -- retrieve -> extract -> assess -> synthesize
  4. Consistency comparison -- retrieve_by_source(A) -> extract -> retrieve_by_source(B) -> extract -> compare
  5. Cross-source entity resolution -- retrieve_by_source(A) -> entities -> retrieve_by_source(B) -> entities -> synthesize
  6. Temporal source comparison -- retrieve_filtered(date_A) -> extract -> retrieve_filtered(date_B) -> extract -> compare
  7. Yes/No temporal consistency -- retrieve_filtered(period_A) -> extract -> retrieve_filtered(period_B) -> extract -> Yes/No

Project Structure

benchmark-py-MultiHopRAG/
|-- multihop_rag/
|   |-- __init__.py          # Package exports
|   |-- agent.py             # MultiHopRAGAgent (GoalSeeking subclass)
|   |-- main.py              # CLI entry point, demo/benchmark runner
|   |-- models.py            # Data models (Document, EvidencePiece, MultiHopContext, QueryItem)
|   +-- retriever.py         # ChromaRetriever (ChromaDB + Fireworks embeddings)
|-- tests/
|   |-- test_agent.py        # Agent code sanitization, string repair, context tests
|   |-- test_models.py       # Data model unit tests
|   |-- test_retriever.py    # Retriever unit tests (mocked ChromaDB/OpenAI)
|   +-- test_setup_data.py   # Text chunking utility tests
|-- setup_data.py            # Corpus downloader and ChromaDB loader
|-- pyproject.toml           # Dependencies and build config
|-- .env.example             # Template for API keys
+-- README.md

Dataset: MultiHop-RAG

Property Value
Articles 609 news articles (tech, sports, entertainment, business, science, health)
Queries 2,556 with ground-truth answers
Query types inference (32%), comparison (33%), temporal (23%), null (12%)
Evidence per query 2-4 documents
Source HuggingFace: yixuantt/MultiHopRAG

Query Types

  • Inference queries -- require connecting facts across multiple articles to identify a person, event, or outcome
  • Comparison queries -- compare claims or reporting between two named news sources
  • Temporal queries -- assess consistency or change in reporting across different time periods
  • Null queries -- questions where the corpus does not contain sufficient information (expected answer: "Insufficient information")

Benchmark Results

Run on the full MultiHop-RAG dataset (2,556 queries across all four types) using GoalSeeking with iterative multi-hop retrieval. The backbone LLM is gpt-oss-120b (120B parameters, served by Fireworks AI).

Metric Value
Overall accuracy 82.9% (2,118 / 2,556)
Goals achieved 99.6% (2,545 / 2,556)
Avg iterations per query 1.9

Comparison with Published Results

Caveat on cross-method comparisons: The results below come from different studies using different LLM backbones (GPT-4, Llama 3.1 8B/70B, gpt-oss-120b), different embedding models, different retrieval corpora or index configurations, and different evaluation splits. The original MultiHop-RAG paper excluded null queries from its accuracy calculation. No official leaderboard exists. These numbers provide directional context, not a controlled ablation.

Method Backbone Inference Comparison Temporal Null Overall
GoalSeeking (ours) gpt-oss-120b (120B) 88.0% 78.2% 76.5% 94.7% 82.9%
IRCoT + RAG [1] Llama 3.1 8B 96.2% 65.0% 57.6% 80.1% 75.0%
IRCoT + GraphRAG [1] Llama 3.1 8B 95.0% 65.9% 60.4% 69.4% 74.3%
Community-GraphRAG Local [1] Llama 3.1 70B 92.0% 60.2% 49.1% 88.7% 71.2%
HippoRAG2 [1] Llama 3.1 8B 91.5% 58.4% 49.9% 85.7% 70.3%
SCMRAG (AAMAS 2025) [2] ~64% ~58% ~67.6%
Multi-Meta-RAG [3] GPT-4 95.1% 38.2% 25.6% 98.7% 60.6%
GPT-4 RAG baseline [4] GPT-4 excl. 56.0%
GPT-4 + ground-truth chunks [4] GPT-4 excl. 89.0%

Key takeaways:

  • +7.9pp overall vs. the previous best (IRCoT + RAG at 75.0%)
  • +12.2pp on comparison vs. the previous best (Community-GraphRAG Global at 66.0%)
  • +16.1pp on temporal vs. the previous best (IRCoT + GraphRAG at 60.4%)
  • The only system with balanced performance across all four query types (>76% each)

References:

  1. RAG vs. GraphRAG: A Systematic Evaluation (arXiv:2502.11371)
  2. SCMRAG — Self-Corrective Multihop RAG (AAMAS 2025)
  3. Multi-Meta-RAG (arXiv:2406.13213)
  4. MultiHop-RAG benchmark paper (arXiv:2401.15391)

CLI Reference

setup_data.py -- Corpus Loading

Downloads the MultiHop-RAG corpus from HuggingFace and loads it into a local ChromaDB vector store.

uv run python setup_data.py              # Load all 609 articles
uv run python setup_data.py --quick      # Load first 50 articles
uv run python setup_data.py --max-articles 100  # Load first 100 articles
uv run python setup_data.py --clear      # Clear existing data and reload
Flag Description Default
--quick Load first 50 articles only off
--max-articles N Limit to N articles all
--clear Clear existing data before loading off
--chunk-size N Target words per chunk 300
--max-chunks N Max chunks per article 20

python -m multihop_rag.main -- Agent

# Interactive mode (default)
uv run python -m multihop_rag.main

# Single query
uv run python -m multihop_rag.main --query "Who was found guilty in the crypto trial?"

# Benchmark demo
uv run python -m multihop_rag.main --demo --type inference --num 3

# Run 10 queries across all types, 3 in parallel
uv run python -m multihop_rag.main --demo --num 10 --parallel 3

# Use a different provider/model
uv run python -m multihop_rag.main --provider ollama --model llama3.2
Flag Description Default
--model MODEL LLM model name accounts/fireworks/models/gpt-oss-120b
--provider fireworks, ollama, openai, anthropic, groq fireworks
--query / -q Single query mode (non-interactive)
--demo Run benchmark queries from the dataset off
--type / -t Filter demo queries: inference, comparison, temporal, null, all all
--num / -n Number of queries per type in demo mode 2
--parallel / -p Concurrent queries in demo mode 5
--max-iterations GoalSeeking max iterations per query 5
--quick Quick corpus setup (50 articles) if DB is empty off
--reinit Clear and reload the knowledge base off

Logging

Every run creates a timestamped directory under logs/ with:

  • query_N.md -- Full trace for each query: plan code, execution steps, arguments, results, timing, and evaluation
  • summary.md -- Aggregate statistics (accuracy, iterations, per-type breakdown)
  • results.json -- Machine-readable results for programmatic analysis

Environment Variables

Variable Required Description
FIREWORKS_API_KEY Yes Fireworks AI API key (embeddings + default LLM)
GROQ_API_KEY No Groq API key (if using --provider groq)
ANTHROPIC_API_KEY No Anthropic API key (if using --provider anthropic)

Running Tests

uv sync --extra dev
uv run pytest tests/ -v

The test suite covers:

  • Models -- Document, EvidencePiece, MultiHopContext, QueryItem creation and defaults
  • Agent -- Code sanitization (markdown fences, control flow flattening, bare expression conversion, unterminated string repair)
  • Retriever -- Query, add, count, filter, lifecycle methods (all with mocked ChromaDB/OpenAI)
  • Setup data -- Text chunking, paragraph splitting, overlap, and filtering

About

MultiHop-RAG Benchmark using GoalSeeking pattern from opensymbolicai-core

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages