Skip to content

Add guided vector-first Arango graph retrieval#2

Draft
JonasReuter wants to merge 6 commits intomainfrom
feature/budgeted-arango-graph-retrieval
Draft

Add guided vector-first Arango graph retrieval#2
JonasReuter wants to merge 6 commits intomainfrom
feature/budgeted-arango-graph-retrieval

Conversation

@JonasReuter
Copy link
Copy Markdown
Owner

Summary

Adds a new guided, vector-first ArangoDB graph retrieval path that avoids broad k-hop expansion.

The approach follows the concept we discussed:

  • use Arango vector search to find semantic seed entities first
  • run a best-first frontier traversal instead of 1..depth broad expansion
  • expand only top-M scored edges per frontier node
  • apply hard runtime budgets (max_expansions, max_frontier_size, max_edges_per_node, max_results)
  • score paths with vector relevance, relation/query hints, edge weight, target type, hub/degree penalty, depth decay, and a lightweight PPR-like mass boost
  • optionally retrieve community reports for the best graph region
  • keep legacy traverse_neighbors() and hybrid_search() intact for compatibility

Why

A fixed k-hop traversal can explode on high-degree entities. A node with 1,000 edges can produce ~1,000,000 two-hop candidates. This PR changes the retrieval primitive from "walk all hops" to "walk only promising paths within a strict budget".

The runtime is bounded primarily by:

seed_k + max_expansions * max_edges_per_node

instead of by:

degree ^ depth

New API

from graphrag_vectors import (
    GuidedArangoGraphRetriever,
    GuidedGraphRetrievalConfig,
    QueryGraphPlan,
)

retriever = GuidedArangoGraphRetriever(graph_store._db, graph_store.graph_name)
result = retriever.retrieve(
    query_vector=query_vector,
    query=query,
    plan=QueryGraphPlan.from_query_text(query),
    config=GuidedGraphRetrievalConfig(),
)

Also adds a query-layer helper:

from graphrag.query.input.retrieval.guided_graph import retrieve_guided_graph_context

Config additions

graph_store:
  guided_retrieval_enabled: true
  guided_seed_k: 12
  guided_max_depth: 3
  guided_max_edges_per_node: 8
  guided_max_expansions: 128
  guided_max_frontier_size: 256
  guided_max_results: 80
  guided_min_path_score: 0.05
  guided_depth_decay: 0.72
  guided_community_report_limit: 8
  guided_allow_vector_scan_fallback: false

Performance notes

  • Full vector scan fallback is disabled by default to protect query latency.
  • Edge expansion is performed as repeated bounded 1-hop top-M AQL calls, not variable-length traversal.
  • Hub penalty uses entity degree/rank signals to avoid generic high-degree nodes dominating retrieval.
  • The implementation returns stats including duration, seed count, expansions, path count, and maximum possible edge reads.

Testing

Not run locally from this environment. Please run:

uv run poe check
uv run poe test_unit

Recommended manual test: compare legacy hybrid_search() against guided retrieval on a high-degree seed and inspect result["stats"] for bounded expansion behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant