Skip to content

Add graph-aware ANN hybrid entity resolution#1

Draft
JonasReuter wants to merge 7 commits intomainfrom
feature/ann-hybrid-entity-resolution
Draft

Add graph-aware ANN hybrid entity resolution#1
JonasReuter wants to merge 7 commits intomainfrom
feature/ann-hybrid-entity-resolution

Conversation

@JonasReuter
Copy link
Copy Markdown
Owner

Summary

Adds a new ann_hybrid entity resolution strategy that is designed to scale without curated alias/word arrays.

The strategy:

  • builds graph-aware entity profile texts from title, type, description, neighbouring entities, and predicates
  • embeds those profiles and uses the existing vector store for top-k ANN-style candidate blocking
  • scores candidate pairs using embedding similarity, title similarity, graph-context overlap, and type compatibility
  • auto-merges only high-confidence groups
  • sends only borderline groups to the existing LLM merge extractor
  • keeps the existing llm_context_window and embedding_search strategies intact

Config additions

entity_resolution:
  enabled: true
  strategy: ann_hybrid
  similarity_threshold: 0.72
  top_k: 10
  auto_merge_threshold: 0.95
  llm_review_threshold: 0.82
  max_llm_groups: 5000
  profile_neighbor_limit: 30

Notes

This first version avoids adding a new dependency such as rapidfuzz; it uses stdlib title similarity plus graph context and existing embeddings/vector store infrastructure.

Testing

Not run locally from this environment. Please run the repository's normal checks, especially:

uv run poe check
uv run poe test_unit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant