Skip to content

feat: improve semantic search with query expansion and keyword boosting#20

Merged
DevanshuNEU merged 1 commit into
OpenCodeIntel:mainfrom
DevanshuNEU:feature/improved-semantic-search
Dec 3, 2025
Merged

feat: improve semantic search with query expansion and keyword boosting#20
DevanshuNEU merged 1 commit into
OpenCodeIntel:mainfrom
DevanshuNEU:feature/improved-semantic-search

Conversation

@DevanshuNEU

Copy link
Copy Markdown
Collaborator

Summary

Implements Level 1 semantic search improvements to significantly boost search accuracy.

Changes

  • New SearchEnhancer service - Handles query expansion, keyword scoring, and reranking
  • Rich embedding text - Extracts docstrings, parameters, return types for better embeddings
  • LLM-powered query expansion - Expands 'authentication' → 'auth login verify token jwt session...'
  • Keyword boosting - Function names matching query get score boost
  • Reranking - Combines semantic (80%) + keyword (20%) scores

Results

Metric Before After
Match accuracy 39% 64%
Top result Generic function load_logged_in_user from auth.py
Result relevance Poor Highly relevant

Technical Details

  • Query expansion uses gpt-4o-mini for cost efficiency
  • Keyword scoring: name match = 0.5 weight, code match = 0.1 weight
  • Retrieves 3x candidates for reranking, returns top N
  • Backward compatible - existing indexes work without re-indexing

Future Improvements (Level 2-3)

  • BM25 hybrid search
  • Cross-encoder reranking
  • Code-specific embedding models (Voyage Code-3)

Testing

  • Tested on Flask repository
  • All existing tests pass (31/31)

Level 1 search improvements:
- Add SearchEnhancer service with LLM-powered query expansion
- Extract rich metadata (docstrings, params, return types) for embeddings
- Implement keyword boosting for function name matching
- Add reranking to combine semantic + keyword scores

Results: 39% → 64% match accuracy on authentication queries

Technical changes:
- New: backend/services/search_enhancer.py
- Modified: indexer_optimized.py (rich embedding text, query expansion, reranking)
- Added SUPABASE_SERVICE_ROLE_KEY to docker-compose.yml
- Added EMBEDDING_MODEL config to .env.example
@vercel

vercel Bot commented Dec 3, 2025

Copy link
Copy Markdown

@DevanshuNEU is attempting to deploy a commit to the Dev's projects Team on Vercel.

A member of the Team first needs to authorize it.

@vercel

vercel Bot commented Dec 3, 2025

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
opencodeintel Ready Ready Preview Comment Dec 3, 2025 2:26am

@DevanshuNEU DevanshuNEU merged commit 55cde6c into OpenCodeIntel:main Dec 3, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant