-
Notifications
You must be signed in to change notification settings - Fork 5
feat(search): Search V3 'Project Brain' with Cohere Reranking #224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
DevanshuNEU
merged 16 commits into
OpenCodeIntel:main
from
DevanshuNEU:feature/search-v3-project-brain
Jan 26, 2026
Merged
Changes from all commits
Commits
Show all changes
16 commits
Select commit
Hold shift + click to select a range
f7ea9cd
WIP: Search V2 hybrid fixes - camelCase tokenization, backward compatβ¦
DevanshuNEU c19fab8
feat(search): Search V3 'Project Brain' - Full Overhaul
DevanshuNEU b4d3d97
feat(search): integrate Cohere reranking with YAML formatting
DevanshuNEU 331a580
feat(search): gate Cohere reranking behind pro_user flag
DevanshuNEU da674c0
fix(query): preserve CamelCase in code term extraction
DevanshuNEU 6646c18
fix(search): use get_running_loop instead of deprecated get_event_loop
DevanshuNEU 21d7a30
fix(metrics): add gauge method and use it for rerank avg_score
DevanshuNEU d1476e2
fix(scripts): catch Exception instead of bare except in extended_querβ¦
DevanshuNEU 2655a90
fix(scripts): add load_dotenv + catch Exception in final_v3_test
DevanshuNEU 6989b31
fix(scripts): add load_dotenv to human_query_test
DevanshuNEU aa86547
feat(search): pass pro_user flag through indexer search_v3 method
DevanshuNEU e8325e2
fix(tests): mock search_v3 instead of semantic_search in playground test
DevanshuNEU 661706f
fix(scripts): stricter test file detection in has_test_file
DevanshuNEU 757bd07
fix(search): filter tests in V2 fallback when include_tests=False
DevanshuNEU 72605e6
refactor: consolidate test file detection into shared utility
DevanshuNEU ba3370f
fix(utils): use anchored regex patterns for test file detection
DevanshuNEU File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,247 @@ | ||
| #!/usr/bin/env python3 | ||
| """ | ||
| Search V3 vs V2 Benchmark | ||
| Run with: python3 scripts/benchmark_search_v3.py | ||
|
|
||
| Compares: | ||
| - V2 (OpenAI embeddings + Cohere reranking) | ||
| - V3 (Voyage AI embeddings + Query Understanding + Code Graph + Cohere reranking) | ||
| """ | ||
| import asyncio | ||
| import os | ||
| import sys | ||
| import time | ||
| from typing import List, Dict, Tuple | ||
|
|
||
| # add parent to path | ||
| sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) | ||
|
|
||
| from dotenv import load_dotenv | ||
| load_dotenv() | ||
|
|
||
| from services.indexer_optimized import OptimizedCodeIndexer | ||
|
|
||
| # Test queries representing real developer scenarios | ||
| TEST_QUERIES = [ | ||
| { | ||
| "query": "how to add authentication", | ||
| "expected_keywords": ["auth", "middleware", "authenticate", "credential"], | ||
| "description": "Developer wants to add auth to their app" | ||
| }, | ||
| { | ||
| "query": "handle websocket messages", | ||
| "expected_keywords": ["websocket", "message", "send", "receive", "on_"], | ||
| "description": "Developer working with WebSockets" | ||
| }, | ||
| { | ||
| "query": "return json from endpoint", | ||
| "expected_keywords": ["json", "response", "jsonresponse", "return"], | ||
| "description": "Developer wants to return JSON data" | ||
| }, | ||
| { | ||
| "query": "validate request data", | ||
| "expected_keywords": ["valid", "request", "data", "schema"], | ||
| "description": "Developer needs input validation" | ||
| }, | ||
| { | ||
| "query": "middleware that runs before request", | ||
| "expected_keywords": ["middleware", "before", "dispatch", "call_next"], | ||
| "description": "Developer needs pre-request processing" | ||
| }, | ||
| { | ||
| "query": "error handling", | ||
| "expected_keywords": ["error", "exception", "handler", "catch"], | ||
| "description": "Looking for error handling patterns" | ||
| }, | ||
| { | ||
| "query": "route decorator", | ||
| "expected_keywords": ["route", "decorator", "path", "endpoint"], | ||
| "description": "Developer needs routing functionality" | ||
| }, | ||
| { | ||
| "query": "database session", | ||
| "expected_keywords": ["database", "session", "db", "connection"], | ||
| "description": "Working with database sessions" | ||
| }, | ||
| ] | ||
|
|
||
|
|
||
| def score_results(results: List[Dict], expected_keywords: List[str]) -> Tuple[float, int, bool]: | ||
| """ | ||
| Score search results based on expected keywords | ||
| Returns: (score 0-10, matches count, is_test_in_top_3) | ||
| """ | ||
| if not results: | ||
| return 0.0, 0, False | ||
|
|
||
| # combine text from top 3 results | ||
| top_3_text = "" | ||
| has_test_in_top_3 = False | ||
|
|
||
| for r in results[:3]: | ||
| name = r.get("name", "").lower() | ||
| qualified = r.get("qualified_name", "").lower() | ||
| summary = (r.get("summary") or "").lower() | ||
| file_path = r.get("file_path", "").lower() | ||
|
|
||
| top_3_text += f" {name} {qualified} {summary} " | ||
|
|
||
| # check for test files | ||
| if "test" in file_path or "test" in name: | ||
| has_test_in_top_3 = True | ||
|
|
||
| # count keyword matches | ||
| matches = sum(1 for kw in expected_keywords if kw.lower() in top_3_text) | ||
| score = min(10.0, (matches / len(expected_keywords)) * 10) | ||
|
|
||
| return score, matches, has_test_in_top_3 | ||
|
|
||
|
|
||
| async def run_benchmark(repo_id: str): | ||
| """Run benchmark comparing V2 vs V3""" | ||
| print("=" * 80) | ||
| print("π§ͺ SEARCH V3 vs V2 BENCHMARK") | ||
| print("=" * 80) | ||
| print() | ||
|
|
||
| indexer = OptimizedCodeIndexer() | ||
|
|
||
| v2_scores = [] | ||
| v3_scores = [] | ||
| v2_times = [] | ||
| v3_times = [] | ||
| v2_test_count = 0 | ||
| v3_test_count = 0 | ||
|
|
||
| for tc in TEST_QUERIES: | ||
| query = tc["query"] | ||
| expected = tc["expected_keywords"] | ||
| desc = tc["description"] | ||
|
|
||
| print(f"π Query: \"{query}\"") | ||
| print(f" Scenario: {desc}") | ||
| print() | ||
|
|
||
| # V2 Search | ||
| start = time.time() | ||
| try: | ||
| v2_results = await indexer.search_v2( | ||
| query=query, | ||
| repo_id=repo_id, | ||
| top_k=5, | ||
| use_reranking=True | ||
| ) | ||
| v2_time = (time.time() - start) * 1000 | ||
| except Exception as e: | ||
| print(f" β V2 Error: {e}") | ||
| v2_results = [] | ||
| v2_time = 0 | ||
|
|
||
| v2_score, v2_matches, v2_has_test = score_results(v2_results, expected) | ||
| v2_scores.append(v2_score) | ||
| v2_times.append(v2_time) | ||
| if v2_has_test: | ||
| v2_test_count += 1 | ||
|
|
||
| # V3 Search | ||
| start = time.time() | ||
| try: | ||
| v3_results = await indexer.search_v3( | ||
| query=query, | ||
| repo_id=repo_id, | ||
| top_k=5, | ||
| include_tests=False, | ||
| use_reranking=True | ||
| ) | ||
| v3_time = (time.time() - start) * 1000 | ||
| except Exception as e: | ||
| print(f" β V3 Error: {e}") | ||
| v3_results = [] | ||
| v3_time = 0 | ||
|
|
||
| v3_score, v3_matches, v3_has_test = score_results(v3_results, expected) | ||
| v3_scores.append(v3_score) | ||
| v3_times.append(v3_time) | ||
| if v3_has_test: | ||
| v3_test_count += 1 | ||
|
|
||
| # Print comparison | ||
| print(f" V2: Score {v2_score:.1f}/10 ({v2_matches}/{len(expected)} keywords) | {v2_time:.0f}ms") | ||
| if v2_results: | ||
| print(f" Top result: {v2_results[0].get('name', 'unknown')}") | ||
|
|
||
| print(f" V3: Score {v3_score:.1f}/10 ({v3_matches}/{len(expected)} keywords) | {v3_time:.0f}ms") | ||
| if v3_results: | ||
| print(f" Top result: {v3_results[0].get('name', 'unknown')}") | ||
|
|
||
| # Winner | ||
| if v3_score > v2_score: | ||
| print(f" π V3 WINS (+{v3_score - v2_score:.1f})") | ||
| elif v2_score > v3_score: | ||
| print(f" π V2 WINS (+{v2_score - v3_score:.1f})") | ||
| else: | ||
| print(f" π€ TIE") | ||
|
|
||
| print() | ||
|
|
||
| # Summary | ||
| print("=" * 80) | ||
| print("π BENCHMARK RESULTS") | ||
| print("=" * 80) | ||
|
|
||
| v2_avg = sum(v2_scores) / len(v2_scores) | ||
| v3_avg = sum(v3_scores) / len(v3_scores) | ||
| v2_total_time = sum(v2_times) | ||
| v3_total_time = sum(v3_times) | ||
|
|
||
| v2_wins = sum(1 for v2, v3 in zip(v2_scores, v3_scores) if v2 > v3) | ||
| v3_wins = sum(1 for v2, v3 in zip(v2_scores, v3_scores) if v3 > v2) | ||
| ties = len(v2_scores) - v2_wins - v3_wins | ||
|
|
||
| print(f""" | ||
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | ||
| β METRIC β V2 β V3 β β | ||
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | ||
| β Average Score β {v2_avg:>6.1f}/10 β {v3_avg:>6.1f}/10 β {"V3 β" if v3_avg > v2_avg else "V2 β" if v2_avg > v3_avg else "TIE":<5}β | ||
| β Total Time β {v2_total_time:>6.0f}ms β {v3_total_time:>6.0f}ms β {"V3 β" if v3_total_time < v2_total_time else "V2 β":<5}β | ||
| β Queries with test in top3 β {v2_test_count:>6} β {v3_test_count:>6} β {"V3 β" if v3_test_count < v2_test_count else "V2 β" if v2_test_count < v3_test_count else "TIE":<5}β | ||
| β Wins β {v2_wins:>6} β {v3_wins:>6} β β | ||
| β Ties β {ties:>6} β {ties:>6} β β | ||
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | ||
| """) | ||
|
|
||
| # Final verdict | ||
| print() | ||
| if v3_avg >= v2_avg + 1.0: | ||
| print("β VERDICT: V3 is SIGNIFICANTLY BETTER - Ready for production!") | ||
| elif v3_avg > v2_avg: | ||
| print("β VERDICT: V3 is BETTER - Consider shipping!") | ||
| elif v3_avg == v2_avg: | ||
| print("β οΈ VERDICT: V3 is EQUAL to V2 - Need more optimization") | ||
| else: | ||
| print("β VERDICT: V3 is WORSE than V2 - Needs more work") | ||
|
|
||
| print() | ||
|
|
||
| # Check for Voyage | ||
| try: | ||
| from services.search_v3.integration import get_search_v3 | ||
| v3 = get_search_v3() | ||
| if v3.is_voyage_enabled: | ||
| print("π Using Voyage AI code-specific embeddings") | ||
| else: | ||
| print("β οΈ Voyage AI not enabled - using OpenAI embeddings") | ||
| print(" Set VOYAGE_API_KEY for better code search accuracy!") | ||
| except Exception as e: | ||
| print(f"β οΈ Could not check Voyage status: {e}") | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| # default repo ID (starlette) - change as needed | ||
| REPO_ID = os.getenv("BENCHMARK_REPO_ID", "0323a08f-9d21-4c59-b567-e0629a9bbb24") | ||
|
|
||
| print(f"Using repo_id: {REPO_ID}") | ||
| print("Set BENCHMARK_REPO_ID env var to use a different repo") | ||
| print() | ||
|
|
||
| asyncio.run(run_benchmark(REPO_ID)) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.