Skip to content

Conversation

@prosdev
Copy link
Collaborator

@prosdev prosdev commented Nov 23, 2025

🐛 Problem

Search was indexing 566 documents but returning 0 results for all queries, making the semantic search feature completely non-functional.

Root Cause

LanceDB returns L2 distance (~1.0 for similar vectors), but our code incorrectly calculated:

score = 1 - distance = 1 - 1.0 = 0

All results scored 0 and were filtered out by the threshold.


✅ Solution

1. Fix Distance-to-Similarity Conversion

  • Use exponential decay: score = e^(-distance²)
  • Provides proper 0-1 range with good score distribution
  • Exact matches now score 70-90%, semantic matches 25-60%

2. Fix CLI Metadata Access

  • Changed metadata.filemetadata.path
  • Prevents undefined errors in search output

3. Add Comprehensive Tests

  • 11 new integration tests for search functionality
  • Tests cover: stats, thresholds, limits, sorting, score ranges
  • Protects against regression

4. Update Documentation

  • Real working examples from testing
  • Threshold recommendations (0.7=precise, 0.25=exploratory)
  • Natural language query examples
  • Actual output with real scores

🧪 Verification

Before:

$ dev search "coordinator" --threshold 0.7
✖ Found 0 result(s)

After:

$ dev search "coordinator" --threshold 0.3
1. CoordinatorLogger (42.6% match)
2. Coordinator - The Central Nervous System (42.4% match)
3. CoordinatorLogger.info (35.6% match)
✔ Found 3 result(s)

Test Results:

  • ✅ 11/11 integration tests passing
  • ✅ Natural language queries working ("how do agents communicate" → 51.9% match)
  • ✅ Exact term matching ("RepositoryIndexer" → 85.7% match)
  • ✅ Technical concepts ("vector embeddings" → 58.5% match)

📊 Impact

Search Quality Examples:

Query Top Result Score
RepositoryIndexer RepositoryIndexer.index 85.7%
vector embeddings EmbeddingProvider 58.5%
how do agents communicate Message Architecture 51.9%
error handling Handle Errors Gracefully 39.3%

Score Interpretation:

  • 70-90%: Exact matches, highly relevant
  • 40-60%: Strong semantic matches
  • 25-40%: Related concepts, exploratory
  • <20%: Weak matches

🔗 Commits (Atomic)

  1. fix(search): Correct distance-to-similarity calculation
  2. test(search): Add 11 comprehensive integration tests
  3. chore: Update gitignore, remove tsx dependency
  4. docs: Update READMEs with real examples

Each commit builds and tests independently.


🚀 Next Steps

  • Fix explore similar command (searches filename as text, not content)
  • Consider adding score calibration based on query length
  • Add search result caching for frequently used queries

Dogfooded on dev-agent itself: All examples are real searches on this repository! 🐕

- Fix L2 distance conversion in LanceDBVectorStore
  * Use exponential decay: score = e^(-distance^2)
  * Provides scores 0-1 range with better distribution
  * Fixes issue where all results were filtered out

- Fix CLI metadata field reference
  * Changed metadata.file to metadata.path
  * Prevents 'undefined' errors in search output

Fixes search returning 0 results despite indexed data.
- Add 11 integration tests for search functionality
- Tests cover: stats, semantic search, thresholds, limits, sorting
- Tests validate score ranges and metadata structure
- Uses real indexed data from dev-agent repository

Provides regression protection for search bug fixes.
- Add .dev-agent.json and .dev-agent/ to gitignore
- Remove tsx devDependency (was only for debug script)
- Update pnpm-lock.yaml
- Add practical Quick Start with npm link instructions
- Include real search output from testing
- Document threshold recommendations (0.7=precise, 0.25=exploratory)
- Add explore command documentation
- Show actual semantic search scores and results
- Include pro tips for scripting and workflows

Makes documentation reflect actual working functionality.
- Add 9 unit tests for distance-to-similarity conversion
  * Tests the core bug fix: score = e^(-distance²)
  * Validates score ranges, monotonic decrease, edge cases
  * Fast (<2ms), deterministic, always run in CI

- Make integration tests skip in CI by default
  * Set RUN_INTEGRATION=true to run in CI if needed
  * Require pre-indexed data (.dev-agent/)
  * Run locally after `dev index .`

Hybrid approach: Unit tests catch logic bugs, integration tests
validate real behavior when available.
@prosdev prosdev merged commit 4f9df34 into main Nov 23, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant