feat: wire yzma embeddings into dedup system#141
Merged
nvandessel merged 1 commit intomainfrom Feb 20, 2026
Merged
Conversation
- Add EmbeddingComparer support to MockClient (Embed, CompareEmbeddings, call tracking, compile-time interface assertion) - Extract unified ComputeSimilarity with 3-tier fallback (embedding → LLM → Jaccard) into internal/dedup/similarity.go, replacing 3 duplicated implementations - Add EmbeddingCache for batch dedup so each behavior text is embedded at most once during pairwise comparison - Wire s.llmClient into MCP handleFloopDeduplicate (was ignored despite being initialized on the server struct) - Embed merged behaviors in background after successful dedup merges - Add DefaultEmbeddingDedupThreshold (0.7) constant and EmbeddingThreshold field to DeduplicatorConfig - Add --embedding-threshold CLI flag to floop deduplicate command Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Greptile SummaryThis PR unifies similarity computation across the deduplication system by extracting 3 duplicate implementations into Key improvements:
The refactoring successfully eliminates code duplication while maintaining backward compatibility through fallback chains. Confidence Score: 4/5
|
| Filename | Overview |
|---|---|
| cmd/floop/cmd_dedup.go | Added --embedding-threshold flag, removed duplicate similarity logic by delegating to unified dedup.ComputeSimilarity, added EmbeddingCache for O(n) embedding cost |
| internal/dedup/similarity.go | New file implementing unified 3-tier similarity: embedding → LLM → Jaccard with EmbeddingCache for batch dedup optimization |
| internal/dedup/store_dedup.go | Added embeddingCache field, refactored computeSimilarity to delegate to unified function, added effectiveThreshold method for threshold selection |
| internal/llm/mock.go | Extended MockClient to implement EmbeddingComparer interface with Embed, CompareEmbeddings, builder methods, and call tracking |
| internal/mcp/handlers.go | Wired llmClient into dedup handler, added background embedding of merged behaviors, configured EmbeddingThreshold (hardcoded, not from args) |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
Start[Dedup Request CLI/MCP] --> CheckLLM{LLM Client Available?}
CheckLLM -->|Yes| CreateCache[Create EmbeddingCache]
CheckLLM -->|No| Jaccard[Jaccard Similarity]
CreateCache --> PairwiseLoop[For each behavior pair]
PairwiseLoop --> ComputeSim[ComputeSimilarity]
ComputeSim --> TryEmbed{Client supports<br/>EmbeddingComparer?}
TryEmbed -->|Yes| GetCachedA[Cache.GetOrCompute A]
GetCachedA --> GetCachedB[Cache.GetOrCompute B]
GetCachedB --> Cosine[CosineSimilarity]
Cosine -->|Success| CheckEmbedThresh{Score >= EmbeddingThreshold<br/>default: 0.7?}
Cosine -->|Error| TryLLM
TryEmbed -->|No| TryLLM[LLM CompareBehaviors]
TryLLM -->|Success| CheckSimThresh{Score >= SimilarityThreshold<br/>default: 0.9?}
TryLLM -->|Error| Jaccard
Jaccard --> CheckSimThresh
CheckEmbedThresh -->|Yes| Merge[Merge Behaviors]
CheckEmbedThresh -->|No| PairwiseLoop
CheckSimThresh -->|Yes| Merge
CheckSimThresh -->|No| PairwiseLoop
Merge --> EmbedMerged{Embedder Available?}
EmbedMerged -->|Yes| BgEmbed[Background: Embed merged behavior]
EmbedMerged -->|No| SyncStore
BgEmbed --> SyncStore[Sync Store]
SyncStore --> End[Return Report]
Last reviewed commit: 3369872
| useLLM := s.llmClient != nil && s.llmClient.Available() | ||
| dedupConfig := dedup.DeduplicatorConfig{ | ||
| SimilarityThreshold: threshold, | ||
| EmbeddingThreshold: constants.DefaultEmbeddingDedupThreshold, |
There was a problem hiding this comment.
Consider adding embedding_threshold parameter to FloopDeduplicateInput schema to allow MCP clients to configure this value (currently hardcoded)
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ComputeSimilarity: Extracted 3 duplicated similarity functions into a singleinternal/dedup/similarity.gowith 3-tier fallback (embedding → LLM → Jaccard). All callers (StoreDeduplicator, CrossStoreDeduplicator, CLI) now delegate to this shared implementation.EmbeddingCache: Batch dedup operations cache each behavior's embedding so it's computed at most once during pairwise comparison (O(n) embeds instead of O(n²)).handleFloopDeduplicatenow usess.llmClient(was ignored despite being initialized). Merged behaviors are embedded in background after successful dedup.MockClientsupportsEmbeddingComparer:Embed(),CompareEmbeddings(), builder methods, call tracking — enables testing the full embedding dedup path.DefaultEmbeddingDedupThreshold = 0.7(cosine similarity distributes differently from Jaccard),--embedding-thresholdCLI flag, wired through MCP handler.Bead:
feedback-loop-cqq(P1)Test plan
go test ./...— 32 packages passgo build ./cmd/floop— cleangolangci-lint run— 0 issuesfloop deduplicate --dry-run --threshold 0.7with local provider — verify embedding method usedfloop_deduplicatewiththreshold: 0.7— verify embedding-based dedup🤖 Generated with Claude Code