Skip to content

Comments

feat: wire yzma embeddings into dedup system#141

Merged
nvandessel merged 1 commit intomainfrom
feature/yzma-dedup
Feb 20, 2026
Merged

feat: wire yzma embeddings into dedup system#141
nvandessel merged 1 commit intomainfrom
feature/yzma-dedup

Conversation

@nvandessel
Copy link
Owner

Summary

  • Unified ComputeSimilarity: Extracted 3 duplicated similarity functions into a single internal/dedup/similarity.go with 3-tier fallback (embedding → LLM → Jaccard). All callers (StoreDeduplicator, CrossStoreDeduplicator, CLI) now delegate to this shared implementation.
  • EmbeddingCache: Batch dedup operations cache each behavior's embedding so it's computed at most once during pairwise comparison (O(n) embeds instead of O(n²)).
  • MCP handler wired: handleFloopDeduplicate now uses s.llmClient (was ignored despite being initialized). Merged behaviors are embedded in background after successful dedup.
  • MockClient supports EmbeddingComparer: Embed(), CompareEmbeddings(), builder methods, call tracking — enables testing the full embedding dedup path.
  • Embedding-aware threshold: DefaultEmbeddingDedupThreshold = 0.7 (cosine similarity distributes differently from Jaccard), --embedding-threshold CLI flag, wired through MCP handler.

Bead: feedback-loop-cqq (P1)

Test plan

  • go test ./... — 32 packages pass
  • go build ./cmd/floop — clean
  • golangci-lint run — 0 issues
  • Manual: floop deduplicate --dry-run --threshold 0.7 with local provider — verify embedding method used
  • Manual: MCP floop_deduplicate with threshold: 0.7 — verify embedding-based dedup
  • Acceptance: Four golangci-lint behaviors detected as duplicates at threshold 0.7

🤖 Generated with Claude Code

- Add EmbeddingComparer support to MockClient (Embed, CompareEmbeddings,
  call tracking, compile-time interface assertion)
- Extract unified ComputeSimilarity with 3-tier fallback
  (embedding → LLM → Jaccard) into internal/dedup/similarity.go,
  replacing 3 duplicated implementations
- Add EmbeddingCache for batch dedup so each behavior text is embedded
  at most once during pairwise comparison
- Wire s.llmClient into MCP handleFloopDeduplicate (was ignored despite
  being initialized on the server struct)
- Embed merged behaviors in background after successful dedup merges
- Add DefaultEmbeddingDedupThreshold (0.7) constant and
  EmbeddingThreshold field to DeduplicatorConfig
- Add --embedding-threshold CLI flag to floop deduplicate command

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@greptile-apps
Copy link

greptile-apps bot commented Feb 20, 2026

Greptile Summary

This PR unifies similarity computation across the deduplication system by extracting 3 duplicate implementations into internal/dedup/similarity.go with a 3-tier fallback strategy (embedding → LLM → Jaccard).

Key improvements:

  • Unified similarity logic: All dedup paths (CLI, MCP, store, cross-store) now use ComputeSimilarity for consistency
  • Embedding-first approach: When LLM client supports EmbeddingComparer, embeddings are preferred over full LLM comparison for better performance
  • O(n) embedding cost: EmbeddingCache ensures each behavior text is embedded once during batch pairwise comparisons instead of O(n²)
  • Separate thresholds: EmbeddingThreshold (0.7) vs SimilarityThreshold (0.9) accounts for different score distributions
  • MCP integration: handleFloopDeduplicate now uses s.llmClient and embeds merged behaviors in background
  • Enhanced testing: MockClient now implements EmbeddingComparer with full test coverage

The refactoring successfully eliminates code duplication while maintaining backward compatibility through fallback chains.

Confidence Score: 4/5

  • Safe to merge with one minor enhancement opportunity
  • The refactoring is well-tested with comprehensive test coverage across all similarity paths. Code quality is high with proper fallback chains and error handling. The one minor issue is hardcoding EmbeddingThreshold in the MCP handler instead of exposing it as a parameter, but this doesn't affect correctness.
  • No files require special attention - the MCP handler improvement is optional

Important Files Changed

Filename Overview
cmd/floop/cmd_dedup.go Added --embedding-threshold flag, removed duplicate similarity logic by delegating to unified dedup.ComputeSimilarity, added EmbeddingCache for O(n) embedding cost
internal/dedup/similarity.go New file implementing unified 3-tier similarity: embedding → LLM → Jaccard with EmbeddingCache for batch dedup optimization
internal/dedup/store_dedup.go Added embeddingCache field, refactored computeSimilarity to delegate to unified function, added effectiveThreshold method for threshold selection
internal/llm/mock.go Extended MockClient to implement EmbeddingComparer interface with Embed, CompareEmbeddings, builder methods, and call tracking
internal/mcp/handlers.go Wired llmClient into dedup handler, added background embedding of merged behaviors, configured EmbeddingThreshold (hardcoded, not from args)

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    Start[Dedup Request CLI/MCP] --> CheckLLM{LLM Client Available?}
    CheckLLM -->|Yes| CreateCache[Create EmbeddingCache]
    CheckLLM -->|No| Jaccard[Jaccard Similarity]
    
    CreateCache --> PairwiseLoop[For each behavior pair]
    PairwiseLoop --> ComputeSim[ComputeSimilarity]
    
    ComputeSim --> TryEmbed{Client supports<br/>EmbeddingComparer?}
    TryEmbed -->|Yes| GetCachedA[Cache.GetOrCompute A]
    GetCachedA --> GetCachedB[Cache.GetOrCompute B]
    GetCachedB --> Cosine[CosineSimilarity]
    Cosine -->|Success| CheckEmbedThresh{Score >= EmbeddingThreshold<br/>default: 0.7?}
    Cosine -->|Error| TryLLM
    
    TryEmbed -->|No| TryLLM[LLM CompareBehaviors]
    TryLLM -->|Success| CheckSimThresh{Score >= SimilarityThreshold<br/>default: 0.9?}
    TryLLM -->|Error| Jaccard
    
    Jaccard --> CheckSimThresh
    CheckEmbedThresh -->|Yes| Merge[Merge Behaviors]
    CheckEmbedThresh -->|No| PairwiseLoop
    CheckSimThresh -->|Yes| Merge
    CheckSimThresh -->|No| PairwiseLoop
    
    Merge --> EmbedMerged{Embedder Available?}
    EmbedMerged -->|Yes| BgEmbed[Background: Embed merged behavior]
    EmbedMerged -->|No| SyncStore
    BgEmbed --> SyncStore[Sync Store]
    SyncStore --> End[Return Report]
Loading

Last reviewed commit: 3369872

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

13 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

useLLM := s.llmClient != nil && s.llmClient.Available()
dedupConfig := dedup.DeduplicatorConfig{
SimilarityThreshold: threshold,
EmbeddingThreshold: constants.DefaultEmbeddingDedupThreshold,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding embedding_threshold parameter to FloopDeduplicateInput schema to allow MCP clients to configure this value (currently hardcoded)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@nvandessel nvandessel merged commit dd579af into main Feb 20, 2026
8 checks passed
@nvandessel nvandessel deleted the feature/yzma-dedup branch February 20, 2026 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant