Phase 3 has been successfully implemented. The knowledge gaps feature detects semantically similar clusters that have few connections, helping identify opportunities to strengthen the knowledge base.
All components have been implemented and are ready for use.
- Empty init file marking the gaps module
detect_gaps()- Main gap detection functionClusterInfo- Dataclass for cluster metadata (size, hubs, tags)KnowledgeGap- Dataclass for gap representationGapsResult- Complete result with all detected gaps
Algorithm:
- Build graph and compute Louvain clusters
- Load embeddings for all spaces
- Compute cluster centroids (mean of member embeddings)
- For each cluster pair:
- Compute semantic similarity (cosine of centroids)
- Compute link density (cross_links / (size_a * size_b))
- Calculate gap_score = semantic_sim - link_density
- Filter gaps above threshold
- Find boundary nodes and shared tags
- Return sorted by gap_score descending
Helper Functions:
get_cluster_centroid()- Average embedding of cluster membersget_cluster_info()- Extract hub docs (top 5 centrality) and top tagsfind_boundary_nodes()- Nodes that link to both clustersfind_shared_tags()- Tags appearing in both clusterscount_cross_links()- Count edges between clusters
format_gaps()- Formats GapsResult as compact TSV/markdown- Output includes:
- Gap rank and score
- Semantic similarity vs link density
- Hub documents for each cluster
- Top tags for each cluster
- Shared tags between clusters
- Boundary nodes that bridge clusters
Output Format:
# KNOWLEDGE_GAPS count=5 generated=2025-12-10T12:30:00
## GAP rank=1 gap_score=0.58
clusters: 3, 7
semantic_sim: 0.72
link_density: 0.02
cross_links: 2
### CLUSTER_3 size=47
HUBS: Data Tokenization, Distributed Storage, API Design
TAGS: tokenization(12), data(10), api(8)
### CLUSTER_7 size=23
HUBS: Trading Journal, Position Sizing, Risk Management
TAGS: trading(8), risk(5), analytics(4)
SHARED_TAGS: data, analytics
BOUNDARY_NODES: Market Data Feed
Added gaps command with options:
datacortex gaps- Detect gaps for all spacesdatacortex gaps --space personal- Single spacedatacortex gaps --min-score 0.4- Custom threshold
Output written to: /tmp/datacortex_gaps_{timestamp}.txt
Complete command documentation including:
- Overview of knowledge gaps concept
- Synthesis guidelines for Claude
- How to name clusters based on hub docs and tags
- Bridge action recommendations (expand nodes, create notes, add links, unify tags)
- Output format guidelines
- Prioritization criteria
The gaps module integrates cleanly with existing Datacortex components:
- Clustering: Uses
metrics/clusters.py-compute_clusters()with Louvain algorithm - Embeddings: Uses
ai/embeddings.py-compute_embeddings_for_space()with caching - Similarity: Uses
ai/similarity.py-cosine_similarity()for centroid comparison - Graph Building: Uses
indexer/graph_builder.py-build_graph()for full graph - Database: Uses
core/database.py-get_connection(),space_exists()
# Analyze all spaces with default threshold (0.3)
datacortex gaps
# Analyze specific space
datacortex gaps --space personal
# Use higher threshold for only significant gaps
datacortex gaps --min-score 0.5# From Claude Code CLI
/datacortex-gapsThis will:
- Run
datacortex gapsto generate analysis - Read the output file
- Synthesize actionable bridge suggestions
- Present gaps with specific recommendations
Each knowledge gap includes:
- Semantic Similarity: How related the clusters are (0-1)
- Link Density: Actual connections / maximum possible (0-1)
- Gap Score: Semantic similarity minus link density
- Cross Links: Absolute count of connections
- Cluster Sizes: Number of documents in each cluster
- Hub Documents: Top 5 most central documents per cluster
- Top Tags: Most frequent tags per cluster (with counts)
- Shared Tags: Tags appearing in both clusters
- Boundary Nodes: Documents that already link both clusters
- High gap score (>0.5): Clusters are very related but barely connected - high priority
- Medium gap score (0.3-0.5): Some semantic relationship, few connections - worth investigating
- Low gap score (<0.3): Either not very related or already well-connected - can ignore
The formatter and command work together to suggest:
-
Expand Boundary Nodes (easiest)
- Documents already linking both clusters
- Add more explicit connections in their content
-
Create Bridge Notes (most impactful)
- New documents explicitly connecting themes
- Suggested titles and key points
-
Add Direct Links (quick wins)
- Specific wiki-links between existing documents
- Strategic connections between hub docs
-
Unify Tags (organizational)
- Standardize shared tags
- Add spanning tags to both clusters
All gap analyses are written to:
/tmp/datacortex_gaps_{timestamp}.txt
Format: YYYYMMDD_HHMMSS
Example: /tmp/datacortex_gaps_20251210_133000.txt
All dependencies already specified in pyproject.toml:
numpy- Array operations for centroidsnetworkx- Graph analysis (already used for clustering)sentence-transformers- Embeddings (already used in Phase 1)
No new dependencies required.
To test after installation:
# Install datacortex in development mode
pip install -e .
# Ensure embeddings are computed
datacortex embed
# Run gap detection
datacortex gaps --space personal
# View output
cat /tmp/datacortex_gaps_*.txt | tail -1 | xargs cat- Install datacortex in development mode:
pip install -e . - Compute embeddings if not already done:
datacortex embed - Run gap detection:
datacortex gaps - Test the
/datacortex-gapscommand in Claude Code CLI - Review gap suggestions and implement bridge recommendations
The gaps module follows the same pattern as the digest module:
detector.py- Core algorithm and data structuresformatter.py- Output formatting for Claude consumption- CLI integration in
commands.py - Datacore command in
.datacore/commands/
This consistency makes the codebase easy to understand and extend.
Potential improvements for future phases:
- Temporal analysis - track how gaps close over time
- Gap visualization in the web UI
- Automated link suggestions (not just manual recommendations)
- Gap prioritization based on user activity (recent edits/views)
- Integration with
/todaybriefing to show new gaps
Status: Phase 3 Complete
Date: 2025-12-10
Location: src/datacortex/