Skip to content

Integrate Graphify for zero-cost code entity extraction #5

@verkligheten

Description

@verkligheten

Problem

When ingesting code folders via agent-notes memory ingest, the system concatenates all files as raw text and relies on LLM agents (wiki-compiler at Sonnet/Opus cost) to discover what entities exist (classes, functions, modules) and how they relate. This is:

  • Expensive: ~$0.10-0.50 per file in LLM tokens for entity discovery
  • Non-deterministic: same code may produce different entity lists on different runs
  • Slow: LLM round-trips for purely structural information available in the AST

Solution

Integrate Graphify (47k stars, MIT, Python 3.10+) which uses tree-sitter to parse ASTs locally and extract code entities deterministically with zero API cost. Wire its extraction into the existing wiki_ingest_folder() pipeline.

Cost Impact

Step Current (LLM) With Graphify
Code structure (functions, classes, imports) ~$0.10-0.50/file Free (tree-sitter, local)
Relationships between code entities LLM inference Free (AST traversal)
Community/module detection Manual or LLM-inferred Free (Leiden algorithm)
Domain narrative compilation LLM agents LLM agents (unchanged)

For a typical 80% code / 20% docs project, ~80% of extraction work becomes free.

Graphify Python API (verified from source + context7)

# File detection
from graphify.detect import detect, classify_file, FileType
result = detect(Path("./project"))
# -> {"files": {"code": [...], "document": [...], "paper": [...], "image": [...]},
#     "total_files": int, "total_words": int, "skipped_sensitive": [...]}

# File collection
from graphify.extract import collect_files, extract
code_files = collect_files(Path("./src"))  # -> [Path, ...]

# AST extraction (zero API cost for code)
result = extract(code_files, cache_root=Path("."))
# -> {"nodes": [{"id": str, "label": str, "file_type": str, "source_file": str, "source_location": str}],
#     "edges": [{"source": str, "target": str, "relation": str, "confidence": str, "weight": float}],
#     "input_tokens": 0, "output_tokens": 0}

# Graph construction
from graphify.build import build_from_json
G = build_from_json(result)  # -> networkx.Graph

# Community detection
from graphify.cluster import cluster, score_all
communities = cluster(G)  # -> {0: ["node_id_a", "node_id_b"], 1: [...]}
cohesion = score_all(G, communities)  # -> {0: 0.85, 1: 0.72}

# Analysis
from graphify.analyze import god_nodes, surprising_connections
gods = god_nodes(G)  # -> [{"label": str, "degree": int, ...}]
surprises = surprising_connections(G, communities)

Key facts:

  • extract() accepts list[Path] of code files, returns dict with nodes/edges
  • build_from_json() accepts extraction dict, returns networkx.Graph
  • cluster() returns dict[int, list[str]] mapping community ID to node IDs
  • Node IDs are deterministic: {filename_stem}_{entity_name} (lowercase, NFKC normalized)
  • Edge confidence: "EXTRACTED" (from AST), "INFERRED", "AMBIGUOUS"
  • Edge relations: "contains", "calls", "imports", "uses", "inherits", "method"
  • Supports 15+ languages: Python, JS/TS, Java, Go, Rust, C/C++, C#, Kotlin, Scala, PHP, Ruby, Swift, Lua, Groovy, Fortran
  • PyPI package: graphifyy (double-y), CLI command: graphify

Architecture

wiki_ingest_folder(folder_path)
    │
    ├── [existing] Walk files, concatenate with --- FILE: markers
    │
    ├── [NEW] if has_code and graphify_available():
    │   ├── collect_files() → code file paths
    │   ├── extract(code_files) → {nodes, edges}
    │   ├── build_from_json() → NetworkX graph
    │   ├── cluster() → communities
    │   ├── graph_to_wiki_terms() → {entities, concepts, edges_by_entity}
    │   └── save_graph_json() → raw/<slug>-graph.json
    │
    ├── Merge graphify-discovered entities/concepts with caller-provided ones
    │
    └── [existing] wiki_ingest(merged_entities, merged_concepts)
            ├── Store raw content
            ├── Create source page
            ├── Fan out entity stub pages  ← now pre-populated by Graphify
            ├── Fan out concept stub pages ← now pre-populated by Graphify
            └── Cross-reference (enhanced with edge data)

Design Constraints

  1. Optional dependencygraphifyy in [project.optional-dependencies], graceful fallback via try/except ImportError
  2. No new CLI commands — auto-detect folder path in existing agent-notes memory ingest
  3. No manual configuration — Graphify extraction runs automatically when available
  4. Backward compatible — all existing tests and workflows unchanged
  5. Single integration point — new code_graph.py module encapsulates all Graphify interaction

Sub-issues

Dependency Graph

#6 (optional dep)
 └── #7 (code_graph.py)
      └── #8 (wire into wiki_ingest_folder)
           ├── #9 (CLI folder detection)
           ├── #10 (cross-ref enrichment)
           └── #11 (agent instructions)
      └── #12 (tests — depends on all above)

Files to Modify

File Action
pyproject.toml Add graph optional dep, pytest marker
agent_notes/services/code_graph.py NEW — Graphify extraction boundary
agent_notes/services/wiki_backend.py Modify wiki_ingest_folder(), _cross_reference()
agent_notes/commands/memory.py Modify do_ingest() for folder auto-detect
agent_notes/data/agents/wiki-compiler.md Add graph.json usage instructions
agent_notes/data/skills/obsidian-memory/SKILL.md Document folder auto-detection
tests/unit/services/test_code_graph.py NEW — extraction module tests
tests/unit/services/test_wiki_backend.py Add Graphify integration tests
tests/functional/memory/test_memory_command.py Add folder detection test

Branch

feat/graphify-integration from develop

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementepicMulti-issue feature spanning several PRs

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions