Skip to content

Conversation

@dstengle
Copy link
Owner

@dstengle dstengle commented Nov 5, 2025

feat: Add markdown structure processing to graph
Description
Summary
Implements comprehensive markdown structure processing functionality that converts markdown elements (headings, sections, lists, tables, code blocks, and blockquotes) into RDF graph entities with proper relationships and metadata.

Key Changes

  1. KB Entity Models (src/knowledgebase_processor/models/kb_entities.py)
    Added 7 new Pydantic models for markdown structure:

KbHeading - Markdown headings (h1-h6) with level and hierarchy
KbSection - Content sections with heading relationships
KbList - Ordered/unordered lists with item counts
KbListItem - Individual list items with parent relationships
KbTable - Tables with row/column counts and headers
KbCodeBlock - Code blocks with language and line count
KbBlockquote - Blockquotes with nesting levels
All models include RDF property mappings, position tracking, and Schema.org types.

  1. Markdown Structure Processor (src/knowledgebase_processor/processor/markdown_structure_processor.py)
    Converts markdown elements to KB entities
    Maintains parent-child relationships (heading↔section, list↔items)
    Tracks position information (start/end line numbers)
    Uses deterministic ID generation based on position for reproducibility
    Provides statistics on extracted structure
  2. Integration (src/knowledgebase_processor/processor/entity_processor.py)
    Integrated into main processing pipeline
    Automatically extracts structure from all documents
    Processes alongside todos, wikilinks, and named entities
  3. ID Generation (src/knowledgebase_processor/utils/id_generator.py)
    Added generate_markdown_element_id() method
    Deterministic URIs based on element type and position
  4. Specification-Based Tests
    Created 5 new test cases in specs/test_cases/:
    markdown_structure_01_single_heading
    markdown_structure_02_code_block
    markdown_structure_03_list
    markdown_structure_04_table
    markdown_structure_05_blockquote
    Regenerated all 60 existing spec test outputs to include new entities
    Added scripts/regenerate_spec_outputs.py utility for batch updates
    Impact
    All markdown structure elements are now fully represented in the knowledge graph with:

✅ Proper RDF types and Schema.org mappings
✅ Position metadata (start/end line numbers)
✅ Parent-child relationships
✅ Queryable via SPARQL
✅ Deterministic, reproducible entity IDs
Test Plan

All 61 specification tests pass

RDF converter handles all new entity types

Deterministic ID generation ensures test reproducibility

Integration tests verify end-to-end processing

Spec tests use declarative approach per project standards
Testing Results
============================= test session starts ==============================
collected 61 items

tests/test_specifications.py::test_specifications PASSED x60
tests/test_specifications.py::test_test_cases_directory_exists PASSED

===================== 61 passed, 31 warnings in 1.51s =========================

Implements comprehensive markdown structure processing functionality
that converts markdown elements (headings, sections, lists, tables,
code blocks, and blockquotes) into RDF graph entities.

Changes:
- Add KB entity models for markdown structure elements (KbHeading,
  KbSection, KbList, KbListItem, KbTable, KbCodeBlock, KbBlockquote)
- Create MarkdownStructureProcessor to convert markdown elements
  to KB entities with proper relationships
- Integrate MarkdownStructureProcessor into EntityProcessor pipeline
- Add generate_markdown_element_id method to EntityIdGenerator
- Add comprehensive test coverage for all markdown structure types

All markdown structure elements are now processed into the RDF graph
with proper metadata including position information, nesting levels,
and parent-child relationships.

Tests: All 9 tests pass
Converts markdown structure processing tests to follow the project's
specification-driven testing methodology instead of unit tests.

Changes:
- Remove unit test file from tests/processor directory
- Create 5 new specification test cases for markdown structure:
  - markdown_structure_01_single_heading
  - markdown_structure_02_code_block
  - markdown_structure_03_list
  - markdown_structure_04_table
  - markdown_structure_05_blockquote
- Update markdown structure processor to use deterministic IDs based
  on position instead of random UUIDs for sections, lists, tables,
  and code blocks
- Regenerate all 60 spec test expected outputs to include new
  markdown structure entities in RDF graphs
- Add regenerate_spec_outputs.py script for batch updating test
  expectations when processor output changes

Test Results: All 61 specification tests pass

This aligns with the project's specification-driven testing approach
where behavior is captured in declarative artifacts (input.md and
expected_output.ttl files) rather than imperative Python test code.
@dstengle dstengle merged commit 04406d5 into main Nov 5, 2025
2 checks passed
@dstengle dstengle deleted the claude/markdown-to-graph-processor-011CUpuvxAn2hbzC5SaNFv8x branch November 5, 2025 19:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants