Skip to content

Latest commit

 

History

History
322 lines (235 loc) · 8.24 KB

File metadata and controls

322 lines (235 loc) · 8.24 KB

Implementation Plan

Goal

Build an MVP local search and browse tool for python-github-backup output. The tool should help both humans and agents search issues, pull requests, discussions, releases, comments, reviews, and attachment metadata without requiring GitHub API access.

The MVP should be useful with this local backup:

C:\CodeBlocks\ggml-org-backup\backup

The project itself lives at:

C:\CodeBlocks\github-backup-browser

Non-goals for the MVP

  • No GitHub API calls.
  • No authentication.
  • No live sync.
  • No import of binary attachment contents into SQLite.
  • No OCR or indexing of images/files attached to issues/PRs/discussions.
  • No source-code repository indexing.
  • No multi-user permissions.
  • No Go implementation.
  • No raw JSON storage in the DB.

If a missing JSON field becomes important, re-run import after a schema change.

User-facing defaults

The tool should be easy for agents to call with minimal arguments.

DB path

Resolve the database path as:

  1. GHBB_DB, if the environment variable is set and non-empty.
  2. ./ghbb.db, relative to the current working directory.

There should be no --db option in the MVP.

Backup root

ghbb import accepts one optional positional backup root:

ghbb import [BACKUP_ROOT]

If omitted, detect a backup root in this order:

  1. ./backup, if ./backup/repositories exists.
  2. ., if ./repositories exists.
  3. Otherwise fail with a clear error message.

Owner/org

There should be no --org option. Derive repository owner and repo name from JSON URLs. See importer.md.

Recommended dependencies

Use uv to manage dependencies.

Initial command:

uv add typer rich fastapi "uvicorn[standard]" jinja2 markdown-it-py bleach

Recommended use:

  • typer: CLI command routing.
  • rich: CLI tables, progress bars, readable errors.
  • fastapi: local web app and JSON API.
  • uvicorn: development/local server.
  • jinja2: HTML templates.
  • markdown-it-py: Markdown rendering for issue/PR/discussion bodies.
  • bleach: sanitize rendered Markdown before serving HTML.
  • sqlite3: Python stdlib; no SQLAlchemy needed for MVP.
  • json, pathlib, datetime, re, urllib.parse: Python stdlib.

Before implementing FTS, confirm Python's SQLite has FTS5:

uv run python -c "import sqlite3; c=sqlite3.connect(':memory:'); c.execute('create virtual table t using fts5(x)'); print('fts5 ok')"

Suggested project layout

Current project root is already a standalone uv project.

C:\CodeBlocks\github-backup-browser\
  pyproject.toml
  README.md
  docs/
  src/
    ghbb/
      __init__.py          # can keep script entry or delegate to cli.main
      cli.py               # Typer app
      config.py            # DB path and backup root resolution
      db.py                # SQLite connection, schema setup, transactions
      schema.sql           # SQL schema
      importer.py          # backup traversal and import orchestration
      normalizers.py       # issue/PR/discussion/release parsing
      search.py            # FTS query construction and search SQL
      render.py            # Markdown rendering/sanitization helpers
      web.py               # FastAPI app factory
      templates/
        base.html
        index.html
        search.html
        repo.html
        item.html
      static/
        app.css
  tests/
    test_repo_derivation.py
    test_normalizers.py
    test_search_query.py

Update the script entry in pyproject.toml if needed:

[project.scripts]
ghbb = "ghbb.cli:main"

Import strategy

For MVP, a full rebuild on each import is acceptable and recommended for simplicity.

Suggested import flow:

  1. Resolve DB path from GHBB_DB or ./ghbb.db.
  2. Resolve backup root.
  3. Validate backup root contains repositories/.
  4. Open SQLite.
  5. Ensure schema exists.
  6. In one transaction:
    • clear imported tables;
    • insert one import_runs row;
    • scan all repositories;
    • import labels;
    • import issues;
    • import pull requests;
    • import discussions;
    • import releases and release asset metadata;
    • import attachment manifests as metadata;
    • rebuild FTS documents.
  7. Print counts.

This avoids subtle incremental import bugs. The current test backup has about 30k issue/PR/discussion/release JSON files, which is reasonable for full re-import.

A later version can add incremental import using source file size/mtime/hash columns.

Implementation phases

Phase 1: CLI skeleton and DB config

Deliver:

uv run ghbb --help
uv run ghbb db-path

Tasks:

  • Add Typer CLI.
  • Implement get_db_path() using GHBB_DB or ./ghbb.db.
  • Implement backup root resolution.
  • Add basic error formatting.

Acceptance:

  • uv run ghbb --help works.
  • uv run ghbb db-path prints the resolved database path.
  • GHBB_DB=C:\tmp\test.db uv run ghbb db-path prints the override.

db-path is not essential for end users, but useful for agents and smoke tests.

Phase 2: Schema and import

Deliver:

uv run ghbb import C:\CodeBlocks\ggml-org-backup\backup
uv run ghbb stats

Tasks:

  • Add schema from schema.md.
  • Implement normalizers from importer.md.
  • Import item/comment/label/attachment metadata.
  • Import release asset metadata.
  • Rebuild FTS.
  • Print import summary.

Acceptance:

  • Import completes without reading binary attachment contents.
  • stats shows expected repo and item counts from testing.md.
  • Re-running import produces the same counts.

Phase 3: CLI search and show

Deliver:

uv run ghbb search "Fabrice Bellard"
uv run ghbb show ggml issue 1
uv run ghbb show llama.cpp pull 10001
uv run ghbb search "KV cache" --json
uv run ghbb show ggml issue 1 --json

Tasks:

  • Implement FTS query builder.
  • Implement filters: --repo, --kind, --state, --author, --label, --limit, --offset, --json.
  • Implement show by repo/kind/key.
  • Format human output with Rich.
  • Return stable JSON for agents.

Acceptance:

  • Search finds matches in item bodies and comments.
  • Results link comment/review hits to parent item.
  • JSON output includes IDs needed for follow-up calls.

Phase 4: Local web app

Deliver:

uv run ghbb serve

Tasks:

  • Add FastAPI app.
  • Add HTML pages:
    • home/search page;
    • search results;
    • repo browse;
    • item thread view.
  • Add JSON API endpoints from cli-web-api.md.
  • Render Markdown safely.
  • Do not serve local attachment files by default.

Acceptance:

  • Browser opens at http://127.0.0.1:8765/.
  • Search works in the browser.
  • Item page shows title, metadata, body, comments/reviews/replies, labels, and attachment metadata.
  • API endpoints return JSON useful for agents.

Phase 5: Polish

Tasks:

  • Add tests for repo derivation and normalizers.
  • Add README usage examples.
  • Add clear error messages for missing FTS5, bad backup roots, empty DB, malformed JSON.
  • Add import progress bars.
  • Add pagination for web and CLI.
  • Add basic performance pragmas.

SQLite performance recommendations

During import:

PRAGMA journal_mode = WAL;
PRAGMA synchronous = NORMAL;
PRAGMA temp_store = MEMORY;
PRAGMA foreign_keys = ON;

Use executemany for bulk inserts where practical. Keep import inside a transaction.

For full rebuild import, clear tables in dependency order or disable foreign keys only if necessary. Prefer ON DELETE CASCADE and simple deletes from root tables.

Security/safety notes

The web app is a local browser for untrusted GitHub-authored content.

  • Bind to 127.0.0.1 by default.
  • Escape all plain text.
  • Sanitize Markdown-rendered HTML with bleach.
  • Do not execute scripts from GitHub content.
  • Do not render raw bodyHTML from GitHub discussion JSON without sanitizing.
  • Do not serve downloaded attachment files in the MVP.
  • Link original attachment URLs as external links.

Definition of done for MVP

A different agent should be able to run:

cd C:\CodeBlocks\github-backup-browser
uv sync
uv run ghbb import C:\CodeBlocks\ggml-org-backup\backup
uv run ghbb stats
uv run ghbb search "Fabrice Bellard"
uv run ghbb show ggml issue 1
uv run ghbb serve

And get:

  • correct import counts;
  • useful CLI search results;
  • useful JSON output with --json;
  • a local web UI that can search and browse threads.