Skip to content

Latest commit

 

History

History
380 lines (281 loc) · 8.45 KB

File metadata and controls

380 lines (281 loc) · 8.45 KB

Testing and Handoff Notes

Local paths

Project root:

C:\CodeBlocks\github-backup-browser

Test backup root:

C:\CodeBlocks\ggml-org-backup\backup

Git Bash path:

/c/CodeBlocks/ggml-org-backup/backup

Backup script:

C:\CodeBlocks\ggml-org-backup\backup.py

Backup tool source:

C:\CodeBlocks\python-github-backup

Current project state

At the time this plan was written, the project contains a minimal uv Python package:

pyproject.toml
README.md
src/ghbb/__init__.py

pyproject.toml currently has:

[project]
name = "ghbb"
requires-python = ">=3.12"
dependencies = []

[project.scripts]
ghbb = "ghbb:main"

Implementation should likely change the script entry to:

[project.scripts]
ghbb = "ghbb.cli:main"

Test backup expected counts

These counts were observed from C:\CodeBlocks\ggml-org-backup\backup.

Repositories

ggml-org/ggml
ggml-org/llama.cpp

The repository directories are named:

backup/repositories/ggml
backup/repositories/llama.cpp

Owner must be derived from JSON URLs as ggml-org.

Item JSON files

Per repo:

ggml:
  issues:       531
  pulls:        792
  discussions:  137
  releases:      11
  labels files:   1

llama.cpp:
  issues:       7903
  pulls:        11236
  discussions:  3177
  releases:     5941
  labels files:    1

Totals:

issues:       8434
pulls:        12028
discussions:  3314
releases:     5952
items total:  29728
labels files:     2

Attachment manifests and metadata rows

Manifest counts:

ggml issues:          43
ggml pulls:           28
ggml discussions:     12
llama.cpp issues:   1549
llama.cpp pulls:    1172
llama.cpp discussions: 458
manifest total:     3262

Attachment entries inside manifests:

ggml issues:           71
ggml pulls:            53
ggml discussions:      24
llama.cpp issues:    3708
llama.cpp pulls:     3633
llama.cpp discussions:1158
attachment rows:     8647

Total size_bytes in manifests is about 3.75 GB. The importer must not read/import those files as BLOBs.

Important note about .json under attachments

There are downloaded attachment files with .json extensions. A naive recursive *.json scan will over-count and may try to parse downloaded attachment content as backup metadata.

Only parse item files in known directories and attachments/*/manifest.json.

Manual count commands

From the backup project directory:

cd C:\CodeBlocks\ggml-org-backup

PowerShell equivalent may differ; these are Git Bash-friendly commands.

Count item JSON files:

find backup/repositories -maxdepth 3 -type f -name '*.json' \
  | grep -v '/attachments/' \
  | sort \
  | wc -l

Count per top-level content type with Python:

python - <<'PY'
from pathlib import Path
from collections import Counter
root=Path('C:/CodeBlocks/ggml-org-backup/backup/repositories')
c=Counter()
for p in root.rglob('*.json'):
    rel=p.relative_to(root)
    parts=rel.parts
    if len(parts)>=4 and parts[2]=='attachments':
        if p.name == 'manifest.json':
            c[f'{parts[0]}/{parts[1]}/manifest'] += 1
    elif len(parts)>=2:
        c[f'{parts[0]}/{parts[1]}'] += 1
print(c)
PY

Count attachment entries:

python - <<'PY'
import json
from pathlib import Path
from collections import Counter
root=Path('C:/CodeBlocks/ggml-org-backup/backup/repositories')
mc=Counter(); ac=Counter(); bytes_total=0
for p in root.rglob('manifest.json'):
    rel=p.relative_to(root); parts=rel.parts
    if len(parts)>=4 and parts[2]=='attachments':
        key=f'{parts[0]}/{parts[1]}'
    else:
        key='other'
    mc[key]+=1
    data=json.load(open(p, encoding='utf-8'))
    for a in data.get('attachments') or []:
        ac[key]+=1
        bytes_total += int(a.get('size_bytes') or 0)
print('manifests', mc, sum(mc.values()))
print('attachments', ac, sum(ac.values()), 'bytes', bytes_total)
PY

Smoke test sequence

Run from the project root:

cd C:\CodeBlocks\github-backup-browser
uv sync
uv run ghbb --help
uv run ghbb import C:\CodeBlocks\ggml-org-backup\backup
uv run ghbb stats
uv run ghbb search "Fabrice Bellard"
uv run ghbb show ggml issue 1
uv run ghbb show llama.cpp pull 10001
uv run ghbb search "KV cache" --repo llama.cpp --json
uv run ghbb serve

With an explicit DB path via env var:

set GHBB_DB=C:\tmp\ghbb-test.db
uv run ghbb import C:\CodeBlocks\ggml-org-backup\backup
uv run ghbb stats

PowerShell:

$env:GHBB_DB = "C:\tmp\ghbb-test.db"
uv run ghbb import C:\CodeBlocks\ggml-org-backup\backup
uv run ghbb stats

Git Bash:

GHBB_DB=/c/tmp/ghbb-test.db uv run ghbb import /c/CodeBlocks/ggml-org-backup/backup
GHBB_DB=/c/tmp/ghbb-test.db uv run ghbb stats

Expected smoke test content

ghbb search "Fabrice Bellard"

Should find at least:

repo:  ggml-org/ggml
kind:  issue
key:   1
title: Implement gpt2tc/nncp with ggml

The issue body mentions Fabrice Bellard, gpt2tc, libnc, and nncp.

ghbb show ggml issue 1

Should show:

  • issue #1 title;
  • issue body;
  • closed state;
  • comments including a comment by ggerganov;
  • labels, if imported.

ghbb show llama.cpp pull 10001

Should show:

  • PR #10001 title/body;
  • PR labels;
  • review comments from comment_data;
  • reviews from review_data;
  • regular comments from comment_regular_data if any.

ghbb show ggml discussion 32

Should show a discussion with GraphQL-shaped fields.

ghbb show llama.cpp release b1046

Should show release metadata and release asset metadata.

Automated tests to add

Use pytest if adding test dependencies:

uv add --dev pytest
uv run pytest

Recommended unit tests:

  1. test_repo_from_api_url

    • https://api.github.com/repos/ggml-org/ggml/issues/1 -> ggml-org, ggml
    • enterprise-like host also works.
  2. test_repo_from_web_url

    • https://github.com/ggml-org/llama.cpp/discussions/32 -> ggml-org, llama.cpp
  3. test_normalize_issue_sample

    • sample ggml/issues/1.json -> issue #1, repo ggml-org/ggml, comments > 0.
  4. test_normalize_pull_sample

    • sample llama.cpp/pulls/10001.json -> PR #10001, review data present.
  5. test_normalize_discussion_sample

    • sample ggml/discussions/32.json -> discussion #32.
  6. test_normalize_release_sample

    • sample llama.cpp/releases/b1046.json -> release item key b1046, assets present.
  7. test_attachment_manifest_sample

    • sample manifest imports metadata and computes local path without reading image files.
  8. test_fts_query_builder

    • handles KV cache, quoted phrases, punctuation-heavy code identifiers.

Integration test:

  • Create temp DB with GHBB_DB pointing to a temporary file.
  • Import a tiny fixture backup with one repo and a few files.
  • Run search and show functions against it.

Avoid making the full 5.2 GB backup part of automated tests. Use it for manual/integration smoke tests only.

Edge cases to handle

  • Empty body fields.
  • Deleted discussion comments with null/empty body.
  • Missing users/authors.
  • Labels missing description or color.
  • Releases without assets.
  • Release tags with slashes or unusual characters.
  • Repo names with dots, e.g. llama.cpp.
  • Markdown tables and code fences.
  • FTS queries containing +, #, ., /, _, -, :.
  • Attachment manifest exists but parent item JSON is absent.
  • Downloaded attachment file has .json extension.
  • Discussion bodyHTML exists but should not be trusted unsanitized.

Performance checks

The full backup is about 5.2 GB because of attachments. Import should not scale with attachment byte size, only manifest count.

Performance sanity checks:

  • Import should not open image/binary attachment files.
  • Memory use should not grow with full backup size; parse files one at a time.
  • SQLite inserts should be batched inside one transaction.
  • FTS rebuild should be done once per import, not after every item.

Done criteria for another agent

An implementation is acceptable when:

  1. uv run ghbb import C:\CodeBlocks\ggml-org-backup\backup completes.
  2. uv run ghbb stats --json reports:
    • 2 repositories;
    • 8,434 issues;
    • 12,028 pulls;
    • 3,314 discussions;
    • 5,952 releases;
    • 8,647 attachment metadata rows.
  3. uv run ghbb search "Fabrice Bellard" finds ggml issue 1.
  4. uv run ghbb show ggml issue 1 --json returns the complete issue body and comments.
  5. uv run ghbb serve starts on 127.0.0.1:8765 and supports /api/search and /api/items/{id}.