Project root:
C:\CodeBlocks\github-backup-browser
Test backup root:
C:\CodeBlocks\ggml-org-backup\backup
Git Bash path:
/c/CodeBlocks/ggml-org-backup/backup
Backup script:
C:\CodeBlocks\ggml-org-backup\backup.py
Backup tool source:
C:\CodeBlocks\python-github-backup
At the time this plan was written, the project contains a minimal uv Python package:
pyproject.toml
README.md
src/ghbb/__init__.py
pyproject.toml currently has:
[project]
name = "ghbb"
requires-python = ">=3.12"
dependencies = []
[project.scripts]
ghbb = "ghbb:main"Implementation should likely change the script entry to:
[project.scripts]
ghbb = "ghbb.cli:main"These counts were observed from C:\CodeBlocks\ggml-org-backup\backup.
ggml-org/ggml
ggml-org/llama.cpp
The repository directories are named:
backup/repositories/ggml
backup/repositories/llama.cpp
Owner must be derived from JSON URLs as ggml-org.
Per repo:
ggml:
issues: 531
pulls: 792
discussions: 137
releases: 11
labels files: 1
llama.cpp:
issues: 7903
pulls: 11236
discussions: 3177
releases: 5941
labels files: 1
Totals:
issues: 8434
pulls: 12028
discussions: 3314
releases: 5952
items total: 29728
labels files: 2
Manifest counts:
ggml issues: 43
ggml pulls: 28
ggml discussions: 12
llama.cpp issues: 1549
llama.cpp pulls: 1172
llama.cpp discussions: 458
manifest total: 3262
Attachment entries inside manifests:
ggml issues: 71
ggml pulls: 53
ggml discussions: 24
llama.cpp issues: 3708
llama.cpp pulls: 3633
llama.cpp discussions:1158
attachment rows: 8647
Total size_bytes in manifests is about 3.75 GB. The importer must not read/import those files as BLOBs.
There are downloaded attachment files with .json extensions. A naive recursive *.json scan will over-count and may try to parse downloaded attachment content as backup metadata.
Only parse item files in known directories and attachments/*/manifest.json.
From the backup project directory:
cd C:\CodeBlocks\ggml-org-backupPowerShell equivalent may differ; these are Git Bash-friendly commands.
Count item JSON files:
find backup/repositories -maxdepth 3 -type f -name '*.json' \
| grep -v '/attachments/' \
| sort \
| wc -lCount per top-level content type with Python:
python - <<'PY'
from pathlib import Path
from collections import Counter
root=Path('C:/CodeBlocks/ggml-org-backup/backup/repositories')
c=Counter()
for p in root.rglob('*.json'):
rel=p.relative_to(root)
parts=rel.parts
if len(parts)>=4 and parts[2]=='attachments':
if p.name == 'manifest.json':
c[f'{parts[0]}/{parts[1]}/manifest'] += 1
elif len(parts)>=2:
c[f'{parts[0]}/{parts[1]}'] += 1
print(c)
PYCount attachment entries:
python - <<'PY'
import json
from pathlib import Path
from collections import Counter
root=Path('C:/CodeBlocks/ggml-org-backup/backup/repositories')
mc=Counter(); ac=Counter(); bytes_total=0
for p in root.rglob('manifest.json'):
rel=p.relative_to(root); parts=rel.parts
if len(parts)>=4 and parts[2]=='attachments':
key=f'{parts[0]}/{parts[1]}'
else:
key='other'
mc[key]+=1
data=json.load(open(p, encoding='utf-8'))
for a in data.get('attachments') or []:
ac[key]+=1
bytes_total += int(a.get('size_bytes') or 0)
print('manifests', mc, sum(mc.values()))
print('attachments', ac, sum(ac.values()), 'bytes', bytes_total)
PYRun from the project root:
cd C:\CodeBlocks\github-backup-browser
uv sync
uv run ghbb --help
uv run ghbb import C:\CodeBlocks\ggml-org-backup\backup
uv run ghbb stats
uv run ghbb search "Fabrice Bellard"
uv run ghbb show ggml issue 1
uv run ghbb show llama.cpp pull 10001
uv run ghbb search "KV cache" --repo llama.cpp --json
uv run ghbb serveWith an explicit DB path via env var:
set GHBB_DB=C:\tmp\ghbb-test.db
uv run ghbb import C:\CodeBlocks\ggml-org-backup\backup
uv run ghbb statsPowerShell:
$env:GHBB_DB = "C:\tmp\ghbb-test.db"
uv run ghbb import C:\CodeBlocks\ggml-org-backup\backup
uv run ghbb statsGit Bash:
GHBB_DB=/c/tmp/ghbb-test.db uv run ghbb import /c/CodeBlocks/ggml-org-backup/backup
GHBB_DB=/c/tmp/ghbb-test.db uv run ghbb statsShould find at least:
repo: ggml-org/ggml
kind: issue
key: 1
title: Implement gpt2tc/nncp with ggml
The issue body mentions Fabrice Bellard, gpt2tc, libnc, and nncp.
Should show:
- issue #1 title;
- issue body;
- closed state;
- comments including a comment by
ggerganov; - labels, if imported.
Should show:
- PR #10001 title/body;
- PR labels;
- review comments from
comment_data; - reviews from
review_data; - regular comments from
comment_regular_dataif any.
Should show a discussion with GraphQL-shaped fields.
Should show release metadata and release asset metadata.
Use pytest if adding test dependencies:
uv add --dev pytest
uv run pytestRecommended unit tests:
-
test_repo_from_api_urlhttps://api.github.com/repos/ggml-org/ggml/issues/1->ggml-org,ggml- enterprise-like host also works.
-
test_repo_from_web_urlhttps://github.com/ggml-org/llama.cpp/discussions/32->ggml-org,llama.cpp
-
test_normalize_issue_sample- sample
ggml/issues/1.json-> issue #1, repoggml-org/ggml, comments > 0.
- sample
-
test_normalize_pull_sample- sample
llama.cpp/pulls/10001.json-> PR #10001, review data present.
- sample
-
test_normalize_discussion_sample- sample
ggml/discussions/32.json-> discussion #32.
- sample
-
test_normalize_release_sample- sample
llama.cpp/releases/b1046.json-> release item keyb1046, assets present.
- sample
-
test_attachment_manifest_sample- sample manifest imports metadata and computes local path without reading image files.
-
test_fts_query_builder- handles
KV cache, quoted phrases, punctuation-heavy code identifiers.
- handles
Integration test:
- Create temp DB with
GHBB_DBpointing to a temporary file. - Import a tiny fixture backup with one repo and a few files.
- Run search and show functions against it.
Avoid making the full 5.2 GB backup part of automated tests. Use it for manual/integration smoke tests only.
- Empty body fields.
- Deleted discussion comments with null/empty body.
- Missing users/authors.
- Labels missing
descriptionorcolor. - Releases without assets.
- Release tags with slashes or unusual characters.
- Repo names with dots, e.g.
llama.cpp. - Markdown tables and code fences.
- FTS queries containing
+,#,.,/,_,-,:. - Attachment manifest exists but parent item JSON is absent.
- Downloaded attachment file has
.jsonextension. - Discussion
bodyHTMLexists but should not be trusted unsanitized.
The full backup is about 5.2 GB because of attachments. Import should not scale with attachment byte size, only manifest count.
Performance sanity checks:
- Import should not open image/binary attachment files.
- Memory use should not grow with full backup size; parse files one at a time.
- SQLite inserts should be batched inside one transaction.
- FTS rebuild should be done once per import, not after every item.
An implementation is acceptable when:
uv run ghbb import C:\CodeBlocks\ggml-org-backup\backupcompletes.uv run ghbb stats --jsonreports:- 2 repositories;
- 8,434 issues;
- 12,028 pulls;
- 3,314 discussions;
- 5,952 releases;
- 8,647 attachment metadata rows.
uv run ghbb search "Fabrice Bellard"findsggmlissue1.uv run ghbb show ggml issue 1 --jsonreturns the complete issue body and comments.uv run ghbb servestarts on127.0.0.1:8765and supports/api/searchand/api/items/{id}.