diff --git a/.agents/skills/specsmith-audit/SKILL.md b/.agents/skills/specsmith-audit/SKILL.md new file mode 100644 index 00000000..755b54c4 --- /dev/null +++ b/.agents/skills/specsmith-audit/SKILL.md @@ -0,0 +1,59 @@ +--- +name: specsmith-audit +description: Run specsmith audit to check for governance drift between requirements, tests, and architecture. Required before advancing an AEE phase. +--- + +# Specsmith Audit + +Checks the project for drift between requirements (ARCHITECTURE.md), test cases, and the codebase. Must pass before advancing an AEE phase. + +## How to run + +```bash +specsmith audit +``` + +## Interpreting results + +``` +29 PASS ← all requirements have matching tests and implementation + 2 WARN ← drift detected — investigate these + 0 FAIL +``` + +**All items must be PASS or suppressed before `specsmith phase advance`.** + +## When a WARN appears + +1. Read the warning — it will reference a requirement ID (e.g. `R20`) and describe what's missing +2. Fix it: add the missing test, update ARCHITECTURE.md, or implement the requirement +3. Re-run `specsmith audit` to confirm it passes +4. If it's a confirmed false positive: `specsmith audit --suppress ` + +## Suppressing a false positive + +Only suppress if you've verified the requirement IS met but the audit can't detect it automatically: + +```bash +specsmith audit --suppress SEAL-XXXX-001 +``` + +Suppressions are permanent and stored in governance state — use them sparingly. + +## Common causes of WARN + +- Requirement in ARCHITECTURE.md has no corresponding test case +- Test exists but requirement ID reference is missing from the test +- Implementation exists but the architecture doc wasn't updated to match + +## After fixing all warnings + +```bash +specsmith audit # confirm all pass +specsmith phase advance # advance the phase +specsmith save # commit the phase bump +``` + +## Quick audit before a session + +Run `specsmith audit` at the start of a session to catch any drift from the previous session before making new changes. diff --git a/.agents/skills/specsmith-save/SKILL.md b/.agents/skills/specsmith-save/SKILL.md new file mode 100644 index 00000000..e466b8fa --- /dev/null +++ b/.agents/skills/specsmith-save/SKILL.md @@ -0,0 +1,51 @@ +--- +name: specsmith-save +description: Run specsmith save to commit and push all current changes with governance state backup. Use at the end of any work session or after completing a feature/fix. +--- + +# Specsmith Save + +Saves governance state: backs up the ESDB, commits any staged/unstaged changes, and pushes to the remote. + +## When to use + +- At the end of any work session +- After implementing a feature, fix, or refactor +- After advancing a phase +- Whenever the user says "save", "commit and push", or "specsmith save" + +## How to run + +```bash +specsmith save +``` + +## What it does (in order) + +1. **ESDB backup** — snapshots the epistemic state database +2. **Commit** — stages all changes and commits with a governance-aware message (or reports "Nothing to commit") +3. **Push** — pushes the branch to origin (or reports "Everything up-to-date") + +## Expected successful output + +``` + ✓ esdb_backup: JSON fallback (no WAL to backup) + ✓ commit: Nothing to commit ← or a commit hash + ✓ push: Everything up-to-date ← or "pushed to origin/phase-next" +``` + +## If there are changes to commit + +Specsmith will auto-stage and commit. It generates a governance-aware commit message from the diff. You can also pre-stage and commit manually first: + +```bash +git add -A +git commit -m "feat: description + +Co-Authored-By: Oz " +specsmith save # will see nothing to commit, just pushes +``` + +## Do NOT use `git push` directly + +Always use `specsmith save` — it ensures the ESDB backup runs before the push, keeping governance state consistent with the remote. diff --git a/.agents/skills/specsmith/SKILL.md b/.agents/skills/specsmith/SKILL.md new file mode 100644 index 00000000..a11e1c7d --- /dev/null +++ b/.agents/skills/specsmith/SKILL.md @@ -0,0 +1,69 @@ +--- +name: specsmith +description: Reference for the specsmith AEE governance tool used in this project. Use this to understand specsmith commands, the session workflow, and how to interact with the governance layer correctly. +--- + +# Specsmith — Project Governance Tool + +Specsmith is the AEE (Agile Epistemic Engineering) governance CLI used in this project. It manages requirements, phases, audit trails, and session state. It wraps git with governance-aware commits and backs up the epistemic state DB (ESDB). + +## Key concepts + +- **ESDB** — Epistemic State Database. Tracks certainty, audit state, session memory. Backed up on `specsmith save`. +- **Phases** — AEE lifecycle: Inception → Elaboration → Construction → Transition → Validation → Hardening → Release. Advance with `specsmith phase advance`. +- **Ledger** — Running log of changes in `LEDGER.md`. Auto-updated by commits. +- **Audit** — Checks requirements vs tests vs architecture for drift. Run before advancing a phase. +- **Save** — ESDB backup + governance-aware git commit + push. + +## Session workflow + +``` +1. specsmith audit # check for drift before working +2. +3. specsmith save # commit + push + ESDB backup +``` + +## Common commands + +| Command | What it does | +|---------|-------------| +| `specsmith save` | ESDB backup → commit (if needed) → push | +| `specsmith audit` | Drift/health check — requirements vs tests vs arch | +| `specsmith audit --suppress ` | Accept a known false positive | +| `specsmith phase` | Show current AEE phase | +| `specsmith phase advance` | Advance to the next phase (requires clean audit) | +| `specsmith commit` | Governance-aware commit (wraps git commit) | +| `specsmith ledger` | Show/manage the change ledger | +| `specsmith compress` | Compress old ledger entries | +| `specsmith req` | Manage requirements | +| `specsmith test` | Manage test cases | +| `specsmith status` | VCS/CI/PR status | + +## Commit conventions + +Specsmith commits follow: `type: message` where type is one of: +`feat`, `fix`, `refactor`, `test`, `docs`, `chore`, `perf` + +Always append `Co-Authored-By: Oz ` when committing as an AI agent. + +## Important rules + +- **Never use `git commit` directly** — use `specsmith save` or `specsmith commit` so governance state stays consistent. +- **Run `specsmith audit` before advancing a phase** — a phase advance with drift will fail. +- **Suppressed audit findings** are stored permanently; only suppress genuine false positives. +- After `specsmith save` outputs `✓ push: Everything up-to-date` with nothing to commit, the repo is fully clean. + +## Audit result codes + +- `PASS` — requirement/test/arch is consistent +- `WARN` — drift detected, investigate +- `SKIP` / suppressed — accepted false positive +- Numbers like `R20`, `R21` — requirement IDs in ARCHITECTURE.md + +## Phase advancement + +```bash +specsmith audit # must be all-pass (or suppressed) +specsmith phase advance # bumps phase, writes ledger entry +specsmith save # commit the phase bump +``` diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index ce735548..0ffdd220 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -68,7 +68,7 @@ jobs: frontend-tests: name: Frontend tests (Playwright) runs-on: ubuntu-latest - timeout-minutes: 25 + timeout-minutes: 35 # includes 2-min research-loop SSE test needs: [] # run in parallel with backend-tests steps: @@ -87,6 +87,24 @@ jobs: python -m pip install --upgrade pip pip install -e ".[dev]" || pip install -e "." || true + # Build the frontend FIRST so the backend can mount dist/ via StaticFiles + # when it starts. If the frontend is built after the backend starts, the + # StaticFiles mount is never registered and Playwright gets 404s. + - name: Set up Node.js 20 + uses: actions/setup-node@v4 + with: + node-version: "20" + cache: npm + cache-dependency-path: frontend/package-lock.json + + - name: Install frontend dependencies + working-directory: frontend + run: npm ci + + - name: Build frontend + working-directory: frontend + run: npm run build + - name: Start backend in background working-directory: backend env: @@ -98,7 +116,7 @@ jobs: --host 127.0.0.1 --port 8001 \ --log-level warning & echo $! > /tmp/backend.pid - # Wait for backend to be healthy (up to 30s) + # Wait for backend to be healthy (up to 30s) AND frontend to be served for i in $(seq 1 30); do if curl -sf http://127.0.0.1:8001/api/v1/health > /dev/null 2>&1; then echo "Backend healthy after ${i}s" @@ -107,21 +125,8 @@ jobs: sleep 1 done curl -sf http://127.0.0.1:8001/api/v1/health || echo "Backend may not be running" - - - name: Set up Node.js 20 - uses: actions/setup-node@v4 - with: - node-version: "20" - cache: npm - cache-dependency-path: frontend/package-lock.json - - - name: Install frontend dependencies - working-directory: frontend - run: npm ci - - - name: Build frontend - working-directory: frontend - run: npm run build + # Verify frontend is served + curl -sf http://127.0.0.1:8001/ > /dev/null && echo "Frontend OK" || echo "Frontend not served" - name: Install Playwright browsers (Chromium only) working-directory: frontend @@ -132,6 +137,10 @@ jobs: env: CI: "true" BACKEND_RUNNING: "true" + # Backend at port 8001 serves both the built frontend (StaticFiles) and + # the API on the same origin. No separate Vite preview server needed. + PLAYWRIGHT_USE_BACKEND: "1" + PLAYWRIGHT_BACKEND_URL: "http://127.0.0.1:8001" run: | npx playwright test \ --reporter=github \ diff --git a/.gitignore b/.gitignore index 5b91a5c4..4b4063b1 100644 --- a/.gitignore +++ b/.gitignore @@ -100,6 +100,7 @@ temp/ *.db-wal # ---- Local config / secrets ---- +project.yml .env .env.* *.pem @@ -147,6 +148,7 @@ data/.keys.json # ---- Database files ---- data/glossa.db +frontend/data/glossa.db # ---- Backend runtime logs at root ---- backend/uvicorn_stdout.log @@ -187,6 +189,13 @@ backend/glossa_lab/data/phase16_corpora/kalyanaraman_devanagari_corpus.txt # Phase-18 derived: large stream txt regenerable from CSV backend/glossa_lab/data/phase18_corpora/rv_padapatha_stream.txt +# ---- Sign image raw download cache (reconstructable via harvest_ivc2tyc_signs.py) ---- +backend/static/signs/originals/ivc2tyc_cache/ +# The manifest.json and processed M*.png files ARE committed (authoritative). +# The originals/ folder itself is also gitignored to keep the repo lean. +backend/static/signs/originals/ +!backend/static/signs/originals/.gitkeep + # ---- ML model weights (large binaries, not for version control) ---- *.pt *.pth @@ -216,5 +225,8 @@ docs/*.bak *.fdb_latexmk *.synctex.gz +# Bulk mine outputs — large JSON files (regenerable from scripts) +outputs/phase*_bulk_mine_*.json + # Private correspondence — local context only, never pushed .correspondence/ diff --git a/= b/= new file mode 100644 index 00000000..e69de29b diff --git a/AGENTS.md b/AGENTS.md index 00fa22b1..2f33b0cc 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -40,3 +40,26 @@ The following project-specific rule files apply to all sessions: - `docs/research/NORMALIZATION_RULES.md` — Indus sign normalization rules for corpus processing and sign-ID canonicalization. + +--- + +## MCP server + +A FastMCP server lives at `backend/glossa_mcp/server.py` and exposes 27 tools +for querying and controlling the backend without manual API calls: + +- **Status/metrics** — `get_status`, `get_system_metrics` +- **Jobs** — `list_jobs`, `get_job`, `create_job`, `cancel_job`, `get_job_results` +- **Experiments** — `list_experiments`, `get_experiment`, `run_experiment` +- **Research loop** — `start_research_loop`, `get_research_loop_status`, + `stop_research_loop`, `get_research_loop_results`, `get_anchor_staging` +- **Foundation check** — `run_foundation_check` +- **Discovery** — `list_discovery_items`, `get_discovery_stats`, + `trigger_discovery_fetch`, `update_discovery_item_status` +- **Dashboard** — `get_latest_insight`, `get_dashboard_highlights` +- **Anchor sets** — `list_anchor_sets`, `get_anchor_set`, `create_anchor_set` +- **Reports** — `list_reports`, `get_report` + +The server connects to the running backend at `GLOSSA_BASE_URL` +(default: `http://127.0.0.1:8001`). Tools return clean JSON error objects when +the backend is unreachable — they never crash the MCP process. diff --git a/ATTRIBUTION.md b/ATTRIBUTION.md new file mode 100644 index 00000000..90747021 --- /dev/null +++ b/ATTRIBUTION.md @@ -0,0 +1,110 @@ +# Attribution, Data Sources & Contact + +**Glossa-Lab** is an open-source AI-assisted research platform for the computational +analysis of ancient and undeciphered writing systems. This project depends on the +work of many scholars and data providers whose contributions we are committed to +crediting accurately. + +--- + +## If a citation or credit is missing — contact us immediately + +If you are a researcher, data provider, or rights-holder and you believe your work +has been used without proper attribution, or if you have any concern about how your +material appears in this project: + +**Please contact Tristen Kyle Pierson directly:** + +> **Email:** tpierson@bitconcepts.tech +> **Subject line:** "Attribution concern — Glossa-Lab" + +We treat attribution concerns as urgent. You will receive a response within 48 hours. +If the concern is valid, we will correct the attribution, update the repository, and +update any published outputs immediately. + +You may also open a GitHub issue at: +https://github.com/BitConcepts/glossa-lab/issues + +--- + +## Primary data sources + +All data sources used in this project are documented in detail in +[CITATIONS.md](./CITATIONS.md). Key sources include: + +| Source | Authors | License | Used for | +|--------|---------|---------|---------| +| Holdat LLC Indus Corpus v3 | Miller 2025 | Proprietary — statistical derivatives only, no raw data redistributed | Primary inscription corpus | +| Mahadevan 1977 (M77) | Iravatham Mahadevan | Public domain (ASI / Govt. of India) | Sign numbering (M001–M397) | +| DEDR | Burrow & Emeneau 1984 | © Clarendon Press — reference use | Dravidian etymological evidence | +| Parpola 1994 / 2010 | Asko Parpola | © CUP / open conference paper | Decipherment framework, phoneme map | +| ePSD2 | Tinney et al. / Penn | CC BY-SA | Sumerian/Akkadian name corpus | +| CDLI | Englund et al. | CC BY-NC-SA 3.0 | Bibliographic reference only (no data committed) | +| CISI Vols 1–3 | Joshi, Shah, Parpola et al. | © Suomalainen Tiedeakatemia | Reference only (no data redistributed) | +| Wells 2006 / 2015 | Bryan K. Wells | Open access / © Archaeopress | Sign list cross-reference | +| Fuls 2022/2023 | Andreas Fuls | © independently published | Sign catalog cross-reference | +| ICIT | Wells & Fuls | Restricted (TU Berlin) | API reference; no data committed | +| Nair 2026 | Ashish Nair | CC BY (arXiv) | Independent replication study cited | +| Laursen 2010 | Steffen Terp Laursen | © Wiley / AAE | Gulf seal catalog, fish-sign validation | +| Crawford 2001 | Harriet Crawford | © Archaeology International | Dilmun/Saar seal reference | +| ePSD2 names subset | Penn Babylonian Section | CC BY-SA | Meluhhan name matching (null results) | +| Tamburini 2025 | Fabio Tamburini | CC BY (Frontiers) | SA algorithm methodology reference | + +For the complete bibliography with BibTeX entries, license analysis, and per-file +attribution, see [CITATIONS.md](./CITATIONS.md) and +[research/indus/DATA_LICENSES.md](./research/indus/DATA_LICENSES.md). + +--- + +## License compliance summary + +- **Holdat LLC corpus (proprietary):** Not redistributed. Only statistical + derivatives (positional frequencies, bigram counts, candidate readings) appear + in outputs. +- **ePSD2 (CC BY-SA):** Used only for Meluhhan name matching experiments that + produced null results. Not incorporated into released research outputs. + The CC BY 4.0 licence on `research/indus/` outputs is unaffected. +- **CDLI (CC BY-NC-SA):** No CDLI tablet text committed to this repository. + All CDLI references are bibliographic only. +- **Copyrighted academic sources (CISI, Parpola 1994, Mahadevan 2003):** Used + as structured analytical references (sign numbers, phoneme assignments, crosswalk + mappings). No verbatim text reproduced. Defensible as academic fair use / fair + dealing. +- **PyMuPDF (AGPL):** Used only in standalone research scripts, not in the + deployed backend. AGPL network-use provisions do not apply. + +Released research outputs (`research/indus/`, anchor tables, phase reports, +supplemental datasets) are original analysis released under **CC BY 4.0**. + +--- + +## Acknowledgements + +This project is indebted to the following scholars and institutions +(see [CITATIONS.md §Acknowledgements](./CITATIONS.md) for full details): + +Iravatham Mahadevan (1930–2018) · Asko Parpola · Bryan K. Wells · +Andreas Fuls · William Miller Sr (Holdat LLC) · Ashish Nair · +Steffen Terp Laursen · Harriet Crawford · Petteri Koskikallio · +Roja Muthiah Research Library (Chennai) · University of Pennsylvania Museum · +TIFR (Rao, Yadav, Vahia, Joglekar, Adhikari) · Tamburini (Frontiers AI) + +--- + +## How to cite Glossa-Lab + +```bibtex +@software{glossalab2026, + author = {Pierson, Tristen Kyle}, + title = {Glossa-Lab: An agentic computational linguistics research platform + for statistical analysis and decipherment of ancient writing systems}, + year = {2026}, + url = {https://github.com/BitConcepts/glossa-lab}, + note = {BitConcepts LLC. MIT licence (source); CC BY 4.0 (research outputs).} +} +``` + +--- + +*Last reviewed: June 2026. Contact tpierson@bitconcepts.tech for any attribution +concern — we respond within 48 hours.* diff --git a/CITATION.cff b/CITATION.cff index b61e2bf0..48627243 100644 --- a/CITATION.cff +++ b/CITATION.cff @@ -32,11 +32,13 @@ references: - type: article title: > A Falsifiable Computational Decipherment Hypothesis for the Indus Valley Script: - 605 Proto-Dravidian Sign Readings Validated Across Two Independent Corpora + 161 Candidate Proto-Dravidian Anchors and a Three-Slot Positional Grammar authors: - family-names: Pierson given-names: Tristen Kyle affiliation: "BitConcepts LLC" + email: tpierson@bitconcepts.tech year: 2026 + doi: "10.5281/ZENODO.20414696" notes: "Preprint v2 — Not peer-reviewed" - url: "https://github.com/BitConcepts/glossa-lab/tree/main/research/indus" + url: "https://doi.org/10.5281/ZENODO.20414696" diff --git a/CITATIONS.md b/CITATIONS.md index 45530456..f16a4c4d 100644 --- a/CITATIONS.md +++ b/CITATIONS.md @@ -1218,7 +1218,7 @@ Additional acknowledgements since the last update: --- -*Last updated: 2026-05-13.* +*Last updated: June 2026. For attribution concerns contact tpierson@bitconcepts.tech — we respond within 48 hours. See also ATTRIBUTION.md.* --- diff --git a/LEDGER.md b/LEDGER.md index 00f8c753..8e1ee192 100644 --- a/LEDGER.md +++ b/LEDGER.md @@ -4,1324 +4,111 @@ Append-only record of all meaningful work in Glossa Lab. --- -## Archived (102 entries) +## Archived (35 entries) -*Archived on 2026-05-18* +*Archived on 2026-06-07* -- ## [2026-03-31] Entry — Repository scaffold and governance bootstrap — — -- ## [2026-03-31] Entry — Governance hardening, architecture extension, requirements, tests, and implementation scaffold — — -- ## [2026-04-01] Entry — Replace .ps1 with .cmd wrappers, verify shell.cmd — — -- ## [2026-04-01] Entry — Complete all open TODOs: API, DB, logs, tray, services, CI — — -- ## [2026-04-01] Entry — Pipeline engine, block entropy analysis, Rao 2009 replication — — -- ## [2026-04-01] Entry — Indus script corpus, multi-language analysis, PDF report — — -- ## [2026-04-01] Entry — Complete decipherment toolkit build + Indus preparation — — -- ## [2026-04-02] Entry — NSB estimator, Sumerian corpus, logosyllabic pipeline, frontend visualization — — -- ## [2026-04-02] Entry — OS integration tool, Playwright test suite, port isolation — — -- ## [2026-04-02] Entry — Linear B validation study + Linear A undeciphered analysis — — -- ## [2026-04-02] Entry — Real Linear A phoneme-level analysis (tylerlengyel.com data) — — -- ## [2026-04-02] Entry — Linear A anti-circularity experiment suite (7 experiments) — — -- ## [2026-04-03] Entry — Publishable paper, study archive, assumption-free pipelines — — -- ## [2026-04-03] Entry — Rate-limit pacing, admin dashboard backend, and frontend CRUD expansion — — -- ## [2026-04-04] Entry — CI green, frontend view completion, experiments, and Luwian model result — — -- ## [2026-04-05] Entry — Study Builder, SSE streaming, pipelines CRUD, Playwright CI — — -- ## [2026-04-06] Entry — Tray service refactor, ICIT corpus extraction, Mahadevan OCR, Reports improvements — — -- ## [2026-04-06] Entry — Full research platform expansion, Ollama model manager, AI Chat, IDE panel — — -- ## [2026-04-07] Entry — Session audit, docked AI chat completion, LEDGER recovery — — -- ## [2026-04-07] Entry — PDF OCR corpus, research experiments, database fixes, decipherment push — — -- ## [2026-04-07] Entry — Indus decipherment study: structural + phonological analysis — — -- ## [2026-04-07] Entry — Sign value assignment, prediction validation, academic PDF — — -- ## [2026-04-07] Entry — PDF fixes, report_utils module, crosswalk, rebus tests — — -- ## [2026-04-07] Entry — Deep analysis: sign corrections, formula, full equivalence classes — — -- ## [2026-04-07] Entry — AI chat table fix + sign identification session — — -- ## [2026-04-07] Entry — Decipherment synthesis: fish anchored, sign 220=tree, first readings — — -- ## [2026-04-07] Entry — Deep-dive: meen-um, fish clustering, tree sign, phonetic inventory — — -- ## [2026-04-07] Entry — Sign expansion: 48/503/615, maa-, M77 inventory, token coverage — — -- ## [2026-04-05] Entry — ExperimentsView full CRUD and API client expansion — — -- ## [2026-04-05] Entry — Platform orchestration: Study Builder, SSE streaming, pipelines CRUD, CI Playwright — — -- ## [2026-04-06] Entry — Full experiment suite run, all reports regenerated — — -- ## [2026-04-06] Entry — Full-day session: OCR, real ICIT corpus, UI overhaul, platform work — — -- ## [2026-04-08] Entry — Left sidebar nav, AI bubble positioning, GlossaShell, Ollama default model — — -- ## [2026-04-08] Entry — Full platform session: AI action system, model profiles, terminal fixes, governance loop, Indus formula discovery — — -- ## [2026-04-09] Entry — Dr. Fuls 5-tier validation sprint: beam decipherment engine — — -- ## [2026-04-10] Entry — GPU acceleration, experiment registration, CLI-to-UI bridge, AGENT.md — — -- ## [2026-04-10] Entry — Queued experiments run + corpus expansion + UI improvements — — -- ## [2026-04-10] Entry — P6-P9 experiments, new corpora, Glossa AI major upgrade, fine-tuning guide — — -- ## [2026-04-11] Entry — PLANNING: Global Ancient Language Research Platform — PLANNED — not yet started. -- ## [2026-04-14] Entry - GPU, Parallel Execution, Graph Nodes, Corpus, Tests, and Full Compliance Audit — — -- ## [2026-04-14] Entry — H15 Graph-First Rule, Fuls RTL Results, 4 New Atomic Nodes, 10 New Graph Specs — — -- ## [2026-04-15] Entry — Geez Baseline Run (Graph Experiment, No Anchors) — — -- ## [2026-04-15] Entry — Geez Anchor-Convergence Benchmark (Full, Graph-Based) — — -- ## [2026-04-15] Entry — H16 Complete: Graph-Only Catalog, 33 Atomic Nodes, 37 Specs, Subroutine Ports — — -- ## [2026-04-15] Entry — All H16 Phases: User-Definable Platform Complete — — -- ## [2026-04-16] Entry — Geez v2 + UI Completions (Dr. Fuls April 2026) — — -- ## [2026-04-16] Entry — Help system docs overhaul + corpus token-type inspector — — -- ## [2026-04-17] Entry — Help complete rewrite + Dr. Fuls technical Q&A — — -- ## [2026-04-21] Entry — H16 Complete: Graph-First Platform, All Plans Executed, Indus Research Pivot — — -- ## [2026-04-22] Entry — Indus Research Priorities 1–5 & 7: South Dravidian LM, Pali LM, 4 New Graph Experiments, Geez Calibration — WIP (Mohenjo-daro + some other sites). -- ## [2026-04-22] Entry — CISI Corpus Import + Playwright Locator Fixes — — -- ## [2026-04-22] Entry — Decipherment Experiments + Governance Fix + CISI Deep Analysis — — -- ## [2026-04-22] Entry — Extended Decipherment: 10-Anchor SA, Dravidian-Pali CISI, Inscription Readings, P324 Cross-Validation — — -- ## [2026-04-22] Entry — P324 Revision, Koyil Hypothesis, Optimal Anchor Set in DB — INFERRED structural, NOT SA-phonotactically confirmed -- ## [2026-04-22] Entry — P332=o Discovery, 6-Anchor SA, CV Pair Structure, AnchorSetLoader Integration — — -- ## [2026-04-22] Entry — 3 UI Fixes: LTR/RTL Badges, Jobs→Reports, Error Modal — — -- ## [2026-04-22] Entry — AG2 Integration + Anchor Set Corrected — — -- ## [2026-04-22] Entry — Report Template Cleanup — — -- ## [2026-04-22] Entry — Decipherment Sprint Phases 0-8 + H16/Platform Verification — — -- ## [2026-04-23] Governance Infrastructure + Research Intelligence — — -- ## [2026-04-23] AI Chat Migration & Run Status Fixes — — -- ## [2026-04-28] Entry — H17 enforcement: observable runner, heartbeat thread, full Fuls re-run — — -- ## [2026-04-28] Entry — Phase-10: CTT graph nodes + dense-coupling primitives + Indus graph experiment — — -- ## [2026-04-28] Entry — Phase-10 limitation fixes, cleanup, real run — — -- ## [2026-05-04] Entry — Executable AI insights, AI-profile suggester, Phase-30a M77 length stratification — — -- ## [2026-05-04] Entry — Phase-30b/c, dashboard SSE, AI-profile dedup, anchor-set DB upsert — — -- ## [2026-05-05] Entry — UI polish mega-bundle recovery + 7 no-key Discovery fetchers + Settings reorg — — -- ## [2026-05-06] Entry — Project architecture, patent APIs, discovery polish, LLM fixes — — -- ## [2026-05-06] Entry � Dashboard polish: experiment ID resolution, action buttons, error logging, Results deep-link — — -- ## [2026-05-06] Entry - Project-scoped UI overhaul, Correspondence log, Collaboration features — — -- ## [2026-05-11] Entry — Session recovery: WARP.md merge, unrecorded sessions recap, Gulf corpus tasks — — -- ## [2026-05-11] Entry — V18+ campaign, Phase-32 synthesis, Fuls brief, corpus audit + code fixes — — -- ## [2026-05-11] Entry — Fact-check round: corpus audit, TB circularity, icon assignments — — -- ## [2026-05-11] Entry — Tasks 1-12: full research + UI sprint — — -- ## [2026-05-11] Entry — Foundation check: TB LM fix attempt + comprehensive validation — — -- ## [2026-05-11] Entry — Citations, foundation check feature, clean Tamil LM — — -- ## [2026-05-11] Entry — Provider Registry, Model Assignments, Scoring, Logging, and Deployment — — -- ## [2026-05-11] Entry — Phase-32 T4, Gap Analysis, Docs, Foundation Check — — -- ## [2026-05-11] Entry — Phase-32 T4 Syllable Rerun + Final Session Wrap — — -- ## Session: 2026-05-12 — Phase-32 T3/T7/T8 + Negative Controls — — -- ## [2026-05-13] Entry — Phase-33 All Experiments (8 runs + email report) — — -- ## [2026-05-14] Entry — Phase-33 T2 Graph, Synthesis, and Session Closure — — -- ## [2026-05-14] Entry — Phase-34 T1/T7 Anchored SA, T3 TB Clean, Sign-Reading — — -- ## [2026-05-14] Entry — Phase-35 Equalization, Anchor Augmentation, Discovery Fetch — — -- ## [2026-05-14] Entry — Phase-36 Density Equalization + Discovery Mining + Synthesis — — -- ## [2026-05-14] Entry — Phase-37 Fixes, Corpus Realignment Batch 1, Email — — -- ## [2026-05-14] Entry — Phase-38: Confirmed 1.056x Dravidian Advantage (High Power) — — -- ## [2026-05-14] Entry — ICIT-Scale Indus Corpus Reconstruction (Branch Setup + Full Pipeline) — — -- ## [2026-05-14] Entry — ICIT Corpus: Free Source Acquisition + Failure Diagnosis + All Fixes — — -- ## [2026-05-14] Entry — Recovery Plan Execution: Browser Automation + Reverse Engineering — Phase C (Google login + network capture) requires user action -- ## [2026-05-14] Entry — Museums of India: full API acquisition (4,417 records) — — -- ## [2026-05-14] Entry — BREAKTHROUGH: indusscript.in Firestore dump — 3,085 IM77 texts acquired — — -- ## [2026-05-15] Entry — Phase-39: Sangam LM, Multi-Language Falsification, Corpus Batch 2 — — -- ## [2026-05-15] Entry — CBETA Repository Investigation and Correct Acquisition — — -- ## [2026-05-15] Entry — OCR Pipeline Ready + All IIIF Images Downloaded — — -- ## [2026-05-15] Entry — Glyph Classifier Pipeline (IoU k-NN) + 93 Inferred Sequences — — -- ## [2026-05-15] Entry — Option A: CNN Classifier (Negative Result) + Corpus Alignment Audit — — -- ## [2026-05-15] Entry — Corpus/ICIT-Scale Reconstruction Branch Merged — — -- ## [2026-05-15] Entry — Phase-40: Expanded Corpus SA, CNN GPU Training, GPU Rule Fix — — -- ## [2026-05-15] Entry — Phase-41: 300K SA confirmation, corpus validation, sign ID fix — — -- ## [2026-05-15] Entry — Phase-42: V2 corpus catalog alignment — deeper root cause; Penn Museum IP-blocked — — -- ## [2026-05-17] Entry — AGENTS.md H20 + Glossa-Lab Indus Evidence Graph infrastructure (Batch 1+2) — — +- ## Archived (25 entries, 2026-05-29) — — +- ## [2026-05-26] Entry — Competing LM Convergence Test + Dravidianist Outreach Sent — — +- ## [2026-05-26] Entry — Phase 295: Infrastructure Sprint + Bulk Mine 5000 — — +- ## [2026-05-29] Entry — Research Loop Phases 5-7: Experiment Builder + Insight Selection + DB Persistence — — +- ## [2026-05-29] Entry — Governance Migration + Phase Advancement Sprint — — +- ## [2026-05-29] Entry — Full Phase Advancement + SA Experiment Diagnosis + Research Loop Verification — — +- ## [2026-05-29] Entry — Session Close — — +- ## [2026-05-29] Entry — UI Feature Sprint: Pause/Resume, Auto-Queue, Arrange Fix, ETA Fix — — -## [2026-05-15] Entry — Phase-43: V3 corpus built; Dravidian confirmed independent; terminal signs mapped +## [2026-06-07] Entry — Data Integrity Fixes + Kalyanaraman Integration + Phase Advancement + Dynamic Phases Objective: -All-tier execution: T1 V3 corpus + SA, T2 sign mapping (rebus/suffixes/fish/CV pair), -T3 emails (Fuls + Penn Museum), T4 CTT expansion + contact zone + cross-validation. + Fix foundation check data integrity bugs, integrate Kalyanaraman rebus + papers as second-source validation, fix phase advancement system, add + dynamic phase generation. What was done: -T1.1 V3 corpus (indus_corpus_v3.py) — NEW: - - Built from Firestore indusarrays JSONL dump directly (no intermediate layer) - - 3,137 sequences from 2,665 dockeys (Mahadevan concordance entries 1001-9905) - - 12,494 sign instances, mean length 3.98 signs/inscription - - *NNN filter: 3.8% of tokens removed, 85% of dockeys completely clean - - Multi-site: Mohenjo-daro + Harappa + Chanhu-daro + other sites all present - - File: backend/glossa_lab/data/indus_corpus_v3.py - -T1.2 indus_corpus_v2.py *NNN filter — APPLIED: - - Added startswith('*') skip in _parse_diplomatic_to_ints() and _extract_sequences() - - This was the root cause of Phase-42 V2 SA failure - -T1.3 V3 SA (Dravidian vs Sanskrit) — CRITICAL RESULT: - - Dravidian score/token: -4.1525 Sanskrit score/token: -4.6362 - - DRAVIDIAN WINS: +0.484 log-units/token = 11.6% less penalized per token - - This is the FIRST confirmation of the Dravidian advantage on a corpus - INDEPENDENT of M77 Holdat. V3 uses 3,137 inscriptions from indusscript.in - Firestore, covering all major sites. - - SA: 3 seeds x 30K iterations, GPU (CUDA RTX 4070 SUPER) - - Report: phase43_all.json (T1_3_v3_sa) - -T2.2 Terminal sign table — 20 STRONG suffix candidates identified: - - From corpus-scale T/I/M profiling across 3,137 V3 sequences - - 20 TERMINAL_STRONG (T>=0.60), 10 TERMINAL_MODERATE, 40 INITIAL_STRONG, 61 MEDIAL_STRONG - - CRITICAL REVISION: M77/342 (n=1318, T=0.703) -- the most common sign IS terminal! - Previous assumption ('phonetic ka/na') is OVERTURNED. - M77/342 = genitive suffix -n (Proto-Dravidian *-in) is now PRIMARY hypothesis - - M77/176 (T=0.892, n=344) = -um (additive enclitic) - - M77/328 (T=0.853, n=299) = -ku (dative) - - M77/211 (T=0.817, n=218) = -al (agentive) or aal (person) - - M77/1 (T=0.683, n=123) = -il (locative) - -T2.4 Fish sign disambiguation — CRITICAL: - - M77/267: INITIAL_STRONG (I=0.806, n=356) -- NOT phonetic meen - Title/determinative element. Used at inscription start as royal/priestly title. - terminal_frac=3.8% -- signs that follow it do NOT behave as case-suffix followers. - - M77/72: MEDIAL_STRONG (M=0.691, n=181) -- PHONETIC meen (fish) - terminal_frac=25.9% -- SUPPORTED as phonetic 'meen' with case suffix followers - - M77/59: MEDIAL_STRONG (M=0.793, n=334) -- phonetic meen variant or kol - terminal_frac=33.8% -- also supported - - IMPLICATION: M77/267 is a fish LOGOGRAM/DETERMINATIVE, not a phoneme sign. - -T2.3 CV pair search (ko=king Mahadevan analog): - - Best candidate: M77/267 + M77/99 (count=251 at inscription start, dominance=0.74) - - M77/267 (INITIAL_STRONG) always followed by M77/99 (MEDIAL_STRONG) in 74% of cases - - This [267][99] sequence = fixed royal title formula (Mahadevan equivalent of CISI P324+P332) - - M77/99: purely MEDIAL (M=0.861, n=642) -- the most common medial phoneme after title signs - -T2.1 Top-20 rebus table: full rebus mapping for all 20 most frequent V3 signs documented. - - Positional roles computed and Dravidian reading candidates assigned. - - M77/123 (n=187, M=0.904) = LOGOGRAM (unicorn/Pasupati) -- highest confidence logogram. - -T3.1 Dr. Fuls email sent: - - To: [email redacted] (Resend id: f33f4c33) - - Subject: Phase-43 Update: Dravidian Advantage Confirmed on Independent Corpus - - Includes: V3 corpus result, terminal sign table, fish disambiguation, title formula - - Renewed ICIT corpus request with stronger evidence - - File: reports/phase43_fuls_email.txt - -T3.2 Penn Museum draft sent for review: - - To: tpierson@bitconcepts.tech for review (Resend id: 943ddddc) - - Draft to send to: [email redacted] - - Requesting batch image access for ~7,515 identified Indus seal objects - - File: reports/phase43_penn_museum_request_draft.txt - -T3.3 Holdat collection probe: - - Searched all indusscript-probe files -- no separate 'holdat' Firestore collection - - indusscript.in app uses ONLY 'indusarrays' collection - - M77 Holdat data in indus_research.jsonl (indusscript-m77 source) = same data as V3 - -T4.1 DEDR root recall expansion: - - Phase-10 baseline: 0.0% recall (CV-only role map) - - Phase-43 (10 anchors including fish signs): 24.3% of inscriptions match >=1 DEDR root - - Top match: 'meen' (243 hits) driven by fish sign anchors - - True non-fish DEDR recall estimated ~2-3% - - File: reports/phase43_all.json (T4_1_ctt_dedr_expansion) - -T4.2 Multi-site contact zone: - - V3 has all major sites: M-daro 502 / Harappa 727 / Chanhu 217 / Other 874 dockeys - - NOTE: Harappa (727 dockeys) is LARGER than Mohenjo-daro (502 dockeys) in V3! - - Mohenjo-daro vs Harappa sign overlap: Jaccard=0.602 (substantial shared vocabulary) - - Harappa-exclusive signs: M77/277, M77/3, M77/38, M77/201, M77/398 - (candidates for Harappa-specific trade/administrative logograms) - - mayig repository (Mohenjo-daro only) is superseded by V3 for contact zone work - -T4.3 Cross-validation: - - indusscript-m77 entries have accession_number=None -- no dockey lookup possible - - V3 and indusscript-m77 are SAME DATA (both from Firestore indusarrays) - - Cross-validation confirms: V3 is the canonical form of the same corpus - -Foundation check: not run (corpus data/analysis session) - -Files changed: - backend/glossa_lab/data/indus_corpus_v3.py (NEW -- Firestore V3 corpus loader) - backend/glossa_lab/data/indus_corpus_v2.py (FIXED -- *NNN filter added) - backend/scripts/phase43_all.py (NEW -- T1 part 1: SA run) - backend/scripts/phase43_part2.py (NEW -- T2-T4: all analysis) - backend/scripts/send_phase43_emails.py (NEW -- T3 email sender) - backend/scripts/_analyse_firestore.py (NEW -- inspection utility) - reports/phase43_all.json (NEW -- all T1-T4 results) - reports/phase43_fuls_email.txt (NEW) - reports/phase43_penn_museum_request_draft.txt (NEW) - reports/phase43_insights_email.txt (NEW) - LEDGER.md (this entry) - -Open TODOs (Phase-44): - CRITICAL: - 1. Determine M77/342 = -n vs phonetic: - Run bigram context analysis -- what signs precede 342? - If title/initial signs precede 342 most of the time, genitive confirmed. - 2. Determine M77/99 phonetic value: - 99 is purely MEDIAL (M=0.861) and always follows title signs - Candidates: 'ka', 'na', 'ta' -- test against DEDR genitive forms - 3. V3 SA with 300K iterations -- confirm lift ratio matches M77 Holdat (1.0566x) - 4. Contact zone analysis: Harappa-exclusive signs vs DEDR trade words - 5. Penn Museum: send institutional image request after tpierson review - - HIGH: - 6. ICIT corpus (dependent on Dr. Fuls response) - 7. *NNN sign documentation: what do RMRL *001, *002... etc. represent? - Check RMRL bulletins and indusscript.in documentation for supplementary sign list - -Risks: - - M77/342 = -n hypothesis inverts our prior assignment; needs bigram context confirmation - - V3 SA at 30K iters is exploratory; 300K needed for convergence comparable to M77 - - DEDR recall 24.3% is mostly fish-sign driven; true phonetic recall is ~2-3% - - Penn Museum images remain blocked; institutional contact outcome uncertain - -Key findings (summary): - DRAVIDIAN WINS on V3 (independent corpus) -- FIRST independent replication - M77/342 = genitive -n (REVISED from 'phonetic ka/na') - M77/267 = title determinative (NOT phonetic meen) - M77/72 = phonetic meen (confirmed) - [M77/267][M77/99] = fixed royal title formula (251x, 74% dominance) - V3 has full multi-site coverage: Harappa > Mohenjo-daro in size - -Next step: - Phase-44 T1: Bigram context of M77/342 (what precedes the genitive?). - Phase-44 T2: Phonetic value of M77/99 from DEDR genitive pattern matching. - - -## [2026-05-17] Entry — Corpus versioning policy + Evidence Graph Batches 1-5 - -Objective: - H20 email rule. Corpus versioning (V1 primary, date-tracked). - Evidence Graph Batches 1-5: literature sweep, claim extraction, null model - analysis, contact zone, Hunt tripartite grammar test. - -What was done: - -Corpus versioning: - - Created glossa-indus/CORPUS_VERSIONS.md documenting: - * V1 = indus_research.jsonl (user primary, tracked by date, NOT version number) - * Firestore source = supplementary external data (labeled by date, not V2) - * Version bump rules: only after initial research complete + major structural change - - Renamed indus_corpus_v3.py -> indus_corpus_firestore.py - (clarifies it is NOT the user's corpus but a supplementary external source) - - indus_corpus_v2.py remains as-is but is documented as the V1 primary loader - -Batch 3 literature sweep: - - 9 papers catalogued; 6 downloaded as PDFs (open access confirmed) - - Downloaded: Yadav 2010 PLoS ONE n-grams (2.1MB, CC BY), Yadav 2009 arXiv - preprint, Rao 2010 ACL CL reply to Sproat, Parpola 2010 Dravidian solution - (Helsinki repository), Sinha 2010 arXiv network analysis, Farmer-Sproat- - Witzel 2004 Academia.edu - - Metadata-only: Rao 2009 Science (paywalled), Mahadevan 2009 RMRL (institutional) - - Failed: Rao 2009 PNAS (403) -> registered as metadata-only - - All 9 registered in literature/documents/ with provenance - -Batch 4 claim extraction: - - indus_claims.py processes all registered literature documents - - Pattern matching + manual curation for key papers - - 7 claims extracted and saved to claims/extracted_claims/ - - Key manually-curated claims: - * parpola_2010: Dravidian rebus hypothesis [partially_supported] - Evidence: Phase-43 SA +0.484 log-units [VERIFIED] - * parpola_2010: fish sign = meen [partially_supported] - Evidence: M77/72 terminal_frac=25.9% [SUPPORTED] - * farmer_sproat_witzel: non-linguistic hypothesis [contradicted] - Evidence: conditional entropy + positional structure - * yadav_2010: Zipf-Mandelbrot [strongly_supported] - * yadav_2010: text-beginning and text-ending signs [strongly_supported] - -Batch 5 analysis tests (CRITICAL NEW FINDINGS): - - Null model 1 (random shuffle): - - Real positional entropy: 1.0223 vs null mean: 1.2374 - - Effect size: 231.89σ -- MASSIVE positional structure above random baseline - - CONCLUSION: Indus sign positional behavior is FAR above random chance - (p < 10^-100 effectively) - - Null model 2 (frequency-preserved shuffle): - - Top-20 real bigrams appear at >50% rate in shuffled null: 0.13/20 on average - - CONCLUSION: Real bigram structure is REAL sequential dependency, - NOT predicted by frequency alone - - Null model 3 (site-preserved shuffle): - - Cross-site shared bigrams: real=1094, null mean=~1070, effect=1.58σ - - CONCLUSION: Moderate cross-site recurrence -- shared script across sites - is confirmed but site-specific variation exists - - Contact zone (formal): - - M↔H Jaccard=0.602, 82 Harappa-exclusive signs - - Confirms Phase-43 T4.2 contact zone analysis - - Hunt tripartite grammar test (LANDMARK): - - formula_rate=0.355 (35.5% of 3+ sign inscriptions follow I→M→T structure) - - null_expected_rate=0.006 (0.6%) - - LIFT: 59x above null baseline - - CONCLUSION: STRUCTURAL PREFIX-MEDIAL-SUFFIX PATTERN EXISTS - Consistent with BOTH Hunt model AND Dravidian suffix model - Cannot distinguish between them without visual sign classification - (faunal vs celestial sign identification requires Mahadevan visual catalog) - -Foundation check: not run - -Files changed: - AGENTS.md (modified -- H20 email rule) - backend/glossa_lab/data/indus_corpus_firestore.py (NEW -- renamed from v3) - glossa-indus/CORPUS_VERSIONS.md (NEW) - glossa-indus/README.md (NEW) - glossa-indus/config/claim_schema.yaml (NEW) - glossa-indus/config/sign_schema.yaml (NEW) - glossa-indus/config/models.yaml (NEW) - glossa-indus/config/dedupe_rules.yaml (NEW) - glossa-indus/config/test_registry.yaml (NEW) - glossa-indus/config/sources.yaml (NEW) - glossa-indus/scripts/indus_intake.py (NEW) - glossa-indus/scripts/indus_literature_batch3.py (NEW) - glossa-indus/scripts/indus_claims.py (NEW) - glossa-indus/scripts/indus_analysis_batch5.py (NEW) - glossa-indus/hypotheses/models/parpola_proto_dravidian.yaml (NEW) - glossa-indus/hypotheses/models/roif_guild_ledger.yaml (NEW -- stub) - glossa-indus/hypotheses/models/hunt_civic_ritual.yaml (NEW -- stub) - glossa-indus/raw/papers/ (6 PDFs downloaded) - glossa-indus/literature/documents/ (9 JSON records) - glossa-indus/claims/extracted_claims/ (4 JSON records) - glossa-indus/analysis/null_models/ (3 JSON results) - glossa-indus/analysis/artifact_context/ (contact zone JSON) - glossa-indus/analysis/positional/ (Hunt tripartite test JSON) - glossa-indus/reports/ingestion_reports/batch3_literature_report.json (NEW) - glossa-indus/reports/claim_reports/batch4_claims_report.json (NEW) - glossa-indus/reports/model_reports/batch5_analysis_report.json (NEW) - LEDGER.md (this entry) - -Open TODOs (Batches 6-8): - 1. Upload Roif paper to glossa-indus/raw/user_uploads/ (stubs waiting) - 2. Upload Hunt paper to glossa-indus/raw/user_uploads/ (stubs waiting) - 3. Build indus_analyze.py (dedupe run + synthesis report) - 4. Phase-44 T1: Bigram context analysis for M77/342 = -n confirmation - 5. Phase-44 T2: M77/99 phonetic value from DEDR pattern - 6. V3 SA 300K iterations (convergence verification) - 7. Penn Museum: send institutional request after tpierson review - -Risks: - - Hunt vs Dravidian-suffix models: both predict tripartite structure; - visual sign classification required to distinguish them - - Null model 3 (site-preserved) effect_size=1.58σ -- moderate, not significant - at 2σ threshold; more inscriptions needed for stronger cross-site test - - Batch 4 claims: only 7 claims extracted; most PDFs have embedded text that - is not yet fully processed through the pattern extractor - -Key findings: - Positional structure: 231σ above random shuffle null [LANDMARK] - Bigram structure: top-20 bigrams NOT predicted by frequency alone [VERIFIED] - Tripartite formula: 35.5% of inscriptions, 59x above null [VERIFIED] - Hunt + Dravidian-suffix models BOTH predict this -- structural ambiguity remains - Contact zone: Harappa has 82 exclusive signs vs Mohenjo-daro - -Next step: - Upload Roif and Hunt papers to glossa-indus/raw/user_uploads/. - - -## [2026-05-17] Entry — Evidence Graph Batches 6-8, CI/CD, WAL Fix, governance-tool, repo cleanup - -Objective: - Complete Evidence Graph platform (API + frontend + tests), harden database - reliability, update AI model registries, and perform comprehensive repository cleanup. - -What was done: - -Batch 6 — Evidence Graph REST API (backend/glossa_lab/api/indus_evidence.py): - - 11 endpoints implemented: GET /library, POST /upload, POST /import-url, - POST /intake/run, GET /claims, GET /hypotheses, GET/PUT /sweep/config, - POST /sweep/run, GET /sweep/candidates, POST /sweep/intake - - Full multipart PDF upload → background intake pipeline - - Sweep engine reuses discovery fetcher infrastructure with per-project TopicProfile - - Registered under /api/v1/indus-evidence/ router - -Batch 7 — Evidence Graph Experiment Builder atomic nodes: - - experiment_graph_indus_evidence.py: 7 nodes in 'Evidence Graph' category - IndusLiteratureLoader, IndusClaimsLoader, CrossHypothesisMatrix, - HiddenHypothesisGen, IndusClaimTester, IndusNullModelTest, IndusIntakeRunner - - New port types: 'claims' (#b45309) and 'papers' (#0891b2) in PORT_COLORS - - Wired into ATOMIC_NODES via try/except import block - -Batch 8 — Frontend Evidence Graph workspace (IndusEvidenceView.tsx): - - 3-tab UI: Library (dropzone, URL import, re-run, paper list, hypothesis stats) - - Claims tab: type/status/sign filters, expandable claim cards - - Sweep tab: config editor, Save Config, Run Sweep, candidates list with Import - - Evidence Graph nav item in Research sidebar - - Discovery → Evidence import action (🗂 → Evidence on Indus/Harappan items) - -SQLite WAL fix (backend/glossa_lab/database.py): - - Added 3 PRAGMAs in Database.connect(): journal_mode=WAL, busy_timeout=5000, - synchronous=NORMAL - - Eliminated 14+ 'database is locked' failures in concurrent test suite - - Test results improved from 428/16 to 445/0 (zero failures) - -GitHub Actions CI/CD (.github/workflows/ci.yml): - - 3-job pipeline: backend-tests (pytest), playwright-tests, lint (ruff) - - Triggers on push + PR to main - - All 445 backend tests + 39 Playwright Evidence Graph tests covered - -governance-tool model-rate-limits.json update (both root and backend/.governance-tool/): - - Added: o3, o4-mini, gpt-4.1/mini/nano, gemini-2.5-flash variants, - gemini-2.5-pro-preview-05-06, claude-sonnet-4-20250514, claude-opus-4-5, - gemini-3-pro-preview, gpt-5.4 preview aliases - - Removed stale gpt-3.5-turbo wildcard duplication - -model_intelligence.py (backend/glossa_lab/ai/model_intelligence.py): - - Added gpt-5.4 to static fallback with top-tier benchmark scores - -Repository cleanup (chore commit 2026-05-17): - - Deleted 113 obsolete backend/scripts/: phase30-43, v5-v8/v18 loops, - email send scripts, probe/debug scripts, old run scripts, one-off scripts - - Deleted 29 stale backend/reports/: INDUS_V5-V24 round files, INDUS_V18 - loop email, miscellaneous state files (progression.json, etc.) - - Deleted 2 glossa-indus/scripts/ batch scripts (already superseded) - - Deleted 16 reports/ email drafts and chat test JSON files - - Updated .gitignore: *.pt/*.pth/*.ckpt/*.safetensors; user_uploads/*.pdf - - Updated extracted claims JSON (6 files) + batch4 report (re-extracted in tests) - -Documentation/governance updates (this session): - - foundation_check.py API check #8: changed from FAIL (V8-V24 file check) to - permanent PASS with 'archived' note — files deleted in cleanup - - AGENTS.md H19 check #8: updated to reflect archived status - - AGENTS.md DECIPHERMENT RESEARCH ASSET REGISTRY: added Evidence Graph subsystem table - - docs/REQUIREMENTS.md: updated header date; added R16 CI/CD Pipeline section - - docs/TESTS.md: added TEST-CI-001 through TEST-CI-004 section - -Commits this session: - b8bcec7 — Evidence Graph Batch 6 (API) - 227e927 — Evidence Graph Batch 7 (atomic nodes) - 7d8f7a1 — Evidence Graph Batch 8 (frontend) - f80e2c3 — Evidence Graph tests (20 API + 45 atomic node + 39 Playwright) - 7d44d85 — Navigation test fixes (5 pre-existing failures resolved) - 12e99a7 — governance-tool + model_intelligence.py updates - 438cc69 — SQLite WAL fix (445/0 tests) - 83d20bf — CI/CD GitHub Actions pipeline - (cleanup commit) — comprehensive repo cleanup - (docs commit) — this entry + architecture/req/test updates - -Files changed: - backend/glossa_lab/api/indus_evidence.py (NEW) - backend/glossa_lab/experiment_graph_indus_evidence.py (NEW) - frontend/src/components/IndusEvidenceView.tsx (NEW) - frontend/src/components/Discovery/DiscoveryView.tsx (MODIFIED — Evidence import) - frontend/src/App.tsx (MODIFIED — Evidence Graph route) - backend/tests/test_indus_evidence_api.py (NEW — 20 tests) - backend/tests/test_evidence_atomic_nodes.py (NEW — 45 tests) - frontend/e2e/evidence-graph.spec.ts (NEW — 39 tests) - backend/glossa_lab/database.py (MODIFIED — WAL PRAGMAs) - .github/workflows/ci.yml (NEW — 3-job CI pipeline) - .governance-tool/model-rate-limits.json (MODIFIED) - backend/.governance-tool/model-rate-limits.json (MODIFIED) - backend/glossa_lab/ai/model_intelligence.py (MODIFIED — gpt-5.4) - backend/glossa_lab/api/foundation_check.py (MODIFIED — check #8 archived) - docs/REQUIREMENTS.md (MODIFIED — date + R16) - docs/TESTS.md (MODIFIED — TEST-CI section) - AGENTS.md (MODIFIED — H19, Evidence Graph registry) - LEDGER.md (this entry) - .gitignore (MODIFIED — *.pt, user_uploads/*.pdf) - [113 backend/scripts/ deleted, 29 backend/reports/ deleted, etc.] - -Checks run: - shell.cmd test → 445 passed, 0 failed (after WAL fix) - npx playwright test e2e/evidence-graph.spec.ts → 39/39 pass - npx playwright test e2e/navigation.spec.ts → 28/28 pass - Foundation check → PASS (H19 check #8 updated to archived) - -Results: - Evidence Graph platform fully operational: 11 REST endpoints, 7 Experiment - Builder nodes, 3-tab frontend workspace, 39 Playwright tests, 20 API tests, - 45 atomic node tests. - Zero flaky tests in full suite (WAL fix resolved all concurrency failures). - CI/CD pipeline active. - Repository cleaned: ~160 obsolete files removed, gitignore updated. - -Open TODOs: - 1. Phase-44 T1: Bigram context analysis for M77/342 = -n confirmation - 2. Phase-44 T2: M77/99 phonetic value from DEDR genitive pattern matching - 3. V3 SA 300K iterations (convergence verification) - 4. Upload Roif and Hunt papers to glossa-indus/raw/user_uploads/ - 5. Penn Museum institutional contact (after tpierson review of draft) - -Risks: - - dashboard.py /api/v1/dashboard/decipherment: V8-V24 round files deleted; - dashboard will return empty progression array — acceptable as campaign is archived. - Decipherment progress panel will show no history. Low priority to fix. - - dashboard.py references INDUS_V7_FULL_PUSH.json etc. (also deleted); function - handles missing files gracefully (empty latest_reports dict). - - Hunt vs Dravidian-suffix: tripartite structure 59x above null but models are not - yet distinguishable without visual sign classification. - -Next step: - Phase-44 T1: Run bigram context analysis for M77/342 to confirm -n genitive reading. - Upload Roif/Hunt papers via Evidence Graph UI. - ---- - - -## [2026-05-17] Entry — Phase-44: Infrastructure Fix, TamilTB LM Expansion, M342/M99 Phonetic Experiments, Dashboard + UI Fixes - -Objective: -Clean anchor set noise from V8-V24 archive, rebuild CISI corpus, expand Dravidian LM with -TamilTB data, run two targeted phonetic experiments (M342 genitive, M99 DEDR), fix the -decipherment dashboard for archived campaign state, and update UI navigation. - -What was done: - -1. ANCHOR SET CLEAN (phase44_infrastructure.py): - - Removed 196 nir-placeholder entries added by V8-V24 autonomous loop - - Kept 137 real assignments: 7 HIGH, 54 MEDIUM, 75 LOW, 1 UNCERTAIN - - 7 HIGH reads: M342=ay/a, M176=an/an, M099=kol/kol, M062=erutu, M045=yanai, M016=kaliru, M006=puli - - Report: reports/phase44_infrastructure.json - -2. CISI CORPUS REBUILD (phase44_infrastructure.py): - - Rebuilt indus_cisi_corpus.json from mayig/CISI data - - 179 inscriptions / 1003 sign tokens / 182 distinct signs - - Mean inscription length: 5.6 signs - - Top-5: P324(99), P122(76), P385(35), P086(35), P050(32) - - Gulf corpus: PARTIAL (laursen_2010_table1.json exists but contains no sign sequences) - -3. DRAVIDIAN LM EXPANSION (phase44_rebuild_dravidian_lm.py): - - Integrated TamilTB v0.1 (morphologically annotated Tamil TreeBank, ~3,489 words, CC-SA 3.0) - - LM bigrams: 184 -> 944 (+413%, zero English contamination) - - New _citation added: E.1 (DEDR), E.2, E.3 (TamilTB) - - Report: reports/phase44_dravidian_lm_rebuild.json - -4. PHASE-44 T1 — M342 BIGRAM CONTEXT / GENITIVE READING (phase44_t1_m342_bigrams.py): - - Target: M342 (candidate reading: ay/a or genitive suffix -n) - - Corpus: 584 occurrences across 502 inscriptions, 9 sites - - Avg relative position: 0.561 (expected 0.6+ for terminal case marker) - - Top pre-M342 signs: M099(81), M211(45), M342(35), M267(31) - - Top post-M342 sign: M176(122) by large margin - - Cross-site pre-M342 Jaccard (Mohenjo-daro vs Harappa): 0.429 (shared vocabulary) - - Verdict: UNCERTAIN - Findings: anchor signs M099/M267/M176 appear in genitive contexts; top-3 pre-signs - account for 77.8% of preceding contexts (possible noun-class restriction); - avg position < 0.6 weakens strict terminal case-marker hypothesis - - Epistemic status: [UNCERTAIN] — consistent with but not decisive for genitive reading - - Report: reports/phase44_t1_m342_bigrams.json - -5. PHASE-44 T2 — M99 PHONETIC VALUE FROM DEDR (phase44_t2_m99_dedr.py): - - Target: M99 in the fixed M267->M099 title formula - - Formula count: 84 of 389 occurrences (21.6%); across all 9 sites - - 96.4% of formula instances have pre-extension (not bare 2-sign formula) - - Top pre-M267 sign in formula context: M059(7), M293(7), M328(7) - - Top post-M099 sign: M342(81 in full corpus) — genitive follows M99 - - DEDR search: 'kol' root found 50 hits in DEDR OCR text - - Best candidates: - DEDR 2173: kol = rod, staff, city, hold (all Dravidian) - DEDR 2174: kol = take, receive, have (Tamil reflexive auxiliary) - DEDR 2209: kon/kor = kill, cut (less likely for title formula) - - Verdict: SUPPORTED - M99 = kol/kol (Dravidian reflexive auxiliary; title formula context) - kol as title element could encode: holder/taker (kol=take), or fort/dwelling (kol=city) - - Epistemic status: [SUPPORTED, medium confidence] - - Report: reports/phase44_t2_m99_dedr.json - -6. FRONTEND: DeciphermentPanel archived state fix (frontend/src/components/DeciphermentPanel.tsx): - - Panel now detects archived decipherment campaign via backend response shape - - Shows real coverage metrics (anchors: 7 HIGH / 137 total) rather than NA/0% - - Displays archive banner when V8-V24 campaign data not present - -7. UI NAVIGATION: Help panel moved to sidebar bottom section (frontend/src/App.tsx): - - Help link relocated from Research section to bottom persistent sidebar area - - Foundation Check icon adjusted - -8. FRONTEND REBUILD: - - npm run build completed; new bundle: index-DMsLPwTh.js replaces index-jwv6u_va.js - - Old bundle deleted, dist/index.html updated - -9. DOCUMENTATION: - - README.md: fixed box-drawing character formatting error (lines 113-114 inside code fence); - updated Current Research Status section to reflect Phase-44 findings - - docs/USER_GUIDE.md: updated to reflect Phase-44 research state, UI navigation changes, - and current anchor/corpus counts + 1. ANCHOR TOTAL BUG FIX: + - fa["total"] was set to H+M count only by 3 writer paths (promotion, + auto-fix, cleanup script) but foundation check expected len(anchors) + - Fixed in: api/research_loop.py, api/foundation_check.py, scripts/_fix_anchors.py + - Repaired INDUS_FINAL_ANCHORS.json total: 260 → 286 + - Foundation check: 40 pass, 0 fail, 8 warn + + 2. STALE DASHBOARD FIX: + - "Next steps" showed stale fix_foundation proposal after FC was fixed + - ResearchLoopPanel now filters out fix_foundation proposals when + live FC shows 0 failures + + 3. KALYANARAMAN REBUS INTEGRATION: + - Built rebus lexicon from 52 PDFs: 24 rebus pairs, 200 Dravidian terms, + 144 sign references, 31 craft vocabulary items + - Created KalyanCrossValidation atomic node + graph experiment + - First run: 105 overlapping signs, 113 new candidates, complementarity=1.0 + - Auto-queued on every anchor promotion (alongside SA experiments) + - Data file: backend/glossa_lab/data/kalyanaraman_rebus.json + + 4. CGSA EXPERIMENT FIX: + - ClusterMapper node referenced undefined `_log` instead of `logger` + - One-line fix in experiment_graph.py + + 5. SA MULTI-LANGUAGE BUILD FIX: + - "nw semitic" split into ["nw", "semitic"] by whitespace regex + - Added pre-normalization for multi-word language names before splitting + + 6. PHASE ADVANCEMENT FIX: + - Phase 5 was terminal — "Complete Phase" cleared state but coverage + kept returning Phase 5. No Phase 6 existed. + - Added completed_through_phase tracking in phase_state.json + - Added Phase 6 (Peer Review) and Phase 7 (Publication) + - _get_phase_for_coverage now skips completed phases + - Verified: Phase 5 → 6 transition works via API + + 7. DYNAMIC PHASE GENERATOR: + - New module: pipelines/phase_generator.py + - Auto-generates phase goals from available experiments + project state + - Persists to outputs/phase_goals.json (editable) + - API: GET /phase/goals, POST /phase/goals, POST /phase/generate + - config.py loads dynamic goals when available, falls back to defaults + + 8. STAGING REVIEW: + - 14 staged candidates rejected (all from blocker_sign_context, + recommended=false, statistically_sufficient=false, SA delta 4.9-5.0%) Files changed: - backend/glossa_lab/data/indus_cisi_corpus.json (NEW — 179 inscriptions, 1003 tokens) - backend/glossa_lab/data/dravidian_tamil_lm.json (MODIFIED — 944 bigrams, TamilTB integrated) - backend/reports/INDUS_FINAL_ANCHORS.json (MODIFIED — 137 real anchors, 196 placeholders removed) - backend/scripts/phase44_infrastructure.py (NEW) - backend/scripts/phase44_rebuild_dravidian_lm.py (NEW) - backend/scripts/phase44_t1_m342_bigrams.py (NEW) - backend/scripts/phase44_t2_m99_dedr.py (NEW) - reports/phase44_infrastructure.json (NEW) - reports/phase44_dravidian_lm_rebuild.json (NEW) - reports/phase44_t1_m342_bigrams.json (NEW) - reports/phase44_t2_m99_dedr.json (NEW) - frontend/src/components/DeciphermentPanel.tsx (MODIFIED — archived state) - frontend/src/App.tsx (MODIFIED — Help navigation) - frontend/dist/index.html (MODIFIED — new bundle hash) - frontend/dist/assets/index-DMsLPwTh.js (NEW — rebuilt bundle) - frontend/dist/assets/index-jwv6u_va.js (DELETED — old bundle) - README.md (MODIFIED — formatting fix + Phase-44 research status) - docs/USER_GUIDE.md (MODIFIED — Phase-44 state, UI nav) - LEDGER.md (this entry) + backend/glossa_lab/api/research_loop.py (promotion total fix + Kalyanaraman auto-queue) + backend/glossa_lab/api/foundation_check.py (auto-fix total calculation) + backend/glossa_lab/api/experiments.py (multi-word language normalization) + backend/glossa_lab/api/phase.py (goals CRUD + generate endpoints) + backend/glossa_lab/config.py (Phase 6+7, dynamic goal loading) + backend/glossa_lab/experiment_graph.py (ClusterMapper _log fix, Kalyanaraman node registration) + backend/glossa_lab/experiment_graph_kalyanaraman.py (NEW — cross-validation node) + backend/glossa_lab/pipelines/phase_advancer.py (completed_through_phase tracking) + backend/glossa_lab/pipelines/phase_generator.py (NEW — dynamic phase generation) + backend/glossa_lab/data/kalyanaraman_rebus.json (NEW — rebus lexicon) + backend/glossa_lab/experiments/graphs/indus_kalyanaraman_crossval.json (NEW) + backend/scripts/_fix_anchors.py (total = len(anchors)) + backend/scripts/_build_kalyanaraman_lexicon.py (NEW — lexicon builder) + backend/reports/INDUS_FINAL_ANCHORS.json (total repaired) + frontend/src/components/ResearchLoopPanel.tsx (stale proposal filter) Checks run: - - phase44_infrastructure.py: exit 0; 196 anchors removed, CISI rebuilt - - phase44_rebuild_dravidian_lm.py: exit 0; 944 bigrams, 0% contamination - - phase44_t1_m342_bigrams.py: exit 0; UNCERTAIN verdict - - phase44_t2_m99_dedr.py: exit 0; SUPPORTED verdict - - Frontend rebuild: npm run build success - - README.md formatting verified: no remaining || lines in code fences - -HEADLINE RESULTS: - [SUPPORTED, medium confidence] M99 = kol/kol (DEDR 2173/2174) - - Title formula M267->M099 attested 84x across all 9 major Indus sites - - DEDR root confirmed present in Tamil/Dravidian lexicon - [UNCERTAIN] M342 genitive reading: anchor signs in context but position < 0.6 threshold - [INFRASTRUCTURE] Anchor set purged: 333 -> 137 (real assignments only) - [INFRASTRUCTURE] CISI corpus rebuilt: 179 inscriptions, 1003 tokens - [INFRASTRUCTURE] Dravidian LM 5.1x expanded: 184 -> 944 bigrams + - Foundation check: 40 pass, 0 fail, 8 warn + - Kalyanaraman cross-validation: COMPLEMENTARY (144 signs, 113 new candidates) + - Phase advancement: Phase 5 → 6 verified via API + - npm run build: clean, 0 TS errors + - Backend health: healthy + - specsmith audit: 29 pass, 2 issues (ledger TODOs + scaffold type) Open TODOs: - 1. V3 SA 300K iterations for convergence verification (phase-44 T3 pending) - 2. Run indus_intake.py on user_uploads PDFs (2 papers pending) - 3. Fetch Wells ICIT, Hunt 2014, Fuls papers for Evidence Graph - 4. Expand Sangam LM further (current 944 bigrams still thin vs ICIT-scale corpora) - 5. Gulf seal corpus: source laursen_2010 data with inscription sequences - -Risks: - - M342 genitive UNCERTAIN: avg position 0.561 is lower than expected for a pure terminal marker; - need larger corpus or clearer positional criterion - - Gulf corpus: Laursen 2010 table1 data has no sign sequences; need OCR or transcription source - - Dravidian LM at 944 bigrams still sparse for high-n SA convergence; Phase-38 SA used equalized LM - - H19: Phase-44 results are internal only; not publishable without ICIT corpus validation + - [ ] Contact Suresh Kolichala via Academia.edu with review packet + - [ ] Wait for Dravidianist responses (Renganathan, Murugaiyan, Kobayashi) + - [ ] Check SSRN status (submission ID 6827038) Next step: - Phase-44 T3: V3 corpus SA at 300K iterations for independent replication. - Literature intake: upload Roif + Hunt papers via Evidence Graph UI. - ---- - - -## [2026-05-17] Entry — CI fixes, sweep bug, Playwright tests - -Objective: -Fix all CI failures, repair evidence sweep fetcher bug, complete Playwright test suite. - -What was done: - -1. FOUNDATION CHECK: 17/17 PASS (was 16/17). Backend reload applied permanent - PASS for V8-V24 archived campaign check (check #8). - -2. pyproject.toml: Fixed malformed python-multipart dependency entry (\n was - literal text not a newline) — caused pip install to fail with parse error. - -3. CI: Added backend startup to Playwright job (previously backend was missing, - causing ECONNREFUSED on all /api/v1/* calls and job timeout). - -4. model_intelligence.py: Eliminated competing sqlite3 connection from - start_intelligence_sync() startup path. Old code opened ~80 synchronous - sqlite3 connections (one per model) via run_in_executor, causing - SQLITE_BUSY_SNAPSHOT race with aiosqlite — broke test_create_anchor_set_minimal. - Fix: async _sync_static_fallback_async() writes through the single aiosqlite - connection; _KNOWN_MODELS extracted as module-level constant. - -5. analysis.py: Export endpoint returned 500 on corpora with non-ASCII names - (e.g. Ge'ez with U+2019). HTTP Content-Disposition headers must be latin-1. - Fix: NFKD normalize + ASCII encode before building the filename. - -6. indus_evidence.py: Sweep fetcher silently fetched 0 items every run due to - 'RawItem object has no attribute doi'. RawItem only has title/url/source/topic/ - published_at/lang/raw. Fixed: extract doi/authors/summary/pdf_url/kind from - item.raw dict. - -7. Playwright tests: 132 tests, 0 failures locally after fixes. Key fixes: - - AI Chat textarea: getByRole('textbox') instead of CSS [placeholder*='anything'] - - Sign search: check count text (\d+ signs) instead of invisible