From 90ac121fb2d32f48afb79fb610faed8ed29c1505 Mon Sep 17 00:00:00 2001 From: Ran Jiao Date: Fri, 15 May 2026 16:53:43 +0800 Subject: [PATCH] Digest: auto-detect research-style vs project-context docs (no new flag) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit User context: 'I need to reference academic papers + industry tech reports periodically. Is the current /pmo digest suitable for this?' Analysis: current digest's schema (TL;DR + Key facts + Open questions + What PMO must remember + Section map) was designed for project-context docs (term sheets, internal constraints, screenshots). It's underspec'd for academic content where the reuse value lives in Method / Limitations / Reproducibility / Applicability, not in 'key facts.' User direction: do NOT add a --paper flag. Have the agent auto-detect the document type from signals in the document itself. Implementation: classify into 2 types at digest time, weighted-signal heuristic with explicit fallback. New: `## Document-type auto-detection` section in digests.md - 6 weighted signals (length, structural markers, citations density, file source / URL, writing style, section structure) - Threshold: ≥3 signals → high-confidence research-style; ≤1 → high-confidence project-context; =2 → ambiguous, AskUserQuestion - Logged in digest front-matter `Doc type:` field for audit + later manual override via --refresh New schema: `## Digest schemas (two variants share lifecycle)` - Project-context schema (existing) → state/digest_TEMPLATE.md unchanged - Research-style schema (new) → state/digest_paper_TEMPLATE.md - TL;DR (plain language, not the abstract verbatim) - Method (sample / setup / baseline / eval — judgeable + transferable) - Key results (number + context, not raw claim) - Limitations (author-acknowledged + reader-observed) - Reproducibility (data / code / clarity / barriers) - Applicable to this project? (verdict: use / partially / don't, + reason) - Followup citations (2-3 max with one-line reason each — NOT dump) - Open questions + What PMO must remember (carried over from existing schema) - Extra front-matter: Authors, Published, Source venue (arXiv:NNNN.NNNNN etc.) Plus a `### Why a different schema for research` paragraph explaining the rationale per section, so future contributors don't add Method back to the project-context schema or strip Method from research. Shared across both schemas (unchanged): - inputs/ → knowledge// flow - Status: active/archived/eternal/superseded lifecycle - archive_inactive_days candidate detection - knowledge/INDEX.md registration - -digest.md naming convention Co-Authored-By: Claude Opus 4.7 (1M context) --- pmo/reference/digests.md | 74 ++++++++++++++++++++++++++++- pmo/state/digest_paper_TEMPLATE.md | 76 ++++++++++++++++++++++++++++++ 2 files changed, 148 insertions(+), 2 deletions(-) create mode 100644 pmo/state/digest_paper_TEMPLATE.md diff --git a/pmo/reference/digests.md b/pmo/reference/digests.md index f132a12..85aff96 100644 --- a/pmo/reference/digests.md +++ b/pmo/reference/digests.md @@ -48,7 +48,8 @@ Argument: a path inside `inputs/` (relative or absolute), or a magic string `--p ### Reading + drafting 4. Read the full source. For long docs, read in chunks but produce ONE coherent digest. -5. Draft the digest using the schema below (see `## Digest schema`). Save to a working file at `inputs/-digest.draft.md` so the user can see progress. +4a. **Auto-classify document type** (see `## Document-type auto-detection` below). Two classes: `project-context` (default schema) vs `research-style` (academic paper / industry tech report — uses the extended schema). High-confidence classifications proceed silently; low-confidence cases ask the user once via `AskUserQuestion`. +5. Draft the digest using the schema matching the detected type (see `## Digest schemas`). Save to a working file at `inputs/-digest.draft.md` so the user can see progress. ### Verification (Q6: b+c locked) @@ -76,7 +77,47 @@ Argument: a path inside `inputs/` (relative or absolute), or a magic string `--p - Per Q6c (user can hand-edit), respect any sections marked `` — do NOT overwrite without explicit AskUserQuestion confirmation. - Update `Source SHA-256` and `Last digested` fields. -## Digest schema (template at `state/digest_TEMPLATE.md`) +## Document-type auto-detection + +Every digest run classifies the source into one of **two** types. The classification controls which schema/template is used downstream. **NO user-facing flag** — the agent decides from signals in the document itself, asks the user only when uncertain. + +### The two types + +- **`project-context`** (default) — anything the user drops in to give PMO project background: term sheets, internal docs, constraints lists, screenshots, user notes, regulatory text. Uses `state/digest_TEMPLATE.md`. Typical length < 5000 words. +- **`research-style`** — academic papers, industry tech reports, white papers, formal technical notes. Uses `state/digest_paper_TEMPLATE.md` (extended schema with Method / Results / Limitations / Reproducibility / Applicability / Followup citations). Typical length ≥ 8 pages. + +### Classification signals + +Compute weighted evidence from the source. **Each signal toward `research-style` = +1 point. ≥3 points → high-confidence `research-style`. 0–1 points → high-confidence `project-context`. 2 points → ambiguous, ask user.** + +| Signal | Counts toward `research-style` if … | +|---|---| +| Length | ≥ ~5000 words / ≥ 8 pages (PDF page count, or ≈ 30K chars in markdown/text) | +| Structural markers | Source contains explicit "Abstract", "References" / "Bibliography", numbered figure captions ("Figure 1: …"), equation blocks, DOI string, arXiv ID (`arXiv:NNNN.NNNNN`) | +| Citations density | ≥ 10 distinct external citations / bibliography entries | +| File source / URL | Path or URL contains `arxiv` / `acm` / `ieee` / `nature` / `springer` / `ssrn` / `nber` / named research-org domains; or filename pattern `--.pdf` / pure arXiv ID `NNNN.NNNNN.pdf` | +| Writing style | First-person plural method narration ("We propose / We evaluate / Our results show"), hypothesis-testing language, defined notation introduced early | +| Section structure | Has at minimum 3 of: Introduction / Background / Method (or Approach / Methodology) / Experiments / Results / Discussion / Conclusion / Limitations | + +### Behavior by confidence + +- **High-confidence `research-style`** (≥ 3 signals): proceed silently with `digest_paper_TEMPLATE.md`. +- **High-confidence `project-context`** (≤ 1 signal): proceed silently with `digest_TEMPLATE.md`. +- **Ambiguous** (2 signals, or specifically conflicting evidence): one `AskUserQuestion` call, header `"Doc type"`, options: `Research-style — paper / tech report (Recommended) | Project-context — internal doc / constraints / notes`. User picks once, no further prompts; chosen template applies through the rest of the digest run. + +The classification is logged in the digest's front-matter `Doc type:` field for audit. If the user disagrees later, they can manually flip the field and re-run `/pmo digest .md --refresh` to regenerate with the alternative schema. + +### Why no flag + +The user already drops the source file at `inputs/`. Adding `--paper` would force a decision the agent can make from the document's own structure — extra cognitive load with no upside. Auto-detect with explicit fallback to confirmation is the cleaner contract. + +## Digest schemas (two variants share lifecycle, differ in schema) + +Both schemas share: front-matter (Source / SHA / Status / Topics / Referenced by), placement under `knowledge//`, the same `Status: active | archived | eternal | superseded` lifecycle, the same `archive_inactive_days` archive-candidate detection, the same `knowledge/INDEX.md` registration. + +They differ in **body sections** to match what's worth capturing about each type. + +### Project-context schema (template at `state/digest_TEMPLATE.md`) ```markdown # Digest — @@ -111,6 +152,35 @@ Argument: a path inside `inputs/` (relative or absolute), or a magic string `--p - §2 ... ``` +### Research-style schema (template at `state/digest_paper_TEMPLATE.md`) + +For academic papers and industry technical reports. Replaces the project-context body with sections that capture what's worth knowing about research: + +- **TL;DR** — paper's core claim in plain user-language (NOT the abstract verbatim). +- **Method** — sample / dataset, setup, baseline, evaluation. Specific enough to judge soundness and transferability. +- **Key results** — number + the context that makes it meaningful (not just the paper's claim). +- **Limitations** — split: author-acknowledged + reader-observed (critical reading). +- **Reproducibility** — data available? code available? method clarity? barriers to reuse? +- **Applicable to this project?** — concrete verdict (`use` / `partially use` / `don't use`) + reason + landing point. +- **Followup citations** — 2–3 max, with one-line reason each is worth pulling. NOT a dump of the references list. +- **Open questions + What PMO must remember** — same as project-context schema; carried over because they're universal. + +Extra front-matter fields (`Authors`, `Published`, `Source venue` like `arXiv:NNNN.NNNNN`) are added so the bibliography surface in the digest is searchable + citeable. + +The rationale for each extra section is in `## Why a different schema for research` below. + +### Why a different schema for research + +Academic content has properties that the project-context schema underserves: + +- **Reuse value lives in the method, not the result.** A paper's "we got 92.4% on X" matters less than "they did Y, this is whether Y transfers." The Method section forces capture of what's actually reusable. +- **Claims need critical reading, not bullet-point summarisation.** Limitations (both stated and observed) is mandatory because a digest without it sells the paper, doesn't evaluate it. +- **Reproducibility is the engineering gate.** "Could I do this myself?" should be answered before the paper's results are believed at face value. +- **Papers are nodes in a citation graph.** Followup citations capture the 2-3 worth-following edges so future-you knows where to read next. +- **Generalised "Key facts" loses precision.** Numbers without dataset / cohort / baseline are decoration; the research-style schema forces context attached to each result. + +If a digest of a paper would fit comfortably into the project-context schema, that's a signal the paper wasn't worth a full digest in the first place — maybe a 3-line journal note + a saved URL is the right capture. + ## `knowledge/INDEX.md` — the catalog PMO maintains this file automatically. Updated on every `/pmo digest`, archive operation, and full rebuild during `end-phase-retro`. diff --git a/pmo/state/digest_paper_TEMPLATE.md b/pmo/state/digest_paper_TEMPLATE.md new file mode 100644 index 0000000..09c931e --- /dev/null +++ b/pmo/state/digest_paper_TEMPLATE.md @@ -0,0 +1,76 @@ +# Digest — + +> Source: knowledge// +> Source SHA-256: +> Received: by +> Last digested: by PMO +> Status: active +> Superseded by: — +> Topics: +> Verification level: standard +> Doc type: research-style # auto-detected; do not edit +> Authors / affiliation: +> Published / dated: +> Source venue: +> Referenced by: (auto-grep on INDEX rebuild) + +## TL;DR (≤ 3 sentences) + + + +## Method + + + +- **Sample / dataset**: +- **Setup**: +- **Baseline**: +- **Evaluation**: + +## Key results (with context) + +Each result = number + the context that makes it meaningful (not the paper's claim as-is). + +- ****: , vs baseline , on . (§) +- ... + +## Limitations + +Both author-acknowledged AND critical-reading limitations the digest writer observed. + +- **Author-acknowledged**: +- **Reader-observed**: + +## Reproducibility + +- **Data available?**: +- **Code available?**: +- **Method described clearly enough to re-implement?**: +- **Barrier to using in this project**: + +## Applicable to this project? + +Concrete judgment, not aspirational. + +- **Verdict**: use / partially use / don't use +- **Reason**: +- **If "use"**: where in the project this lands (file path / phase / task-id) +- **If "partially use"**: which part transfers, which doesn't + +## Followup citations (2–3 max) + +References from this paper that look worth reading next. NOT a dump of the references list — only the 2-3 highest-leverage ones, with the reason each is worth pulling. + +- : +- : + +## Open questions + +- +- ... + +## What PMO must remember in future work + +- +- +- ...