Skip to content

Digest auto-detects research-style vs project-context (no new flag)#22

Open
ranjiao wants to merge 1 commit into
mainfrom
perry/digest-auto-detect-doctype
Open

Digest auto-detects research-style vs project-context (no new flag)#22
ranjiao wants to merge 1 commit into
mainfrom
perry/digest-auto-detect-doctype

Conversation

@ranjiao
Copy link
Copy Markdown
Owner

@ranjiao ranjiao commented May 15, 2026

Summary

User feedback: '我时不时需要参考学术论文 / 行业技术报告。现在的 digest 适合处理这类信息吗?'

Analysis: current digest schema (TL;DR + Key facts + Open questions + What PMO must remember + Section map) was tuned for project-context docs — term sheets, constraints, screenshots, internal notes. It underserves academic / industry research content where the reuse value lives in Method / Limitations / Reproducibility / Applicability, not in 'key facts.'

User direction: NO new flag. Have the agent auto-detect the document type from signals in the source itself, fall back to one AskUserQuestion when ambiguous.

Auto-detection logic (new § in digests.md)

Two classes:

  • project-context — existing default; term sheets, internal docs, constraints, screenshots
  • research-style — papers, industry tech reports, white papers, formal technical notes

Six weighted signals computed at digest time:

Signal Counts toward research-style if …
Length ≥ 5000 words / ≥ 8 pages
Structural markers Abstract / References / numbered figure captions / equations / DOI / arXiv ID
Citations density ≥ 10 distinct external citations
File source / URL arxiv / acm / ieee / nature / springer / ssrn etc.; or <author>-<year>-<topic>.pdf pattern
Writing style First-person plural method narration, hypothesis testing
Section structure ≥ 3 of: Introduction / Background / Method / Experiments / Results / Discussion / Conclusion / Limitations

Threshold: ≥ 3 signals = high-confidence research-style (proceed silently). ≤ 1 = high-confidence project-context (proceed silently). = 2 = ambiguous → one AskUserQuestion with the user picking once.

Classification logged in digest's Doc type: front-matter for audit + manual override via --refresh.

Two schemas

Both share: front-matter convention, placement under knowledge/<topic>/, Status: active | archived | eternal | superseded lifecycle, archive-candidate detection, INDEX registration.

They differ in body sections only.

Project-context (unchanged)

state/digest_TEMPLATE.md — TL;DR + Key facts + Open questions + What PMO must remember + Section map.

Research-style (NEW)

state/digest_paper_TEMPLATE.md:

  • TL;DR — plain user-language, not abstract verbatim
  • Method — sample / setup / baseline / eval (judgeable + transferable)
  • Key results — number + context (not raw claim)
  • Limitations — author-acknowledged + reader-observed (critical reading)
  • Reproducibility — data / code / clarity / barriers
  • Applicable to this project? — verdict (use / partially / don't) + reason + landing point
  • Followup citations — 2-3 max with one-line reason each
  • Open questions + What PMO must remember (carried over)
  • Extra front-matter: Authors, Published, Source venue

Each section has a stated rationale in a ### Why a different schema for research paragraph — so future edits don't drift the boundary in either direction.

Why this is better than --paper flag

  • One mental model: drop file in inputs/, run /pmo digest <path>, agent does the right thing
  • Document-type evidence lives IN the document, not in the user's memory of which flag to pass
  • Borderline cases (long-form blog post that reads academic, or a short methodology memo) get explicit user confirmation rather than a wrong silent default

Files changed

  • pmo/reference/digests.md: +144 lines (auto-detection signals, two-schema description, rationale)
  • pmo/state/digest_paper_TEMPLATE.md: new file, 76 lines
  • pmo/state/digest_TEMPLATE.md: unchanged (still the project-context template)
  • No SKILL.md change (digest reference is loaded only on /pmo digest; type-detection is a reference-internal concern)
  • No new flag, no new subcommand syntax

Test plan

  • Drop an arXiv PDF (e.g., a paper with arxiv in URL, 12 pages, abstract+references) → confirm agent silently picks research-style and writes the 6-section schema.
  • Drop a 2-page screenshot or term sheet → confirm agent silently picks project-context.
  • Drop a borderline doc (e.g., a 6-page technical memo from a company blog) → confirm one AskUserQuestion fires and the user's pick is honored.
  • Check the resulting digest's Doc type: front-matter field is set correctly.

🤖 Generated with Claude Code

User context: 'I need to reference academic papers + industry tech reports
periodically. Is the current /pmo digest suitable for this?'

Analysis: current digest's schema (TL;DR + Key facts + Open questions +
What PMO must remember + Section map) was designed for project-context
docs (term sheets, internal constraints, screenshots). It's underspec'd
for academic content where the reuse value lives in Method / Limitations
/ Reproducibility / Applicability, not in 'key facts.'

User direction: do NOT add a --paper flag. Have the agent auto-detect
the document type from signals in the document itself.

Implementation: classify into 2 types at digest time, weighted-signal
heuristic with explicit fallback.

New: `## Document-type auto-detection` section in digests.md
- 6 weighted signals (length, structural markers, citations density,
  file source / URL, writing style, section structure)
- Threshold: ≥3 signals → high-confidence research-style; ≤1 →
  high-confidence project-context; =2 → ambiguous, AskUserQuestion
- Logged in digest front-matter `Doc type:` field for audit + later
  manual override via --refresh

New schema: `## Digest schemas (two variants share lifecycle)`
- Project-context schema (existing) → state/digest_TEMPLATE.md unchanged
- Research-style schema (new) → state/digest_paper_TEMPLATE.md
  - TL;DR (plain language, not the abstract verbatim)
  - Method (sample / setup / baseline / eval — judgeable + transferable)
  - Key results (number + context, not raw claim)
  - Limitations (author-acknowledged + reader-observed)
  - Reproducibility (data / code / clarity / barriers)
  - Applicable to this project? (verdict: use / partially / don't, + reason)
  - Followup citations (2-3 max with one-line reason each — NOT dump)
  - Open questions + What PMO must remember (carried over from existing schema)
  - Extra front-matter: Authors, Published, Source venue (arXiv:NNNN.NNNNN etc.)

Plus a `### Why a different schema for research` paragraph explaining the
rationale per section, so future contributors don't add Method back to
the project-context schema or strip Method from research.

Shared across both schemas (unchanged):
- inputs/ → knowledge/<topic>/ flow
- Status: active/archived/eternal/superseded lifecycle
- archive_inactive_days candidate detection
- knowledge/INDEX.md registration
- <source>-digest.md naming convention

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant