Digest auto-detects research-style vs project-context (no new flag) by ranjiao · Pull Request #22 · ranjiao/Perry

ranjiao · 2026-05-15T08:54:33Z

Summary

User feedback: '我时不时需要参考学术论文 / 行业技术报告。现在的 digest 适合处理这类信息吗？'

Analysis: current digest schema (TL;DR + Key facts + Open questions + What PMO must remember + Section map) was tuned for project-context docs — term sheets, constraints, screenshots, internal notes. It underserves academic / industry research content where the reuse value lives in Method / Limitations / Reproducibility / Applicability, not in 'key facts.'

User direction: NO new flag. Have the agent auto-detect the document type from signals in the source itself, fall back to one AskUserQuestion when ambiguous.

Auto-detection logic (new § in digests.md)

Two classes:

project-context — existing default; term sheets, internal docs, constraints, screenshots
research-style — papers, industry tech reports, white papers, formal technical notes

Six weighted signals computed at digest time:

Signal	Counts toward `research-style` if …
Length	≥ 5000 words / ≥ 8 pages
Structural markers	Abstract / References / numbered figure captions / equations / DOI / arXiv ID
Citations density	≥ 10 distinct external citations
File source / URL	arxiv / acm / ieee / nature / springer / ssrn etc.; or `<author>-<year>-<topic>.pdf` pattern
Writing style	First-person plural method narration, hypothesis testing
Section structure	≥ 3 of: Introduction / Background / Method / Experiments / Results / Discussion / Conclusion / Limitations

Threshold: ≥ 3 signals = high-confidence research-style (proceed silently). ≤ 1 = high-confidence project-context (proceed silently). = 2 = ambiguous → one AskUserQuestion with the user picking once.

Classification logged in digest's Doc type: front-matter for audit + manual override via --refresh.

Two schemas

Both share: front-matter convention, placement under knowledge/<topic>/, Status: active | archived | eternal | superseded lifecycle, archive-candidate detection, INDEX registration.

They differ in body sections only.

Project-context (unchanged)

state/digest_TEMPLATE.md — TL;DR + Key facts + Open questions + What PMO must remember + Section map.

Research-style (NEW)

state/digest_paper_TEMPLATE.md:

TL;DR — plain user-language, not abstract verbatim
Method — sample / setup / baseline / eval (judgeable + transferable)
Key results — number + context (not raw claim)
Limitations — author-acknowledged + reader-observed (critical reading)
Reproducibility — data / code / clarity / barriers
Applicable to this project? — verdict (use / partially / don't) + reason + landing point
Followup citations — 2-3 max with one-line reason each
Open questions + What PMO must remember (carried over)
Extra front-matter: Authors, Published, Source venue

Each section has a stated rationale in a ### Why a different schema for research paragraph — so future edits don't drift the boundary in either direction.

Why this is better than --paper flag

One mental model: drop file in inputs/, run /pmo digest <path>, agent does the right thing
Document-type evidence lives IN the document, not in the user's memory of which flag to pass
Borderline cases (long-form blog post that reads academic, or a short methodology memo) get explicit user confirmation rather than a wrong silent default

Files changed

pmo/reference/digests.md: +144 lines (auto-detection signals, two-schema description, rationale)
pmo/state/digest_paper_TEMPLATE.md: new file, 76 lines
pmo/state/digest_TEMPLATE.md: unchanged (still the project-context template)
No SKILL.md change (digest reference is loaded only on /pmo digest; type-detection is a reference-internal concern)
No new flag, no new subcommand syntax

Test plan

Drop an arXiv PDF (e.g., a paper with arxiv in URL, 12 pages, abstract+references) → confirm agent silently picks research-style and writes the 6-section schema.
Drop a 2-page screenshot or term sheet → confirm agent silently picks project-context.
Drop a borderline doc (e.g., a 6-page technical memo from a company blog) → confirm one AskUserQuestion fires and the user's pick is honored.
Check the resulting digest's Doc type: front-matter field is set correctly.

🤖 Generated with Claude Code

User context: 'I need to reference academic papers + industry tech reports periodically. Is the current /pmo digest suitable for this?' Analysis: current digest's schema (TL;DR + Key facts + Open questions + What PMO must remember + Section map) was designed for project-context docs (term sheets, internal constraints, screenshots). It's underspec'd for academic content where the reuse value lives in Method / Limitations / Reproducibility / Applicability, not in 'key facts.' User direction: do NOT add a --paper flag. Have the agent auto-detect the document type from signals in the document itself. Implementation: classify into 2 types at digest time, weighted-signal heuristic with explicit fallback. New: `## Document-type auto-detection` section in digests.md - 6 weighted signals (length, structural markers, citations density, file source / URL, writing style, section structure) - Threshold: ≥3 signals → high-confidence research-style; ≤1 → high-confidence project-context; =2 → ambiguous, AskUserQuestion - Logged in digest front-matter `Doc type:` field for audit + later manual override via --refresh New schema: `## Digest schemas (two variants share lifecycle)` - Project-context schema (existing) → state/digest_TEMPLATE.md unchanged - Research-style schema (new) → state/digest_paper_TEMPLATE.md - TL;DR (plain language, not the abstract verbatim) - Method (sample / setup / baseline / eval — judgeable + transferable) - Key results (number + context, not raw claim) - Limitations (author-acknowledged + reader-observed) - Reproducibility (data / code / clarity / barriers) - Applicable to this project? (verdict: use / partially / don't, + reason) - Followup citations (2-3 max with one-line reason each — NOT dump) - Open questions + What PMO must remember (carried over from existing schema) - Extra front-matter: Authors, Published, Source venue (arXiv:NNNN.NNNNN etc.) Plus a `### Why a different schema for research` paragraph explaining the rationale per section, so future contributors don't add Method back to the project-context schema or strip Method from research. Shared across both schemas (unchanged): - inputs/ → knowledge/<topic>/ flow - Status: active/archived/eternal/superseded lifecycle - archive_inactive_days candidate detection - knowledge/INDEX.md registration - <source>-digest.md naming convention Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Digest auto-detects research-style vs project-context (no new flag)#22

Digest auto-detects research-style vs project-context (no new flag)#22
ranjiao wants to merge 1 commit into
mainfrom
perry/digest-auto-detect-doctype

ranjiao commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ranjiao commented May 15, 2026

Summary

Auto-detection logic (new § in digests.md)

Two schemas

Project-context (unchanged)

Research-style (NEW)

Why this is better than --paper flag

Files changed

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant