fix(triage): prevent prompt-injection escape via forged content delimiters by sarahxsanders · Pull Request #50 · PostHog/warlock

sarahxsanders · 2026-06-12T21:02:20Z

Problem

buildTriagePrompt() (src/scanner/triage.ts) embedded untrusted scanned content between fixed markers --- CONTENT START --- / --- CONTENT END ---. Because those strings are constant and guessable, malicious content could include its own --- CONTENT END --- line followed by forged instructions, escaping the data region to address the triage LLM directly — e.g. instructing it to mark every match false_positive.

This is exactly the threat the scanner exists to stop: a prompt injection that suppresses a true positive defeats the whole triage layer, letting real exfiltration or destructive payloads through with a benign verdict.

Fix

Stamp the data delimiters (CONTENT START/END, MATCHES/END MATCHES) with an unguessable per-call nonce (randomUUID from node:crypto, no new dependency). An attacker can't guess the nonce, so they can't forge the real boundary.
Add an explicit instruction telling the model everything inside is untrusted data to analyze, never obey, and that only a marker bearing the exact token is a real boundary.
Lift the magic 30000 to reuse the existing MAX_CONTENT_CHARS constant.
The existing fail-safe-to-true_positive defaults in parseTriageResponse/triageMatches are untouched, so an escape attempt still cannot silently flip verdicts.

Tests

Added regression coverage in triage.test.ts:

nonce-stamped boundary present and consistent across both ends
fresh nonce per call (not reused)
escape resistance against a forged bare --- CONTENT END --- line
presence of the "untrusted data / never obey" instruction

Full suite 520/520 pass; tsc --noEmit clean.

Sizing: the nonce + instruction add ~600 chars to fixed prompt overhead; with PROMPT_OVERHEAD_CHARS = 5_000 covering a ~3,500-char template there's ample headroom, so no batching constant needed changing.

…iters buildTriagePrompt() embedded untrusted scanned content between the fixed markers `--- CONTENT START ---` / `--- CONTENT END ---`. Because those strings are constant, malicious content could include its own `--- CONTENT END ---` line followed by forged instructions, breaking out of the data region to address the triage LLM directly (e.g. instructing it to mark every match false_positive and suppress a real threat). Stamp the data delimiters with an unguessable per-call nonce (randomUUID) so content cannot forge the boundary, and add an explicit instruction that everything inside is untrusted data to analyze, never obey. The existing fail-safe-to-true_positive defaults are unchanged, so an escape attempt still cannot silently flip verdicts. Adds regression tests covering nonce presence/consistency, per-call freshness, escape resistance against a forged bare delimiter, and the untrusted-data instruction. Generated-By: PostHog Code Task-Id: 5720da9e-fd54-48ee-953d-0a943118bbe2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(triage): prevent prompt-injection escape via forged content delimiters#50

fix(triage): prevent prompt-injection escape via forged content delimiters#50
sarahxsanders wants to merge 1 commit into
mainfrom
posthog-code/fix-triage-delimiter-injection

sarahxsanders commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

sarahxsanders commented Jun 12, 2026

Problem

Fix

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant