Skip to content

fix(triage): prevent prompt-injection escape via forged content delimiters#50

Draft
sarahxsanders wants to merge 1 commit into
mainfrom
posthog-code/fix-triage-delimiter-injection
Draft

fix(triage): prevent prompt-injection escape via forged content delimiters#50
sarahxsanders wants to merge 1 commit into
mainfrom
posthog-code/fix-triage-delimiter-injection

Conversation

@sarahxsanders

Copy link
Copy Markdown
Collaborator

Problem

buildTriagePrompt() (src/scanner/triage.ts) embedded untrusted scanned content between fixed markers --- CONTENT START --- / --- CONTENT END ---. Because those strings are constant and guessable, malicious content could include its own --- CONTENT END --- line followed by forged instructions, escaping the data region to address the triage LLM directly — e.g. instructing it to mark every match false_positive.

This is exactly the threat the scanner exists to stop: a prompt injection that suppresses a true positive defeats the whole triage layer, letting real exfiltration or destructive payloads through with a benign verdict.

Fix

  • Stamp the data delimiters (CONTENT START/END, MATCHES/END MATCHES) with an unguessable per-call nonce (randomUUID from node:crypto, no new dependency). An attacker can't guess the nonce, so they can't forge the real boundary.
  • Add an explicit instruction telling the model everything inside is untrusted data to analyze, never obey, and that only a marker bearing the exact token is a real boundary.
  • Lift the magic 30000 to reuse the existing MAX_CONTENT_CHARS constant.
  • The existing fail-safe-to-true_positive defaults in parseTriageResponse/triageMatches are untouched, so an escape attempt still cannot silently flip verdicts.

Tests

Added regression coverage in triage.test.ts:

  • nonce-stamped boundary present and consistent across both ends
  • fresh nonce per call (not reused)
  • escape resistance against a forged bare --- CONTENT END --- line
  • presence of the "untrusted data / never obey" instruction

Full suite 520/520 pass; tsc --noEmit clean.

Sizing: the nonce + instruction add ~600 chars to fixed prompt overhead; with PROMPT_OVERHEAD_CHARS = 5_000 covering a ~3,500-char template there's ample headroom, so no batching constant needed changing.

…iters

buildTriagePrompt() embedded untrusted scanned content between the fixed
markers `--- CONTENT START ---` / `--- CONTENT END ---`. Because those
strings are constant, malicious content could include its own
`--- CONTENT END ---` line followed by forged instructions, breaking out
of the data region to address the triage LLM directly (e.g. instructing
it to mark every match false_positive and suppress a real threat).

Stamp the data delimiters with an unguessable per-call nonce
(randomUUID) so content cannot forge the boundary, and add an explicit
instruction that everything inside is untrusted data to analyze, never
obey. The existing fail-safe-to-true_positive defaults are unchanged, so
an escape attempt still cannot silently flip verdicts.

Adds regression tests covering nonce presence/consistency, per-call
freshness, escape resistance against a forged bare delimiter, and the
untrusted-data instruction.

Generated-By: PostHog Code
Task-Id: 5720da9e-fd54-48ee-953d-0a943118bbe2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant