fix(triage): prevent prompt-injection escape via forged content delimiters#50
Draft
sarahxsanders wants to merge 1 commit into
Draft
fix(triage): prevent prompt-injection escape via forged content delimiters#50sarahxsanders wants to merge 1 commit into
sarahxsanders wants to merge 1 commit into
Conversation
…iters buildTriagePrompt() embedded untrusted scanned content between the fixed markers `--- CONTENT START ---` / `--- CONTENT END ---`. Because those strings are constant, malicious content could include its own `--- CONTENT END ---` line followed by forged instructions, breaking out of the data region to address the triage LLM directly (e.g. instructing it to mark every match false_positive and suppress a real threat). Stamp the data delimiters with an unguessable per-call nonce (randomUUID) so content cannot forge the boundary, and add an explicit instruction that everything inside is untrusted data to analyze, never obey. The existing fail-safe-to-true_positive defaults are unchanged, so an escape attempt still cannot silently flip verdicts. Adds regression tests covering nonce presence/consistency, per-call freshness, escape resistance against a forged bare delimiter, and the untrusted-data instruction. Generated-By: PostHog Code Task-Id: 5720da9e-fd54-48ee-953d-0a943118bbe2
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
buildTriagePrompt()(src/scanner/triage.ts) embedded untrusted scanned content between fixed markers--- CONTENT START ---/--- CONTENT END ---. Because those strings are constant and guessable, malicious content could include its own--- CONTENT END ---line followed by forged instructions, escaping the data region to address the triage LLM directly — e.g. instructing it to mark every matchfalse_positive.This is exactly the threat the scanner exists to stop: a prompt injection that suppresses a true positive defeats the whole triage layer, letting real exfiltration or destructive payloads through with a benign verdict.
Fix
CONTENT START/END,MATCHES/END MATCHES) with an unguessable per-call nonce (randomUUIDfromnode:crypto, no new dependency). An attacker can't guess the nonce, so they can't forge the real boundary.30000to reuse the existingMAX_CONTENT_CHARSconstant.true_positivedefaults inparseTriageResponse/triageMatchesare untouched, so an escape attempt still cannot silently flip verdicts.Tests
Added regression coverage in
triage.test.ts:--- CONTENT END ---lineFull suite 520/520 pass;
tsc --noEmitclean.Sizing: the nonce + instruction add ~600 chars to fixed prompt overhead; with
PROMPT_OVERHEAD_CHARS = 5_000covering a ~3,500-char template there's ample headroom, so no batching constant needed changing.