feat: entropy-based secret detection for exception code variables#692
feat: entropy-based secret detection for exception code variables#692ablaszkiewicz wants to merge 1 commit into
Conversation
|
posthog-python Compliance ReportDate: 2026-06-23 12:16:28 UTC ✅ All Tests Passed!45/45 tests passed Capture Tests✅ 29/29 tests passed View Details
Feature_Flags Tests✅ 16/16 tests passed View Details
|
a245e6e to
d206ea9
Compare
Add a last-resort entropy-based detector that redacts high-entropy, secret-looking values (API keys, tokens, strong passwords) sitting in innocuously-named code variables, after the existing name-pattern and URL-credential checks. - Known vendor key formats (OpenAI, Anthropic, AWS, Stripe, GitHub, GitLab, Slack, Google, JWT, PEM private keys) are matched directly. - Structured identifiers (UUIDs, Mongo ObjectIds, hashes), object reprs, file paths and URLs are never flagged. - Exposed as the `code_variables_detect_secrets` option (default True) with a per-context override, threaded through client/contexts. - Tighten the masking size caps to keep capture cost bounded. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
d206ea9 to
615244d
Compare
|
|
||
| # Synthetic, format-correct fakes (no real credentials). Vendor keys are assembled from | ||
| # prefix + body so no complete secret literal lives in source (which trips secret scanners). | ||
| def _key(prefix, body): |
There was a problem hiding this comment.
I had to split secrets into two parts. Otherwise GitHub wasn't happy
|
Reviews (2): Last reviewed commit: "feat: entropy-based secret detection for..." | Re-trigger Greptile |
| ``repr(value)`` but fails closed: redact entirely on any mask match, over-length | ||
| repr, or a raising ``__repr__``.""" | ||
| try: | ||
| rendered = repr(value) |
There was a problem hiding this comment.
do we want to run _looks_like_secret(rendered) here too maybe? 🤔 otherwise something like this bypasses it:
from posthog.exception_utils import _MaskingConfig, _encode_variable, VariableSizeLimiter
class OpaqueToken:
__slots__ = ()
def __repr__(self):
return "n8fK2pQ9vX7mL4wR8tY3uZ6bC1dE5gH" # high-entropy fake token
config = _MaskingConfig.build(
[], [],
mask_url_credentials=False,
detect_secrets=True,
)
print(_encode_variable(OpaqueToken(), config, VariableSizeLimiter()))
Currently we detect secrets based on keys and values having common secret phrases like
api-key,password, etc...@hpouillot got an amazing idea to do an entropy based secrets detection. This PR does that and extends our detection with popular secrets format like
sk-ant,ey...,gh_...,glpat_....There is a problem however - sometimes high entropy strings are classified as secrets. We do our best to detect genuine high entropy non-secrets and stop the entropy detection process if it's:
There are much more cases and rules - it's in the PR.
I also tightened our search limits based on real world exceptions.
Did a benchmark on 3 real exceptions. This computes how long does it take to capture variables from frames