Skip to content

feat: panic autocapture#157

Merged
cat-ph merged 16 commits into
mainfrom
ci/rust-panic-autocapture-v2
Jun 29, 2026
Merged

feat: panic autocapture#157
cat-ph merged 16 commits into
mainfrom
ci/rust-panic-autocapture-v2

Conversation

@cat-ph

@cat-ph cat-ph commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

What

Adds panic autocapture to the Rust SDK, rebuilt on top of the now-merged runtime-independent transport.

When the global client is initialized with init_global, a process-wide std::panic hook is installed by default — crash reporting, on by default. It captures each panic as a personless $exception event (the panic payload, panic-site $exception_panic_file/_line/_column, and a call-site stack trace honoring capture_stacktrace), then chains the previously installed hook.

Capture routes through the transport's background worker, so:

  • no async runtime is required (works for both the async and blocking clients);
  • before_send hooks still apply; and
  • a panic inside before_send is a joinable worker-thread panic, not a double-panic process abort.

Public API

  • ErrorTrackingOptions::capture_panics (default true) — set false to opt out of the automatic global install.
  • ErrorTrackingOptions::panic_flush_timeout_ms (default 2000) — how long the hook blocks the panicking thread waiting for the flush before letting the panic proceed. Kept short and separate from shutdown_timeout_ms so a slow/unreachable endpoint can't freeze the crash or delay its panic message.
  • install_panic_hook(Arc<Client>) — install the hook for a standalone (non-global) client. It takes an Arc because the hook is 'static; a disabled client is a no-op and never latches the single process-wide hook.
  • New error variant Error::PanicHookAlreadyInstalled.

Notes

  • Delivery is best-effort: the $exception is flushed ahead of queued retries, but under sustained transport backpressure (slow/unreachable endpoint or a large capture backlog) it may not be sent within the bound before the process exits. A true queue-jumping priority path is better suited to the sender-pool rework than this std-mpsc worker.
  • On the panicking thread the hook does only panic-safe, tracing-free work (enqueue + bounded flush); the HTTP send and before_send run on the worker.

Testing

cargo test across the four feature configs (default, capture-v1, --no-default-features --features error-tracking ± capture-v1); clippy -D warnings; cargo doc; public-api snapshot regenerated.

tested on a devbox

CleanShot 2026-06-25 at 19 56 46

cat-ph added 10 commits June 25, 2026 18:28
Install a process-wide panic hook with `install_panic_hook(Arc<Client>)` that
captures panics as personless `$exception` events: the panic payload as the
exception value, panic-site `$exception_panic_file` / `_line` / `_column` from
`PanicInfo::location()`, and a call-site stacktrace honoring the client's
`capture_stacktrace`.

Unlike the original draft (#130), the hook routes through the client's transport
— a non-blocking enqueue plus a synchronous flush — rather than a dedicated
synchronous HTTP path with its own timeout. The actual send and `before_send`
run on the transport's background worker (a std::thread), where the panic count
is zero, so a panic there is a joinable thread panic instead of the process
abort a panic on the hook thread would cause. This lets `before_send` run (no
longer bypassed) and drops the separate `panic_capture_timeout_seconds` knob;
teardown is bounded by the worker's flush. Installing twice returns
`Error::PanicHookAlreadyInstalled`.

Also strips v0 crate disambiguators (`std[hex]::...`) from demangled frame names
so internal-frame matching and server-side grouping see stable symbols.
A user `before_send` hook that panics fires the panic hook on the transport
worker thread itself (the hook runs before `catch_unwind` catches the
`before_send` panic). There, `flush_blocking()` enqueued a `Flush` to the worker
and blocked on `recv` — but the worker can't process its own `Flush` mid-hook,
so it deadlocked, stalling delivery and hanging later `flush`/`shutdown`.

`flush_blocking` now returns early when called from the worker thread (matched
by its name): the panic event is already enqueued and ships on the worker's next
cycle. Adds a watchdog-bounded regression test.
The round-1 fix detected the worker thread by its name and only skipped
the synchronous self-flush. That was both imprecise and incomplete:

- Imprecise: the thread name is a shared string, so any unrelated thread
  named `posthog-transport` would also be treated as the worker.
- Incomplete: it stopped the self-flush deadlock but not recursion. When
  a `before_send` hook panics, the panic hook fires on the worker; still
  building and enqueuing the `$exception` re-enters `before_send`, which
  panics again, forever.

Capture the worker's `ThreadId` at spawn and compare against it, then
skip the panic capture entirely (not just the flush) when running on
this client's worker. That fixes both the deadlock and the recursion at
the source. Strengthen the regression test to panic unconditionally in
`before_send` so it exercises the recursion path, not just the flush.
The panic hook runs on the panicking thread before unwinding releases any
locks held at the panic site, then blocks in flush_blocking() waiting for
the worker. The worker runs before_send for the queued events. If a
before_send hook needs a lock the panic site still holds, the worker
blocks on that lock, the panicking thread blocks on the worker, and the
process hangs forever instead of crashing and reporting.

Bound flush_blocking with rx.recv_timeout(shutdown_timeout) instead of an
unbounded recv. The handle captures the (clamped) shutdown_timeout at
spawn, reusing the caller's existing delivery budget rather than adding a
knob. On the deadlock the hook now returns once the budget elapses and the
panic proceeds; the worker unblocks on its own once unwinding frees the
lock. Factor the clamp ceiling into MAX_SHUTDOWN_TIMEOUT, shared with the
worker drain. Add a regression test that panics while holding a lock a
before_send hook also takes.
capture_panic routed the $exception through Client::capture, which is
once on a full queue. Both run arbitrary tracing-subscriber code on the
already-panicking thread: a subscriber that panics aborts the process
(nested panic), and one that waits on a lock the panic site holds hangs it
before the previous hook runs. That contradicts the hook's own invariant
of doing only panic-safe work.

Add a tracing-free enqueue path for the panic hook: TransportHandle gains
enqueue_panic (no full-queue warn!), and each Client gains a
non-instrumented enqueue_panic_event. capture_panic now uses it instead of
capture. Factor the pure slot reservation into reserve_slot, shared by the
warning try_reserve and the silent panic path, and pull the shared
stamp-and-send into send_reserved so enqueue and enqueue_panic stay DRY.
The panicking thread now only builds the event, reserves a slot, sends on
the channel, and waits on the bounded flush — no subscriber code.
The previous commit bounded the panic flush by reusing shutdown_timeout_ms
(default 30s). A cross-engine review flagged that this is the wrong budget
for a crash path: the hook flushes on the panicking thread before the
panic message prints, so when PostHog is slow or unreachable - or a
backlog is queued at panic time - every panic freezes the process and
delays its own crash message for up to 30s. The rare before_send-lock
deadlock is only one way to hit that bound; an unreachable host hits it on
every panic.

Add a dedicated panic_flush_timeout_ms to ErrorTrackingOptions (default
2000ms), decoupled from shutdown_timeout_ms, so the dying process stays
responsive and operators can tune it. The transport gains a
flush_blocking_timeout(timeout) (clamped to MAX_SHUTDOWN_TIMEOUT); the
no-argument flush_blocking becomes test-only. capture_panic now flushes
with the configured budget. Regenerate the public-api snapshot for the new
builder setter.
Per review feedback, panic autocapture shouldn't require finding and
calling a standalone function. init_global now installs the panic hook
automatically — crash reporting, on by default, like Sentry. This is the
clean path because the global client lives in a process `static`, whose
lifetime matches the `'static` panic hook exactly: the hook borrows it
directly, no Arc needed.

- Add `ErrorTrackingOptions::capture_panics` (default `true`). When set,
  `init_global` installs a hook that captures panics through the global
  client. Set it `false` to opt out.
- Keep `install_panic_hook(Arc<Client>)` as the explicit path for a
  standalone (non-global) client — a non-`'static` client can only feed a
  process-global hook via an `Arc` the caller keeps alive, so this stays a
  separate, opt-in entry point. The flag does not affect standalone clients.
- Factor the latch + take/set/chain logic into a shared `install_hook`;
  both entry points are thin wrappers over it. The global hook reads the
  client from the `OnceLock` at panic time, so it owns no handle.
- Teach the stacktrace stripper about the new `install_hook` and
  `maybe_install_global_panic_hook` closure frames so the first
  application frame is still surfaced.

Tests: the flag default/builder, and an end-to-end init_global panic
capture (the global client is a set-once OnceLock, so that is the only
test that initializes it). Regenerate the public-api snapshot for the new
builder setter.
…pture

Two autoreview findings on the global panic-capture path:

- The panic hook enqueued the `$exception` then sent a normal `Flush`,
  which services older held retries before the live buffer. With a retry
  backlog behind a slow endpoint, the hook's short bounded wait could be
  spent on stale retries and the crash event never attempted before the
  process exits. Add a dedicated `Control::FlushCaptures` that drains the
  live buffer (the `$exception`) first, then older retries best-effort;
  the panic flush uses it.

- `maybe_install_global_panic_hook` only checked `capture_panics`, so a
  disabled global client (blank key / `disabled(true)`) still latched the
  single process-wide hook — which then no-ops — blocking a later
  `install_panic_hook` on an enabled standalone client. Gate install on a
  new `should_capture_global_panics` predicate (`!is_disabled && flag`),
  unit-tested across enabled / disabled / opted-out clients.
…t best-effort

- `install_panic_hook` now returns `Ok(())` without installing when the
  client is disabled, mirroring the global path's guard. A disabled client
  can't send, so latching the single process-wide hook would only block a
  later enabled install (and silently skip `init_global`'s best-effort
  auto-install). Regression-tested.
- Document that panic delivery is best-effort: the just-captured
  `$exception` is flushed ahead of queued retries, but under sustained
  backpressure (slow/unreachable endpoint or a large capture backlog) it
  may not be sent within `panic_flush_timeout_ms` before the process exits.
  Prioritizing past a saturated channel would need a transport-level
  priority path, which belongs with the sender-pool rework rather than
  bolted onto the std-mpsc worker.
@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

posthog-rs-v0 Compliance Report

Date: 2026-06-29 00:08:50 UTC
Duration: 15514ms

⚠️ Some Tests Failed

31/45 tests passed, 14 failed


Capture Tests

29/29 tests passed

View Details
Test Status Duration
Format Validation.Event Has Required Fields 128ms
Format Validation.Event Has Uuid 138ms
Format Validation.Event Has Lib Properties 141ms
Format Validation.Distinct Id Is String 140ms
Format Validation.Token Is Present 136ms
Format Validation.Custom Properties Preserved 142ms
Format Validation.Event Has Timestamp 139ms
Retry Behavior.Retries On 503 5134ms
Retry Behavior.Does Not Retry On 400 2140ms
Retry Behavior.Does Not Retry On 401 2134ms
Retry Behavior.Respects Retry After Header 5094ms
Retry Behavior.Implements Backoff 15088ms
Retry Behavior.Retries On 500 5092ms
Retry Behavior.Retries On 502 5091ms
Retry Behavior.Retries On 504 5091ms
Retry Behavior.Max Retries Respected 15086ms
Deduplication.Generates Unique Uuids 92ms
Deduplication.Preserves Uuid On Retry 5021ms
Deduplication.Preserves Uuid And Timestamp On Retry 10033ms
Deduplication.Preserves Uuid And Timestamp On Batch Retry 5036ms
Deduplication.No Duplicate Events In Batch 23ms
Deduplication.Different Events Have Different Uuids 22ms
Compression.Sends Gzip When Enabled 23ms
Batch Format.Uses Proper Batch Structure 23ms
Batch Format.Flush With No Events Sends Nothing 24ms
Batch Format.Multiple Events Batched Together 101ms
Error Handling.Does Not Retry On 403 2082ms
Error Handling.Does Not Retry On 413 2067ms
Error Handling.Retries On 408 5071ms

Feature_Flags Tests

⚠️ 2/16 tests passed, 14 failed

View Details
Test Status Duration
Request Payload.Request With Person Properties Device Id 67ms
Request Payload.Flags Request Uses V2 Query Param 41ms
Request Payload.Flags Request Hits Flags Path Not Decide 26ms
Request Payload.Flags Request Omits Authorization Header 26ms
Request Payload.Token In Flags Body Matches Init 24ms
Request Payload.Groups Round Trip 31ms
Request Payload.Groups Default To Empty Object 44ms
Request Payload.Person Properties Distinct Id Auto Populated When Caller Omits It 21ms
Request Payload.Disable Geoip False Propagates As Geoip Disable False 45ms
Request Payload.Disable Geoip Omitted Defaults To False 23ms
Request Payload.Flag Keys To Evaluate Contains Only Requested Key 53ms
Request Lifecycle.No Flags Request On Init Alone 20ms
Request Lifecycle.No Flags Request On Normal Capture 68ms
Request Lifecycle.Two Flag Calls Produce Two Remote Requests 24ms
Request Lifecycle.Mock Response Value Is Returned To Caller 26ms
Side Effect Events.Get Feature Flag Captures Feature Flag Called Event 35ms

Failures

request_payload.request_with_person_properties_device_id

404, message='Not Found', url='http://sdk-adapter:8080/get_feature_flag?test_id=t-6cfddce4-c1e8-4866-a1d6-eb536e73cd04'

request_payload.flags_request_uses_v2_query_param

404, message='Not Found', url='http://sdk-adapter:8080/get_feature_flag?test_id=t-5515de72-511d-41c8-8efd-f6f6aa0bb15c'

request_payload.flags_request_hits_flags_path_not_decide

404, message='Not Found', url='http://sdk-adapter:8080/get_feature_flag?test_id=t-eef1202f-e786-43a9-a4b8-dccfe765d568'

request_payload.flags_request_omits_authorization_header

404, message='Not Found', url='http://sdk-adapter:8080/get_feature_flag?test_id=t-b368f3a6-95c7-4cf5-a148-486bd363f4b6'

request_payload.token_in_flags_body_matches_init

404, message='Not Found', url='http://sdk-adapter:8080/get_feature_flag?test_id=t-d5a53c77-4867-4e9d-a4c8-51b5de61e113'

request_payload.groups_round_trip

404, message='Not Found', url='http://sdk-adapter:8080/get_feature_flag?test_id=t-2e4023e2-68d4-4ad0-bb45-dbdadac8c713'

request_payload.groups_default_to_empty_object

404, message='Not Found', url='http://sdk-adapter:8080/get_feature_flag?test_id=t-b0d8b39b-a257-49e2-8503-95bbbd0bf622'

request_payload.person_properties_distinct_id_auto_populated_when_caller_omits_it

404, message='Not Found', url='http://sdk-adapter:8080/get_feature_flag?test_id=t-eed3c43a-b4d6-4f01-bcc8-d9b9a13d5b84'

request_payload.disable_geoip_false_propagates_as_geoip_disable_false

404, message='Not Found', url='http://sdk-adapter:8080/get_feature_flag?test_id=t-610506a2-65d2-4b29-a02a-cf77d0e2a796'

request_payload.disable_geoip_omitted_defaults_to_false

404, message='Not Found', url='http://sdk-adapter:8080/get_feature_flag?test_id=t-603cb943-5166-46ac-8de1-6504a8c12920'

request_payload.flag_keys_to_evaluate_contains_only_requested_key

404, message='Not Found', url='http://sdk-adapter:8080/get_feature_flag?test_id=t-dabec705-8e7a-47cf-b973-2c9a27e3a5dc'

request_lifecycle.two_flag_calls_produce_two_remote_requests

404, message='Not Found', url='http://sdk-adapter:8080/get_feature_flag?test_id=t-24d3dc19-f4a9-4054-b50b-005edeac7396'

request_lifecycle.mock_response_value_is_returned_to_caller

404, message='Not Found', url='http://sdk-adapter:8080/get_feature_flag?test_id=t-77635b63-28b2-40a9-809e-1e3e34c9eea7'

side_effect_events.get_feature_flag_captures_feature_flag_called_event

404, message='Not Found', url='http://sdk-adapter:8080/get_feature_flag?test_id=t-99bdf335-edd6-4d51-b917-0e79fe710df7'

@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

posthog-rs-v1 Compliance Report

Date: 2026-06-29 00:09:37 UTC
Duration: 23005ms

⚠️ Some Tests Failed

96/110 tests passed, 14 failed


Capture_V1 Tests

94/94 tests passed

View Details
Test Status Duration
Endpoint And Method.Targets V1 Endpoint 148ms
Endpoint And Method.Does Not Use Legacy Endpoints 147ms
Required Headers.Has Authorization Bearer Header 147ms
Required Headers.Has Content Type Json 148ms
Required Headers.Has Posthog Sdk Info Format 147ms
Required Headers.Has Posthog Attempt Header 147ms
Required Headers.Has Posthog Request Id 146ms
Required Headers.Has Posthog Request Timestamp 147ms
Required Headers.Has User Agent 146ms
Body Format.Body Has Created At And Batch 149ms
Body Format.No Api Key In Body 133ms
Body Format.No Sent At In Body 132ms
Event Format.Event Has Required Root Fields 134ms
Event Format.Event Uuid Is Valid 131ms
Event Format.Event Timestamp Is Rfc3339 142ms
Event Format.Distinct Id Is String 133ms
Event Format.Distinct Id At Root Not Properties 134ms
Event Format.Custom Properties Preserved 133ms
Event Format.Set Properties Preserved 134ms
Event Format.Set Once Properties Preserved 130ms
Event Format.Groups Properties Preserved 123ms
Event Format.Sdk Generates Uuid If Not Provided 122ms
Event Format.Event Has Required Root Fields Batch 169ms
Event Format.Event Uuid Is Valid Batch 169ms
Event Format.Event Timestamp Is Rfc3339 Batch 168ms
Event Format.Distinct Id Is String Batch 169ms
Event Format.Distinct Id At Root Not Properties Batch 169ms
Event Format.Custom Properties Preserved Batch 168ms
Event Format.Set Properties Preserved Batch 166ms
Event Format.Set Once Properties Preserved Batch 161ms
Event Format.Groups Properties Preserved Batch 147ms
Event Format.Sdk Generates Uuid If Not Provided Batch 139ms
Batch Behavior.Multiple Events In Single Batch 182ms
Batch Behavior.Batch Envelope Smoke 168ms
Batch Behavior.Flush With No Events Sends Nothing 91ms
Batch Behavior.Flush At Triggers Batch 1113ms
Batch Behavior.Created At Reflects Batch Creation Time 101ms
Deduplication.Generates Unique Uuids 174ms
Deduplication.Different Events Same Content Different Uuids 112ms
Deduplication.Preserves Uuid On Retry 5115ms
Deduplication.Preserves Timestamp On Retry 5070ms
Deduplication.Preserves Uuid And Timestamp On Batch Retry 5106ms
Deduplication.No Duplicate Events In Batch 139ms
Header Behavior On Retry.Attempt Header Starts At One 74ms
Header Behavior On Retry.Attempt Header Increments On Retry 10074ms
Header Behavior On Retry.Request Id Preserved On Retry 5067ms
Header Behavior On Retry.Different Requests Have Different Request Ids 2069ms
Header Behavior On Retry.Request Timestamp Changes On Retry 5065ms
Response Format Validation.Success Response Has Uuid Keyed Results 60ms
Response Format Validation.Success Response Has Ok For Each Event 30ms
Response Format Validation.Success No Retry After When All Ok 34ms
Response Format Validation.Success Retry After Present When Retry Events 31ms
Response Format Validation.Success No Retry After When Drop Only 33ms
Response Format Validation.Response Echoes Request Id 28ms
Retry Behavior.Retries On 408 5038ms
Retry Behavior.Retries On 500 5028ms
Retry Behavior.Retries On 503 5024ms
Retry Behavior.Retries On 504 5023ms
Retry Behavior.Retryable Errors Have Retry After 2025ms
Retry Behavior.Respects Retry After On Retryable Error 8024ms
Retry Behavior.Does Not Retry On 400 2027ms
Retry Behavior.Does Not Retry On 401 2031ms
Retry Behavior.Does Not Retry On 402 2033ms
Retry Behavior.Does Not Retry On 413 2027ms
Retry Behavior.Does Not Retry On 415 2027ms
Retry Behavior.Non Retryable Errors Have No Retry After 2026ms
Retry Behavior.Implements Backoff 15030ms
Retry Behavior.Max Retries Respected 15024ms
Partial Batch Handling.Handles 200 Full Success 2042ms
Partial Batch Handling.Handles 200 With All Ok 3048ms
Partial Batch Handling.Does Not Retry Dropped Events 3046ms
Partial Batch Handling.Does Not Retry Limited Events 3028ms
Partial Batch Handling.Prunes Ok Events On Partial Retry 5030ms
Partial Batch Handling.Prunes Dropped Events On Partial Retry 5025ms
Partial Batch Handling.Retries Only Retry Events From Partial 5026ms
Partial Batch Handling.Partial Retry Preserves Uuids 5024ms
Partial Batch Handling.Partial Retry Attempt Header Increments 5032ms
Partial Batch Handling.Partial Retry Request Id Preserved 5031ms
Partial Batch Handling.Respects Retry After On Partial 5024ms
Partial Batch Handling.Unknown Result Treated As Terminal 3023ms
Partial Batch Handling.Mixed Ok Drop Limited No Retry 3038ms
Compression.Sends Gzip Content Encoding 39ms
Compression.No Content Encoding When Disabled 21ms
Compression.Compressed Body Is Decompressible 22ms
Error Handling.Does Not Retry On Unknown 4Xx 2022ms
Event Options.Cookieless Mode Override 23ms
Event Options.Disable Skew Correction Override 22ms
Event Options.Process Person Profile Override 22ms
Event Options.Product Tour Id Override 21ms
Event Options.Unset Options Omitted 23ms
Event Options.Options Override In Batch 22ms
Geoip And Historical Migration.Geoip Disable Injected Into Properties 20ms
Geoip And Historical Migration.Historical Migration Set In Body 22ms
Geoip And Historical Migration.Historical Migration Absent By Default 22ms

Feature_Flags Tests

⚠️ 2/16 tests passed, 14 failed

View Details
Test Status Duration
Request Payload.Request With Person Properties Device Id 12ms
Request Payload.Flags Request Uses V2 Query Param 13ms
Request Payload.Flags Request Hits Flags Path Not Decide 12ms
Request Payload.Flags Request Omits Authorization Header 15ms
Request Payload.Token In Flags Body Matches Init 15ms
Request Payload.Groups Round Trip 15ms
Request Payload.Groups Default To Empty Object 13ms
Request Payload.Person Properties Distinct Id Auto Populated When Caller Omits It 12ms
Request Payload.Disable Geoip False Propagates As Geoip Disable False 13ms
Request Payload.Disable Geoip Omitted Defaults To False 12ms
Request Payload.Flag Keys To Evaluate Contains Only Requested Key 14ms
Request Lifecycle.No Flags Request On Init Alone 11ms
Request Lifecycle.No Flags Request On Normal Capture 24ms
Request Lifecycle.Two Flag Calls Produce Two Remote Requests 12ms
Request Lifecycle.Mock Response Value Is Returned To Caller 15ms
Side Effect Events.Get Feature Flag Captures Feature Flag Called Event 12ms

Failures

request_payload.request_with_person_properties_device_id

404, message='Not Found', url='http://sdk-adapter:8080/get_feature_flag?test_id=t-7d154a82-5052-498d-9371-1ae9f47b991a'

request_payload.flags_request_uses_v2_query_param

404, message='Not Found', url='http://sdk-adapter:8080/get_feature_flag?test_id=t-76a70551-8047-4113-b3f3-f76a6d4c4ea8'

request_payload.flags_request_hits_flags_path_not_decide

404, message='Not Found', url='http://sdk-adapter:8080/get_feature_flag?test_id=t-6091aff5-9b11-468d-91a7-93ec11f040d4'

request_payload.flags_request_omits_authorization_header

404, message='Not Found', url='http://sdk-adapter:8080/get_feature_flag?test_id=t-c7397306-d943-4631-b558-1a4ce08d77a4'

request_payload.token_in_flags_body_matches_init

404, message='Not Found', url='http://sdk-adapter:8080/get_feature_flag?test_id=t-fa8064ee-f8b9-4d67-8832-03f26a321830'

request_payload.groups_round_trip

404, message='Not Found', url='http://sdk-adapter:8080/get_feature_flag?test_id=t-50f9d928-5a88-41c4-8aca-448833ba4665'

request_payload.groups_default_to_empty_object

404, message='Not Found', url='http://sdk-adapter:8080/get_feature_flag?test_id=t-bc8d7f42-9918-4b7e-88f1-47e157df6996'

request_payload.person_properties_distinct_id_auto_populated_when_caller_omits_it

404, message='Not Found', url='http://sdk-adapter:8080/get_feature_flag?test_id=t-58c903c2-2cdc-437d-bd76-7ef0024125b1'

request_payload.disable_geoip_false_propagates_as_geoip_disable_false

404, message='Not Found', url='http://sdk-adapter:8080/get_feature_flag?test_id=t-5b7cf57c-01f7-4d92-acd4-a2901d38fba3'

request_payload.disable_geoip_omitted_defaults_to_false

404, message='Not Found', url='http://sdk-adapter:8080/get_feature_flag?test_id=t-318c9720-51df-4590-be0e-b8c9b4ca5bbc'

request_payload.flag_keys_to_evaluate_contains_only_requested_key

404, message='Not Found', url='http://sdk-adapter:8080/get_feature_flag?test_id=t-43102c95-8951-4029-ae62-2d26966f18ad'

request_lifecycle.two_flag_calls_produce_two_remote_requests

404, message='Not Found', url='http://sdk-adapter:8080/get_feature_flag?test_id=t-bb878c68-dc2c-48dc-83e0-25d0a0be6ba3'

request_lifecycle.mock_response_value_is_returned_to_caller

404, message='Not Found', url='http://sdk-adapter:8080/get_feature_flag?test_id=t-ad68e721-bfa0-418c-93b3-6365e25ca61e'

side_effect_events.get_feature_flag_captures_feature_flag_called_event

404, message='Not Found', url='http://sdk-adapter:8080/get_feature_flag?test_id=t-c3a04cbc-8266-4280-ab9e-133cb3b432b6'

@greptile-apps

greptile-apps Bot commented Jun 25, 2026

Copy link
Copy Markdown

Comments Outside Diff (1)

  1. src/error_tracking.rs, line 876-901 (link)

    P2 PanicHookReset::Drop skips restore on panic — can cascade test failures

    When a test assertion fails (i.e. the test thread panics), Drop sees std::thread::panicking() == true and silently skips restore(). PANIC_HOOK_INSTALLED is left true and the installed hook is left in place. The next serialized test (holding panic_hook_test_lock()) will then call install_panic_hook(...).unwrap(), which returns Err(PanicHookAlreadyInstalled) and panics — its PanicHookReset also skips restore, and the failure cascade continues for every remaining panic-hook test in the suite.

    A common pattern for this situation is to call restore() unconditionally in Drop (it's safe — restore only calls std::panic::set_hook and an atomic store, neither of which panics), or to use std::panic::always_abort temporarily during restore.

Reviews (1): Last reviewed commit: "fix: don't latch the panic hook for a di..." | Re-trigger Greptile

Comment thread src/client/async_client.rs
Comment thread src/error_tracking.rs
@cat-ph cat-ph marked this pull request as ready for review June 25, 2026 16:57
@cat-ph cat-ph requested a review from a team as a code owner June 25, 2026 16:57
@cat-ph cat-ph requested review from a team, ablaszkiewicz and hpouillot June 25, 2026 16:57
@marandaneto

Copy link
Copy Markdown
Member

$exception_level should be fatal for panics

Comment thread .sampo/changesets/panic-autocapture.md Outdated
Comment thread api/public-api.txt Outdated
Comment thread src/client/async_client.rs
Comment thread src/client/async_client.rs
Comment thread src/client/async_client.rs
Comment thread src/client/async_client.rs
Comment thread src/error_tracking.rs
Comment thread src/error_tracking.rs Outdated
@marandaneto

Copy link
Copy Markdown
Member

@cat-ph left a bunch of comments/suggestions mainly to keep a cleaner public API and some 'better' behaviors imo, feel free to push back if you dont think we need that
also a few of those suggestions can be follow ups

@marandaneto

Copy link
Copy Markdown
Member

something my agent told:

Global panic-hook tests look fragile

The test initializes the real process-wide global client:

  GLOBAL_CLIENT: OnceLock<Client>

That cannot be reset. The comment says it is the only test doing that, but this is easy to break later and can be
order-dependent with future tests.

Better options:

  • move global panic-hook behavior to an integration test in a separate process
  • or isolate it behind test-only reset hooks
  • or avoid asserting behavior that requires setting the real OnceLock

and:

Maybe avoid “first frame must be app frame” tests

Tests currently assert that frame 0 is the user panic site. That encourages brittle stripping logic.

A more robust test would assert:

  • the panic site frame exists
  • it is marked in_app = true
  • SDK/runtime frames are marked in_app = false

and:

Consider whether standalone clients should have clearer docs

capture_panics only affects init_global. Standalone clients require:

  install_panic_hook(Arc<Client>)

That is reasonable, but easy to miss. The docs/changelog should be very explicit:

  • global client: panic capture on by default
  • standalone client: no automatic install
  • manual install is process-wide and only one PostHog hook can be installed

Comment thread src/error_tracking.rs Outdated
cat-ph added 3 commits June 26, 2026 18:34
Panics are unrecoverable (the process is unwinding/aborting), so the
`$exception_level` of the captured `$exception` is now `fatal` instead of
`error`, matching how the other SDKs classify crashes. The shared panic
matcher now asserts it.
Per review: the panic stacktrace dropped its leading frames using a
hand-maintained function-name list (the panic hook + capture machinery).
That list was brittle — it drifted whenever internals were renamed or
inlined, and it threw away frames the error-tracking UI can collapse on
its own.

Stop dropping. `capture_raw_panic_frames` now keeps the full backtrace,
and in-app classification marks the machinery as `in_app = false`:
extend `default_in_app_function` to cover bare unwind symbols
(`rust_begin_unwind`, the `__rust`/`___rust` shims) and the async runtime
crates (`futures_core`/`futures_util`), on top of the existing
`std`/`core`/`alloc`/`backtrace`/`posthog_rs`/`reqwest`/`tokio`. The user's
panic site stays in-app; everything above it is kept but flagged out of
app. Drop the now-dead `is_internal_panic_frame` and reframe its test
around classification; the panic matcher now asserts the machinery is
present and out-of-app rather than expecting frame 0 to be the panic site.

The manual `capture_exception` path still trims its own capture frames
(separate behavior, left as-is for now).
The `init_global_installs_panic_capture_by_default` unit test initialized
the real process-wide `GLOBAL_CLIENT` (a set-once `OnceLock` that can't be
reset), which made it order-dependent: any future test that touched the
global, or a reordering, could break the suite. It can't be fixed with
dependency injection either — the panic hook needs a `'static` client, so
only the global `OnceLock` qualifies and there's nothing to mock.

Move that end-to-end check to its own integration-test binary
(`tests/test_panic_global.rs`), where the `OnceLock` is naturally isolated,
and drop the in-process test plus its now-unused `init_global_test` helper.
The gate (`should_capture_global_panics`) and the hook mechanism
(`install_panic_hook`) stay covered by the in-process unit tests; the
integration test only exercises the `init_global` -> hook wiring, and runs
across all error-tracking feature configs (V0/V1 x async/blocking).
@cat-ph

cat-ph commented Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

I resolved some previous thread pre-emptively, but some updates:

  • changed the level to fatal
  • updated the in_app classification / frame stripping logic
  • fixed the test fragility

and them some remaining discussions/thoughts:

  • extra defensiveness in the enqueue path - kinda intended to try and keep code that could panic to the minimum
  • whether we should have a separate, lower panic_flush_timeout_ms - I'd say yes, but if we want to align timeouts across SDKs and think this may not be needed, maybe we can 'hardcode' this to 2s for now and not have it public (we can always just expose it later on if we need to), open to thoughts
  • whether enabled by default, IMO yes?

@marandaneto

Copy link
Copy Markdown
Member

I resolved some previous thread pre-emptively, but some updates:

  • changed the level to fatal
  • updated the in_app classification / frame stripping logic
  • fixed the test fragility

and them some remaining discussions/thoughts:

  • extra defensiveness in the enqueue path - kinda intended to try and keep code that could panic to the minimum
  • whether we should have a separate, lower panic_flush_timeout_ms - I'd say yes, but if we want to align timeouts across SDKs and think this may not be needed, maybe we can 'hardcode' this to 2s for now and not have it public (we can always just expose it later on if we need to), open to thoughts
  • whether enabled by default, IMO yes?

cool stuff, lets keep that way then

Comment thread src/error_tracking.rs
@marandaneto

Copy link
Copy Markdown
Member

Add a test that the actual panic-site frame is in_app = true

The current test checks:

  • panic site frame exists
  • panic/unwind machinery is in_app = false

But it does not verify that the user panic-site frame is marked in_app = true.

I’d ask for this, ideally in tests/test_panic_global.rs, because that panic originates from an integration-test crate,
not inside posthog_rs itself.

Something like:

  frames.iter().any(|frame| {
      frame["function"].as_str().is_some_and(|f| f.contains("...panic site..."))
          && frame["in_app"] == true
  })

Comment thread src/error_tracking.rs Outdated
@marandaneto

Copy link
Copy Markdown
Member

@cat-ph great work and nice addition
left a few comments/non blockers (suggestions), but happy as is!

cat-ph added 3 commits June 29, 2026 01:19
Addresses review feedback on the panic-autocapture API surface:

- Global-only: a panic hook is process-global (std::panic::set_hook), so it
  pairs with the process-global client. Dropped the public
  install_panic_hook(Arc<Client>) — it exposed an Arc/'static lifetime wart
  and muddy multi-client semantics for a niche case. The hook now installs
  solely via the global client; install_panic_hook is kept #[cfg(test)] to
  exercise the shared hook path against a standalone client. (Re-exposable
  later without a breaking change if a real need appears.)

- Opt-in: ErrorTrackingOptions::capture_panics now defaults to false. Panic
  autocapture is enabled explicitly, matching PostHog's other SDKs (e.g.
  Android exception autocapture is opt-in) and avoiding seizing the process
  panic hook unsolicited.

- Fixed flush timeout: removed the public panic_flush_timeout_ms option in
  favor of a private 2s PANIC_FLUSH_TIMEOUT const. The dying-thread flush
  budget isn't something callers should tune for now; easy to expose later.

Regenerated the public-api snapshot (drops install_panic_hook and the
panic_flush_timeout_ms builder setter), updated the changeset and the
init_global / option docs, and adjusted the affected tests (the integration
test now opts in via capture_panics(true); the bounded-flush test relies on
the fixed 2s timeout).
…ite is in-app

- Extend the default in-app classifier so framework/error-plumbing frames
  (`tracing`, `tracing_core`, `log`, `anyhow`) are marked `in_app = false`,
  alongside the existing std/core/alloc/backtrace/futures/posthog_rs/reqwest/
  tokio set, per review.
- Add the missing positive assertion: the integration test now panics from a
  named site in the test crate (not `posthog_rs`) and verifies that frame is
  kept and marked `in_app = true` — the in-process unit tests can't show this
  because a panic there originates inside `posthog_rs` (classified out-of-app).
Removing the `panic_flush_timeout_ms` option meant the bounded-flush tests
(which deadlock `before_send` on purpose) waited the full production budget,
adding ~2s of wall-clock each. Split `PANIC_FLUSH_TIMEOUT` by `cfg(test)`:
200ms under test, 2s in production. The tests only need to prove the flush is
bounded, not the production duration, so this restores their speed without
weakening coverage.
@cat-ph

cat-ph commented Jun 29, 2026

Copy link
Copy Markdown
Contributor Author

aight last few small changes to simplify actually, I think it's better:

  • disable by default (what other SDKs do)
  • hardcode panic flush timeout (can make it configurable later or align with other SDK timeouts)
  • keep global only, simplifies the public API, we can add per-client later if others find it useful

will test again thoroughly tomorrow (today? 😆) & merge unless any objections 🚀

@cat-ph cat-ph merged commit 4294c51 into main Jun 29, 2026
25 checks passed
@cat-ph cat-ph deleted the ci/rust-panic-autocapture-v2 branch June 29, 2026 14:37
cat-ph added a commit that referenced this pull request Jun 29, 2026
Panic autocapture (#157) was squash-merged and released as v0.15.0, and
its final form diverged from the snapshot this branch was built on.
Reconcile so main's panic implementation is authoritative while keeping
this branch's native instruction-address + debug-image capture:

- Panic capture keeps every frame (in_app marks SDK/runtime), per main,
  but now also collects the $debug_images those frames reference so the
  server can symbolicate native panic frames.
- Manual capture keeps the address+name SDK-frame stripping plus native
  capture.
- Drop this branch's stale panic variants superseded by main: opt-out
  default, configurable panic_flush_timeout_ms, public install_panic_hook
  (now #[cfg(test)]), and is_internal_panic_frame (panics no longer strip).
- Drop the already-released panic-autocapture changeset; keep the
  unreleased native-symbolication changeset.
- Tests: union of main's panic/in-app tests and this branch's native
  tests; public-api.txt matches main (native adds no public symbols).
Comment thread src/error_tracking.rs

let mut event = Event::new_anon("$exception");
if let Some(location) = panic_info.location() {
event.insert_prop("$exception_panic_file", location.file())?;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to understand, why adding those properties ? They won't be displayed on the UI and they are not processed, are they ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants