Skip to content

feat(coglet): add Sentry error reporting for infrastructure errors#2865

Open
markphelps wants to merge 3 commits intomainfrom
mphelps/sentry-integration
Open

feat(coglet): add Sentry error reporting for infrastructure errors#2865
markphelps wants to merge 3 commits intomainfrom
mphelps/sentry-integration

Conversation

@markphelps
Copy link
Contributor

@markphelps markphelps commented Mar 25, 2026

Summary

  • Adds Sentry error reporting to the coglet parent process, activated when SENTRY_DSN is set in the environment
  • Automatically captures infrastructure-level errors (setup failures, worker crashes, IPC errors) via the sentry tracing layer — zero overhead when disabled
  • Upgrades reqwest from 0.12 to 0.13 to share the same version with the sentry SDK

Details

How it works

The integration uses the sentry crate's tracing layer, which plugs into the existing tracing-subscriber stack:

  • ERROR-level tracing events → captured as Sentry events (issues)
  • WARN-level tracing events → captured as Sentry breadcrumbs (context for future errors)
  • INFO/DEBUG/TRACE → ignored (avoids noise from prediction failures logged at info level)

What gets reported

Sentry is initialized in the parent process only (the HTTP server + orchestrator). All errors from the worker subprocess are already surfaced to the parent via IPC (control channel and slot sockets), where the orchestrator logs them at error! level.

Error type Source (parent process) Tracing level
Setup failures ("Worker initialization failed") coglet-python/src/lib.rs error!
Worker fatal errors ("Worker fatal") coglet/src/orchestrator.rs error!
Control channel errors coglet/src/orchestrator.rs error!
Slot socket errors coglet/src/orchestrator.rs error!
File upload failures coglet/src/orchestrator.rs error!
Worker panics sentry panic integration automatic
Worker crash (control channel closed) coglet/src/orchestrator.rs warn! (breadcrumb)

Prediction failures (user model exceptions) are logged at info level and are not reported to Sentry.

Worker subprocess errors (11 error! callsites in worker.rs) are not directly captured by Sentry, but are relayed to the parent via IPC where the orchestrator captures them.

Sentry context/tags

All events are enriched with:

  • coglet.version — coglet build version
  • coglet.predictor_ref — predictor module reference
  • coglet.max_concurrency — configured concurrency
  • python.version — Python runtime version
  • cog.sdk_version — cog SDK version

Zero overhead when disabled

When SENTRY_DSN is not set:

  • sentry::init() creates a disabled no-op client
  • sentry_tracing_layer() returns None, which as Option<Layer> is a compile-time no-op
  • configure_sentry_scope() is a no-op on a disabled client

reqwest upgrade (0.12 → 0.13)

Upgraded to share the same reqwest version with sentry 0.47 and avoid pulling in two separate versions. Breaking changes in 0.13 are minimal (TLS defaults changed, feature renames) and don't affect our usage.

TLS behavior: reqwest 0.13 with rustls-no-provider + explicit rustls with ring feature. The rustls-platform-verifier is used for certificate verification (platform-native, equivalent to the old rustls-tls-native-roots behavior). The ring crypto provider avoids pulling in aws-lc-sys (which requires cmake at build time).

Health-only mode

When no predictor is specified (health-only mode), Sentry is still initialized but configure_sentry_scope() is intentionally skipped since there's no model metadata to attach. Infrastructure errors are still captured with default Sentry context (release, environment).

Files changed

  • crates/Cargo.toml — Added sentry and rustls workspace deps, upgraded reqwest to 0.13
  • crates/coglet/Cargo.toml — Added rustls.workspace = true
  • crates/coglet/src/lib.rs — Added install_crypto_provider() for ring TLS setup
  • crates/coglet/src/webhook.rs — Added crypto provider init to test setup
  • crates/coglet-python/Cargo.toml — Added sentry.workspace = true
  • crates/coglet-python/src/sentry_integration.rs — New module: init_sentry(), configure_sentry_scope(), sentry_tracing_layer()
  • crates/coglet-python/src/lib.rs — Crypto provider + Sentry init before tracing, layer added to subscriber, scope configured with metadata
  • crates/Cargo.lock — Updated lockfile

Report infrastructure-level errors to Sentry when SENTRY_DSN is set.
Uses the sentry tracing layer to automatically capture ERROR-level
tracing events (setup failures, worker crashes, IPC errors) as Sentry
issues and WARN-level events as breadcrumbs. Zero overhead when no DSN
is configured.

Also upgrades reqwest from 0.12 to 0.13 to share the same version
with the sentry SDK and avoid duplicate dependencies.
@markphelps markphelps marked this pull request as ready for review March 25, 2026 19:29
@markphelps markphelps requested a review from a team as a code owner March 25, 2026 19:29
@markphelps markphelps added this to the 0.18.0 milestone Mar 25, 2026
…add comment for health-only path

- Fix double is_enabled() check in init_sentry() (reviewer issue #3)
- Use ring crypto provider instead of aws-lc-rs to avoid cmake build
  dependency. Add rustls with ring feature explicitly, use reqwest's
  rustls-no-provider feature, and export install_crypto_provider()
  from coglet core (reviewer issue #7)
- Remove sentry's rustls feature (reqwest already configures TLS)
- Add install_crypto_provider() call to serve_impl() and webhook tests
- Add comment explaining intentional omission of configure_sentry_scope
  in health-only path (reviewer issue #6)
- Note: reqwest 0.13 with rustls-platform-verifier uses platform-native
  certificate verification, equivalent to the old rustls-tls-native-roots
  behavior (reviewer issue #4 — confirmed not a regression)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant