Skip to content

feat(trace sampler): emit sampler.seen/kept metrics#1316

Draft
thieman wants to merge 26 commits intothieman/error-tracking-standalonefrom
thieman/sampler-metrics
Draft

feat(trace sampler): emit sampler.seen/kept metrics#1316
thieman wants to merge 26 commits intothieman/error-tracking-standalonefrom
thieman/sampler-metrics

Conversation

@thieman
Copy link
Copy Markdown
Contributor

@thieman thieman commented Apr 6, 2026

Summary

Implements all 6 sampler metrics from datadog-agent/pkg/trace/sampler/metrics.go and rare_sampler.go, part of #1134.

Metric Type Tags
datadog.trace_agent.sampler.seen Counter sampler, target_service, target_env¹, sampling_priority²
datadog.trace_agent.sampler.kept Counter same
datadog.trace_agent.sampler.size Gauge sampler (priority / no_priority / error)
datadog.trace_agent.sampler.rare.hits Counter none
datadog.trace_agent.sampler.rare.misses Counter none
datadog.trace_agent.sampler.rare.shrinks Gauge none

¹ target_env only emitted for priority, no_priority, rare, and error samplers
² sampling_priority only emitted for the priority sampler

Implementation details:

  • run_samplers returns a SamplingDecision struct instead of a bare tuple, carrying sampler name and priority for metric recording
  • seen/kept counter handles are lazily registered and cached in a FastHashMap — hot path pays only a hash-map lookup after the first observation of each tag combination
  • rare.hits/rare.misses are incremented inline in RareSampler::sample(); rare.shrinks is updated each time SeenSpans::shrink() fires
  • sampler.size gauges are updated after each transform_buffer batch by polling size() on each score-based sampler

Test plan

  • All 43 existing trace sampler unit tests pass
  • cargo clippy clean

🤖 Generated with Claude Code

thieman and others added 25 commits April 3, 2026 16:27
Adds a rare sampler that catches traces for span signature combinations
(env, service, name, resource, error type, http status) that are not
seen by the priority sampler, ensuring low-traffic trace shapes remain
represented in sampled data.

- New `RareSampler` with token bucket rate limiting (5 TPS default,
  burst 50) and per-(env,service) shard TTL tracking
- Cardinality-bounded `SeenSpans` with modular-hash shrinking (matches
  Go agent behavior)
- Wired into `run_samplers` ahead of all other samplers in both
  probabilistic and legacy paths
- `record_priority_trace` feedback so priority-sampled signatures don't
  re-trigger as rare within the cooldown window
- Config fields under `apm_config.rare_sampler` (disabled by default)
- 6 unit tests covering: disabled mode, new/repeated signatures, top-
  level vs measured detection, cross-signature independence, rate limit

Closes #1134 (partial — rare sampler only)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Ports the remaining tests from rare_sampler_test.go and fixes a bug
discovered in the process:

- TTL expiration: verify same signature becomes rare again after cooldown
- record_priority_trace: verify priority-sampled signatures suppress rare
- Cardinality/shrink: verify SeenSpans stays bounded after overflow
- Multiple top-level spans: verify first-expired span gets _dd.rare flag
  and all spans' TTLs are refreshed on keep

Also fixes SeenSpans::add to only apply the TTL_RENEWAL_PERIOD skip when
the stored entry is still live. Previously it could skip updates for
expired entries when TTL < TTL_RENEWAL_PERIOD (only affects tests; in
production TTL=5min always exceeds the 60s renewal period).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- `enable_rare_sampler` moves to a top-level `apm_config` field
  (YAML: `apm_config.enable_rare_sampler`, env: `DD_APM_ENABLE_RARE_SAMPLER`)
  rather than nested under `apm_config.rare_sampler`
- Rename `cooldown_period_secs` serde key to `cooldown`
  (YAML: `apm_config.rare_sampler.cooldown`)
- Add KEY_ALIAS `apm_config.enable_rare_sampler` → `apm_enable_rare_sampler`
  so the YAML value is visible under the same flat key the env var sets
- Read `apm_enable_rare_sampler` explicitly in `from_configuration` to
  let the env var override the YAML-deserialized value

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…wrapper

Read the flag at the root level via ApmConfiguration (renamed to
apm_enable_rare_sampler) so KEY_ALIASES handles YAML/env var precedence
generically. Removes the manual try_get_typed override.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… env var

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ing path

Removes `record_priority_trace` which had no equivalent in the Go agent —
priority-sampled traces don't suppress the rare sampler in Go V0. Also
removes the `else` guard in the probabilistic block that was briefly changed
to V1 behavior, keeping the V0 else-if chain.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…-driven eligibility test

The previous multi-top-level test slept past TTL so r1 was expired when
trace2 was sampled, diverging from TestMultipleTopeLevels in Go where r1
is still within TTL and r2 (new) gets _dd.rare. Rewrites the test to match
Go exactly: no sleep, r2 gets the rare flag, r1 is suppressed by the
refreshed TTL.

Also adds a table-driven span_eligibility test mirroring TestConsideredSpans.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…t tests

TokenBucket is a generic rate-limiting primitive with no trace-specific
logic. Moves it to saluki-common::rate so other components can reuse it.
Adds 4 unit tests covering burst exhaustion, time-based refill, capacity
cap, and zero-rate behavior.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…pies

- Remove String allocation in handle_trace: use &str directly from
  get_trace_env and reorder record_all_top_level_spans before spans_mut
  so NLL ends the shared borrow before the mutable one is needed
- get_expire returns Option<&Instant> instead of Option<Instant>
- Collapse find_rare_span expiry check to is_none_or
- Use is_some_and in is_top_level_or_measured instead of copied().unwrap_or

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adapts Go TestSampling cases from agent_test.go:
- rare catches first occurrence when probabilistic would drop (0%)
- _dd.rare=1 set on kept span
- second occurrence within TTL is dropped
- rare disabled does not catch unsampled traces
- rare catches PriorityAutoDrop in legacy (non-probabilistic) path
- probabilistic 100%/0% coverage with rare disabled

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements the ETS sampling path from datadog-agent/pkg/trace/agent/agent.go:

- In run_samplers: ETS check runs at the very top (before rare/probabilistic).
  Traces with errors (including exception span events) are routed exclusively
  to the error sampler; traces without errors are dropped immediately.
- In process_trace: dropped ETS traces suppress SSS and analytics events.
- When ETS keeps a trace, sets ets_error=true on TraceSampling so the encoder
  emits _dd.error_tracking_standalone.error="true" as a chunk tag.
- Adds ets_error field to TraceSampling and sampling_mut() accessor to Trace.
- Updates the DD traces encoder to emit the ETS chunk tag when set.

6 new tests: error kept, no-error dropped, SSS suppressed, ets_error flag set,
exception span events treated as errors, ETS disabled uses normal path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Follows the same pattern as enable_rare_sampler:
- YAML: apm_config.error_tracking_standalone.enabled
- Env var: DD_APM_ERROR_TRACKING_STANDALONE

Adds enable_error_tracking_standalone to the ApmConfiguration wrapper
(with rename = "apm_error_tracking_standalone"), a KEY_ALIAS mapping the
nested YAML path to the flat key, and copies the value into ApmConfig in
from_configuration. Removes the now-redundant ErrorTrackingStandaloneConfig
struct. Adds 4 config tests mirroring the rare sampler config tests.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… per-request allocation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…hunk tag from config directly

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… in source comments

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…s, keep only ETS-added links as permalinks

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…to avoid per-trace method call

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds `datadog.trace_agent.sampler.seen` and `datadog.trace_agent.sampler.kept`
counters per (sampler, service, env, sampling_priority) combination, mirroring
`RecordMetricsKey` in datadog-agent/pkg/trace/sampler/metrics.go.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@dd-octo-sts dd-octo-sts bot added area/components Sources, transforms, and destinations. transform/trace-sampler Trace Sampler synchronous transform. labels Apr 6, 2026
@pr-commenter
Copy link
Copy Markdown

pr-commenter bot commented Apr 6, 2026

Binary Size Analysis (Agent Data Plane)

Target: bc3cc74 (baseline) vs 448ea06 (comparison) diff
Analysis Type: Stripped binaries (debug symbols excluded)
Baseline Size: 26.35 MiB
Comparison Size: 26.39 MiB
Size Change: +35.58 KiB (+0.13%)
Pass/Fail Threshold: +5%
Result: PASSED ✅

Changes by Module

Module File Size Symbols
saluki_components::transforms::trace_sampler +15.28 KiB 8
core +6.72 KiB 81
hashbrown +6.41 KiB 4
[sections] +4.06 KiB 5
saluki_metrics::builder::MetricsBuilder +1.96 KiB 4
serde_core +643 B 3
saluki_components::common::datadog -607 B 23
saluki_components::encoders::datadog +575 B 5
saluki_core::data_model::event +352 B 1
http +148 B 6
[Unmapped] -105 B 1
unicode_segmentation +88 B 1
saluki_components::transforms::dogstatsd_prefix_filter +48 B 1
aho_corasick +20 B 1
saluki_core::topology::blueprint +9 B 1
tokio +8 B 28
std -4 B 15
figment +3 B 1
agent_data_plane::cli::run +2 B 2
saluki_common::task::instrument +0 B 2

Detailed Symbol Changes

    FILE SIZE        VM SIZE    
 --------------  -------------- 
  +1.7% +26.4Ki  +2.1% +22.9Ki    [273 Others]
  [NEW] +25.6Ki  [NEW] +25.5Ki    saluki_components::transforms::trace_sampler::TraceSampler::process_trace::h6a6309045a0d5efa
  [NEW] +17.3Ki  [NEW] +17.2Ki    saluki_components::encoders::datadog::traces::TraceEndpointEncoder::encode_tracer_payload::_{{closure}}::_{{closure}}::h7f60427ea2bbc270
  [NEW] +13.2Ki  [NEW] +13.0Ki    saluki_components::common::datadog::request_builder::RequestBuilder<E>::encode_inner::_{{closure}}::h56c7675ff9353162
  [NEW] +12.4Ki  [NEW] +12.2Ki    _<core::marker::PhantomData<T> as serde_core::de::DeserializeSeed>::deserialize::h516ba3727ce3c3a6
  [NEW] +10.5Ki  [NEW] +10.3Ki    _<saluki_common::task::instrument::InstrumentedTask<F> as core::future::future::Future>::poll::h23e186975d91b998
  [NEW] +9.79Ki  [NEW] +9.67Ki    saluki_components::common::datadog::apm::ApmConfig::from_configuration::h06ca569aa0fe3ec0
  [NEW] +5.71Ki  [NEW] +5.55Ki    saluki_components::common::datadog::request_builder::RequestBuilder<E>::flush::_{{closure}}::h2da80e1539595dd3
  [NEW] +5.20Ki  [NEW] +5.05Ki    _<core::marker::PhantomData<T> as serde_core::de::DeserializeSeed>::deserialize::h5fabbb1c95dca95f
  [NEW] +3.83Ki  [NEW] +3.71Ki    <webpki::error::Error as core::fmt::Debug>::fmt.11262
  [NEW] +3.34Ki  [NEW] +3.17Ki    saluki_components::common::datadog::request_builder::RequestBuilder<E>::try_split_request::_{{closure}}::hfd70afba07daed71
  [DEL] -3.34Ki  [DEL] -3.18Ki    saluki_components::common::datadog::request_builder::RequestBuilder<E>::try_split_request::_{{closure}}::h3d6146a97391a71f
  [DEL] -3.83Ki  [DEL] -3.71Ki    <webpki::error::Error as core::fmt::Debug>::fmt.11256
  [DEL] -3.93Ki  [DEL] -3.78Ki    _<core::marker::PhantomData<T> as serde_core::de::DeserializeSeed>::deserialize::hde9fde945ac088d0
  [DEL] -5.72Ki  [DEL] -5.57Ki    saluki_components::common::datadog::request_builder::RequestBuilder<E>::flush::_{{closure}}::he10db712ee5679b7
  [DEL] -10.5Ki  [DEL] -10.3Ki    _<saluki_common::task::instrument::InstrumentedTask<F> as core::future::future::Future>::poll::hfbfad7a42c71c843
  [DEL] -11.0Ki  [DEL] -10.9Ki    saluki_components::common::datadog::apm::ApmConfig::from_configuration::hdcad4928457e0f0c
  [DEL] -12.5Ki  [DEL] -12.4Ki    _<core::marker::PhantomData<T> as serde_core::de::DeserializeSeed>::deserialize::habd4db8865bce350
  [DEL] -13.2Ki  [DEL] -13.1Ki    saluki_components::common::datadog::request_builder::RequestBuilder<E>::encode_inner::_{{closure}}::hcd0b9ca9428d1499
  [DEL] -16.3Ki  [DEL] -16.1Ki    _<saluki_components::transforms::trace_sampler::TraceSampler as saluki_core::components::transforms::SynchronousTransform>::transform_buffer::hf202555a35ab77d7
  [DEL] -17.3Ki  [DEL] -17.1Ki    saluki_components::encoders::datadog::traces::TraceEndpointEncoder::encode_tracer_payload::_{{closure}}::_{{closure}}::he82b1e6b98565373
  +0.1% +35.6Ki  +0.1% +32.1Ki    TOTAL

@pr-commenter
Copy link
Copy Markdown

pr-commenter bot commented Apr 6, 2026

Regression Detector (Agent Data Plane)

This comment was omitted because it was over 65,536 characters.Please check the Gitlab Job logs to see its output.

…isses/shrinks)

Implements the 4 missing metrics from datadog-agent/pkg/trace/sampler/:
- datadog.trace_agent.sampler.size (gauge, sampler tag) — current signature-table
  size for priority, no_priority, and error samplers; updated after each batch
- datadog.trace_agent.sampler.rare.hits (counter) — traces kept by rare sampler
- datadog.trace_agent.sampler.rare.misses (counter) — traces rejected by rare sampler
- datadog.trace_agent.sampler.rare.shrinks (gauge) — cumulative shard shrink count

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@tobz tobz changed the title feat(trace-sampler): emit sampler.seen/kept metrics feat(trace sampler): emit sampler.seen/kept metrics Apr 7, 2026
@thieman thieman force-pushed the thieman/error-tracking-standalone branch from 99425a8 to 2d5cb62 Compare April 9, 2026 14:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/components Sources, transforms, and destinations. transform/trace-sampler Trace Sampler synchronous transform.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant