feat(trace sampler): emit sampler.seen/kept metrics#1316
Draft
thieman wants to merge 26 commits intothieman/error-tracking-standalonefrom
Draft
feat(trace sampler): emit sampler.seen/kept metrics#1316thieman wants to merge 26 commits intothieman/error-tracking-standalonefrom
thieman wants to merge 26 commits intothieman/error-tracking-standalonefrom
Conversation
Adds a rare sampler that catches traces for span signature combinations (env, service, name, resource, error type, http status) that are not seen by the priority sampler, ensuring low-traffic trace shapes remain represented in sampled data. - New `RareSampler` with token bucket rate limiting (5 TPS default, burst 50) and per-(env,service) shard TTL tracking - Cardinality-bounded `SeenSpans` with modular-hash shrinking (matches Go agent behavior) - Wired into `run_samplers` ahead of all other samplers in both probabilistic and legacy paths - `record_priority_trace` feedback so priority-sampled signatures don't re-trigger as rare within the cooldown window - Config fields under `apm_config.rare_sampler` (disabled by default) - 6 unit tests covering: disabled mode, new/repeated signatures, top- level vs measured detection, cross-signature independence, rate limit Closes #1134 (partial — rare sampler only) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Ports the remaining tests from rare_sampler_test.go and fixes a bug discovered in the process: - TTL expiration: verify same signature becomes rare again after cooldown - record_priority_trace: verify priority-sampled signatures suppress rare - Cardinality/shrink: verify SeenSpans stays bounded after overflow - Multiple top-level spans: verify first-expired span gets _dd.rare flag and all spans' TTLs are refreshed on keep Also fixes SeenSpans::add to only apply the TTL_RENEWAL_PERIOD skip when the stored entry is still live. Previously it could skip updates for expired entries when TTL < TTL_RENEWAL_PERIOD (only affects tests; in production TTL=5min always exceeds the 60s renewal period). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- `enable_rare_sampler` moves to a top-level `apm_config` field (YAML: `apm_config.enable_rare_sampler`, env: `DD_APM_ENABLE_RARE_SAMPLER`) rather than nested under `apm_config.rare_sampler` - Rename `cooldown_period_secs` serde key to `cooldown` (YAML: `apm_config.rare_sampler.cooldown`) - Add KEY_ALIAS `apm_config.enable_rare_sampler` → `apm_enable_rare_sampler` so the YAML value is visible under the same flat key the env var sets - Read `apm_enable_rare_sampler` explicitly in `from_configuration` to let the env var override the YAML-deserialized value Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…wrapper Read the flag at the root level via ApmConfiguration (renamed to apm_enable_rare_sampler) so KEY_ALIASES handles YAML/env var precedence generically. Removes the manual try_get_typed override. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… env var Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ing path Removes `record_priority_trace` which had no equivalent in the Go agent — priority-sampled traces don't suppress the rare sampler in Go V0. Also removes the `else` guard in the probabilistic block that was briefly changed to V1 behavior, keeping the V0 else-if chain. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…-driven eligibility test The previous multi-top-level test slept past TTL so r1 was expired when trace2 was sampled, diverging from TestMultipleTopeLevels in Go where r1 is still within TTL and r2 (new) gets _dd.rare. Rewrites the test to match Go exactly: no sleep, r2 gets the rare flag, r1 is suppressed by the refreshed TTL. Also adds a table-driven span_eligibility test mirroring TestConsideredSpans. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…t tests TokenBucket is a generic rate-limiting primitive with no trace-specific logic. Moves it to saluki-common::rate so other components can reuse it. Adds 4 unit tests covering burst exhaustion, time-based refill, capacity cap, and zero-rate behavior. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…pies - Remove String allocation in handle_trace: use &str directly from get_trace_env and reorder record_all_top_level_spans before spans_mut so NLL ends the shared borrow before the mutable one is needed - get_expire returns Option<&Instant> instead of Option<Instant> - Collapse find_rare_span expiry check to is_none_or - Use is_some_and in is_top_level_or_measured instead of copied().unwrap_or Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adapts Go TestSampling cases from agent_test.go: - rare catches first occurrence when probabilistic would drop (0%) - _dd.rare=1 set on kept span - second occurrence within TTL is dropped - rare disabled does not catch unsampled traces - rare catches PriorityAutoDrop in legacy (non-probabilistic) path - probabilistic 100%/0% coverage with rare disabled Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements the ETS sampling path from datadog-agent/pkg/trace/agent/agent.go: - In run_samplers: ETS check runs at the very top (before rare/probabilistic). Traces with errors (including exception span events) are routed exclusively to the error sampler; traces without errors are dropped immediately. - In process_trace: dropped ETS traces suppress SSS and analytics events. - When ETS keeps a trace, sets ets_error=true on TraceSampling so the encoder emits _dd.error_tracking_standalone.error="true" as a chunk tag. - Adds ets_error field to TraceSampling and sampling_mut() accessor to Trace. - Updates the DD traces encoder to emit the ETS chunk tag when set. 6 new tests: error kept, no-error dropped, SSS suppressed, ets_error flag set, exception span events treated as errors, ETS disabled uses normal path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Follows the same pattern as enable_rare_sampler: - YAML: apm_config.error_tracking_standalone.enabled - Env var: DD_APM_ERROR_TRACKING_STANDALONE Adds enable_error_tracking_standalone to the ApmConfiguration wrapper (with rename = "apm_error_tracking_standalone"), a KEY_ALIAS mapping the nested YAML path to the flat key, and copies the value into ApmConfig in from_configuration. Removes the now-redundant ErrorTrackingStandaloneConfig struct. Adds 4 config tests mirroring the rare sampler config tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… per-request allocation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…hunk tag from config directly Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… in source comments Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…s, keep only ETS-added links as permalinks Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…to avoid per-trace method call Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds `datadog.trace_agent.sampler.seen` and `datadog.trace_agent.sampler.kept` counters per (sampler, service, env, sampling_priority) combination, mirroring `RecordMetricsKey` in datadog-agent/pkg/trace/sampler/metrics.go. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Binary Size Analysis (Agent Data Plane)Target: bc3cc74 (baseline) vs 448ea06 (comparison) diff
|
| Module | File Size | Symbols |
|---|---|---|
saluki_components::transforms::trace_sampler |
+15.28 KiB | 8 |
core |
+6.72 KiB | 81 |
hashbrown |
+6.41 KiB | 4 |
[sections] |
+4.06 KiB | 5 |
saluki_metrics::builder::MetricsBuilder |
+1.96 KiB | 4 |
serde_core |
+643 B | 3 |
saluki_components::common::datadog |
-607 B | 23 |
saluki_components::encoders::datadog |
+575 B | 5 |
saluki_core::data_model::event |
+352 B | 1 |
http |
+148 B | 6 |
[Unmapped] |
-105 B | 1 |
unicode_segmentation |
+88 B | 1 |
saluki_components::transforms::dogstatsd_prefix_filter |
+48 B | 1 |
aho_corasick |
+20 B | 1 |
saluki_core::topology::blueprint |
+9 B | 1 |
tokio |
+8 B | 28 |
std |
-4 B | 15 |
figment |
+3 B | 1 |
agent_data_plane::cli::run |
+2 B | 2 |
saluki_common::task::instrument |
+0 B | 2 |
Detailed Symbol Changes
FILE SIZE VM SIZE
-------------- --------------
+1.7% +26.4Ki +2.1% +22.9Ki [273 Others]
[NEW] +25.6Ki [NEW] +25.5Ki saluki_components::transforms::trace_sampler::TraceSampler::process_trace::h6a6309045a0d5efa
[NEW] +17.3Ki [NEW] +17.2Ki saluki_components::encoders::datadog::traces::TraceEndpointEncoder::encode_tracer_payload::_{{closure}}::_{{closure}}::h7f60427ea2bbc270
[NEW] +13.2Ki [NEW] +13.0Ki saluki_components::common::datadog::request_builder::RequestBuilder<E>::encode_inner::_{{closure}}::h56c7675ff9353162
[NEW] +12.4Ki [NEW] +12.2Ki _<core::marker::PhantomData<T> as serde_core::de::DeserializeSeed>::deserialize::h516ba3727ce3c3a6
[NEW] +10.5Ki [NEW] +10.3Ki _<saluki_common::task::instrument::InstrumentedTask<F> as core::future::future::Future>::poll::h23e186975d91b998
[NEW] +9.79Ki [NEW] +9.67Ki saluki_components::common::datadog::apm::ApmConfig::from_configuration::h06ca569aa0fe3ec0
[NEW] +5.71Ki [NEW] +5.55Ki saluki_components::common::datadog::request_builder::RequestBuilder<E>::flush::_{{closure}}::h2da80e1539595dd3
[NEW] +5.20Ki [NEW] +5.05Ki _<core::marker::PhantomData<T> as serde_core::de::DeserializeSeed>::deserialize::h5fabbb1c95dca95f
[NEW] +3.83Ki [NEW] +3.71Ki <webpki::error::Error as core::fmt::Debug>::fmt.11262
[NEW] +3.34Ki [NEW] +3.17Ki saluki_components::common::datadog::request_builder::RequestBuilder<E>::try_split_request::_{{closure}}::hfd70afba07daed71
[DEL] -3.34Ki [DEL] -3.18Ki saluki_components::common::datadog::request_builder::RequestBuilder<E>::try_split_request::_{{closure}}::h3d6146a97391a71f
[DEL] -3.83Ki [DEL] -3.71Ki <webpki::error::Error as core::fmt::Debug>::fmt.11256
[DEL] -3.93Ki [DEL] -3.78Ki _<core::marker::PhantomData<T> as serde_core::de::DeserializeSeed>::deserialize::hde9fde945ac088d0
[DEL] -5.72Ki [DEL] -5.57Ki saluki_components::common::datadog::request_builder::RequestBuilder<E>::flush::_{{closure}}::he10db712ee5679b7
[DEL] -10.5Ki [DEL] -10.3Ki _<saluki_common::task::instrument::InstrumentedTask<F> as core::future::future::Future>::poll::hfbfad7a42c71c843
[DEL] -11.0Ki [DEL] -10.9Ki saluki_components::common::datadog::apm::ApmConfig::from_configuration::hdcad4928457e0f0c
[DEL] -12.5Ki [DEL] -12.4Ki _<core::marker::PhantomData<T> as serde_core::de::DeserializeSeed>::deserialize::habd4db8865bce350
[DEL] -13.2Ki [DEL] -13.1Ki saluki_components::common::datadog::request_builder::RequestBuilder<E>::encode_inner::_{{closure}}::hcd0b9ca9428d1499
[DEL] -16.3Ki [DEL] -16.1Ki _<saluki_components::transforms::trace_sampler::TraceSampler as saluki_core::components::transforms::SynchronousTransform>::transform_buffer::hf202555a35ab77d7
[DEL] -17.3Ki [DEL] -17.1Ki saluki_components::encoders::datadog::traces::TraceEndpointEncoder::encode_tracer_payload::_{{closure}}::_{{closure}}::he82b1e6b98565373
+0.1% +35.6Ki +0.1% +32.1Ki TOTAL
Regression Detector (Agent Data Plane)This comment was omitted because it was over 65,536 characters.Please check the Gitlab Job logs to see its output. |
…isses/shrinks) Implements the 4 missing metrics from datadog-agent/pkg/trace/sampler/: - datadog.trace_agent.sampler.size (gauge, sampler tag) — current signature-table size for priority, no_priority, and error samplers; updated after each batch - datadog.trace_agent.sampler.rare.hits (counter) — traces kept by rare sampler - datadog.trace_agent.sampler.rare.misses (counter) — traces rejected by rare sampler - datadog.trace_agent.sampler.rare.shrinks (gauge) — cumulative shard shrink count Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
99425a8 to
2d5cb62
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements all 6 sampler metrics from
datadog-agent/pkg/trace/sampler/metrics.goandrare_sampler.go, part of #1134.datadog.trace_agent.sampler.seensampler,target_service,target_env¹,sampling_priority²datadog.trace_agent.sampler.keptdatadog.trace_agent.sampler.sizesampler(priority / no_priority / error)datadog.trace_agent.sampler.rare.hitsdatadog.trace_agent.sampler.rare.missesdatadog.trace_agent.sampler.rare.shrinks¹
target_envonly emitted for priority, no_priority, rare, and error samplers²
sampling_priorityonly emitted for the priority samplerImplementation details:
run_samplersreturns aSamplingDecisionstruct instead of a bare tuple, carrying sampler name and priority for metric recordingseen/keptcounter handles are lazily registered and cached in aFastHashMap— hot path pays only a hash-map lookup after the first observation of each tag combinationrare.hits/rare.missesare incremented inline inRareSampler::sample();rare.shrinksis updated each timeSeenSpans::shrink()firessampler.sizegauges are updated after eachtransform_bufferbatch by pollingsize()on each score-based samplerTest plan
cargo clippyclean🤖 Generated with Claude Code