feat(trace sampler): implement rare sampler#1311
feat(trace sampler): implement rare sampler#1311gh-worker-dd-mergequeue-cf854d[bot] merged 21 commits intomainfrom
Conversation
Binary Size Analysis (Agent Data Plane)Target: ffeb787 (baseline) vs c12651b (comparison) diff
|
| Module | File Size | Symbols |
|---|---|---|
core |
+6.96 KiB | 20 |
saluki_components::transforms::trace_sampler |
+6.65 KiB | 7 |
hashbrown |
+4.46 KiB | 3 |
[Unmapped] |
+3.26 KiB | 1 |
saluki_components::common::datadog |
-1.34 KiB | 5 |
serde_core |
+796 B | 10 |
saluki_core::data_model::event |
+269 B | 1 |
[sections] |
-133 B | 6 |
agent_data_plane::cli::run |
+116 B | 3 |
unicode_segmentation |
+104 B | 1 |
saluki_components::transforms::dogstatsd_prefix_filter |
+48 B | 1 |
saluki_core::topology::blueprint |
+9 B | 1 |
aho_corasick |
+8 B | 1 |
figment |
-6 B | 1 |
std |
+4 B | 9 |
serde_yaml |
-4 B | 1 |
agent_data_plane::components::tag_filterlist |
+1 B | 1 |
tonic |
+0 B | 2 |
crossbeam_channel |
+0 B | 6 |
tracing_subscriber |
+0 B | 4 |
Detailed Symbol Changes
FILE SIZE VM SIZE
-------------- --------------
[NEW] +22.0Ki [NEW] +21.8Ki saluki_components::transforms::trace_sampler::TraceSampler::process_trace::haae362d51dae69a0
[NEW] +13.6Ki [NEW] +13.4Ki _<core::marker::PhantomData<T> as serde_core::de::DeserializeSeed>::deserialize::h1d9cfb93998519c7
[NEW] +9.63Ki [NEW] +9.51Ki saluki_components::common::datadog::apm::ApmConfig::from_configuration::h8e84793e7f99a6e6
[NEW] +5.20Ki [NEW] +5.05Ki _<core::marker::PhantomData<T> as serde_core::de::DeserializeSeed>::deserialize::had646bb5b345f1a8
+0.3% +3.74Ki +0.3% +3.03Ki [108 Others]
+46% +3.26Ki [ = ] 0 [Unmapped]
[NEW] +1.88Ki [NEW] +1.78Ki hashbrown::raw::RawTable<T,A>::reserve_rehash::hd0ce83933bc8fbac
[NEW] +1.79Ki [NEW] +1.65Ki <tonic::transport::channel::endpoint::Endpoint as core::clone::Clone>::clone.12982
[NEW] +1.79Ki [NEW] +1.69Ki hashbrown::raw::RawTable<T,A>::reserve_rehash::h30b016c18f118cb7
[NEW] +1.10Ki [NEW] +1.00Ki crossbeam_channel::waker::SyncWaker::notify.13122
[NEW] +1.04Ki [NEW] +913 <tracing_subscriber::filter::env::directive::Directive as core::clone::Clone>::clone.13350
[NEW] +878 [NEW] +673 _<saluki_components::transforms::trace_sampler::TraceSampler as saluki_core::components::transforms::SynchronousTransform>::transform_buffer::h2e85609667304799
[NEW] +812 [NEW] +713 hashbrown::map::HashMap<K,V,S,A>::remove::hbf94225925c826fd
[DEL] -772 [DEL] -671 crossbeam_channel::waker::SyncWaker::unregister.13124
-15.1% -896 -15.1% -896 [section .rodata]
[DEL] -1.04Ki [DEL] -913 <tracing_subscriber::filter::env::directive::Directive as core::clone::Clone>::clone.13348
[DEL] -1.10Ki [DEL] -1.00Ki crossbeam_channel::waker::SyncWaker::notify.13120
[DEL] -1.79Ki [DEL] -1.65Ki <tonic::transport::channel::endpoint::Endpoint as core::clone::Clone>::clone.12980
[DEL] -11.0Ki [DEL] -10.9Ki saluki_components::common::datadog::apm::ApmConfig::from_configuration::hf8c47abaacda9d53
[DEL] -12.5Ki [DEL] -12.4Ki _<core::marker::PhantomData<T> as serde_core::de::DeserializeSeed>::deserialize::ha43058272b282e7a
[DEL] -16.3Ki [DEL] -16.1Ki _<saluki_components::transforms::trace_sampler::TraceSampler as saluki_core::components::transforms::SynchronousTransform>::transform_buffer::h999b6ca59cc02d7a
+0.1% +21.2Ki +0.1% +16.7Ki TOTAL
Regression Detector (Agent Data Plane)This comment was omitted because it was over 65,536 characters.Please check the Gitlab Job logs to see its output. |
|
|
||
| const fn default_rare_sampler_cardinality() -> usize { | ||
| 200 | ||
| } |
There was a problem hiding this comment.
Confirmed these against https://github.com/DataDog/datadog-agent/blob/main/pkg/trace/config/config.go#L615-L617
|
|
||
| #[serde(default = "default_rare_sampler_cardinality")] | ||
| cardinality: usize, | ||
| } |
There was a problem hiding this comment.
| ("proxy.http", "proxy_http"), | ||
| ("proxy.https", "proxy_https"), | ||
| ("proxy.no_proxy", "proxy_no_proxy"), | ||
| ("apm_config.enable_rare_sampler", "apm_enable_rare_sampler"), |
There was a problem hiding this comment.
| self.apm_config.rare_sampler_enabled(), | ||
| self.apm_config.rare_sampler_tps(), | ||
| std::time::Duration::from_secs_f64(self.apm_config.rare_sampler_cooldown_period_secs()), | ||
| self.apm_config.rare_sampler_cardinality(), |
There was a problem hiding this comment.
What is our expected behavior when RC eventually changes our config at runtime? Do we end up restarting the entire process to reset things, or do we need to respond to those config changes here somehow?
|
I figured it made sense to add |
andrewqian2001datadog
left a comment
There was a problem hiding this comment.
For this code
I think we want to add a rare check?
In the agent, there is a rare check before the no priority "legacy" path.
Added a test to show that the we decide to drop even when rare matches
/// Demonstrates the current ADP/agent divergence for OTLP traces without a sampling priority.
///
/// The Datadog Agent still honors the rare decision for this shape, but ADP currently routes
/// OTLP no-priority traces through the upstream OTLP sampling branch and drops them when that
/// rate check fails.
#[test]
fn rare_sampler_otlp_no_priority_differs_from_agent() {
let mut sampler = create_sampler_with_rare_enabled();
sampler.probabilistic_sampler_enabled = false;
sampler.error_sampling_enabled = false;
sampler.otlp_sampling_rate = 0.0;
let mut meta = saluki_common::collections::FastHashMap::default();
meta.insert(
MetaString::from_static(OTEL_TRACE_ID_META_KEY),
MetaString::from("00000000000000000000000000000001"),
);
let span = create_top_level_span(888, 1).with_meta(meta);
let mut trace = create_test_trace(vec![span]);
let (keep, priority, decision_maker, root_idx) = sampler.run_samplers(&mut trace);
assert!(!keep, "ADP currently drops this OTLP trace even though rare matched");
assert_eq!(priority, PRIORITY_AUTO_DROP);
assert_eq!(decision_maker, "");
assert_eq!(
trace.spans()[root_idx.unwrap()]
.metrics()
.get(rare_sampler::RARE_KEY)
.copied(),
Some(1.0),
"the rare sampler ran, but its keep decision was not honored"
);
}
| } | ||
|
|
||
| /// Returns `true` if the span is top-level (`_top_level = 1`) or measured (`_dd.measured = 1`). | ||
| fn is_top_level_or_measured(span: &Span) -> bool { |
There was a problem hiding this comment.
[Claude Sonnet 4.6] Good point. Fixed in efd14f6 — is_top_level_or_measured now checks _dd.top_level in addition to _top_level, mirroring HasTopLevel in the Go agent's traceutil package. We don't have an equivalent of UpdateTracerTopLevel / ComputeTopLevel in ADP's ingest pipeline, so checking both keys directly in the sampler is the right approach for now. Also added tracer-top-level to the span_eligibility test case table.
|
[Claude Sonnet 4.6] Good catch — fixed in 7457f0b. Added the rare check at the top of the OTLP branch before the rate-based sampling, matching the agent's behavior at the linked lines. Also adapted the test from documenting the divergence to asserting the correct behavior (keep + |
Adds a rare sampler that catches traces for span signature combinations (env, service, name, resource, error type, http status) that are not seen by the priority sampler, ensuring low-traffic trace shapes remain represented in sampled data. - New `RareSampler` with token bucket rate limiting (5 TPS default, burst 50) and per-(env,service) shard TTL tracking - Cardinality-bounded `SeenSpans` with modular-hash shrinking (matches Go agent behavior) - Wired into `run_samplers` ahead of all other samplers in both probabilistic and legacy paths - `record_priority_trace` feedback so priority-sampled signatures don't re-trigger as rare within the cooldown window - Config fields under `apm_config.rare_sampler` (disabled by default) - 6 unit tests covering: disabled mode, new/repeated signatures, top- level vs measured detection, cross-signature independence, rate limit Closes #1134 (partial — rare sampler only) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Ports the remaining tests from rare_sampler_test.go and fixes a bug discovered in the process: - TTL expiration: verify same signature becomes rare again after cooldown - record_priority_trace: verify priority-sampled signatures suppress rare - Cardinality/shrink: verify SeenSpans stays bounded after overflow - Multiple top-level spans: verify first-expired span gets _dd.rare flag and all spans' TTLs are refreshed on keep Also fixes SeenSpans::add to only apply the TTL_RENEWAL_PERIOD skip when the stored entry is still live. Previously it could skip updates for expired entries when TTL < TTL_RENEWAL_PERIOD (only affects tests; in production TTL=5min always exceeds the 60s renewal period). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- `enable_rare_sampler` moves to a top-level `apm_config` field (YAML: `apm_config.enable_rare_sampler`, env: `DD_APM_ENABLE_RARE_SAMPLER`) rather than nested under `apm_config.rare_sampler` - Rename `cooldown_period_secs` serde key to `cooldown` (YAML: `apm_config.rare_sampler.cooldown`) - Add KEY_ALIAS `apm_config.enable_rare_sampler` → `apm_enable_rare_sampler` so the YAML value is visible under the same flat key the env var sets - Read `apm_enable_rare_sampler` explicitly in `from_configuration` to let the env var override the YAML-deserialized value Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…wrapper Read the flag at the root level via ApmConfiguration (renamed to apm_enable_rare_sampler) so KEY_ALIASES handles YAML/env var precedence generically. Removes the manual try_get_typed override. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ing path Removes `record_priority_trace` which had no equivalent in the Go agent — priority-sampled traces don't suppress the rare sampler in Go V0. Also removes the `else` guard in the probabilistic block that was briefly changed to V1 behavior, keeping the V0 else-if chain. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…-driven eligibility test The previous multi-top-level test slept past TTL so r1 was expired when trace2 was sampled, diverging from TestMultipleTopeLevels in Go where r1 is still within TTL and r2 (new) gets _dd.rare. Rewrites the test to match Go exactly: no sleep, r2 gets the rare flag, r1 is suppressed by the refreshed TTL. Also adds a table-driven span_eligibility test mirroring TestConsideredSpans. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…t tests TokenBucket is a generic rate-limiting primitive with no trace-specific logic. Moves it to saluki-common::rate so other components can reuse it. Adds 4 unit tests covering burst exhaustion, time-based refill, capacity cap, and zero-rate behavior. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…pies - Remove String allocation in handle_trace: use &str directly from get_trace_env and reorder record_all_top_level_spans before spans_mut so NLL ends the shared borrow before the mutable one is needed - get_expire returns Option<&Instant> instead of Option<Instant> - Collapse find_rare_span expiry check to is_none_or - Use is_some_and in is_top_level_or_measured instead of copied().unwrap_or Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adapts Go TestSampling cases from agent_test.go: - rare catches first occurrence when probabilistic would drop (0%) - _dd.rare=1 set on kept span - second occurrence within TTL is dropped - rare disabled does not catch unsampled traces - rare catches PriorityAutoDrop in legacy (non-probabilistic) path - probabilistic 100%/0% coverage with rare disabled Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The OTLP no-priority branch was missing the rare check that the Go agent performs before falling through to rate-based sampling. OTLP traces with no sampling priority and a matching rare signature were incorrectly dropped even when the rare sampler kept them. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…_top_level Modern Datadog tracers set _dd.top_level on spans they consider top-level rather than (or in addition to) the legacy _top_level key. The Go agent normalizes _dd.top_level → _top_level via UpdateTracerTopLevel when ClientComputedTopLevel is true, but ADP has no equivalent normalization pass. Mirror the agent's HasTopLevel check by testing both keys directly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
efd14f6 to
af78b1a
Compare
webern
left a comment
There was a problem hiding this comment.
Assisted code review. Looks good.
It may have identified some testing gaps if they are of interest:
Go agent_test.go TestSampling — the interesting gaps are in the integration-level interactions. Three scenarios from Go aren't covered in the PR:
- Manual drop short-circuits rare — Go's priority < 0 check fires before rare. A UserDrop(-1) trace should be dropped even if rare=true. The PR has no test for this. Relevant because the Rust code gates rare behind priority < PRIORITY_AUTO_DROP.
- Rare wins over probabilistic when both would keep — Go's probabilistic-rare-100 tests that when probabilistic is 100% and rare is enabled, the trace is kept via the rare path (samplerName=rare, no _dd.p.dm tag). The PR's rare_sampler_catches_unsampled_trace only tests the case where probabilistic is 0%. A test at 100% would verify decision_maker stays empty when rare takes precedence over probabilistic.
- Rare + error trace interaction — Go tests (error-sampled-prio-unsampled, error-sampled) show that rare catches error traces before the error sampler runs. The PR has no test where contains_error=true and rare is enabled — verifying the else-if chain means error sampler is never consulted when rare fires.
All three are run_samplers-level interaction tests (they'd go in mod.rs tests).
| } | ||
|
|
||
| if rare { | ||
| return (true, PRIORITY_AUTO_KEEP, "", Some(root_span_idx)); |
There was a problem hiding this comment.
(Claude Opus 4.6) Minor divergence from Go V0: In the Go legacy path (agent.go:1129-1131), if rare { return true, true } does not modify the trace chunk's priority — it preserves whatever the tracer originally set (could be AutoDrop=0, AutoKeep=1, or UserKeep=2). Here, we explicitly return PRIORITY_AUTO_KEEP, which means a UserKeep(2) trace caught by rare gets downgraded to AutoKeep(1).
The no-priority path (line 377) and OTLP path (line 359) already follow this same pattern, so this is consistent within the Rust codebase. Just flagging the Go delta — could matter if downstream consumers distinguish UserKeep from AutoKeep.
| // Run the rare sampler early, before all other samplers. This mirrors the Go agent behavior | ||
| // where the rare sampler runs first to catch traces that would otherwise be dropped entirely. | ||
| // logic taken from: https://github.com/DataDog/datadog-agent/blob/main/pkg/trace/agent/agent.go#L1078 | ||
| let rare = self.rare_sampler.sample(trace, root_span_idx); |
There was a problem hiding this comment.
(Claude Opus 4.6) Pre-existing issue (not introduced by this PR), but worth noting now that the rare sampler amplifies its impact: In Go V0 (agent.go:1070-1077), ErrorTrackingStandalone is checked before the rare sampler call, gating all samplers from running on non-error traces. In Rust, run_samplers has no ETS guard — the rare sampler runs unconditionally, and the ETS check at process_trace:454 only fires on the keep=false fallback path. This means with ETS enabled, the rare sampler (and other samplers) can keep non-error traces that Go would drop, violating the ETS contract.
…ns in legacy path The Go agent returns the trace chunk's existing priority unchanged when rare wins in the priority path. We were always returning PRIORITY_AUTO_KEEP, downgrading UserKeep(2) traces to AutoKeep(1). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
[Claude Sonnet 4.6] Re: webern's first comment (priority downgrade) — fixed in 750a672. The rare branch in the priority path now returns Re: the ETS concern — noted on #1314 for follow-up there. |
… in review Covers gaps identified against Go agent_test.go TestSampling: - manual drop short-circuits before rare (UserDrop=-1 drops even if rare matches) - rare wins over probabilistic at 100% rate (_dd.p.dm not set when rare fires first) - rare catches error traces before error sampler is consulted Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
[Claude Sonnet 4.6] Re: webern's test gap review — added all three in b9fb81e:
|
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
/merge |
|
View all feedbacks in Devflow UI.
This pull request is not mergeable according to GitHub. Common reasons include pending required checks, missing approvals, or merge conflicts — but it could also be blocked by other repository rules or settings.
The expected merge time in
|
5b862ff
into
main
## Summary Implements the rare sampler from `datadog-agent/pkg/trace/sampler/rare_sampler.go` matching the Go V0 behavior (`runSamplers`). Ensures low-traffic trace shapes that would otherwise be dropped by probabilistic or priority sampling remain visible. ## Changes - **`rare_sampler.rs`** — new module with `RareSampler`, a per-(env,service)-shard `SeenSpans` with cardinality-bounded TTL tracking and modular-hash shrinking on overflow - **`signature.rs`** — expose `span_hash_for_rare` helper (`compute_span_hash` with `env=""`, `with_resource=true`) - **`apm.rs`** — add `RareSamplerConfig` with `tps`, `cooldown`, `cardinality` fields; `enable_rare_sampler` lives at the `ApmConfiguration` wrapper level to support env var remapping (`APM_ENABLE_RARE_SAMPLER`); disabled by default - **`mod.rs`** — wire rare sampler into `run_samplers` before all other samplers in both probabilistic and legacy (priority) paths - **`saluki-common/src/rate.rs`** — new `TokenBucket` (tokens-per-second refill, burst capacity) extracted as a reusable primitive ## Behavioral notes - Rare sampler runs before all other samplers on every trace - In the probabilistic path: rare short-circuits probabilistic/error sampling (else-if chain matching Go V0) - In the legacy path: rare is checked after manual-drop but before priority/no-priority samplers - When rare wins: `_dd.rare = 1` set on the first rare span; no `_dd.p.dm` set (matches Go) - Default: disabled (`apm_config.enable_rare_sampler: false`), 5 TPS, 300s cooldown, 200 cardinality per shard ## Test plan - [ ] `rare_sampler.rs`: 11 unit tests covering disabled mode, new signature kept, TTL drop, expiry re-sample, measured spans, non-top-level ignored, multi-top-level flag placement, rate limiting, cardinality shrink, span eligibility - [ ] `saluki-common/src/rate.rs`: 4 unit tests for `TokenBucket` - [ ] `apm.rs`: 5 config tests (disabled by default, enable via YAML, enable via env var, env var overrides YAML, tuning via YAML) - [ ] `mod.rs`: 7 interaction tests (rare catches 0% probabilistic drop, `_dd.rare=1` set, no resample within TTL, rare disabled drops, legacy path PriorityAutoDrop caught, 100%/0% probabilistic baseline) Partial fix for #1134 (rare sampler only; error tracking standalone and priority/no-priority sampler gaps remain). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: travis.thieman <travis.thieman@datadoghq.com> 5b862ff
Summary
Implements the rare sampler from
datadog-agent/pkg/trace/sampler/rare_sampler.gomatching the Go V0 behavior (runSamplers). Ensures low-traffic trace shapes that would otherwise be dropped by probabilistic or priority sampling remain visible.Changes
rare_sampler.rs— new module withRareSampler, a per-(env,service)-shardSeenSpanswith cardinality-bounded TTL tracking and modular-hash shrinking on overflowsignature.rs— exposespan_hash_for_rarehelper (compute_span_hashwithenv="",with_resource=true)apm.rs— addRareSamplerConfigwithtps,cooldown,cardinalityfields;enable_rare_samplerlives at theApmConfigurationwrapper level to support env var remapping (APM_ENABLE_RARE_SAMPLER); disabled by defaultmod.rs— wire rare sampler intorun_samplersbefore all other samplers in both probabilistic and legacy (priority) pathssaluki-common/src/rate.rs— newTokenBucket(tokens-per-second refill, burst capacity) extracted as a reusable primitiveBehavioral notes
_dd.rare = 1set on the first rare span; no_dd.p.dmset (matches Go)apm_config.enable_rare_sampler: false), 5 TPS, 300s cooldown, 200 cardinality per shardTest plan
rare_sampler.rs: 11 unit tests covering disabled mode, new signature kept, TTL drop, expiry re-sample, measured spans, non-top-level ignored, multi-top-level flag placement, rate limiting, cardinality shrink, span eligibilitysaluki-common/src/rate.rs: 4 unit tests forTokenBucketapm.rs: 5 config tests (disabled by default, enable via YAML, enable via env var, env var overrides YAML, tuning via YAML)mod.rs: 7 interaction tests (rare catches 0% probabilistic drop,_dd.rare=1set, no resample within TTL, rare disabled drops, legacy path PriorityAutoDrop caught, 100%/0% probabilistic baseline)Partial fix for #1134 (rare sampler only; error tracking standalone and priority/no-priority sampler gaps remain).
🤖 Generated with Claude Code