Skip to content

feat(trace sampler): implement rare sampler#1311

Merged
gh-worker-dd-mergequeue-cf854d[bot] merged 21 commits intomainfrom
thieman/rare-sampler
Apr 8, 2026
Merged

feat(trace sampler): implement rare sampler#1311
gh-worker-dd-mergequeue-cf854d[bot] merged 21 commits intomainfrom
thieman/rare-sampler

Conversation

@thieman
Copy link
Copy Markdown
Contributor

@thieman thieman commented Apr 3, 2026

Summary

Implements the rare sampler from datadog-agent/pkg/trace/sampler/rare_sampler.go matching the Go V0 behavior (runSamplers). Ensures low-traffic trace shapes that would otherwise be dropped by probabilistic or priority sampling remain visible.

Changes

  • rare_sampler.rs — new module with RareSampler, a per-(env,service)-shard SeenSpans with cardinality-bounded TTL tracking and modular-hash shrinking on overflow
  • signature.rs — expose span_hash_for_rare helper (compute_span_hash with env="", with_resource=true)
  • apm.rs — add RareSamplerConfig with tps, cooldown, cardinality fields; enable_rare_sampler lives at the ApmConfiguration wrapper level to support env var remapping (APM_ENABLE_RARE_SAMPLER); disabled by default
  • mod.rs — wire rare sampler into run_samplers before all other samplers in both probabilistic and legacy (priority) paths
  • saluki-common/src/rate.rs — new TokenBucket (tokens-per-second refill, burst capacity) extracted as a reusable primitive

Behavioral notes

  • Rare sampler runs before all other samplers on every trace
  • In the probabilistic path: rare short-circuits probabilistic/error sampling (else-if chain matching Go V0)
  • In the legacy path: rare is checked after manual-drop but before priority/no-priority samplers
  • When rare wins: _dd.rare = 1 set on the first rare span; no _dd.p.dm set (matches Go)
  • Default: disabled (apm_config.enable_rare_sampler: false), 5 TPS, 300s cooldown, 200 cardinality per shard

Test plan

  • rare_sampler.rs: 11 unit tests covering disabled mode, new signature kept, TTL drop, expiry re-sample, measured spans, non-top-level ignored, multi-top-level flag placement, rate limiting, cardinality shrink, span eligibility
  • saluki-common/src/rate.rs: 4 unit tests for TokenBucket
  • apm.rs: 5 config tests (disabled by default, enable via YAML, enable via env var, env var overrides YAML, tuning via YAML)
  • mod.rs: 7 interaction tests (rare catches 0% probabilistic drop, _dd.rare=1 set, no resample within TTL, rare disabled drops, legacy path PriorityAutoDrop caught, 100%/0% probabilistic baseline)

Partial fix for #1134 (rare sampler only; error tracking standalone and priority/no-priority sampler gaps remain).

🤖 Generated with Claude Code

@dd-octo-sts dd-octo-sts bot added area/components Sources, transforms, and destinations. transform/trace-sampler Trace Sampler synchronous transform. labels Apr 3, 2026
@pr-commenter
Copy link
Copy Markdown

pr-commenter bot commented Apr 3, 2026

Binary Size Analysis (Agent Data Plane)

Target: ffeb787 (baseline) vs c12651b (comparison) diff
Analysis Type: Stripped binaries (debug symbols excluded)
Baseline Size: 26.46 MiB
Comparison Size: 26.48 MiB
Size Change: +21.16 KiB (+0.08%)
Pass/Fail Threshold: +5%
Result: PASSED ✅

Changes by Module

Module File Size Symbols
core +6.96 KiB 20
saluki_components::transforms::trace_sampler +6.65 KiB 7
hashbrown +4.46 KiB 3
[Unmapped] +3.26 KiB 1
saluki_components::common::datadog -1.34 KiB 5
serde_core +796 B 10
saluki_core::data_model::event +269 B 1
[sections] -133 B 6
agent_data_plane::cli::run +116 B 3
unicode_segmentation +104 B 1
saluki_components::transforms::dogstatsd_prefix_filter +48 B 1
saluki_core::topology::blueprint +9 B 1
aho_corasick +8 B 1
figment -6 B 1
std +4 B 9
serde_yaml -4 B 1
agent_data_plane::components::tag_filterlist +1 B 1
tonic +0 B 2
crossbeam_channel +0 B 6
tracing_subscriber +0 B 4

Detailed Symbol Changes

    FILE SIZE        VM SIZE    
 --------------  -------------- 
  [NEW] +22.0Ki  [NEW] +21.8Ki    saluki_components::transforms::trace_sampler::TraceSampler::process_trace::haae362d51dae69a0
  [NEW] +13.6Ki  [NEW] +13.4Ki    _<core::marker::PhantomData<T> as serde_core::de::DeserializeSeed>::deserialize::h1d9cfb93998519c7
  [NEW] +9.63Ki  [NEW] +9.51Ki    saluki_components::common::datadog::apm::ApmConfig::from_configuration::h8e84793e7f99a6e6
  [NEW] +5.20Ki  [NEW] +5.05Ki    _<core::marker::PhantomData<T> as serde_core::de::DeserializeSeed>::deserialize::had646bb5b345f1a8
  +0.3% +3.74Ki  +0.3% +3.03Ki    [108 Others]
   +46% +3.26Ki  [ = ]       0    [Unmapped]
  [NEW] +1.88Ki  [NEW] +1.78Ki    hashbrown::raw::RawTable<T,A>::reserve_rehash::hd0ce83933bc8fbac
  [NEW] +1.79Ki  [NEW] +1.65Ki    <tonic::transport::channel::endpoint::Endpoint as core::clone::Clone>::clone.12982
  [NEW] +1.79Ki  [NEW] +1.69Ki    hashbrown::raw::RawTable<T,A>::reserve_rehash::h30b016c18f118cb7
  [NEW] +1.10Ki  [NEW] +1.00Ki    crossbeam_channel::waker::SyncWaker::notify.13122
  [NEW] +1.04Ki  [NEW]    +913    <tracing_subscriber::filter::env::directive::Directive as core::clone::Clone>::clone.13350
  [NEW]    +878  [NEW]    +673    _<saluki_components::transforms::trace_sampler::TraceSampler as saluki_core::components::transforms::SynchronousTransform>::transform_buffer::h2e85609667304799
  [NEW]    +812  [NEW]    +713    hashbrown::map::HashMap<K,V,S,A>::remove::hbf94225925c826fd
  [DEL]    -772  [DEL]    -671    crossbeam_channel::waker::SyncWaker::unregister.13124
 -15.1%    -896 -15.1%    -896    [section .rodata]
  [DEL] -1.04Ki  [DEL]    -913    <tracing_subscriber::filter::env::directive::Directive as core::clone::Clone>::clone.13348
  [DEL] -1.10Ki  [DEL] -1.00Ki    crossbeam_channel::waker::SyncWaker::notify.13120
  [DEL] -1.79Ki  [DEL] -1.65Ki    <tonic::transport::channel::endpoint::Endpoint as core::clone::Clone>::clone.12980
  [DEL] -11.0Ki  [DEL] -10.9Ki    saluki_components::common::datadog::apm::ApmConfig::from_configuration::hf8c47abaacda9d53
  [DEL] -12.5Ki  [DEL] -12.4Ki    _<core::marker::PhantomData<T> as serde_core::de::DeserializeSeed>::deserialize::ha43058272b282e7a
  [DEL] -16.3Ki  [DEL] -16.1Ki    _<saluki_components::transforms::trace_sampler::TraceSampler as saluki_core::components::transforms::SynchronousTransform>::transform_buffer::h999b6ca59cc02d7a
  +0.1% +21.2Ki  +0.1% +16.7Ki    TOTAL

@pr-commenter
Copy link
Copy Markdown

pr-commenter bot commented Apr 3, 2026

Regression Detector (Agent Data Plane)

This comment was omitted because it was over 65,536 characters.Please check the Gitlab Job logs to see its output.


const fn default_rare_sampler_cardinality() -> usize {
200
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


#[serde(default = "default_rare_sampler_cardinality")]
cardinality: usize,
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

("proxy.http", "proxy_http"),
("proxy.https", "proxy_https"),
("proxy.no_proxy", "proxy_no_proxy"),
("apm_config.enable_rare_sampler", "apm_enable_rare_sampler"),
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.apm_config.rare_sampler_enabled(),
self.apm_config.rare_sampler_tps(),
std::time::Duration::from_secs_f64(self.apm_config.rare_sampler_cooldown_period_secs()),
self.apm_config.rare_sampler_cardinality(),
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is our expected behavior when RC eventually changes our config at runtime? Do we end up restarting the entire process to reset things, or do we need to respond to those config changes here somehow?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spoke with @tobz on this, going to leave a TODO and an issue here for now. #1326

@dd-octo-sts dd-octo-sts bot added the area/core Core functionality, event model, etc. label Apr 6, 2026
@thieman thieman changed the title feat(trace-sampler): implement rare sampler feat(trace sampler transform): implement rare sampler Apr 6, 2026
@thieman thieman changed the title feat(trace sampler transform): implement rare sampler feat(trace sampler): implement rare sampler Apr 6, 2026
@dd-octo-sts dd-octo-sts bot added the area/ci CI/CD, automated testing, etc. label Apr 6, 2026
@thieman
Copy link
Copy Markdown
Contributor Author

thieman commented Apr 6, 2026

I figured it made sense to add trace sampler as an accepted PR title since there will be a few more of these coming, seems like it will let us merge with that failing since it checks against the action definition on main.

Copy link
Copy Markdown
Contributor

@andrewqian2001datadog andrewqian2001datadog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this code

I think we want to add a rare check?

In the agent, there is a rare check before the no priority "legacy" path.

Added a test to show that the we decide to drop even when rare matches

/// Demonstrates the current ADP/agent divergence for OTLP traces without a sampling priority.
    ///
    /// The Datadog Agent still honors the rare decision for this shape, but ADP currently routes
    /// OTLP no-priority traces through the upstream OTLP sampling branch and drops them when that
    /// rate check fails.
    #[test]
    fn rare_sampler_otlp_no_priority_differs_from_agent() {
        let mut sampler = create_sampler_with_rare_enabled();
        sampler.probabilistic_sampler_enabled = false;
        sampler.error_sampling_enabled = false;
        sampler.otlp_sampling_rate = 0.0;

        let mut meta = saluki_common::collections::FastHashMap::default();
        meta.insert(
            MetaString::from_static(OTEL_TRACE_ID_META_KEY),
            MetaString::from("00000000000000000000000000000001"),
        );
        let span = create_top_level_span(888, 1).with_meta(meta);
        let mut trace = create_test_trace(vec![span]);

        let (keep, priority, decision_maker, root_idx) = sampler.run_samplers(&mut trace);
        assert!(!keep, "ADP currently drops this OTLP trace even though rare matched");
        assert_eq!(priority, PRIORITY_AUTO_DROP);
        assert_eq!(decision_maker, "");
        assert_eq!(
            trace.spans()[root_idx.unwrap()]
                .metrics()
                .get(rare_sampler::RARE_KEY)
                .copied(),
            Some(1.0),
            "the rare sampler ran, but its keep decision was not honored"
        );
    }

}

/// Returns `true` if the span is top-level (`_top_level = 1`) or measured (`_dd.measured = 1`).
fn is_top_level_or_measured(span: &Span) -> bool {
Copy link
Copy Markdown
Contributor

@andrewqian2001datadog andrewqian2001datadog Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non blocking but I think if we are planning to port over the trace agent to ADP then we would have to add the tracer logic to sampling.

Copy link
Copy Markdown
Contributor Author

@thieman thieman Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude Sonnet 4.6] Good point. Fixed in efd14f6is_top_level_or_measured now checks _dd.top_level in addition to _top_level, mirroring HasTopLevel in the Go agent's traceutil package. We don't have an equivalent of UpdateTracerTopLevel / ComputeTopLevel in ADP's ingest pipeline, so checking both keys directly in the sampler is the right approach for now. Also added tracer-top-level to the span_eligibility test case table.

@thieman
Copy link
Copy Markdown
Contributor Author

thieman commented Apr 6, 2026

[Claude Sonnet 4.6] Good catch — fixed in 7457f0b. Added the rare check at the top of the OTLP branch before the rate-based sampling, matching the agent's behavior at the linked lines. Also adapted the test from documenting the divergence to asserting the correct behavior (keep + _dd.rare = 1 on first occurrence).

thieman and others added 9 commits April 6, 2026 16:44
Adds a rare sampler that catches traces for span signature combinations
(env, service, name, resource, error type, http status) that are not
seen by the priority sampler, ensuring low-traffic trace shapes remain
represented in sampled data.

- New `RareSampler` with token bucket rate limiting (5 TPS default,
  burst 50) and per-(env,service) shard TTL tracking
- Cardinality-bounded `SeenSpans` with modular-hash shrinking (matches
  Go agent behavior)
- Wired into `run_samplers` ahead of all other samplers in both
  probabilistic and legacy paths
- `record_priority_trace` feedback so priority-sampled signatures don't
  re-trigger as rare within the cooldown window
- Config fields under `apm_config.rare_sampler` (disabled by default)
- 6 unit tests covering: disabled mode, new/repeated signatures, top-
  level vs measured detection, cross-signature independence, rate limit

Closes #1134 (partial — rare sampler only)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Ports the remaining tests from rare_sampler_test.go and fixes a bug
discovered in the process:

- TTL expiration: verify same signature becomes rare again after cooldown
- record_priority_trace: verify priority-sampled signatures suppress rare
- Cardinality/shrink: verify SeenSpans stays bounded after overflow
- Multiple top-level spans: verify first-expired span gets _dd.rare flag
  and all spans' TTLs are refreshed on keep

Also fixes SeenSpans::add to only apply the TTL_RENEWAL_PERIOD skip when
the stored entry is still live. Previously it could skip updates for
expired entries when TTL < TTL_RENEWAL_PERIOD (only affects tests; in
production TTL=5min always exceeds the 60s renewal period).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- `enable_rare_sampler` moves to a top-level `apm_config` field
  (YAML: `apm_config.enable_rare_sampler`, env: `DD_APM_ENABLE_RARE_SAMPLER`)
  rather than nested under `apm_config.rare_sampler`
- Rename `cooldown_period_secs` serde key to `cooldown`
  (YAML: `apm_config.rare_sampler.cooldown`)
- Add KEY_ALIAS `apm_config.enable_rare_sampler` → `apm_enable_rare_sampler`
  so the YAML value is visible under the same flat key the env var sets
- Read `apm_enable_rare_sampler` explicitly in `from_configuration` to
  let the env var override the YAML-deserialized value

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…wrapper

Read the flag at the root level via ApmConfiguration (renamed to
apm_enable_rare_sampler) so KEY_ALIASES handles YAML/env var precedence
generically. Removes the manual try_get_typed override.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
thieman and others added 7 commits April 6, 2026 16:44
…ing path

Removes `record_priority_trace` which had no equivalent in the Go agent —
priority-sampled traces don't suppress the rare sampler in Go V0. Also
removes the `else` guard in the probabilistic block that was briefly changed
to V1 behavior, keeping the V0 else-if chain.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…-driven eligibility test

The previous multi-top-level test slept past TTL so r1 was expired when
trace2 was sampled, diverging from TestMultipleTopeLevels in Go where r1
is still within TTL and r2 (new) gets _dd.rare. Rewrites the test to match
Go exactly: no sleep, r2 gets the rare flag, r1 is suppressed by the
refreshed TTL.

Also adds a table-driven span_eligibility test mirroring TestConsideredSpans.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…t tests

TokenBucket is a generic rate-limiting primitive with no trace-specific
logic. Moves it to saluki-common::rate so other components can reuse it.
Adds 4 unit tests covering burst exhaustion, time-based refill, capacity
cap, and zero-rate behavior.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…pies

- Remove String allocation in handle_trace: use &str directly from
  get_trace_env and reorder record_all_top_level_spans before spans_mut
  so NLL ends the shared borrow before the mutable one is needed
- get_expire returns Option<&Instant> instead of Option<Instant>
- Collapse find_rare_span expiry check to is_none_or
- Use is_some_and in is_top_level_or_measured instead of copied().unwrap_or

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adapts Go TestSampling cases from agent_test.go:
- rare catches first occurrence when probabilistic would drop (0%)
- _dd.rare=1 set on kept span
- second occurrence within TTL is dropped
- rare disabled does not catch unsampled traces
- rare catches PriorityAutoDrop in legacy (non-probabilistic) path
- probabilistic 100%/0% coverage with rare disabled

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The OTLP no-priority branch was missing the rare check that the Go agent
performs before falling through to rate-based sampling. OTLP traces with
no sampling priority and a matching rare signature were incorrectly dropped
even when the rare sampler kept them.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…_top_level

Modern Datadog tracers set _dd.top_level on spans they consider top-level
rather than (or in addition to) the legacy _top_level key. The Go agent
normalizes _dd.top_level → _top_level via UpdateTracerTopLevel when
ClientComputedTopLevel is true, but ADP has no equivalent normalization
pass. Mirror the agent's HasTopLevel check by testing both keys directly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@thieman thieman force-pushed the thieman/rare-sampler branch from efd14f6 to af78b1a Compare April 6, 2026 20:44
@dd-octo-sts dd-octo-sts bot removed the area/ci CI/CD, automated testing, etc. label Apr 6, 2026
Copy link
Copy Markdown
Contributor

@webern webern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assisted code review. Looks good.

It may have identified some testing gaps if they are of interest:

Go agent_test.go TestSampling — the interesting gaps are in the integration-level interactions. Three scenarios from Go aren't covered in the PR:

  1. Manual drop short-circuits rare — Go's priority < 0 check fires before rare. A UserDrop(-1) trace should be dropped even if rare=true. The PR has no test for this. Relevant because the Rust code gates rare behind priority < PRIORITY_AUTO_DROP.
  2. Rare wins over probabilistic when both would keep — Go's probabilistic-rare-100 tests that when probabilistic is 100% and rare is enabled, the trace is kept via the rare path (samplerName=rare, no _dd.p.dm tag). The PR's rare_sampler_catches_unsampled_trace only tests the case where probabilistic is 0%. A test at 100% would verify decision_maker stays empty when rare takes precedence over probabilistic.
  3. Rare + error trace interaction — Go tests (error-sampled-prio-unsampled, error-sampled) show that rare catches error traces before the error sampler runs. The PR has no test where contains_error=true and rare is enabled — verifying the else-if chain means error sampler is never consulted when rare fires.

All three are run_samplers-level interaction tests (they'd go in mod.rs tests).

}

if rare {
return (true, PRIORITY_AUTO_KEEP, "", Some(root_span_idx));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Claude Opus 4.6) Minor divergence from Go V0: In the Go legacy path (agent.go:1129-1131), if rare { return true, true } does not modify the trace chunk's priority — it preserves whatever the tracer originally set (could be AutoDrop=0, AutoKeep=1, or UserKeep=2). Here, we explicitly return PRIORITY_AUTO_KEEP, which means a UserKeep(2) trace caught by rare gets downgraded to AutoKeep(1).

The no-priority path (line 377) and OTLP path (line 359) already follow this same pattern, so this is consistent within the Rust codebase. Just flagging the Go delta — could matter if downstream consumers distinguish UserKeep from AutoKeep.

// Run the rare sampler early, before all other samplers. This mirrors the Go agent behavior
// where the rare sampler runs first to catch traces that would otherwise be dropped entirely.
// logic taken from: https://github.com/DataDog/datadog-agent/blob/main/pkg/trace/agent/agent.go#L1078
let rare = self.rare_sampler.sample(trace, root_span_idx);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Claude Opus 4.6) Pre-existing issue (not introduced by this PR), but worth noting now that the rare sampler amplifies its impact: In Go V0 (agent.go:1070-1077), ErrorTrackingStandalone is checked before the rare sampler call, gating all samplers from running on non-error traces. In Rust, run_samplers has no ETS guard — the rare sampler runs unconditionally, and the ETS check at process_trace:454 only fires on the keep=false fallback path. This means with ETS enabled, the rare sampler (and other samplers) can keep non-error traces that Go would drop, violating the ETS contract.

…ns in legacy path

The Go agent returns the trace chunk's existing priority unchanged when rare
wins in the priority path. We were always returning PRIORITY_AUTO_KEEP,
downgrading UserKeep(2) traces to AutoKeep(1).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@thieman
Copy link
Copy Markdown
Contributor Author

thieman commented Apr 8, 2026

[Claude Sonnet 4.6] Re: webern's first comment (priority downgrade) — fixed in 750a672. The rare branch in the priority path now returns priority instead of PRIORITY_AUTO_KEEP, matching Go's behavior of leaving the chunk priority untouched. Added a test covering the UserKeep(2) case specifically.

Re: the ETS concern — noted on #1314 for follow-up there.

… in review

Covers gaps identified against Go agent_test.go TestSampling:
- manual drop short-circuits before rare (UserDrop=-1 drops even if rare matches)
- rare wins over probabilistic at 100% rate (_dd.p.dm not set when rare fires first)
- rare catches error traces before error sampler is consulted

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@thieman
Copy link
Copy Markdown
Contributor Author

thieman commented Apr 8, 2026

[Claude Sonnet 4.6] Re: webern's test gap review — added all three in b9fb81e:

  1. manual_drop_short_circuits_before_rare — UserDrop(-1) is dropped even when rare would match
  2. rare_wins_over_probabilistic_no_decision_maker_tag — rare takes precedence at 100% probabilistic rate, _dd.p.dm not set
  3. rare_catches_error_trace_before_error_sampler — rare keeps a no-priority error trace before the error sampler is consulted

@thieman
Copy link
Copy Markdown
Contributor Author

thieman commented Apr 8, 2026

/merge

@gh-worker-devflow-routing-ef8351
Copy link
Copy Markdown

gh-worker-devflow-routing-ef8351 bot commented Apr 8, 2026

View all feedbacks in Devflow UI.

2026-04-08 18:44:07 UTC ℹ️ Start processing command /merge


2026-04-08 18:44:13 UTC ℹ️ MergeQueue: waiting for PR to be ready

This pull request is not mergeable according to GitHub. Common reasons include pending required checks, missing approvals, or merge conflicts — but it could also be blocked by other repository rules or settings.
It will be added to the queue as soon as checks pass and/or get approvals. View in MergeQueue UI.
Note: if you pushed new commits since the last approval, you may need additional approval.
You can remove it from the waiting list with /remove command.


2026-04-08 18:52:09 UTC ℹ️ MergeQueue: merge request added to the queue

The expected merge time in main is approximately 15m (p90).


2026-04-08 19:04:22 UTC ℹ️ MergeQueue: This merge request was merged

@gh-worker-dd-mergequeue-cf854d gh-worker-dd-mergequeue-cf854d bot merged commit 5b862ff into main Apr 8, 2026
59 of 60 checks passed
dd-octo-sts bot pushed a commit that referenced this pull request Apr 8, 2026
## Summary

Implements the rare sampler from `datadog-agent/pkg/trace/sampler/rare_sampler.go` matching the Go V0 behavior (`runSamplers`). Ensures low-traffic trace shapes that would otherwise be dropped by probabilistic or priority sampling remain visible.

## Changes

- **`rare_sampler.rs`** — new module with `RareSampler`, a per-(env,service)-shard `SeenSpans` with cardinality-bounded TTL tracking and modular-hash shrinking on overflow
- **`signature.rs`** — expose `span_hash_for_rare` helper (`compute_span_hash` with `env=""`, `with_resource=true`)
- **`apm.rs`** — add `RareSamplerConfig` with `tps`, `cooldown`, `cardinality` fields; `enable_rare_sampler` lives at the `ApmConfiguration` wrapper level to support env var remapping (`APM_ENABLE_RARE_SAMPLER`); disabled by default
- **`mod.rs`** — wire rare sampler into `run_samplers` before all other samplers in both probabilistic and legacy (priority) paths
- **`saluki-common/src/rate.rs`** — new `TokenBucket` (tokens-per-second refill, burst capacity) extracted as a reusable primitive

## Behavioral notes

- Rare sampler runs before all other samplers on every trace
- In the probabilistic path: rare short-circuits probabilistic/error sampling (else-if chain matching Go V0)
- In the legacy path: rare is checked after manual-drop but before priority/no-priority samplers
- When rare wins: `_dd.rare = 1` set on the first rare span; no `_dd.p.dm` set (matches Go)
- Default: disabled (`apm_config.enable_rare_sampler: false`), 5 TPS, 300s cooldown, 200 cardinality per shard

## Test plan

- [ ] `rare_sampler.rs`: 11 unit tests covering disabled mode, new signature kept, TTL drop, expiry re-sample, measured spans, non-top-level ignored, multi-top-level flag placement, rate limiting, cardinality shrink, span eligibility
- [ ] `saluki-common/src/rate.rs`: 4 unit tests for `TokenBucket`
- [ ] `apm.rs`: 5 config tests (disabled by default, enable via YAML, enable via env var, env var overrides YAML, tuning via YAML)
- [ ] `mod.rs`: 7 interaction tests (rare catches 0% probabilistic drop, `_dd.rare=1` set, no resample within TTL, rare disabled drops, legacy path PriorityAutoDrop caught, 100%/0% probabilistic baseline)

Partial fix for #1134 (rare sampler only; error tracking standalone and priority/no-priority sampler gaps remain).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: travis.thieman <travis.thieman@datadoghq.com> 5b862ff
@tobz tobz deleted the thieman/rare-sampler branch April 10, 2026 13:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/components Sources, transforms, and destinations. area/core Core functionality, event model, etc. mergequeue-status: done transform/trace-sampler Trace Sampler synchronous transform.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants